WAXAL: A Large-Scale Multilingual African Language Speech Corpus

πŸ‡³πŸ‡¬πŸ‡ͺπŸ‡ΉπŸ‡¨πŸ‡©πŸ‡°πŸ‡ͺπŸ‡ΊπŸ‡¬πŸ‡¬πŸ‡­πŸ‡²πŸ‡¬πŸ‡ΏπŸ‡ΌπŸ‡¨πŸ‡¬ Hausa β€’ Yoruba β€’ Igbo β€’ Amharic β€’ Oromo β€’ Swahili β€’ Lingala β€’ Akan (Fante, Twi) β€’ Luganda β€’ Kikuyu β€’ Luo β€’ Shona β€’ Malagasy β€’ Fulani (Fula) β€’ Tigrinya β€’ Sidama β€’ Wolaytta β€’ Ewe β€’ Nyankole β€’ Rukiga β€’ Masaaba β€’ Soga β€’ Dagbani β€’ Dagaare β€’ Acholi β€’ Ikposo

This took several years and I’m so happy it is finally out. We just released an open-source dataset of nearly 2M African speech records for speech recognition and vocalization (27 languages). As of today it already has close to 10k downloads and it is currently being used for ASR and TTS AI models training.

Blog: blog.google/intl/en-afri…

Paper: arxiv.org/abs/2602.02734

Dataset: huggingface.co/datasets/goo…

News articles:

https://techcabal.com/2026/02/12/voice-is-africas-gateway-to-ai-and-google-wants-to-lead-it

https://restofworld.org/2026/google-waxal-african-languages-ai-sovereignty

https://techpoint.africa/insight/google-is-expanding-waxal-beyond-21-languages-what-it-means-for-african-researchers