🎉 My colleagues and members of the language community have released SMOL, a new open-source dataset (CC-BY-4) designed for machine translation research. SMOL includes professionally translated parallel text for over 115 low-resource languages, with a significant representation of over 50 African languages. This dataset is intended to provide a valuable resource for researchers working on machine translation for under-represented languages.
Kindly check the paper for more details including limitations of this dataset.
Paper: https://arxiv.org/pdf/2502.12301
Dataset: https://huggingface.co/datasets/google/smol
List of languages:
Afar
Acoli
Afrikaans
Alur
Amharic
Bambara
Baoulé
Bemba (Zambia)
Berber
Chiga
Dinka
Dombe
Dyula
Efik
Ewe
Fon
Fulfulde
Ga
Hausa
Igbo
Kikuyu
Kongo
Kanuri
Krio
Kituba (DRC)
Lingala
Luo
Kiluba (Luba-Katanga)
Malagasy
Mossi
North Ndebele
Ndau
Nigerian Pidgin
Oromo
Rundi
Kinyarwanda
Sepedi
Shona
Somali
South Ndebele
Susu
Swati
Swahili
Tamazight
Tigrinya
Tiv
Tsonga
Tumbuka
Tswana
Twi
Venda
Wolof
Xhosa
Yoruba
Zulu
Credits:
Isaac Caswell and Elizabeth Nielsen and Jiaming Luo and Colin Cherry and Geza Kovacs and Hadar Shemtov and Partha Talukdar and Dinesh Tewari and Baba Mamadi Diane and Koulako Moussa Doumbouya and Djibrila Diane and Solo Farabado Cissé. SMOL: Professionally translated parallel data for 115 under-represented languages.