Pula: Training Large Language Models for Setswana
Published in NAACL 2025, 2025
Developed in partnership with the DSFSI group at The University of Pretoria, this work introduces Pula, the first suite of LLMs built for Setswana; Marothodi, the largest Setswana pre-training corpus; and Medupi, the first extensive Setswana instruction-tuning dataset.
Recommended citation: Brown, Nathan and Marivate, Vukosi (2025). "Pula: Training Large Language Models for Setswana" NAACL 2025 https://aclanthology.org/2025.naacl-long.338/