David Adelani: “How Good are Large Language Models on African Languages?”

Friday, May 3, 2024
12:00 PM 1:00 PM

Google Calendar ICS

Recent advancements in natural language processing have led to the proliferation of large language models (LLMs). These models have been shown to yield good performance, using in-context learning, even on tasks and languages they are not trained on. However, their performance on African languages is largely understudied relative to high-resource languages. We present an analysis of four popular large language models(mT0, Aya, LLaMa 2, and GPT-4) on six tasks (topic classification, sentiment classification, machine translation, summarization, question answering, and named entity recognition) across 60 African languages, spanning different language families and geographical regions. Our results suggest that all LLMs produce lower performance for African languages, and there is a large gap in performance compared to high-resource languages (such as English) for most tasks. We find that GPT-4 has an average to good performance on classification tasks, yet its performance on generative tasks such as machine translation and summarization is significantly lacking. Surprisingly, we find that mT0 had the best overall performance for cross-lingual QA, better than the state-of-the-art supervised model (i.e. fine-tuned mT5) and GPT-4 on African languages. Similarly, we find the recent Aya model to have comparable result to mT0 in almost all tasks except for topic classification where it outperform mT0. Overall, LLaMa 2 showed the worst performance, which we believe is due to its English and code-centric (around 98%) pre-training corpus. Our findings confirms that performance on African languages continues to remain a hurdle for the current LLMs, underscoring the need for additional efforts to close this gap.

Dr David Ifeoluwa Adelani is a DeepMind Academic Fellow/Research Fellow at University College London, an incoming Assistant Professor at McGill School of Computer Science in Canada, a Core Academic Member at Mila - Quebec Artificial Intelligence Institute, and a Canada CIFAR AI Chair. Dr. Adelani is also actively involved in Masakhane, a grassroots organization dedicated to advancing natural language processing (NLP) research in African languages. He received his Ph.D in Computer Science at the Department of Language Science and Technology, Saarland University. Dr. Adelani's research primarily focuses on NLP for under-resourced languages, particularly those of Africa. His interests extend to multilingual representation learning, machine translation, and speech processing. With over 20 publications in prestigious NLP and Speech processing venues such as ACL, TACL, EMNLP, NAACL, COLING, and Interspeech, Dr. Adelani's contributions are highly regarded in NLP for African languages. Notably, one of his publications received the best paper award at COLING 2022 for developing a multilingual pre-trained language model for African languages. In 2023, another publication received an Area Chair Award at IJCNLP-AACL for developing news topic classification dataset for 16 African languages.

Tagged: Spring 2024

David Adelani: “How Good are Large Language Models on African Languages?”

Melissa Dell “Efficient OCR for Building a Diverse Digital History”

Jacqueline Calderón: “Public Health and Environmental Risk Factors in Mexico“