Igbo is a language spoken by twenty million people, mostly in southern Nigeria. This EAGER project makes use of a corpus of spoken Igbo, which will cover all of the dialects of the language. The corpus is to be used for explorations in which statistical machine learning (ML) programs are created to learn an "inter-Igbo'' consisting cognate sets (Igbo words pronounced differently in different locations but having the same meaning) that enable the corpus to be treated as if it were spoken as a single language, even though the dialects are, at extreme ends of the Igbo homeland, mutually unintelligible. Another aspect of our work is the extension of the existing corpus to fill in gaps in dialect coverage where there currently are recordings from locations that have no near geographical neighbors. The need for this stems from the fact that the closer a dialect's neighbors the more similar they are, and the easier for programs to locate words which differ systematically.
Achieving goals of this exploratory project is of considerable interest for computational linguistics. As opposed to language change over time, there is little computational work on language change over geography, and finding the appropriate ML models for the latter aspect of language variation is a considerable challenge.