Dr. Peter K. Norquest (University of Arizona) and Dr. Sean S. Downey (Stanford University and University of Arizona) will undertake research on the effects of community-level language interactions on prehistoric migration patterns in Nusa Tenggara, a large region in eastern Indonesia. The history of this region remains obscure because contact between Papuan-speakers and Austronesian-speakers (circa 4,500 BP) and subsequent drift has since obscured many of the linguistic criteria which historical linguists typically use to reconstruct migrations. Recent developments in computational linguistic can help overcome some of these obstacles. The researchers will integrate two independent methodologies and improve on them by including both local and regional scales of analysis. Thus this project will produce new methodologies that can be used by other researchers concerned with problems that involve historical reconstructions of population movements where historical records are lacking.
The researchers will investigate the relationship between community-level and regional-level language patterns by analyzing a database of words collected from numerous locations across Nusa Tenggara. Two computational analyses will be used, each proven accurate at a different spatial scale, and these will be complemented by the more traditional "comparative method" in which a trained historical linguist manually classifies languages based on the phonological and morphological differences in their lexicons. The results of these analyses are expected to shed light on how linguistic interaction among people in neighboring communities have affected language patterns that are observed in the wake of large-scale demographic transitions.
The research is significant because it will clarify a poorly understood aspect of the last great human migration prior to the modern era: the Austronesian expansion from Mainland China into Indonesia and ultimately the colonization of islands across the Pacific Ocean. The methodology for integrating computational linguistic analysis across scales that will be developed during the course of this project will be applicable for reconstructing the demographic history of indigenous populations wherever historical records are unavailable.
There are relatively few sources of information available to paleoanthropologists wishing to reconstruct prehistoric population migrations. These include archaeological artifacts -- the pottery, stone and bone remains of human occupation; genetic samples taken from blood and saliva of contemporary peoples; and historical linguistics which uses phonological and lexical differences among the words spoken by contemporary populations to infer how languages evolved and traveled with earlier speakers. These are the main types of information commonly used to understand intercontinental population migrations that occurred thousands of years ago. Our project focused on this last category using lists of words with shared meanings. We analyzed thousands of words from Nusa Tenggara, a relatively remote and under-studied region in eastern Indonesia, for subtle differences in words using a combination of traditional methods used by historical linguists and a new computational approach developed by our research team. Our methods were designed to use the small phonological differences between the words and thereby establish the historical relationships between contemporary languages in this part of the world. In historical linguistics, two words are considered ‘cognate’ if they share a common historical ancestor. The process of manually cognate-coding words is tedious and requires years of specialized training; yet the most sophisticated computational methods for inferring the historical relationships between languages relies on pre-coded cognacy determinations. In contrast, the analytical approach taken in this project was to develop a special computer program to calculate a numeric "distance" that reflects the difference in sounds. In addition to new insights into the region’s demographic history, the intellectual merit of this project lies in developing a high-resolution metric for the phonological differences between words, which we use to detect subtle differences between the languages and dialects spoken in different communities in Eastern Indonesia. In order to do this, we assembled a large database of wordlists and simultaneously analyzed them using the traditional methods of historical linguistics and these new computational methods. We found that our methodology performed well across a range of scales (from small dialectical differences to mutually unintelligible languages) when compared to the traditional methods of historical linguistics, with two additional advantages. First, it is not reliant on the skills of trained linguists and is therefore highly reproducible; and second, it is a quantitative measurement, which means it can be compared directly to other quantifiable data, such as those which come from archaeology, genetics, paleoclimatology, and geography. In addition to the above, an unexpected result of this project was the discovery of a new series of phonological distinctions in the ancestor of the languages we analyzed, Proto Malayo-Polynesian. These were first noted while coding data and performing reconstruction and subgrouping of the languages of Nusa Tenggara in the study sample. In an attempt to discover their origin, additional data were gathered from other Austronesian languages of Indonesia, the Philippines, and Oceania. As more data was reviewed, it became clear that these were not secondary developments within the subgroups of Nusa Tenggara, but were also found in various languages of the Philippines, Borneo, the Barrier Islands off the west coast of Sumatra, and the Oceanic languages. As research progressed, it was found that these distinctions were also found in cognate words in the languages of the Kra-Dai (Tai-Kadai) phylum. Since Kra-Dai is often suggested to be genetically related to Austronesian, this discovery is quite exciting, as the data from these two language phyla are mutually reinforcing and have allowed new insights into the phonological reconstruction of both groups. This in turn strengthens the hypothesis of a relationship between them and allows additional inferences to be made regarding the prehistory of the speakers of these languages. While our methodology was developed for a specific purpose, the general approach of measuring differences between the sounds of words is extremely broad and may have other practical applications. For example, calculating a quantitative distances between words while accounting for phonological differences could be used to improve Internet search results or for mining online textual data for patterns (web pages, twitter feeds, etc.). We have therefore made our software available open-source to facilitate the possibility of technology transfer. It is possible that providing a more widely accessible version of our method could be picked up by commercial online search firms or software developers and be incorporated in search results or apps that would have a direct impact outside the academic world.