It is widely recognized that language impairment can have a negative effect on literacy skills, and that children suffering language impairment are at a higher risk of academic under-achievement and lower overall social development. Hence, early and accurate language assessment for children is critical, especially for those with non-mainstream linguistic backgrounds. Spontaneous language samples are commonly used in communication disorders to measure the speaker's competence across a range of complementary language skills. These elicitation tasks allow clinicians and clinical researchers to analyze speech fluency by looking at the patterns of disfluencies and other speech disruptions. Language productivity can be gauged by computing mean length of utterance, along with measures of vocabulary and total utterances produced. Morpho-syntactic skills can also be analyzed from these data, by manually coding for specific grammatical constructions that are known to signal developmental milestones. At present, use of the information contained in these language samples is restricted to the capacity of human experts to manually analyze the data, since little has been done to use computational models for this task In this collaborative effort by PIs in the University of Alabama at Birmingham and the University of Texas at Dallas, the objective is to address this problem by developing computational approaches for scoring samples from children along different language dimensions, including speech fluency, syntactic structure, content, and coherence, with the long term goal of building robust computational linguistic approaches for identifying language impairments in children. With these ends in mind, the PIs will investigate a number of core research questions, including measuring syntactic complexity in children's language, evaluating content in story retelling and play sessions, and detecting disfluencies in children's transcripts. Moreover, this research will focus on analyzing samples from children with three different language backgrounds: English monolinguals, Spanish monolinguals, and Spanish-English bilinguals of Mexican descent (the latter representing the fastest growing minority in this country). Since their models will be data driven, the PIs expect to be able to evaluate empirically the differences in developmental patterns of speech in children across these linguistic diversities. Addressing the bilingual population involves modeling code-switching behavior; thus, additional core research questions include measuring syntactic complexity in code-switched data, and identification and categorization of code-switching patterns in bilingual children.

Broader Impacts: This research will contribute to developing more accurate and practical tools for assessing language development in children, a field to which little attention has been paid to date. Addressing the challenges involved in the automated analysis of children's speech will also advance the field of Natural Language Processing (NLP) in general. Moreover, since the project involves children with three different linguistic backgrounds, the new technology will have low language dependency and so should be easily portable to other languages and domains. In the field of communication disorders, applying corpus-based approaches to language assessment is still in its infancy; project outcomes will have a direct impact on this field, by providing new metrics for scoring spontaneous language samples of children that can complement the battery of assessment tools currently used.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1018124
Program Officer
Ephraim Glinert
Project Start
Project End
Budget Start
2010-09-01
Budget End
2014-10-31
Support Year
Fiscal Year
2010
Total Cost
$301,055
Indirect Cost
Name
University of Alabama Birmingham
Department
Type
DUNS #
City
Birmingham
State
AL
Country
United States
Zip Code
35294