For nearly five decades sociolinguists and dialectologists have studied the differences in pronunciation, word choice and grammar as correlated with the demographics and attitudes of the speakers and the situations in which they find themselves. This work has important implications for society, education, politics, technology development and forensics. Sociolinguists routinely produce recordings of natural speech, variously transcribed and quantitatively analyzed for dialect features plus careful descriptions of the speakers' characteristics and the interview situation. These data have important potential for linguists, scholars in language related fields and computer scientists developing human language technologies. Although many sociolinguists are eager to share their work, there have been impediments to such sharing. The proposed workshop will address two of the most important. First, within the United States, an Institutional Review Board (IRB) must approve any research involving human subjects. The vast majority of sociolinguistic research involves extremely low risk, and potentially high social benefit particularly for minority communities, but no common body of practice exists for permitting data to be shared. Nor is there a common body of practice with respect to the demographic, attitudinal and situational information collected, complicating sharing and comparison across studies.
The workshop will gather leading sociolinguists and dialectologists, and other field researchers with extensive experience, to develop common practice in preparing for institutional review and sharing of data. Expected outcomes are a sketch of an IRB protocol and demographic, attitudinal and situational questionnaires, each containing a core set that scholars should collect for every subject as well as a larger set whose relevance will depend upon the interview itself. The Linguistic Society of America and the Linguistic Data Consortium will publish the protocol and modules on their web sites and announce them via their newsletters and mailing lists, which reach more than 8000 scholars worldwide.
Sociolinguists study variation and change in language correlating with social factors: the age, sex, socioeconomic class and region of the speakers, the nature of the interaction and speakers’ attitudes towards their demographics and situation. Over the past fifty years, sociolinguists have typically worked in their own speech communities conducting interviews, collecting data about their human subjects and analyzing the resulting language. The speech communities studied have been very different from each other: metropolitan New York, rural communities in the U.S. South, towns along the England-Scotland border and communities in Israel and Iran. Thus, one naturally expects differences in the social factors that vary with language. However, our need for consistency in the terms of analysis becomes more acute because: 1) after numerous community studies it is natural to compare across communities, 2) such comparisons reveal similarities in how language and social factors correlate suggesting that eliciting consistent information in a consistent way improves our ability to compare and build upon prior work as all good science does, 3) technology now available permits researchers to combine and analyze data from multiple communities, making such work desirable partially because it is possible, and 4) emerging connections between sociolinguistics and related fields emphasizes the need for sociolinguists to document methods so that they can be replicated by other practitioners and understood by outsiders. However, a review of published studies reveals wide-ranging differences in practice, not all of which are motivated by differences in the communities studied. To address these issues we convened a workshop, inviting speakers recognized for fieldwork experience and expertise on specific social factors. Although attendance was open to all, we provided stipends to attract graduate students embarking on their own fieldwork and willing to share resulting data. We asked presenters to: 1) provide evidence supporting for the social factors in which they specialized that suggests one approach over others and 2) strategies for eliciting this information while protecting the rights or human subjects. For example, several presenters specializing in ethnic identity showed evidence that the common categories Asian and Latino are insufficient especially when applied to communities of speakers from multiple countries. They also showed that facts surrounding emigration (age, generation) affect language. Some also showed that speakers associate with multiple identities and their speech changes accordingly. Concretely, a subject whose mother is African-American and father is Latino may associate with both labels and may alter her speech, perhaps even subconsciously, based on her situation, for example when with other African-Americans or Latinos. The papers from this workshop are available at: http://projects.ldc.upenn.edu /NSF_Coding_Workshop_LSA and expanded versions are being edited for publication in the journal, Language and Linguistics Compass. Students and junior researchers can consult these to inform their own work. Subgroups of workshop participants are planning follow-up efforts including publication and sharing of existing data and new collections to also be shared. The workshop benefits the public in several ways. Sociolinguists are motivated by their interest in how language reflects and impacts social trends, for example the ongoing national debates about African American English. Broader data sharing helps researchers reach conclusions faster and at lower public cost to more efficiently inform public debate. One pair of workshop presenters collaborated on a proposal to collect and publish a corpus of African-American English for exactly this purpose. An additional benefit will result from use of sociolinguistic data to improve human language technologies. We saw that speaker recognitions systems were unable resolve identities in 911 calls during the George Zimmerman/Travon Martin trial (in fact were misused by some expert witnesses). Technology developers have used sociolinguistic data to improve system performance but were hampered by the lack of consistency that the workshop addressed. The papers that resulted from the workshop will guide future data collections efforts so that they are consistent and sufficient to capture the social factors that vary with language. This will improve the researcher's ability to draw conclusion from the data and will allow analysts to compare, combine and otherwise build upon prior work. It will also allow consumers of sociolinguistic data to better understand the terms of analysis so that they can build systems that perform better at, for example, identifying the speaker or perhaps the social characteristics of the speaker in a 911 call. Human Language Technologies are currently limited by the data the data used to build them. For example, if a dataset only distinguishes for example speakers of Asian or Latino descent, then a system built from this data will be unable to make finer distinctions, for example among speakers who emigrated or descended from emigrants of different Asian countries. Thus by refining and improving the consistency of the analyses performed by sociolinguists we can also improve the performance of human language technologies build from that data.