Thought disorder in psychotic disorders and their risk states has typically been evaluated using clinical rating scales, and occasionally labor-intensive manual methods of linguistic analysis. We propose instead to use a novel automated linguistic corpus-based approach to language analysis informed by artificial intelligence. The method derives the semantic meaning of words and phrases by drawing on a large corpus of text, similar to how humans assign meaning to language, and leads to measures of semantic coherence from one phrase to the next. It also evaluates syntactic complexity through ?part-of-speech? tagging and analysis of speech graphs. These analyses yield fine-grained indices of speech semantics and syntax that may more accurately capture thought disorder. Using these automated methods of speech analysis, in collaboration with computer scientists from IBM, we identified a classifier with high accuracy for psychosis onset in a small CHR cohort, which included decreased semantic coherence from phrase to phrase, and decreased syntactic complexity, including shortened phrase length and decreased use of determiner pronouns (?which?, ?what?, ?that?). These features correlated with prodromal symptoms but outperformed them in classification accuracy. They also discriminated schizophrenia from normal speech. We further cross-validated this automated approach in a second small CHR cohort, identifying a semantics/syntax classifier that classified psychosis outcome in both cohorts, and discriminated speech in recent-onset psychosis patients from normal speech. These automated linguistic analytic methods hold great promise, but their use thus far has been circumscribed to only a few small studies that aim to discriminate schizophrenia from the norm, and in our own work, predict psychosis. There is a critical gap in our understanding of the linguistic mechanisms that underlie thought disorder. To address this gap, in response to PAR-16-136, we propose to use the RDoC construct of language production, and its linguistic corpus-based analytic paradigm, to study thought disorder dimensionally and transdiagnostically, in a large cohort of 150 putatively healthy volunteers, 150 CHR patients, and 150 recent-onset psychosis patients. We expect that latent semantic analysis will yield measures of semantic coherence that index positive thought disorder (tangentiality, derailment), whereas part-of-speech (POS) tagging/speech graphs will yields measures of syntactic complexity that index negative thought disorder (concreteness, poverty of content). This large language dataset will be obtained from two PSYSCAN/HARMONY sites, such that these language data will be available for secondary analyses with PSYSCAN/HARMONY imaging and EEG data to study language production at the circuit and physiological levels. This large language and clinical dataset will also be archived at NIH for further linguistic analyses by other investigators.
Language offers a privileged view into the mind: it is the basis by which we infer others' thoughts. In collaboration with computer scientists at IBM, we will use advanced computational speech analytic approaches to identify the linguistic basis ? semantics and syntax ? that underlie language production along a spectrum from normal to gradations of thought disorder. Our large international language dataset on 450 individuals will be archived at NIH as a resource for further linguistic analyses.