Statistical analysis is key to many challenging applications such as text classification, speech recognition, and DNA analysis. However, often the amount of data available is comparable or even smaller than the set of symbols (alphabet) constituting the data. Unfortunately, not much is known about optimal inference in this so-called large-alphabet domain. Recently, several promising approaches have been developed by different scientific communities, including Bayesian nonparametrics in statistics and machine learning, universal compression in information theory, and the theory of graph limits in mathematics and computer science.
The investigators study the problem drawing from these multiple perspectives, but with a particular focus on developing the information theoretic approach. The research studies analytical properties of the "pattern maximum likelihood'' estimator, which performs well in practice but is not understood theoretically, and also explores computational speedups. Moreover, it attempts to delineate which problem classes are better handled by Bayesian nonparametric techniques and which by the pattern approach, and explores links between these approaches. The investigators use the resulting theory for automatic document classification, allowing for more automation in storing, retrieving, and analyzing data. Furthermore, the investigators use the theory to study genetic variations, whose link with disease diagnosis is a crucial step in the systematic quantification of biology that is playing an increasingly important role in medical advancement. The research also brings new courses to the classroom, with a special outreach effort to involve women and under-represented minorities, including through the Native Hawaiian Science and Engineering Mentorship Program.