Relational database-management systems constitute a mature, ubiquitous, sophisticated technology that is deeply entrenched. Seemingly all organizations are collecting vastly increasing volumes of structured as well as unstructured data, and want to extract knowledge from this data using machine-learning techniques/algorithms. Thus many learning tasks faced by working data scientists involve relational data. Thus a marriage of machine learning and relational databases seems inevitable. However, standard machine-learning algorithms are not designed to operate directly on relational data, and further, it is far from obvious if and how one can adapt many of these algorithms to work on relational data without suffering a significant loss of efficiency. The current standard practice for a data scientist, confronted with a machine-learning task on relational data, is to issue a feature-extraction query to extract the (carefully curated) data from the relational database by joining together multiple tables to create a design matrix, and then to import this design matrix into some machine-learning tool to train the model. This standard practice is wasteful because (1) computing relational joins is computationally expensive, both in terms of time and space, (2) the resulting design matrix will likely contain much redundant information and consume much more space than the original tables, and thus (3) the machine-learning task takes more time than should conceptually be necessary. Algorithms that are orders of magnitude faster for standard machine-learning problems on relational data are certain to exist, and the goal of this research program is to discover them. Such algorithms would allow the extraction of information from data that is now not currently feasibly extractable.

The research goals of this project are threefold. The first goal is to design and analyze relational algorithms for common machine-learning queries. A relational algorithm works directly on the relational data, without forming the design matrix, and can be orders of magnitude faster than standard machine-learning practice for such data. The second goal is to design a layer of relational algorithms for commonly arising subproblems and that can be utilized by the data scientists as a sort of middleware toolkit when designing their algorithms. The third goal is to build some intuition as to what problems are, and are not, solvable by relational algorithms and that researchers/practitioners can rely on when faced with a new problem. These goals will require the development of new algorithmic-design and -analysis techniques.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Pittsburgh
United States
Zip Code