This Small Business Innovation Research (SBIR) Phase II project will apply multithreading techniques to provide multi-terabyte (and larger) high-performance databases in MySQL. The company has developed a highperformance storage engine for MySQL, which maintains indexes on live data 100 times faster than current commonly-used structures. The technology solves the problem of maintaining indexes on large databases in the face of high trickle-load indexing rates. In Phase I, the company developed a multithreaded bulk loader to solve the problem of how to load data quickly. The next significant research problems for large MySQL databases are to allow online, or "hot", schema changes in which, for example, an index can be added without taking the database down, and to use multithreading to speed up joins and reductions so that the large data sets can be queried quickly. In this project, the researchers will investigate the use of multithreading to support hot indexing and parallel joins reductions.
If successful, multi-terabyte and larger databases will be manageable and fast on modest hardware, and the hardware will be scalable both with CPU cores and disks. The broader impact of this work is driven by faster, cheaper, lower-power on-disk storage. Organizations that have very large databases will be able to use much less hardware, both saving money and reducing power consumption significantly. Currently many application areas do not employ databases because their performance is too slow. Speeding up databases by two orders-of-magnitude can help grow the market. Currently, many organizations fail to make good use of the data they have collected because they cannot manage it, index it, or query it fast enough to be useful. Applications in finance, retail, homeland security, telecommunications, and scientific computing will benefit from improved manageability and performance. As users' appetite for data continues to outstrip the availability of fast memory, organizing multithreaded queries on disk-based data for performance will continue to grow in importance.
This Small Business Innovation Research (SBIR) Phase II projectapplied novel multithreading techniques to provide multi-terabyte (andlarger) high-performance databases in MySQL. Using traditional B-treeindexes, MySQL becomes unmanageable beyond about 500GB. Tokutek hasdeveloped TokuDB, a high-performance storage engine for MySQL, whichmaintains indexes on live data 100 times faster than B-trees. TokuDBsolves the problem of maintaining indexes on large databases in theface of high trickle-load indexing rates. In Phase I, Tokutekdeveloped a multithreaded bulk loader using Cilk to solve the problemof how to load data quickly. The next big research problems for largeMySQL databases were to allow online, or ``hot'', schema changes inwhich, for example, an index can be added without taking the databasedown, and to use multithreading to speed up joins and reductions sothat the large data sets can be queried quickly. In this project, theresearchers investigated the use of multithreading to support hotindexing (which employed new fractal-tree-specific techniques), andimproved multithreaded query perofrmance. As a result of thisreserach, multi-terabyte and larger databases are more manageable andfaster on modest hardware, and the perforamcne scales better both withCPU cores and disks. The broader impact of this work is driven by faster, cheaper,lower-power on-disk storage. Organizations that have very largedatabases can use much less hardware, both saving moneyand reducing power consumption significantly. Manyapplication areas do not employ databases because their performance istoo slow. Speeding up databases by two orders-of-magnitude can helpgrow the market by additional billions of dollars per year.Currently, many organizations fail to make good use of the data theyhave collected because they cannot manage it, index it, or query itfast enough to be useful. Applications in finance, retail, homelandsecurity, telecommunications, and scientific computing will benefitfrom improved manageability and performance. The research has alsofurthered our basic scientific understanding of how to lay out data ondisk and perform queries for a wide variety of data-intensiveproblems. As users' appetite for data continues to outstrip theavailability of fast memory, organizing multithreaded queries ondisk-based data for performance will continue to grow in importance.