Not only is big data voluminous, it is varied. Individual data analysis tasks must first collect data sets that are often in unrelated locations and frequently in vastly disparate formats. These data sets almost always need multiple changes prior to the analysis task, such as format normalization, data cleansing, type checking, and outlier detection. These "data integration" activities typically consume an inordinate amount of time and effort both on the part of the data analyst and on the part of the computing systems.
This project is to design, prototype, and evaluate an Application-Specific Instruction Processor (ASIP) that will support the concurrent execution of data integration workloads for multiple streams of big data. The ASIP will not only execute an individual integration stream, but will be capable of concurrently executing a number of distinct data integration streams (each with its own processing requirements), enabling data from disparate sources to be utilized for analysis. Successful ASIP deployment will substantially increase the throughput (and therefore effectiveness) of big data analysis across a range of fields.
What is unique about the ASIP design is not just that the instruction set will be customized, but the entire data path will be optimized for the data integration problem. Both very long instruction word (VLIW) and vector techniques will be used to expose and exploit parallelism. Complex transformations will be supported by a combination of customized engines as well as hardware virtualization. The optimization for data integration not only includes the computational data path, but explicit attention will be paid to the memory subsystem design as well. The project will include super-optimization of memory subsystems from individual applications to the application class comprised of data integration workflows.
The result should lead to dramatic improvements in the overhead of preparing data for analysis.