Over the last two decades there has been an explosive growth in online data storage of various forms. These large datasets have motivated the rapid development of data mining methods. However, until now, there has been a lack of an online repository of large data sets for researchers to evaluate and compare their methods. In this project, an online repository of large and difficult data sets are being gathered that are representative of the diverse character of many important scientific and business domains. This repository includes high-dimensional data sets as well as data sets of different data types (time series, spatial data, transaction data, and so forth). The primary role of the repository is that of a benchmark testbed to enable researchers in data mining (including computer scientists, statisticians, engineers, and mathematicians) to scale existing and future data analysis algorithms to very large data sets. Each data set in the respository contains online documentation, metadata, and links to relevant background domain information such as prior published work. Availability of a standard set of large benchmark data sets will directly stimulate and foster systematic progress in data mining related research, similar to the affect that the UCI Machine Learning Data Repository has had on machine learning research. This repository will play a substantial role in brokering the gap between research-oriented algorithm development in the laboratory and the real-world practicalities and challenges of very large data sets. www.ics.uci.edu/~mlearn/MLRepository.html