Recent years have seen the emergence of large-scale data clusters consisting of hundreds of thousands hard drives in data-intensive computing. Because failure of a disk could cause loss of valuable data or information, the direct and indirect costs of ever often disk failures are becoming increasingly critical issues in the deployment and operation of these computational platforms. Recent studies conclude that the trend in data-intensive computing is towards much higher disk replacement rates in the field than the Mean-Time-To-Failure estimates of the manufacturer datasheet would suggest, even in the first years of operation. Hence, the recovery becomes state of storage.
Existing storage recovery solutions successfully developed optimal and near-optimal parallelism layouts such as declustered parity organizations at small-scale storage architectures. There are very few studies on multi-way replication based storage architectures that are equally important but significantly different from erasure code based storage architectures. Moreover, it is difficult to scale up to a large size because current placement-ideal solutions have a limited number of configurations. Lastly, fast recovery demands efficient reverse data lookup, which is not well studied in current scalable data distribution schemes. The investigators develop methods and tools for achieving fast recovery by exploiting optimal and near-optimal parallelism techniques, and distributed hash table and reverse hashing techniques to improve the scalability of reverse data lookup in high-performance storage systems. The proposed research, if successful, will have broad impact in both fault-tolerance computing and high-performance computing community by providing a scalable and fast storage recovery solution.
Recent years have seen the emergence of warehouse-scale computers consisting of hundreds of thousands hard drives in data-intensive computing. Because failure of a disk could cause loss of valuable data or information, the direct and indirect costs of frequent disk failures are becoming critical issues in deploying and operating these computational platforms. Recent studies conclude that the trend in data-intensive computing is towards much higher disk replacement rates in the field than the Mean-Time-To-Failure estimates of the manufacturer datasheet would suggest, even in the first years of operation. Hence, the recovery becomes state of storage. Existing storage recovery solutions successfully developed optimal and near optimal parallelism layouts such as declustered parity organizations at small-scale storage architectures. There are very few studies on multi-way replication based storage architectures that are equally important but significantly different from erasure code based storage architectures. Moreover, it is difficult to scale up to a large size because current placement-ideal solutions have a limited number of configurations. Lastly, fast recovery demands efficient reverse data lookup,which is not well studied in current scalable data distribution schemes. The investigators develop methods and tools for achieving fast recovery by exploiting optimal and near-optimal parallelism techniques, and distributed hash table and reverse hashing techniques to improve the scalability of reverse data lookup in high-performance storage systems. The proposed research makes broad impact in both fault-tolerance computing and high-performance computing community by providing a scalable and fast storage recovery solution. First, our theoretical proofs and comprehensive simulation results show that the proposed shifted declustering data layout scheme is superiour in performance and load-balancing to traditional replication layout schemes. Second, the proposed shifted declustering data layout scheme outperforms current layouts in multi-way replication computer storage system in terms of mean time to data loss (MTTDL). Third, our mathematical proofs and real-life experiments show that Group based shifted declustering data layout scheme realizes a scalable reverse lookup that is up to one order of magnitude faster than existing schemes such as Google's random placement solution. Moreover, this research makes indirect key outcomes for industrial impact and cost savings in super data clusters and data-intensive computing centers, and in the environmental benefits of fault-tolerant and green computing. The outcomes include: (i) new storage system organizations and architectures for high RAS(Reliability, Availability, Serviceability), high-performance and high energy efficiency;(ii) a broad-based fundamental understanding of the relationship and tradeoffs among fault-tolerance and performance control techniques; and (iii) models, algorithms, methods and tools to analyze, and control reliability andperformance of large-scale data-intensive computational systems. In addition,several graduate students and undergraduate students including minority representatives such as female students and/or Hispanic students have been trained.