With the globalization of the economy, business data needs to be available twenty-four hours a day, seven days a week. Furthermore, in the event of a disaster, the data must be restored as quickly as possible to minimize the business? financial loss. Since our current Internet environment is truly distributed, data are copied and revised many times. Therefore, the data or portions of the data are highly redundant. How to store, preserve and manage the enormous amount of digital data with a reasonable cost become very challenging. Data de-duplication is used to support many data driven applications in our daily life, and is widely deployed for data redundancy elimination so that the huge volumes of data can be easily managed and better preserved. However, the theoretical understanding of the problem is still largely missing. In this project, the PI plans first to investigate several fundamental issues of data de-duplication and then use these new insights to design more efficient data new algorithms for efficient data archiving and backup. The anticipated prototype system will be open source and made available to others. The proposed project will enhance the education process by bringing input from industry, developing new courses at both undergraduate and graduate levels and emphasizing the diversity of the student population. The efficiency of data de-duplication has a great impact on both long-term data preservation and ease of managing the existing huge volume of digital data. Many crucial applications from large scale simulation and modeling to electronic patient records to preserving and managing our personal data depend on both preservation and management, enhancing the impact of this work
Data deduplication is used for reducing data volume in backup systems. It is first partition data into data chunks based on some statistical techniques. If this data chunk has appeared before, this data will not be stored. However, an index for each unique chunk has to be created to verify the uniqueness of data chuncks. Our newly designed data chunking algorithm can be showed to have the best performance results compared with the other existing data chunking algorithms. In our recent paper, we have also proposed a new evaluation models to compare the performance of different data deduplicatin algorithms. We expect the proposed algorithm can be incorporated into some of existing data deduplication software to enhance the performance. The reliability is another major issue for data deduplication since multiple redundant data has been reduced to a single copy. Our work has enhanced our understanding of this issue. We have proposed jew ways to measure the reliability at the time to do the data deduplication such that a demanded reliability can be guaranteed. When data deduplication is used for primary storage, the read performance becomes extremely important. In the past data deduplication only concerns with its write performance. In our current study, we have showed the reason of read performance can be degraded. We have completed a few more studies. One is to design an efficient way of using flash memory based Solid State Drives (SSD) for storing the indexing of data deduplication as a set of key-value store. We have designed an efficient Bloom Filter based index structure on SSD for data deduplication. It has shown that this design can be used to support general key-value store applications too. We have further explored reliability-aware deduplication storage. Since the major goal of the original data deduplication lies in saving storage space, its design has been focused primarily on improving write performance by removing as many duplicate data as possible from incoming data streams. Although fast recovery from a system crash relies mainly on read performance provided by deduplication storage, little investigation into read performance improvement has been made. In general, as the amount of deduplicated data increases, write performance improves accordingly, whereas associated read performance becomes worse. We also newly propose a deduplication scheme that assures demanded read performance of each data stream while achieving its write performance at a reasonable level, eventually being able to guarantee a target system recovery time. For this, we first propose an indicator called cache-aware Chunk Fragmentation Level (CFL) that estimates degraded read performance on the fly by taking into account both incoming chunk information and read cache effects. We also show a strong correlation between this CFL and read performance in the backup datasets. In order to guarantee demanded read performance expressed in terms of a CFL value, we propose a read performance enhancement scheme called selective duplication that is activated whenever the current CFL becomes worse than the demanded one. The key idea is to judiciously write non-unique (shared) chunks into storage together with unique chunks unless the shared chunks exhibit good enough spatial locality. We quantify the spatial locality by using a selective duplication threshold value. Our experiments with the actual backup datasets demonstrate that the proposed scheme achieves demanded read performance in most cases at the reasonable cost of write performance. We are currently working together with EMC to validate some of the concepts that we have proposed so far. It is also important to try out our proposed schemes with a large set of backup data provided by EMC.