Record and document retention (document disposition) has become a serious problem for both organizations and individuals since most documents now created are digital. Digital documents offer both problems and advantages. Digital documents are easily versioned, copied and disseminated. Thus, there can be several similar copies or versions of important or relevant documents in many locations. Document or record disposition can be applied to or is needed by individuals, organizations and domains (such as law, science, policy, etc.) for effective information management over long periods of time. This problem is of epic proportions and is becoming a major problem in organizations and for individuals throughout the world where effective record disposition is either required by law or by the organization or by practical limitations in systems.
This exploratory project investigates possible automatic document disposition methods based on algorithms for text inspection, mining, and search. The challenges lie in finding scalable, adaptable algorithms that can be used in several if not all application domains. In addition, variability in users presents many problems. A disposition method or procedure may vary depending on the user, organization and domain (e.g., law, health records, etc.). The approach explored in this project applies and extends machine learning methods to these problems since these methods adapt to variability in data, areas and domains. Using such approaches, automated disposition methods can be readily applied to these different areas such as science, email and legal records. This research lays the groundwork for adaptive methods for a variety of domains in terms of applicability, performance and scalability. this proof-of-concept project initially focuses on the Enron email data set that is publicly available and is be used to demonstrate the feasibility of the approach since email can be considered a special case of document disposition. If successful, other disposition domains such as science and government data will be explored. This work will show the viability of developing and applying machine learning methods to an important and diverse problem domain.
The results from this exploratory project together with insights gathered from methods used in large scale document search are expected to yield understanding as to how we can better manage our digital past and the rapidly expanding digital future. The results are expected to introduce this important problem to other researchers and document disposition professionals and lead to collaborations with industry. Data and research results will be made available through a publicly available website (http://clgiles.ist.psu.edu/disposeseer/) and research papers will be published and presented in appropriate venues. The project provides research experience for graduate and undergraduate students.