Software engineers often do not get access to confidential data because of internal security rules that are put in place by organizations that own these data and because of several laws that regulate data protection and privacy. This situation complicates basic software engineering tasks such as testing. To give some access to required data, data owners typically use a commercial tool to anonymize or "sanitize" the data. Unfortunately, none of the existing tools takes into account basic software engineering tasks such as testing, which leads to situations where the anonymized data is of little to no value for software engineers. Currently, software engineers operate with little or no meaningful data, which is a great obstacle to creating high quality software.
This research program addresses a fundamental question of software engineering: how can a data owner protect private information so that the data subjects (e.g., persons, equipment) cannot be re-identified while the data retains their efficacy for software engineering tasks? To preserve the usefulness of data for software engineering tasks, algorithms are needed that take into account the structures of the applications. This work will lay a foundation for a new direction of research on interactions between software engineering and data privacy, and the PIs will support it with a set of tools for low-cost software development and evolution.
Under this grant we conducted research to address the fundamental problems of releasing real data to software testers while protecting sensitive information that can be inferred from this data. The main research idea investigated under this grant is based on a core idea to link attributes of the database with the Database-Centric Application (DCA) that uses this database. The results from this research program include fundamental theories on how to balance utility and privacy for software engineering tasks, so that stakeholders can preserve the utilities of these tasks while achieving desired levels of privacy. In addition, the results include practical implementations of these theories in privacy frameworks -- the PI's work includes one such theory that he already implemented for the utility of software testing -- and thorough evaluation of the created frameworks relative to other competitive approaches. Our work is seminal in the area of data privacy at the intersection with software engineering, specifically, software testing. These results are published in top-tier software engineering conferences. All software that has been developed as part of this research program is freely available on the web for other researchers and industry practitioners. One paper published under this grant that won the best paper award at the IEEE International Symposium on Software Reliability Engineering (ISSRE'10), San Jose, CA, November 1-4, 2010.