The amount of data generated during the development of today?s software systems is staggering. It includes the source code, developer e-mails, bug information, testing results, analysis data, process information, requirements, etc. The size and complexity of this information make it impossible for developers to reason about it.
Data mining techniques are a common solution to extract what is relevant to developers and managers. The success and quality of these software projects depends on the software engineers? ability to customize generic data mining algorithms to specific software engineering data. This project will produce tools and techniques that will allow software developers and managers to easily customize and apply data mining techniques to a variety of software engineering problems. Such solution will become more practical and will help many existing approaches to migrate from the research lab into industry. Under represented categories of students will participate in this research. The project will enhance the existing software engineering curriculum and facilitate the inclusion of data mining solution in the repertoire of future software engineering practitioners and researchers.
Specifically, the project will improve the state of the art solution to three important software engineering tasks: concept location in software, software defect prediction, and development effort estimation. The project will produce an algorithm customization methodology and a framework that will be instantiated for a variety of combinations of data mining algorithm x software engineering task x software system data. The customization problem is framed and addressed as an optimization problem. The resulting customization agent will assist the software engineering user in efficiently selecting the best configuration, which includes a set of algorithms and their parameter values, customized for a particular task and software system. All tools and methodologies will be empirically evaluated in academic and industrial settings.
Intellectual Merit. Many software engineering problems require data mining solutions. Customizing these data mining algorithms is a difficult challenge. Experience has shown that this is a costly process as well as a confusing one. Without a well defined methodology that allows for effective customization to specific software engineering applications, the research in the field is limited to proof-of-concept efforts. This problem is one of the reasons that often stop many data mining solutions migrating from the research labs to industry and government applications. This project offers a unique solution to the problem of tailoring general purpose algorithms to local conditions. We have shown that it is better to develop models that use software data in specific contexts, rather than developing models that work for any software data. In consequence, the project resulted in improved techniques that directly support and improve: defect prediction in software, development effort estimation, software refactoring, and concept location in software. In addition, the project showed that combining local information from a range of sources (such as, text and software metrics) results in better solutions for software engineering problems, such as, concept location in software, refactoring, or defect prediction. Broader Impacts. For the research community, this project has spearheaded a novel research area in software engineering, where researcher focus now on finding theories and prediction models that are based on local data (that is, specific to a software system or just one part of it). These theories and models will be used to address software engineering problems not covered in this project. The project resulted in new and improved solutions and tool support for software engineering problems, such as, effort estimation, concept location in software, refactoring, and defect prediction. These are available for other researchers and practitioners. In addition, the project resulted in substantial data used in empirical evaluations, which is shared with the research community, as well. As for other broader impacts, this project has allowed the project investigators to extend advanced research training to students from traditionally under-represented areas. This project trained four Ph.D. students who completed their doctoral degrees (two female students and one from an acutely economically depressed region of central Pennsylvania). Further, three more graduate students are working towards their degrees (one female student and one latino). As to other training possibilities, based on the experience of this project, the principle investigators have extensively revised their training materials. By using those new materials in our teaching, 100+ graduate students (masters and Ph.D.) now have a better understanding of the leading edge of research in software engineering.