The widespread practice of open source development is changing the IT industry in significant ways. Open source, these days, is a strategy that companies consider as part of their product's marketability. In Science and Engineering, open source has an established track record, and having the source code available to everyone these days is as important as having the data supporting scientific claims available, since Science and Engineering rely more and more on software for substantiating claims. Unfortunately, undocumented source code is as difficult to understand as raw, undocumented data; having it available without being able to understand it is not of much benefit. Open source projects, in particular, are notorious for their lack of documentation, since the developers often don't have the resources to produce artifacts beyond the code, so "the code is the documentation." This is a pervasive problem that impacts Science the most, as it increasingly relies on software that is produced under slim budgets without margin for documentation efforts.
This project seeks to automatically recover high-level knowledge from software artifacts in order to make software components understandable in the absence of documentation. Recovering high-level knowledge from software artifacts has been a long-sought goal of software engineering research. The achievements so far have been limited. The approach taken here is to use machine learning techniques. This approach may finally start to produce usable solutions to this elusive problem. In pursuing the goal, this project unveils important knowledge and tools related to open source projects. First, it unveils knowledge about which and what kind of relations among source code artifacts correlate with the architecture recovery process. Second, it will produce a catalog of unsupervised learning algorithms tailored for software component identification. This will be publicly available for others to use and study. Third, it will produce a benchmark of software architectures of projects from various domains. Fourth, it will produce a catalog describing the artifacts and the learning technique which best recovered their architecture. Finally, it will produce reusable implementations of (i) several component identification algorithms; and (ii) structural, behavioral, and domain feature extraction. This project combines all this knowledge and tools in a plugin for Eclipse that supports automatic recovery of software architecture.