Software is comprised of a multitude of artifacts; some of them are intended to be read by the compiler, while many others are intended to be read by developers. The user centric information is often expressed in natural language and it is usually much larger in size than the source code. Given the amount of unstructured data present in existing software systems, tools are necessary for its storage, retrieval, and analysis, before it is delivered to the users. This type of information is essential during software evolution, when developers need it to understand the software.
The research will define and evaluate an infrastructure for the management of the textual information present in software systems. The infrastructure will make use of Information Retrieval techniques in combination with statistical and rule-based Natural Language Processing methods and other text analysis techniques. The infrastructure will be used do define a new type of conceptual model of a software system that will complement the structural, behavioral, and architectural models, which can be extracted and built with traditional analysis and modeling methods. The new conceptual model and infrastructure will be used to define novel methodologies and build tools to support a variety of software evolution tasks, such as: change propagation in software, traceability link recovery between software artifacts, error and change prediction, quality measurement, concept location, refactoring, and program comprehension in general. The planned infrastructure will offer a platform for researchers from different areas of computer science (such as, software engineering and computational linguistics) to use state of the art results from each field. The empirical work will result in a repository of software artifacts and analysis data of textual nature from software to be used to support rigorously controlled experimentation and benchmarking in the research community.