Computational biology researchers often need to reconstruct an unobserved phenomenon from available data, such as the unknown evolutionary tree for a set of species, the unknown evolutionary alignment of related protein sequences, or the unseen folded state of an RNA molecule. Finding the reconstruction is usually modeled as an optimization problem, whose optimal solution is intended to correspond to the correct reconstruction. A key ingredient in such a model is the criterion that is being optimized, which is captured by the objective function. This function often has many free parameters, and the choice of values for these parameters critically affects whether the correct reconstruction is actually an optimal solution. This project will build a general software system that solves two complementary problems: (1) parameter inference, which learns optimal values for the parameters of the objective function for classes of inputs, given examples of correct reconstructions; and (2) parameter advising, which recommends good values of the parameters for a specific input. Consider a developer and a user of a software tool that solves a reconstruction problem: parameter inference finds the best default values for the tool developer, while parameter advising finds good specific values for the particular input being run by the tool user.
Parameter inference will be tackled via a recent algorithmic breakthrough in inverse parametric optimization. Given examples of input-output pairs for a reconstruction problem, values for the parameters of the objective function that make each output be an optimal solution for its input can be found in polynomial time, as long as: (a) the objective function is linear in the parameters, and (b) the reconstruction problem is efficiently solvable when parameter values are fixed. Parameter advising will be tackled by: (1) learning a polynomial that combines feature functions into an estimator for the accuracy of a reconstruction, and (2) selecting the parameter value from a set of choices that yields the reconstruction of highest estimated accuracy. An optimal parameter set that maximizes the average true accuracy of the advisor will be found by integer linear programming.
The general software system for parameter inference and advising will be made freely available for researchers as open source through the Internet. Outreach educational activities include conducting a summer workshop for high school biology and mathematics teachers to develop bioinformatics curriculum modules for their classrooms, and mentoring economically disadvantaged and minority undergraduate students through the McNair Scholars program. The tools developed by the project are poised to have very broad impact throughout computer science, not just computational biology, as the most widely-used optimization models, such as shortest paths, minimum spanning trees, maximum matchings, and network flow, all have linear objective functions, to which the parameter inference techniques apply.