Much of the internet's business is conducted with dynamically generated documents --- HTML pages, SQL queries, and execution scripts --- that are generated on-the-fly by document-generator scripts written in PHP, Javascript, and JSP. The situation is a threat to internet security because the document-generator scripts are often faulty, generating malformed documents that are vulnerable to attackers. To remove this vulnerability, a new approach, abstract parsing, is developed and applied to enforce, in advance of execution, that every dynamically generated document emitted from a script will be grammatically well formed (spelled correctly). The technique also predicts the range of semantics (intended meanings) of the generated documents, to help prevent attacker exploitation. The impact of the research lies in its tools and methodologies to help programmers assemble a more secure internet.
The technical approach starts from a document-generator script and a context-free reference grammar for the document language and generates an LALR(1)-parser from the grammar, applying it within a data-flow analysis of the script to predict the LALR-grammatical structure of the string-documents to be generated by the script. The analysis computes abstract LALR-parse stacks that overapproximate the grammatical structure of the string-documents that the script generates. Attribute-grammar machinery predicts the context-sensitive and semantical properties of the documents to be generated. The technology is applied to (i) generate a semantically-aware implementation of taint analysis; (ii) implement automata-defined filters for dynamic string updates; (iii) combine abstract-interpretation technology with abstract parsing to analyze a wider class of program constructions, in particular, dictionary data structures.