Verifying that web and mobile applications will protect user privacy requires knowledge about what kinds of data and data practices are sensitive to users. Privacy impact assessments are standardized procedures that companies and government agencies use to identify what personal information is collected, used, and for what purpose, and shared with whom, as well as, what steps are taken to protect that information. Conducting privacy impact assessments on applications is time consuming, because evaluators often have limited knowledge of the software?s behavior, and the assessments are often done after the software has been constructed, which is costly. Because developers are under pressure to continuously release new application versions, they have little time for extensive documentation about their data practices. Today, the status quo in documenting privacy is the privacy policy, which regulators increasingly check for data practice misrepresentations during the application?s lifetime. This project seeks to develop methods and tools to automatically and quickly conduct privacy impact assessments from software artifacts, called user stories, that are easier for developers to produce. Based on a risk assessment informed by which data practices are most sensitive to users, developers can prioritize where best to introduce privacy controls that users want. Furthermore, by conducting risk assessments from user stories, regulators and developers would have greater assurance that assessments accurately reflect current app behavior. Finally, these assessments save developer time, because a change to a user story could trigger an automatic re-assessment that alerts the developer to changes in privacy risk. This research is transformative because it allows software developers to respond to changes in privacy risk during design time, when important safeguards can be introduced, as opposed to waiting for lengthier impact assessments that are harder to integrate after the software has been constructed.

The project investigates the symbolic and statistical relationships between agile requirements, privacy risk and privacy policies. The research explores strategies for scoring user stories for privacy risk and prioritizing which stories are most important to user privacy comprehension. The components of the solution will be investigated as follows: (1) corpora of user stories and privacy policies expressed in natural language will be acquired and annotated using coding theory; (2) semantic frames and an ontology expressed in Description Logic will be extracted from the corpora using entity and relation extraction; and (3) the risk scores will be collected using privacy risk surveys that measure how users perceive privacy risk under different scenarios derived from user stories and mitigations. A key obstacle to effectively scoring risk is the inherent presence of ambiguity and vagueness in natural language. The semantic frames and ontology will be used to encode and resolve ambiguity and vagueness in the scenarios. Furthermore, the survey results will be used to model changes in risk due to selected mitigations, thus, developers will be able to explore the local design space around a specific user story and available mitigation choices.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Project Start
Project End
Budget Start
2020-10-01
Budget End
2023-09-30
Support Year
Fiscal Year
2020
Total Cost
$498,221
Indirect Cost
Name
Carnegie-Mellon University
Department
Type
DUNS #
City
Pittsburgh
State
PA
Country
United States
Zip Code
15213