This research explores application development and optimizations for cloud platforms by developing: 1) cloud based parallelization for data intensive graph algorithms, 2) a framework for efficient scheduling and execution of applications in a heterogeneous cloud environment, and 3) hierarchical programming abstraction to specify parallelism. The work investigates and adapts wealth of techniques in traditional parallel computing for graph problems based on a performance model of the cloud and explore strategies for scheduling and load balancing applications on the cloud. These include centralized and distributed approaches for scheduling and work stealing and work sharing. Methodologies to evaluate the framework in executing applications that involve data intensive graph computations are being developed. The broader impact of this project includes addressing key challenges in the areas of application mapping and performance optimization. The research makes developing data intensive graph applications across public and private clouds easier. The developed software will be released as free and open source software to the community, making it possible for researchers and engineers in academia and industry to leverage this work and develop applications for the cloud. Graph problems and streaming applications arising in the area of energy informatics are considered to illustrate the techniques.
In the age of cyber-social-physical network, there is tremendous value in data integration and association. The emergence of social computing, mobile computing, and the rapid adoption of near ubiquitous sensors that monitor and transmit data about physical objects, has led to applications handling data at a scale never seen before. This has made the task of meaningful data integration or "connecting the dots", a challenge. It is in this context that cloud computing, with the promise of on demand scalability and virtually unlimited resources, has emerged as a powerful platform for developing and deploying applications that scale to meet the data challenge. Despite its significant advantages, the adoption of cloud computing as computing paradigm has been impeded by numerous concerns: (1) development of cloud based parallelization techniques for data intensive applications, (2) simple programming abstraction for specifying parallelism, (3) framework for efficient scheduling and execution of applications in a heterogeneous private-public cloud environment, and (4) data privacy and security. The project addressed the above challenges by developing and implementing architectures that demonstrated the feasibility and performance improvements as well as cost benefits of using cloud infrastructures. These scalable platforms were used to handle large scale problems such as graph algorithms which form the basis of social network analysis and large scale machine learning models. We evaluated our the general-purpose graph processing framework on public cloud platforms; and developed resource allocation strategies for use in hybrid cloud computing environments; modeled application dynamism to enable elasticity in continuous workflows and benchmarked the system on a practical real-world application, that of enabling dynamic demand response in a Smart Power Grid Cyber-Physical System using cloud technologies. Our findings extend beyond the field of computer science and engineering. The architectures and algorithms we proposed can be used in many disciplines. For example, they can be used by social sciences researchers to analyze social graphs and study human behavior through the analysis. In particular, we built a Cloud-based software platform for data-driven analytics for a smart grid that will be part of the Los Angeles Smart Grid Project. The graph algorithms that we use to evaluate the performance of our proposed systems form the basis for social network analysis. Our software products are open source and can be publicly used and extended by the research community for specific needs. We have implemented three open source systems namely, "Pillcrow" for scalable graph analysis, "OpenPlanet" for large scale machine learning and "Cryptonite" for secure data repository. Of these, Pillcrow and Cryptonite were implemented on the Microsoft Azure platform and OpenPlanet is implemented using the open source Hadoop, an implementation of Map-Reduce programming model.