Center for Experimental Research in Computer Systems Proposal #1127904
This proposal seeks funding for the Center for Experimental Research in Computer Systems at Georgia Institute of Technology. Funding Requests for Fundamental Research are authorized by an NSF approved solicitation, NSF 10-601. The solicitation invites I/UCRCs to submit proposals for support of industry-defined fundamental research.
While cloud computing is rising in importance for numerous applications, there remains a fundamental lack of understanding of performance achievable for different configurations, especially for N-tier applications common in such areas as e-commerce and social networking. The proposed research will seek to systematically design large scale experiments from which performance data will be derived and performance metrics established for N-tier applications. The resulting large data sets can enable researchers to explore means to achieve optimal allocation of hardware and software resources for specific applications. The proposed comparative experimental study will enable development of comparative models through which N-tier application performance can be predicted and as such holds the opportunity for significant breakthroughs in understanding of cloud performance for this class of problems.
The proposed research has the potential to source the development of tools from which industry providers of cloud resources can better manage their resources and offer services in a cost effective way. Additionally, this optimization can be applied to achievement of Green IT goals. The work is well supported by center individual industry members and has the potential to extend the portfolio of the center by virtue of the many studies and modeling efforts achievable using the dataset generated by this study. Beyond the center, the dataset, if properly designed, has the potential for broad impact in the research community as a resource for studies in this area. The proposal furthermore provides a solid plan for student and UREP involvement.
Cloud computing infrastructures (data centers) have become an important building block, or even the foundation of mission-critical interactive applications such as e-commerce and customer relationship management. These mission-critical applications have stringent quality of service (e.g., predictable response time) requirements, despite fluctuations in workload. A large scale shared infrastructure such as cloud can be beneficial in two ways. First, clouds offer sufficient aggregate resources to support scale up and scale out of mission-critical n-tier applications when bursty workloads demand sudden increase. Second, virtualized cloud environments support consolidation of applications to increase the utilization of shared nodes when workloads decrease. However, despite these potential advantages, in practice data centers have seen very low average utilization of around 18%. In this project, we have found several non-trivial reasons for mission-critical application providers to keep the utilization low in a shared cloud environment, even though there are strong economic incentives to increase cloud utilization. The first reason is the significant impact of software resource allocations (e.g., number of threads and database connections) on the performance of n-tier applications. Specifically, the best software resource setting for one hardware configuration (e.g., two application servers and two database servers) can become seriously sub-optimal for other configurations (e.g., four application servers and four database servers). These complications lead the service providers to keep the system utilization level low to avoid potential bottlenecks from arising. The second reason is the significant impact of transient bottlenecks on response time of n-tier applications, a well-known problem (anecdotally) in industry, but lacking in solid scientific documentation and understanding. Transient bottlenecks last only a very short time (tens to hundreds of milliseconds), but can cause very wide response time variations (up to several seconds) due to server queue overflow, dropped network packets, and packet retransmissions. These system-wide queuing effects happen because of the high request arrival rates (thousands to tens of thousands per second) during typical workload bursts. Complex interactions among the n-tier system components make the system performance worse in normal situations if we simply lengthened the queues. Consequently, the transient bottleneck problem introduces non-trivial trade-offs that cause service providers to keep the system utilization low to avoid transient bottlenecks. In our project, we have developed software techniques and tools to monitor n-tier systems in cloud environments, analyze the system performance, find the transient bottlenecks, and optimize system configuration to minimize or avoid the impact of transient bottlenecks. These results can help cloud providers increase the utilization (and return on investment) of their data centers, and improve the quality of service of n-tier application for web users. An integral aspect of the research supported by this grant is the continued collaboration with industry and international partners. We have been working closely with Fujitsu partners, with joint publications and patent applications. Ongoing collaborations include HP Labs, Intel, VMware, Xerox Research, and Oracle, as well as international collaborations with university and industrial partners in Germany, China, and Taiwan. -------------------------------- Elba Project: www.cc.gatech.edu/systems/projects/Elba/.