CMU proposes to explore methodologies and algorithms for automating analysis of failures and performance degradations in large-scale storage systems. Problem analysis includes such crucial tasks as identifying which component(s) misbehaved, likely root causes, and supporting evidence for any conclusions. Combining statistical tools with appropriate instrumentation, we hope to dramatically reduce the difficulty of analyzing performance and reliability problems in deployed storage systems. Such tools, integrated with automated reaction logic, also provide an essential building block for the longer-term goal of self-healing. Automating problem analysis is crucial to achieving cost-effective storage at the scales needed for tomorrow's high-end computing systems. The number of hardware and software components will make problems common rather than anomalous, so it must be possible to quickly move from problem to fix with little-to-no system downtime for analysis. Further, the distributed software complexity of such systems make by-hand analysis increasingly untenable. More nuanced, but perhaps of most concern, implementors of these storage systems are increasingly unable to test in representative high-end computing environments because they simply cannot afford to recreate the necessary system scale. As a result, scale-related problems must be analyzed in the field to allow improvements to be made, which introduces delays and productivity reductions for customers/users plus issues of clearance for systems deployed to support highly sensitive activities. Current designs and tools fall far short of what is needed.