Virtually all fields of science and engineering depend on fundamental advances in computing. High- End Computing (HEC) simulations in various areas of science enable to understand the world around us. Unfortunately, HEC is known for lack of sustained performance and reliability. Its system-wide failure rate increases significantly with the growing number of components. The conventional method for fault tolerance in HEC, checkpointing, is costly and triggers a cycle of deterioration. This deterioration is fueled by ever-increasing HEC complexity. A new fault tolerant approach is a must for next generation HEC. In this research, the PIs propose a novel Hybrid Fault Tolerant (HFT) approach for HEC that combines long-term and short-term techniques to improve fault management. Long-term prediction models the possibility of faults based on historical data, and consequently facilitates failure-aware scheduling by intelligently mapping jobs to available resources. Short-term prediction diagnoses the root causes of unusual runtime events, and triggers job rescheduling on-the-fly to move running jobs away from these troublesome resources. The long-term support and the short-term support complement each other, where failure-aware scheduling prevents inactive jobs (i.e. the jobs that are not scheduled yet) from the failures that are well captured in the long-term failure models and failure-aware rescheduling enables active jobs (i.e. the jobs that are already scheduled and running) to avoid irregular failures that may not follow any long-term pattern but can be discovered at runtime (e.g. sudden hardware and software errors). The integrated long-term and short-term approach promotes a better understanding of failure trends and modes and consequently improves system productivity in HEC.