The goal in most reliability projects has, traditionally, been to prevent errors of all kinds. Researchers are now discovering that not all errors cause a failure. Some errors can be masked within the circuits as not all inputs affect final results. By preventing all errors rather than only those errors that change a result, time and power get wasted. This project proposes to explore techniques which allow errors to occur that do not change final results. In many applications such as facial recognition or voice recognition, many of the data errors will not be noticed by the software, depending on the particular data. For example, if we one bit gets flipped in an incoming audio signal for voice recognition, it may not affect the result at all. The proper word may be recognized despite the error in one sample. A key observation, however, is that even these applications are not very resistant to control flow errors. For example, if the voice recognition software stops before it completes its analysis of the audio signal, the wrong word would most likely be recognized leading to failure.
This project explores how to take advantage of partial tolerance to unreliability. More efficient reliability mechanisms can be designed that are targeted towards only the important instructions, not all instructions. In even more tolerant applications, errors can be introduced into the system in order to speed up the system--allowing the process to proceed without waiting for slow operations. In order to discover and exploit error-tolerance, this project will identify 10-15 applications that are tolerant to errors, develop heuristics to determine which instructions are more tolerant to error than others, develop specific techniques for efficiently protecting only critical instructions from errors, and develop mechanisms to introduce errors into less important, high-latency instructions in order to save power and/or improve performance.
This CAREER award funded four research efforts - two in computer architecture and two outreach projects for computer science education. In our first project, we looked at how to project the computer from errors caused by unexpected errors, such as radiation from the sun. A common way to protect against this is to run every instruction in the program twice. This requires twice as much hardware for the same program. We found that, especially in multimedia applications like streaming video, audio, or artificial intelligence applications, there is no appreciable loss in accuracy if specific instructions are incorrect. Therefore, we built a system that only runs the important instructions twice. Depending on the application, we were able to reduce the overhead by 20-50%. Our second project looks at running parallel programs on future parallel chips more efficiently. There are two parts to this, the data the program uses and the instructions it runs. Data is a problem because the parallel programs have not caught up with parallel architectures. These programs were written for computers in which each part of the program ran in a separate machine with its own memory. Therefore, much of the same data was stored in each machine. Now, all of these parts of the program are run on the same chip, sharing the same memory. To take advantage of this, the programs would need to all be rewritten. We propose to have the system detect when the same data is being used and store it as one piece of data instead of a separate one for each part of the program. We applied this technique to a few different levels in the machine. By applying this to on-chip cache, we saw orders of magnitude speedup. When used in off-chip memory, we saved 37-60% of the memory, allowing larger programs to be executed on the same machine. We then tackled computation on the chip - reducing instructions rather than memory space. We found that for many parallel applications, they execute many of the same instructions, sometimes with the same data. This is not programmer error - if these instructions were normally only executed once, and all parts of the program needed the results, it would be very slow to communicate all of those results to the other parts of the program. We designed a processor that detects when the instructions be executed are the same. If so, it only executes them once. The key is that the communication is free because this is all done in the same processor with shared resources. Finally, we made significant contributions to broader impacts with computer science education work. We had three undergraduate students create the pilot system for a summer camp eventually funded through the NSF Broadening Participation in Computing Grant. This summer camp is for female and Latina/o middle school students to introduce them to computer science through engaging projects based on Mayan culture and endangered species. This camp has been very successful, increasing interest in computer science as a field, confidence in computer science skills, and experience with programming. Each summer, almost half of the females eligible to return to the camp have applied to return. In addition, we analyzed the needs of K-12 educational cell phone applications and compared that to the known behaviors of cell phones as they fail from old age. We determined that K-12 education is an excellent match for used cell phones and hope this will spur innovation in the use of cell phone technology in K-12 education.