An important class of practical parallel computers consists of machines comprising many large grained processors. These processors execute asynchronously and memory is assumed to be distributed. Trying to execute a general n-thread parallel program P on such a parallel system poses theoretical and practical challenges. One must provide mechanisms for implementing the synchronization barriers, for load sharing and balancing amongst the processors, and for program executing that respects memory locality. The approach used in this research is to define algorithms and a program transformation (compilation) T so that T(P) executes correctly and efficiently on any P. This simulated execution does not require any specialized hardware or operating system mechanisms in the target machine. Of particular interest is the case that the parallel program is large-grained, i.e. that in each parallel step a substantial block of instructions is executed on some or most of the parallel threads. It is assumed that the parallel program is already optimized so as to make the blocks of code within the parallel step quite large. This project studies the most general case of large grained parallel programs, paying particular attention to memory locality and to memory fault-tolerance, employing randomization. A further goal is to achieve Las Vegas style results, i.e. small expected running time and no errors.