There is a rising demand to scale application performance by distributing computation across many machines in a data-center. It is difficult to write efficient and robust parallel programs in the data-center setting because programmers need to worry about reducing communication overhead while handling possible machine failures.
This project investigates a new data-centric parallel programming model, called Piccolo, that can simplify the construction of in-memory data-center applications such as PageRank, neural network training etc.
In-memory applications can hold all their intermediate states in the aggregate memory of many machines and benefit from sharing these intermediate states between machines during computation. Traditionally, these applications have been built using low-level communication-centric primitives such as MPI, resulting in significant programming complexity. The recently popular MapReduce and Dryad also do not fit well with these applications because their data flow programming model lacks support for shared states.
Unlike data flow models, Piccolo explicitly supports the sharing of mutable, distributed states via a key/value table interface. Piccolo makes sharing efficient by optimizing for locality of access to shared tables and automatically resolving write-write conflicts using user-defined accumulation functions. As a result, Piccolo is easy to program for, enables applications that do not fit into MapReduce, and achieves good scalable performance.