Room 25-26/105, 4 place Jussieu - 75005 Paris
In this talk, we will present MemProf, the first profiler that allows programmers to choose and implement efficient application-level optimizations for NUMA systems. MemProf achieves this goal by allowing programmers to (i) precisely understand which memory objects are accessed remotely in memory, and (ii) building temporal flows of interactions between threads and objects. We evaluated MemProf using four applications (FaceRec, Streamcluster, Psearchy, and Apache) on three different machines. In each case, we will show how MemProf helped us choose and implement efficient optimizations, unlike existing profilers. These optimizations provide significant performance gains on the studied applications (up to 161%), while requiring very lightweight modifications (10 lines of code or less).
Transactional memory (TM) provides a scalable and easy-to-use alternative to locks. One of the key results highlighted by previous research is that, independently of the nature of the synchronization scheme adopted by a TM platform, its actual performance is strongly workload-dependant and affected by a number of complex, often intertwined factors (e.g., duration of transactions, level of data contention, ratio of update vs. read-only transactions). In this talk, we will explore the assumption that most workloads have a natural degree of parallelism, i.e., there is a workload-specific threshold below which adding more threads will improve transaction throughput, and over which new threads will not help and might even degrade performance because of higher contention and aborts rates, even if sufficiently many cores are available. We will present techniques to dynamically identify the "best" degree of parallelism that should be used for a given application.
The key problem with getting multithreaded programs right is non-determinism. Programs with data races behave differently depending on the vagaries of thread scheduling: different runs of the same multithreaded program can unexpectedly produce different results. These "Heisenbugs" greatly complicate debugging, and eliminating them requires extensive testing to account for possible thread interleavings.
We attack the problem of non-determinism with Dthreads, an efficient deterministic multithreading system for general-purpose, unmodified C/C++ programs. Dthreads directly replaces the pthreads library and eliminates races by making all executions deterministic. Not only does Dthreads dramatically outperform a state-of-the-art deterministic runtime system, it often matches—and occasionally exceeds—the performance of pthreads.
While correctness is important, it is not enough. Multithreaded applications also need to be efficient and scalable. Key to achieving high performance and scalability is reducing contention for shared resources. However, even when sharing has been reduced to a minimum, threads can still suffer from false sharing. Multiple objects that are not logically shared can end up on the same cache line, leading to invalidation traffic. False sharing is insidious: not only can it be disastrous to performance—causing performance to plummet by as much as an order of magnitude—but it also difficult to diagnose and track down.
We have developed two systems to attack the problem of false sharing: Sheriff-Detect and Sheriff-Protect. Sheriff-Detect is a false sharing detection tool that is precise (no false positives), runs with low overhead (on average, 20%), and is accurate, pinpointing the exact objects involved in false sharing. When rewriting a program to fix false sharing is infeasible (source code is unavailable, or padding objects would consume too much memory), programmers can instead use Sheriff-Protect. Sheriff-Protect is a runtime system that automatically eliminates most of the performance impact of false sharing. Sheriff-Protect can improve performance by up to 9X without the need for programmer intervention.
Bio: Emery Berger is an Associate Professor in the Department of Computer Science at the University of Massachusetts Amherst, the flagship campus of the UMass system. He graduated with a Ph.D. in Computer Science from the University of Texas at Austin in 2002. Professor Berger has been a Visiting Scientist at Microsoft Research and at the Universitat Politècnica de Catalunya (UPC) / Barcelona Supercomputing Center (BSC).
Professor Berger's research spans programming languages, runtime systems, and operating systems, with a particular focus on systems that transparently improve reliability, security, and performance. He is the creator of various widely-used software systems including Hoard, a fast and scalable memory manager that accelerates multithreaded applications (used by companies including British Telecom, Cisco, Royal Bank of Canada, SAP, and Tata, and on which the Mac OS X memory manager is based), and DieHard, an error-avoiding memory manager that directly influenced the design of the Windows 7 Fault-Tolerant Heap.
His honors include a Microsoft Research Fellowship (2001), an NSF CAREER Award (2003), a Lilly Teaching Fellowship (2006), and a Best Paper Award at FAST 2007. Professor Berger served as the General Chair of the Memory Systems Performance and Correctness workshop (MSPC 2008), co-Program Chair of the 2010 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE 2010), Program Chair of the 2012 Workshop on Determinism and Correctness in Parallel Programming (WoDET 2012), and co-Program Chair of the Fifth USENIX Workshop on Hot Topics in Parallelism (HotPar 2013). He is a Senior Member of the ACM, and is currently an Associate Editor of the ACM Transactions on Programming Languages and Systems.