Séminaire REGAL

Workshop on Language Virtual Machines and Multicore Architectures

Jeudi 27 septembre 2012 au 28/09/2012

In conjunction with the "Habilitation à Diriger des Recherches" of Gaël Thomas, the REGAL team is organising an international workshop on the topics of language virtual machines and multicore architectures.

Room 25-26/105, 4 place Jussieu - 75005 Paris

Thursday 27th:

2:00 PM - Erven Rohou (INRIA Rennes)

Compilation Challenges for Future Architectures

Since the 2000s, performance no longer comes in the form of increased clock frequency, but as additional parallelism. The consequences for the software industry are drastic. Most existing applications have been developed for a sequential model, and even parallel applications contain residual sequential sections. We discuss some of the challenges put out by emerging parallel and heterogeneous architectures, and we propose a few directions to address compilation issues.
2:45 PM - Vivien Quema (University of Grenoble)

MemProf: a Memory Profiler for NUMA Multicore Systems

Modern multicore systems are based on a Non-Uniform Memory Access (NUMA) design. Efficiently exploiting such architectures is notoriously complex for programmers. One of the key concerns is to limit as much as possible the number of remote memory accesses (i.e., main memory accesses performed from a core to a memory bank that is not directly attached to it). However, in many cases, existing profilers do not provide enough information to help programmers achieve this goal.
In this talk, we will present MemProf, the first profiler that allows programmers to choose and implement efficient application-level optimizations for NUMA systems. MemProf achieves this goal by allowing programmers to (i) precisely understand which memory objects are accessed remotely in memory, and (ii) building temporal flows of interactions between threads and objects. We evaluated MemProf using four applications (FaceRec, Streamcluster, Psearchy, and Apache) on three different machines. In each case, we will show how MemProf helped us choose and implement efficient optimizations, unlike existing profilers. These optimizations provide significant performance gains on the studied applications (up to 161%), while requiring very lightweight modifications (10 lines of code or less).
3:30 PM - Sasha Fedorova (Simon Fraser University)

Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems

NUMA systems are characterized by Non-Uniform Memory Access times, where accessing data in a remote node takes longer than in a local node. NUMA hardware has been built since the late 80's, and the operating systems designed for it optimized for access locality. They placed threads and memory pages on the system's nodes so as to reduce the number of remote accesses. We discovered that contrary to older systems, remote wire delays have been substantially reduced on modern hardware, and so remote access costs per se are not the main concern for performance. Instead, congestion on memory controllers and interconnects, caused by memory traffic from data-intensive applications, hurts performance a lot more. Because of that, thread and memory placement algorithms must be completely redesigned to target traffic congestion, which requires different techniques than optimizing locality. In this talk, I will describe Carrefour, a new algorithm that addresses this goal. We implemented Carrefour in Linux and obtained performance improvements of up to 2.5 times relative to the default kernel, as well as significant improvements compared to NUMA-aware patchsets available for Linux.
4:15 PM Coffee break
4:30 PM - Thomas Wuerthinger (Oracle)

Truffle: A Self-Optimizing Runtime System

Adapting the running code based on the actual application behaviour is a powerful tool to speed up long-running applications. I will introduce the area of dynamic run time feedback and how current virtual machines profit from it using speculative optimisations and assumptions. I will then present Truffle, our new approach on how to create a self-optimizing runtime system for multiple languages. Truffle is a part of the Graal OpenJDK project (see http://openjdk.java.net/projects/graal/).
Bio: Thomas Wuerthinger is a researcher at Oracle in Austria. He is the lead of the Graal OpenJDK project and a member of the Alphabet Soup project at Oracle Labs. His research interests include compilers, virtual machines, and graph visualization. He received a Dr. techn. degree in Computer Science (advisor: Prof. Hanspeter Moessenboeck) from the Johannes Kepler University Linz, Austria.
5:15 PM - Pascal Felber (Université de Neuchâtel)

Optimistic Synchronization and the Natural Degree of Parallelism of Concurrent Applications

Transactional memory (TM) provides a scalable and easy-to-use alternative to locks. One of the key results highlighted by previous research is that, independently of the nature of the synchronization scheme adopted by a TM platform, its actual performance is strongly workload-dependant and affected by a number of complex, often intertwined factors (e.g., duration of transactions, level of data contention, ratio of update vs. read-only transactions). In this talk, we will explore the assumption that most workloads have a natural degree of parallelism, i.e., there is a workload-specific threshold below which adding more threads will improve transaction throughput, and over which new threads will not help and might even degrade performance because of higher contention and aborts rates, even if sufficiently many cores are available. We will present techniques to dynamically identify the "best" degree of parallelism that should be used for a given application.

Friday 28th:

9:00 AM Albert Cohen (INRIA Paris-Rocquencourt)

Languages and Compilers for Productivity and Efficiency

As parallelism permeate through all computing devices, the need to provide productivity-oriented programming models to exploit these architectures increases. High-level languages can express (in)dependence and data locality without reference to any particular hardware. Compilers and runtime systems are left with the responsibility of lowering these abstractions to well-orchestrated threads and memory management. Our approach combines programming language, compiler and runtime system design, based on the data-flow model of computation. We will present the productivity, the portability, the scalability and the efficiency advantages of such a combination. In particular, we will demonstrate how data-flow execution reduces the severity of the memory wall while preserving a modular and deterministic model for compatible with modern software engineering. We will also show how to embed determinstic data-flow computations into a traditional imperative language, preserving its excellent single-thread performance with explicit memory management.
9:45 AM Emery Berger (Department of Computer Science, University of Massachusetts Amherst)

Multithreaded Programming for Mere Mortals

The shift from single to multiple core architectures means that programmers will increasingly be forced to write concurrent, multithreaded programs to increase application performance. Unfortunately, it is challenging to write multithreaded programs that are both correct and fast. This talk presents two software-only systems that aim to dramatically simplify both tasks.
The key problem with getting multithreaded programs right is non-determinism. Programs with data races behave differently depending on the vagaries of thread scheduling: different runs of the same multithreaded program can unexpectedly produce different results. These "Heisenbugs" greatly complicate debugging, and eliminating them requires extensive testing to account for possible thread interleavings.
We attack the problem of non-determinism with Dthreads, an efficient deterministic multithreading system for general-purpose, unmodified C/C++ programs. Dthreads directly replaces the pthreads library and eliminates races by making all executions deterministic. Not only does Dthreads dramatically outperform a state-of-the-art deterministic runtime system, it often matches—and occasionally exceeds—the performance of pthreads.
While correctness is important, it is not enough. Multithreaded applications also need to be efficient and scalable. Key to achieving high performance and scalability is reducing contention for shared resources. However, even when sharing has been reduced to a minimum, threads can still suffer from false sharing. Multiple objects that are not logically shared can end up on the same cache line, leading to invalidation traffic. False sharing is insidious: not only can it be disastrous to performance—causing performance to plummet by as much as an order of magnitude—but it also difficult to diagnose and track down.
We have developed two systems to attack the problem of false sharing: Sheriff-Detect and Sheriff-Protect. Sheriff-Detect is a false sharing detection tool that is precise (no false positives), runs with low overhead (on average, 20%), and is accurate, pinpointing the exact objects involved in false sharing. When rewriting a program to fix false sharing is infeasible (source code is unavailable, or padding objects would consume too much memory), programmers can instead use Sheriff-Protect. Sheriff-Protect is a runtime system that automatically eliminates most of the performance impact of false sharing. Sheriff-Protect can improve performance by up to 9X without the need for programmer intervention.
Bio: Emery Berger is an Associate Professor in the Department of Computer Science at the University of Massachusetts Amherst, the flagship campus of the UMass system. He graduated with a Ph.D. in Computer Science from the University of Texas at Austin in 2002. Professor Berger has been a Visiting Scientist at Microsoft Research and at the Universitat Politècnica de Catalunya (UPC) / Barcelona Supercomputing Center (BSC).
Professor Berger's research spans programming languages, runtime systems, and operating systems, with a particular focus on systems that transparently improve reliability, security, and performance. He is the creator of various widely-used software systems including Hoard, a fast and scalable memory manager that accelerates multithreaded applications (used by companies including British Telecom, Cisco, Royal Bank of Canada, SAP, and Tata, and on which the Mac OS X memory manager is based), and DieHard, an error-avoiding memory manager that directly influenced the design of the Windows 7 Fault-Tolerant Heap.
His honors include a Microsoft Research Fellowship (2001), an NSF CAREER Award (2003), a Lilly Teaching Fellowship (2006), and a Best Paper Award at FAST 2007. Professor Berger served as the General Chair of the Memory Systems Performance and Correctness workshop (MSPC 2008), co-Program Chair of the 2010 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE 2010), Program Chair of the 2012 Workshop on Determinism and Correctness in Parallel Programming (WoDET 2012), and co-Program Chair of the Fifth USENIX Workshop on Hot Topics in Parallelism (HotPar 2013). He is a Senior Member of the ACM, and is currently an Associate Editor of the ACM Transactions on Programming Languages and Systems.
10:30 AM Coffee Break
10:45 AM Jan Vitek (Purdue University)

Planet Dynamic or: How I Learned to Stop Worrying and Love Reflection

A fundamental belief underlying forty years of programming languages research, aptly captured by the slogan "Well-typed programs can't go wrong", is that programs augmented with machine-checked annotations are more likely to be free of bugs. But of course, real programs do wrong and programmers are voting with their feet. Dynamic languages such as Ruby, Python, Lua, JavaScript and R are unencumbered by redundant type annotations and are increasingly popular. JavaScript, the lingua franca of the web, is moving to the server with the success of Node.js. R, another dynamic language, is being used in statistics, biology and finance for data analysis and visualization. Not only are these languages devoid of types, but they utterly lack any static structure that could be used for program verification. This talk will draw examples from recent results on JavaScript and R to illustrate the extent of the problem and propose some directions for research.
11:30 AM Wolfgang Schröder-Preikschat (University of Erlangen)

Invasive Computing - A Systems Programming Perspective

Invasive Computing is a research program that aims at developing a new paradigm to address the hardware- and software challenges of managing and using massively-parallel MPSoCs of the years 2020 and beyond. The program encompasses twelve projects from the areas of computer architecture, system software, programming systems, algorithm engineering and applications. The core idea is to let applications manage the available computing resources on a local scope and to provide means for a dynamic and fine-grained expansion and contraction of parallelism. This talk provides a brief overview of the program and presents initial thoughts on system software support for it.
2:30 PM Gaël Thomas (LIP6/INRIA Paris-Rocquencourt)

Soutenance d'Habilitation à Diriger des Recherches

Improving the Design and the Performance of Managed Runtime Environments

With the advent of the Web and the need to protect users against malicious applications, Managed Runtime Environments (MREs), such as Java or .Net virtual machines, have become the norm to execute programs. Over the last years, my research contributions have targeted three aspects of MREs: their design, their safety, and their performance on multicore hardware. My first contribution is VMKit, a library that provides the most complex components of MREs to ease the development of new efficient MREs. My second contribution is I-JVM, a Java virtual machine that eliminates the eight known vulnerabilities that a component of the OSGi framework was able to exploit. My third contribution targets the improvement of the performance of MREs on multicore hardware, focusing on the efficiency of locks and garbage collectors: with a new locking mechanism that outperforms all other known locking mechanisms when the number of cores increases, and with a study of the bottlenecks incurred by garbage collectors on multicore hardware. My research has been carried out in collaboration with seven PhD students, two of which having already defended. Building on these contributions, in a future work, I propose to explore the design of the next generation of MREs that will have to adapt the application at runtime to the actual multicore hardware on which it is executed.

Nadine.Taniou (at) nulllip6.fr