EMERALD 1st Workshop on Distributed Computing with Emerging Hardware Technology

June 21th, 2024. Colocated with PODC’24 and SPAA’24. Nantes, France.

Overview and scope

EMERALD aims at investigating how the utilization of future and emerging hardware technology can influence (or add to) the foundations of concurrent and distributed computing. The workshop will host a series of invited talks which will contribute to the better understanding of how such technology could change what we know about concurrent and distributed algorithms and models. The emphasis will be on persistent memory computing, NUMA-aware computing, distributed computing in systems with Remote Direct Memory Accesses (RDMA), distributed and concurrent computing on top of heterogeneous hardware, processing in memory (PIM), near-memory processing (NMP), disaggregated/composable systems, and others. The topics covered by EMERALD specialize the following broader topics for systems that feature modern, emerging or future hardware technology:

concurrency, synchronization, and persistence
design and analysis of concurrent and distributed algorithms
distributed and concurrent data structures
fault-tolerance, self-organization, and self-stabilization
lower bounds and impossibility results for distributed computing
multiprocessor and multi-core architectures and algorithms
transactional memory

Through a carefully-chosen collection of invited talks, EMERALD aspires to reveal realistic and practical aspects that can positively influence theory research, in whatever regards the choice of the problems to work on, the right level of abstraction to study them, how to come up with realistic models for computation, how to efficiently support additional desirable features (persistence, NUMA-awareness, architecture-specific design) when designing concurrent and distributed algorithms, while maintaining or improving their performance characteristics, etc. EMERALD will be a full-day event. In addition to hosting lectures by invited speakers, it will feature an open discussion session aiming at highlighting important problems for future research.

Speakers

João Barreto, University of Lisbon and INESC-ID, Portugal
Gregory Chockler, University of Surrey, UK
Denisa-Andreea Constantinescu, EPFL, Switzerland
Naama Ben David, Technion, Israel
Panagiota Fatourou, University of Crete and FORTH, Greece
Phillip Gibbons, Carnegie Mellon University, US
Wojciech Golab, University of Waterloo, Canada
George Hodgkins, University of Colorado, Boulder and Sandia National Lab, US
Eleni Kanellou, FORTH ICS, Greece
Jim Larus, EPFL, Switzerland
Roberto Palmieri, Lehigh University, US
Erez Petrank, Technion, Israel
Paolo Romano, University of Lisbon and INESC-ID, Portugal
Eric Ruppert, York University, Canada
Samuel Thomas, Brown University, US
Gala Yadgar, Technion, Israel

Program

The workshop takes place on June 21st, 2024.

Time	Activity
9:00–10:40 Session 1	Persistent Computing – Theory, Algorithms, Data Structures Session Chair: Maurice Herlihy 1. Wojciech Golab: How Persistent Memory Changed Distributed Computing Theory 2. Erez Petrank: The Persistence Bug: Dead or Alive? 3. Panagiota Fatourou: Highly-Efficient Persistent Data Structures: The performance principles that govern their design 4. Eleni Kanellou: The Power of Software Combining in Persistence 5. Eric Ruppert: When is Recoverable Consensus harder than Consensus?
10:40–11:00	Coffee Break I
11:00–12:20 Session 2	Hardware-Aware/Accelerated/Supported Distributed Computing Session Chair: João Barreto 1. Roberto Palmieri: Synchronization using RDMA: High Performance, Programmability, and Scalability 2. George Hodgkins: LOCO: Building Applications in Network Memory with Cross-Node Objects 3. Jim Larus: Hardware-Accelerated, Fine-Grain BSP 4. Denisa-Andreea Constantinescu: Neuro-Inspired Edge AI Architectures for Distributed Federated Learning
12:40–13:50	Lunch
13:50–15:30 Session 3	Memory & Storage – Systems, Practice, Applications Session Chair: Erez Petrank 1. Naama Ben-David: The Effects of Fast I/O on Concurrent Computing 2. Gala Yadgar: Who moved my cheese? How storage systems deal with changes in their media 3. Gregory Chockler: Mangosteen: Fast Transparent Durability for Linearizable Applications using NVM 4. João Barreto: Persistent hardware transactions can scale 5. Samuel Thomas: Rethinking Secure NVM in the Age of CXL
15:30–15:50	Coffee Break I
15:50–16:30 Session 4	Processing in Memory Session chair: Panagiota Fatourou 1. Phillip Gibbons: Processing-in-Memory: Theory and Practice 2. Paolo Romano: Transactional Memory for Processing-In-Memory Systems: a Software-based Implementation for the UPMEM platform
16:30–16:45	Break
16:45–18:00	Discussion Session - Panel

Talks

Persistent hardware transactions can scale [slides] João Barreto University of Lisbon and INESC-ID, Portugal
Persistent Memory (PM) has paved the way for researchers to develop Transactional Memories (TMs) that, besides providing atomic transactions in main memory, also deliver durable transactions. Unfortunately, combining PM and TM is challenging, as the most efficient implementations of TM, i.e., Hardware Transaction Memories (HTMs), operate at the level of CPU caches. As caches are volatile, the durability of transactions needs to be enforced using additional software mechanisms that explicitly, yet transparently for the programmer, flush the cache contents via carefully designed protocols taking place at commit time. In this talk I'll present how we can work around these issues with the help of a carefully crafted software layer that is interposed between the application and the HTM. Moreover, I'll highlight the scalability shortcomings of the state of the art in persistent hardware transactions, and present some recent ideas to overcome them.
Mangosteen: Fast Transparent Durability for Linearizable Applications using NVM Gregory Chockler University of Surrey, UK
The advent of byte-addressable non-volatile memory (NVM) technologies has enabled the development of low-latency high-throughput durable applications, i.e., applications that are capable of recovering from full-system crashes. However, programming such applications is error-prone as efficiency gains often require fine-grained (programmer-controlled) management of low-level persistence instructions. In this talk, I will discuss our recent work on Mangosteen (to appear in USENIX ATC’24), a high-level programming framework that allows developers to transform an existing linearizable in-memory application to a corresponding durably linearizable version using NVM. Our framework’s API consists of a set of callback hooks that interpose on an application’s request processing flow with minimal developer effort. Mangosteen executes client operations on DRAM and persists their effects using binary instrumentation and redo logging. Mangosteen’s concurrency control facilitates batching of read-write requests to minimize the cost of persistence, while allowing read-only requests to execute concurrently. A novel intra-batch deduplication mechanism further reduces persistence overheads for common OLTP workloads. Our empirical evaluation results show that Mangosteen-enabled applications outperform state-of-the-art solutions across the entire spectrum of read-write ratios. In particular, the Mangosteen-based version of Redis demonstrates throughput gains of between 2×–5× in comparison to prior work.
Neuro-Inspired Edge AI Architectures for Distributed Federated Learning [slides] Denisa-Andreea Constantinescu EPFL, Switzerland
Edge computing is becoming an essential concept covering multiple domains nowadays as our world becomes increasingly connected to enable the smart world concept. In addition, the new wave of Artificial Intelligence (AI), particularly complex Machine Learning (ML) and Deep Learning (DL) models, is driving the need for new computing paradigms and edge AI architectures beyond traditional general-purpose computing to make viable a sustainable smart world. In this presentation, Dr. Constantinescu will discuss the potential benefits and challenges of using emerging edge AI hardware architectures for distributed Federated Learning (FL) in the biomedical domain. These novel computing architectures take inspiration from how the brain processes incoming information and adapts to changing conditions. First, it exploits the idea of accepting computing inexactness at the system level while integrating multiple computing accelerators (such as in-memory computing or coarse-grained reconfigurable accelerators). Second, these edge AI architectures can operate ensembles of neural networks to improve the ML/DL outputs' robustness at the system level while minimizing memory and computation resources for the target final application. These two concepts have enabled the development of the open-source eXtended and Heterogeneous Energy-Efficient Hardware Platform (X-HEEP). X-HEEP will be showcased as a means for developing new edge AI and distributed FL systems for personalized healthcare.
The Effects of Fast I/O on Concurrent Computing [slides] Naama Ben David Technion, Israel
Traditionally, algorithmic design in systems that use I/O, both for communicating with other machines and for accessing storage, has focused exclusively on reducing I/O costs, paying little attention to optimization in other parts of the system. This was motivated by the extremely high cost of I/O, which formed the bottleneck in any system that used it. However, recent hardware trends have seen I/O speeds increase at a much higher rate than CPU speeds. Thus, it is now possible for distributed systems that use messages and access storage to bottleneck on the CPU. In this talk, I will discuss how this hardware trend affects concurrency control in transactional memory, in particular in distributed transactional systems, and, time permitting, on-disk transactional databases.
Highly-Efficient Persistent Data Structures: The principles that govern their design [slides] Panagiota Fatourou University of Crete and FORTH, Greece
In this talk, we present fundamental persistence principles, crucial for performance, that an algorithm’s designer has to take into consideration when designing persistent data structures. We illustrate the performance power of respecting these principles, when designing fundamental data structures, such as stacks and queues. The talk will also provide a methodology for analyzing the cost of persistence instructions, not by simply counting them but by separating them into categories based on the impact they have on the performance. This analysis reveals that understanding the actual persistence cost of an algorithm in machines with NVM, is more complicated than previously thought, and requires a thorough evaluation, since the performance impact of different persistence instructions may greatly vary.
Processing-in-Memory: Theory and Practice [slides] Phillip Gibbons Carnegie Mellon University, US
As computational resources become more efficient and data sizes grow, data movement is fast becoming the dominant cost in computing. Processing-in-Memory (a.k.a., near-data-processing), an idea dating back to at least 1970, is now emerging as a key technique for reducing costly data movement, by enabling computation to be executed on compute resources embedded in memory modules. While there has been considerable recent work on the systems/architecture/technology side of PIM, there has been very little work addressing the theory/programming/algorithm side. Open problems include: How does/should programming/algorithm design on PIM systems differ from traditional shared/distributed settings? What are the fundamental limitations/trade-offs in using PIM? This talk highlights our results-to-date addressing these questions. As a driving application kernel, we focus on a novel PIM-friendly index structure supporting batch parallel inserts/deletes, point queries, and range queries. Our index addresses head-on the inherent tension between minimizing communication and achieving load balance in PIM systems. We also summarize our results for other types of indexes and for database systems more generally. Finally, the talk will report on our experiences implementing our ideas on UPMEM’s 2,560-module PIM system.
How Persistent Memory Changed Distributed Computing Theory [slides] Wojciech Golab University of Waterloo, Canada
Persistent memory (PMem) marries the low latency of DRAM with the durability of secondary storage devices, enabling both simpler and faster data-intensive software applications. Recent years witnessed the next evolution of this technology as Intel, in partnership with Micron, developed and commercially released the Optane persistent memory module. In parallel with these efforts, the Storage Networking Industry Association (SNIA) devised a standard programming model for persistent memory, based on the notion that persistent memory is exposed to applications by the operating system through memory-mapped files. The Compute Express Link (CXL) standard promises to create further opportunities for memory expansion and tiering of persistent memory across devices. The emergence of novel high-density persistent memories upends some long-standing traditions in distributed computing theory, such as the strong emphasis on parallelism over fault tolerance in the design of in-memory data structures and synchronization primitives. In this talk, I will discuss the impact of the modern PMem-empowered memory hierarchy on research in distributed computing theory, particularly in the area of shared memory algorithms.
LOCO: Building Applications in Network Memory with Cross-Node Objects [slides] George Hodgkins University of Colorado, Boulder and Sandia National Lab, US
In this talk, we explore the the idea of objects as a programming model for clusters using network memory (RDMA or CXL). We argue that the natural representation of an application designed for network memory is a system of interconnected objects which extend well-defined methods to the programmer, similar to traditional object-oriented application designs. These concurrent objects store their state in a distributed fashion across all participating nodes, especially in an incoherent or uncacheable memory network. In a sense, channel state is stored "across the network". Based on this philosophy, we introduce the Library of Channel Objects (LOCO), a shared-memory-like object-based library for RDMA and extensible to other weak memories. Channels are composable and reusable, and designed for both the strong locality effects and the weak consistency of RDMA. Unlike prior work, LOCO channels do not hide memory complexity, instead relying on the programmer to use NUMA-like techniques to explicitly manage each object. As a consequence, our channel objects have performance similar to custom RDMA systems (e.g. distributed maps), but with a far simpler programming model. Our distributed map channel has better read and comparable write performance to a state-of-the-art custom RDMA solution, using well-encapsulated and reusable primitives.
The Performance Power of Software Combining in Persistence [slides] Eleni Kanellou FORTH ICS, Greece
In this talk, we will study the power of software combining in achieving recoverable synchronization and designing persistent data structures. Software combining is a general synchronization approach, which attempts to simulate the ideal world when executing synchronization requests (i.e., requests that must be executed in mutual exclusion). Software combining has been shown to significantly decreases the synchronization cost and outperforms many other synchronization techniques in various cases. We will present two recoverable software combining protocols, satisfying different progress properties, that are many times faster and have much lower persistence cost than a large collection of previously-presented persistent techniques for achieving scalable synchronization. We build fundamental recoverable data structures, such as stacks and queues, based on these protocols that outperform existing recoverable implementations of such data structures. We also provide the first recoverable implementation of a concurrent heap and present experiments to show that it has good performance when the size of the heap is not very large.
Hardware-Accelerated, Fine-Grain BSP [slides] James Larus EPFL, Switzerland
Hardware verification and testing heavily rely on cycle-accurate RTL simulation. Conventional, single-threaded RTL simulation is a bottleneck in the design of increasingly complex chips and systems. Parallel RTL simulation, where, ideally, simulators run on hundreds and thousands of parallel cores, is an appealing remedy. However, existing simulators only exploit tens of cores due to the high cost of synchronization and communication. RTL simulation generally conforms to Valiant’s BSP pattern, where barriers separate the computation of logic values from their communication to consumers in the following clock cycle. BSP is challenging to apply effectively in fine-grain computation like RTL simulation because of its synchronization and communication costs. This work explores two solutions. The first is Manticore, a parallel computer designed to accelerate RTL simulation by using static scheduling to minimize runtime overhead. The Manticore compiler statically schedules resources and communication, which is feasible since RTL code contains few divergent code paths. Communication and synchronization overhead is minimal, making fine-grained parallelism practical. Moreover, static scheduling dramatically simplifies the processor, significantly increasing the number of cores that fit on a chip. Our 225-core FPGA prototype runs at 475 MHz and outperforms state-of-the-art RTL simulators. However, the widening gap between chip size and processor performance means we need parallel RTL simulators capable of running across thousands of cores. We studied the challenges inherent in massively parallel RTL simulation on the Graphcore computer, constructed from multiple 1472-core IPUs, which have hardware support for BSP programming. We analyzed the IPU’s synchronization and communication performance and built Parendi, which runs RTL simulations across 5888 cores in 4 IPU sockets.
Synchronization using RDMA: High Performance, Programmability, and Scalability [slides] Roberto Palmieri Lehigh University, US
Remote Direct Memory Access (RDMA) is a networking technology that allows a machine to access the memory of another machine directly without exchanging messages and with performance significantly faster than legacy networking infrastructures. The RDMA programming model resembles shared memory, which is appealing because it increases the programmability of distributed concurrent applications. In practice, developing high-performance, scalable systems using RDMA is still an open challenge. This talk focuses on two well-known issues when building RDMA-enabled distributed concurrent applications. First, the absence of atomicity between shared-memory APIs and RDMA APIs mandates using loopback channels, which affects performance and parallelism. Second, the difficulty of increasing the size of the distributed system due to the limited resources available on the RDMA cards hampers scalability. In the talk, a solution for each of these issues is discussed, and future directions are presented.
The Persistence Bug: Dead or Alive? [slides] Erez Petrank Technion, Israel
In this short talk, I will briefly survey some constructions of persistent lock-free data structures. I will then focus on the persistence bug that has previously misled some naive designs on the Intel ADR Optane platform, and I will question whether we should still be concerned about it.
Rethinking Secure NVM in the Age of CXL [slides] Samuel Thomas Brown University, US
As computation moves increasingly towards the cloud, making guarantees to end-users about the confidentiality and integrity of their applications and data is an important primitive. Unfortunately, this is difficult to achieve given the vulnerability of memory devices to corruption-based attacks that exploit physical device properties. In this talk, I will present secure memory, a protocol that allows memory architectures to make these guarantees. In doing so, I will describe the challenges posed by emerging non-volatile memories and disaggregated memory topologies, such as CXL, on the secure memory protocol and how our work reconciles these challenges.
Transactional Memory for Processing-In-Memory Systems: a Software-based Implementation for the UPMEM platform [slides] Paolo Romano University of Lisbon and INESC-ID, Portugal
Processing-In-Memory (PIM) is a novel computing architecture that augments existing DRAM memory chips with lightweight logic. By allowing to offload computations to the PIM system, this approach aims at avoiding the data-bottleneck problem that affects many modern workloads. This talk addresses the challenge of creating efficient software implementations of the Transactional Memory (TM) abstraction by presenting PIM-STM, a library offering a variety of TM implementations for UPMEM, one of the first commercial PIM systems. Through an extensive analysis, we evaluate the efficiency of different TM algorithm design choices on this emerging architecture. Additionally, we measure the impact of utilizing various memory tiers within the UPMEM system, each with different latency and capacity trade-offs, to store the metadata for different TM implementations. Finally, we demonstrate the performance and memory efficiency gains achievable by using PIM-STM to accelerate TM applications originally designed for conventional CPU-based systems.
When is Recoverable Consensus Harder Than Consensus? [slides] Eric Ruppert York University, Canada
We study the ability of different shared object types to solve recoverable consensus using non-volatile shared memory in a system with crashes and recoveries. In particular, we compare the difficulty of solving recoverable consensus to the difficulty of solving the standard wait-free consensus problem in a system with halting failures. We focus on the model where individual processes may crash and recover and on the large class of object types that are equipped with a read operation. We provide a characterization of the readable object types that can solve recoverable consensus among a given number of processes. (In his PODC 2024 paper, Ovens showed that our characterization is exact.) Using this characterization, we show that the number of processes that can solve consensus using a readable type can be larger than the number of processes that can solve recoverable consensus using that type, but only slightly larger.
Who moved my cheese? How storage systems deal with changes in their media. [slides] Gala Yadgar Technion, Israel
Storage systems must continuously evolve to take full advantage of new technologies as they emerge. Non-volatile memory has recently become the focus of many such optimization and re-design efforts. In this talk, I will describe some of these efforts to demonstrate the stages of new media adoption: from ensuring backwards compatibility and easy deployment, through identifying and addressing media-specific bottlenecks, to rethinking the core interfaces of the system. These stages have been followed in the adoption of several past technologies, for surprisingly many years!

Organization committee

Iris Bahar, Colorado School of Mines, USA
João Barreto, INESC-ID, Portugal
Panagiota Fatourou, University of Crete and Foundation for Research and Technology – Hellas (FORTH), Greece
Maurice Herlihy, Brown University, USA
Erez Petrank, Technion, Israel

Contact: Panagiota Fatourou faturu@ics.forth.gr

June 21th, 2024. Colocated with PODC’24 and SPAA’24. Nantes, France.

Overview and scope

Speakers

Program

Talks

Persistent hardware transactions can scale [slides]

Mangosteen: Fast Transparent Durability for Linearizable Applications using NVM

Neuro-Inspired Edge AI Architectures for Distributed Federated Learning [slides]

The Effects of Fast I/O on Concurrent Computing [slides]

Highly-Efficient Persistent Data Structures: The principles that govern their design [slides]

Processing-in-Memory: Theory and Practice [slides]

How Persistent Memory Changed Distributed Computing Theory [slides]

LOCO: Building Applications in Network Memory with Cross-Node Objects [slides]

The Performance Power of Software Combining in Persistence [slides]

Hardware-Accelerated, Fine-Grain BSP [slides]

Synchronization using RDMA: High Performance, Programmability, and Scalability [slides]

The Persistence Bug: Dead or Alive? [slides]

Rethinking Secure NVM in the Age of CXL [slides]

Transactional Memory for Processing-In-Memory Systems: a Software-based Implementation for the UPMEM platform [slides]

When is Recoverable Consensus Harder Than Consensus? [slides]

Who moved my cheese? How storage systems deal with changes in their media. [slides]

Organization committee