Context

Relational database management systems (RDBMS) have long been the trusty workhorse of the information technology (IT) industry. In fact, by holding all shared mutable state and being responsible for durability, the RDBMS is the key component in system scale-out and availability, making database server clusters a perennial hot topic of research in industry and academia.

The current state of the art to address these challenges is still, after a long standing debate, split between shared-storage and shared-nothing clustering architectures. On one hand, a shared-storage cluster allows maximum resource efficiency: One uses as many nodes as required for processing the workload and to ensure the desired availability, while the storage is configured solely according to the desired storage bandwidth and disk resilience. Unfortunately, a shared-storage approach based on distributed shared memory and distributed locking raises a number of problems, which make such solutions costly to develop and deploy. Namely, server software needs to be heavily refactored to deal with distributed locking, buffer invalidation, and recovery from partial cluster failure. Anecdotal evidence for these is that none of the mainstream open source database servers provide this option. Most commercial database servers also lack a shared-storage configuration. Also, true write sharing is a potential source of corruption upon software or hardware faults. It is also an additional vulnerability to malicious intrusions.

On the other hand, there have been a number of proposals for shared-nothing database server clusters based on consistent replication. All these share the same basic approach: Updates are ordered and propagated before replying back to the client, thus ensuring that no conflicts arise after the transaction commits. The resulting performance and scalability are very good, especially, with currently common mostly read-only workloads. The logical independence of database replicas also increases resilience to data corruption, whether malicious or not. Moreover they are inexpensive and widely available as an add-on to all major DBMS, as no changes to the server software are required. Unfortunately, in a shared-nothing cluster a separate physical copy of data is required for each node. Therefore, even if a only few copies are required for dependability, a large cluster with hundreds of nodes must be configured also with sufficient storage capacity for hundreds of copies of data. In large scale systems, this imposes a hardware and operational cost that offsets their initial advantage.

Goal

The goal of project ReD is to achieve a generic, robust, and inexpensive shared-storage cluster from an off-the-shelf RDBMS. In detail, the project will deliver the following concrete results:

  • A general architecture and specification of the proposed approach.
  • An exploration of the performance, scalability, and dependability aspects of the approach, highlighting the most interesting tradeoffs.
  • A detailed experimental evaluation, using the prototype and industry standard transaction processing benchmarks.

Challenge and Approach

It might look simple at first sight to extend the shared-nothing protocol to cope with shared storage: If all replicas perform exactly the same write operations, database state would be identical and thus could be shared. Unfortunately, internal non-determinism means that different physical images are produced regardless of logical consistency, leading to corruption. Moreover, such simple approach would not preserve the logical independence of replicas and rule out tolerating Byzantine faults.

The ReD approach is to combine the replication protocol with a specialized copy-on-write volume management system, that holds transient logically independent partial copies, thus masking internal server non-determinism and isolating multiple logical replicas for resilience.