• Home (current)
  • News
  • Research
  • Publication
  • Classes
  • Team
  • Contact

Smart and Scalable Data Systems Lab

@ Department of Computer Science | Brandeis University

Latest News

Jan, 2025

We are hosting North-East Database Day 2025 at Brandeis! SSD Lab presents three posters!

Oct, 2024

Our paper ''QuIT your B+Tree for Quick Insertion Tree'' is accepted for publication in EDBT 2025.

Sep, 2024

Congratulations, Alex Ott! Alex gets full scholarship to attend the Brandeis x Possible program.

May, 2024

Our paper ''Anatomy of the LSM Memory Buffer: Insights & Implications'' is accepted for publication in DBTest 2024.

May, 2024

Our paper ''KVBench: A Key-Value Benchmarking Suite'' is accepted for publication in DBTest 2024.

Apr, 2024

SSD Lab goes to North-East Database Day 2024 with two posters!

Mar, 2024

Congratulations to Alex Ott for being selected for the Gordon Science Research Fellowship!

Aug, 2023

Our paper ''Enabling Timely and Persistent Deletion in LSM-Engines'' is now published in ACM ToDS.

Explore Our Research

Welcome to the Smart & Scalable Data Systems (SSD) lab! We are a group of passionate systems enthusiasts who are obsessed with designing, optimizing, and building highly efficient data systems. At SSD Lab, we strive to solve cutting-edge research challenges at the intersection of data systems design, storage engine optimization, and data privacy protection in modern systems. Our mission is to advance knowledge across the broader domains of database and storage engine design, machine learning for systems, systems for machine learning, and privacy-aware data systems. Below are the three main research thrusts of the lab.

Query Path Optimization

Query Path Optimization

Brief description of Research Topic 1.

GitHub
Deletion-Compliant Databases

Deletion-Compliant Databases

Brief description of Research Topic 1.

GitHub
Self-Designing Data Systems

Self-Designing Data Systems

Brief description of Research Topic 1.

GitHub

Publication

  1. DBTest
    Anatomy of the LSM Memory Buffer: Insights & Implications
    Shubham Kaushik and Subhadeep Sarkar
    In Proceedings of the International Workshop on Testing Database Systems (DBTest) , 2024
    Abs PDF

    Log-structured merge (LSM) tree is an ingestion-optimized data structure that is widely used in modern NoSQL key-value stores. To support high throughput for writes, LSM-trees maintain an in-memory buffer that absorbs the incoming entries before writing them to slower secondary storage. We point out that the choice of the data structure and implementation of the memory buffer has a significant impact on the overall performance of LSM-based storage engines. In fact, even with the same implementation of the buffer, the performance of a storage engine can vary by up to several orders of magnitude if there is a shift in the input workload.
    In this paper, we benchmark the performance of LSM-based storage engines with different memory buffer implementations and under different workload characteristics. We experiment with four buffer implementations, namely, (i) vector, (ii) skip-list, (iii) hash skip-list, and (iv) hash linked-list, and for each implementation, we vary any design choices (such as bucket count in a hash skip-list and prefix length in a hash linked-list). We present a comprehensive performance benchmark for each buffer configuration, and highlight how the relative performance of the different buffer implementations varies with a shift in input workload. Lastly, we present a guideline for selecting the appropriate buffer implementation for a given workload and performance goal.

  2. DBTest
    KVBench: A Key-Value Benchmarking Suite
    Zichen Zhu, Arpita Saha, Manos Athanassoulis, and Subhadeep Sarkar
    In Proceedings of the International Workshop on Testing Database Systems (DBTest) , 2024
    Abs PDF

    Key-value stores are at the core of several modern NoSQL-based data systems, and thus, a comprehensive benchmark tool is of paramount importance in evaluating their performance under different workloads. Prior research reveals that real-world workloads have a diverse range of characteristics, such as the fraction of point queries that target non-existing keys, point and range deletes, as well as, different distributions for queries and updates, all of which have very different performance implications. State-of-the-art key-value workload generators, such as YCSB and db_bench, fail to generate workloads that emulate these practical workloads, limiting the dimensions on which we can benchmark the systems’ performance.
    In this paper, we present KVBench, a novel synthetic workload generator that fills the gap between classical key-value workload generators and more complex real-life workloads. KVBench supports a wide range of operations, including point queries, range queries, inserts, updates, deletes, range deletes, and among these options, inserts, queries, and updates can be customized by different distributions. Compared to state-of-the-art key-value workload generators, KVBench offers a richer array of knobs, including the proportion of empty point queries, customized distributions for updates and queries, and range deletes with specific selectivity, constituting a significantly flexible framework that can better emulate real-world workloads.

  3. EDBT
    QuIT your B+-tree for the Quick Insertion Tree
    Aneesh Raman, Konstantinos Karatsenidis, Shaolin Xie, Matthaios Olma, Subhadeep Sarkar, and Manos Athanassoulis
    In Proceedings of the International Conference on Extending Database Technology (EDBT) , 2025
    Abs PDF

    Search trees, like B+-trees, are often used as index structures in data systems to improve query performance at the cost of index construction and maintenance. Production data systems drastically reduce the index construction cost when the data arrives fully sorted by employing a fast-path ingestion technique to their B+-tree that directly appends the incoming entries to the tail leaf. However, this optimization is only effective if the incoming data is fully sorted or has very few out-of-order entries. The state-of-the-art sortedness-aware design (SWARE) employs an in-memory buffer to capture near-sortedness to reduce the index construction cost when the data is nearly sorted. This, however, sacrifices performance during lookups and introduces additional design complexity.
    To address these challenges, we present Quick Insertion Tree (QuIT), a new sortedness-aware index that improves ingestion performance with minimal design complexity and no read over- head. QuIT maintains in memory a pointer to the predicted- ordered-leaf (𝑝𝑜ℓ𝑒) that provides a sortedness-aware fast-path optimization, and facilitates faster ingestion. The key benefit comes from accurately predicting 𝑝𝑜ℓ𝑒 throughout data ingestion. Further, QuIT achieves high memory utilization by maintaining tightly packed leaf nodes when the ingested data arrives with high sortedness. This, in turn, helps improve performance during range lookups. Overall, QuIT outperforms B+-tree (SWARE) by up to 3× (2×) for ingestion, while also offering up to 1.32× faster (than SWARE) point lookup performance and accessing up to 2× fewer leaf nodes than the B+-tree during range lookups.

  4. TODS
    Enabling Timely and Persistent Deletion in LSM-Engines
    Subhadeep Sarkar, Tarikul Islam Papon, Zichen Zhu, Dimitris Staratzis, and Manos Athanassoulis
    ACM Transactions on Database Systems, 2023
    Abs PDF

    Data-intensive applications have fueled the evolution of log-structured merge (LSM) based key-value engines that employ the out-of-place paradigm to support high ingestion rates with low read/write interference. These benefits, however, come at the cost of treating deletes as second-class citizens. A delete operation inserts a tombstone that invalidates older instances of the deleted key. State-of-the-art LSM-engines do not provide guarantees as to how fast a tombstone will propagate to persist the deletion.
    Further, LSM-engines only support deletion on the sort key. To delete on another attribute (e.g., timestamp), the entire tree is read and re-written, leading to undesired latency spikes and increasing the overall operational cost of a database. Efficient and persistent deletion is key to support: (i) streaming systems operating on a window of data, (ii) privacy with latency guarantees on data deletion, and (iii) en masse cloud deployment of data systems. Further, we document that LSM-based key-value engines perform suboptimally in presence of deletes in a workload. Tombstone-driven logical deletes, by design, are unable to purge the deleted entries in a timely manner, and retaining the invalidated entries perpetually affects the overall performance of LSM-engines in terms of space amplification, write amplification, and read performance. Moreover, the potentially unbounded latency for persistent deletes brings in critical privacy concerns in light of the data privacy protection regulations, such as the right to be forgotten in EU’s GDPR, the right to delete in California’s CCPA and CPRA, and deletion right in Virginia’s VCDPA. Toward this, we introduce the delete design space for LSM-trees and highlight the performance implications of the different classes of delete operations.
    To address these challenges, in this article, we build a new key-value storage engine, Lethe+, that uses a very small amount of additional metadata, a set of new delete-aware compaction policies, and a new physical data layout that weaves the sort and the delete key order. We show that Lethe+ supports any user-defined threshold for the delete persistence latency offering higher read throughput (1.17x-1.4x) and lower space amplification (2.1x-9.8x), with a modest increase in write amplification (between 4% and 25%) that can be further amortized to less than 1%. In addition, Lethe+ supports efficient range deletes on a secondary delete key by dropping entire data pages without sacrificing read performance or employing a costly full tree merge.

Classes

Database Management Systems
Database Management Systems
Advanced Database Systems
Advanced Data Systems

Team

Subhadeep Sarkar
Subhadeep Sarkar

Assistant Professor
Office Volen 259
[email protected]

Shubham Kaushik
Shubham Kaushik

PhD Researcher
Office Volen 110
[email protected]

James Chen
James Chen

PhD Researcher
Office Volen 111
[email protected]

Yu-Cheng Huang
Yu-Cheng Huang

Researcher
[email protected]

Arpita Saha
Arpita Saha

Researcher

Alexander Ott
Alexander Ott

Researcher
[email protected]

Find Us

© Copyright 2025 SSD Lab.
OSZAR »