Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

This is the source code for our (Tobias Ziegler, Carsten Binnig and Viktor Leis) published paper at SIGMOD’22: ScaleStore: A Fast and Cost-Efficient Storage Engine using DRAM, NVMe, and RDMA.

License

NotificationsYou must be signed in to change notification settings

DataManagementLab/ScaleStore

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This is the source code for our (Tobias Ziegler, Carsten Binnig and Viktor Leis) published paper at SIGMOD’22: ScaleStore: A Fast and Cost-Efficient Storage Engine using DRAM, NVMe, and RDMA. Paper can be found here:Paper Link ACM orPaper Link PDF

Abstract

In this paper, we propose ScaleStore, a novel distributed storage engine that exploits DRAM caching, NVMe storage, and RDMA networking to achieve high performance, cost-efficiency, and scalability at the same time. Using low latency RDMA messages, ScaleStore implements a transparent memory abstraction that provides access to the aggregated DRAM memory and NVMe storage of all nodes. In contrast to existing distributed RDMA designs such as NAM-DB or FaRM, ScaleStore integrates seamlessly with NVMe SSDs, lowering the overall hardware cost significantly. The core of ScaleStore is a distributed caching strategy that dynamically decides which data to keep in memory (and which on SSDs) based on the workload. The caching protocol also provides strong consistency in the presence of concurrent data modifications. In our YCSB-based evaluation, we show that ScaleStore can provide high performance for various types of workloads (read/write-dominated, uniform/skewed) even when the data size is larger than the aggregated memory of all nodes. We further show that ScaleStore can efficiently handle dynamic workload changes and support elasticity.

Citation

@inproceedings{DBLP:conf/sigmod/0001BL22,  author    = {Tobias Ziegler and               Carsten Binnig and               Viktor Leis},  title     = {ScaleStore: {A} Fast and Cost-Efficient Storage Engine using DRAM,               NVMe, and {RDMA}},  booktitle = {{SIGMOD} '22: International Conference on Management of Data, Philadelphia,               PA, USA, June 12 - 17, 2022},  pages     = {685--699},  publisher = {{ACM}},  year      = {2022},  url       = {https://doi.org/10.1145/3514221.3526187},  doi       = {10.1145/3514221.3526187}}

Setup

Cluster Setup

All experiments were conducted on a 5-node cluster running Ubuntu 18.04.1 LTS, with Linux 4.15.0 kernel. Each node is equipped with two Intel(R) Xeon(R) Gold 5120 CPUs (14 cores), 512 GB main-memory split between both sockets, and four Samsung SSD 980 Pro M.2 1 TB connected via PCIe by one ASRock Hyper Quad M.2 PCIe card. The nodes of the cluster are connected with an InfiniBand network using one Mellanox ConnectX-5 MT27800 NICs (InfiniBand EDR 4x, 100 Gbps) per node.

Mellanox RDMA

We used the following Mellanox OFED installation:

ofed_info

MLNX_OFED_LINUX-5.1-2.5.8.0 (OFED-5.1-2.5.8):Installed Packages:-------------------ii  ar-mgr                                        1.0-0.3.MLNX20200824.g8577618.51258     amd64        Adaptive Routing Managerii  dapl2-utils                                   2.1.10.1.mlnx-OFED.51258                amd64        Utilitiesfor use with the DAPL librariesii  dpcp                                          1.1.0-1.51258                           amd64        Direct Packet Control Plane (DPCP) is a library to use Devxii  dump-pr                                       1.0-0.3.MLNX20200824.g8577618.51258     amd64        Dump PathRecord Pluginii  hcoll                                         4.6.3125-1.51258                        amd64        Hierarchical collectives (HCOLL)ii  ibacm                                         51mlnx1-1.51258                         amd64        InfiniBand Communication Manager Assistant (ACM)ii  ibdump                                        6.0.0-1.51258                           amd64        Mellanox packets sniffer toolii  ibsim                                         0.9-1.51258                             amd64        InfiniBand fabric simulatorfor managementii  ibsim-doc                                     0.9-1.51258                             all          documentationfor ibsimii  ibutils2                                      2.1.1-0.126.MLNX20200721.gf95236b.51258 amd64        OpenIB Mellanox InfiniBand Diagnostic Toolsii  ibverbs-providers:amd64                       51mlnx1-1.51258                         amd64        User space provider driversfor libibverbsii  ibverbs-utils                                 51mlnx1-1.51258                         amd64        Examplesfor the libibverbs libraryii  infiniband-diags                              51mlnx1-1.51258                         amd64        InfiniBand diagnostic programsii  iser-dkms                                     5.1-OFED.5.1.2.5.3.1                    all          DKMS support fo iser kernel modulesii  isert-dkms                                    5.1-OFED.5.1.2.5.3.1                    all          DKMS support fo isert kernel modulesii  kernel-mft-dkms                               4.15.1-100                              all          DKMS supportfor kernel-mft kernel modulesii  knem                                          1.1.4.90mlnx1-OFED.5.1.2.5.0.1          amd64        userspace toolsfor the KNEM kernel moduleii  knem-dkms                                     1.1.4.90mlnx1-OFED.5.1.2.5.0.1          all          DKMS supportfor mlnx-ofed kernel modulesii  libdapl-dev                                   2.1.10.1.mlnx-OFED.51258                amd64        Development filesfor the DAPL librariesii  libdapl2                                      2.1.10.1.mlnx-OFED.51258                amd64        The Direct Access Programming Library (DAPL)ii  libibmad-dev:amd64                            51mlnx1-1.51258                         amd64        Development filesfor libibmadii  libibmad5:amd64                               51mlnx1-1.51258                         amd64        Infiniband Management Datagram (MAD) libraryii  libibnetdisc5:amd64                           51mlnx1-1.51258                         amd64        InfiniBand diagnostics libraryii  libibumad-dev:amd64                           51mlnx1-1.51258                         amd64        Development filesfor libibumadii  libibumad3:amd64                              51mlnx1-1.51258                         amd64        InfiniBand Userspace Management Datagram (uMAD) libraryii  libibverbs-dev:amd64                          51mlnx1-1.51258                         amd64        Development filesfor the libibverbs libraryii  libibverbs1:amd64                             51mlnx1-1.51258                         amd64        Libraryfor direct userspace use of RDMA (InfiniBand/iWARP)ii  libibverbs1-dbg:amd64                         51mlnx1-1.51258                         amd64        Debug symbolsfor the libibverbs libraryii  libopensm                                     5.7.3.MLNX20201102.e56fd90-0.1.51258    amd64        Infiniband subnet manager librariesii  libopensm-devel                               5.7.3.MLNX20201102.e56fd90-0.1.51258    amd64        Developement filesfor OpenSMii  librdmacm-dev:amd64                           51mlnx1-1.51258                         amd64        Development filesfor the librdmacm libraryii  librdmacm1:amd64                              51mlnx1-1.51258                         amd64        Libraryfor managing RDMA connectionsii  mlnx-ethtool                                  5.4-1.51258                             amd64        This utility allows querying and changing settings such as speed,ii  mlnx-iproute2                                 5.6.0-1.51258                           amd64        This utility allows querying and changing settings such as speed,ii  mlnx-ofed-kernel-dkms                         5.1-OFED.5.1.2.5.8.1                    all          DKMS supportfor mlnx-ofed kernel modulesii  mlnx-ofed-kernel-utils                        5.1-OFED.5.1.2.5.8.1                    amd64        Userspace tools to restart and tune mlnx-ofed kernel modulesii  mpitests                                      3.2.20-5d20b49.51258                    amd64        Set of popular MPI benchmarks and tools IMB 2018 OSU benchmarks ver 4.0.1 mpiP-3.3 IPM-2.0.6ii  mstflint                                      4.14.0-3.51258                          amd64        Mellanox firmware burning applicationii  openmpi                                       4.0.4rc3-1.51258                        all          Open MPIii  opensm                                        5.7.3.MLNX20201102.e56fd90-0.1.51258    amd64        An Infiniband subnet managerii  opensm-doc                                    5.7.3.MLNX20201102.e56fd90-0.1.51258    amd64        Documentationfor opensmii  perftest                                      4.4+0.5-1                               amd64        Infiniband verbs performance testsii  rdma-core                                     51mlnx1-1.51258                         amd64        RDMA core userspace infrastructure and documentationii  rdmacm-utils                                  51mlnx1-1.51258                         amd64        Examplesfor the librdmacm libraryii  sharp                                         2.2.2.MLNX20201102.b26a0fd-1.51258      amd64        SHArP switch collectivesii  srp-dkms                                      5.1-OFED.5.1.2.5.3.1                    all          DKMS support fo srp kernel modulesii  srptools                                      51mlnx1-1.51258                         amd64        Toolsfor Infiniband attached storage (SRP)ii  ucx                                           1.9.0-1.51258                           amd64        Unified Communication X

SSD

4x 512 GB main-memory split between both sockets, and four Samsung SSD 980 Pro M.2 1 TB connected via PCIe by one ASRock Hyper Quad M.2 PCIe card. All SSDs are used as block device and organized as a RAID 0 via

sudo mdadm --create /dev/md0 --auto md --level=0 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1

Huge pages

We are using huge pages for the memory buffers:

echo N| sudo tee /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages

CMake build

To build ScaleStore we use CMake. First we create a build folder in the top level folder of scalestore:

mkdir buildcd build

Afterwards, we can build the executable with either in debug mode with address sanitizers enabled:

cmake -D CMAKE_C_COMPILER=gcc-10 -D CMAKE_CXX_COMPILER=g++-10 -DCMAKE_BUILD_TYPE=Debug -DSANI=On ..&& make -j

or in release mode:

cmake -D CMAKE_C_COMPILER=gcc-10 -D CMAKE_CXX_COMPILER=g++-10 -DCMAKE_BUILD_TYPE=Release ..&& make -j

Libraries

  • gflags
  • lib_aio
  • ibverbs
  • tabulate
  • rdma cm

Run executable

All executables can be found inscalestore/build/frontend. For instance, the follwoing command can be used to run ycsb in a single node setup:

make -j&& numactl --membind=0 --cpunodebind=0  ./ycsb -ownIp=172.18.94.80 -nodes=1 -YCSB_all_workloads -worker=20 -YCSB_tuple_count=1000000000 -dramGB=150 -csvFile=singlenode_oom_scalestore_ycsb_zipf.csv  -YCSB_run_for_seconds=60 -ssd_path=/dev/md0 --ssd_gib=400 -pageProviderThreads=4 -YCSB_all_zipf

Configuration

The main configuration file in order to execute ScaleStore can be found inshared-headers/Defs.hpp.

IPs

To configure the servers and their ips the following configuration needs to be adapted:

const std::vector<std::vector<std::string>> NODES{    {""},// 0 to allow direct offset    {"172.18.94.80"},// 1    {"172.18.94.80","172.18.94.70"},// 2    {"172.18.94.80","172.18.94.70","172.18.94.10"},// 3    {"172.18.94.80","172.18.94.70","172.18.94.10","172.18.94.20"},// 4    {"172.18.94.80","172.18.94.70","172.18.94.10","172.18.94.20","172.18.94.40"},// 5    {"172.18.94.80","172.18.94.70","172.18.94.10","172.18.94.20","172.18.94.40","172.18.94.30"},// 6};

CPU Cores

We implemented a very simpleCoreManager which can be found in (scalestore/backend/threads/CoreManager.hpp). All configurations are hard-coded to fit our servers (2 NUMA nodes) and might need to be adapted to fit yours.

Gflags help

Besides theDefs.hpp file there are gflags parameters. Most of them are stored inbackend/ScaleStore/Config.hpp. However, some are attached to the main executable file, e.g. ycsb has theYCSB_tuple_count flag. To see all (custom) gflags parameters and their description one can run:

./exe --help

Paper Benchmarks

The paper benchmark implementations can be found infrontend/ycsb. The distributed experiment runner scripts can be found indistexperiments/experiments. In order to run them please consult the following github page:https://github.com/mjasny/distexprunner

Benchmark Runners

  • YCSB runner
  • OLAP scan queries

Tests

  • consistency checks
  • TPC-C consistency checks

Known Issues/Bugs

Startup

If you see the following exception at the startup of ScaleStore:

"Consider adjusting BATCH_SIZE and PARTITIONS"in /home/tziegler/ScaleStore/backend/scalestore/storage/buffermanager/Buffermanager.cpp:62

You would need to change thePARTITIONS andBATCH_SIZE variable in theDefs.hpp file. The reason is that we use a partitioned queue of batches to reduce contention in the free lists and accesses to the latch. To calculate the right number of batches per partition we use.

NUMBER_BATCHES = (DRAM_SIZE / PAGE_SIZE) / PARTITIONS / BATCH_SIZE

Therefore, this may be needed if the DRAM_SIZE is too small or the page size has been changed.

About

This is the source code for our (Tobias Ziegler, Carsten Binnig and Viktor Leis) published paper at SIGMOD’22: ScaleStore: A Fast and Cost-Efficient Storage Engine using DRAM, NVMe, and RDMA.

Topics

Resources

License

Stars

Watchers

Forks


[8]ページ先頭

©2009-2025 Movatter.jp