- Notifications
You must be signed in to change notification settings - Fork17
This is the source code for our (Tobias Ziegler, Carsten Binnig and Viktor Leis) published paper at SIGMOD’22: ScaleStore: A Fast and Cost-Efficient Storage Engine using DRAM, NVMe, and RDMA.
License
DataManagementLab/ScaleStore
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This is the source code for our (Tobias Ziegler, Carsten Binnig and Viktor Leis) published paper at SIGMOD’22: ScaleStore: A Fast and Cost-Efficient Storage Engine using DRAM, NVMe, and RDMA. Paper can be found here:Paper Link ACM orPaper Link PDF
In this paper, we propose ScaleStore, a novel distributed storage engine that exploits DRAM caching, NVMe storage, and RDMA networking to achieve high performance, cost-efficiency, and scalability at the same time. Using low latency RDMA messages, ScaleStore implements a transparent memory abstraction that provides access to the aggregated DRAM memory and NVMe storage of all nodes. In contrast to existing distributed RDMA designs such as NAM-DB or FaRM, ScaleStore integrates seamlessly with NVMe SSDs, lowering the overall hardware cost significantly. The core of ScaleStore is a distributed caching strategy that dynamically decides which data to keep in memory (and which on SSDs) based on the workload. The caching protocol also provides strong consistency in the presence of concurrent data modifications. In our YCSB-based evaluation, we show that ScaleStore can provide high performance for various types of workloads (read/write-dominated, uniform/skewed) even when the data size is larger than the aggregated memory of all nodes. We further show that ScaleStore can efficiently handle dynamic workload changes and support elasticity.
@inproceedings{DBLP:conf/sigmod/0001BL22, author = {Tobias Ziegler and Carsten Binnig and Viktor Leis}, title = {ScaleStore: {A} Fast and Cost-Efficient Storage Engine using DRAM, NVMe, and {RDMA}}, booktitle = {{SIGMOD} '22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022}, pages = {685--699}, publisher = {{ACM}}, year = {2022}, url = {https://doi.org/10.1145/3514221.3526187}, doi = {10.1145/3514221.3526187}}
All experiments were conducted on a 5-node cluster running Ubuntu 18.04.1 LTS, with Linux 4.15.0 kernel. Each node is equipped with two Intel(R) Xeon(R) Gold 5120 CPUs (14 cores), 512 GB main-memory split between both sockets, and four Samsung SSD 980 Pro M.2 1 TB connected via PCIe by one ASRock Hyper Quad M.2 PCIe card. The nodes of the cluster are connected with an InfiniBand network using one Mellanox ConnectX-5 MT27800 NICs (InfiniBand EDR 4x, 100 Gbps) per node.
We used the following Mellanox OFED installation:
MLNX_OFED_LINUX-5.1-2.5.8.0 (OFED-5.1-2.5.8):Installed Packages:-------------------ii ar-mgr 1.0-0.3.MLNX20200824.g8577618.51258 amd64 Adaptive Routing Managerii dapl2-utils 2.1.10.1.mlnx-OFED.51258 amd64 Utilitiesfor use with the DAPL librariesii dpcp 1.1.0-1.51258 amd64 Direct Packet Control Plane (DPCP) is a library to use Devxii dump-pr 1.0-0.3.MLNX20200824.g8577618.51258 amd64 Dump PathRecord Pluginii hcoll 4.6.3125-1.51258 amd64 Hierarchical collectives (HCOLL)ii ibacm 51mlnx1-1.51258 amd64 InfiniBand Communication Manager Assistant (ACM)ii ibdump 6.0.0-1.51258 amd64 Mellanox packets sniffer toolii ibsim 0.9-1.51258 amd64 InfiniBand fabric simulatorfor managementii ibsim-doc 0.9-1.51258 all documentationfor ibsimii ibutils2 2.1.1-0.126.MLNX20200721.gf95236b.51258 amd64 OpenIB Mellanox InfiniBand Diagnostic Toolsii ibverbs-providers:amd64 51mlnx1-1.51258 amd64 User space provider driversfor libibverbsii ibverbs-utils 51mlnx1-1.51258 amd64 Examplesfor the libibverbs libraryii infiniband-diags 51mlnx1-1.51258 amd64 InfiniBand diagnostic programsii iser-dkms 5.1-OFED.5.1.2.5.3.1 all DKMS support fo iser kernel modulesii isert-dkms 5.1-OFED.5.1.2.5.3.1 all DKMS support fo isert kernel modulesii kernel-mft-dkms 4.15.1-100 all DKMS supportfor kernel-mft kernel modulesii knem 1.1.4.90mlnx1-OFED.5.1.2.5.0.1 amd64 userspace toolsfor the KNEM kernel moduleii knem-dkms 1.1.4.90mlnx1-OFED.5.1.2.5.0.1 all DKMS supportfor mlnx-ofed kernel modulesii libdapl-dev 2.1.10.1.mlnx-OFED.51258 amd64 Development filesfor the DAPL librariesii libdapl2 2.1.10.1.mlnx-OFED.51258 amd64 The Direct Access Programming Library (DAPL)ii libibmad-dev:amd64 51mlnx1-1.51258 amd64 Development filesfor libibmadii libibmad5:amd64 51mlnx1-1.51258 amd64 Infiniband Management Datagram (MAD) libraryii libibnetdisc5:amd64 51mlnx1-1.51258 amd64 InfiniBand diagnostics libraryii libibumad-dev:amd64 51mlnx1-1.51258 amd64 Development filesfor libibumadii libibumad3:amd64 51mlnx1-1.51258 amd64 InfiniBand Userspace Management Datagram (uMAD) libraryii libibverbs-dev:amd64 51mlnx1-1.51258 amd64 Development filesfor the libibverbs libraryii libibverbs1:amd64 51mlnx1-1.51258 amd64 Libraryfor direct userspace use of RDMA (InfiniBand/iWARP)ii libibverbs1-dbg:amd64 51mlnx1-1.51258 amd64 Debug symbolsfor the libibverbs libraryii libopensm 5.7.3.MLNX20201102.e56fd90-0.1.51258 amd64 Infiniband subnet manager librariesii libopensm-devel 5.7.3.MLNX20201102.e56fd90-0.1.51258 amd64 Developement filesfor OpenSMii librdmacm-dev:amd64 51mlnx1-1.51258 amd64 Development filesfor the librdmacm libraryii librdmacm1:amd64 51mlnx1-1.51258 amd64 Libraryfor managing RDMA connectionsii mlnx-ethtool 5.4-1.51258 amd64 This utility allows querying and changing settings such as speed,ii mlnx-iproute2 5.6.0-1.51258 amd64 This utility allows querying and changing settings such as speed,ii mlnx-ofed-kernel-dkms 5.1-OFED.5.1.2.5.8.1 all DKMS supportfor mlnx-ofed kernel modulesii mlnx-ofed-kernel-utils 5.1-OFED.5.1.2.5.8.1 amd64 Userspace tools to restart and tune mlnx-ofed kernel modulesii mpitests 3.2.20-5d20b49.51258 amd64 Set of popular MPI benchmarks and tools IMB 2018 OSU benchmarks ver 4.0.1 mpiP-3.3 IPM-2.0.6ii mstflint 4.14.0-3.51258 amd64 Mellanox firmware burning applicationii openmpi 4.0.4rc3-1.51258 all Open MPIii opensm 5.7.3.MLNX20201102.e56fd90-0.1.51258 amd64 An Infiniband subnet managerii opensm-doc 5.7.3.MLNX20201102.e56fd90-0.1.51258 amd64 Documentationfor opensmii perftest 4.4+0.5-1 amd64 Infiniband verbs performance testsii rdma-core 51mlnx1-1.51258 amd64 RDMA core userspace infrastructure and documentationii rdmacm-utils 51mlnx1-1.51258 amd64 Examplesfor the librdmacm libraryii sharp 2.2.2.MLNX20201102.b26a0fd-1.51258 amd64 SHArP switch collectivesii srp-dkms 5.1-OFED.5.1.2.5.3.1 all DKMS support fo srp kernel modulesii srptools 51mlnx1-1.51258 amd64 Toolsfor Infiniband attached storage (SRP)ii ucx 1.9.0-1.51258 amd64 Unified Communication X
4x 512 GB main-memory split between both sockets, and four Samsung SSD 980 Pro M.2 1 TB connected via PCIe by one ASRock Hyper Quad M.2 PCIe card. All SSDs are used as block device and organized as a RAID 0 via
sudo mdadm --create /dev/md0 --auto md --level=0 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
We are using huge pages for the memory buffers:
echo N| sudo tee /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
To build ScaleStore we use CMake. First we create a build folder in the top level folder of scalestore:
mkdir buildcd build
Afterwards, we can build the executable with either in debug mode with address sanitizers enabled:
cmake -D CMAKE_C_COMPILER=gcc-10 -D CMAKE_CXX_COMPILER=g++-10 -DCMAKE_BUILD_TYPE=Debug -DSANI=On ..&& make -j
or in release mode:
cmake -D CMAKE_C_COMPILER=gcc-10 -D CMAKE_CXX_COMPILER=g++-10 -DCMAKE_BUILD_TYPE=Release ..&& make -j
- gflags
- lib_aio
- ibverbs
- tabulate
- rdma cm
All executables can be found inscalestore/build/frontend
. For instance, the follwoing command can be used to run ycsb in a single node setup:
make -j&& numactl --membind=0 --cpunodebind=0 ./ycsb -ownIp=172.18.94.80 -nodes=1 -YCSB_all_workloads -worker=20 -YCSB_tuple_count=1000000000 -dramGB=150 -csvFile=singlenode_oom_scalestore_ycsb_zipf.csv -YCSB_run_for_seconds=60 -ssd_path=/dev/md0 --ssd_gib=400 -pageProviderThreads=4 -YCSB_all_zipf
The main configuration file in order to execute ScaleStore can be found inshared-headers/Defs.hpp
.
To configure the servers and their ips the following configuration needs to be adapted:
const std::vector<std::vector<std::string>> NODES{ {""},// 0 to allow direct offset {"172.18.94.80"},// 1 {"172.18.94.80","172.18.94.70"},// 2 {"172.18.94.80","172.18.94.70","172.18.94.10"},// 3 {"172.18.94.80","172.18.94.70","172.18.94.10","172.18.94.20"},// 4 {"172.18.94.80","172.18.94.70","172.18.94.10","172.18.94.20","172.18.94.40"},// 5 {"172.18.94.80","172.18.94.70","172.18.94.10","172.18.94.20","172.18.94.40","172.18.94.30"},// 6};
We implemented a very simpleCoreManager
which can be found in (scalestore/backend/threads/CoreManager.hpp
). All configurations are hard-coded to fit our servers (2 NUMA nodes) and might need to be adapted to fit yours.
Besides theDefs.hpp
file there are gflags parameters. Most of them are stored inbackend/ScaleStore/Config.hpp
. However, some are attached to the main executable file, e.g. ycsb has theYCSB_tuple_count
flag. To see all (custom) gflags parameters and their description one can run:
./exe --help
The paper benchmark implementations can be found infrontend/ycsb
. The distributed experiment runner scripts can be found indistexperiments/experiments
. In order to run them please consult the following github page:https://github.com/mjasny/distexprunner
- YCSB runner
- OLAP scan queries
- consistency checks
- TPC-C consistency checks
If you see the following exception at the startup of ScaleStore:
"Consider adjusting BATCH_SIZE and PARTITIONS"in /home/tziegler/ScaleStore/backend/scalestore/storage/buffermanager/Buffermanager.cpp:62
You would need to change thePARTITIONS
andBATCH_SIZE
variable in theDefs.hpp
file. The reason is that we use a partitioned queue of batches to reduce contention in the free lists and accesses to the latch. To calculate the right number of batches per partition we use.
NUMBER_BATCHES = (DRAM_SIZE / PAGE_SIZE) / PARTITIONS / BATCH_SIZE
Therefore, this may be needed if the DRAM_SIZE is too small or the page size has been changed.
About
This is the source code for our (Tobias Ziegler, Carsten Binnig and Viktor Leis) published paper at SIGMOD’22: ScaleStore: A Fast and Cost-Efficient Storage Engine using DRAM, NVMe, and RDMA.