Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Build a simple functioning distributed file system

NotificationsYou must be signed in to change notification settings

DevipriyaSarkar/distributed_file_system

Repository files navigation

Build a simple functioning distributed file system

Design decisions

Cases covered:

  • PUT
  • GET
  • Verify file integrity — md5 checksum
  • Dynamic adding of storage nodes
  • Replication of file
  • If store operation not successful, retry based on error code
  • While retrieving, if the primary node is unavailable, fetch from replica
  • Multiple simultaneous requests to get/put

Cases not covered:

  • Single server — no load balancing
  • No periodic heartbeat check for the storage - machines
  • No ACL for the files
  • No proper storage distribution
  • Blocking operation for the client
  • No support for multiple files with the same name.
    If such a request comes, the file is stored with a different unique name and the generated new name is notified to the client.
  • No support for delete operation (could be added - easily though)
Primary node selection algorithm:   RandomClient server protocol:             HTTP

Trade offs made

  1. Responsibility to get file from storage node — thick client vs thin client

    * Chosen:       thin client* Motivation:   ability to modify the business logic without touching the client* Cons:         one extra file transfer
  2. Replication responsibility — server vs primary storage node vs dedicated node for replication

    * Chosen:       dedicated node for replication* Motivation:   no extra load on server and primary nodes + async* Cons:         eventual replication instead of immediate
  3. Selection of primary node while storing file — based on hash vs random

    * Chosen:       random* Motivation:   ability to add new storage nodes without handling rebalance* Cons:         additional step at retrieval compared to hash based approach

Design

1. Put a file

dfs_put

2. Get a file

dfs_get

Technology stack

Language:               pythonServer:                 FlaskDB:                     sqlite3Task queue:             CeleryMessage broker:         RedisCluster management:     docker-compose

Requirements

  • python
  • docker,docker-compose

Setup

$ git clone https://github.com/DevipriyaSarkar/distributed_file_system_sample.git$ cd distributed_file_system_sample$ virtualenv --python=python3 ~/PythonEnvs/distributed_file_system_sample$ source ~/PythonEnvs/distributed_file_system_sample/bin/activate(distributed_file_system_sample) $ python one_time_setup.pyCreated database dfs.db.Table created for master node.Table created for storing replication data.Setup done!$ docker-compose up -d# verify whether all the containers are up and running$ docker psCONTAINER ID        IMAGE                                       COMMAND                  CREATED             STATUS              PORTS                    NAMESde17bdc9ad3e        distributed_file_system_sample_sn3          "/bin/sh -c 'flask r…"   21 hours ago        Up 4 hours          0.0.0.0:6050->6050/tcp   sn340d3f9d18281        distributed_file_system_sample_sn0          "/bin/sh -c 'flask r…"   21 hours ago        Up 4 hours          0.0.0.0:5000->5000/tcp   sn0af9201ea6b44        redis:5-alpine                              "docker-entrypoint.s…"   21 hours ago        Up 4 hours          0.0.0.0:6379->6379/tcp   redis4d4001d234d4        distributed_file_system_sample_dfs_celery   "/bin/sh -c 'celery …"   21 hours ago        Up 4 hours                                   dfs_celery54fe337a616b        distributed_file_system_sample_sn1          "/bin/sh -c 'flask r…"   21 hours ago        Up 4 hours          0.0.0.0:5050->5050/tcp   sn1efd85b121822        distributed_file_system_sample_master       "/bin/sh -c 'flask r…"   21 hours ago        Up 4 hours          0.0.0.0:8820->8820/tcp   master0f7ed8810411        distributed_file_system_sample_sn4          "/bin/sh -c 'flask r…"   21 hours ago        Up 4 hours          0.0.0.0:7000->7000/tcp   sn4777586996a22        distributed_file_system_sample_sn2          "/bin/sh -c 'flask r…"   21 hours ago        Up 4 hours          0.0.0.0:6000->6000/tcp   sn2

Usage

# store file$ python client.py put dummy.txtInitiating store dummy.txt to server.File stored successfully!{'message': 'File storage_sn1_5050/dummy.txt saved successfully.', 'status_code': 200}# retrieve file$ python client.py get dummy.txtInitiating retrive dummy.txt from server.File retrieved successfully!Checking file integrity.{'message': 'received_files/dummy.txt received. File integrity validated.', 'status_code': 200}# received files are stored at $PROJECT_ROOT/received_files# to check where the replicas are$ docker logs dfs_celery    # or check the db table `replication_data`[2020-04-14 14:00:09,985: INFO/ForkPoolWorker-8] Task dfs_tasks.replicate[7d5fa969-d85b-4f28-80ad-efdd55898d16] succeeded in 0.04845419999946898s: 'Successful. dummy.txt : sn1:5050 —> sn4:7000'[2020-04-14 14:00:10,001: INFO/ForkPoolWorker-1] Task dfs_tasks.replicate[a364a055-7f43-47da-b3a8-b20699f62bb8] succeeded in 0.06273929999952088s: 'Successful. dummy.txt : sn1:5050 —> sn0:5000'# so the file is stored primarily on sn1# the replicas are there at sn4 and sn0# can be verified by checking the following directories for dummy.txt# - $PROJECT_ROOT/storage_sn1_5050# - $PROJECT_ROOT/storage_sn4_7000# - $PROJECT_ROOT/storage_sn0_5000# this is possible without ssh-ing into the individual containers# because the $PROJECT_ROOT is volumed mounted for ease in# debugging

About

Build a simple functioning distributed file system

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp