sl-solution/InMemoryDatasets.jlPublic

NotificationsYou must be signed in to change notification settings
Fork19
Star132

Multithreaded package for working with tabular data in Julia

License

View license

132 stars 19 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 1,171 Commits
.github/workflows		.github/workflows
docs		docs
src		src
test		test
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Project.toml		Project.toml
README.md		README.md

Repository files navigation

InMemoryDatasets

Documentation

The latest release's Documentation is available viahttps://sl-solution.github.io/InMemoryDatasets.jl/stable.

Introduction

InMemoryDatasets.jl is a multithreaded package for data manipulation and is designed forJulia 1.6+ (64bit OS). The core computation engine of the package is a set of customised algorithms developed specifically for columnar tables. The package performance is tuned with two goals in mind, a) low overhead of allowing missing values everywhere, and b) the following priorities - in order of importance:

Low compilation time
Memory efficiency
High performance

we do our best to keep the overall complexity of the package as low as possible to simplify:

the maintenance of the package
adding new features to the package
contributing to the package

Seehere for some benchmarks.

Features

InMemoryDatasets.jl has many interesting features, here, we highlight some of our favourites (in no particular order):

Assigning a named function to a column as itsformat
- By default, formatted values are used for operations like: displaying, sorting, grouping, joining,...
- Format evaluation is lazy
- Formats don't change the actual values
Multi-threading across the whole package
- Most functions inInMemoryDatasets.jl exploit all cores available toJulia by default
- Disabling parallel computation via passing thethreads = false keyword argument to functions
Powerfulrow-wise operations
- Support many common operations
- Specialised operations for modifying columns
- Customised row-wise operations forfiltering observations /filter simply wrapsbyrow
Unique approach forreshaping data
- Unified syntax for all type of reshaping
- Cover all reshaping functions:
  - stacking and un-stacking on single/multiple columns
  - wide to long and long to wide reshaping
  - transposing and more
Fastsorting algorithms
- Stable and UnstableHeapSort andQuickSort algorithms
- Count sort for integers
Compiler friendlygrouping algorithms
- groupby!/groupby to group observation using sorting algorithms - sorted order
- gatherby to group observation using hybrid hash algorithms - observations order
- incremental grouping operation forgroupby!/groupby, i.e. adding a column at a time
Efficientjoining algorithms
- Preserve the order of observations in the left data set
- Support two methods for joining:sort-merge join andhash join.
- Customised columnar-hybrid-hash algorithms for join
- Inequality-kind (non-equi) andrange joins forinnerjoin,contains,semijoin!/semijoin,antijoin!/antijoin
- closejoin!/closejoin fornon exact match join
- update!/update forupdating a master data set with values from a transaction data set

Example

julia>using InMemoryDatasetsjulia> g1=repeat(1:6, inner=4);julia> g2=repeat(1:4,6);julia> y= ["d8888b."," .d8b.","d888888b","  .d8b.","88  `8D","d8' `8b","`~~88~~'"," d8' `8b","88   88","88ooo88","   88"," 88ooo88","88   88","88~~~88","   88"," 88~~~88","88  .8D","88   88","   88"," 88   88","Y8888D'","YP   YP","   YP"," YP   YP"];julia> ds=Dataset(g1= g1, g2= g2, y= y)24×3 Dataset Row │ g1        g2        y              │ identity  identity  identity       │ Int64?    Int64?    String?   ─────┼───────────────────────────────1 │11  d8888b.2 │12.d8b.3 │13  d888888b4 │14.d8b.5 │2188`8D   6 │        2         2  d8'`8b7 │23`~~88~~'   8 │        2         4   d8'`8b9 │31888810 │3288ooo8811 │338812 │3488ooo8813 │41888814 │4288~~~8815 │438816 │4488~~~8817 │5188  .8D18 │52888819 │538820 │54888821 │61  Y8888D'22 │62  YP   YP23 │63     YP24 │64   YP   YPjulia>sort(ds,:g2)24×3 Sorted Dataset Sorted by: g2 Row │ g1        g2        y              │ identity  identity  identity       │ Int64?    Int64?    String?   ─────┼───────────────────────────────1 │11  d8888b.2 │2188`8D   3 │        3         1  88   88   4 │        4         1  88   88   5 │        5         1  88  .8D   6 │        6         1  Y8888D'   7 │        1         2   .d8b.   8 │        2         2  d8'`8b9 │3288ooo8810 │4288~~~8811 │52888812 │62  YP   YP13 │13  d888888b14 │23`~~88~~'  15 │        3         3     88  16 │        4         3     88  17 │        5         3     88  18 │        6         3     YP  19 │        1         4    .d8b.  20 │        2         4   d8'`8b21 │3488ooo8822 │4488~~~8823 │54888824 │64   YP   YPjulia> tds=transpose(groupby(ds,:g1),:y)6×6 Dataset Row │ g1        _variables_  _c1        _c2        _c3        _c4            │ identity  identity     identity   identity   identity   identity       │ Int64?    String?      String?    String?    String?    String?   ─────┼───────────────────────────────────────────────────────────────────1 │1  y            d8888b..d8b.     d888888b.d8b.2 │2  y88`8D    d8'`8b`~~88~~'    d8'`8b3 │3  y888888ooo888888ooo884 │4  y888888~~~888888~~~885 │5  y88  .8D88888888886 │6  y            Y8888D'    YP   YP       YP       YP   YPjulia> mds=map(tds, x->replace(x,r"[^ ]"=>"∑"),r"_c")6×6 Dataset Row │ g1        _variables_  _c1        _c2        _c3        _c4            │ identity  identity     identity   identity   identity   identity       │ Int64?    String?      String?    String?    String?    String?   ─────┼───────────────────────────────────────────────────────────────────1 │1  y            ∑∑∑∑∑∑∑     ∑∑∑∑∑     ∑∑∑∑∑∑∑∑     ∑∑∑∑∑2 │2  y            ∑∑  ∑∑∑    ∑∑∑ ∑∑∑    ∑∑∑∑∑∑∑∑    ∑∑∑ ∑∑∑3 │3  y            ∑∑   ∑∑    ∑∑∑∑∑∑∑       ∑∑       ∑∑∑∑∑∑∑4 │4  y            ∑∑   ∑∑    ∑∑∑∑∑∑∑       ∑∑       ∑∑∑∑∑∑∑5 │5  y            ∑∑  ∑∑∑    ∑∑   ∑∑       ∑∑       ∑∑   ∑∑6 │6  y            ∑∑∑∑∑∑∑    ∑∑   ∑∑       ∑∑       ∑∑   ∑∑julia>byrow(mds, sum,r"_c", by= x->count(isequal('∑'),x))6-element Vector{Union{Missing, Int64}}:252520201517julia>using Chainjulia>@chain mdsbeginrepeat!(2)sort!(:g1)flatten!(r"_c")insertcols!(:g2=>repeat(1:9,12))groupby(:g2)transpose(r"_c")modify!(r"_c"=>byrow(x->join(reverse(x))))select!(r"row")insertcols!(1,:g=>repeat(1:4,9))sort!(:g)end36×2 Sorted Dataset Sorted by: g Row │ g         row_function     │ identity  identity          │ Int64?    String?      ─────┼────────────────────────1 │1  ∑∑∑∑∑∑∑∑∑∑∑∑2 │1  ∑∑∑∑∑∑∑∑∑∑∑∑3 │1  ∑∑        ∑∑4 │1  ∑∑        ∑∑5 │1  ∑∑∑∑    ∑∑∑∑6 │1  ∑∑∑∑∑∑∑∑∑∑∑∑7 │1  ∑∑∑∑∑∑∑∑∑∑∑∑8 │19 │110 │2  ∑∑∑∑∑∑∑∑∑∑11 │2  ∑∑∑∑∑∑∑∑∑∑∑∑12 │2      ∑∑∑∑∑∑∑∑13 │2      ∑∑∑∑  ∑∑14 │2      ∑∑∑∑∑∑∑∑15 │2  ∑∑∑∑∑∑∑∑∑∑∑∑16 │2  ∑∑∑∑∑∑∑∑∑∑17 │218 │219 │3          ∑∑∑∑20 │3          ∑∑∑∑21 │3          ∑∑∑∑22 │3  ∑∑∑∑∑∑∑∑∑∑∑∑23 │3  ∑∑∑∑∑∑∑∑∑∑∑∑24 │3          ∑∑∑∑25 │3          ∑∑∑∑26 │3          ∑∑∑∑27 │328 │429 │4  ∑∑∑∑∑∑∑∑∑∑30 │4  ∑∑∑∑∑∑∑∑∑∑∑∑31 │4      ∑∑∑∑∑∑∑∑32 │4      ∑∑∑∑  ∑∑33 │4      ∑∑∑∑∑∑∑∑34 │4  ∑∑∑∑∑∑∑∑∑∑∑∑35 │4  ∑∑∑∑∑∑∑∑∑∑36 │4

Acknowledgement

We like to acknowledge the contributors toJulia's data ecosystem, especiallyDataFrames.jl, since the existence of their works gave the development ofInMemoryDatasets.jl a head start.

About

Multithreaded package for working with tabular data in Julia

Releases52

v0.7.24 Latest

Nov 10, 2025

+ 51 releases

Packages

No packages published

Contributors11

Languages

Julia100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Folders and files

Latest commit

History

Repository files navigation

InMemoryDatasets

Documentation

Introduction

Features

Example

Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases52

Packages

Uh oh!

Contributors11

Uh oh!

Languages

Movatterモバイル変換

License

sl-solution/InMemoryDatasets.jl

Folders and files

Latest commit

History

Repository files navigation

InMemoryDatasets

Documentation

Introduction

Features

Example

Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases52

Packages0

Uh oh!

Contributors11

Uh oh!

Languages

Packages