- Notifications
You must be signed in to change notification settings - Fork19
Multithreaded package for working with tabular data in Julia
License
sl-solution/InMemoryDatasets.jl
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
The latest release's Documentation is available viahttps://sl-solution.github.io/InMemoryDatasets.jl/stable.
InMemoryDatasets.jl is a multithreaded package for data manipulation and is designed forJulia 1.6+ (64bit OS). The core computation engine of the package is a set of customised algorithms developed specifically for columnar tables. The package performance is tuned with two goals in mind, a) low overhead of allowing missing values everywhere, and b) the following priorities - in order of importance:
- Low compilation time
- Memory efficiency
- High performance
we do our best to keep the overall complexity of the package as low as possible to simplify:
- the maintenance of the package
- adding new features to the package
- contributing to the package
Seehere for some benchmarks.
InMemoryDatasets.jl has many interesting features, here, we highlight some of our favourites (in no particular order):
- Assigning a named function to a column as itsformat
- By default, formatted values are used for operations like: displaying, sorting, grouping, joining,...
- Format evaluation is lazy
- Formats don't change the actual values
- Multi-threading across the whole package
- Most functions in
InMemoryDatasets.jlexploit all cores available toJuliaby default - Disabling parallel computation via passing the
threads = falsekeyword argument to functions
- Most functions in
- Powerfulrow-wise operations
- Support many common operations
- Specialised operations for modifying columns
- Customised row-wise operations forfiltering observations /
filtersimply wrapsbyrow
- Unique approach forreshaping data
- Unified syntax for all type of reshaping
- Cover all reshaping functions:
- stacking and un-stacking on single/multiple columns
- wide to long and long to wide reshaping
- transposing and more
- Fastsorting algorithms
- Stable and Unstable
HeapSortandQuickSortalgorithms - Count sort for integers
- Stable and Unstable
- Compiler friendlygrouping algorithms
groupby!/groupbyto group observation using sorting algorithms - sorted ordergatherbyto group observation using hybrid hash algorithms - observations order- incremental grouping operation for
groupby!/groupby, i.e. adding a column at a time
- Efficientjoining algorithms
- Preserve the order of observations in the left data set
- Support two methods for joining:
sort-mergejoin andhashjoin. - Customised columnar-hybrid-hash algorithms for join
- Inequality-kind (non-equi) andrange joins for
innerjoin,contains,semijoin!/semijoin,antijoin!/antijoin closejoin!/closejoinfornon exact match joinupdate!/updateforupdating a master data set with values from a transaction data set
julia>using InMemoryDatasetsjulia> g1=repeat(1:6, inner=4);julia> g2=repeat(1:4,6);julia> y= ["d8888b."," .d8b.","d888888b"," .d8b.","88 `8D","d8' `8b","`~~88~~'"," d8' `8b","88 88","88ooo88"," 88"," 88ooo88","88 88","88~~~88"," 88"," 88~~~88","88 .8D","88 88"," 88"," 88 88","Y8888D'","YP YP"," YP"," YP YP"];julia> ds=Dataset(g1= g1, g2= g2, y= y)24×3 Dataset Row │ g1 g2 y │ identity identity identity │ Int64? Int64? String? ─────┼───────────────────────────────1 │11 d8888b.2 │12.d8b.3 │13 d888888b4 │14.d8b.5 │2188`8D 6 │ 2 2 d8'`8b7 │23`~~88~~' 8 │ 2 4 d8'`8b9 │31888810 │3288ooo8811 │338812 │3488ooo8813 │41888814 │4288~~~8815 │438816 │4488~~~8817 │5188 .8D18 │52888819 │538820 │54888821 │61 Y8888D'22 │62 YP YP23 │63 YP24 │64 YP YPjulia>sort(ds,:g2)24×3 Sorted Dataset Sorted by: g2 Row │ g1 g2 y │ identity identity identity │ Int64? Int64? String? ─────┼───────────────────────────────1 │11 d8888b.2 │2188`8D 3 │ 3 1 88 88 4 │ 4 1 88 88 5 │ 5 1 88 .8D 6 │ 6 1 Y8888D' 7 │ 1 2 .d8b. 8 │ 2 2 d8'`8b9 │3288ooo8810 │4288~~~8811 │52888812 │62 YP YP13 │13 d888888b14 │23`~~88~~' 15 │ 3 3 88 16 │ 4 3 88 17 │ 5 3 88 18 │ 6 3 YP 19 │ 1 4 .d8b. 20 │ 2 4 d8'`8b21 │3488ooo8822 │4488~~~8823 │54888824 │64 YP YPjulia> tds=transpose(groupby(ds,:g1),:y)6×6 Dataset Row │ g1 _variables_ _c1 _c2 _c3 _c4 │ identity identity identity identity identity identity │ Int64? String? String? String? String? String? ─────┼───────────────────────────────────────────────────────────────────1 │1 y d8888b..d8b. d888888b.d8b.2 │2 y88`8D d8'`8b`~~88~~' d8'`8b3 │3 y888888ooo888888ooo884 │4 y888888~~~888888~~~885 │5 y88 .8D88888888886 │6 y Y8888D' YP YP YP YP YPjulia> mds=map(tds, x->replace(x,r"[^ ]"=>"∑"),r"_c")6×6 Dataset Row │ g1 _variables_ _c1 _c2 _c3 _c4 │ identity identity identity identity identity identity │ Int64? String? String? String? String? String? ─────┼───────────────────────────────────────────────────────────────────1 │1 y ∑∑∑∑∑∑∑ ∑∑∑∑∑ ∑∑∑∑∑∑∑∑ ∑∑∑∑∑2 │2 y ∑∑ ∑∑∑ ∑∑∑ ∑∑∑ ∑∑∑∑∑∑∑∑ ∑∑∑ ∑∑∑3 │3 y ∑∑ ∑∑ ∑∑∑∑∑∑∑ ∑∑ ∑∑∑∑∑∑∑4 │4 y ∑∑ ∑∑ ∑∑∑∑∑∑∑ ∑∑ ∑∑∑∑∑∑∑5 │5 y ∑∑ ∑∑∑ ∑∑ ∑∑ ∑∑ ∑∑ ∑∑6 │6 y ∑∑∑∑∑∑∑ ∑∑ ∑∑ ∑∑ ∑∑ ∑∑julia>byrow(mds, sum,r"_c", by= x->count(isequal('∑'),x))6-element Vector{Union{Missing, Int64}}:252520201517julia>using Chainjulia>@chain mdsbeginrepeat!(2)sort!(:g1)flatten!(r"_c")insertcols!(:g2=>repeat(1:9,12))groupby(:g2)transpose(r"_c")modify!(r"_c"=>byrow(x->join(reverse(x))))select!(r"row")insertcols!(1,:g=>repeat(1:4,9))sort!(:g)end36×2 Sorted Dataset Sorted by: g Row │ g row_function │ identity identity │ Int64? String? ─────┼────────────────────────1 │1 ∑∑∑∑∑∑∑∑∑∑∑∑2 │1 ∑∑∑∑∑∑∑∑∑∑∑∑3 │1 ∑∑ ∑∑4 │1 ∑∑ ∑∑5 │1 ∑∑∑∑ ∑∑∑∑6 │1 ∑∑∑∑∑∑∑∑∑∑∑∑7 │1 ∑∑∑∑∑∑∑∑∑∑∑∑8 │19 │110 │2 ∑∑∑∑∑∑∑∑∑∑11 │2 ∑∑∑∑∑∑∑∑∑∑∑∑12 │2 ∑∑∑∑∑∑∑∑13 │2 ∑∑∑∑ ∑∑14 │2 ∑∑∑∑∑∑∑∑15 │2 ∑∑∑∑∑∑∑∑∑∑∑∑16 │2 ∑∑∑∑∑∑∑∑∑∑17 │218 │219 │3 ∑∑∑∑20 │3 ∑∑∑∑21 │3 ∑∑∑∑22 │3 ∑∑∑∑∑∑∑∑∑∑∑∑23 │3 ∑∑∑∑∑∑∑∑∑∑∑∑24 │3 ∑∑∑∑25 │3 ∑∑∑∑26 │3 ∑∑∑∑27 │328 │429 │4 ∑∑∑∑∑∑∑∑∑∑30 │4 ∑∑∑∑∑∑∑∑∑∑∑∑31 │4 ∑∑∑∑∑∑∑∑32 │4 ∑∑∑∑ ∑∑33 │4 ∑∑∑∑∑∑∑∑34 │4 ∑∑∑∑∑∑∑∑∑∑∑∑35 │4 ∑∑∑∑∑∑∑∑∑∑36 │4
We like to acknowledge the contributors toJulia's data ecosystem, especiallyDataFrames.jl, since the existence of their works gave the development ofInMemoryDatasets.jl a head start.
About
Multithreaded package for working with tabular data in Julia
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors11
Uh oh!
There was an error while loading.Please reload this page.