Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Multithreaded package for working with tabular data in Julia

License

NotificationsYou must be signed in to change notification settings

sl-solution/InMemoryDatasets.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1,171 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CI

Documentation

The latest release's Documentation is available viahttps://sl-solution.github.io/InMemoryDatasets.jl/stable.

Introduction

InMemoryDatasets.jl is a multithreaded package for data manipulation and is designed forJulia 1.6+ (64bit OS). The core computation engine of the package is a set of customised algorithms developed specifically for columnar tables. The package performance is tuned with two goals in mind, a) low overhead of allowing missing values everywhere, and b) the following priorities - in order of importance:

  1. Low compilation time
  2. Memory efficiency
  3. High performance

we do our best to keep the overall complexity of the package as low as possible to simplify:

  • the maintenance of the package
  • adding new features to the package
  • contributing to the package

Seehere for some benchmarks.

Features

InMemoryDatasets.jl has many interesting features, here, we highlight some of our favourites (in no particular order):

  • Assigning a named function to a column as itsformat
    • By default, formatted values are used for operations like: displaying, sorting, grouping, joining,...
    • Format evaluation is lazy
    • Formats don't change the actual values
  • Multi-threading across the whole package
    • Most functions inInMemoryDatasets.jl exploit all cores available toJulia by default
    • Disabling parallel computation via passing thethreads = false keyword argument to functions
  • Powerfulrow-wise operations
    • Support many common operations
    • Specialised operations for modifying columns
    • Customised row-wise operations forfiltering observations /filter simply wrapsbyrow
  • Unique approach forreshaping data
    • Unified syntax for all type of reshaping
    • Cover all reshaping functions:
      • stacking and un-stacking on single/multiple columns
      • wide to long and long to wide reshaping
      • transposing and more
  • Fastsorting algorithms
    • Stable and UnstableHeapSort andQuickSort algorithms
    • Count sort for integers
  • Compiler friendlygrouping algorithms
    • groupby!/groupby to group observation using sorting algorithms - sorted order
    • gatherby to group observation using hybrid hash algorithms - observations order
    • incremental grouping operation forgroupby!/groupby, i.e. adding a column at a time
  • Efficientjoining algorithms
    • Preserve the order of observations in the left data set
    • Support two methods for joining:sort-merge join andhash join.
    • Customised columnar-hybrid-hash algorithms for join
    • Inequality-kind (non-equi) andrange joins forinnerjoin,contains,semijoin!/semijoin,antijoin!/antijoin
    • closejoin!/closejoin fornon exact match join
    • update!/update forupdating a master data set with values from a transaction data set

Example

julia>using InMemoryDatasetsjulia> g1=repeat(1:6, inner=4);julia> g2=repeat(1:4,6);julia> y= ["d8888b."," .d8b.","d888888b","  .d8b.","88  `8D","d8' `8b","`~~88~~'"," d8' `8b","88   88","88ooo88","   88"," 88ooo88","88   88","88~~~88","   88"," 88~~~88","88  .8D","88   88","   88"," 88   88","Y8888D'","YP   YP","   YP"," YP   YP"];julia> ds=Dataset(g1= g1, g2= g2, y= y)24×3 Dataset Row │ g1        g2        y              │ identity  identity  identity       │ Int64?    Int64?    String?   ─────┼───────────────────────────────111  d8888b.212.d8b.313  d888888b414.d8b.52188`8D   6 │        2         2  d8'`8b723`~~88~~'   8 │        2         4   d8'`8b9318888103288ooo88113388123488ooo8813418888144288~~~88154388164488~~~88175188  .8D18528888195388205488882161  Y8888D'2262  YP   YP2363     YP2464   YP   YPjulia>sort(ds,:g2)24×3 Sorted Dataset Sorted by: g2 Row │ g1        g2        y              │ identity  identity  identity       │ Int64?    Int64?    String?   ─────┼───────────────────────────────111  d8888b.22188`8D   3 │        3         1  88   88   4 │        4         1  88   88   5 │        5         1  88  .8D   6 │        6         1  Y8888D'   7 │        1         2   .d8b.   8 │        2         2  d8'`8b93288ooo88104288~~~88115288881262  YP   YP1313  d888888b1423`~~88~~'  15 │        3         3     88  16 │        4         3     88  17 │        5         3     88  18 │        6         3     YP  19 │        1         4    .d8b.  20 │        2         4   d8'`8b213488ooo88224488~~~88235488882464   YP   YPjulia> tds=transpose(groupby(ds,:g1),:y)6×6 Dataset Row │ g1        _variables_  _c1        _c2        _c3        _c4            │ identity  identity     identity   identity   identity   identity       │ Int64?    String?      String?    String?    String?    String?   ─────┼───────────────────────────────────────────────────────────────────11  y            d8888b..d8b.     d888888b.d8b.22  y88`8D    d8'`8b`~~88~~'    d8'`8b33  y888888ooo888888ooo8844  y888888~~~888888~~~8855  y88  .8D888888888866  y            Y8888D'    YP   YP       YP       YP   YPjulia> mds=map(tds, x->replace(x,r"[^ ]"=>""),r"_c")6×6 Dataset Row │ g1        _variables_  _c1        _c2        _c3        _c4            │ identity  identity     identity   identity   identity   identity       │ Int64?    String?      String?    String?    String?    String?   ─────┼───────────────────────────────────────────────────────────────────11  y            ∑∑∑∑∑∑∑     ∑∑∑∑∑     ∑∑∑∑∑∑∑∑     ∑∑∑∑∑22  y            ∑∑  ∑∑∑    ∑∑∑ ∑∑∑    ∑∑∑∑∑∑∑∑    ∑∑∑ ∑∑∑33  y            ∑∑   ∑∑    ∑∑∑∑∑∑∑       ∑∑       ∑∑∑∑∑∑∑44  y            ∑∑   ∑∑    ∑∑∑∑∑∑∑       ∑∑       ∑∑∑∑∑∑∑55  y            ∑∑  ∑∑∑    ∑∑   ∑∑       ∑∑       ∑∑   ∑∑66  y            ∑∑∑∑∑∑∑    ∑∑   ∑∑       ∑∑       ∑∑   ∑∑julia>byrow(mds, sum,r"_c", by= x->count(isequal(''),x))6-element Vector{Union{Missing, Int64}}:252520201517julia>using Chainjulia>@chain mdsbeginrepeat!(2)sort!(:g1)flatten!(r"_c")insertcols!(:g2=>repeat(1:9,12))groupby(:g2)transpose(r"_c")modify!(r"_c"=>byrow(x->join(reverse(x))))select!(r"row")insertcols!(1,:g=>repeat(1:4,9))sort!(:g)end36×2 Sorted Dataset Sorted by: g Row │ g         row_function     │ identity  identity          │ Int64?    String?      ─────┼────────────────────────11  ∑∑∑∑∑∑∑∑∑∑∑∑21  ∑∑∑∑∑∑∑∑∑∑∑∑31  ∑∑        ∑∑41  ∑∑        ∑∑51  ∑∑∑∑    ∑∑∑∑61  ∑∑∑∑∑∑∑∑∑∑∑∑71  ∑∑∑∑∑∑∑∑∑∑∑∑8191102  ∑∑∑∑∑∑∑∑∑∑112  ∑∑∑∑∑∑∑∑∑∑∑∑122      ∑∑∑∑∑∑∑∑132      ∑∑∑∑  ∑∑142      ∑∑∑∑∑∑∑∑152  ∑∑∑∑∑∑∑∑∑∑∑∑162  ∑∑∑∑∑∑∑∑∑∑172182193          ∑∑∑∑203          ∑∑∑∑213          ∑∑∑∑223  ∑∑∑∑∑∑∑∑∑∑∑∑233  ∑∑∑∑∑∑∑∑∑∑∑∑243          ∑∑∑∑253          ∑∑∑∑263          ∑∑∑∑273284294  ∑∑∑∑∑∑∑∑∑∑304  ∑∑∑∑∑∑∑∑∑∑∑∑314      ∑∑∑∑∑∑∑∑324      ∑∑∑∑  ∑∑334      ∑∑∑∑∑∑∑∑344  ∑∑∑∑∑∑∑∑∑∑∑∑354  ∑∑∑∑∑∑∑∑∑∑364

Acknowledgement

We like to acknowledge the contributors toJulia's data ecosystem, especiallyDataFrames.jl, since the existence of their works gave the development ofInMemoryDatasets.jl a head start.

Packages

No packages published

Contributors11

Languages


[8]ページ先頭

©2009-2026 Movatter.jp