cfhammill/FramesPublic

forked fromacowley/Frames

NotificationsYou must be signed in to change notification settings
Fork0
Star0

Data frames for tabular data.

License

BSD-3-Clause license

0 stars 41 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 293 Commits
Vinyl @ 827deb2		Vinyl @ 827deb2
benchmarks		benchmarks
data		data
demo		demo
src		src
test		test
.gitignore		.gitignore
.gitmodules		.gitmodules
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
Frames-notes.org		Frames-notes.org
Frames-notes.org_archive		Frames-notes.org_archive
Frames.cabal		Frames.cabal
LICENSE		LICENSE
README.md		README.md
README.org		README.org
Setup.hs		Setup.hs
cabal.project		cabal.project
default.nix		default.nix
hackagedocs.sh		hackagedocs.sh
shell.nix		shell.nix
stack.yaml		stack.yaml

Repository files navigation

Frames

Data Frames for Haskell

User-friendly, type safe, runtime efficient tooling for working with tabular data deserialized from comma-separated values (CSV) files. The type of each row of data is inferred from data, which can then be streamed from disk, or worked with in memory.

We provide streaming and in-memory interfaces for efficiently working with datasets that can be safely indexed by column names found in the data files themselves. This type safety of column access and manipulation is checked at compile time.

Use Cases

For a running example, we will use variations of theprestige.csv data set. Each row includes 7 columns, but we just want to compute the average ratio ofincome toprestige.

Clean Data

If you have a CSV data where the values of each column may be classified by a single type, and ideally you have a header row giving each column a name, you may simply want to avoid writing out the Haskell type corresponding to each row.Frames providesTemplateHaskell machinery to infer a Haskell type for each row of your data set, thus preventing the situation where your code quietly diverges from your data.

We generate a collection of definitions generated by inspecting the data file at compile time (usingtableTypes), then, at runtime, load that data into column-oriented storage in memory (anin-core array of structures (AoS)). We're going to compute the average ratio of two columns, so we'll use thefoldl library. Our fold will project the columns we want, and apply a function that divides one by the other after appropriate numeric type conversions. Here is the entirety of thatprogram.

{-#LANGUAGE DataKinds, FlexibleContexts, QuasiQuotes, TemplateHaskell #-}moduleUncurryFoldwhereimportqualifiedControl.FoldlasLimportData.Vinyl (rcast)importData.Vinyl.Curry (runcurryX)importFrames-- Data set from http://vincentarelbundock.github.io/Rdatasets/datasets.htmltableTypes"Row""test/data/prestige.csv"loadRows::IO (FrameRow)loadRows= inCoreAoS (readTable"test/data/prestige.csv")--| Compute the ratio of income to prestige for a record containing-- only those fields.ratio::Record '[Income,Prestige]->Doubleratio= runcurryX (\i p->fromIntegral i/ p)averageRatio::IODoubleaverageRatio=L.fold (L.premap (ratio. rcast) avg)<$> loadRowswhere avg=(/)<$>L.sum<*>L.genericLength

Missing Header Row

Now consider a case where our data file lacks a header row (I deleted the first row from `prestige.csv`). We will provide our own name for the generated row type, our own column names, and, for the sake of demonstration, we will also specify a prefix to be added to every column-based identifier (particularly useful if the column namesdo come from a header row, and you want to work with multiple CSV files some of whose column names coincide). We customize behavior by updating whichever fields of the record produced byrowGen we care to change, passing the result totableTypes'.Link to code.

{-#LANGUAGE DataKinds, FlexibleContexts, QuasiQuotes, TemplateHaskell #-}moduleUncurryFoldNoHeaderwhereimportqualifiedControl.FoldlasLimportData.Vinyl (rcast)importData.Vinyl.Curry (runcurryX)importFramesimportFrames.TH (rowGen,RowGen(..))-- Data set from http://vincentarelbundock.github.io/Rdatasets/datasets.htmltableTypes' (rowGen"test/data/prestigeNoHeader.csv")            { rowTypeName="NoH"            , columnNames= ["Job","Schooling","Money","Females"                            ,"Respect","Census","Category" ]            , tablePrefix="NoHead"}loadRows::IO (FrameNoH)loadRows= inCoreAoS (readTableOpt noHParser"test/data/prestigeNoHeader.csv")--| Compute the ratio of money to respect for a record containing-- only those fields.ratio::Record '[NoHeadMoney,NoHeadRespect]->Doubleratio= runcurryX (\m r->fromIntegral m/ r)averageRatio::IODoubleaverageRatio=L.fold (L.premap (ratio. rcast) avg)<$> loadRowswhere avg=(/)<$>L.sum<*>L.genericLength

Missing Data

Sometimes not every row has a value for every column. I went ahead and blanked theprestige column of every row whosetype column wasNA inprestige.csv. For example, the first such row now reads,

"athletes",11.44,8206,8.13,,3373,NA

We can no longer parse aDouble for that row, so we will work with row types parameterized by aMaybe type constructor. We are substantially filtering our data, so we will perform this operation in a streaming fashion without ever loading the entire table into memory. Our process will be to check if theprestige column was parsed, only keeping those rows for which it was not, then project theincome column from those rows, and finally throw awayNothing elements.Link to code.

{-#LANGUAGE DataKinds, FlexibleContexts, QuasiQuotes, TemplateHaskell, TypeApplications, TypeOperators #-}moduleUncurryFoldPartialDatawhereimportqualifiedControl.FoldlasLimportData.Maybe (isNothing)importData.Vinyl.XRec (toHKD)importFramesimportPipes (Producer,(>->))importqualifiedPipes.PreludeasP-- Data set from http://vincentarelbundock.github.io/Rdatasets/datasets.html-- The prestige column has been left blank for rows whose "type" is-- listed as "NA".tableTypes"Row""test/data/prestigePartial.csv"--| A pipes 'Producer' of our 'Row' type with a column functor of-- 'Maybe'. That is, each element of each row may have failed to parse-- from the CSV file.maybeRows::MonadSafem=>Producer (Rec (Maybe:.ElField) (RecordColumnsRow))m()maybeRows= readTableMaybe"test/data/prestigePartial.csv"--| Return the number of rows with unknown prestige, and the average-- income of those rows.incomeOfUnknownPrestige::IO (Int,Double)incomeOfUnknownPrestige=  runSafeEffect.L.purelyP.fold avg$    maybeRows>->P.filter prestigeUnknown>->P.map getIncome>->P.concatwhere avg= (\s l-> (l, s/fromIntegral l))<$>L.sum<*>L.length        getIncome=fmapfromIntegral. toHKD. rget@IncomeprestigeUnknown::Rec (Maybe:.ElField) (RecordColumnsRow)->Bool        prestigeUnknown= isNothing. toHKD. rget@Prestige

Tutorial

For comparison to working with data frames in other languages, see thetutorial.

Demos

There are variousdemos in the repository. Be sure to run thegetdata build target to download the data files used by the demos! You can also download the data files manually and put them in adata directory in the directory from which you will be running the executables.

Benchmarks

Thebenchmark shows several ways of dealing with data when you want to perform multiple traversals.

Anotherdemo shows how to fuse multiple passes into one so that the full data set is never resident in memory. APandas version of a similar program is also provided for comparison.

This is a trivial program, but shows that performance is comparable to Pandas, and the memory savings of a compiled program are substantial.

About

Data frames for tabular data.

Releases

No releases published

Packages

No packages published

Languages

Haskell96.3%
Nix1.8%
Python1.2%
Shell0.7%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Frames

Data Frames for Haskell

Use Cases

Clean Data

Missing Header Row

Missing Data

Tutorial

Demos

Benchmarks

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

License

cfhammill/Frames

Folders and files

Latest commit

History

Repository files navigation

Frames

Data Frames for Haskell

Use Cases

Clean Data

Missing Header Row

Missing Data

Tutorial

Demos

Benchmarks

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages