- Notifications
You must be signed in to change notification settings - Fork33
Immutable data frame for Go
License
tobgu/qframe
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
QFrame is an immutable data frame that support filtering, aggregationand data manipulation. Any operation on a QFrame results ina new QFrame, the original QFrame remains unchanged. This can be donefairly efficiently since much of the underlying data will be sharedbetween the two frames.
The design of QFrame has mainly been driven by the requirements fromqocache but it is in many aspectsa general purpose data frame. Any suggestions for added/improvedfunctionality to support a wider scope is always of interest as longas they don't conflict with the requirements from qocache!SeeContribute.
go get github.com/tobgu/qframe
Below are some examples of common use cases. The list is not exhaustivein any way. For a complete description of all operations including moreexamples see thedocs.
QFrames can currently be read from and written to CSV, recordoriented JSON, and any SQL database supported by the godatabase/sqldriver.
Read CSV data:
input:=`COL1,COL2a,1.5b,2.25c,3.0`f:=qframe.ReadCSV(strings.NewReader(input))fmt.Println(f)
Output:
COL1(s) COL2(f)------- ------- a 1.5 b 2.25 c 3Dims = 2 x 3QFrame supports reading and writing data from the standard librarydatabase/sqldrivers. It has been tested withSQLite,Postgres, andMariaDB.
Load data to and from an in-memory SQLite database. Notethat this example requires you to havego-sqlite3 installedprior to running.
package mainimport ("database/sql""fmt"_"github.com/mattn/go-sqlite3""github.com/tobgu/qframe"qsql"github.com/tobgu/qframe/config/sql")funcmain() {// Create a new in-memory SQLite database.db,_:=sql.Open("sqlite3",":memory:")// Add a new table.db.Exec(`CREATE TABLE test (COL1 INT,COL2 REAL,COL3 TEXT,COL4 BOOL);`)// Create a new QFrame to populate our table with.qf:=qframe.New(map[string]interface{}{"COL1": []int{1,2,3},"COL2": []float64{1.1,2.2,3.3},"COL3": []string{"one","two","three"},"COL4": []bool{true,true,true},})fmt.Println(qf)// Start a new SQL Transaction.tx,_:=db.Begin()// Write the QFrame to the database.qf.ToSQL(tx,// Write only to the test tableqsql.Table("test"),// Explicitly set SQLite compatibility.qsql.SQLite(),)// Create a new QFrame from SQL.newQf:=qframe.ReadSQL(tx,// A query must return at least one column. In this// case it will return all of the columns we created above.qsql.Query("SELECT * FROM test"),// SQLite stores boolean values as integers, so we// can coerce them back to bools with the CoercePair option.qsql.Coerce(qsql.CoercePair{Column:"COL4",Type:qsql.Int64ToBool}),qsql.SQLite(),)fmt.Println(newQf)fmt.Println(newQf.Equals(qf))}
Output:
COL1(i) COL2(f) COL3(s) COL4(b)------- ------- ------- ------- 1 1.1 one true 2 2.2 two true 3 3.3 three trueDims = 4 x 3trueFiltering can be done either by applying individual filtersto the QFrame or by combining filters using AND and OR.
Filter with OR-clause:
f:=qframe.New(map[string]interface{}{"COL1": []int{1,2,3},"COL2": []string{"a","b","c"}})newF:=f.Filter(qframe.Or( qframe.Filter{Column:"COL1",Comparator:">",Arg:2}, qframe.Filter{Column:"COL2",Comparator:"=",Arg:"a"}))fmt.Println(newF)
Output:
COL1(i) COL2(s)------- ------- 1 a 3 cDims = 2 x 2Grouping and aggregation is done in two distinct steps. The functionused in the aggregation step takes a slice of elements andreturns an element. For floats this function signature matchesmany of the statistical functions inGonum,these can hence be applied directly.
intSum:=func(xx []int)int {result:=0for_,x:=rangexx {result+=x }returnresult}f:=qframe.New(map[string]interface{}{"COL1": []int{1,2,2,3,3},"COL2": []string{"a","b","c","a","b"}})f=f.GroupBy(groupby.Columns("COL2")).Aggregate(qframe.Aggregation{Fn:intSum,Column:"COL1"})fmt.Println(f.Sort(qframe.Order{Column:"COL2"}))
Output:
COL2(s) COL1(i)------- ------- a 4 b 5 c 2Dims = 2 x 3There are two different functions by which data can be manipulated,Apply andEval.Eval is slightly more high level and takes a more data driven approachbut basically boils down to a bunch ofApply in the end.
Example usingApply to string concatenate two columns:
f:=qframe.New(map[string]interface{}{"COL1": []int{1,2,3},"COL2": []string{"a","b","c"}})f=f.Apply( qframe.Instruction{Fn:function.StrI,DstCol:"COL1",SrcCol1:"COL1"}, qframe.Instruction{Fn:function.ConcatS,DstCol:"COL3",SrcCol1:"COL1",SrcCol2:"COL2"})fmt.Println(f.Select("COL3"))
Output:
COL3(s)------- 1a 2b 3cDims = 1 x 3The same example usingEval instead:
f:=qframe.New(map[string]interface{}{"COL1": []int{1,2,3},"COL2": []string{"a","b","c"}})f=f.Eval("COL3",qframe.Expr("+",qframe.Expr("str",types.ColumnName("COL1")),types.ColumnName("COL2")))fmt.Println(f.Select("COL3"))
Examples of the most common operations are available in thedocs.
All operations that may result in errors will set theErr variableon the returned QFrame to indicate that an error occurred.The presence of an error on the QFrame will prevent any future operationsfrom being executed on the frame (eg. it follows a monad-like pattern).This allows for smooth chaining of multiple operations without havingto explicitly check errors between each operation.
API functions that require configuration parameters make use offunctional optionsto allow more options to be easily added in the future in a backwardscompatible way.
- Performance
- Speed should be on par with, or better than, Python Pandas for corresponding operations.
- No or very little memory overhead per data element.
- Performance impact of operations should be straight forward to reason about.
- API
- Should be reasonably small and low ceremony.
- Should allow custom, user provided, functions to be used for data processing
- Should provide built in functions for most common operations
A QFrame is a collection of columns which can be of type int, float,string, bool or enum. For more information about the data types see thetypes docs.
In addition to the columns there is also an index which controlswhich rows in the columns that are part of the QFrame and thesort order of these columns.Many operations on QFrames only affect the index, the underlyingdata remains the same.
Many functions and methods in qframe take the empty interface as parameter,for functions to be applied or string references to internal functionsfor example.These always correspond to a union/sum type with a fixed set of valid typesthat are checked in runtime through type switches (there's hardly anyreflection applied in QFrame for performance reasons).Which types are valid depends on the function called and the column typethat is affected. Modelling this statically is hard/impossible in Go,hence the dynamic approach. If you plan to use QFrame with datasetswith fixed layout and types it should be a small task to write tinywrappers for the types you are using to regain static type safety.
- The API can still not be considered stable.
- The maximum number of rows in a QFrame is 4294967296 (2^32).
- The CSV parser only handles ASCII characters as separators.
- Individual strings cannot be longer than 268 Mb (2^28 byte).
- A string column cannot contain more than a total of 34 Gb (2^35 byte).
- At the moment you cannot rely on any of the errors returned tofulfill anything else than the
Errorinterface. In the futurethis will hopefully be improved to provide more help in identifyingthe root cause of errors.
There are a number of benchmarks inqbenchcomparing QFrame to Pandas and Gota where applicable.
The work on QFrame has been inspired byPython PandasandGota.
Want to contribute? Great! Open an issue on Github and let the discussionsbegin! Below are some instructions for working with the QFrame repo.
Below are some ideas of areas where contributions would be welcome.
- Support for more input and output formats.
- Support for additional column formats.
- Support for using theArrow format for columns.
- General CPU and memory optimizations.
- Improve documentation.
- More analytical functionality.
- Dataset joins.
- Improved interoperability with other libraries in the Go data science eco system.
- Improve string representation of QFrames.
make dev-deps
Please contribute tests together with any code. The tests should bewritten against the public API to avoid lockdown of the implementationand internal structure which would make it more difficult to change inthe future.
Run tests:make test
This will also trigger code to be regenerated.
The codebase contains some generated code to reduce the amount ofduplication required for similar functionality across different columntypes. Generated code is recognized by file names ending with_gen.go.These files must never be edited directly.
To trigger code generation:make generate
About
Immutable data frame for Go
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors8
Uh oh!
There was an error while loading.Please reload this page.