- Notifications
You must be signed in to change notification settings - Fork23
The foundational library of the Morpheus data science framework
License
zavtech/morpheus-core
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
The Morpheus library is designed to facilitate the development of high performance analytical software involving large datasets forboth offline and real-time analysis on theJava Virtual Machine (JVM). Thelibrary is written in Java 8 with extensive use of lambdas, but is accessible to all JVM languages.
For detailed documentation with examples, seehere
At its core, Morpheus provides a versatile two-dimensionalmemory efficient tabular data structure called aDataFrame, similar tothat first popularised inR. Whiledynamically typed scientific computing languages likeR,Python &Matlabare great for doing research, they are not well suited for large scale production systems as they become extremely difficult to maintain,and dangerous to refactor. The Morpheus library attempts to retain the power and versatility of theDataFrame concept, while providing amuch moretype safe andself describing set of interfaces, which should make developing, maintaining & scaling code complexity mucheasier.
Another advantage of the Morpheus library is that it is extremely good atscaling onmulti-core processorarchitectures given the powerfulthreading capabilities of the JavaVirtual Machine. Many operations on a MorpheusDataFrame can seamlessly be run inparallel by simply callingparallel() on the entityyou wish to operate on, much like withJava 8 Streams.Internally, these parallel implementations are based on the Fork & Join framework, and near linear improvements in performance are observedfor certain types of operations as CPU cores are added.
A MorpheusDataFrame is a column store structure where each column is represented by a MorpheusArray of which there are manyimplementations, including dense, sparse andmemory mapped versions. Morpheus arraysare optimized and wherever possible are backed by primitive native Java arrays (even for types such asLocalDate,LocalDateTime etc...)as these are far more efficient from a storage, access and garbage collection perspective. Memory mapped MorpheusArrays, while stillexperimental, allow very largeDataFrames to be created using off-heap storage that are backed by files.
While the complete feature set of the MorpheusDataFrame is still evolving, there are already many powerful APIs to affect complextransformations and analytical operations with ease. There are standard functions to compute summary statistics, perform various typesofLinear Regressions, applyPrincipal Component Analysis(PCA) to mention just a few. TheDataFrame is indexed in both the row and column dimension, allowing data to be efficientlysorted,sliced,grouped, andaggregated along either axis.
Morpheus also aims to provide a standard mechanism to load datasets from various data providers. The hope is that this API willbe embraced by the community in order to grow the catalogue of supported data sources. Currently, providers are implemented to enabledata to be loaded fromQuandl,The Federal Reserve,The World Bank,Yahoo Finance andGoogle Finance.
Consider a dataset of motor vehicle characteristics accessiblehere.The code below loads this CSV data into a MorpheusDataFrame, filters the rows to only include those vehicles that have a powerto weight ratio > 0.1 (whereweight is converted into kilograms), then adds a column to record the relative efficiency between highwayand city mileage (MPG), sorts the rows by this newly added column in descending order, and finally records this transformed resultto a CSV file.
DataFrame.read().csv(options -> {options.setResource("http://zavtech.com/data/samples/cars93.csv");options.setExcludeColumnIndexes(0);}).rows().select(row -> {doubleweightKG =row.getDouble("Weight") *0.453592d;doublehorsepower =row.getDouble("Horsepower");returnhorsepower /weightKG >0.1d;}).cols().add("MPG(Highway/City)",Double.class,v -> {doublecityMpg =v.row().getDouble("MPG.city");doublehighwayMpg =v.row().getDouble("MPG.highway");returnhighwayMpg /cityMpg;}).rows().sort(false,"MPG(Highway/City)").write().csv(options -> {options.setFile("/Users/witdxav/cars93m.csv");options.setTitle("DataFrame");});
This example demonstrates the functional nature of the Morpheus API, where many method return types are in fact aDataFrame andtherefore allow this form of method chaining. In this example, the methodscsv(),select(),add(), andsort() all returna frame. In some cases the same frame that the method operates on, or in other cases a filter or shallow copy of the frame beingoperated on. The first 10 rows of the transformed dataset in this example looks as follows, with the newly added column appearingon the far right of the frame.
Index | Manufacturer | Model | Type | Min.Price | Price | Max.Price | MPG.city | MPG.highway | AirBags | DriveTrain | Cylinders | EngineSize | Horsepower | RPM | Rev.per.mile | Man.trans.avail | Fuel.tank.capacity | Passengers | Length | Wheelbase | Width | Turn.circle | Rear.seat.room | Luggage.room | Weight | Origin | Make | MPG(Highway/City) |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 9 | Cadillac | DeVille | Large | 33.0000 | 34.7000 | 36.3000 | 16 | 25 | Driver only | Front | 8 | 4.9000 | 200 | 4100 | 1510 | No | 18.0000 | 6 | 206 | 114 | 73 | 43 | 35 | 18 | 3620 | USA | Cadillac DeVille | 1.5625 | 10 | Cadillac | Seville | Midsize | 37.5000 | 40.1000 | 42.7000 | 16 | 25 | Driver & Passenger | Front | 8 | 4.6000 | 295 | 6000 | 1985 | No | 20.0000 | 5 | 204 | 111 | 74 | 44 | 31 | 14 | 3935 | USA | Cadillac Seville | 1.5625 | 70 | Oldsmobile | Eighty-Eight | Large | 19.5000 | 20.7000 | 21.9000 | 19 | 28 | Driver only | Front | 6 | 3.8000 | 170 | 4800 | 1570 | No | 18.0000 | 6 | 201 | 111 | 74 | 42 | 31.5 | 17 | 3470 | USA | Oldsmobile Eighty-Eight | 1.47368421 | 74 | Pontiac | Firebird | Sporty | 14.0000 | 17.7000 | 21.4000 | 19 | 28 | Driver & Passenger | Rear | 6 | 3.4000 | 160 | 4600 | 1805 | Yes | 15.5000 | 4 | 196 | 101 | 75 | 43 | 25 | 13 | 3240 | USA | Pontiac Firebird | 1.47368421 | 6 | Buick | LeSabre | Large | 19.9000 | 20.8000 | 21.7000 | 19 | 28 | Driver only | Front | 6 | 3.8000 | 170 | 4800 | 1570 | No | 18.0000 | 6 | 200 | 111 | 74 | 42 | 30.5 | 17 | 3470 | USA | Buick LeSabre | 1.47368421 | 13 | Chevrolet | Camaro | Sporty | 13.4000 | 15.1000 | 16.8000 | 19 | 28 | Driver & Passenger | Rear | 6 | 3.4000 | 160 | 4600 | 1805 | Yes | 15.5000 | 4 | 193 | 101 | 74 | 43 | 25 | 13 | 3240 | USA | Chevrolet Camaro | 1.47368421 | 76 | Pontiac | Bonneville | Large | 19.4000 | 24.4000 | 29.4000 | 19 | 28 | Driver & Passenger | Front | 6 | 3.8000 | 170 | 4800 | 1565 | No | 18.0000 | 6 | 177 | 111 | 74 | 43 | 30.5 | 18 | 3495 | USA | Pontiac Bonneville | 1.47368421 | 56 | Mazda | RX-7 | Sporty | 32.5000 | 32.5000 | 32.5000 | 17 | 25 | Driver only | Rear | rotary | 1.3000 | 255 | 6500 | 2325 | Yes | 20.0000 | 2 | 169 | 96 | 69 | 37 | NA | NA | 2895 | non-USA | Mazda RX-7 | 1.47058824 | 18 | Chevrolet | Corvette | Sporty | 34.6000 | 38.0000 | 41.5000 | 17 | 25 | Driver only | Rear | 8 | 5.7000 | 300 | 5000 | 1450 | Yes | 20.0000 | 2 | 179 | 96 | 74 | 43 | NA | NA | 3380 | USA | Chevrolet Corvette | 1.47058824 | 51 | Lincoln | Town_Car | Large | 34.4000 | 36.1000 | 37.8000 | 18 | 26 | Driver & Passenger | Rear | 8 | 4.6000 | 210 | 4600 | 1840 | No | 20.0000 | 6 | 219 | 117 | 77 | 45 | 31.5 | 22 | 4055 | USA | Lincoln Town_Car | 1.44444444 |
The Morpheus API includes a regression interface in order to fit data to a linear model using eitherOLS,WLS orGLS. The code below uses the same car dataset introduced in the previous example,and regressesHorsepower onEngineSize. The code example prints the model results to standard out, which is shown below,and then creates a scatter chart with the regression line clearly displayed.
//Load the dataDataFrame<Integer,String>data =DataFrame.read().csv(options -> {options.setResource("http://zavtech.com/data/samples/cars93.csv");options.setExcludeColumnIndexes(0);});//Run OLS regression and plotStringregressand ="Horsepower";Stringregressor ="EngineSize";data.regress().ols(regressand,regressor,true,model -> {System.out.println(model);DataFrame<Integer,String>xy =data.cols().select(regressand,regressor);Chart.create().withScatterPlot(xy,false,regressor,chart -> {chart.title().withText(regressand +" regressed on " +regressor);chart.subtitle().withText("Single Variable Linear Regression");chart.plot().style(regressand).withColor(Color.RED).withPointsVisible(true);chart.plot().trend(regressand).withColor(Color.BLACK);chart.plot().axes().domain().label().withText(regressor);chart.plot().axes().domain().format().withPattern("0.00;-0.00");chart.plot().axes().range(0).label().withText(regressand);chart.plot().axes().range(0).format().withPattern("0;-0");chart.show(); });returnOptional.empty();});
============================================================================================== Linear Regression Results ==============================================================================================Model: OLS R-Squared: 0.5360Observations: 93 R-Squared(adjusted): 0.5309DF Model: 1 F-Statistic: 105.1204DF Residuals: 91 F-Statistic(Prob): 1.11E-16Standard Error: 35.8717 Runtime(millis) 52Durbin-Watson: 1.9591 ============================================================================================== Index | PARAMETER | STD_ERROR | T_STAT | P_VALUE | CI_LOWER | CI_UPPER |---------------------------------------------------------------------------------------------- Intercept | 45.2195 | 10.3119 | 4.3852 | 3.107E-5 | 24.736 | 65.7029 | EngineSize | 36.9633 | 3.6052 | 10.2528 | 7.573E-17 | 29.802 | 44.1245 |==============================================================================================
It is possible to access all UK residentialreal-estate transaction recordsfrom 1995 through to current day via theUK Government Open Data initiative. The data is presented in CSVformat, and contains numerouscolumns, including such information as thetransaction date, price paid, fully qualified address (including postal code), property type, lease type and so on.
Let us begin by writing a function to load these CSV files from Amazon S3 buckets, and since they are stored one file per year,we provide a parameterized function accordingly. Given the requirements of our analysis, there is no need to load all the columns in thefile, so below we only choose to read columns at index 1, 2, 4, and 11. In addition, since the files do not include a header, were-name columns to something more meaningful to make subsequent access a little clearer.
/** * Loads UK house price from the Land Registry stored in an Amazon S3 bucket * Note the data does not have a header, so columns will be named Column-0, Column-1 etc... * @param year the year for which to load prices * @return the resulting DataFrame, with some columns renamed */privateDataFrame<Integer,String>loadHousePrices(Yearyear) {Stringresource ="http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-%s.csv";returnDataFrame.read().csv(options -> {options.setResource(String.format(resource,year.getValue()));options.setHeader(false);options.setCharset(StandardCharsets.UTF_8);options.setIncludeColumnIndexes(1,2,4,11);options.getFormats().setParser("TransactDate",Parser.ofLocalDate("yyyy-MM-dd HH:mm"));options.setColumnNameMapping((colName,colOrdinal) -> {switch (colOrdinal) {case0:return"PricePaid";case1:return"TransactDate";case2:return"PropertyType";case3:return"City";default:returncolName; } }); });}
Below we use this data in order to compute the median nominal price (not inflation adjusted) of anapartment for each year between1995 through 2014 for a subset of the largest cities in the UK. There are about 20 million records in the unfiltered dataset between1993 and 2014, and while it takes a fairly long time to load and parse (approximately 3.5GB of data), Morpheus executes the analyticalportion of the code in about 5 seconds (not including load time) on a standard Apple Macbook Pro purchased in late 2013. Note how we useparallel processing to load and process the data by callingresults.rows().keys().parallel().
//Create a data frame to capture the median prices of Apartments in the UK'a largest citiesDataFrame<Year,String>results =DataFrame.ofDoubles(Range.of(1995,2015).map(Year::of),Array.of("LONDON","BIRMINGHAM","SHEFFIELD","LEEDS","LIVERPOOL","MANCHESTER"));//Process yearly data in parallel to leverage all CPU coresresults.rows().keys().parallel().forEach(year -> {System.out.printf("Loading UK house prices for %s...\n",year);DataFrame<Integer,String>prices =loadHousePrices(year);prices.rows().select(row -> {//Filter rows to include only apartments in the relevant citiesfinalStringpropType =row.getValue("PropertyType");finalStringcity =row.getValue("City");finalStringcityUpperCase =city !=null ?city.toUpperCase() :null;returnpropType !=null &&propType.equals("F") &&results.cols().contains(cityUpperCase); }).rows().groupBy("City").forEach(0, (groupKey,group) -> {//Group row filtered frame so we can compute median prices in selected citiesfinalStringcity =groupKey.item(0);finaldoublepriceStat =group.colAt("PricePaid").stats().median();results.data().setDouble(year,city,priceStat); });});//Map row keys to LocalDates, and map values to be percentage changes from start datefinalDataFrame<LocalDate,String>plotFrame =results.mapToDoubles(v -> {finaldoublefirstValue =v.col().getDouble(0);finaldoublecurrentValue =v.getDouble();return (currentValue /firstValue -1d) *100d;}).rows().mapKeys(row -> {finalYearyear =row.key();returnLocalDate.of(year.getValue(),12,31);});//Create a plot, and display itChart.create().withLinePlot(plotFrame,chart -> {chart.title().withText("Median Nominal House Price Changes");chart.title().withFont(newFont("Arial",Font.BOLD,14));chart.subtitle().withText("Date Range: 1995 - 2014");chart.plot().axes().domain().label().withText("Year");chart.plot().axes().range(0).label().withText("Percent Change from 1995");chart.plot().axes().range(0).format().withPattern("0.##'%';-0.##'%'");chart.plot().style("LONDON").withColor(Color.BLACK);chart.legend().on().bottom();chart.show();});
The percent change in nominal median prices forapartments in the subset of chosen cities is shown in the plot below. Itshows that London did not suffer any nominal house price decline as a result of the Global Financial Crisis (GFC), however notall cities in the UK proved as resilient. What is slightly surprising is that some of the less affluent northern cities saw ahigher rate of appreciation in the 2003 to 2006 period compared to London. One thing to note is that while London did not seeany nominal price reduction, there was certainly a fairly severe correction in terms of EUR and USD since Pound Sterlingdepreciated heavily against these currencies during the GFC.
Visualizing data in MorpheusDataFrames is made easy via asimple chart abstraction API with adapters supporting bothJFreeChart as well asGoogle Charts (with othersto follow by popular demand). This design makes it possible to generate interactiveJava Swingcharts as well as HTML5 browser based charts via the same programmatic interface. For more details on how to use this API,see the section on visualizationhere, and the codehere.
![]() | ![]() |
![]() | ![]() |
![]() | ![]() |
![]() | ![]() |
![]() | ![]() |
![]() | ![]() |
![]() | ![]() |
![]() | ![]() |
![]() | ![]() |
![]() | ![]() |
![]() | ![]() |
Morpheus is published to Maven Central so it can be easily added as a dependency in your build tool of choice. The codebase is currentlydivided into 5 repositories to allow each module to be evolved independently. The core module, which is aptly namedmorpheus-core,is the foundational library on which all other modules depend. The various Maven artifacts are as follows:
Morpheus Core
Thefoundational library that contains Morpheus Arrays, DataFrames and other key interfaces & implementations.
<dependency> <groupId>com.zavtech</groupId> <artifactId>morpheus-core</artifactId> <version>${VERSION}</version></dependency>
Morpheus Visualization
Thevisualization components to displayDataFrames in charts and tables.
<dependency> <groupId>com.zavtech</groupId> <artifactId>morpheus-viz</artifactId> <version>${VERSION}</version></dependency>
Morpheus Quandl
Theadapter to load data fromQuandl
<dependency> <groupId>com.zavtech</groupId> <artifactId>morpheus-quandl</artifactId> <version>${VERSION}</version></dependency>
Morpheus Google
Theadapter to load data fromGoogle Finance
<dependency> <groupId>com.zavtech</groupId> <artifactId>morpheus-google</artifactId> <version>${VERSION}</version></dependency>
Morpheus Yahoo
Theadapter to load data fromYahoo Finance
<dependency> <groupId>com.zavtech</groupId> <artifactId>morpheus-yahoo</artifactId> <version>${VERSION}</version></dependency>
A Questions & Answers forum has been setup using Google Groups and is accessiblehere
Morpheus Javadocs can be accessed onlinehere.
A Continuous Integration build server can be accessedhere, which builds code after each merge.
Morpheus is released under theApache Software Foundation License Version 2.
About
The foundational library of the Morpheus data science framework
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors4
Uh oh!
There was an error while loading.Please reload this page.
























