- Notifications
You must be signed in to change notification settings - Fork4
Multi-dimensional data manipulation and easy access to Eurostat data. In Java.
License
eurostat/java4eurostat
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Java4eurostat is a Java library for statistical data manipulation. It provides a number of functions to load statistical data into an 'hypercube' structure and index it for easy and fast in-memory computations. A number of specific functions are provided to easily accessEurostat data.
Let's start with a simple dataset:
| country | gender | year | population |
|---|---|---|---|
| Brasil | Male | 2013 | 45.1 |
| Brasil | Female | 2013 | 48.3 |
| Brasil | Total | 2013 | 93.4 |
| Brasil | Male | 2014 | 46.2 |
| Brasil | Female | 2014 | 47.7 |
| Brasil | Total | 2014 | 93.9 |
| Japan | Male | 2013 | 145.1 |
| Japan | Female | 2013 | 148.3 |
| Japan | Total | 2013 | 293.4 |
| Japan | Male | 2014 | 146.2 |
| Japan | Female | 2014 | 147.7 |
| Japan | Total | 2014 | 293.9 |
stored as a CSV fileexample.csv:
country,gender,year,populationBrasil,Male,2013,45.1Brasil,Female,2013,48.3Brasil,Total,2013,93.4Brasil,Male,2014,46.2Brasil,Female,2014,47.7Brasil,Total,2014,93.9Japan,Male,2013,145.1Japan,Female,2013,148.3Japan,Total,2013,293.4Japan,Male,2014,146.2Japan,Female,2014,147.7Japan,Total,2014,293.9This file can be loaded into an hypercube structure with:
StatsHypercubehc =CSV.load("example.csv","population");
Information on the hypercube structure is shown withhc.printInfo();, which returns:
Information: 12 value(s) with 3 dimension(s). Dimension: gender (3 dimension values) Female Male Total Dimension: year (2 dimension values) 2013 2014 Dimension: country (2 dimension values) Brasil JapanSeveral input formats are supported. For example,Eurostat data can be loaded directly from the web. For that, only the database code given inEurostat databases catalog is required. For example, the database onHICP - Country weights (codeprc_hicp_cow) can be downloaded and loaded simply with:
StatsHypercubehc2 =EurobaseIO.getData("prc_hicp_cow");
The structure returned with ahc2.printInfo(); is:
Information: 2001 value(s) with 3 dimension(s). Dimension: time (21 dimension values) 1996 1997 1998 ... Dimension: geo (35 dimension values) AT BE BG ... Dimension: statinfo (6 dimension values) COWEA COWEA18 COWEA19 ...Once loaded, data can be filtered/selected. For example,hc.selectDimValueEqualTo("country","Brasil") selects data for Brasil andhc.selectValueGreaterThan(147) selects data with values greater than 147. Selection criteria can be combined in cascade likehc.selectDimValueEqualTo("country","Brasil").selectDimValueGreaterThan("year",2012) for the selection of Brasil data after 2012. Logical operators 'AND', 'OR' and 'NOT' can also be used to build more complex selection criteria. Totally generic selection criteria can be specified such as:
hc.select(newCriteria(){@Overridepublicbooleankeep(Statstat) {returnstat.dims.get("country").contains("r") &&Math.sqrt(stat.value)>7;}});
which selects all statistics with country names containing a "r" character, and whose root square value is greater than 7.
A single value can be retrieved with for examplehc.selectDimValueEqualTo("country", "Japan", "gender", "Total", "year", "2014").stats.iterator().next().value but the fastest way to retrieve a value and scan a dataset is to use an index with:
StatsIndexindex =newStatsIndex(hc,"gender","year","country");
This index is a tree structure based on the dimension values. This structure can be displayed withindex.print();:
Total 2014 Brasil -> 93.9 Japan -> 293.9 2013 Brasil -> 93.4 Japan -> 293.4Male 2014 Brasil -> 46.2 Japan -> 146.2 2013 Brasil -> 45.1 Japan -> 145.1Female 2014 Brasil -> 47.7 Japan -> 147.7 2013 Brasil -> 48.3 Japan -> 148.3A statistical value is accessed quickly from the index and its dimension values:double value = index.getSingleValue("Total","2014","Japan");. Scanning a full dataset across its dimensions is very fast with:
for(Stringgender :index.getKeys())for(Stringyear :index.getKeys(gender))for(Stringcountry :index.getKeys(gender,year)) {System.out.println(gender +" "+year+" "+country);System.out.println(index.getSingleValue(gender,year,country));}
Java4eurostat usesApache Maven. To use java4eurostat, add it as a dependency to thepom.xml file:
<dependency><groupId>eu.europa.ec.eurostat</groupId><artifactId>java4eurostat</artifactId><version>X.Y.Z</version></dependency>WhereX.Y.Z is the latest version number, as availableMaven central repository.
For more information on how to setup a coding environment based on Eclipse, seethis page.
See theJavadoc API.
Statistical data such as:
country,gender,year,populationBrasil,Male,2013,45.1Brasil,Female,2013,48.3Japan,Total,2013,93.4...Can be simply loaded and saved with:
//loadStatsHypercubehc =CSV.load("C:\datafolder\myFile.csv","population");//saveCSV.save(hc,"population","C:\datafolder\myFile.csv");
For tabular data with several value columns such as:
country,gender,year,2010,2015,2020Brasil,Male,2013,45.1,45.1,45.1Brasil,Total,2013,93.4,45.1,45.1Japan,Male,2014,46.2,45.1,45.1...Just use:
//loadStatsHypercubehc =CSV.loadMultiValues("C:\datafolder\myFile.csv","year","2010","2015","2020");//saveCSV.saveMultiValues(hc,"C:\datafolder\myFile.csv","year")
The classEurobaseIO provides several functions to handle Eurostat data. For example:StatsHypercube hc = EurobaseIO.getData("prc_hicp_cow"); loads the databaseprc_hicp_cow. Selection parameters may also be specified:getData("prc_hicp_cow", "geo", "EU", "geo", "EA", "time", "2016") returns loads databaseprc_hicp_cow figures for 2016, for bothEU andEA. Additionnaly,getData("prc_hicp_cow", "lastTimePeriod", "4") return the figures for the 4 last time periods, whilegetData("prc_hicp_cow", "sinceTimePeriod", "2005") returns all figures since 2005.
Eurostat TSV files can be downloaded manually fromthe bulk download facility or using:
//download from Eurostat bulk download facilityEurobaseIO.getDataBulkDownload("eurobase_code","/home/datafolder/");//loadStatsHypercubehc =EurostatTSV.load("/home/datafolder/eurobase_code.tsv");//save// not implemented (yet)
The last publication date of a database can be retrieved withgetUpdateDate: For example,EurobaseIO.getUpdateDate("prc_hicp_cow"); returns the last publication date of the database with codeprc_hicp_cow.
In case of regular use of some Eurostat databases as TSV files, these files can be downloaded and updated only when new data is published. For example:
EurobaseIO.update("C:/my_data_folder/","my_database_code1","my_database_code2","my_database_code3", ...);
retrieves new filesmy_database_code1.tsv,my_database_code2.tsv andmy_database_code3.tsv only when they has been updated. This function creates a fileupdate.txt inC:/my_data_folder/ folder, which gives the last update dates of the files.
Code list dictionnaries are loaded with for exampleEurobaseIO.getDictionnary("geo") which retrieve the dictionnary of geographical locations (codegeo).EurobaseIO.getDictionnary("geo").get("IT") returns "Italy". Last update dates are retreved with for examplegetDictionnaryUpdateDate("geo").
ForJSON-stat data, simply use:
//loadStringjsonStatString ='{"version":"2.0", "class":"dataset", "label":"Population data", "source":"", "id":[...], "size":[...], "dimension":{...}, "value":[...]}';StatsHypercubehc =JSONStat.load(jsonStatString);//save// not implemented (yet)
To ensure an efficient usage of memory, a selection criteria can be specified when loading from a data source. For example,StatsHypercube hc = EurobaseIO.getData("prc_hicp_cow", new DimValueEqualTo("geo","BG")) loads only data for countryBG.
The base classes areStat andStatsHypercube. AStat object represents a statistical value, which is stored as an element of theStatsHypercube structure.
AStat object is characterised by its value (of course) and its position in the hypercube, which is represented as a dictionnary of pairs(dimension label, dimension value), which represents its coordinates within the hypercube.Flags can also be attached to a statistical value. The classStatsHypercube is simply characterised by its collection ofStat elements and dimension names.
[TODO: describe HierarchicalCode]
Data of a hypercube are accessed using either theStatsHypercube.select() method or aStatsIndex object. Access with aStatsIndex is faster, but requires the construction of an index object, which can be resource consumming.
Basic operations based on selection and indexing are presented in the quick start section above.
[TODO: extend description.]
The classSelection provide various ways to navigate in the hypercube structure hy selecting specific values based on various criteria.
Operations can be quickly applied on statistical values of a hypercube, such as:
//divide all values by 100.hc.div(100);//add 0.185 to all values.hc.add(0.185);
It is also possible to combine values of two hypercubes for example:
//get population data for 2020 and 2010StatsHypercubehcPop2020 = ...;StatsHypercubehcPop2010 = ...;//compute population changeStatsHypercubehcPopChange =hcPop2020.diff(hcPop2010);
These operation can easily be combined:
//get population data for 2020 and 2010StatsHypercubehcPop2020 = ...;StatsHypercubehcPop2010 = ...;//compute population rate of change, in percentageStatsHypercubehcPopRateOfChange =hcPop2020.diff(hcPop2010).div(hcPop2010).mult(100);
New statistical values can also be computed from existing hypercube values. For example, to compute the total value along a dimensionage_group:
//get population data by age groupStatsHypercubehcPopByAge = ...;Collection<Stat>totals =Operations.computeSumDim(hcPopByAge,"age_group","TOTAL");hcPopByAge.stats.addAll(totals);
More operations are available from theOperations class. Customunary,binary oraggregation operators can be implemented.
The classCompacity provides various methods to analyse how full/empty the hypercube structure is. This compacity computation can be restrictied to single dimensions, which gives a good overview of the completness of the input data and along which dimension it is worth focussing on. See for example theCompacity.getDimensionValuesByCompacity method.
The classValidation provides various methods to check the compliance of the dimension codes with some specified values (Validation.Compacity.checkDimensionValuesValidity method). TheValidation.Compacity.checkUnicity methods also checks the unicity of statistical values per position in the hypercube.
The classTimeSeriesUtil provides several function for time series analysis such as the computation of moving averages, gap analysis and outlier values detection.
[TODO: extend description.]
Feel free toask support, fork the project or simply star it (it's always a pleasure).
About
Multi-dimensional data manipulation and easy access to Eurostat data. In Java.
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Uh oh!
There was an error while loading.Please reload this page.
Contributors2
Uh oh!
There was an error while loading.Please reload this page.