Copyright © 2013W3C® (MIT,ERCIM,Keio,Beihang), All Rights Reserved. W3Cliability,trademark anddocument use rules apply.
Many national, regional and local governments, as well as otherorganizations in- and outside of the public sector, collect numericdata and aggregate this data into statistics. There is a need topublish these statistics in a standardized, machine-readable way onthe Web, so that they can be freely integrated and reused in consumingapplications.
In this document, theW3CGovernment Linked Data Working Group presents use cases and lessonssupporting a recommendation of the RDF Data Cube Vocabulary [QB-2013]. We describe case studies ofexisting deployments of an earlier version of the Data Cube Vocabulary[QB-2010] as well as otherpossible use cases that would benefit from using the vocabulary. Inparticular, we identify benefits and challenges in using a vocabularyfor representing statistics. Also, we derive lessons that can be usedfor future work on the vocabulary as well as for useful toolscomplementing the vocabulary.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of currentW3C publications and the latest revision of this technical report can be found in theW3C technical reports index at http://www.w3.org/TR/.
This document was published by theGovernment Linked Data Working Group as a Working Group Note. If you wish to make comments regarding this document, please send them topublic-gld-comments@w3.org (subscribe,archives). All comments are welcome.
Publication as a Working Group Note does not imply endorsement by theW3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the5 February 2004W3C Patent Policy.W3C maintains apublic list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes containsEssential Claim(s) must disclose the information in accordance withsection 6 of theW3C Patent Policy.
The rest of this document is structured as follows. We willfirst give a short introduction to modeling statistics. Then, we willdescribe use cases that have been derived from existing deployments orfrom feedback to the earlier version of the Data Cube Vocabulary. Inparticular, we describe possible benefits and challenges of use cases.Afterwards, we will describe lessons derived from the use cases.
We use the term "Data Cube Vocabulary" throughout the documentwhen referring to the vocabulary.
In the following, we describe the challenge of authoring an RDFvocabulary for publishing statistics as Linked Data. Describingstatistics — collected and aggregated numeric data — is challengingfor the following reasons:
The Statistical Data and Metadata eXchange [SDMX] — the ISO standard for exchanging andsharing statistical data and metadata among organizations — uses a"multidimensional model" to meet the above challenges in modelingstatistics. It can describe statistics as observations. Observationsexhibit values (Measures) that depend on dimensions (Members ofDimensions). Since the SDMX standard has proven applicable in manycontexts, the Data Cube Vocabulary adopts the multidimensional model thatunderlies SDMX and will be compatible with SDMX.
Statisticsis thestudy ofthe collection, organization, analysis, and interpretation of data.Statistics comprise statistical data.
The basic structure ofstatistical datais a multidimensional table (also called a data cube) [SDMX], i.e., a set of observed values organizedalong a group of dimensions, together with associated metadata. Werefer to aggregated statistical data as "macro-data" and unaggregatedstatistical data as "micro-data".
Statistical data can be collected in adataset,typically published and maintained by an organization [SDMX]. The dataset contains metadata, e.g.,about the time of collection and publication or about the maintainingand publishing organization.
Source datais data from data stores such as relational databases or spreadsheetsthat acts as a source for the Linked Data publishing process.
Metadataabout statistics defines the data structure and gives contextualinformation about the statistics.
A format ismachine-readableif it is amenable to automated processing by a machine, as opposed topresentation to a human user.
Apublisheris a person or organization that exposes source data as Linked Data onthe Web.
Aconsumeris a person or agent that uses Linked Data from the Web.
Aregistryallows a publisher to announce that data or metadata exists and to addinformation about how to obtain that data [SDMX 2.1].
This section presents scenarios that are enabled by theexistence of a standard vocabulary for the representation ofstatistics as Linked Data.
(Use case taken from SDMX WebDissemination Use Case [SDMX2.1])
Since we have adopted the multidimensional model that underliesSDMX, we also adopt the "Web Dissemination Use Case" which is theprime use case for SDMX since it is an increasingly popular use of SDMXand enables organizations to build a self-updating disseminationsystem.
The Web Dissemination Use Case contains three actors, astructural metadata Web service (registry) that collects metadataabout statistical data in a registration fashion, a data Web service(publisher) that publishes statistical data and its metadata asregistered in the structural metadata Web service, and a dataconsumption application (consumer) that first discovers data from theregistry, then queries data from the corresponding publisher ofselected data, and then visualizes the data.
(This use case has beensummarized from Ian Dickinson et al. [COINS])
More and more organizations want to publish statistics on theWeb, for reasons such as increasing transparency and trust. Although,in the ideal case, published data can be understood by both humans andmachines, data often is simply published as CSV, PDF, XSL etc.,lacking elaborate metadata, which makes free usage and analysisdifficult.
Therefore, the goal in this scenario is to use a machine-readable andapplication-independent description of common statistics, expressed using open standards, to foster usage and innovation on the published data.In the "COINS as Linked Data" project [COINS], the Combined Online Information System(COINS) shall be published using a standard Linked Data vocabulary.Via the Combined Online Information System (COINS),HMTreasury, the principal custodian of financial data for the UKgovernment, releases previously restricted financial information aboutgovernment spending.
The COINS data has a hypercube structure. It describes financialtransactions using seven independent dimensions (time, data-type,department etc.) and one dependent measure (value). Also, it allowsthirty-three attributes that may further describe each transaction.COINS is an example of one of the more complex statistical datasetsbeing publishing via data.gov.uk.
Part of the complexity of COINS arises from the nature of thedata being released:
The published COINS datasets cover expenditure related to fivedifferent years (2005–06 to 2009–10). The actual COINS database at HMTreasury is updated daily. In principle at least, multiple snapshotsof the COINS data could be released throughout the year.
The actual data and its hypercube structure are to berepresented separately so that an application first can examine thestructure before deciding to download the actual data, i.e., thetransactions. The hypercube structure also defines, for each dimensionand attribute, a range of permitted values that are to be represented.
An access or query interface to the COINS data, e.g., via aSPARQL endpoint or the linked data API, is planned. Queries that areexpected to be interesting are: "spending for one department", "totalspending by department", "retrieving all data for a given observation"etc.
According to the COINS as Linked Data project, the reason forpublishing COINS as Linked Data are threefold:
The COINS use case leads to the following challenges:
(This use case has beencontributed by Rinke Hoekstra. SeeCEDA_R andData2Semantics for moreinformation.)
Not only in government, there is a need to publish considerableamounts of statistical data to be consumed in various (alsounexpected) application scenarios. Typically, Microsoft Excel sheetsare made available for download.
For instance, in theCEDA_RandData2Semanticsprojects publishing and harmonizing Dutch historical census data (from1795 onwards) is a goal. These censuses are now only available asExcel spreadsheets (obtained by data entry) that closely mimic the wayin which the data was originally published and shall be published asLinked Data.
Those Excel sheets contain single spreadsheets with severalmultidimensional data tables, having a name and notes, as well ascolumn values, row values, and cell values.
Another concrete example is theStats2RDFproject that intends to publish Excel sheets with biomedical statistical data. Here, Excel files are first translated into CSV and then translated into RDF using OntoWiki, a semantic wiki.
(Use case has been taken from [QB4OLAP] and from discussions atpublishing-statistical-datamailing list)
It often comes up in statistical data that you have some kind of'overall' figure, which is then broken down into parts.
Example (in pseudo-turtle RDF):
ex:obs1 sdmx:refArea <uk>; sdmx:refPeriod "2011"; ex:population "60" .ex:obs2 sdmx:refArea <england>; sdmx:refPeriod "2011"; ex:population "50" .ex:obs3 sdmx:refArea <scotland>; sdmx:refPeriod "2011"; ex:population "5" .ex:obs4 sdmx:refArea <wales>; sdmx:refPeriod "2011"; ex:population "3" .ex:obs5 sdmx:refArea <northernireland>; sdmx:refPeriod "2011"; ex:population "2" .
We are looking for the best way (in the context of the RDF/DataCube/SDMX approach) to express that the values forEngland, Scotland, Wales & Northern Ireland ought to add up to the valuefor the UK and constitute a more detailed breakdown of the overall UKfigure. Since we might also have population figures for France,Germany, EU28 etc., it is not as simple as just taking aqb:Slice
where you fix the time period and the measure.
Similarly, Etcheverry and Vaisman [QB4OLAP]present the use case to publish household data fromStatsWales andOpenData Communities.
This multidimensional data contains for each fact a timedimension with one level Year and a location dimension with levelsUnitary Authority, Government Office Region, Country, and ALL. Asunit, units of 1000 households is used.
In this use case, one wants to publish not only a dataset on thebottom most level, i.e., what are the number of households at eachUnitary Authority in each year, but also a dataset on more aggregatedlevels. For instance, in order to publish a dataset with the number ofhouseholds at each Government Office Region per year, one needs toaggregate the measure of each fact having the same Government OfficeRegion using the SUM function.
Similarly, for many uses then population broken down by somecategory (e.g., ethnicity) is expressed as a percentage. Separatedatasets give the actual counts per category and aggregate counts. Insuch cases it is common to talk about the denominator (often DENOM)which is the aggregate count against which the percentages can beinterpreted.
(Use case has been provided byEpimorphics Ltd, in theirUKBathing Water Quality deployment)
As part of their work with data.gov.uk and the UK Location Programme,Epimorphics Ltd have been working to pilot the publication of bothcurrent and historic bathing water quality information from theUK EnvironmentAgency as Linked Data.
The UK has a number of areas, typically beaches, that aredesignated as bathing waters where people routinely enter the water.The Environment Agency monitors and reports on the quality of thewater at these bathing waters.
The Environment Agency's data can be thought of as structuredin 3 groups:
The most important dimensions of the data are bathing water,sampling point, and compliance classification.
The Met Office, the UK's National Weather Service, provides arange of weather forecast products including openly availablesite-specific forecasts for the UK. The site specific forecasts coverover 5000 forecast points, each forecast predicts 10 parameters andspans a 5 day window at 3 hourly intervals, the whole forecast isupdated each hour. A proof of concept project investigated thechallenge of publishing this information as linked data using the DataCube vocabulary.
This weather forecasts case study leads to the followingchallenges:
The World Meteorological Organization (WMO) develops and recommendsdata interchange standard and within that community compatibility withISO19156"Geographic information — Observations andmeasurements" (O&M) is regarded as important. Thus, this supportslessonModelersusing ISO19156 - Observations & Measurements may needclarification regarding the relationship to the Data Cube Vocabulary.
Solution in this case study:O&M provides a data model for an Observation with associatedPhenomenon, measurement ProcessUsed, Domain (feature of interest) andResult. Prototype vocabularies developed at CSIRO and extended withinthis project allow this data model to be represented in RDF. For thesite specific forecasts then a 5-day forecast for all 5000+ sites isregarded as a single O&M Observation.
To represent the forecast data itself, the Result in the O&Mmodel, then the relevant standard is ISO19123"Geographicinformation — Schema for coverage geometry and functions". Thisprovides a data model for a Coverage which can represent a set ofvalues across some space. It defines different types of Coverageincluding a DiscretePointCoverage suited to representing site-specificforecast results.
It turns out that it is straightforward to treat an RDF DataCube as a particular concrete representation of theDiscretePointCoverage logical model. The cube has dimensionscorresponding to the forecast time and location and the measure is arecord representing the forecast values of the 10 phenomena. Slices bytime and location provide subsets of the data that directly match thedata packages supported by an existing on-line service.
Note that in this situation anobservation in the sense ofqb:Observation
and anobservation in the sense of ISO19156 Observations andMeasurements are different things. The O&M Observation is thewhole forecast whereas eachqb:Observation
corresponds to a single GeometryValuePair within the forecast resultsCoverage.
Each hourly update comprises over 2 million data points and forecastdata is requested by a large number of data consumers. Bandwidth costsare thus a key consideration and the apparent verbosity of RDF ingeneral, and Data Cube specifically, was a concern. This supportslessonPublishers and consumers may need more guidance in efficientlyprocessing data using the Data Cube Vocabulary.
Solution in this case study:Regarding bandwidth costs then the key is not raw data volumebut compressibility, since such data is transmitted in compressedform. A Turtle representation of a non-abbreviated data cubecompressed to within 15-20% of the size of compressed, handcrafted XMLand JSON representations. Thus obviating the need for abbreviations orcustom serialization.
(This use case has been takenfromEurostatLinked Data Wrapper andLinked StatisticsEurostat Data, both deployments for publishing Eurostat SDMX asLinked Data using the draft version of the Data Cube Vocabulary)
As mentioned already, the ISO standard for exchanging and sharingstatistical data and metadata among organizations is Statistical Dataand Metadata eXchange [SDMX].Since this standard has proven applicable in many contexts, we adoptthe multidimensional model that underlies SDMX and intend the standardvocabulary to be compatible to SDMX. Therefore, in this use case weexplain the benefit and challenges of publishing SDMX data as LinkedData.
As one of the main adopters of SDMX,Eurostat publishes largeamounts of European statistics coming from a data warehouse as SDMXand other formats on the Web. Eurostat also provides an interface tobrowse and explore the datasets. However, linking suchmultidimensional data to related data sets and concepts would requiredownloading of interesting datasets and manual integration. The goalhere is to improve integration with other datasets; Eurostat datashould be published on the Web in a machine-readable format, possiblyto be linked with other datasets, and possibly to be freely consumedby applications. BothEurostatLinked Data Wrapper andLinked StatisticsEurostat Data intend to publishEurostatSDMX data as5 Star Linked OpenData. Eurostat data is partly published as SDMX, partly as tabulardata (TSV, similar to CSV). Eurostat provides aTOCof published datasets as well as a feed of modified and new datasets.Eurostat provides a list of used code lists, i.e.,rangeof permitted dimension values. Any Eurostat dataset contains avarying set of dimensions (e.g., date, geo, obs_status, sex, unit) aswell as measures (generic value, content is specified by dataset,e.g., GDP per capita in PPS, Total population, Employment rate bysex).
(This use case has mainly beentaken from [COGS])
In several applications, relationships between statistical dataneed to be represented.
The goal of this use case is to describe provenance,transformations, and versioning around statistical data, so that thehistory of statistics published on the Web becomes clear. This mayalso relate to the issue of having relationships between datasetspublished.
A concrete example is given by Freitas et al. [COGS], where transformations on financialdatasets, e.g., the addition of derived measures, conversion of units,aggregations, OLAP operations, and enrichment of statistical data areexecuted on statistical data before showing them in a Web-basedreport.
SeeSWPM2012 Provenance Example for screenshots about this use case.
Making transparent the transformation a dataset has been exposedto increases trust in the data.
qb:DataSet
(e.g.,ex:populationCount
andex:populationPercent
)?(Use case taken fromSMART natural sciences researchproject)
Data that is published on the Web is typically visualized bytransforming it manually into CSV or Excel and then creating avisualization on top of these formats using Excel, Tableau,RapidMiner, Rattle, Weka etc.
This use case shall demonstrate how statistical data publishedon the Web can be visualized inside a webpage with little effort and withoutusing commercial or highly-complex tools.
An example scenario is environmental research done within theSMART research project. Here,statistics about environmental aspects (e.g., measurements about theclimate in the Lower Jordan Valley) shall be visualized for scientistsand decision makers. Statistics should also be possible to beintegrated and displayed together. The data is available as XML fileson the Web which are re-published as Linked Data using the Data CubeVocabulary. On a separate website, specific parts of the data shall bequeried and visualized in simple charts, e.g., line diagrams.
Figure 1: HTML embedded line chart of anenvironmental measure over time for three regions in the lower Jordanvalley
Figure 2: Showing the same data in a pivot tableaggregating to single months. Here, the aggregate COUNT of measuresper cell is given.
Easy, flexible and powerful visualizations of publishedstatistical data.
(Use case taken fromGoogle Public DataExplorer (GPDE))
Google PublicData Explorer (GPDE) provides an easy possibility to visualize andexplore statistical data. Data needs to be in theDatasetPublishing Language (DSPL) to be uploaded to the data explorer. ADSPL dataset is a bundle that contains an XML file, the schema, and aset of CSV files, the actual data. Google provides a tutorial tocreate a DSPL dataset from your data, e.g., in CSV. This requires agood understanding of XML, as well as a good understanding of the datathat shall be visualized and explored.
In this use case, the goal is to take statistical data publishedas Linked Data re-using the Data Cube Vocabulary and to transform itinto DSPL for visualization and exploration using GPDE with as feweffort as possible.
For instance, Eurostat data about Unemployment rate downloadedfrom the Web as shown in the following figure:
Figure 3: An interactive chart in GPDE forvisualizing Eurostat data described with DSPL
There are different possible approaches each having advantagesand disadvantages: 1) A customer C is downloading this data into atriple store; SPARQL queries on this data can be used to transform thedata into DSPL and uploaded and visualized using GPDE. 2) or, one ormore XLST transformation on the RDF/XML transforms the data into DSPL.
(Use case taken fromFinancialInformation Observation System (FIOS))
Online Analytical Processing (OLAP) [OLAP]is an analysis method on multidimensional data. It is an explorativeanalysis method that allows users to interactively view the data ondifferent angles (rotate, select) or granularities (drill-down,roll-up), and filter it for specific information (slice, dice).
OLAP systems are commonly used in industry to analyze statistical dataon a regular basis. OLAP systems first use ETL pipelines toextract-load-transform relevant data in a data warehouse and then allow interfaces to efficiently issue OLAP querieson the data.
The goal in this use case is to allow analysis of publishedstatistical data with common OLAP systems [OLAP4LD].
For that a multidimensional model of the data needs to begenerated. A multidimensional model consists of facts summarized indata cubes. Facts exhibit measures depending on members of dimensions.Members of dimensions can be further structured along hierarchies oflevels.
An example scenario of this use case is the Financial InformationObservation System (FIOS) [FIOS],where XBRL data provided by the SEC on the Web is re-published asLinked Data and made possible to explore and analyze by stakeholdersin a Web-based OLAP client Saiku.
The following figure shows an example of using FIOS. Here, forthree different companies, the Cost of Goods Sold as disclosed in XBRLdocuments are analyzed. As cell values either the number ofdisclosures or — if only one available — the actual number in USD isgiven:
Figure 4: Example of using FIOS for OLAPoperations on financial data
(Use case motivated byData Catalog vocabularyandRDF Data Cube Vocabulary datasets in the PlanetData Wiki)
After statistics have been published as Linked Data, the questionremains how to communicate the publication and to let users discoverthe statistics. There are catalogs to register datasets, e.g., CKAN,datacite.org,da|ra, andPangea. Those catalogs require specificconfigurations to register statistical data.
The goal of this use case is to demonstrate how to expose anddistribute statistics after publication. For instance, to allowautomatic registration of statistical data in such catalogs, forfinding and evaluating datasets. To solve this issue, it should bepossible to transform the published statistical data into formats thatcan be used by data catalogs.
A concrete use case is the structured collection ofRDF Data Cube Vocabulary datasets in the PlanetData Wiki. This list is supposed todescribe statistical datasets on a higher level — for easy discoveryand selection — and to provide a useful overview of RDF Data Cubedeployments in the Linked Data cloud.
The use cases presented in the previous section give rise to thefollowing lessons that can motivate future work on the vocabulary aswell as associated tools or services complementing the vocabulary.
The draft version of the vocabulary builds uponSDMX Standards Version 2.0. Anewer version of SDMX,SDMXStandards, Version 2.1, is available.
The requirement is to at least build upon Version 2.0, ifspecific use cases derived from Version 2.1 become available, theworking group may consider building upon Version 2.1.
Background information:
Supporting use cases:
There should be a consensus on the issue of flattening orabbreviating data; one suggestion is to author data without theduplication, but have the data publication tools "flatten" the compactrepresentation into standalone observations during the publicationprocess.
Background information:
qb:subslice
, the vocabularyshould clarify or drop the use ofqb:subslice
; issue:http://www.w3.org/2011/gld/track/issues/34Supporting use cases:
First, hierarchical code lists may be supported via SKOS [SKOS]. Allow for cross-location and cross-timeanalysis of statistical datasets.
Second, one can think of non-SKOS hierarchical code lists. E.g., ifsimple skos:narrower
/skos:broader
relationships are not sufficient or if a vocabulary uses specifichierarchical properties, e.g.,geo:containedIn
.
Also, the use of hierarchy levels needs to be clarified. It has beensuggested, to allowskos:Collections
as value ofqb:codeList
.
Richard Cyganiak gave a summary of different options for specifyingthe allowed dimension values of a coded property, possibly includinghierarchies (seemail):
Background information:
Supporting use cases:
A number of organizations, particularly in the Climate andMeteorological area, already have some commitment to the OGC"Observations and Measurements" (O&M) logical data model, alsopublished as ISO 19156. Are there any statements about compatibilityand interoperability between O&M and Data Cube that can be made togive guidance to such organizations?
Partly solved by description forPublisherCase study: Site specific weather forecasts from Met Office, the UK'sNational Weather Service.
Background information:
Supporting use cases:
Background information:
Supporting use cases:
Background information:
Supporting use cases:
Background information:
Supporting use cases:
Background information:
Supporting use cases:
Background information:
Supporting use cases:
Background information:
Supporting use cases:
Clarify the relationship between DCAT and QB.
Background information:
Supporting use cases:
We thank Phil Archer, John Erickson, Rinke Hoekstra, BernadetteHyland, Aftab Iqbal, James McKinney, Dave Reynolds, Biplav Srivastava, Boris Villazón-Terrazas for feedback and input.
We thank Hadley Beeman, Sandro Hawke, Bernadette Hyland, George Thomas for their help with publishing this document.