Movatterモバイル変換

3.Use cases

This section presents scenarios that are enabled by theexistence of a standard vocabulary for the representation ofstatistics as Linked Data.

3.1SDMX Web Dissemination UseCase

(Use case taken from SDMX WebDissemination Use Case [SDMX2.1])

Since we have adopted the multidimensional model that underliesSDMX, we also adopt the "Web Dissemination Use Case" which is theprime use case for SDMX since it is an increasingly popular use of SDMXand enables organizations to build a self-updating disseminationsystem.

The Web Dissemination Use Case contains three actors, astructural metadata Web service (registry) that collects metadataabout statistical data in a registration fashion, a data Web service(publisher) that publishes statistical data and its metadata asregistered in the structural metadata Web service, and a dataconsumption application (consumer) that first discovers data from theregistry, then queries data from the corresponding publisher ofselected data, and then visualizes the data.

Benefits

A structural metadata source (registry) can collect metadataabout statistical data.
A data Web service (publisher) can register statistical datain a registry, and can provide statistical data from a database andmetadata from a metadata repository for consumers. For that, thepublisher creates database tables, and loads statistical data in adatabase and metadata in a metadata repository.
A consumer can discover data from a registry andautomatically can create a query to the publisher for selectedstatistical data.
The publisher can translate the query to a query to itsdatabase as well as metadata repository and return the statisticaldata and metadata.
The consumer can visualize the returned statistical data andmetadata.

Challenges

This use case is too abstract. The SDMX Web Dissemination UseCase can be concretized by several sub-use cases, detailed in thefollowing sections.
In particular, this use case requires a recommended way toadvertise published statistical datasets, which supports thefollowing lesson:Publishersmay need guidance in communicating the availability of publishedstatistical data to external parties and to allow automaticdiscovery of statistical data.

3.2PublisherCase Study: UK government financial data from Combined OnlineInformation System (COINS)

(This use case has beensummarized from Ian Dickinson et al. [COINS])

More and more organizations want to publish statistics on theWeb, for reasons such as increasing transparency and trust. Although,in the ideal case, published data can be understood by both humans andmachines, data often is simply published as CSV, PDF, XSL etc.,lacking elaborate metadata, which makes free usage and analysisdifficult.

Therefore, the goal in this scenario is to use a machine-readable andapplication-independent description of common statistics, expressed using open standards, to foster usage and innovation on the published data.In the "COINS as Linked Data" project [COINS], the Combined Online Information System(COINS) shall be published using a standard Linked Data vocabulary.Via the Combined Online Information System (COINS),HMTreasury, the principal custodian of financial data for the UKgovernment, releases previously restricted financial information aboutgovernment spending.

The COINS data has a hypercube structure. It describes financialtransactions using seven independent dimensions (time, data-type,department etc.) and one dependent measure (value). Also, it allowsthirty-three attributes that may further describe each transaction.COINS is an example of one of the more complex statistical datasetsbeing publishing via data.gov.uk.

Part of the complexity of COINS arises from the nature of thedata being released:

The published COINS datasets cover expenditure related to fivedifferent years (2005–06 to 2009–10). The actual COINS database at HMTreasury is updated daily. In principle at least, multiple snapshotsof the COINS data could be released throughout the year.

The actual data and its hypercube structure are to berepresented separately so that an application first can examine thestructure before deciding to download the actual data, i.e., thetransactions. The hypercube structure also defines, for each dimensionand attribute, a range of permitted values that are to be represented.

An access or query interface to the COINS data, e.g., via aSPARQL endpoint or the linked data API, is planned. Queries that areexpected to be interesting are: "spending for one department", "totalspending by department", "retrieving all data for a given observation"etc.

Benefits

According to the COINS as Linked Data project, the reason forpublishing COINS as Linked Data are threefold:

using an open standard representation makes it easier to workwith the data using available technologies and promises innovativethird-party tools and usages;
individual transactions and groups of transactions are givenan identity, and so can be referenced by Web address (URL), to allowthem to be discussed, annotated, or listed as source data forarticles or visualizations;
cross-links between linked-data datasets allow for muchricher exploration of related datasets.

Challenges

The COINS use case leads to the following challenges:

Although not originally intended, the Data Cube Vocabularycould be successfully used for publishing financial data, not juststatistics. This has also been shown by thePayments Ontology.
Also, the publisher favors a representation that is both asself-descriptive as possible, i.e., others can link to and downloadfully-described individual transactions, and as compact as possible,i.e., information is not unnecessarily repeated. This challengesupports lesson:Publishersand consumers may need guidance in checking and making use ofwell-formedness of published data using data cube.
Moreover, the publisher is thinking about the possiblebenefit of publishing slices of the data, e.g., datasets that fix alldimensions but the time dimension. For instance, such slices could beparticularly interesting for visualizations or comments. However,depending on the number of dimensions, the number of possible slicescan become large which makes it difficult to semi-automaticallyselect all interesting slices. This challenge supports lesson:Publishersmay need more guidance in creating and managing slices or arbitrarygroups of observations.
An important benefit of linked data is that we are able toannotate data, at a fine-grained level of detail, to recordinformation about the data itself. This includes where it came from —the provenance of the data — but could include annotations fromreviewers, links to other useful resources, etc. Being able to trustthat data to be correct and reliable is a central value forgovernment-published data, so recording provenance is a keyrequirement for the COINS data. For instance, the COINS project [COINS] has at least four perspectives on whatthey mean by “COINS” data: the abstract notion of “all of COINS”; thedata for a particular year; the version of the data for a particularyear released on a given date; and the constituent graphs which holdboth the authoritative data translated from HMT’s own sources and additional supplementary information which they derive from the data,for example by cross-linking to other datasets. This challengesupports lesson:Publishersmay need guidance in making transparent the pre-processing ofaggregate statistics.
A challenge also is the size of the data, especially since itis updated regularly. Five data files already contain between 3.3 and4.9 million rows of data. This challenge supports lesson:Publishersand consumers may need more guidance in efficiently processing datausing the Data Cube Vocabulary.

3.3Publisher UseCase: Publishing Excel Spreadsheets about Dutch historical census dataas Linked Data

(This use case has beencontributed by Rinke Hoekstra. SeeCEDA_R andData2Semantics for moreinformation.)

Not only in government, there is a need to publish considerableamounts of statistical data to be consumed in various (alsounexpected) application scenarios. Typically, Microsoft Excel sheetsare made available for download.

For instance, in theCEDA_RandData2Semanticsprojects publishing and harmonizing Dutch historical census data (from1795 onwards) is a goal. These censuses are now only available asExcel spreadsheets (obtained by data entry) that closely mimic the wayin which the data was originally published and shall be published asLinked Data.

Those Excel sheets contain single spreadsheets with severalmultidimensional data tables, having a name and notes, as well ascolumn values, row values, and cell values.

Another concrete example is theStats2RDFproject that intends to publish Excel sheets with biomedical statistical data. Here, Excel files are first translated into CSV and then translated into RDF using OntoWiki, a semantic wiki.

Benefits

The goal in this use case is to publish spreadsheetinformation in a machine-readable format on the Web, e.g., so thatcrawlers can find spreadsheets that use a certain column value. Thepublished data should represent and make available for queries themost important information in the spreadsheets, e.g., rows, columns,and cell values.
All context and so all meaning of the measurement point isexpressed by means of dimensions. The pure number is the star of anego-network of attributes or dimensions. In an RDF representation itis then easily possible to define hierarchical relationships betweenthe dimensions (that can be exemplified further) as well as mappingdifferent attributes across different value points. This way aharmonization among variables is performed around the measurementpoints themselves.
Integration with provenance vocabularies, e.g.,PROV-O, for tracking of harmonization steps becomes possible.
Once data representation and publication are standardised, consumers can focus on novel visualizations and analysis interfaces of census data.
In historical research, until now, harmonization acrossdatasets is performed by hand, and in subsequent iterations of adatabase: it is very hard to trace back the provenance of decisionsmade during the harmonization procedure. Publishing the census dataas Linked Data may allow (semi-)automatic harmonization.

Challenges

Semi-structured information, e.g., notes about lineage ofdata cells, may not be possible to be formalized. This supportslessonPublishersmay need guidance in making transparent the pre-processing ofaggregate statistics.
Combining Data Cube with SKOS [SKOS] to allow for cross-location andcross-time historical analysis, supporting lessonPublishersmay need more guidance to decide which representation of hierarchiesis most suitable for their use case.
These challenges may seem to be particular to the field ofhistorical research, but in fact apply to government information atlarge. Government is not a single body that publishes information ata single point in time. Government consists of multiple (altering)bodies, scattered across multiple levels, jurisdictions and areas.Publishing government information in a consistent, integrated mannerrequires exactly the type of harmonization required in this use case.
Define a mapping between Excel and the Data Cube Vocabulary.Excel spreadsheets are representative for other common representationformats for statistics such as CSV, XBRL, ARFF, which supports lessonPublishersmay need guidance in conversions from common statisticalrepresentations such as CSV, Excel, ARFF etc.
Excel sheets provide a great deal of flexibility in arranginginformation. It may be necessary to limit this flexibility to allowautomatic transformation.
There may be many spreadsheets which supports lessonPublishersand consumers may need more guidance in efficiently processing datausing the Data Cube Vocabulary.

3.4PublisherUse Case: Publishing hierarchically structured data from StatsWalesand Open Data Communities

(Use case has been taken from [QB4OLAP] and from discussions atpublishing-statistical-datamailing list)

It often comes up in statistical data that you have some kind of'overall' figure, which is then broken down into parts.

Example (in pseudo-turtle RDF):

ex:obs1  sdmx:refArea <uk>;  sdmx:refPeriod "2011";  ex:population "60" .ex:obs2  sdmx:refArea <england>;  sdmx:refPeriod "2011";  ex:population "50" .ex:obs3  sdmx:refArea <scotland>;  sdmx:refPeriod "2011";  ex:population "5" .ex:obs4  sdmx:refArea <wales>;  sdmx:refPeriod "2011";  ex:population "3" .ex:obs5  sdmx:refArea <northernireland>;  sdmx:refPeriod "2011";  ex:population "2" .

We are looking for the best way (in the context of the RDF/DataCube/SDMX approach) to express that the values forEngland, Scotland, Wales & Northern Ireland ought to add up to the valuefor the UK and constitute a more detailed breakdown of the overall UKfigure. Since we might also have population figures for France,Germany, EU28 etc., it is not as simple as just taking aqb:Slice where you fix the time period and the measure.

Similarly, Etcheverry and Vaisman [QB4OLAP]present the use case to publish household data fromStatsWales andOpenData Communities.

This multidimensional data contains for each fact a timedimension with one level Year and a location dimension with levelsUnitary Authority, Government Office Region, Country, and ALL. Asunit, units of 1000 households is used.

In this use case, one wants to publish not only a dataset on thebottom most level, i.e., what are the number of households at eachUnitary Authority in each year, but also a dataset on more aggregatedlevels. For instance, in order to publish a dataset with the number ofhouseholds at each Government Office Region per year, one needs toaggregate the measure of each fact having the same Government OfficeRegion using the SUM function.

Similarly, for many uses then population broken down by somecategory (e.g., ethnicity) is expressed as a percentage. Separatedatasets give the actual counts per category and aggregate counts. Insuch cases it is common to talk about the denominator (often DENOM)which is the aggregate count against which the percentages can beinterpreted.

Benefits

Expressing aggregation relationships would allow queryengines to automatically derive statistics on higher aggregationlevels.
Vice versa, representing further aggregated datasets wouldallow the answering of queries with a simple lookup instead of computationswhich may be more time consuming or require specific features of thequery engine (e.g., SPARQL 1.1).

Challenges

Importantly, one would like to maintain the relationshipbetween the resulting datasets, i.e., the levels and aggregationfunctions. Again, this use case does not simply need a selection (or"dice" in OLAP context) where one fixes the time period dimension, but includes aggregation. This supports lessonPublishersmay need guidance in how to represent common analytical operationssuch as Slice, Dice, Rollup on data cubes
Literals that are used in observations cannot be used assubjects in triples. So no hierarchies can be defined that would, forexample, link integer years via skos:narrower to months. This supportslessonPublishersmay need more guidance to decide which representation of hierarchiesis most suitable for their use case.

3.5PublisherCase Study: Publishing Observational Data Sets about UK Bathing WaterQuality

(Use case has been provided byEpimorphics Ltd, in theirUKBathing Water Quality deployment)

As part of their work with data.gov.uk and the UK Location Programme,Epimorphics Ltd have been working to pilot the publication of bothcurrent and historic bathing water quality information from theUK EnvironmentAgency as Linked Data.

The UK has a number of areas, typically beaches, that aredesignated as bathing waters where people routinely enter the water.The Environment Agency monitors and reports on the quality of thewater at these bathing waters.

The Environment Agency's data can be thought of as structuredin 3 groups:

basic reference data describing the bathing watersand sampling points;
"Annual Compliance Assessment Dataset"giving the rating for each bathing water for each year it has beenmonitored;
"In-Season Sample Assessment Dataset"giving the detailed weekly sampling results for each bathing water.

The most important dimensions of the data are bathing water,sampling point, and compliance classification.

Benefits

The bathing-water dataset (documentation) is structuredaround the use of the Data Cube Vocabulary and fronted by a linkeddata API configuration which makes the data available for re-use inadditional formats such as JSON and CSV.
Publishing bathing-water quality information in this way will1) enable the Environment Agency to meet the needs of its many dataconsumers in a uniform way rather than through diverse pair-wisearrangements 2) preempt requests for specific data and 3) enable alarger community of Web and mobile application developers andvalue-added information aggregators to use and re-use bathing-waterquality information sourced by the environment agency.

Challenges

Observations may exhibit a number of attributes, e.g.,whether there was an abnormal weather exception.
Relevant slices of both datasets are to be created, whichsupports lessonPublishersmay need more guidance in creating and managing slices or arbitrarygroups of observations:
- Annual Compliance Assessment Dataset: all the observationsfor a specific sampling point, all the observations for a specificyear.
- In-Season Sample Assessment Dataset: samples for a givensampling point, samples for a given week, samples for a given year,samples for a given year and sampling point, latest samples foreach sampling point.
- The use case suggests more arbitrary subsets of theobservations, e.g., collecting all the "latest" observations in acontinuously updated data set.
In this use case, observation and measurement data is to bepublished whichper se is not aggregated statistics. TheSemantic Sensor Networkontology (SSN) already provides a way to publish sensor information.SSN data provides statistical Linked Data and grounds its data to thedomain, e.g., sensors that collect observations (e.g., sensorsmeasuring average of temperature over location and time). Still, thiscase study has shown that the Data Cube Vocabulary may be a usefulalternative and can be successfully used for observation andmeasurement data, as well as statistical data.

3.6Publisher Case Study: Site specificweather forecasts from Met Office, the UK's National Weather Service

(This section contributed by DaveReynolds)

The Met Office, the UK's National Weather Service, provides arange of weather forecast products including openly availablesite-specific forecasts for the UK. The site specific forecasts coverover 5000 forecast points, each forecast predicts 10 parameters andspans a 5 day window at 3 hourly intervals, the whole forecast isupdated each hour. A proof of concept project investigated thechallenge of publishing this information as linked data using the DataCube vocabulary.

Benefits

Explicit metadata describing the forecast process, coverageand phenomena being forecast; making the data self-describing.
Linking to other linked data resources (particularlygeographic regions and named places associated with the forecastlocations) enabling discovery of related data.
Ability to define slices through the data for convenientconsumption by applications.

Challenges

This weather forecasts case study leads to the followingchallenges:

ISO19156 compatibility

The World Meteorological Organization (WMO) develops and recommendsdata interchange standard and within that community compatibility withISO19156"Geographic information — Observations andmeasurements" (O&M) is regarded as important. Thus, this supportslessonModelersusing ISO19156 - Observations & Measurements may needclarification regarding the relationship to the Data Cube Vocabulary.

Solution in this case study:

O&M provides a data model for an Observation with associatedPhenomenon, measurement ProcessUsed, Domain (feature of interest) andResult. Prototype vocabularies developed at CSIRO and extended withinthis project allow this data model to be represented in RDF. For thesite specific forecasts then a 5-day forecast for all 5000+ sites isregarded as a single O&M Observation.

To represent the forecast data itself, the Result in the O&Mmodel, then the relevant standard is ISO19123"Geographicinformation — Schema for coverage geometry and functions". Thisprovides a data model for a Coverage which can represent a set ofvalues across some space. It defines different types of Coverageincluding a DiscretePointCoverage suited to representing site-specificforecast results.

It turns out that it is straightforward to treat an RDF DataCube as a particular concrete representation of theDiscretePointCoverage logical model. The cube has dimensionscorresponding to the forecast time and location and the measure is arecord representing the forecast values of the 10 phenomena. Slices bytime and location provide subsets of the data that directly match thedata packages supported by an existing on-line service.

Note that in this situation anobservation in the sense ofqb:Observationand anobservation in the sense of ISO19156 Observations andMeasurements are different things. The O&M Observation is thewhole forecast whereas eachqb:Observationcorresponds to a single GeometryValuePair within the forecast resultsCoverage.

Data volume

Each hourly update comprises over 2 million data points and forecastdata is requested by a large number of data consumers. Bandwidth costsare thus a key consideration and the apparent verbosity of RDF ingeneral, and Data Cube specifically, was a concern. This supportslessonPublishers and consumers may need more guidance in efficientlyprocessing data using the Data Cube Vocabulary.

Solution in this case study:

Regarding bandwidth costs then the key is not raw data volumebut compressibility, since such data is transmitted in compressedform. A Turtle representation of a non-abbreviated data cubecompressed to within 15-20% of the size of compressed, handcrafted XMLand JSON representations. Thus obviating the need for abbreviations orcustom serialization.

3.7Publisher Case Study: EurostatSDMX as Linked Data

(This use case has been takenfromEurostatLinked Data Wrapper andLinked StatisticsEurostat Data, both deployments for publishing Eurostat SDMX asLinked Data using the draft version of the Data Cube Vocabulary)

As mentioned already, the ISO standard for exchanging and sharingstatistical data and metadata among organizations is Statistical Dataand Metadata eXchange [SDMX].Since this standard has proven applicable in many contexts, we adoptthe multidimensional model that underlies SDMX and intend the standardvocabulary to be compatible to SDMX. Therefore, in this use case weexplain the benefit and challenges of publishing SDMX data as LinkedData.

As one of the main adopters of SDMX,Eurostat publishes largeamounts of European statistics coming from a data warehouse as SDMXand other formats on the Web. Eurostat also provides an interface tobrowse and explore the datasets. However, linking suchmultidimensional data to related data sets and concepts would requiredownloading of interesting datasets and manual integration. The goalhere is to improve integration with other datasets; Eurostat datashould be published on the Web in a machine-readable format, possiblyto be linked with other datasets, and possibly to be freely consumedby applications. BothEurostatLinked Data Wrapper andLinked StatisticsEurostat Data intend to publishEurostatSDMX data as5 Star Linked OpenData. Eurostat data is partly published as SDMX, partly as tabulardata (TSV, similar to CSV). Eurostat provides aTOCof published datasets as well as a feed of modified and new datasets.Eurostat provides a list of used code lists, i.e.,rangeof permitted dimension values. Any Eurostat dataset contains avarying set of dimensions (e.g., date, geo, obs_status, sex, unit) aswell as measures (generic value, content is specified by dataset,e.g., GDP per capita in PPS, Total population, Employment rate bysex).

Benefits

Possible implementation of ETL pipelines based on Linked Datatechnologies (e.g.,LDSpider)to effectively load the data into a data warehouse for analysis.
Allows useful queries to the data, e.g., comparison ofstatistical indicators across EU countries.
Allows one to attach contextual information to statistics duringthe interpretation process.
Allows one to reuse single observations from the data.
Linking to information from other data sources, e.g., forgeo-spatial dimension.

Challenges

There is a large number of Eurostat datasets, each possiblycontaining a large number of columns (dimensions) and rows(observations). Eurostat publishes more than 5200 datasets, which,when converted into RDF require more than 350GB of disk spaceyielding a dataspace with some 8 billion triples. This supportslessonPublishers and consumers may need more guidance in efficientlyprocessing data using the Data Cube Vocabulary.
In the Eurostat Linked Data Wrapper, there is a timeout fortransforming SDMX to Linked Data, since Google App Engine is used.Mechanisms to reduce the amount of data that needs to be translatedwould be needed, again supporting lessonPublishers and consumers may need more guidance in efficientlyprocessing data using the Data Cube Vocabulary.
Each dimension used by a dataset has a range of permittedvalues that need to be described.
The Eurostat SDMX as Linked Data use case provides data on agender level and on a level aggregating over the gender level. Thissuggests a need to have time lines on data aggregating over the genderdimension, supporting the lesson:Publishers may need guidance in how to represent common analyticaloperations such as Slice, Dice, Rollup on data cubes.
New Eurostat datasets are added regularly to Eurostat. TheLinked Data representation should automatically provide access to themost-up-to-date data:
- Eurostat Linked Data pulls in changes from the originalEurostat dataset on a weekly basis and the conversion process runsevery Saturday at noon taking into account new datasets along withupdates to existing datasets.
- Eurostat Linked Data Wrapper translates Eurostatdatasets into RDF on the fly so that the most current data is always used. Theproblem is only to point users towards the URIs of Eurostatdatasets: Estatwrap provides a feed of modified and newdatasets.Also, it provides aTOCthat could be automatically updated from theEurostatTOC.
Query interface
- Eurostat Linked Data provides a SPARQL endpoint for themetadata (not the observations).
- Eurostat Linked Data Wrapper provides resolvable URIs todatasets (ds) that return all observations of the dataset. Also,every dataset serves the URI of its data structure definition (dsd).The dsd URI returns all RDF describing the dataset. Separatinginformation resources for dataset and data structure definitionallows one, for example, to first gather the dsd and only for actual queryexecution to resolve the ds.
Providing a useful interface for browsing and visualizing thedata:
- One problem is that the data sets have too high dimensionalityto be displayed directly. Instead, one could visualize slices of timeseries data. However, for that, one would need to either fix mostother dimensions (e.g., sex) or aggregate over them (e.g., viaaverage). The selection of useful slices from the large number ofpossible slices is a challenge. This supports lessonPublishers may need more guidance in creating and managing slices orarbitrary groups of observations.
- Eurostat Linked Data Wrapper provides for each dataset anHTML page showing a JavaScript-based visualization of the data.This also supports lessonConsumers may need guidance in conversions into formats that caneasily be displayed and further investigated in tools such asGoogle Data Explorer, R, Weka etc.
One possible application would run validation checks overEurostat data. However, the Data Cube Vocabulary is designed to publishstatistical data as-is and is not intended to represent informationfor validation (similar to business rules).
An application could try to automatically match elements ofthe geo-spatial dimension to elements of other data sources, e.g.,NUTS, GADM. In Eurostat Linked Data wrapper this is done by simpleURI guessing from external data sources. Automatic linking datasetsor linking datasets with metadata is not part of Data CubeVocabulary.
The draft version of the Data Cube Vocabulary builds upon SDMX Standards Version 2.0. A newer version of SDMX, SDMX Standards, Version 2.1, is available which might be used by Eurostat in the future which supports lessonThere is a putative requirement to update to SDMX 2.1 if there are specific use cases that demand it.

3.8PublisherCase Study: Improving trust in published sustainability information atthe Digital Enterprise Research Institute (DERI)

(This use case has mainly beentaken from [COGS])

In several applications, relationships between statistical dataneed to be represented.

The goal of this use case is to describe provenance,transformations, and versioning around statistical data, so that thehistory of statistics published on the Web becomes clear. This mayalso relate to the issue of having relationships between datasetspublished.

A concrete example is given by Freitas et al. [COGS], where transformations on financialdatasets, e.g., the addition of derived measures, conversion of units,aggregations, OLAP operations, and enrichment of statistical data areexecuted on statistical data before showing them in a Web-basedreport.

SeeSWPM2012 Provenance Example for screenshots about this use case.

Benefits

Making transparent the transformation a dataset has been exposedto increases trust in the data.

Challenges

Operations on statistical data result in new statisticaldata, depending on the operation. For instance, in terms of the DataCube Vocabulary, operations such as slice, dice, roll-up, drill-down will resultin new data cubes. This may require representing generalrelationships between cubes (as discussed in thepublishing-statistical-datamailing list).
Should the Data Cube Vocabulary support explicit declaration of suchrelationships either between separated qb:DataSets or betweenmeasures with a singleqb:DataSet (e.g.,ex:populationCountandex:populationPercent)?
If so should that be scoped to simple, common relationshipslike DENOM or allow expression of arbitrary mathematical relations?
This use case opens up questions regarding versioning ofstatistical Linked Data. Thus, there is a possible relation to theVersioningpart of GLD Best Practices Document, where it is specified how topublish data which has multiple versions.
In this use case, theCOGS vocabulary [COGS] has shown to complement the Data CubeVocabulary with respect to representing ETL pipelines processing statistics.This supports lessonPublishers may need guidance in making transparent thepre-processing of aggregate statistics.

3.9ConsumerCase Study: Simple chart visualizations of (integrated) publishedclimate sensor data

(Use case taken fromSMART natural sciences researchproject)

Data that is published on the Web is typically visualized bytransforming it manually into CSV or Excel and then creating avisualization on top of these formats using Excel, Tableau,RapidMiner, Rattle, Weka etc.

This use case shall demonstrate how statistical data publishedon the Web can be visualized inside a webpage with little effort and withoutusing commercial or highly-complex tools.

An example scenario is environmental research done within theSMART research project. Here,statistics about environmental aspects (e.g., measurements about theclimate in the Lower Jordan Valley) shall be visualized for scientistsand decision makers. Statistics should also be possible to beintegrated and displayed together. The data is available as XML fileson the Web which are re-published as Linked Data using the Data CubeVocabulary. On a separate website, specific parts of the data shall bequeried and visualized in simple charts, e.g., line diagrams.

Figure 1: HTML embedded line chart of anenvironmental measure over time for three regions in the lower Jordanvalley

display of an environmental measure over time for three regions in the lower Jordan valley

Figure 2: Showing the same data in a pivot tableaggregating to single months. Here, the aggregate COUNT of measuresper cell is given.

Figure: Showing the same data in a pivottable aggregating to single months. Here, the aggregate COUNT of measures per cell is given.

Benefits

Easy, flexible and powerful visualizations of publishedstatistical data.

Challenges

The difficulties lay in structuring the data appropriately sothat the specific information can be queried. This supports lesson:Publishers and consumers may need guidance in checking and makinguse of well-formedness of published data using data cube.
Also, data shall be published with potentialintegration in mind. Therefore, e.g., units of measurements need tobe represented.
Integration becomes much more difficult if publishers usedifferent measures/dimensions.

3.10ConsumerUse Case: Visualizing published statistical data in Google Public DataExplorer

(Use case taken fromGoogle Public DataExplorer (GPDE))

Google PublicData Explorer (GPDE) provides an easy possibility to visualize andexplore statistical data. Data needs to be in theDatasetPublishing Language (DSPL) to be uploaded to the data explorer. ADSPL dataset is a bundle that contains an XML file, the schema, and aset of CSV files, the actual data. Google provides a tutorial tocreate a DSPL dataset from your data, e.g., in CSV. This requires agood understanding of XML, as well as a good understanding of the datathat shall be visualized and explored.

In this use case, the goal is to take statistical data publishedas Linked Data re-using the Data Cube Vocabulary and to transform itinto DSPL for visualization and exploration using GPDE with as feweffort as possible.

For instance, Eurostat data about Unemployment rate downloadedfrom the Web as shown in the following figure:

Figure 3: An interactive chart in GPDE forvisualizing Eurostat data described with DSPL

An interactive chart in GPDE for visualising Eurostat data in the DSPL

There are different possible approaches each having advantagesand disadvantages: 1) A customer C is downloading this data into atriple store; SPARQL queries on this data can be used to transform thedata into DSPL and uploaded and visualized using GPDE. 2) or, one ormore XLST transformation on the RDF/XML transforms the data into DSPL.

Benefits

Easy to visualize statistics published using the Data Cube Vocabulary.
There could be a process of first transforming data into RDFfor further preprocessing and integration and then of loading it intoGPDE for visualization.
Linked Data could provide the way to automatically load datafrom a data source whereas GPDE is only for visualization.

Challenges

The technical challenges for the consumer here lay in knowingwhere to download what data and how to get it transformed into DSPLwithout knowing the data. This supports lessonPublishers and consumers may need guidance in checking and makinguse of well-formedness of published data using data cube.
Define a mapping between Data Cube and DSPL. DSPL isrepresentative for using statistical data published on the Web inavailable tools for analysis. Similar tools that may additionally becovered are: Weka (arff data format), Tableau, SPSS, STATA, PC-Axisetc. This supports lessonConsumers may need guidance in conversions into formats that caneasily be displayed and further investigated in tools such as GoogleData Explorer, R, Weka etc..

3.11ConsumerCase Study: Analyzing published financial (XBRL) data from the SECwith common OLAP systems

(Use case taken fromFinancialInformation Observation System (FIOS))

Online Analytical Processing (OLAP) [OLAP]is an analysis method on multidimensional data. It is an explorativeanalysis method that allows users to interactively view the data ondifferent angles (rotate, select) or granularities (drill-down,roll-up), and filter it for specific information (slice, dice).

OLAP systems are commonly used in industry to analyze statistical dataon a regular basis. OLAP systems first use ETL pipelines toextract-load-transform relevant data in a data warehouse and then allow interfaces to efficiently issue OLAP querieson the data.

The goal in this use case is to allow analysis of publishedstatistical data with common OLAP systems [OLAP4LD].

For that a multidimensional model of the data needs to begenerated. A multidimensional model consists of facts summarized indata cubes. Facts exhibit measures depending on members of dimensions.Members of dimensions can be further structured along hierarchies oflevels.

An example scenario of this use case is the Financial InformationObservation System (FIOS) [FIOS],where XBRL data provided by the SEC on the Web is re-published asLinked Data and made possible to explore and analyze by stakeholdersin a Web-based OLAP client Saiku.

The following figure shows an example of using FIOS. Here, forthree different companies, the Cost of Goods Sold as disclosed in XBRLdocuments are analyzed. As cell values either the number ofdisclosures or — if only one available — the actual number in USD isgiven:

Figure 4: Example of using FIOS for OLAPoperations on financial data

Example of using FIOS for OLAP operations on financial data

Benefits

Data cube model well-known to many people in industry.
OLAP operations cover typical business requirements, e.g.,slice, dice, drill-down and can be issued via intuitive, interactive,explorative, fast OLAP frontends.
OLAP functionality provided by many tools that may be reused

Challenges

Define a mapping between XBRL and the Data Cube Vocabulary.XBRL is representative for other common representation formats forstatistics such as CSV, Excel, ARFF, which supports lessonPublishersmay need guidance in conversions from common statisticalrepresentations such as CSV, Excel, ARFF etc.
ETL pipeline needs to automatically populate a datawarehouse. Common OLAP systems use relational databases with a starschema. This supports lessonPublishers and consumers may need guidance in checking and makinguse of well-formedness of published data using data cube.
A problem lies in the strict separation between queries forthe structure of data (metadata queries), and queries for actualaggregated values (OLAP operations).
Define a mapping between OLAP operations and operations ondata using the Data Cube Vocabulary. This supports lessonPublishers may need guidance in how to represent common analyticaloperations such as Slice, Dice, Rollup on data cubes.
Another problem lies in defining data cubes without greaterinsight in the data beforehand. Thus, OLAP systems have to cater forpossibly missing information (e.g., the aggregation function or ahuman readable label).
Depending on the expressivity of the OLAP queries (e.g.,aggregation functions, hierarchies, ordering), performance plays animportant role. This supports lessonPublishers and consumers may need more guidance in efficientlyprocessing data using the Data Cube Vocabulary.

3.12RegistryUse Case: Registering published statistical data in data catalogs

(Use case motivated byData Catalog vocabularyandRDF Data Cube Vocabulary datasets in the PlanetData Wiki)

After statistics have been published as Linked Data, the questionremains how to communicate the publication and to let users discoverthe statistics. There are catalogs to register datasets, e.g., CKAN,datacite.org,da|ra, andPangea. Those catalogs require specificconfigurations to register statistical data.

The goal of this use case is to demonstrate how to expose anddistribute statistics after publication. For instance, to allowautomatic registration of statistical data in such catalogs, forfinding and evaluating datasets. To solve this issue, it should bepossible to transform the published statistical data into formats thatcan be used by data catalogs.

A concrete use case is the structured collection ofRDF Data Cube Vocabulary datasets in the PlanetData Wiki. This list is supposed todescribe statistical datasets on a higher level — for easy discoveryand selection — and to provide a useful overview of RDF Data Cubedeployments in the Linked Data cloud.

Benefits

Datasets may automatically be discovered by Web or datacrawlers.
Potential consumers will be pointed to published statisticsin search engines if searching for related information.
Users can use keyword search or structured queries forspecific datasets they may be interested in.
Applications and users are told about licenses, downloadcapabilities etc. of datasets.

Challenges

Define mapping between DCAT and Data Cube Vocabulary. TheData Catalog vocabulary(DCAT) is strongly related to this use case since it may complementthe standard vocabulary for representing statistics in the case ofregistering data in a data catalog. This supports lessonPublishersmay need guidance in communicating the availability of publishedstatistical data to external parties and to allow automaticdiscovery of statistical data.
Define mapping between the Data Cube Vocabulary and data catalogdescriptions. If data catalogs contain statistics, they do not exposethose using Linked Data but for instance using CSV, HTML (e.g.,Pangea) or XML (e.g., DDI - Data Documentation Initiative).Therefore, it could also be a use case to publish such data using theData Cube Vocabulary.

4.Lessons

The use cases presented in the previous section give rise to thefollowing lessons that can motivate future work on the vocabulary aswell as associated tools or services complementing the vocabulary.

4.1There isa putative requirement to update to SDMX 2.1 if there are specific usecases that demand it

The draft version of the vocabulary builds uponSDMX Standards Version 2.0. Anewer version of SDMX,SDMXStandards, Version 2.1, is available.

The requirement is to at least build upon Version 2.0, ifspecific use cases derived from Version 2.1 become available, theworking group may consider building upon Version 2.1.

Background information:

Issue:http://www.w3.org/2011/gld/track/issues/37

Supporting use cases:

PublisherCase Study: Eurostat SDMX as Linked Data

4.2Publishersmay need more guidance in creating and managing slices or arbitrarygroups of observations

There should be a consensus on the issue of flattening orabbreviating data; one suggestion is to author data without theduplication, but have the data publication tools "flatten" the compactrepresentation into standalone observations during the publicationprocess.

Background information:

Issue:http://www.w3.org/2011/gld/track/issues/33
Since there are no known use cases forqb:subslice, the vocabularyshould clarify or drop the use ofqb:subslice; issue:http://www.w3.org/2011/gld/track/issues/34

Supporting use cases:

4.3Publishersmay need more guidance to decide which representation of hierarchiesis most suitable for their use case

First, hierarchical code lists may be supported via SKOS [SKOS]. Allow for cross-location and cross-timeanalysis of statistical datasets.

Second, one can think of non-SKOS hierarchical code lists. E.g., ifsimple skos:narrower/skos:broaderrelationships are not sufficient or if a vocabulary uses specifichierarchical properties, e.g.,geo:containedIn.

Also, the use of hierarchy levels needs to be clarified. It has beensuggested, to allowskos:Collectionsas value ofqb:codeList.

Richard Cyganiak gave a summary of different options for specifyingthe allowed dimension values of a coded property, possibly includinghierarchies (seemail):

All instances of a given rdfs:Class (via rdf:type).
All skos:Concepts in a given skos:ConceptScheme (viaskos:inScheme).
All skos:Concepts in a given skos:Collection or itssubcollections (via skos:member).
All resources that are roots, or children of a root, of aqb:HierarchicalCodeList.

Background information:

Issue:http://www.w3.org/2011/gld/track/issues/31
Issue:http://www.w3.org/2011/gld/track/issues/39
Discussion at publishing-statistical-data mailing list:http://groups.google.com/group/publishing-statistical-data/msg/7c80f3869ff4ba0f
Part of the requirement is met by the work on an ISOExtension to SKOS [XKOS]
Issue:http://www.w3.org/2011/gld/track/issues/59

Supporting use cases:

4.4Modelersusing ISO19156 - Observations & Measurements may need clarificationregarding the relationship to the Data Cube Vocabulary

A number of organizations, particularly in the Climate andMeteorological area, already have some commitment to the OGC"Observations and Measurements" (O&M) logical data model, alsopublished as ISO 19156. Are there any statements about compatibilityand interoperability between O&M and Data Cube that can be made togive guidance to such organizations?

Partly solved by description forPublisherCase study: Site specific weather forecasts from Met Office, the UK'sNational Weather Service.

Background information:

Issue:http://www.w3.org/2011/gld/track/issues/32

Supporting use cases:

PublisherCase Study: Site specific weather forecasts from Met Office, theUK's National Weather Service

4.5Publishersmay need guidance in how to represent common analytical operationssuch as Slice, Dice, Rollup on data cubes

Background information:

Issue:http://www.w3.org/2011/gld/track/issues/31

Supporting use cases:

4.6Publishersmay need guidance in making transparent the pre-processing ofaggregate statistics

Background information:

Issue:http://www.w3.org/2011/gld/track/issues/30
Discussion inpublishing-statistical-datamailing list

Supporting use cases:

4.7Publishersand consumers may need guidance in checking and making use ofwell-formedness of published data using data cube

Background information:

Issue:http://www.w3.org/2011/gld/track/issues/29

Supporting use cases:

4.8Publishersmay need guidance in conversions from common statisticalrepresentations such as CSV, Excel, ARFF etc.

Background information:

None.

Supporting use cases:

4.9Consumersmay need guidance in conversions into formats that can easily bedisplayed and further investigated in tools such as Google DataExplorer, R, Weka etc.

Background information:

None.

Supporting use cases:

4.10Publishersand consumers may need more guidance in efficiently processing datausing the Data Cube Vocabulary

Background information:

Related issue regarding abbreviationshttp://www.w3.org/2011/gld/track/issues/29

Supporting use cases:

4.11Publishersmay need guidance in communicating the availability of publishedstatistical data to external parties and to allow automatic discoveryof statistical data

Clarify the relationship between DCAT and QB.

Background information:

None.

Supporting use cases:

Movatterモバイル変換

Use Cases and Lessons for the Data Cube Vocabulary

W3C Working Group Note 01 August 2013

Abstract

Status of This Document

Table of Contents

1.Introduction

2.Terminology

3.Use cases

3.1SDMX Web Dissemination UseCase

Benefits

Challenges

3.2PublisherCase Study: UK government financial data from Combined OnlineInformation System (COINS)

Benefits

Challenges

3.3Publisher UseCase: Publishing Excel Spreadsheets about Dutch historical census dataas Linked Data

Benefits

Challenges

3.4PublisherUse Case: Publishing hierarchically structured data from StatsWalesand Open Data Communities

Benefits

Challenges

3.5PublisherCase Study: Publishing Observational Data Sets about UK Bathing WaterQuality

Benefits

Challenges

3.6Publisher Case Study: Site specificweather forecasts from Met Office, the UK's National Weather Service

Benefits

Challenges

ISO19156 compatibility

Data volume

3.7Publisher Case Study: EurostatSDMX as Linked Data

Benefits

Challenges

3.8PublisherCase Study: Improving trust in published sustainability information atthe Digital Enterprise Research Institute (DERI)

Benefits

Challenges

3.9ConsumerCase Study: Simple chart visualizations of (integrated) publishedclimate sensor data

Benefits

Challenges

3.10ConsumerUse Case: Visualizing published statistical data in Google Public DataExplorer

Benefits

Challenges

3.11ConsumerCase Study: Analyzing published financial (XBRL) data from the SECwith common OLAP systems

Benefits

Challenges

3.12RegistryUse Case: Registering published statistical data in data catalogs

Benefits

Challenges

4.Lessons

4.1There isa putative requirement to update to SDMX 2.1 if there are specific usecases that demand it

4.2Publishersmay need more guidance in creating and managing slices or arbitrarygroups of observations

4.3Publishersmay need more guidance to decide which representation of hierarchiesis most suitable for their use case

4.4Modelersusing ISO19156 - Observations & Measurements may need clarificationregarding the relationship to the Data Cube Vocabulary

4.5Publishersmay need guidance in how to represent common analytical operationssuch as Slice, Dice, Rollup on data cubes

4.6Publishersmay need guidance in making transparent the pre-processing ofaggregate statistics

4.7Publishersand consumers may need guidance in checking and making use ofwell-formedness of published data using data cube

4.8Publishersmay need guidance in conversions from common statisticalrepresentations such as CSV, Excel, ARFF etc.

4.9Consumersmay need guidance in conversions into formats that can easily bedisplayed and further investigated in tools such as Google DataExplorer, R, Weka etc.

4.10Publishersand consumers may need more guidance in efficiently processing datausing the Data Cube Vocabulary

4.11Publishersmay need guidance in communicating the availability of publishedstatistical data to external parties and to allow automatic discoveryof statistical data

A.Acknowledgements

References