Movatterモバイル変換


[0]ホーム

URL:


W3C

Model for Tabular Data and Metadata on the Web

W3C Recommendation

This version:
http://www.w3.org/TR/2015/REC-tabular-data-model-20151217/
Latest published version:
http://www.w3.org/TR/tabular-data-model/
Latest editor's draft:
http://w3c.github.io/csvw/syntax/
Test suite:
http://www.w3.org/2013/csvw/tests/
Implementation report:
http://www.w3.org/2013/csvw/implementation_report.html
Previous version:
http://www.w3.org/TR/2015/PR-tabular-data-model-20151117/
Editors:
Jeni Tennison,Open Data Institute
Gregg Kellogg,Kellogg Associates
Authors:
Jeni Tennison,Open Data Institute
Gregg Kellogg,Kellogg Associates
Ivan Herman,W3C
Repository:
We are on GitHub
File a bug
Changes:
Diff to previous version
Commit history

Please check theerrata for any errors or issues reported since publication.

This document is also available in this non-normative format:ePub

The English version of this specification is the only normative version. Non-normativetranslations may also be available.

Copyright © 2015W3C® (MIT,ERCIM,Keio,Beihang).W3Cliability,trademark anddocument use rules apply.


Abstract

Tabular data is routinely transferred on the web in a variety of formats, including variants on CSV, tab-delimited files, fixed field formats, spreadsheets, HTML tables, and SQL dumps. This document outlines a data model, or infoset, for tabular data and metadata about that tabular data that can be used as a basis for validation, display, or creating other formats. It also contains some non-normative guidance for publishing tabular data as CSV and how that maps into the tabular data model.

An annotated model of tabular data can be supplemented by separate metadata about the table. This specification defines how implementations should locate that metadata, given a file containing tabular data. The standard syntax for that metadata is defined in [tabular-metadata]. Note, however, that applications may have other means to create annotated tables, e.g., through some application specific API-s; this model does not depend on the specificities described in [tabular-metadata].

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of currentW3C publications and the latest revision of this technical report can be found in theW3C technical reports index at http://www.w3.org/TR/.

TheCSV on the Web Working Group waschartered to produce a recommendation "Access methods for CSV Metadata" as well as recommendations for "Metadata vocabulary for CSV data" and "Mapping mechanism to transforming CSV into various formats (e.g., RDF, JSON, or XML)". This document aims to primarily satisfy the "Access methods for CSV Metadata" recommendation (seesection5.Locating Metadata), though it also specifies an underlying model for tabular data and is therefore a basis for the other chartered Recommendations.

This definition of CSV used in this document is based on IETF's [RFC4180] which is an Informational RFC. The working group's expectation is that future suggestions to refine RFC 4180 will be relayed to the IETF (e.g. around encoding and line endings) and contribute to its discussions about moving CSV to the Standards track.

Many files containing tabular data embed metadata, for example in lines before the header row of an otherwise standard CSV document. This specification does not define any formats for embedding metadata within CSV files, aside from the titles of columns in the header row which is defined in CSV. We would encourage groups that define tabular data formats to also define a mapping into the annotated tabular data model defined in this document.

This document was published by theCSV on the Web Working Group as a Recommendation. If you wish to make comments regarding this document, please send them topublic-csv-wg@w3.org (subscribe,archives). All comments are welcome.

Please see the Working Group'simplementation report.

This document has been reviewed byW3C Members, by software developers, and by otherW3C groups and interested parties, and is endorsed by the Director as aW3C Recommendation. It is a stable document and may be used as reference material or cited from another document.W3C's role in making the Recommendation is to draw attention to the specification and to promote its widespread deployment. This enhances the functionality and interoperability of the Web.

This document was produced by a group operating under the5 February 2004W3C Patent Policy.W3C maintains apublic list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes containsEssential Claim(s) must disclose the information in accordance withsection 6 of theW3C Patent Policy.

This document is governed by the1 September 2015W3C Process Document.

Table of Contents

1.Introduction

Tabular data is data that is structured into rows, each of which contains information about some thing. Each row contains the same number of cells (although some of these cells may be empty), which provide values of properties of the thing described by the row. In tabular data, cells within the same column provide values for the same property of the things described by each row. This is what differentiates tabular data from other line-oriented formats.

Tabular data is routinely transferred on the web in a textual format called CSV, but the definition of CSV in practice is very loose. Some people use the term to mean any delimited text file. Others stick more closely to the most standard definition of CSV that there is, [RFC4180].Appendix A describes the various ways in which CSV is defined. This specification refers to such files, as well as tab-delimited files, fixed field formats, spreadsheets, HTML tables, and SQL dumps astabular data files.

Insection4.Tabular Data Models, this document defines a model for tabular data that abstracts away from the varying syntaxes that are used for when exchanging tabular data. The model includes annotations, or metadata, about collections of individual tables, rows, columns, and cells. These annotations are typically supplied through separate metadata files;section5.Locating Metadata defines how these metadata files can be located, while [tabular-metadata] defines what they contain.

Once anannotated table has been created, it can be processed in various ways, such as display, validation, or conversion into other formats. This processing is described insection6.Processing Tables.

This specification does not normatively define a format for exchanging tabular data. However, it does provide some best practice guidelines for publishing tabular data as CSV, in sectionsection7.Best Practice CSV, and for parsing both this syntax and those similar to it, insection8.Parsing Tabular Data.

2.Conformance

As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.

The key wordsMAY,MUST,MUST NOT,SHOULD, andSHOULD NOT are to be interpreted as described in [RFC2119].

This specification makes use of thecompact IRI Syntax; please refer to theCompact IRIs from [JSON-LD].

This specification makes use of the following namespaces:

csvw:
http://www.w3.org/ns/csvw#
dc:
http://purl.org/dc/terms/
rdf:
http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs:
http://www.w3.org/2000/01/rdf-schema#
schema:
http://schema.org/
xsd:
http://www.w3.org/2001/XMLSchema#

3.Typographical conventions

The following typographic conventions are used in this specification:

markup
Markup (elements, attributes, properties), machine processable values (string, characters, media types), property name, or a file name is in red-orange monospace font.
variable
A variable in pseudo-code or in an algorithm description is in italics.
definition
A definition of a term, to be used elsewhere in this or other specifications, is in bold and italics.
definition reference
A reference to a definitionin this document is underlined and is also an active link to the definition itself.
markup definition reference
A references to a definitionin this document, when the reference itself is also a markup, is underlined, red-orange monospace font, and is also an active link to the definition itself.
external definition reference
A reference to a definitionin another document is underlined, in italics, and is also an active link to the definition itself.
markup external definition reference
A reference to a definitionin another document, when the reference itself is also a markup, is underlined, in italics red-orange monospace font, and is also an active link to the definition itself.
hyperlink
A hyperlink is underlined and in blue.
[reference]
A document reference (normative or informative) is enclosed in square brackets and links to the references section.
Note

Notes are in light green boxes with a green left border and with a "Note" header in green. Notes are normative or informative depending on the whether they are in a normative or informative section, respectively.

Example 1
Examples are in light khaki boxes, with khaki left border, and with a numbered "Example" header in khaki. Examples are always informative. The content of the example is in monospace font and may be syntax colored.

4.Tabular Data Models

This section defines anannotated tabular data model: a model for tables that are annotated with metadata.Annotations provide information about the cells, rows, columns, tables, and groups of tables with which they are associated. Thevalues of these annotations may be lists, structured objects, or atomic values.Core annotations are those that affect the behavior of processors defined in this specification, but other annotations may also be present on any of the components of the model.

Annotations may be described directly in [tabular-metadata], be embedded in atabular data file, or created during the process of generating anannotated table.

String values within the tabular data model (such as column titles or cell string values)MUST contain only Unicode characters.

Note

In this document, the termannotation refers to any metadata associated with an object in the annotated tabular data model. These are not necessarily web annotations in the sense of [annotation-model].

4.1Table groups

Agroup of tables comprises a set ofannotated tables and a set of annotations that relate to that group of tables. Thecore annotations of a group of tables are:

Groups of tablesMAY in addition have any number ofannotations which provide information about the group of tables. Annotations on a group of tables may include:

When originating from [tabular-metadata], these annotations arise fromcommon properties defined ontable group descriptions within metadata documents.

4.2Tables

Anannotated table is atable that is annotated with additional metadata. Thecore annotations of a table are:

The tableMAY in addition have any number of otherannotations. Annotations on a table may include:

When originating from [tabular-metadata], these annotations arise fromcommon properties defined ontable descriptions within metadata documents.

4.3Columns

Acolumn represents a vertical arrangement ofcells within atable. Thecore annotations of a column are:

Note

Several of these annotations arise frominherited properties that may be defined within metadata ontable group,table or individualcolumn descriptions.

ColumnsMAY in addition have any number of otherannotations, such as a description. When originating from [tabular-metadata], these annotations arise fromcommon properties defined oncolumn descriptions within metadata documents.

4.4Rows

Arow represents a horizontal arrangement ofcells within atable. Thecore annotations of a row are:

RowsMAY have any number of additionalannotations. The annotations on a row provide additional metadata about the information held in the row, such as:

Neither this specification nor [tabular-metadata] defines a method to specify such annotations. ImplementationsMAY define a method for adding annotations to rows by interpretingnotes on the table.

4.5Cells

Acell represents a cell at the intersection of arow and acolumn within atable. Thecore annotations of a cell are:

Note

There presence or absence of quotes around a value within a CSV file is a syntactic detail that is not reflected in the tabular data model. In other words, there is no distinction in the model between the second value ina,,z and the second value ina,"",z.

Note

Several of these annotations arise from or are constructed based oninherited properties that may be defined within metadata ontable group,table orcolumn descriptions.

CellsMAY have any number of additionalannotations. The annotations on a cell provide metadata about the value held in the cell, particularly when this overrides the information provided for thecolumn androw that the cell falls within. Annotations on a cell might be:

Neither this specification nor [tabular-metadata] defines a method to specify such annotations. ImplementationsMAY define a method for adding annotations to cells by interpretingnotes on the table.

Note

Units of measure are not a built-in part of the tabular data model. However, they can be captured throughnotes or included in the converted output of tabular data through definingdatatypes with identifiers that indicate the unit of measure, usingvirtual columns to create nested data structures, or using common properties to specify Data Cube attributes as defined in [vocab-data-cube].

4.6Datatypes

Columns andcell values within tables may be annotated with adatatype which indicates the type of the values obtained by parsing thestring value of the cell.

Datatypes are based on a subset of those defined in [xmlschema11-2]. Theannotated tabular data model limitscell values to have datatypes as shown onthe diagram:

Built-in Datatype Hierarchy diagram
Fig.1Diagram showing the built-in datatypes, based on [xmlschema11-2]; names in parentheses denote aliases to the [xmlschema11-2] terms (see the diagram inSVG orPNG formats)

Thecore annotations of a datatype are:

If theid of a datatype is that of a built-in datatype, the values of the other core annotations listed aboveMUST be consistent with the values defined in [xmlschema11-2] or above. For example, if theid isxsd:integer then thebase must bexsd:decimal.

DatatypesMAY have any number of additionalannotations. The annotations on a datatype provide metadata about the datatype such as title or description. These arise fromcommon properties defined on datatype descriptions within metadata documents, as defined in [tabular-metadata].

Note

Theid annotation may reference an XSD, OWL or other datatype definition, which is not used by this specification for validating column values, but may be useful for further processing.

4.6.1Length Constraints

Thelength,minimum length andmaximum length annotations indicate the exact, minimum and maximum lengths forcell values.

The length of avalue is determined as defined in [xmlschema11-2], namely as follows:

  • if the value isnull, its length is zero.
  • if the value is a string or one of its subtypes, its length is the number of characters (ie [UNICODE]code points) in the value.
  • if the value is of a binary type, its length is the number of bytes in the binary value.

If thevalue is alist, the constraint applies to each element of the list.

4.6.2Value Constraints

Theminimum,maximum,minimum exclusive, andmaximum exclusive annotations indicate limits oncell values. These apply to numeric, date/time, and duration types.

Validation ofcell values against these datatypes is as defined in [xmlschema11-2]. If thevalue is alist, the constraint applies to each element of the list.

5.Locating Metadata

As described insection4.Tabular Data Models, tabular data may have a number of annotations associated with it. Here we describe the different methods that can be used to locate metadata that provides those annotations.

In the methods of locating metadata described here, metadata is provided within a single document. The syntax of such documents is defined in [tabular-metadata]. Metadata is located using a specific order of precedence:

  1. metadata supplied by the user of the implementation that is processing the tabular data, seesection5.1Overriding Metadata.
  2. metadata in a document linked to using aLink header associated with thetabular data file, seesection5.2Link Header.
  3. metadata located through default paths which may be overridden by a site-wide location configuration, seesection5.3Default Locations and Site-wide Location Configuration.
  4. metadata embedded within thetabular data file itself, seesection5.4Embedded Metadata.

ProcessorsMUST use the first metadata found for processing atabular data file by using overriding metadata, if provided. Otherwise processorsMUST attempt to locate the first metadata document from theLink header or the metadata located through site-wide configuration. If no metadata is supplied or found, processorsMUST useembedded metadata. If the metadata does not originate from the embedded metadata,validatorsMUST verify that thetable group description within that metadata iscompatible with that in theembedded metadata, as defined in [tabular-metadata].

Note

When feasible, processors should start from a metadata file and publishers should link to metadata files directly, rather than depend on mechanisms outlined in this section for locating metadata from a tabular data file. Otherwise, if possible, publishers should provide aLink header on the tabular data file as described insection5.2Link Header.

Note

If there is no site-wide location configuration,section5.3Default Locations and Site-wide Location Configuration specifies default URI patterns or paths to be used to locate metadata.

5.1Overriding Metadata

ProcessorsSHOULD provide users with the facility to provide their own metadata fortabular data files that they process. This might be provided:

For example, a processor might be invoked with:

Example 2: Command-line CSV processing with column types
$ csvlint data.csv --datatypes:string,float,string,string

to enable the testing of the types of values in the columns of a CSV file, or with:

Example 3: Command-line CSV processing with a schema
$ csvlint data.csv --schema:schema.json

to supply a schema that describes the contents of the file, against which it can be validated.

Metadata supplied in this way is called overriding, or user-supplied, metadata. ImplementationsSHOULD define how any options they define are mapped into the vocabulary defined in [tabular-metadata]. If the user selects existing metadata files, implementationsMUST NOT use metadata located through the Link header (as described insection5.2Link Header) or site-wide location configuration (as described insection5.3Default Locations and Site-wide Location Configuration).

Note

Users should ensure that any metadata from those locations that they wish to use is explicitly incorporated into the overriding metadata that they use to process tabular data. Processors may provide facilities to make this easier by automatically merging metadata files from different locations, but this specification does not define how such merging is carried out.

5.2Link Header

If the user has not supplied a metadata file as overriding metadata, described insection5.1Overriding Metadata, then when retrieving atabular data file via HTTP, processorsMUST retrieve the metadata file referenced by anyLink header with:

so long as this referenced metadata file describes the retrievedtabular data file (ie, contains atable description whoseurl matches the request URL).

If there is more than one valid metadata file linked to through multipleLink headers, then implementationsMUST use the metadata file referenced by the lastLink header.

For example, when the response to requesting a tab-separated file looks like:

Example 4: HTTP response including Link headers
HTTP/1.1200 OKContent-Type: text/tab-separated-values...Link:<metadata.json>; rel="describedBy"; type="application/csvm+json"

an implementation must use the referencedmetadata.json to supply metadata for processing the file.

If the metadata file found at this location does not explicitly include a reference to the requestedtabular data file then itMUST be ignored. URLsMUST be normalized as described insection6.3URL Normalization.

Note

TheLink header of themetadata fileMAY include references to the CSV files it describes, using thedescribes relationship. For example, in thecountries' metadata example, the server might return the following headers:

Link: <http://example.org/countries.csv>; rel="describes"; type="text/csv"Link: <http://example.org/country_slice.csv>; rel="describes"; type="text/csv"

However, locating the metadataSHOULD NOT depend on this mechanism.

5.3Default Locations and Site-wide Location Configuration

If the user has not supplied a metadata file as overriding metadata, described insection5.1Overriding Metadata, and no applicable metadata file has been discovered through aLink header, described insection5.2Link Header, processorsMUST attempt to locate a metadata documents through site-wide configuration.

In this case, processorsMUST retrieve the file from the well-known URI/.well-known/csvm. (Well-known URIs are defined by [RFC5785].) If no such file is located (i.e. the response results in a client error4xx status code or a server error5xx status code), processorsMUST proceed as if this file were found with the following content which defines default locations:

{+url}-metadata.jsoncsv-metadata.json

The response to retrieving/.well-known/csvmMAY be cached, subject to cache control directives. This includes caching an unsuccessful response such as a404 Not Found.

This fileMUST contain a URI template, as defined by [URI-TEMPLATE], on each line. Starting with the first such URI template, processorsMUST:

  1. Expand the URI template, with the variableurl being set to the URL of the requestedtabular data file (with any fragment component of that URL removed).
  2. Resolve the resulting URL against the URL of the requestedtabular data file.
  3. Attempt to retrieve a metadata document at that URL.
  4. If no metadata document is found at that location, or if the metadata file found at the location does not explicitly include a reference to the relevanttabular data file, perform these same steps on the next URI template, otherwise use that metadata document.

For example, if thetabular data file is athttp://example.org/south-west/devon.csv then processors must attempt to locate a well-known file athttp://example.org/.well-known/csvm. If that file contains:

Example 5
{+url}.jsoncsvm.json/csvm?file={url}

the processor will first look forhttp://example.org/south-west/devon.csv.json. If there is no metadata file in that location, it will then look forhttp://example.org/south-west/csvm.json. Finally, if that also fails, it will look forhttp://example.org/csvm?file=http://example.org/south-west/devon.csv.json.

If no file were found athttp://example.org/.well-known/csvm, the processor will use the default locations and try to retrieve metadata fromhttp://example.org/south-west/devon.csv-metadata.json and, if unsuccessful,http://example.org/south-west/csv-metadata.json.

5.4Embedded Metadata

Most syntaxes for tabular data provide a facility forembedding metadata within thetabular data file itself. The definition of a syntax for tabular dataSHOULD include a description of how the syntax maps to an annotated data model, and in particular how any embedded metadata is mapped into the vocabulary defined in [tabular-metadata]. Parsing based on the default dialect for CSV, as described in8.Parsing Tabular Data, will extractcolumn titles from the first row of a CSV file.

Example 6: http://example.org/tree-ops.csv
GID,On Street,Species,Trim Cycle,Inventory Date1,ADDISON AV,Celtis australis,Large Tree Routine Prune,10/18/20102,EMERSON ST,Liquidambar styraciflua,Large Tree Routine Prune,6/2/2010

The results of this can be found insection8.2.1Simple Example.

For another example, the following tab-delimited file contains embedded metadata where it is assumed that comments may be added using a#, and that the column types may be indicated using a#datatype annotation:

Example 7: Tab-separated file containing embedded metadata
# publisher City of Palo Alto# updated 12/31/2010#name GID on_street species trim_cycle  inventory_date#datatype string  string  string  string  date:M/D/YYYY  GID On Street Species Trim Cycle  Inventory Date  1 ADDISON AV  Celtis australis  Large Tree Routine Prune  10/18/2010  2 EMERSON ST  Liquidambar styraciflua Large Tree Routine Prune  6/2/2010

A processor that recognises this format may be able to extract and make sense of this embedded metadata.

6.Processing Tables

This section describes how particular types of applications should process tabular data and metadata files.

In many cases, an application will start processing from a metadata file. In that case, the initial metadata file is treated asoverriding metadata and the applicationMUST NOT continue to retrieve other available metadata about each of thetabular data files referenced by that initial metadata file other thanembedded metadata.

In other cases, applications will start from atabular data file, such as a CSV file, andlocate metadata from that file. This metadata will be used to process the file as if the processor were starting from that metadata file.

For example, if a validator is passed a locally authored metadata filespending.json, which contains:

Example 8: Metadata file referencing multiple tabular data files sharing a schema
{"tableSchema":"government-spending.csv","tables":[{"url":"http://example.org/east-sussex-2015-03.csv",},{"url":"http://example.org/east-sussex-2015-02.csv"},...]}

the validator would validate all the listed tables, using the locally defined schema atgovernment-spending.csv. It would also use the metadata embedded in the referenced CSV files; for example, when processinghttp://example.org/east-sussex-2015-03.csv, it would useembedded metadata within that file to verify that the CSV iscompatible with the metadata.

If a validator is passed atabular data filehttp://example.org/east-sussex-2015-03.csv, the validator would use the metadata located from the CSV file: the first metadata file found through theLink headers found when retrieving that file, or located through a site-wide location configuration.

Note

Starting with a metadata file can remove the need to perform additional requests to locate linked metadata, or metadata retrieved through site-wide location configuration

6.1Creating Annotated Tables

After locating metadata, metadata isnormalized and coerced into a singletable group description. When starting with a metadata file, this involves normalizing the provided metadata file and verifying that theembedded metadata for eachtabular data file referenced from the metadata iscompatible with the metadata. When starting with atabular data file, this involves locating the first metadata file as described insection5.Locating Metadata and normalizing into a single descriptor.

If processing starts with atabular data file, implementations:

  1. Retrieve the tabular data file.
  2. Retrieve the first metadata file (FM) as described insection5.Locating Metadata:
    1. metadata supplied by the user (seesection5.1Overriding Metadata).
    2. metadata referenced from aLink Header that may be returned when retrieving the tabular data file (seesection5.2Link Header).
    3. metadata retrieved through a site-wide location configuration (seesection5.3Default Locations and Site-wide Location Configuration).
    4. embedded metadata as defined insection5.4Embedded Metadata with a singletables entry where theurl property is set from that of thetabular data file.
  3. Proceed as if the process starts withFM.

If the process starts with a metadata file:

  1. Retrieve the metadata file yielding the metadataUM (which is treated as overriding metadata, seesection5.1Overriding Metadata).
  2. NormalizeUM using the process defined inNormalization in [tabular-metadata], coercingUM into atable group description, if necessary.
  3. For eachtable (TM) inUM in order, create one or moreannotated tables:
    1. Extract thedialect description (DD) fromUM for thetable associated with the tabular data file. If there is no such dialect description, extract the first available dialect description from agroup of tables in which the tabular data file is described. Otherwise use the default dialect description.
    2. If using the defaultdialect description, override default values inDD based on HTTP headers found when retrieving the tabular data file:
      • If the media type from theContent-Type header istext/tab-separated-values, setdelimiter toTAB inDD.
      • If theContent-Type header includes theheader parameter with a value ofabsent, setheader tofalse inDD.
      • If theContent-Type header includes thecharset parameter, setencoding to this value inDD.
    3. Parse the tabular data file, usingDD as a guide, to create a basic tabular data model (T) and extractembedded metadata (EM), for example from theheader line.

      Note

      This specification provides a non-normative definition for parsing CSV-based files, including the extraction ofembedded metadata, insection8.Parsing Tabular Data. This specification does not define any syntax for embedded metadata beyond this; whatever syntax is used, it's assumed that metadata can be mapped to the vocabulary defined in [tabular-metadata].

    4. If aContent-Language HTTP header was found when retrieving the tabular data file, and the value provides a single language, set thelang inherited property to this value inTM, unlessTM already has alang inherited property.
    5. Verify thatTM is compatible withEM using the procedure defined inTable Description Compatibility in [tabular-metadata]; ifTM is not compatible withEM validatorsMUST raise an error, other processorsMUST generate a warning and continue processing.
    6. Use the metadataTM to add annotations to the tabular data modelT as described inSection 2 Annotating Tables in [tabular-metadata].

6.2Metadata Compatibility

When processing atabular data file using metadata as discovered usingsection5.Locating Metadata, processorsMUST ensure that the metadata andtabular data file are compatible, this is typically done by extractingembedded metadata from thetabular data file and determining that the provided or discovered metadata is compatible with theembedded metadata using the procedure defined inTable Compatibility in [tabular-metadata].

6.3URL Normalization

Metadata Discovery and Compatibility involve comparing URLs. When comparing URLs, processorsMUST useSyntax-Based Normalization as defined in [RFC3968]. ProcessorsMUST performScheme-Based Normalization for HTTP (80) and HTTPS (443) andSHOULD performScheme-Based Normalization for other well-known schemes.

6.4Parsing Cells

Unlike many other data formats, tabular data is designed to be read by humans. For that reason, it's common for data to be represented within tabular data in a human-readable way. Thedatatype,default,lang,null,required, andseparator annotations provide the information needed to parse thestring value of a cell into its (semantic)value annotation. This is used:

The process of parsing a cell creates acell with annotations based on the original string value, parsed value and othercolumn annotations and adds the cell to the list ofcells in arow andcells in acolumn:

After parsing, thecell value can be:

The process of parsing the string value into a single value or a list of values is as follows:

  1. unless thedatatype base isstring,json,xml,html oranyAtomicType, replace all carriage return (#xD), line feed (#xA), and tab (#x9) characters with space characters.
  2. unless thedatatype base isstring,json,xml,html,anyAtomicType, ornormalizedString, strip leading and trailing whitespace from thestring value and replace all instances of two or more whitespace characters with a single space character.
  3. if the normalized string is an empty string, apply the remaining steps to the string given by thecolumn default annotation.
  4. if thecolumn separator annotation is notnull and the normalized string is an empty string, thecell value is an empty list. If thecolumn required annotation istrue, add an error to the list oferrors for the cell.
  5. if thecolumn separator annotation is notnull, thecell value is a list of values; set thelist annotation on thecell totrue, and create thecell value created by:
    1. if the normalized string is the same as any one of the values of thecolumn null annotation, then the resulting value isnull.
    2. split the normalized string at the character specified by thecolumn separator annotation.
    3. unless thedatatype base isstring oranyAtomicType, strip leading and trailing whitespace from these strings.
    4. applying the remaining steps to each of the strings in turn.
  6. if the string is an empty string, apply the remaining steps to the string given by thecolumn default annotation.
  7. if the string is the same as any one of the values of thecolumn null annotation, then the resulting value isnull. If thecolumn separator annotation isnull and thecolumn required annotation istrue, add an error to the list oferrors for the cell.
  8. parse the string using thedatatype format if one is specified, as described below to give a value with an associateddatatype. If thedatatype base isstring, or there is nodatatype, the value has an associatedlanguage from thecolumn lang annotation. If there are any errors, add them to the list oferrors for the cell; in this case the value has adatatype ofstring; if thedatatype base isstring, or there is nodatatype, the value has an associatedlanguage from thecolumn lang annotation.
  9. validate the value based on the length constraints described insection4.6.1Length Constraints, the value constraints described insection4.6.2Value Constraints and thedatatype format annotation if one is specified, as described below. If there are any errors, add them to the list oferrors for the cell.

The final value (or values) become thevalue annotation on thecell.

If there is aabout URL annotation on thecolumn, it becomes theabout URL annotation on thecell, after being transformed into an absolute URL as described inURI Template Properties of [tabular-metadata].

If there is aproperty URL annotation on thecolumn, it becomes theproperty URL annotation on thecell, after being transformed into an absolute URL as described inURI Template Properties of [tabular-metadata].

If there is avalue URL annotation on thecolumn, it becomes thevalue URL annotation on thecell, after being transformed into an absolute URL as described inURI Template Properties of [tabular-metadata]. Thevalue URL annotation isnull if thecell value isnull and thecolumn virtual annotation isfalse.

6.4.1Parsing examples

This section is non-normative.

Whendatatype annotation is available, thevalue of a cell is the same as itsstring value. For example, a cell with a string value of"99" would similarly have the (semantic) value"99".

If adatatype base is provided for the cell, that is used to create a (semantic)value for the cell. For example, if the metadata contains:

Example 9
"datatype":"integer"

for the cell with the string value"99" then the value of that cell will be the integer99. A cell whose string value was not a valid integer (such as"one" or"1.0") would be assigned that string value as its (semantic) value annotation, but also have a validation error listed in itserrors annotation.

Sometimes data uses special codes to indicate unknown or null values. For example, a particular column might contain a number that is expected to be between1 and10, with the string99 used in the original tabular data file to indicate a null value. The metadata for such a column would include:

Example 10
"datatype":{"base":"integer","minimum":1,"maximum":10},"null":"99"

In this case, a cell with a string value of"5" would have the (semantic) value of the integer5; a cell with a string value of"99" would have the valuenull.

Similarly, a cell may be assigned a default value if the string value for the cell is empty. A configuration such as:

Example 11
"datatype":{"base":"integer","minimum":1,"maximum":10},"default":"5"

In this case, a cell whose string value is"" would be assigned the value of the integer5. A cell whose string value contains whitespace, such as a single tab character, would also be assigned the value of the integer5: when the datatype is something other thanstring oranyAtomicType, leading and trailing whitespace is stripped from string values before the remainder of the processing is carried out.

Cells can contain sequences of values. For example, a cell might have the string value"1 5 7.0". In this case, the separator is a space character. The appropriate configuration would be:

Example 12
"datatype":{"base":"integer","minimum":1,"maximum":10},"default":"5","separator":" "

and this would mean that the cell's value would be an array containing two integers and a string:[1, 5, "7.0"]. The final value of the array is a string because it is not a valid integer; the cell'serrors annotation will also contain a validation error.

Also, with this configuration, if the string value of the cell were"" (i.e. it was an empty cell) the value of the cell would be an empty list.

Acell value can be inserted into a URL created using aURI template property such asvalueUrl. For example, if a cell with thestring value"1 5 7.0" were in a column namedvalues, defined with:

Example 13
"datatype":"decimal","separator":" ","valueUrl":"{?values}"

then after expansion of the URI template, the resultingvalueUrl would be?values=1.0,5.0,7.0. The canonical representations of the decimal values are used within the URL.

6.4.2Formats for numeric types

By default, numeric values must be in the formats defined in [xmlschema11-2]. It is not uncommon for numbers within tabular data to be formatted for human consumption, which may involve using commas for decimal points, grouping digits in the number using commas, or adding percent signs to the number.

If thedatatype base is a numeric type, thedatatype format annotation indicates the expected format for that number. Its valueMUST be either a single string or an object with one or more of the properties:

decimalChar
A string whose value is used to represent a decimal point within the number. The default value is".". If the supplied value is not a string, implementationsMUST issue a warning and proceed as if the property had not been specified.
groupChar
A string whose value is used to group digits within the number. The default value isnull. If the supplied value is not a string, implementationsMUST issue a warning and proceed as if the property had not been specified.
pattern
Anumber format pattern as defined in [UAX35]. ImplementationsMUST recognise number format patterns containing the symbols0,#, the specifieddecimalChar (or"." if unspecified), the specifiedgroupChar (or"," if unspecified),E,+,% and. ImplementationsMAY additionally recognise number format patterns containing otherspecial pattern characters defined in [UAX35]. If the supplied value is not a string, or if it contains an invalidnumber format pattern or usesspecial pattern characters that the implementation does not recognise, implementationsMUST issue a warning and proceed as if the property had not been specified.

If thedatatype format annotation is a single string, this is interpreted in the same way as if it were an object with apattern property whose value is that string.

If thegroupChar is specified, but nopattern is supplied, when parsing thestring value of a cell against this format specification, implementationsMUST recognise and parse numbers that consist of:

  1. an optional+ or- sign,
  2. followed by a decimal digit (0-9),
  3. followed by any number of decimal digits (0-9) and the string specified as thegroupChar,
  4. followed by an optionaldecimalChar followed by one or more decimal digits (0-9),
  5. followed by an optional exponent, consisting of anE followed by an optional+ or- sign followed by one or more decimal digits (0-9), or
  6. followed by an optional percent (%) or per-mille () sign.

or that are one of the special values:

  1. NaN,
  2. INF, or
  3. -INF.

ImplementationsMAY also recognise numeric values that are in any of the standard-decimal, standard-percent or standard-scientific formats listed in theUnicode Common Locale Data Repository.

ImplementationsMUST add a validation error to theerrors annotation for thecell, and set the cellvalue to a string rather than a number if the string being parsed:

  • is not in the format specified in thepattern, if one is defined
  • otherwise, if the string
    • does not meet the numeric format defined above,
    • contains two consecutivegroupChar strings,
  • contains thedecimalChar, if thedatatype base isinteger or one of its sub-types,
  • contains an exponent, if thedatatype base isdecimal or one of its sub-types, or
  • is one of the special valuesNaN,INF, or-INF, if thedatatype base isdecimal or one of its sub-types.

ImplementationsMUST use the sign, exponent, percent, and per-mille signs when parsing thestring value of a cell to provide thevalue of the cell. For example, the string value"-25%" must be interpreted as-0.25 and the string value"1E6" as1000000.

6.4.3Formats for booleans

Boolean values may be represented in many ways aside from the standard1 and0 ortrue andfalse.

If thedatatype base for a cell isboolean, thedatatype format annotation provides the true value followed by the false value, separated by|. For example ifformat isY|N then cells must hold eitherY orN withY meaningtrue andN meaningfalse. If the format does not follow this syntax, implementationsMUST issue a warning and proceed as if no format had been provided.

The resultingcell value will be one or more booleantrue orfalse values.

6.4.4Formats for dates and times

By default, dates and times are assumed to be in the format defined in [xmlschema11-2]. However dates and times are commonly represented in tabular data in other formats.

If thedatatype base is a date or time type, thedatatype format annotation indicates the expected format for that date or time.

The supported date and time format patterns listed here are expressed in terms of thedate field symbols defined in [UAX35]. These formatsMUST be recognised by implementations andMUST be interpreted as defined in that specification. ImplementationsMAY additionally recognise otherdate format patterns. ImplementationsMUST issue a warning if the date format pattern is invalid or not recognised and proceed as if no date format pattern had been provided.

Note

For interoperability, authors of metadata documentsSHOULD use only the formats listed in this section.

The following date format patternsMUST be recognized by implementations:

  • yyyy-MM-dd e.g.,2015-03-22
  • yyyyMMdd e.g.,20150322
  • dd-MM-yyyy e.g.,22-03-2015
  • d-M-yyyy e.g.,22-3-2015
  • MM-dd-yyyy e.g.,03-22-2015
  • M-d-yyyy e.g.,3-22-2015
  • dd/MM/yyyy e.g.,22/03/2015
  • d/M/yyyy e.g.,22/3/2015
  • MM/dd/yyyy e.g.,03/22/2015
  • M/d/yyyy e.g.,3/22/2015
  • dd.MM.yyyy e.g.,22.03.2015
  • d.M.yyyy e.g.,22.3.2015
  • MM.dd.yyyy e.g.,03.22.2015
  • M.d.yyyy e.g.,3.22.2015

The following time format patternsMUST be recognized by implementations:

  • HH:mm:ss.S with one or more trailingS characters indicating the maximum number of fractional seconds e.g.,HH:mm:ss.SSS for15:02:37.143
  • HH:mm:ss e.g.,15:02:37
  • HHmmss e.g.,150237
  • HH:mm e.g.,15:02
  • HHmm e.g.,1502

The following date/time format patternsMUST be recognized by implementations:

  • yyyy-MM-ddTHH:mm:ss.S with one or more trailingS characters indicating the maximum number of fractional seconds e.g.,yyyy-MM-ddTHH:mm:ss.SSS for2015-03-15T15:02:37.143
  • yyyy-MM-ddTHH:mm:ss e.g.,2015-03-15T15:02:37
  • yyyy-MM-ddTHH:mm e.g.,2015-03-15T15:02
  • any of the date formats above, followed by a single space, followed by any of the time formats above, e.g.,M/d/yyyy HH:mm for3/22/2015 15:02 ordd.MM.yyyy HH:mm:ss for22.03.2015 15:02:37

ImplementationsMUST also recognise date, time, and date/time format patterns that end with timezone markers consisting of between one and threex orX characters, possibly after a single space. TheseMUST be interpreted as follows:

  • X e.g.,-08,+0530, orZ (minutes are optional)
  • XX e.g.,-0800,+0530, orZ
  • XXX e.g.,-08:00,+05:30, orZ
  • x e.g.,-08 or+0530 (Z is not permitted)
  • xx e.g.,-0800 or+0530 (Z is not permitted)
  • xxx e.g.,-08:00 or+05:30 (Z is not permitted)

For example, date format patterns could includeyyyy-MM-ddTHH:mm:ssXXX for2015-03-15T15:02:37Z or2015-03-15T15:02:37-05:00, orHH:mm x for15:02 -05.

Thecell value will one or more dates/time values extracted using theformat.

Note

For simplicity, this version of this standard does not support abbreviated or full month or day names, or double digit years. Future versions of this standard may support other date and time formats, or general purpose date/time pattern strings. Authors of schemasSHOULD use appropriate regular expressions, along with thestring datatype, for dates and times that use a format other than that specified here.

6.4.5Formats for durations

DurationsMUST be formatted and interpreted as defined in [xmlschema11-2], using the [ISO8601] format-?PnYnMnDTnHnMnS. For example, the durationP1Y1D is used for a year and a day; the durationPT2H30M for 2 hours and 30 minutes.

If thedatatype base is a duration type, thedatatype format annotation provides a regular expression for the string values, with syntax and processing defined by [ECMASCRIPT]. If the supplied value is not a valid regular expression, implementationsMUST issue a warning and proceed as if no format had been provided.

Note

Authors are encouraged to be conservative in the regular expressions that they use, sticking to the basic features of regular expressions that are likely to be supported across implementations.

Thecell value will be one or more durations extracted using theformat.

6.4.6Formats for other types

If thedatatype base is not numeric,boolean, a date/time type, or a duration type, thedatatype format annotation provides a regular expression for the string values, with syntax and processing defined by [ECMASCRIPT]. If the supplied value is not a valid regular expression, implementationsMUST issue a warning and proceed as if no format had been provided.

Note

Authors are encouraged to be conservative in the regular expressions that they use, sticking to the basic features of regular expressions that are likely to be supported across implementations.

Values that are labelled ashtml,xml, orjsonSHOULD NOT be validated against those formats.

Note

Metadata creators who wish to check the syntax of HTML, XML, or JSON within tabular data should use thedatatype format annotation to specify a regular expression against which such values will be tested.

6.5Presenting Tables

This section is non-normative.

When presenting tables, implementations should:

6.5.1Bidirectional Tables

There are two levels of bidirectionality to consider when displaying tables: the directionality of the table (i.e., whether the columns should be arranged left-to-right or right-to-left) and the directionality of the content of individual cells.

Thetable direction annotation on the table provides information about the desired display of the columns in the table. Iftable direction isltr then the first column should be displayed on the left and the last column on the right. Iftable direction isrtl then the first column should be displayed on the right and the last column on the left.

Iftable direction isauto then tables should be displayed with attention to the bidirectionality of the content of the cells in the table. Specifically, the values of the cells in the table should be scanned breadth first: from the first cell in the first column through to the last cell in the first row, down to the last cell in the last column. If the first character in the table with astrong type as defined in [BIDI] indicates aRTL directionality, the table should be displayed with the first column on the right and the last column on the left. Otherwise, the table should be displayed with the first column on the left and the last column on the right. Characters such as whitespace, quotes, commas, and numbers do not have a strong type, and therefore are skipped when identifying the character that determines the directionality of the table.

Implementations should enable user preferences to override the indicated metadata about the directionality of the table.

Once the directionality of the table has been determined, each cell within the table should be considered as a separateparagraph, as defined by the Unicode Bidirectional Algorithm (UBA) in [BIDI]. The directionality for the cell is determined by looking at thetext direction annotation for the cell, as follows:

  1. If thetext direction isltr then the base direction for the cell content should be set to left-to-right.
  2. If thetext direction isrtl then the base direction for the cell content should be set to right-to-left.
  3. If thetext direction isauto then the base direction for the cell content should be set to the direction determined by the first character in the cell with astrong type as defined in [BIDI].
Note

If thetextDirection property in metadata has the value"inherit", thetext direction annotation for a cell inherits its value from thetable direction annotation on the table.

When thetitles of a column are displayed, these should be displayed in the direction determined by the first character in the title with astrong type as defined in [BIDI]. Titles for the same column in different languages may be displayed in different directions.

6.5.2Column and row labelling

The labelling of columns and rows helps those who are attempting to understand the content of a table to grasp what a particular cell means. Implementations should present appropriate titles for columns, and ensure that the most important information in a row is kept apparent to the user, to aid their understanding. For example:

  • a table presented on the screen might retain certain columns in view so that readers can easily glance at the identifying information in each row
  • as the user moves focus into a cell, screen readers announce a label for the new column if the user has changed column, or for the new row if the user has changed row

When labelling a column, either on the screen or aurally, implementations should use the first available of:

  1. the column'stitles in the preferred language of the user, or with an undefined language if there is no title available in a preferred language; there may be multiple such titles in which case all should be announced
  2. the column'sname
  3. the column'snumber

When labelling a row, either on the screen or aurally, implementations should use the first available of:

  1. the row'stitles in the preferred language of the user, or with an undefined language if there is no title available in a preferred language; there may be multiple such titles in which case all should be announced
  2. thevalues of the cells in the row'sprimary key
  3. the row'snumber

6.6Validating Tables

Validators test whether giventabular data files adhere to the structure defined within aschema. ValidatorsMUST raise errors (and halt processing) and issue warnings (and continue processing) as defined in [tabular-metadata]. In addition, validatorsMUST raise errors butMAY continue validating in the following situations:

6.7Converting Tables

Conversions of tabular data to other formats operate over aannotated table constructed as defined inAnnotating Tables in [tabular-metadata]. The mechanics of these conversions to other formats are defined in other specifications such as [csv2json] and [csv2rdf].

Conversion specificationsMUST define a default mapping from an annotated table that lacks any annotations (i.e., that is equivalent to an un-annotated table).

Conversion specificationsMUST use theproperty value of thepropertyUrl of a column as the basis for naming machine-readable fields in the target format, such as the name of the equivalent element or attribute in XML, property in JSON or property URI in RDF.

Conversion specificationsMAY use any of theannotations found on an annotated table group, table, column, row or cell, including non-core annotations, to adjust the mapping into another format.

Conversion specificationsMAY define additional annotations, not defined in this specification, which are specifically used when converting to the target format of the conversion. For example, a conversion to XML might specify ahttp://example.org/conversion/xml/element-or-attribute property on columns that determines whether a particular column is represented through an element or an attribute in the data.

7.Best Practice CSV

This section is non-normative.

There is no standard for CSV, and there are many variants of CSV used on the web today. This section defines a method for expressing tabular data adhering to theannotated tabular data model in CSV. Authors are encouraged to adhere to the constraints described in this section as implementations should process such CSV files consistently.

Note

This syntax is not compliant withtext/csv as defined in [RFC4180] in that it permits line endings other thanCRLF. SupportingLF line endings is important for data formats that are used on non-Windows platforms. However, all files that adhere to [RFC4180]'s definition of CSV meet the constraints described in this section.

Developing a standard for CSV is outside the scope of the Working Group. The details here aim to help shape any future standard.

7.1Content Type

The appropriate content type for a CSV file istext/csv. For example, when a CSV file is transmitted via HTTP, the HTTP response should include aContent-Type header with the valuetext/csv:

Content-Type: text/csv

7.2Encoding

CSV files should be encoded using UTF-8, and should be in Unicode Normal Form C as defined in [UAX15]. If a CSV file is not encoded using UTF-8, the encoding should be specified through thecharset parameter in theContent-Type header:

Content-Type: text/csv;charset=ISO-8859-1

7.3Line Endings

The ends of rows in a CSV file should beCRLF (U+000D U+000A) but may beLF (U+000A). Line endings within escapedcells are not normalised.

7.4Lines

Each line of a CSV file should contain the same number of comma-separated values.

Values that contain commas, line endings, or double quotes should be escaped by having the entire value wrapped in double quotes. There should not be whitespace before or after the double quotes. Within these escaped cells, any double quotes should be escaped with two double quotes ("").

7.4.1Headers

The first line of a CSV file should contain a comma-separated list of names ofcolumns. This is known as theheader line and provides titles for the columns. There are no constraints on these titles.

If a CSV file does not include a header line, this should be specified using theheader parameter of the media type:

Content-Type: text/csv;header=absent

7.5Grammar

This grammar is a generalization of that defined in [RFC4180] and is included for reference only.

TheEBNF used here is defined in XML 1.0 [EBNF-NOTATION].

[1]csv::=headerrecord+
[2]header::=record
[3]record::=fields#x0D?#x0A
[4]fields::=field (","fields)*
[5]field::=WS*rawfieldWS*
[6]rawfield::= '"'QCHAR* '"'|SCHAR*
[7]QCHAR::=[^"]|'""'
[8]SCHAR::=[^",#x0A#x0D]
[9]WS::=[#x20#x09]

8.Parsing Tabular Data

This section is non-normative.

As described insection7.Best Practice CSV, there may be many formats which an application might interpret into the tabular data model described insection4.Tabular Data Models, including using different separators or fixed format tables, multiple tables within a single file, or ones that have metadata lines before a table header.

Note

Standardizing the parsing of CSV is outside the chartered scope of the Working Group. This non-normative section is intended to help the creators of parsers handle the wide variety of CSV-based formats that they may encounter due to the current lack of standardization of the format.

This section describes an algorithm for parsing formats that do not adhere to the constraints described insection7.Best Practice CSV, as well as those that do, and extractingembedded metadata. The parsing algorithm uses the following flags. These may be set by metadata properties found whileLocating Metadata, including through user input (seeOverriding Metadata), or through the inclusion of adialect description within a metadata file:

comment prefix
A string that, when it appears at the beginning of a row, indicates that the row is a comment that should be associated as ardfs:comment annotation to the table. This is set by thecommentPrefix property of adialect description. The default isnull, which means no rows are treated as comments. A value other thannull may mean that thesource numbers ofrows are different from theirnumbers.
delimiter
The separator between cells, set by thedelimiter property of adialect description. The default is,.
encoding
The character encoding for the file, one of the encodings listed in [encoding], set by theencoding property of adialect description. The default isutf-8.
escape character
The string that is used to escape thequote character within escaped cells, ornull, set by thedoubleQuote property of adialect description. The default is" (such that"" is used to escape" within an escaped cell).
header row count
The number ofheader rows (following the skipped rows) in the file, set by theheader orheaderRowCount property of adialect description. The default is1. A value other than0 will mean that thesource numbers ofrows will be different from theirnumbers.
line terminators
The strings that can be used at the end of a row, set by thelineTerminators property of adialect description. The default is[CRLF, LF].
quote character
The string that is used around escaped cells, ornull, set by thequoteChar property of adialect description. The default is".
skip blank rows
Indicates whether to ignore wholly empty rows (i.e. rows in which all the cells are empty), set by theskipBlankRows property of adialect description. The default isfalse. A value other thanfalse may mean that thesource numbers ofrows are different from theirnumbers.
skip columns
The number of columns to skip at the beginning of each row, set by theskipColumns property of adialect description. The default is0. A value other than0 will mean that thesource numbers ofcolumns will be different from theirnumbers.
skip rows
The number of rows to skip at the beginning of the file, before a header row or tabular data, set by theskipRows property of adialect description. The default is0. A value greater than0 will mean that thesource numbers ofrows will be different from theirnumbers.
trim
Indicates whether to trim whitespace around cells; may betrue,false,start, orend, set by theskipInitialSpace ortrim property of adialect description. The default istrue.

The algorithm for using these flags to parse a document containing tabular data to create a basicannotated tabular data model and to extractembedded metadata is as follows:

  1. Create a newtableT with the annotations:
  2. Create a metadata document structureM that looks like:
    {"@context":"http://www.w3.org/ns/csvw","rdfs:comment":[]"tableSchema":{"columns":[]}}
  3. If the URL of thetabular data file being parsed is known, set theurl property onM to that URL.
  4. Setsource row number to1.
  5. Read the file using theencoding, as specified in [encoding], using thereplacementerror mode. If the encoding is not a Unicode encoding, use a normalizing transcoder to normalize into Unicode Normal Form C as defined in [UAX15].

    Note

    Thereplacementerror mode ensures that any non-Unicode characters within the CSV file are replaced by U+FFFD, ensuring that strings within the tabular data model such as column titles and cell string values only contain valid Unicode characters.

  6. Repeat the following the number of times indicated byskip rows:
    1. Read a row to provide therow content.
    2. If thecomment prefix is notnull and therow content begins with thecomment prefix, strip that prefix from therow content, and add the resulting string to theM.rdfs:comment array.
    3. Otherwise, if therow content is not an empty string, add therow content to theM.rdfs:comment array.
    4. Add1 to thesource row number.
  7. Repeat the following the number of times indicated byheader row count:
    1. Read a row to provide therow content.
    2. If thecomment prefix is notnull and therow content begins with thecomment prefix, strip that prefix from therow content, and add the resulting string to theM.rdfs:comment array.
    3. Otherwise,parse the row to provide alist of cell values, and:
      1. Remove the firstskip columns number of values from thelist of cell values.
      2. For each of the remaining values at indexi in thelist of cell values:
        1. If the value at indexi in thelist of cell values is an empty string or consists only of whitespace, do nothing.
        2. Otherwise, if there is no column description object at indexi inM.tableSchema.columns, create a new one with atitle property whose value is an array containing a single value that is the value at indexi in thelist of cell values.
        3. Otherwise, add the value at indexi in thelist of cell values to the array atM.tableSchema.columns[i].title.
    4. Add1 to thesource row number.
  8. Ifheader row count is zero, create an empty column description object inM.tableSchema.columns for each column in the current row afterskip columns.
  9. Setrow number to1.
  10. While it is possible to read another row, do the following:
    1. Set thesource column number to1.
    2. Read a row to provide therow content.
    3. If thecomment prefix is notnull and therow content begins with thecomment prefix, strip that prefix from therow content, and add the resulting string to theM.rdfs:comment array.
    4. Otherwise,parse the row to provide alist of cell values, and:
      1. If all of the values in thelist of cell values are empty strings, andskip blank rows istrue, add1 to thesource row number and move on to process the next row.
      2. Otherwise, create a newrowR, with:
      3. AppendR to therows oftableT.
      4. Remove the firstskip columns number of values from thelist of cell values and add that number to thesource column number.
      5. For each of the remaining values at indexi in thelist of cell values (wherei starts at1):
        1. Identify thecolumnC at indexi within thecolumns oftableT. If there is no such column:
          1. Create a newcolumnC with:
          2. AppendC to thecolumns oftableT (at indexi).
        2. Create a newcellD, with:
        3. AppendcellD to thecells ofcolumnC.
        4. AppendcellD to thecells ofrowR (at indexi).
        5. Add1 to thesource column number.
    5. Add1 to thesource row number.
  11. IfM.rdfs:comment is an empty array, remove therdfs:comment property fromM.
  12. Return thetableT and theembedded metadataM.

Toread a row to providerow content, perform the following steps:

  1. Set therow content to an empty string.
  2. Read initial characters and process as follows:
    1. If the string starts with theescape character followed by thequote character, append both strings to therow content, and move on to process the string following thequote character.
    2. Otherwise, if the string starts with theescape character and theescape character is not the same as thequote character, append theescape character and the single character following it to therow content and move on to process the string following that character.
    3. Otherwise, if the string starts with thequote character, append thequoted value obtained byreading a quoted value to therow content and move on to process the string following the quoted value.
    4. Otherwise, if the string starts with one of theline terminators, return therow content.
    5. Otherwise, append the first character to therow content and move on to process the string following that character.
  3. If there are no more characters to read, return therow content.

Toread a quoted value to provide aquoted value, perform the following steps:

  1. Set thequoted value to an empty string.
  2. Read the initialquote character and add aquote character to thequoted value.
  3. Read initial characters and process as follows:
    1. If the string starts with theescape character followed by thequote character, append both strings to thequoted value, and move on to process the string following thequote character.
    2. Otherwise, if string starts with theescape character and theescape character is not the same as thequote character, append theescape character and the character following it to thequoted value and move on to process the string following that character.
    3. Otherwise, if the string starts with thequote character, return thequoted value.
    4. Otherwise, append the first character to thequoted value and move on to process the string following that character.

Toparse a row to provide alist of cell values, perform the following steps:

  1. Set thelist of cell values to an empty list and thecurrent cell value to an empty string.
  2. Set thequoted flag tofalse.
  3. Read initial characters and process as follows:
    1. If the string starts with theescape character followed by thequote character, append thequote character to thecurrent cell value, and move on to process the string following thequote character.
    2. Otherwise, if the string starts with theescape character and theescape character is not the same as thequote character, append the character following theescape character to thecurrent cell value and move on to process the string following that character.
    3. Otherwise, if the string starts with thequote character then:
      1. Ifquoted isfalse, set thequoted flag totrue, and move on to process the remaining string. If thecurrent cell value is not an empty string, raise an error.
      2. Otherwise, setquoted tofalse, and move on to process the remaining string. If the remaining string does not start with thedelimiter, raise an error.
    4. Otherwise, if the string starts with thedelimiter, then:
      1. Ifquoted istrue, append thedelimiter string to thecurrent cell value and move on to process the remaining string.
      2. Otherwise,conditionally trim thecurrent cell value, add the resultingtrimmed cell value to thelist of cell values and move on to process the following string.
    5. Otherwise, append the first character to thecurrent cell value and move on to process the remaining string.
  4. If there are no more characters to read,conditionally trim thecurrent cell value, add the resultingtrimmed cell value to thelist of cell values and return thelist of cell values.

Toconditionally trim a cell value to provide atrimmed cell value, perform the following steps:

  1. Set thetrimmed cell value to the provided cell value.
  2. Iftrim istrue orstart then remove any leading whitespace from the start of thetrimmed cell value and move on to the next step.
  3. Iftrim istrue orend then remove any trailing whitespace from the end of thetrimmed cell value and move on to the next step.
  4. Return thetrimmed cell value.
Note

This parsing algorithm does not account for the possibility of there being more than one area of tabular data within a single CSV file.

8.1Bidirectionality in CSV Files

This section is non-normative.

Bidirectional content does not alter the definition of rows or the assignment of cells to columns. Whether or not a CSV file contains right-to-left characters, the first column's content is the first cell of each row, which is the text prior to the first occurrence of a comma within that row.

For example,Egyptian Referendum results are available as a CSV file athttps://egelections-2011.appspot.com/Referendum2012/results/csv/EG.csv. Over the wire and in non-Unicode-aware text editors, the CSV looks like:

‌ا‌ل‌م‌ح‌ا‌ف‌ظ‌ة‌,‌ن‌س‌ب‌ة‌ ‌م‌و‌ا‌ف‌ق‌,‌ن‌س‌ب‌ة‌ ‌غ‌ي‌ر‌ ‌م‌و‌ا‌ف‌ق‌,‌ع‌د‌د‌ ‌ا‌ل‌ن‌ا‌خ‌ب‌ي‌ن‌,‌ا‌ل‌أ‌ص‌و‌ا‌ت‌ ‌ا‌ل‌ص‌ح‌ي‌ح‌ة‌,‌ا‌ل‌أ‌ص‌و‌ا‌ت‌ ‌ا‌ل‌ب‌ا‌ط‌ل‌ة‌,‌ن‌س‌ب‌ة‌ ‌ا‌ل‌م‌ش‌ا‌ر‌ك‌ة‌,‌م‌و‌ا‌ف‌ق‌,‌غ‌ي‌ر‌ ‌م‌و‌ا‌ف‌ق‌‌ا‌ل‌ق‌ل‌ي‌و‌ب‌ي‌ة‌,60.0,40.0,"2,639,808","853,125","15,224",32.9,"512,055","341,070"‌ا‌ل‌ج‌ي‌ز‌ة‌,66.7,33.3,"4,383,701","1,493,092","24,105",34.6,"995,417","497,675"‌ا‌ل‌ق‌ا‌ه‌ر‌ة‌,43.2,56.8,"6,580,478","2,254,698","36,342",34.8,"974,371","1,280,327"‌ق‌ن‌ا‌,84.5,15.5,"1,629,713","364,509","6,743",22.8,"307,839","56,670"...

Within this CSV file, the first column appears as the content of each line before the first comma and is namedالمحافظة (appearing at the start of each row as‌ا‌ل‌م‌ح‌ا‌ف‌ظ‌ة‌ in the example, which is displaying the relevant characters from left to right in the order they appear "on the wire").

The CSV translates to a table model that looks like:

Column / Rowcolumn 1column 2column 3column 4column 5column 6column 7column 8column 9
column namesالمحافظةنسبة موافقنسبة غير موافقعدد الناخبينالأصوات الصحيحةالأصوات الباطلةنسبة المشاركةموافقغير موافق
row 1القليوبية60.040.02,639,808853,12515,22432.9512,055341,070
row 2الجيزة66.733.34,383,7011,493,09224,10534.6995,417497,675
row 3القاهرة43.256.86,580,4782,254,69836,34234.8974,3711,280,327
row 4قنا84.515.51,629,713364,5096,74322.8307,83956,670

The fragment identifier#col=3 identifies the third of the columns, namedنسبة غير موافق (appearing as‌ن‌س‌ب‌ة‌ ‌غ‌ي‌ر‌ ‌م‌و‌ا‌ف‌ق‌ in the example).

section6.5.1Bidirectional Tables defines how this table model should be displayed by compliant applications, and how metadata can affect the display. The default is for the display to be determined by the content of the table. For example, if this CSV were turned into an HTML table for display into a web page, it should be displayed with the first column on the right and the last on the left, as follows:

غير موافقموافقنسبة المشاركةالأصوات الباطلةالأصوات الصحيحةعدد الناخبيننسبة غير موافقنسبة موافقالمحافظة
341,070512,05532.915,224853,1252,639,80840.060.0القليوبية
497,675995,41734.624,1051,493,0924,383,70133.366.7الجيزة
1,280,327974,37134.836,3422,254,6986,580,47856.843.2القاهرة
56,670307,83922.86,743364,5091,629,71315.584.5قنا

The fragment identifier#col=3 still identifies the third of the columns, namedنسبة غير موافق, which appears in the HTML display as the third column from the right and is what those who read right-to-left would think of as the third column.

Note that this display matches that shownon the original website.

8.2Examples

8.2.1Simple Example

A simple CSV file that complies to the constraints described insection7.Best Practice CSV, athttp://example.org/tree-ops.csv, might look like:

Example 14: http://example.org/tree-ops.csv
GID,On Street,Species,Trim Cycle,Inventory Date1,ADDISON AV,Celtis australis,Large Tree Routine Prune,10/18/20102,EMERSON ST,Liquidambar styraciflua,Large Tree Routine Prune,6/2/2010

Parsing this file results in anannotated tabular data model of a singletableT with five columns and two rows. The columns have the annotations shown in the following table:

idcore annotations
tablenumbersource numbercellstitles
C1T11C1.1,C2.1GID
C2T22C1.2,C2.2On Street
C3T33C1.3,C2.3Species
C4T44C1.4,C2.4Trim Cycle
C5T55C1.5,C2.5Inventory Date

The extractedembedded metadata, as defined in [tabular-metadata], would look like:

Example 15: tree-ops.csv Embedded Metadata
{"@type":"Table","url":"http://example.org/tree-ops.csv","tableSchema":{"columns":[{"titles":["GID"]},{"titles":["On Street"]},{"titles":["Species"]},{"titles":["Trim Cycle"]},{"titles":["Inventory Date"]}]}}

The rows have the annotations shown in the following table:

idcore annotations
tablenumbersource numbercells
R1T12C1.1,C1.2,C1.3,C1.4,C1.5
R2T23C2.1,C2.2,C2.3,C2.4,C2.5
Note

Thesource number of each row is offset by one from thenumber of each row because in the source CSV file, the header line is the first line. It is possible to reconstruct a [RFC7111] compliant reference to the first record in the original CSV file (http://example.org/tree-ops.csv#row=2) using the value of the row's source number. This enables implementations to retain provenance between the table model and the original file.

The cells have the annotations shown in the following table (note that thevalues of all the cells in the table are strings, denoted by the double quotes in the table below):

idcore annotations
tablecolumnrowstring valuevalue
C1.1TC1R1"1""1"
C1.2TC2R1"ADDISON AV""ADDISON AV"
C1.3TC3R1"Celtis australis""Celtis australis"
C1.4TC4R1"Large Tree Routine Prune""Large Tree Routine Prune"
C1.5TC5R1"10/18/2010""10/18/2010"
C2.1TC1R2"2""2"
C2.2TC2R2"EMERSON ST""EMERSON ST"
C2.3TC3R2"Liquidambar styraciflua""Liquidambar styraciflua"
C2.4TC4R2"Large Tree Routine Prune""Large Tree Routine Prune"
C2.5TC5R2"6/2/2010""6/2/2010"
8.2.1.1Using Overriding Metadata

The tools that the consumer of this data uses may provide a mechanism for overriding the metadata that has been provided within the file itself. For example, they might enable the consumer to add machine-readable names to the columns, or to mark the fifth column as holding a date in the formatM/D/YYYY. These facilities are implementation defined; the code for invoking a Javascript-based parser might look like:

Example 16: Javascript implementation configuration
data.parse({"column-names":["GID","on_street","species","trim_cycle","inventory_date"],"datatypes":["string","string","string","string","date"],"formats":[null,null,null,null,"M/D/YYYY"]});

This is equivalent to a metadata file expressed in the syntax defined in [tabular-metadata], looking like:

Example 17: Equivalent metadata syntax
{"@type":"Table","url":"http://example.org/tree-ops.csv","tableSchema":{"columns":[{"name":"GID","datatype":"string"},{"name":"on_street","datatype":"string"},{"name":"species","datatype":"string"},{"name":"trim_cycle","datatype":"string"},{"name":"inventory_date","datatype":{"base":"date","format":"M/d/yyyy"}}]}}

This would be merged with theembedded metadata found in the CSV file, providing the titles for the columns to create:

Example 18: Merged metadata
{"@type":"Table","url":"http://example.org/tree-ops.csv","tableSchema":{"columns":[{"name":"GID","titles":"GID","datatype":"string"},{"name":"on_street","titles":"On Street","datatype":"string"},{"name":"species","titles":"Species","datatype":"string"},{"name":"trim_cycle","titles":"Trim Cycle","datatype":"string"},{"name":"inventory_date","titles":"Inventory Date","datatype":{"base":"date","format":"M/d/yyyy"}}]}}

The processor can then create anannotated tabular data model that includedname annotations on the columns, anddatatype annotations on the cells, and created cells whosevalues were of appropriate types (in the case of this Javascript implementation, the cells in the last column would beDate objects, for example).

Assuming this kind of implementation-defined parsing, the columns would then have the annotations shown in the following table:

idcore annotations
tablenumbersource numbercellsnametitlesdatatype
C1T11C1.1,C2.1GIDGIDstring
C2T22C1.2,C2.2on_streetOn Streetstring
C3T33C1.3,C2.3speciesSpeciesstring
C4T44C1.4,C2.4trim_cycleTrim Cyclestring
C5T55C1.5,C2.5inventory_dateInventory Date{ "base": "date", "format": "M/d/yyyy" }

The cells have the annotations shown in the following table. Because of the overrides provided by the consumer to guide the parsing, and the way the parser works, the cells in theInventory Date column (cellsC1.5 andC2.5) havevalues that are parsed dates rather than unparsed strings.

idcore annotations
tablecolumnrowstring valuevalue
C1.1TC1R1"1""1"
C1.2TC2R1"ADDISON AV""ADDISON AV"
C1.3TC3R1"Celtis australis""Celtis australis"
C1.4TC4R1"Large Tree Routine Prune""Large Tree Routine Prune"
C1.5TC5R1"10/18/2010"2010-10-18
C2.1TC1R2"2""2"
C2.2TC2R2"EMERSON ST""EMERSON ST"
C2.3TC3R2"Liquidambar styraciflua""Liquidambar styraciflua"
C2.4TC4R2"Large Tree Routine Prune""Large Tree Routine Prune"
C2.5TC5R2"6/2/2010"2010-06-02
8.2.1.2Using a Metadata File

A similar set of annotations could be provided through a metadata file, located as discussed insection5.Locating Metadata and defined in [tabular-metadata]. For example, this might look like:

Example 19: http://example.org/tree-ops.csv-metadata.json
{"@context":["http://www.w3.org/ns/csvw",{"@language":"en"}],"url":"tree-ops.csv","dc:title":"Tree Operations","dcat:keyword":["tree","street","maintenance"],"dc:publisher":{"schema:name":"Example Municipality","schema:url":{"@id":"http://example.org"}},"dc:license":{"@id":"http://opendefinition.org/licenses/cc-by/"},"dc:modified":{"@value":"2010-12-31","@type":"xsd:date"},"tableSchema":{"columns":[{"name":"GID","titles":["GID","Generic Identifier"],"dc:description":"An identifier for the operation on a tree.","datatype":"string","required":true},{"name":"on_street","titles":"On Street","dc:description":"The street that the tree is on.","datatype":"string"},{"name":"species","titles":"Species","dc:description":"The species of the tree.","datatype":"string"},{"name":"trim_cycle","titles":"Trim Cycle","dc:description":"The operation performed on the tree.","datatype":"string"},{"name":"inventory_date","titles":"Inventory Date","dc:description":"The date of the operation that was performed.","datatype":{"base":"date","format":"M/d/yyyy"}}],"primaryKey":"GID","aboutUrl":"#gid-{GID}"}}

Theannotated tabular data model generated from this would be more sophisticated again. Thetable itself would have the following annotations:

dc:title
{"@value": "Tree Operations", "@language": "en"}
dcat:keyword
[{"@value": "tree", "@language", "en"}, {"@value": "street", "@language": "en"}, {"@value": "maintenance", "@language": "en"}]
dc:publisher
[{ "schema:name": "Example Municipality", "schema:url": {"@id": "http://example.org"} }]
dc:license
{"@id": "http://opendefinition.org/licenses/cc-by/"}
dc:modified
{"@value": "2010-12-31", "@type": "date"}

The columns would have the annotations shown in the following table:

idcore annotationsother annotations
tablenumbersource numbercellsnametitlesdatatypedc:description
C1T11C1.1,C2.1GIDGID,Generic IdentifierstringAn identifier for the operation on a tree.
C2T22C1.2,C2.2on_streetOn StreetstringThe street that the tree is on.
C3T33C1.3,C2.3speciesSpeciesstringThe species of the tree.
C4T44C1.4,C2.4trim_cycleTrim CyclestringThe operation performed on the tree.
C5T55C1.5,C2.5inventory_dateInventory Date{ "base": "date", "format": "M/d/yyyy" }The date of the operation that was performed.

The rows have an additionalprimary key annotation, as shown in the following table:

idcore annotations
tablenumbersource numbercellsprimary key
R1T12C1.1,C1.2,C1.3,C1.4,C1.5C1.1
R2T23C2.1,C2.2,C2.3,C2.4,C2.5C2.1

Thanks to the provided metadata, the cells again have the annotations shown in the following table. The metadata file has provided the information to supplement the model with additional annotations but also, for theInventory Date column (cellsC1.5 andC2.5), have avalue that is a parsed date rather than an unparsed string.

idcore annotations
tablecolumnrowstring valuevalueabout URL
C1.1TC1R1"1""1"http://example.org/tree-ops.csv#gid-1
C1.2TC2R1"ADDISON AV""ADDISON AV"http://example.org/tree-ops.csv#gid-1
C1.3TC3R1"Celtis australis""Celtis australis"http://example.org/tree-ops.csv#gid-1
C1.4TC4R1"Large Tree Routine Prune""Large Tree Routine Prune"http://example.org/tree-ops.csv#gid-1
C1.5TC5R1"10/18/2010"2010-10-18http://example.org/tree-ops.csv#gid-1
C2.1TC1R2"2""2"http://example.org/tree-ops.csv#gid-2
C2.2TC2R2"EMERSON ST""EMERSON ST"http://example.org/tree-ops.csv#gid-2
C2.3TC3R2"Liquidambar styraciflua""Liquidambar styraciflua"http://example.org/tree-ops.csv#gid-2
C2.4TC4R2"Large Tree Routine Prune""Large Tree Routine Prune"http://example.org/tree-ops.csv#gid-2
C2.5TC5R2"6/2/2010"2010-06-02http://example.org/tree-ops.csv#gid-2

8.2.2Empty and Quoted Cells

The following slightly amended CSV file contains quoted and missing cell values:

Example 20: CSV file containing quoted and missing cell values
GID,On Street,Species,Trim Cycle,Inventory Date1,ADDISON AV,"Celtis australis","Large Tree Routine Prune",10/18/20102,,"Liquidambar styraciflua","Large Tree Routine Prune",

Parsing this file similarly results in anannotated tabular data model of a singletableT with five columns and two rows. The columns and rows have exactly the same annotations as previously, but there are twonull cell values forC2.2 andC2.5. Note that the quoting of values within the CSV makes no difference to either thestring value orvalue of the cell.

idcore annotations
tablecolumnrowstring valuevalue
C1.1TC1R1"1""1"
C1.2TC2R1"ADDISON AV""ADDISON AV"
C1.3TC3R1"Celtis australis""Celtis australis"
C1.4TC4R1"Large Tree Routine Prune""Large Tree Routine Prune"
C1.5TC5R1"10/18/2010""10/18/2010"
C2.1TC1R2"2""2"
C2.2TC2R2""null
C2.3TC3R2"Liquidambar styraciflua""Liquidambar styraciflua"
C2.4TC4R2"Large Tree Routine Prune""Large Tree Routine Prune"
C2.5TC5R2""null

8.2.3Tabular Data Embedding Annotations

The following example illustrates some of the complexities that can be involved in parsing tabular data, how the flags described above can be used, and how new tabular data formats could be defined that embed additional annotations into the tabular data model.

In this example, the publishers of the data are using an internal convention to supply additional metadata about the tabular data embedded within the file itself. They are also using a tab as a separator rather than a comma.

Example 21: Tab-separated file containing embedded metadata
#publisherCity of Palo Alto#updated12/31/2010#nameGIDon_streetspeciestrim_cycleinventory_date#datatypestringstringstringstringdate:M/D/YYYYGIDOn StreetSpeciesTrim CycleInventory Date1ADDISON AVCeltis australisLarge Tree Routine Prune10/18/20102EMERSON STLiquidambar styracifluaLarge Tree Routine Prune6/2/2010
8.2.3.1Naive Parsing

Naive parsing of the above data will assume a comma separator and thus results in a singletableT with a single column and six rows. The column has the annotations shown in the following table:

idcore annotations
tablenumbersource numbercellstitles
C1T11C1.1,C2.1,C3.1,C4.1,C5.1# publisher City of Palo Alto

The rows have the annotations shown in the following table:

idcore annotations
tablenumbersource numbercells
R1T12C1.1
R2T23C2.1
R3T34C3.1
R4T45C4.1
R5T56C5.1
R6T67C6.1

The cells have the annotations shown in the following table (note that thevalues of all the cells in the table are strings, denoted by the double quotes in the table below):

idcore annotations
tablecolumnrowstring valuevalue
C1.1TC1R1"# updated 12/31/2010""# updated 12/31/2010"
C1.1TC1R1"#nameGIDon_streetspeciestrim_cycleinventory_date""#nameGIDon_streetspeciestrim_cycleinventory_date"
C2.1TC1R2"#datatypestringstringstringstringdate:M/D/YYYY""#datatypestringstringstringstringdate:M/D/YYYY"
C3.1TC1R3"GIDOn StreetSpeciesTrim CycleInventory Date""GIDOn StreetSpeciesTrim CycleInventory Date"
C4.1TC1R4"1 ADDISON AVCeltis australisLarge Tree Routine Prune10/18/2010""1 ADDISON AVCeltis australisLarge Tree Routine Prune10/18/2010"
C5.1TC1R5"2 EMERSON STLiquidambar styraciflua Large Tree Routine Prune6/2/2010""2 EMERSON STLiquidambar styraciflua Large Tree Routine Prune6/2/2010"
8.2.3.2Parsing with Flags

The consumer of the data may use the flags described above to create a more useful set of data from this file. Specifically, they could set:

Setting these is done in an implementation-defined way. It could be done, for example, by sniffing the contents of the file itself, through command-line options, or by embedding a dialect description into a metadata file associated with the tabular data, which would look like:

Example 22: Dialect description
{"delimiter":"\t","skipRows":4,"skipColumns":1,"commentPrefix":"#"}

With these flags in operation, parsing this file results in anannotated tabular data model of a singletableT with five columns and two rows which is largely the same as that created from the original simple example described insection8.2.1Simple Example. There are three differences.

First, because the four skipped rows began with thecomment prefix, the table itself now has fourrdfs:comment annotations, with the values:

  1. publisherCity of Palo Alto
  2. updated12/31/2010
  3. nameGIDon_streetspeciestrim_cycleinventory_date
  4. datatypestringstringstringstringdate:M/D/YYYY

Second, because the first column has been skipped, thesource number of each of the columns is offset by one from thenumber of each column:

idcore annotations
tablenumbersource numbercellstitles
C1T12C1.1,C2.1GID
C2T23C1.2,C2.2On Street
C3T34C1.3,C2.3Species
C4T45C1.4,C2.4Trim Cycle
C5T56C1.5,C2.5Inventory Date

Finally, because four additional rows have been skipped, thesource number of each of the rows is offset by five from therow number (the four skipped rows plus the single header row):

idcore annotations
tablenumbersource numbercells
R1T16C1.1,C1.2,C1.3,C1.4,C1.5
R2T27C2.1,C2.2,C2.3,C2.4,C2.5
8.2.3.3Recognizing Tabular Data Formats

The conventions used in this data (invented for the purpose of this example) are in fact intended to create anannotated tabular data model which includes named annotations on the table itself, on the columns, and on the cells. The creator of these conventions could create a specification for this particular tabular data syntax and register a media type for it. The specification would include statements like:

  • A tab delimiter is always used.
  • The first column is always ignored.
  • When the first column of a row has the value"#", the second column is the name of an annotation on the table and the values of the remaining columns are concatenated to create the value of that annotation.
  • When the first column of a row has the value#name, the remaining cells in the row provide aname annotation for each column in the table.
  • When the first column of a row has the value#datatype, the remaining cells in the row providedatatype/format annotations for the cells within the relevant column, and these are interpreted to create thevalue for each cell in that column.
  • The first row where the first column is empty is a row of headers; these providetitle annotations on the columns in the table.
  • The remaining rows make up the data of the table.

Parsers that recognized the format could then build a more sophisticated annotated tabular data model using only the embedded information in thetabular data file. They would extractembedded metadata looking like:

Example 23: Embedded metadata in the format of the annotated tabular model
{"@context":"http://www.w3.org/ns/csvw","url":"tree-ops.csv","dc:publisher":"City of Palo Alto","dc:updated":"12/31/2010","tableSchema":{"columns":[{"name":"GID","titles":"GID","datatype":"string",},{"name":"on_street","titles":"On Street","datatype":"string"},{"name":"species","titles":"Species","datatype":"string"},{"name":"trim_cycle","titles":"Trim Cycle","datatype":"string"},{"name":"inventory_date","titles":"Inventory Date","datatype":{"base":"date","format":"M/d/yyyy"}}]}}

As before, the result would be a singletableT with five columns and two rows. The table itself would have two annotations:

dc:publisher
{"@value": "City of Palo Alto"}
dc:updated
{"@value": "12/31/2010"}

The columns have the annotations shown in the following table:

idcore annotations
tablenumbersource numbercellsnametitles
C1T12C1.1,C2.1GIDGID
C2T23C1.2,C2.2on_streetOn Street
C3T34C1.3,C2.3speciesSpecies
C4T45C1.4,C2.4trim_cycleTrim Cycle
C5T56C1.5,C2.5inventory_dateInventory Date

The rows have the annotations shown in the following table, exactly as in previous examples:

idcore annotations
tablenumbersource numbercells
R1T16C1.1,C1.2,C1.3,C1.4,C1.5
R2T27C2.1,C2.2,C2.3,C2.4,C2.5

The cells have the annotations shown in the following table. Because of the way the particular tabular data format has been specified, these include additional annotations but also, for theInventory Date column (cellsC1.5 andC2.5), have avalue that is a parsed date rather than an unparsed string.

idcore annotations
tablecolumnrowstring valuevalue
C1.1TC1R1"1""1"
C1.2TC2R1"ADDISON AV""ADDISON AV"
C1.3TC3R1"Celtis australis""Celtis australis"
C1.4TC4R1"Large Tree Routine Prune""Large Tree Routine Prune"
C1.5TC5R1"10/18/2010"2010-10-18
C2.1TC1R2"2""2"
C2.2TC2R2"EMERSON ST""EMERSON ST"
C2.3TC3R2"Liquidambar styraciflua""Liquidambar styraciflua"
C2.4TC4R2"Large Tree Routine Prune""Large Tree Routine Prune"
C2.5TC5R2"6/2/2010"2010-06-02

8.2.4Parsing Multiple Header Lines

The following example shows a CSV file with multiple header lines:

Example 24: CSV file with multiple header lines
Who,What,,Where,Organization,Sector,Subsector,Department,Municipality#org,#sector,#subsector,#adm1,#adm2UNICEF,Education,Teacher training,Chocó,QuidbóUNICEF,Education,Teacher training,Chocó,Bojayá

Here, the first line contains some grouping titles in the first line, which are not particularly helpful. The lines following those contain useful titles for the columns. Thus the appropriate configuration for a dialect description is:

Example 25: Dialect description for multiple header lines
{"skipRows":1,"headerRowCount":2}

With this configuration, the table model contains five columns, each of which have two titles, summarized in the following table:

idcore annotations
tablenumbersource numbercellstitles
C1T11C1.1,C2.1Organization,#org
C2T22C1.2,C2.2Sector,#sector
C3T33C1.3,C2.3Subsector,#subsector
C4T44C1.4,C2.4Department,#adm1
C5T55C1.5,C2.5Municipality,#adm2

As metadata, this would look like:

Example 26: Extracted metadata
{"tableSchema":{"columns":[{"titles":["Organization","#org"]},{"titles":["Sector","#sector"]},{"titles":["Subsector","#subsector"]},{"titles":["Department","#adm1"]},{"titles":["Municipality","#adm2"]},]}}

A separate metadata file could contain just the second of each of these titles, for example:

Example 27: Metadata file
{"tableSchema":{"columns":[{"name":"org","titles":#org" },{"name":"sector","titles":#sector" },{"name":"subsector","titles":#subsector" },{"name":"adm1","titles":#adm1" },{"name":"adm2","titles":#adm2" },]}}

This enables people from multiple jurisdictions to use the same tabular data structures without having to use exactly the same titles within their documents.

A.IANA Considerations

/.well-known/csvm
URI suffix:
csvm
Change controller:
W3C
Specification document(s):
This document,section5.3Default Locations and Site-wide Location Configuration

B.Existing Standards

This section is non-normative.

This appendix outlines various ways in which CSV is defined.

B.1RFC 4180

[RFC4180] defines CSV with the following ABNF grammar:

file = [header CRLF] record *(CRLF record) [CRLF]header = name *(COMMA name)record = field *(COMMA field)name = fieldfield = (escaped / non-escaped)escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTEnon-escaped = *TEXTDATACOMMA = %x2CCR = %x0DDQUOTE =  %x22LF = %x0ACRLF = CR LFTEXTDATA =  %x20-21 / %x23-2B / %x2D-7E

Of particular note here are:

B.2Excel

Excel is a common tool for both creating and reading CSV documents, and therefore the CSV that it produces is a de facto standard.

Note

The following describes the behavior of Microsoft Excel for Mac 2011 with an English locale. Further testing is needed to see the behavior of Excel in other situations.

B.2.1Saved CSV

Excel generates CSV files encoded using Windows-1252 withLF line endings. Characters that cannot be represented within Windows-1252 are replaced by underscores. Only those cells that need escaping (e.g. because they contain commas or double quotes) are escaped, and double quotes are escaped with two double quotes.

Dates and numbers are formatted as displayed, which means that formatting can lead to information being lost or becoming inconsistent.

B.2.2Opened CSV

When opening CSV files, Excel interprets CSV files saved in UTF-8 as being encoded as Windows-1252 (whether or not aBOM is present). It correctly deals with double quoted cells, except that it converts line breaks within cells into spaces. It understandsCRLF as a line break. It detects dates (formatted asYYYY-MM-DD) and formats them in the default date formatting for files.

B.2.3Imported CSV

Excel provides more control when importing CSV files into Excel. However, it does not properly understand UTF-8 (with or withoutBOM). It does however properly understand UTF-16 and can read non-ASCII characters from a UTF-16-encoded file.

A particular quirk in the importing of CSV is that if a cell contains a line break, the final double quote that escapes the cell will be included within it.

B.2.4Copied Tabular Data

When tabular data is copied from Excel, it is copied in a tab-delimited format, withLF line breaks.

B.3Google Spreadsheets

B.3.1Downloading CSV

Downloaded CSV files are encoded in UTF-8, without aBOM, and withLF line endings. Dates and numbers are formatted as they appear within the spreadsheet.

B.3.2Importing CSV

CSV files can be imported as UTF-8 (with or withoutBOM).CRLF line endings are correctly recognized. Dates are reformatted to the default date format on load.

B.4CSV Files in a Tabular Data Package

Tabular Data Packages place the following restrictions on CSV files:

As a starting point, CSV files included in a Tabular Data Package package must conform to the RFC for CSV (4180 - Common Format and MIME Type for Comma-Separated Values (CSV) Files). In addition:

  • File namesMUST end with.csv.

  • FilesMUST be encoded as UTF-8.

  • FilesMUST have a single header row. This rowMUST be the first row in the file.

    • Terminology: each column in the CSV file is termed afield and itsname is the string in that column in the header row.

    • ThenameMUST be unique amongst fields,MUST contain at least one character, andMUST conform to the character restrictions defined for thename property.

  • Rows in the fileMUST NOT contain more fields than are in the header row (though they may contain less).

  • Each fileMUST have an entry in thetables array in thedatapackage.json file.

  • The resource metadataMUST include atableSchema attribute whose valueMUST be a validschema description.

  • All fields in the CSV filesMUST be described in theschema description.

CSV files generated by different applications often vary in their syntax, e.g. use of quoting characters, delimiters, etc. To encourage conformance, CSV files in a CSV files in a Tabular Data PackageSHOULD:

  • Use "," as field delimiters.
  • UseCRLF (U+000D U+000A) orLF (U+000A) as line terminators.

If a CSV file does not follow these rules then its specific CSV dialectMUST be documented. The resource hash for the resource in thedatapackage.json descriptorMUST:

Applications processing the CSV fileSHOULD read use thedialect of the CSV file to guide parsing.

Note

To replicate the findings above, test files which include non-ASCII characters, double quotes, and line breaks within cells are:

C.Acknowledgements

This section is non-normative.

At the time of publication, the following individuals had participated in the Working Group, in the order of their first name: Adam Retter, Alf Eaton, Anastasia Dimou, Andy Seaborne, Axel Polleres, Christopher Gutteridge, Dan Brickley, Davide Ceolin, Eric Stephan, Erik Mannens, Gregg Kellogg, Ivan Herman, Jeni Tennison, Jeremy Tandy, Jürgen Umbrich, Rufus Pollock, Stasinos Konstantopoulos, William Ingram, and Yakov Shafranovich.

D.Changes from previous drafts

D.1Changes since the candidate recommendation of 16 July 2015

D.2Changes since the working draft of 16 April 2015

D.3Changes since the working draft of 08 January 2015

The document has undergone substantial changes since the last working draft. Below are some of the changes made:

E.References

E.1Normative references

[BCP47]
A. Phillips; M. Davis.Tags for Identifying Languages. September 2009. IETF Best Current Practice. URL:https://tools.ietf.org/html/bcp47
[BIDI]
Mark Davis; Aharon Lanin; Andrew Glass.Unicode Bidirectional Algorithm. 5 June 2014. Unicode Standard Annex #9. URL:http://www.unicode.org/reports/tr9/
[ECMASCRIPT]
ECMAScript Language Specification. URL:https://tc39.github.io/ecma262/
[ISO8601]
Representation of dates and times. International Organization for Standardization. 2004. ISO 8601:2004. URL:http://www.iso.org/iso/catalogue_detail?csnumber=40874
[JSON-LD]
Manu Sporny; Gregg Kellogg; Markus Lanthaler.JSON-LD 1.0. 16 January 2014. W3C Recommendation. URL:http://www.w3.org/TR/json-ld/
[RFC2119]
S. Bradner.Key words for use in RFCs to Indicate Requirement Levels. March 1997. Best Current Practice. URL:https://tools.ietf.org/html/rfc2119
[RFC3968]
G. Camarillo.The Internet Assigned Number Authority (IANA) Header Field Parameter Registry for the Session Initiation Protocol (SIP). December 2004. Best Current Practice. URL:https://tools.ietf.org/html/rfc3968
[RFC4180]
Y. Shafranovich.Common Format and MIME Type for Comma-Separated Values (CSV) Files. October 2005. Informational. URL:https://tools.ietf.org/html/rfc4180
[RFC5785]
M. Nottingham; E. Hammer-Lahav.Defining Well-Known Uniform Resource Identifiers (URIs). April 2010. Proposed Standard. URL:https://tools.ietf.org/html/rfc5785
[UAX35]
Mark Davis; CLDR committee members.Unicode Locale Data Markup Language (LDML). 15 March 2013. Unicode Standard Annex #35. URL:http://www.unicode.org/reports/tr35/tr35-31/tr35.html
[UNICODE]
The Unicode Standard. URL:http://www.unicode.org/versions/latest/
[URI-TEMPLATE]
J. Gregorio; R. Fielding; M. Hadley; M. Nottingham; D. Orchard.URI Template. March 2012. Proposed Standard. URL:https://tools.ietf.org/html/rfc6570
[tabular-metadata]
Jeni Tennison; Gregg Kellogg.Metadata Vocabulary for Tabular Data. W3C Recommendation. URL:http://www.w3.org/TR/2015/REC-tabular-metadata-20151217/
[xmlschema11-2]
David Peterson; Sandy Gao; Ashok Malhotra; Michael Sperberg-McQueen; Henry Thompson; Paul V. Biron et al.W3C XML Schema Definition Language (XSD) 1.1 Part 2: Datatypes. 5 April 2012. W3C Recommendation. URL:http://www.w3.org/TR/xmlschema11-2/

E.2Informative references

[EBNF-NOTATION]
Tim Bray; Jean Paoli; C. Michael Sperberg-McQueen; Eve Maler; François Yergau.EBNF Notation. W3C Recommendation. URL:http://www.w3.org/TR/xml/#sec-notation
[RFC7111]
M. Hausenblas; E. Wilde; J. Tennison.URI Fragment Identifiers for the text/csv Media Type. January 2014. Informational. URL:https://tools.ietf.org/html/rfc7111
[UAX15]
Mark Davis; Ken Whistler.Unicode Normalization Forms. 31 August 2012. Unicode Standard Annex #15. URL:http://www.unicode.org/reports/tr15
[annotation-model]
Robert Sanderson; Paolo Ciccarese; Benjamin Young.Web Annotation Data Model. 15 October 2015. W3C Working Draft. URL:http://www.w3.org/TR/annotation-model/
[csv2json]
Jeremy Tandy; Ivan Herman.Generating JSON from Tabular Data on the Web. W3C Recommendation. URL:http://www.w3.org/TR/2015/REC-csv2json-20151217/
[csv2rdf]
Jeremy Tandy; Ivan Herman; Gregg Kellogg.Generating RDF from Tabular Data on the Web. W3C Recommendation. URL:http://www.w3.org/TR/2015/REC-csv2rdf-20151217/
[encoding]
Anne van Kesteren; Joshua Bell; Addison Phillips.Encoding. 20 October 2015. W3C Candidate Recommendation. URL:http://www.w3.org/TR/encoding/
[vocab-data-cube]
Richard Cyganiak; Dave Reynolds.The RDF Data Cube Vocabulary. 16 January 2014. W3C Recommendation. URL:http://www.w3.org/TR/vocab-data-cube/

[8]ページ先頭

©2009-2025 Movatter.jp