Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

RealNest - A Collection of Nested Data from Real-World Datasets

License

NotificationsYou must be signed in to change notification settings

cwida/RealNest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

This repository contains the details of theRealNest dataset, a collection of nested data derived from real-worlddatasets. The dataset is designed to help computer science researchers benchmark and evaluate data systems and dataformats supporting nested data types.

RealNest is provided as a script that downloads and generates the data, but for convenience and to facilitatestandardized comparisons, we host (outside of this repository) on the CWIwebsite (https://event.cwi.nl/da/RealNest) two static datasets withdata in.jsonl.gz format in sizes of64 * 1024resp.10 * 64 * 1024 rows. These sample datasets weredownloaded and generated by our script in mid-May 2024.

Furthermore, thesample-data directory inside this repository contains a small sample of the datasetsmentioned above (the first 1024 rows and 100 MiB of each table) as a preview.

Because we provide the script that downloads the original datasets and processes them into a common format, one cancreate the dataset from newer versions of the underlying data and also enlarge them with respect to the static datasets,since even the larger of the two statically downloadable datasets contains only a small part of each of the originaldata sources. Please note that the availability of the original datasets is outside our control, and over time, some ofthe original datasets may become unavailable. The download script will attempt to download the data from the sources,skipping the ones that are not available.

Please refer to theREADME in thescripts directory for more details.

All materials in this GitHub repository, except the files under thesample-data folder, are releasedunder theCC-NC-SA license (https://creativecommons.org/licenses/by-nc-sa/4.0/); hence, this repository isopen-source, requires attribution to this page (which includes the Attribution section below) and does not allowcommercial exploitation.

Note that the sample datasets inside this repository and the two static datasets hosted at CWI linked here remain underthe same licenses and terms of use as the original datasets they are generated from. If you are the owner of an originaldataset, and object to the inclusion of your data in theRealNest static datasets hosted at CWI or to the sampleshosted in this repository, please contact Peter Boncz (boncz@cwi.nl), and we will take action.

Please note that below we attempt to properly attribute the individual datasets as required by their various open-sourcelicenses and terms of usage.

Dataset Structure

The dataset contains a directory for each table with the following files:

  • schema.json: The schema of the table. The schema is a JSON object with a single key,columns, containing a list ofcolumns. Each column is a JSON object with 2 or 3 keys:
    • name - The name of the column as a string.
    • type - The type of the column as a string.
    • children - Optional, only exists for nested types (list,struct,map). Describes the child types of thenested type as a list of column objects. Thelist type always has a single child column with the namechild.Themap type always has two child columns with the nameskey andvalue.
  • data.jsonl ordata.jsonl.gz: The data of the table inJSON Lines format (optionallyGzip compressed).

The schema might contain aJSON type, which may happen for empty JSON objects in the data ({}) or when DuckDB'sschema inference detects incompatible types. The columns of this type can be ignored since they are not typical forstructured data, or they can be handled as VARCHAR columns, where the value is the JSON string.

Attribution

The data has been downloaded from various public sources and converted to a common format. We note that the real-worlddatasets from whichRealNest is derived are released under varying open-source licenses and terms of usage.

The sources of the original datasets are:

  1. Amazon Berkeley Objects (LICENSE)
    • J. Collins, S. Goel, K. Deng, A. Luthra, L. Xu, E. Gundogdu, X. Zhang, T. F. YagoVicente, T. Dideriksen, H. Arora, M. Guillaumin, and J. Malik, "Abo: Dataset andbenchmarks for real-world 3d object understanding," CVPR, 2022.
  2. AWS Public Blockchain Data (LICENSE)
  3. Data Lake as Code (ATTRIBUTIONS)
  4. CORD-19 (LICENSE)
    • L. L. Wang, K. Lo, Y. Chandrasekhar, R. Reas, J. Yang, D. Eide, K. Funk, R. M.Kinney, Z. Liu, W. Merrill, P. Mooney, D. A. Murdick, D. Rishi, J. Sheehan, Z. Shen,B. Stilson, A. D. Wade, K. Wang, C. Wilhelm, B. Xie, D. A. Raymond, D. S. Weld,O. Etzioni, and S. Kohlmeier, "Cord-19: The covid-19 open research dataset," ArXiv, 2020.
  5. Daylight Map Distribution of OpenStreetMap (Open Database License (ODbL))
  6. GitHub Archive
  7. CERN Open Data
    • CMS collaboration (2017). SingleMu primary dataset in AOD format from Run of 2012 (/SingleMu/Run2012B-22Jan2013-v1/AOD). CERN Open Data Portal.DOI:10.7483/OPENDATA.CMS.IYVQ.1J0W
  8. Overture Maps Foundation Open Map Data
    • Overture data is licensed under the Community Database License Agreement Permissive v2 (CDLA) unless derivedfrom a source that requires publishing under a different license, such as data derived from OpenStreetMap,that constitutes a 'Derivative Database' (as defined under ODbL v1.0), which will be licensed under ODbL v1.0.
  9. Twitter Stream Archive

About

RealNest - A Collection of Nested Data from Real-World Datasets

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Languages


[8]ページ先頭

©2009-2025 Movatter.jp