Movatterモバイル変換


[0]ホーム

URL:


US20170193128A1 - Method, apparatus, and computer-readable medium for encoding repetition and definition level values for semi-structured data - Google Patents

Method, apparatus, and computer-readable medium for encoding repetition and definition level values for semi-structured data
Download PDF

Info

Publication number
US20170193128A1
US20170193128A1US15/208,032US201615208032AUS2017193128A1US 20170193128 A1US20170193128 A1US 20170193128A1US 201615208032 AUS201615208032 AUS 201615208032AUS 2017193128 A1US2017193128 A1US 2017193128A1
Authority
US
United States
Prior art keywords
entry
column
leaf
level
repetition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US15/208,032
Inventor
Sattam Alsubaiee
Vinayak Borkar
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mandiant Inc
Original Assignee
X15 Software Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by X15 Software IncfiledCriticalX15 Software Inc
Priority to US15/208,032priorityCriticalpatent/US20170193128A1/en
Assigned to X15 Software, Inc.reassignmentX15 Software, Inc.ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).Assignors: ALSUBAIEE, SATTAM, BORKAR, Vinayak
Publication of US20170193128A1publicationCriticalpatent/US20170193128A1/en
Assigned to FIREEYE, INC.reassignmentFIREEYE, INC.MERGER (SEE DOCUMENT FOR DETAILS).Assignors: X15 Software, Inc.
Abandonedlegal-statusCriticalCurrent

Links

Images

Classifications

Definitions

Landscapes

Abstract

An apparatus, computer-readable medium, and computer-implemented method for encoding repetition and definition level values for a semi-structured data record, including identifying a leaf of an entry in a data record, the entry specifying one or more data fields in the data record and the leaf corresponding to a last data field in the entry, storing a value of the leaf in a column which corresponds to a nesting path of the leaf within the data record, the column being in a current row group, determining a repetition level of the entry for the column, determining a definition level of the entry for the column based at least in part on a nesting level of the leaf in the data record, and storing the repetition level and definition level of the entry for the column in a table of repetition and definition levels for the current row group.

Description

Claims (15)

What is claimed is:
1. A method executed by one or more computing devices for encoding repetition and definition level values for a semi-structured data record while storing the semi-structured data in columnar format, the method comprising:
identifying, by at least one of the one or more computing devices, a leaf of an entry in a data record, wherein the entry specifies one or more data fields in the data record and wherein the leaf corresponds to a last data field in the entry;
storing, by at least one of the one or more computing devices, a value of the leaf in a column which corresponds to a nesting path of the leaf within the data record, wherein the column is in a current row group;
determining, by at least one of the one or more computing devices, a repetition level of the entry for the column, wherein the repetition level of the entry for the column is based at least in part on a determination of whether a previous entry occurs prior to the entry in the data record and whether the previous entry includes a previous leaf which shares at least a portion of the nesting path of the leaf;
determining, by at least one of the one or more computing devices, a definition level of the entry for the column based at least in part on a nesting level of the leaf in the data record; and
storing, by at least one of the one or more computing devices, the repetition level and definition level of the entry for the column in a table of repetition and definition levels for the current row group.
2. The method ofclaim 1, further comprising:
determining, by at least one of the one or more computing devices, one or more repetition levels of the entry for one or more other columns in the current row group;
determining, by at least one of the one or more computing devices, one or more definition levels of the entry for the one or more other columns in the current row group; and
storing, by at least one of the one or more computing devices, the one or more repetition levels and the one or more definition levels of the entry for the one or more other columns in the current row group in the table of repetition and definition levels for current row group.
3. The method ofclaim 2, wherein determining one or more repetition levels of the entry for the one or more other columns in the current row group comprises, for each other column in the one or more other columns:
determining whether the entry is a first entry in the data record; and
setting a repetition level of the entry for the other column to zero based at least in part on a determination that the entry is a first entry in the data record.
4. The method ofclaim 2, wherein determining one or more repetition levels of the entry for the one or more other columns in the current row group comprises, for each other column in the one or more other columns:
determining whether the entry is a first entry in the data record; and
responsive to determining that the entry is not a first entry in the data record:
determining a longest common prefix between one or more data fields in the other column and one or more data fields in the nesting path of the leaf;
determining a first data field which is common to both the entry and the longest common prefix; and
setting a repetition level of the entry for the other column to be a level of nesting of the first data field within the longest common prefix.
5. The method ofclaim 2, wherein determining one or more definition levels of the entry for the one or more other columns in the current row group comprises, for each other column in the one or more other columns:
determining a longest common prefix between one or more data fields in the other column and one or more data fields in the nesting path of the leaf;
determining a last data field of the longest common prefix; and
setting a definition level of the entry for the other column to be a level of nesting of the last data field within the longest common prefix.
6. The method ofclaim 1, wherein storing a value of the leaf in a column which corresponds to a nesting path of the leaf within the data record comprises:
determining whether a column which corresponds to a nesting path of the leaf within the data record exists in the current row group;
creating a new column in the current row group which corresponds to a nesting path of the leaf within the data record based at least in part on a determination that the column does not exist; and
storing the value of the leaf in the new column which corresponds to a nesting path of the leaf within the data record.
7. The method ofclaim 6, further comprising:
determining, by at least one of the one or more computing devices, one or more repetition levels of one or more other entries for the new column;
determining, by at least one of the one or more computing devices, one or more definition levels of the one or more other entries for the new column; and
storing, by at least one of the one or more computing devices, the one or more repetition levels and the one or more definition levels of the one or more other entries for the new column in the table of repetition and definition levels for current row group.
8. The method ofclaim 7, wherein determining one or more repetition levels of one or more other entries for the new column comprises:
comparing one or more data fields in the new column with one or more data fields in each of one or more prior columns to identify a prior column which has a longest common prefix with the new column; and
for each other entry in the one or more other entries:
determining whether a repetition level corresponding to the other entry for the identified prior column is greater than a maximum repetition level; and
setting a repetition level corresponding to the other entry for the new column to be equal to the repetition level corresponding to the other entry for the identified prior column based at least in part on a determination that the repetition level corresponding to the other entry for the identified prior column is not greater than a maximum repetition level.
9. The method ofclaim 8, wherein the maximum repetition level comprises a highest possible repetition level for a longest shared prefix between fields of the new column and fields of the identified prior column.
10. The method ofclaim 7, wherein determining one or more definition levels of one or more other entries for the new column comprises:
comparing one or more data fields in the new column with one or more data fields in each of one or more prior columns to identify a prior column which has a longest common prefix with the new column; and
for each other entry in the one or more other entries:
determining whether a repetition level corresponding to the other entry for the identified prior column is greater than a maximum repetition level;
determining whether a definition level corresponding to the other entry for the identified prior column is greater than a maximum definition level; and either
setting a definition level corresponding to the other entry for the new column to be equal to the definition level corresponding to the other entry for the identified prior column based at least in part on a determination that the definition level corresponding to the other entry for the identified prior column is not greater than a maximum definition level and a determination that the repetition level corresponding to the other entry for the identified prior column is not greater than a maximum repetition level; or
setting a definition level corresponding to the other entry for the new column to be equal to the maximum definition level based at least in part on a determination that the definition level corresponding to the other entry for the identified prior column is greater than a maximum definition level and a determination that the repetition level corresponding to the other entry for the identified prior column is not greater than a maximum repetition level.
11. The method ofclaim 8, wherein the maximum repetition level comprises a highest possible repetition level for a longest shared prefix between fields of the new column and fields of the identified prior column and wherein the maximum definition level comprises a highest possible definition level for a longest shared prefix between fields of the new column and fields of the identified prior column.
12. The method ofclaim 1, wherein identifying a leaf of an entry in a data record, wherein the entry specifies one or more data fields in the data record and wherein the leaf corresponds to a last data field in the entry comprises:
determining whether the last field of the entry comprises a primitive data type; and
setting the last field of the entry as the leaf based at least in part on a determination that the last field of the entry comprises a primitive data type.
13. The method ofclaim 1, wherein identifying a leaf of an entry in a data record, wherein the entry specifies one or more data fields in the data record and wherein the leaf corresponds to a last data field in the entry comprises:
determining whether the last field of the entry comprises a primitive data type; and
responsive to determining that the last field of the entry does not comprise a primitive data type:
adding a new Boolean field to the entry after the last data field;
setting the new Boolean field to true; and
setting the new Boolean field as the leaf.
14. An apparatus for encoding repetition and definition level values for a semi-structured data record while storing the semi-structured data in columnar format, the apparatus comprising:
one or more processors; and
one or more memories operatively coupled to at least one of the one or more processors and having instructions stored thereon that, when executed by at least one of the one or more processors, cause at least one of the one or more processors to:
identify a leaf of an entry in a data record, wherein the entry specifies one or more data fields in the data record and wherein the leaf corresponds to a last data field in the entry;
store a value of the leaf in a column which corresponds to a nesting path of the leaf within the data record, wherein the column is in a current row group;
determine a repetition level of the entry for the column, wherein the repetition level of the entry for the column is based at least in part on a determination of whether a previous entry occurs prior to the entry in the data record and whether the previous entry includes a previous leaf which shares at least a portion of the nesting path of the leaf;
determine a definition level of the entry for the column based at least in part on a nesting level of the leaf in the data record; and
store the repetition level and definition level of the entry for the column in a table of repetition and definition levels for the current row group.
15. At least one non-transitory computer-readable medium storing computer-readable instructions that, when executed by one or more computing devices, cause at least one of the one or more computing devices to:
identify a leaf of an entry in a data record, wherein the entry specifies one or more data fields in the data record and wherein the leaf corresponds to a last data field in the entry;
store a value of the leaf in a column which corresponds to a nesting path of the leaf within the data record, wherein the column is in a current row group;
determine a repetition level of the entry for the column, wherein the repetition level of the entry for the column is based at least in part on a determination of whether a previous entry occurs prior to the entry in the data record and whether the previous entry includes a previous leaf which shares at least a portion of the nesting path of the leaf;
determine a definition level of the entry for the column based at least in part on a nesting level of the leaf in the data record; and
store the repetition level and definition level of the entry for the column in a table of repetition and definition levels for the current row group.
US15/208,0322015-12-312016-07-12Method, apparatus, and computer-readable medium for encoding repetition and definition level values for semi-structured dataAbandonedUS20170193128A1 (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
US15/208,032US20170193128A1 (en)2015-12-312016-07-12Method, apparatus, and computer-readable medium for encoding repetition and definition level values for semi-structured data

Applications Claiming Priority (3)

Application NumberPriority DateFiling DateTitle
US201562274098P2015-12-312015-12-31
US15/164,287US10866940B2 (en)2015-12-312016-05-25Method, apparatus, and computer-readable medium for ingesting semi-structured data in a columnar format
US15/208,032US20170193128A1 (en)2015-12-312016-07-12Method, apparatus, and computer-readable medium for encoding repetition and definition level values for semi-structured data

Related Parent Applications (1)

Application NumberTitlePriority DateFiling Date
US15/164,287ContinuationUS10866940B2 (en)2015-12-312016-05-25Method, apparatus, and computer-readable medium for ingesting semi-structured data in a columnar format

Publications (1)

Publication NumberPublication Date
US20170193128A1true US20170193128A1 (en)2017-07-06

Family

ID=59226422

Family Applications (2)

Application NumberTitlePriority DateFiling Date
US15/164,287Active2038-04-17US10866940B2 (en)2015-12-312016-05-25Method, apparatus, and computer-readable medium for ingesting semi-structured data in a columnar format
US15/208,032AbandonedUS20170193128A1 (en)2015-12-312016-07-12Method, apparatus, and computer-readable medium for encoding repetition and definition level values for semi-structured data

Family Applications Before (1)

Application NumberTitlePriority DateFiling Date
US15/164,287Active2038-04-17US10866940B2 (en)2015-12-312016-05-25Method, apparatus, and computer-readable medium for ingesting semi-structured data in a columnar format

Country Status (1)

CountryLink
US (2)US10866940B2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20240411756A1 (en)*2023-06-062024-12-12Microsoft Technology Licensing, LlcSelection pushdown in column stores using bit manipulation instructions

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US10706038B2 (en)*2017-07-272020-07-07Cisco Technology, Inc.System and method for state object data store
US10681106B2 (en)2017-09-262020-06-09Oracle International CorporationEntropy sharing across multiple compression streams
CN110895582A (en)*2018-09-122020-03-20珠海格力电器股份有限公司Data processing method and device
US11562085B2 (en)*2018-10-192023-01-24Oracle International CorporationAnisotropic compression as applied to columnar storage formats
US11074248B2 (en)2019-03-312021-07-27Oracle International CorporationMap of operations for ingesting external data
US11586587B2 (en)*2020-09-242023-02-21Speedata Ltd.Hardware-implemented file reader
US20220121640A1 (en)*2020-10-212022-04-21Western Digital Technologies, Inc.Emulation of relational data table relationships using a schema
US11775487B2 (en)*2020-11-122023-10-03Western Digital Technologies, Inc.Automatic flexible schema detection and migration

Citations (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20110225038A1 (en)*2010-03-152011-09-15Yahoo! Inc.System and Method for Efficiently Evaluating Complex Boolean Expressions
US20110307521A1 (en)*2010-06-142011-12-15Infobright, Inc.System and method for storing data in a relational database
US20140279838A1 (en)*2013-03-152014-09-18Amiato, Inc.Scalable Analysis Platform For Semi-Structured Data
US20140365500A1 (en)*2013-06-112014-12-11InfiniteBioFast, scalable dictionary construction and maintenance
US20160350375A1 (en)*2015-05-292016-12-01Oracle International CorporationOptimizing execution plans for in-memory-aware joins

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US8290822B2 (en)*2010-08-202012-10-16Valuemomentum, Inc.Product configuration server for efficiently displaying selectable attribute values for configurable products
WO2013074665A1 (en)*2011-11-142013-05-23Google Inc.Data processing service
US9087138B2 (en)*2013-01-152015-07-21Xiaofan ZhouMethod for representing and storing hierarchical data in a columnar format

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20110225038A1 (en)*2010-03-152011-09-15Yahoo! Inc.System and Method for Efficiently Evaluating Complex Boolean Expressions
US20110307521A1 (en)*2010-06-142011-12-15Infobright, Inc.System and method for storing data in a relational database
US20140279838A1 (en)*2013-03-152014-09-18Amiato, Inc.Scalable Analysis Platform For Semi-Structured Data
US20140365500A1 (en)*2013-06-112014-12-11InfiniteBioFast, scalable dictionary construction and maintenance
US20160350375A1 (en)*2015-05-292016-12-01Oracle International CorporationOptimizing execution plans for in-memory-aware joins

Cited By (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20240411756A1 (en)*2023-06-062024-12-12Microsoft Technology Licensing, LlcSelection pushdown in column stores using bit manipulation instructions
US12229125B2 (en)*2023-06-062025-02-18Microsoft Technology Licensing, LlcSelection pushdown in column stores using bit manipulation instructions

Also Published As

Publication numberPublication date
US10866940B2 (en)2020-12-15
US20170193019A1 (en)2017-07-06

Similar Documents

PublicationPublication DateTitle
US10866940B2 (en)Method, apparatus, and computer-readable medium for ingesting semi-structured data in a columnar format
JP5961689B2 (en) Incremental data extraction
US11256852B2 (en)Converting portions of documents between structured and unstructured data formats to improve computing efficiency and schema flexibility
US9953102B2 (en)Creating NoSQL database index for semi-structured data
AU2017340761B2 (en)Techniques for generating and operating on in-memory datasets
US10565208B2 (en)Analyzing multiple data streams as a single data object
US9633073B1 (en)Distributed data store for hierarchical data
US9002907B2 (en)Method and system for storing binary large objects (BLObs) in a distributed key-value storage system
US10242059B2 (en)Distributed execution of expressions in a query
US20200117676A1 (en)Method and system for executing queries on indexed views
US20150347484A1 (en)Combining row based and column based tables to form mixed-mode tables
US20140214838A1 (en)Method and system for processing large amounts of data
CN111190895B (en)Organization method, device and storage medium of column-type storage data
US9898501B2 (en)Method and system for performing transactional updates in a key-value store
Aye et al.A platform for big data analytics on distributed scale-out storage system
CN107977396A (en)A kind of update method of the tables of data of KeyValue databases and table data update apparatus
KR20200019734A (en) Parallel compute offload to database accelerator
WO2017036348A1 (en)Method and device for compressing and decompressing extensible markup language document
CN106462591B (en)Partition filtering using intelligent indexing in memory
US9576008B2 (en)System and method for search indexing
KR101772333B1 (en)INTELLIGENT JOIN TECHNIQUE PROVIDING METHOD AND SYSTEM BETWEEN HETEROGENEOUS NoSQL DATABASES
US10083121B2 (en)Storage system and storage method
CN115994148B (en)Multi-table data updating method and device, electronic equipment and readable storage medium
US20140081986A1 (en)Computing device and method for generating sequence indexes for data files
US20240045872A1 (en)Partitioning, processing, and protecting multi-dimensional data

Legal Events

DateCodeTitleDescription
STPPInformation on status: patent application and granting procedure in general

Free format text:DOCKETED NEW CASE - READY FOR EXAMINATION

ASAssignment

Owner name:X15 SOFTWARE, INC., CALIFORNIA

Free format text:ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:BORKAR, VINAYAK;ALSUBAIEE, SATTAM;SIGNING DATES FROM 20160524 TO 20160525;REEL/FRAME:040075/0425

STPPInformation on status: patent application and granting procedure in general

Free format text:NON FINAL ACTION MAILED

STPPInformation on status: patent application and granting procedure in general

Free format text:RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPPInformation on status: patent application and granting procedure in general

Free format text:FINAL REJECTION MAILED

STCBInformation on status: application discontinuation

Free format text:ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION

ASAssignment

Owner name:FIREEYE, INC., CALIFORNIA

Free format text:MERGER;ASSIGNOR:X15 SOFTWARE, INC.;REEL/FRAME:056005/0538

Effective date:20180111


[8]ページ先頭

©2009-2025 Movatter.jp