Movatterモバイル変換


[0]ホーム

URL:


US20170031903A1 - Encoding and accessing position data - Google Patents

Encoding and accessing position data
Download PDF

Info

Publication number
US20170031903A1
US20170031903A1US15/295,007US201615295007AUS2017031903A1US 20170031903 A1US20170031903 A1US 20170031903A1US 201615295007 AUS201615295007 AUS 201615295007AUS 2017031903 A1US2017031903 A1US 2017031903A1
Authority
US
United States
Prior art keywords
positions
sets
data
ntokens
npadding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US15/295,007
Inventor
Tomi Poutanen
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Altaba Inc
Original Assignee
Yahoo Inc until 2017
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yahoo Inc until 2017filedCriticalYahoo Inc until 2017
Priority to US15/295,007priorityCriticalpatent/US20170031903A1/en
Assigned to YAHOO! INC.reassignmentYAHOO! INC.ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).Assignors: POUTANEN, TOMI
Publication of US20170031903A1publicationCriticalpatent/US20170031903A1/en
Abandonedlegal-statusCriticalCurrent

Links

Images

Classifications

Definitions

Landscapes

Abstract

In one embodiment, a data structure comprises: a primary index comprising one or more position-block references; and one or more position blocks sequentially following the primary index, wherein: each one of the position-block references corresponds to one of the position blocks; and each one of the position blocks comprises: a secondary index comprising one or more position-data references; and one or more sets of positions sequentially following the secondary index, wherein each one of the position-data references corresponds to of one of the sets of positions in the position block. In one embodiment, an instance of the data structure is stored in a computer-readable memory and accessible by an application executed by a process.

Description

Claims (31)

What is claimed:
1. A method comprising: by one or more computing devices, accessing position data of a token for one or more documents, wherein:
the token appears in each one of the documents at one or more positions in the document; and
the position data comprises one or more sets of positions of the token appearing in the documents;
dividing the sets of positions into one or more position blocks;
storing one or more position-block offsets referencing the position blocks in a primary index in an instance of a data structure; and
storing the position blocks sequentially following the primary index in the instance of the data structure, wherein storing one of the position blocks comprises:
storing one or more position-data lengths in a secondary index of the position block, wherein each one of the position-data lengths equals a total size of one of the sets of positions in the position block; and
storing the sets of positions in the position block sequentially following the secondary index.
2. The method recited inclaim 1, further comprising encoding the position-block offsets in the primary index, comprising:
selecting the largest one of the position-block offsets;
determining a first bit number that is a minimum number of bits necessary to encode the largest one of the position-block offsets; and
encoding each one of the position-block offsets using the first bit number.
3. The method recited inclaim 1, wherein each one of the position-block offsets in the primary index references a memory location at the beginning of the corresponding one of the position blocks.
4. The method recited inclaim 1, further comprising encoding each one of the position-data lengths in the secondary index in each one of the position blocks using one byte.
5. The method recited inclaim 1, further comprising encoding each one of one or more of the sets of positions, wherein if the position-data length of one of the sets of positions is less than or equal to 255 bits, then encoding the set of the positions comprises:
selecting the largest one of the positions in the set of positions;
determining a second bit number that is a minimum number of bits necessary to encode the largest one of the positions; and
encoding each one of the positions in the set of positions using the second bit number.
6. The method recited inclaim 1, further comprising encoding each one of one or more of the sets of positions, wherein if the position-data length of one of the sets of positions is greater than 255 bits, then the set of positions is encoded using the exponential-Golomb code and stored at the end of the corresponding position block in which the set of positions is stored.
7. The method recited inclaim 1, wherein for each one of the sets of positions, the first one of the positions is absolute position; and
the subsequent ones of the positions are relative positions, wherein each one of the subsequent ones of the positions is a difference between the position and another position immediately proceeding the position.
8. The method recited inclaim 1, wherein storing one of the position blocks further comprises storing one or more padding bits following one or more of the sets of positions.
9. The method recited inclaim 1, wherein:
a same number of sets of positions is stored in each one of the position blocks, except the last one of the position blocks; and
the same number or less than the number of sets of positions stored in other ones of the position blocks is stored in the last one of the position blocks comprises.
10. The method recited inclaim 9, wherein:
sixteen sets of positions are stored in each one of the position blocks, except the last one of the position blocks; and
sixteen or less than sixteen sets of positions are stored in the last one of the position blocks comprises.
11. A system, comprising:
a memory comprising instructions executable by one or more processors; and
one or more processors coupled to the memory and operable to execute the instructions, the one or more processors being operable when executing the instructions to:
access position data of a token for one or more documents, wherein:
the token appears in each one of the documents at one or more positions in the document; and
the position data comprises one or more sets of positions of the token appearing in the documents;
divide the sets of positions into one or more position blocks;
store one or more position-block offsets referencing the position blocks in a primary index in an instance of a data structure; and
store the position blocks sequentially following the primary index in the instance of the data structure, wherein storing one of the position blocks comprises:
store one or more position-data lengths in a secondary index of the position block, wherein each one of the position-data lengths equals a total size of one of the sets of positions in the position block; and
store the sets of positions in the position block sequentially following the secondary index.
12. The system recited inclaim 11, wherein the one or more processors are further operable when executing the instructions to encode the position-block offsets in the primary index, comprising:
select the largest one of the position-block offsets;
determine a first bit number that is a minimum number of bits necessary to encode the largest one of the position-block offsets; and
encode each one of the position-block offsets using the first bit number.
13. The system recited inclaim 11, wherein each one of the position-block offsets in the primary index references a memory location at the beginning of the corresponding one of the position blocks.
14. The system recited inclaim 11, wherein the one or more processors are further operable when executing the instructions to encode each one of the position-data lengths in the secondary index in each one of the position blocks using one byte.
15. The system recited inclaim 11, wherein the one or more processors are further operable when executing the instructions to encode each one of one or more of the sets of positions, wherein if the position-data length of one of the sets of positions is less than or equal to 255 bits, then to encode the set of the positions comprises:
select the largest one of the positions in the set of positions;
determine a second bit number that is a minimum number of bits necessary to encode the largest one of the positions; and
encode each one of the positions in the set of positions using the second bit number.
16. The system recited inclaim 11, wherein the one or more processors are further operable when executing the instructions to encode each one of one or more of the sets of positions, wherein if the position-data length of one of the sets of positions is greater than 255 bits, then the set of positions is encoded using the exponential-Golomb code and stored at the end of the corresponding position block in which the set of positions is stored.
17. The system recited inclaim 11, wherein for each one of the sets of positions, the first one of the positions is absolute position; and
the subsequent ones of the positions are relative positions, wherein each one of the subsequent ones of the positions is a difference between the position and another position immediately proceeding the position.
18. The system recited inclaim 11, wherein to store one of the position blocks further comprises store one or more padding bits following one or more of the sets of positions.
19. The system recited inclaim 11, wherein:
a same number of sets of positions is stored in each one of the position blocks, except the last one of the position blocks; and
the same number or less than the number of sets of positions stored in other ones of the position blocks is stored in the last one of the position blocks comprises.
20. The system recited inclaim 19, wherein:
sixteen sets of positions are stored in each one of the position blocks, except the last one of the position blocks; and
sixteen or less than sixteen sets of positions are stored in the last one of the position blocks comprises.
21. One or more computer-readable tangible storage media embodying software operable when executed by one or more computer systems to:
access position data of a token for one or more documents, wherein:
the token appears in each one of the documents at one or more positions in the document; and
the position data comprises one or more sets of positions of the token appearing in the documents;
divide the sets of positions into one or more position blocks;
store one or more position-block offsets referencing the position blocks in a primary index in an instance of a data structure; and
store the position blocks sequentially following the primary index in the instance of the data structure, wherein storing one of the position blocks comprises:
store one or more position-data lengths in a secondary index of the position block, wherein each one of the position-data lengths equals a total size of one of the sets of positions in the position block; and
store the sets of positions in the position block sequentially following the secondary index.
22. The media recited inclaim 21, wherein the one or more processors are further operable when executing the instructions to encode the position-block offsets in the primary index, comprising:
select the largest one of the position-block offsets;
determine a first bit number that is a minimum number of bits necessary to encode the largest one of the position-block offsets; and
encode each one of the position-block offsets using the first bit number.
23. The media recited inclaim 21, wherein each one of the position-block offsets in the primary index references a memory location at the beginning of the corresponding one of the position blocks.
24. The media recited inclaim 21, wherein the one or more processors are further operable when executing the instructions to encode each one of the position-data lengths in the secondary index in each one of the position blocks using one byte.
25. The media recited inclaim 21, wherein the one or more processors are further operable when executing the instructions to encode each one of one or more of the sets of positions, wherein if the position-data length of one of the sets of positions is less than or equal to 255 bits, then to encode the set of the positions comprises:
select the largest one of the positions in the set of positions;
determine a second bit number that is a minimum number of bits necessary to encode the largest one of the positions; and
encode each one of the positions in the set of positions using the second bit number.
26. The media recited inclaim 21, wherein the one or more processors are further operable when executing the instructions to encode each one of one or more of the sets of positions, wherein if the position-data length of one of the sets of positions is greater than 255 bits, then the set of positions is encoded using the exponential-Golomb code and stored at the end of the corresponding position block in which the set of positions is stored.
27. The media recited inclaim 21, wherein for each one of the sets of positions,
the first one of the positions is absolute position; and
the subsequent ones of the positions are relative positions, wherein each one of the subsequent ones of the positions is a difference between the position and another position immediately proceeding the position.
28. The media recited inclaim 21, wherein to store one of the position blocks further comprises store one or more padding bits following one or more of the sets of positions.
29. The media recited inclaim 21, wherein:
a same number of sets of positions is stored in each one of the position blocks, except the last one of the position blocks; and
the same number or less than the number of sets of positions stored in other ones of the position blocks is stored in the last one of the position blocks comprises.
30. The media recited inclaim 29, wherein:
sixteen sets of positions are stored in each one of the position blocks, except the last one of the position blocks; and
sixteen or less than sixteen sets of positions are stored in the last one of the position blocks comprises.
31. A system comprising:
means for accessing position data of a token for one or more documents, wherein:
the token appears in each one of the documents at one or more positions in the document; and
the position data comprises one or more sets of positions of the token appearing in the documents;
means for dividing the sets of positions into one or more position blocks;
means for storing one or more position-block offsets referencing the position blocks in a primary index in an instance of a data structure; and
means for storing the position blocks sequentially following the primary index in the instance of the data structure, wherein the means for storing one of the position blocks comprises:
means for storing one or more position-data lengths in a secondary index of the position block, wherein each one of the position-data lengths equals a total size of one of the sets of positions in the position block; and
means for storing the sets of positions in the position block sequentially following the secondary index.
US15/295,0072010-03-252016-10-17Encoding and accessing position dataAbandonedUS20170031903A1 (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
US15/295,007US20170031903A1 (en)2010-03-252016-10-17Encoding and accessing position data

Applications Claiming Priority (3)

Application NumberPriority DateFiling DateTitle
US73145810A2010-03-252010-03-25
US13/173,532US9507827B1 (en)2010-03-252011-06-30Encoding and accessing position data
US15/295,007US20170031903A1 (en)2010-03-252016-10-17Encoding and accessing position data

Related Parent Applications (1)

Application NumberTitlePriority DateFiling Date
US13/173,532DivisionUS9507827B1 (en)2010-03-252011-06-30Encoding and accessing position data

Publications (1)

Publication NumberPublication Date
US20170031903A1true US20170031903A1 (en)2017-02-02

Family

ID=57351951

Family Applications (2)

Application NumberTitlePriority DateFiling Date
US13/173,532Expired - Fee RelatedUS9507827B1 (en)2010-03-252011-06-30Encoding and accessing position data
US15/295,007AbandonedUS20170031903A1 (en)2010-03-252016-10-17Encoding and accessing position data

Family Applications Before (1)

Application NumberTitlePriority DateFiling Date
US13/173,532Expired - Fee RelatedUS9507827B1 (en)2010-03-252011-06-30Encoding and accessing position data

Country Status (1)

CountryLink
US (2)US9507827B1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
WO2017131753A1 (en)*2016-01-292017-08-03Entit Software LlcText search of database with one-pass indexing including filtering
CN108256096B (en)*2018-01-302021-01-22北京搜狐新媒体信息技术有限公司Data processing method and device
US12099530B2 (en)*2020-06-162024-09-24Nippon Telegraph And Telephone CorporationSearch using words from user interface

Citations (29)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US3670310A (en)*1970-09-161972-06-13Infodata Systems IncMethod for information storage and retrieval
US4849878A (en)*1984-06-281989-07-18Wang Laboratories, Inc.Self-extending administrative memory file
US5848407A (en)*1996-05-221998-12-08Matsushita Electric Industrial Co., Ltd.Hypertext document retrieving apparatus for retrieving hypertext documents relating to each other
US5920854A (en)*1996-08-141999-07-06Infoseek CorporationReal-time document collection search engine with phrase indexing
US6020972A (en)*1997-11-142000-02-01Xerox CorporationSystem for performing collective symbol-based compression of a corpus of document images
US6041323A (en)*1996-04-172000-03-21International Business Machines CorporationInformation search method, information search device, and storage medium for storing an information search program
US6275301B1 (en)*1996-05-232001-08-14Xerox CorporationRelabeling of tokenized symbols in fontless structured document image representations
US6349308B1 (en)*1998-02-252002-02-19Korea Advanced Institute Of Science & TechnologyInverted index storage structure using subindexes and large objects for tight coupling of information retrieval with database management systems
US6374232B1 (en)*1996-08-292002-04-16Oracle Corp.Method and mechanism for retrieving values from a database
US20030126116A1 (en)*2001-12-282003-07-03Lucent Technologies Inc.System and method for improving index performance through prefetching
US20030135495A1 (en)*2001-06-212003-07-17Isc, Inc.Database indexing method and apparatus
US20040044659A1 (en)*2002-05-142004-03-04Douglass Russell JuddApparatus and method for searching and retrieving structured, semi-structured and unstructured content
US20040133557A1 (en)*2003-01-062004-07-08Ji-Rong WenRetrieval of structured documents
US20040205044A1 (en)*2003-04-112004-10-14International Business Machines CorporationMethod for storing inverted index, method for on-line updating the same and inverted index mechanism
US6826672B1 (en)*2000-05-162004-11-30Massachusetts Institute Of TechnologyCapability addressing with tight object bounds
US20050220094A1 (en)*2004-03-302005-10-06Parker David KPacket data modification processor command instruction set
US20070050384A1 (en)*2005-08-262007-03-01Korea Advanced Institute Of Science And TechnologyTwo-level n-gram index structure and methods of index building, query processing and index derivation
US20070220023A1 (en)*2004-08-132007-09-20Jeffrey DeanDocument compression system and method for use with tokenspace repository
US20080082554A1 (en)*2006-10-032008-04-03Paul PedersenSystems and methods for providing a dynamic document index
US20090150605A1 (en)*2007-12-062009-06-11David FlynnApparatus, system, and method for converting a storage request into an append data storage command
US20090271353A1 (en)*2008-04-282009-10-29Ben FeiMethod and device for tagging a document
US20100054616A1 (en)*2008-01-022010-03-04Samsung Electronics Co., Ltd.Methods and apparatuses for encoding and decoding image by using improved compression ratio of encoding information
US20100074332A1 (en)*2008-09-232010-03-25Qualcomm IncorporatedOffset calculation in switched interpolation filters
US20100118943A1 (en)*2007-01-092010-05-13Kabushiki Kaisha ToshibaMethod and apparatus for encoding and decoding image
US20100205172A1 (en)*2009-02-092010-08-12Robert Wing Pong LukMethod for using dual indices to support query expansion, relevance/non-relevance models, blind/relevance feedback and an intelligent search interface
US20110052087A1 (en)*2009-08-272011-03-03Debargha MukherjeeMethod and system for coding images
US20110093680A1 (en)*2009-10-152011-04-21Freescale Semiconductor, Inc.Flexible memory controller for autonomous mapping of memory
US8219562B1 (en)*2009-06-292012-07-10Facebook, Inc.Efficient storage and retrieval for large number of data objects
US8904032B2 (en)*2009-05-262014-12-02Cisco Technology, Inc.Prefetch optimization of the communication of data using descriptor lists

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US4335447A (en)*1980-02-051982-06-15Sangamo Weston, Inc.Power outage recovery method and apparatus for demand recorder with solid state memory
US4891771A (en)*1987-12-181990-01-02International Business Machines CorporationColumn balancing control
JP4131894B2 (en)*2000-02-292008-08-13株式会社東芝 Disk control mechanism suitable for random disk write
US7398271B1 (en)*2001-04-162008-07-08Yahoo! Inc.Using network traffic logs for search enhancement
US6851039B2 (en)*2002-09-302005-02-01Lucent Technologies Inc.Method and apparatus for generating an interleaved address
US8078686B2 (en)*2005-09-272011-12-13Siemens Product Lifecycle Management Software Inc.High performance file fragment cache
US8706914B2 (en)*2007-04-232014-04-22David D. DuchesneauComputing infrastructure
JP5169919B2 (en)*2009-03-062013-03-27セイコーエプソン株式会社 Electronic equipment, time difference data acquisition method, data structure of time difference data

Patent Citations (29)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US3670310A (en)*1970-09-161972-06-13Infodata Systems IncMethod for information storage and retrieval
US4849878A (en)*1984-06-281989-07-18Wang Laboratories, Inc.Self-extending administrative memory file
US6041323A (en)*1996-04-172000-03-21International Business Machines CorporationInformation search method, information search device, and storage medium for storing an information search program
US5848407A (en)*1996-05-221998-12-08Matsushita Electric Industrial Co., Ltd.Hypertext document retrieving apparatus for retrieving hypertext documents relating to each other
US6275301B1 (en)*1996-05-232001-08-14Xerox CorporationRelabeling of tokenized symbols in fontless structured document image representations
US5920854A (en)*1996-08-141999-07-06Infoseek CorporationReal-time document collection search engine with phrase indexing
US6374232B1 (en)*1996-08-292002-04-16Oracle Corp.Method and mechanism for retrieving values from a database
US6020972A (en)*1997-11-142000-02-01Xerox CorporationSystem for performing collective symbol-based compression of a corpus of document images
US6349308B1 (en)*1998-02-252002-02-19Korea Advanced Institute Of Science & TechnologyInverted index storage structure using subindexes and large objects for tight coupling of information retrieval with database management systems
US6826672B1 (en)*2000-05-162004-11-30Massachusetts Institute Of TechnologyCapability addressing with tight object bounds
US20030135495A1 (en)*2001-06-212003-07-17Isc, Inc.Database indexing method and apparatus
US20030126116A1 (en)*2001-12-282003-07-03Lucent Technologies Inc.System and method for improving index performance through prefetching
US20040044659A1 (en)*2002-05-142004-03-04Douglass Russell JuddApparatus and method for searching and retrieving structured, semi-structured and unstructured content
US20040133557A1 (en)*2003-01-062004-07-08Ji-Rong WenRetrieval of structured documents
US20040205044A1 (en)*2003-04-112004-10-14International Business Machines CorporationMethod for storing inverted index, method for on-line updating the same and inverted index mechanism
US20050220094A1 (en)*2004-03-302005-10-06Parker David KPacket data modification processor command instruction set
US20070220023A1 (en)*2004-08-132007-09-20Jeffrey DeanDocument compression system and method for use with tokenspace repository
US20070050384A1 (en)*2005-08-262007-03-01Korea Advanced Institute Of Science And TechnologyTwo-level n-gram index structure and methods of index building, query processing and index derivation
US20080082554A1 (en)*2006-10-032008-04-03Paul PedersenSystems and methods for providing a dynamic document index
US20100118943A1 (en)*2007-01-092010-05-13Kabushiki Kaisha ToshibaMethod and apparatus for encoding and decoding image
US20090150605A1 (en)*2007-12-062009-06-11David FlynnApparatus, system, and method for converting a storage request into an append data storage command
US20100054616A1 (en)*2008-01-022010-03-04Samsung Electronics Co., Ltd.Methods and apparatuses for encoding and decoding image by using improved compression ratio of encoding information
US20090271353A1 (en)*2008-04-282009-10-29Ben FeiMethod and device for tagging a document
US20100074332A1 (en)*2008-09-232010-03-25Qualcomm IncorporatedOffset calculation in switched interpolation filters
US20100205172A1 (en)*2009-02-092010-08-12Robert Wing Pong LukMethod for using dual indices to support query expansion, relevance/non-relevance models, blind/relevance feedback and an intelligent search interface
US8904032B2 (en)*2009-05-262014-12-02Cisco Technology, Inc.Prefetch optimization of the communication of data using descriptor lists
US8219562B1 (en)*2009-06-292012-07-10Facebook, Inc.Efficient storage and retrieval for large number of data objects
US20110052087A1 (en)*2009-08-272011-03-03Debargha MukherjeeMethod and system for coding images
US20110093680A1 (en)*2009-10-152011-04-21Freescale Semiconductor, Inc.Flexible memory controller for autonomous mapping of memory

Also Published As

Publication numberPublication date
US9507827B1 (en)2016-11-29

Similar Documents

PublicationPublication DateTitle
US9443008B2 (en)Clustering of search results
US20110184981A1 (en)Personalize Search Results for Search Queries with General Implicit Local Intent
JP6646030B2 (en) Sentence extraction method and system
US9152717B2 (en)Search engine suggestion
CA3088695C (en)Method and system for decoding user intent from natural language queries
US10216831B2 (en)Search results summarized with tokens
US8903837B2 (en)Incorporating geographical locations in a search process
US8843608B2 (en)Methods and systems for caching popular network content
US8255414B2 (en)Search assist powered by session analysis
US10423672B2 (en)Network resource-specific search assistance
US8078642B1 (en)Concurrent traversal of multiple binary trees
US20150058308A1 (en)Generating cache query requests
US20110040769A1 (en)Query-URL N-Gram Features in Web Ranking
US10296497B2 (en)Storing a key value to a deleted row based on key range density
US11074266B2 (en)Semantic concept discovery over event databases
US9940355B2 (en)Providing answers to questions having both rankable and probabilistic components
CN110851136A (en) Data acquisition method, device, electronic device and storage medium
KR102125407B1 (en)Method and system for extracting sentences
US20170031903A1 (en)Encoding and accessing position data
US9705972B2 (en)Managing a set of data
US20170277751A1 (en)Optimizing searches
JP2016519370A (en) DATA PROCESSING DEVICE, DATA PROCESSING METHOD, AND ELECTRONIC DEVICE
KR102034302B1 (en)Method and system for extracting sentences
US9792358B2 (en)Generating and using socially-curated brains
CN115905044A (en) A cache management method, device and computer equipment

Legal Events

DateCodeTitleDescription
ASAssignment

Owner name:YAHOO! INC., CALIFORNIA

Free format text:ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:POUTANEN, TOMI;REEL/FRAME:040030/0309

Effective date:20100323

STPPInformation on status: patent application and granting procedure in general

Free format text:RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPPInformation on status: patent application and granting procedure in general

Free format text:FINAL REJECTION MAILED

STCBInformation on status: application discontinuation

Free format text:ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION


[8]ページ先頭

©2009-2025 Movatter.jp