Search engine indexing is the collecting,parsing, and storing of data to facilitate fast and accurateinformation retrieval. Index design incorporates interdisciplinary concepts fromlinguistics,cognitive psychology, mathematics,informatics, andcomputer science. An alternate name for the process, in the context ofsearch engines designed to findweb pages on the Internet, isweb indexing.
Popular search engines focus on thefull-text indexing of online,natural language documents.[1]Media types such as pictures, video, audio,[2] and graphics[3] are also searchable.
Meta search engines reuse the indices of other services and do not store a local index whereas cache-based search engines permanently store the index along with thecorpus. Unlike full-text indices, partial-text services restrict the depth indexed to reduce index size. Larger services typically perform indexing at a predetermined time interval due to the required time and processing costs, whileagent-based search engines index inreal time.
The purpose of storing an index is to optimize speed and performance in findingrelevant documents for a search query. Without an index, the search engine wouldscan every document in thecorpus, which would require considerable time and computing power. For example, while an index of 10,000 documents can be queried within milliseconds, a sequential scan of every word in 10,000 large documents could take hours. The additionalcomputer storage required to store the index, as well as the considerable increase in the time required for an update to take place, are traded off for the time saved during information retrieval.
Major factors in designing a search engine's architecture include:
Search engine architectures vary in the way indexing is performed and in methods of index storage to meet the various design factors.
A major challenge in the design of search engines is the management of serial computing processes. There are many opportunities forrace conditions and coherent faults. For example, a new document is added to the corpus and the index must be updated, but the index simultaneously needs to continue responding to search queries. This is a collision between two competing tasks. Consider that authors are producers of information, and aweb crawler is the consumer of this information, grabbing the text and storing it in a cache (orcorpus). The forward index is the consumer of the information produced by the corpus, and the inverted index is the consumer of information produced by the forward index. This is commonly referred to as aproducer-consumer model. The indexer is the producer of searchable information and users are the consumers that need to search. The challenge is magnified when working with distributed storage and distributed processing. In an effort to scale with larger amounts of indexed information, the search engine's architecture may involvedistributed computing, where the search engine consists of several machines operating in unison. This increases the possibilities for incoherency and makes it more difficult to maintain a fully synchronized, distributed, parallel architecture.[13]
Many search engines incorporate aninverted index when evaluating asearch query to quickly locate documents containing the words in a query and then rank these documents by relevance. Because the inverted index stores a list of the documents containing each word, the search engine can use directaccess to find the documents associated with each word in the query in order to retrieve the matching documents quickly. The following is a simplified illustration of an inverted index:
| Word | Documents |
|---|---|
| the | Document 1, Document 3, Document 4, Document 5, Document 7 |
| cow | Document 2, Document 3, Document 4 |
| says | Document 5 |
| moo | Document 7 |
This index can only determine whether a word exists within a particular document, since it stores no information regarding the frequency and position of the word; it is therefore considered to be aBoolean index. Such an index determines which documents match a query but does not rank matched documents. In some designs the index includes additional information such as the frequency of each word in each document or the positions of a word in each document.[14] Position information enables the search algorithm to identify word proximity to support searching for phrases; frequency can be used to help in ranking the relevance of documents to the query. Such topics are the central research focus ofinformation retrieval.
The inverted index is asparse matrix, since not all words are present in each document. To reducecomputer storage memory requirements, it is stored differently from a two dimensionalarray. The index is similar to theterm document matrices employed bylatent semantic analysis. The inverted index can be considered a form of a hash table. In some cases the index is a form of abinary tree, which requires additional storage but may reduce the lookup time. In larger indices the architecture is typically adistributed hash table.[15]
For phrase searching, a specialized form of an inverted index called a positional index is used. A positional index not only stores the ID of the document containing the token but also the exact position(s) of the token within the document in thepostings list. The occurrences of the phrase specified in the query are retrieved by navigating these postings list and identifying the indexes at which the desired terms occur in the expected order (the same as the order in the phrase). So if we are searching for occurrence of the phrase "First Witch", we would:
The postings lists can be navigated using a binary search in order to minimize the time complexity of this procedure.[16]
The inverted index is filled via a merge or rebuild. A rebuild is similar to a merge but first deletes the contents of the inverted index. The architecture may be designed to support incremental indexing,[17] where a merge identifies the document or documents to be added or updated and then parses each document into words. For technical accuracy, a merge conflates newly indexed documents, typically residing in virtual memory, with the index cache residing on one or more computer hard drives.
After parsing, the indexer adds the referenced document to the document list for the appropriate words. In a larger search engine, the process of finding each word in the inverted index (in order to report that it occurred within a document) may be too time consuming, and so this process is commonly split up into two parts, the development of a forward index and a process which sorts the contents of the forward index into the inverted index. The inverted index is so named because it is an inversion of the forward index.
The forward index stores a list of words for each document. The following is a simplified form of the forward index:
| Document | Words |
|---|---|
| Document 1 | the,cow,says,moo |
| Document 2 | the,cat,and,the,hat |
| Document 3 | the,dish,ran,away,with,the,spoon |
The rationale behind developing a forward index is that as documents are parsed, it is better to intermediately store the words per document. The delineation enables asynchronous system processing, which partially circumvents the inverted index updatebottleneck.[18] The forward index issorted to transform it to an inverted index. The forward index is essentially a list of pairs consisting of a document and a word, collated by the document. Converting the forward index to an inverted index is only a matter of sorting the pairs by the words. In this regard, the inverted index is a word-sorted forward index.
Generating or maintaining a large-scale search engine index represents a significant storage and processing challenge. Many search engines utilize a form ofcompression to reduce the size of the indices ondisk.[19] Consider the following scenario for a full text, Internet search engine.
Given this scenario, an uncompressed index (assuming a non-conflated, simple, index) for 2 billion web pages would need to store 500 billion word entries. At 1 byte per character, or 5 bytes per word, this would require 2500 gigabytes of storage space alone.[citation needed] This space requirement may be even larger for a fault-tolerant distributed storage architecture. Depending on the compression technique chosen, the index can be reduced to a fraction of this size. The tradeoff is the time and processing power required to perform compression and decompression.[citation needed]
Notably, large scale search engine designs incorporate the cost of storage as well as the costs of electricity to power the storage. Thus compression is a measure of cost.[citation needed]
Document parsing breaks apart the components (words) of a document or other form of media for insertion into the forward and inverted indices. The words found are calledtokens, and so, in the context of search engine indexing andnatural language processing, parsing is more commonly referred to astokenization. It is also sometimes calledword boundary disambiguation,tagging,text segmentation,content analysis, text analysis,text mining,concordance generation,speech segmentation,lexing, orlexical analysis. The terms 'indexing', 'parsing', and 'tokenization' are used interchangeably in corporate slang.
Natural language processing is the subject of continuous research and technological improvement. Tokenization presents many challenges in extracting the necessary information from documents for indexing to support quality searching. Tokenization for indexing involves multiple technologies, the implementation of which are commonly kept as corporate secrets.[citation needed]
Unlikeliterate humans, computers do not understand the structure of a natural language document and cannot automatically recognize words and sentences. To a computer, a document is only a sequence of bytes. Computers do not 'know' that a space character separates words in a document. Instead, humans must program the computer to identify what constitutes an individual or distinct word referred to as a token. Such a program is commonly called atokenizer orparser orlexer. Many search engines, as well as other natural language processing software, incorporatespecialized programs for parsing, such asYACC orLex.
During tokenization, the parser identifies sequences of characters that represent words and other elements, such as punctuation, which are represented by numeric codes, some of which are non-printing control characters. The parser can also identifyentities such asemail addresses, phone numbers, andURLs. When identifying each token, several characteristics may be stored, such as the token's case (upper, lower, mixed, proper), language or encoding, lexical category (part of speech, like 'noun' or 'verb'), position, sentence number, sentence position, length, and line number.
If the search engine supports multiple languages, a common initial step during tokenization is to identify each document's language; many of the subsequent steps are language dependent (such asstemming andpart of speech tagging).Language recognition is the process by which a computer program attempts to automatically identify, or categorize, thelanguage of a document. Other names for language recognition include language classification, language analysis, language identification, and language tagging. Automated language recognition is the subject of ongoing research innatural language processing. Finding which language the words belongs to may involve the use of alanguage recognition chart.
If the search engine supports multipledocument formats, documents must be prepared for tokenization. The challenge is that many document formats contain formatting information in addition to textual content. For example,HTML documents contain HTML tags, which specify formatting information such as new line starts, bold emphasis, andfont size orstyle. If the search engine were to ignore the difference between content and 'markup', extraneous information would be included in the index, leading to poor search results. Format analysis is the identification and handling of the formatting content embedded within documents which controls the way the document is rendered on a computer screen or interpreted by a software program. Format analysis is also referred to as structure analysis, format parsing, tag stripping, format stripping, text normalization, text cleaning and text preparation. The challenge of format analysis is further complicated by the intricacies of various file formats. Certain file formats are proprietary with very little information disclosed, while others are well documented. Common, well-documented file formats that many search engines support include:
Options for dealing with various formats include using a publicly available commercial parsing tool that is offered by the organization which developed, maintains, or owns the format, and writing a customparser.
Some search engines support inspection of files that are stored in acompressed or encrypted file format. When working with a compressed format, the indexer first decompresses the document; this step may result in one or more files, each of which must be indexed separately. Commonly supportedcompressed file formats include:
Format analysis can involve quality improvement methods to avoid including 'bad information' in the index. Content can manipulate the formatting information to include additional content. Examples of abusing document formatting forspamdexing:
Some search engines incorporate section recognition, the identification of major parts of a document, prior to tokenization. Not all the documents in a corpus read like a well-written book, divided into organized chapters and pages. Many documents on theweb, such as newsletters and corporate reports, contain erroneous content and side-sections that do not contain primary material (that which the document is about). For example, articles on the Wikipedia website display a side menu with links to other web pages. Some file formats, like HTML or PDF, allow for content to be displayed in columns. Even though the content is displayed, or rendered, in different areas of the view, the raw markup content may store this information sequentially. Words that appear sequentially in the raw source content are indexed sequentially, even though these sentences and paragraphs are rendered in different parts of the computer screen. If search engines index this content as if it were normal content, the quality of the index and search quality may be degraded due to the mixed content and improper word proximity. Two primary problems are noted:
Section analysis may require the search engine to implement the rendering logic of each document, essentially an abstract representation of the actual document, and then index the representation instead. For example, some content on the Internet is rendered via JavaScript. If the search engine does not render the page and evaluate the JavaScript within the page, it would not 'see' this content in the same way and would index the document incorrectly. Given that some search engines do not bother with rendering issues, many web page designers avoid displaying content via JavaScript or use theNoscriptArchived 2020-07-07 at theWayback Machine tag to ensure that the web page is indexed properly. At the same time, this fact can also beexploited to cause the search engine indexer to 'see' different content than the viewer.
This sectionmay containoriginal research. Pleaseimprove it byverifying the claims made and addinginline citations. Statements consisting only of original research should be removed.(November 2013) (Learn how and when to remove this message) |
Indexing often has to recognize theHTML tags to organize priority. Indexing low priority to high margin to labels likestrong andlink to optimize the order of priority if those labels are at the beginning of the text could not prove to be relevant. Some indexers likeGoogle andBing ensure that thesearch engine does not take the large texts as relevant source due to strong type system compatibility.[22]
Meta tag indexing plays an important role in organizing and categorizing web content. Specific documents often contain embedded meta information such as author, keywords, description, and language. For HTML pages, themeta tag contains keywords which are also included in the index. Earlier Internetsearch engine technology would only index the keywords in the meta tags for the forward index; the full document would not be parsed. At that time full-text indexing was not as well established, nor wascomputer hardware able to support such technology. The design of the HTML markup language initially included support for meta tags for the very purpose of being properly and easily indexed, without requiring tokenization.[23]
As the Internet grew through the 1990s, manybrick-and-mortar corporations went 'online' and established corporate websites. The keywords used to describe webpages (many of which were corporate-oriented webpages similar to product brochures) changed from descriptive to marketing-oriented keywords designed to drive sales by placing the webpage high in the search results for specific search queries. The fact that these keywords were subjectively specified was leading tospamdexing, which drove many search engines to adopt full-text indexing technologies in the 1990s. Search engine designers and companies could only place so many 'marketing keywords' into the content of a webpage before draining it of all interesting and useful information. Given that conflict of interest with the business goal of designing user-oriented websites which were 'sticky', thecustomer lifetime value equation was changed to incorporate more useful content into the website in hopes of retaining the visitor. In this sense, full-text indexing was more objective and increased the quality of search engine results, as it was one more step away from subjective control of search engine result placement, which in turn furthered research of full-text indexing technologies.[citation needed]
Indesktop search, many solutions incorporate meta tags to provide a way for authors to further customize how the search engine will index content from various files that is not evident from the file content. Desktop search is more under the control of the user, while Internet search engines must focus more on the full text index.[citation needed]