Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
The terms "comprising" and "having," and any variations thereof, in the description and claims of this application and the drawings described herein are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
First, a part of vocabulary and application scenarios related to the embodiments of the present application will be described.
Article (Article): the sub-character string is embedded in a static hypertext Markup Language (HTML) character string and takes < article > </article > as a mark.
Inline keyword (Tag): keywords embedded in articles and used for marking the articles belong to an article classification mode, and one article may contain one or more embedded keywords.
The next set of keyword sets refers to a set of other keywords that may be of interest to the user that are related to the current set of keywords.
Fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present application. As shown in fig. 1, the system architecture of the embodiment of the present application may include, but is not limited to:electronic device 11 and server 12.
Theelectronic device 11 and the server 12 may be connected via a network.
The method provided by the embodiment of the application can be realized by an electronic device such as a processor executing corresponding software codes, and can also be realized by an electronic device performing data interaction with a controller while executing the corresponding software codes.
In the related art, associated articles are generally queried based on a set of keywords. However, with the development of the internet, more and more information is provided on the internet, and it takes a long time for a user to search for a related target article from a large amount of information, so that the data query efficiency is low.
In the query method of the embodiment of the application, at least one keyword input by a user is acquired, a next keyword set is found and returned according to a matching result of a character string of the input keyword, and then an associated article is found according to the next keyword set.
The technical idea of the method of the embodiment of the application is as follows: by storing the corresponding relationship between the first index information corresponding to the first keyword set and the second index information corresponding to the second keyword set in advance and the corresponding relationship between the index information corresponding to the keyword set and the index value of the query result, the time consumption during query is reduced, namely, the next group of keyword sets corresponding to the first keyword set, namely, the second keyword set and the target query result finally corresponding to the keyword sets can be quickly determined, and the query efficiency is improved.
The scheme of the embodiment of the application aims to enable a user to obtain more detailed information through a selected single keyword or a plurality of keywords. For example, in the field of network customer service, most users generally do not know what type of customer service they need. If some preset keywords can be provided for the user, the user can select a keyword most relevant to the problems encountered by the user as a starting keyword, and continuously filter the query of the user according to the next set of provided keyword sets until a query result matched with the keyword set selected by the user is obtained.
In the following embodiments of the present application, the query result is described by taking the article as an example, which is not limited in the embodiments of the present application.
For convenience of explaining the method of the embodiment of the present application, the following assumptions are made:
1. in the embodiment of the application, the article is a complete HTML character string, and is identified by an < article > </article > tag, and the keyword list embedded in the article is used as the attribute of the < article > tag, but some articles may not have embedded keywords. Such as: the keyword data-tags is characterized in that the keyword class is ' article ', the id is ' 12345 ', the keyword data-tags is ' C, A, B ' > < h2> S1</h2 </article ', and the values C, A and B of the data-tags attribute are the embedded keyword set. For another example, the article has no embedded keyword, and the article class is "article" id is "12345" > < h2> S1</h2> </article >.
2. The articles in the embodiment of the application are not nested, in other words, one article can not be nested with another article.
3. The embedded keywords of the single article in the embodiment of the application have no repeatability. For example, one valid set of inline keywords: data-tags is "C, a, B". But the following set of inline keywords is invalid: data-tags is "C, a, C, B".
4. The embedded keyword set of each article in the embodiment of the application has uniqueness. For example, this set of embedded keywords may only appear once in a large number of articles: data-tags is "C, a, B". It is further assumed that the keywords of the inline keyword set are also unordered, in other words: data-tags ═ C, a, B "and data-tags ═ a, B, C" are the same. Likewise, the keywords in the next set of keyword sets are also unordered.
5. In the embodiment of the present application, the format of the keyword (for example, containing english, capital and small letters, or containing latin special characters, etc.) is not considered.
The technical solution of the present application will be described in detail below with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.
Fig. 2 is a schematic flowchart of a data query method according to an embodiment of the present application. As shown in fig. 2, the method provided by this embodiment includes:
step 101, a first keyword set input by a user and first index information corresponding to the first keyword set are obtained, wherein the first index information is obtained according to indexes of keywords in the first keyword set.
In particular, the first set of keywords may include one or more keywords, and the first set of keywords may be selected by the user based on a plurality of keywords provided by the device, or may be actively input by the user.
The corresponding index may be set for the keyword, and the index may be a positive integer value, for example, as shown in table 1:
TABLE 1
| Index (integer value) | Keyword |
| 0 | A |
| 1 | B |
| 2 | C |
| 3 | D |
| 4 | E |
According to the index of each keyword in the first keyword set, first index information corresponding to the first keyword set is obtained, where the first index information may be a numerical value, such as an integer, calculated by a preset function according to at least one index.
For example, the first keyword set includes { A, B, C, D }, and the corresponding first index information is, for example, 20+2 }1+22+23=15。
The preset function is, for example, an exponential function, a binary summation, or the like, which is not limited in the embodiment of the present application.
Step 102, if the number of the query results corresponding to the first keyword set is determined not to be a first preset value, determining a second keyword set according to the first index information and the first corresponding relation; the first correspondence includes: and the first index information and the second index information corresponding to the second keyword set correspond to each other.
Specifically, the query result obtained according to the first keyword set includes the following possible situations:
1. the first keyword set input by the current user completely matches the embedded keyword set of an article, for example, the embedded keyword set of the article S1 is { A, B, C, D }, and completely matches the first keyword set { A, B, C, D } input by the current user; at this point, the first set of keywords { A, B, C, D } may also be a subset of the set of inline keywords of another article. In which case the article S1 may be returned to the user.
The number of query results in this case is at least one, and may be one or more.
If there is a full match, the second set of keywords still needs to be returned, and the query cannot be terminated by a full match because the set of keywords currently entered by the user may completely match the set of embedded keywords of one article, and may also be a subset of the set of embedded keywords of another article.
2. The first keyword set currently input by the user is a subset of the embedded keyword set of an article, and the following cases can be distinguished:
(1) if the first set of keywords entered by the current user does not exactly match the set of inline keywords of an article, but may point to a unique article. For example: the first keyword set currently entered by the user is { A, E }, and there are no articles in the stored massive articles that can be completely matched with the keyword set, but the keyword set can only point to the articles of which the embedded keyword set is { A, B, C, E, F }, because the keyword E only appears in the articles. In this case, all subsequent queries can be skipped directly, returning the article directly. Also in this case the next set of keywords, i.e. the second set of keywords, may not be returned, since the final query result is, anyway, only this article. This situation is called "accelerated query": whether the keyword set of the current user and the embedded keyword set of a certain article are matched in subset or in full set, as long as the article can be uniquely pointed by the keyword set, the subsequent query process is skipped, and only the article is returned.
In which case the number of query results is one.
(2) While the set of inline keywords for each article is unique, a subset of this set is not unique. For example: the current set of inline keywords of an article is { A, B, C, D, E }, which is unique among the set of inline keywords of the article, but the subset { A, C, D } of which is not unique, since this subset may also be a subset of the set of inline keywords of other articles.
For example, if the first keyword set input by the user is { a, C, D }, the articles corresponding to the first keyword set may be multiple, i.e., not point to a single article, i.e., the number of query results is multiple.
In summary, when the number of the query results corresponding to the first keyword set is not the first preset value, for example, is not one, the next group of keyword sets is determined, that is, the second keyword set is determined according to the first index information and the first corresponding relationship.
The first corresponding relation refers to the corresponding relation between the first index information and second index information corresponding to the second keyword set, the second index information is obtained according to indexes corresponding to all keywords in the second keyword set, and the calculation mode of the second index information is the same as that of the first index information.
The first correspondence may be represented, for example, by table 2 below:
TABLE 2
According to the first index information and the first corresponding relation, second index information corresponding to the first index information can be obtained, and then according to the second index information, a next group of keyword sets corresponding to the first keyword sets, namely a second keyword set, can be obtained.
For example, the set of keywords entered by the current user is a subset of the set of inline keywords of an article, and the next set of keyword sets, i.e., the second set of keywords, may contain keywords in the article that are not present in the set of keywords entered by the current user. For example, the following steps are carried out: if the set of inline keywords of the Article1 is { a, C, B, D } and the set of keywords currently entered by the user is { a, B }, then { C, D } may be included in the second set of keywords.
103, determining a target query result corresponding to the second keyword set according to the second keyword set and the second corresponding relation; the second correspondence includes: and the corresponding relation between the index information and the index value of the query result.
Specifically, the second correspondence is a correspondence between index information corresponding to the keyword set and an index value of the query result. For example as shown in table 3 below:
TABLE 3
If the unique query result cannot be determined according to the second keyword set, the scheme ofstep 102 and step 103 may be repeatedly executed to determine a next keyword set, and then query the articles corresponding to the next keyword set until the unique target article, that is, the unique target query result, is queried.
In one embodiment, determining the target query result may be implemented as follows:
determining an index value of a target query result corresponding to the second keyword set according to the second keyword set and the second corresponding relation;
and determining the target query result corresponding to the second keyword set according to the index value of the target query result corresponding to the second keyword set.
For example, as shown in Table 4, each article corresponds to a unique index value.
TABLE 4
The query result has a one-to-one correspondence with the index value of the query result.
That is, according to the second corresponding relationship, the index value of the query result corresponding to the keyword set can be determined, and then the target query result is obtained according to the index value.
The method comprises the steps of obtaining a first keyword set input by a user and first index information corresponding to the first keyword set, wherein the first index information is obtained according to indexes of keywords in the first keyword set; furthermore, since the corresponding relationship between the first index information and the second index information corresponding to the second keyword set is preset in advance, the second keyword set, that is, the next group of keyword sets, can be determined quickly according to the first index information and the first corresponding relationship; and the corresponding relation between the index information corresponding to the keyword set and the index value of the query result is preset, so that the target query result corresponding to the second keyword set can be quickly determined according to the second keyword set and the second corresponding relation, the query time is less, and the query efficiency is improved.
On the basis of the above embodiment, the second correspondence is a correspondence between the index information and a second preset value and an index value of a query result, the second preset value is used to indicate that the keyword set is completely matched with the corresponding query result, or the keyword set corresponds to a single query result or a non-single query result, and the number of the query results corresponding to the first keyword set may be determined in the following manner:
determining a second preset value corresponding to the first keyword set according to the second corresponding relation and first index information corresponding to the first keyword set;
and determining the number of the query results corresponding to the first keyword set according to a second preset value corresponding to the first keyword set.
Specifically, the matching information is stored through a second corresponding relationship, and the second corresponding relationship is stored through a MATCH _ LOOKUP dictionary, for example, and the following cases can be classified as follows:
1. if the keyword set is completely matched with the query result, storing (the keyword set (represented by index information such as an integer) and 1) the corresponding relation between the index (such as an integer) of the article character string;
2. if the keyword set is a subset of an embedded keyword set of an article and the keyword set only points to the article, storing (the keyword set (represented by index information such as an integer)), 0) the corresponding relation between the keyword set and an index (such as an integer) of a character string of the article;
3. if the keyword set is a subset of an embedded keyword set of a certain article, and the keyword set does not uniquely point to the article, that is, the keyword set is also a subset of an embedded keyword set of another article, a correspondence between (the keyword set (represented by index information such as an integer), 0) and a preset character string, for example, a NONE, and for example, the preset character string may also be a preset numerical value, is stored.
However, 1 in the above correspondence may be replaced by another integer, which is not limited in the embodiment of the present application.
Searching the first index information in the stored second corresponding relation, if a certain corresponding relation exists, acquiring a second preset value and an index value corresponding to a query result, and if the corresponding relation is a completely matched corresponding relation, namely the condition of 1 st, the second preset value is 1 for example; the index value is, for example, the index value of an article; if the second condition is the 2 nd condition, the second preset value is 0 for example; the index value is, for example, the index value of an article; if the situation is the 3 rd situation, the second preset value is 0 for example; the index value is, for example, a predetermined value.
According to the obtained second preset value, the number of the query results corresponding to the first keyword set can be determined, for example, if the number is the 1 st condition, the number of the query results is not the first preset value; if the condition is the 2 nd condition, the quantity of the query results is a first preset value, namely the keyword set only points to an article; in case of the 3 rd situation, the number of the query results is not the first preset value.
In the foregoing embodiment, the second corresponding relationship further includes a second preset value, which is used to indicate that the keyword set is completely matched with the corresponding query result, or the keyword set corresponds to a single query result or a non-single query result, and the number of the query results corresponding to the keyword set can be distinguished, for example, if the keyword set is completely matched with the query result, the number of the corresponding query results may be multiple, that is, the number of the corresponding query results is not the first preset value, for example, if the keyword set corresponds to a single query result, the number of the query results is the first preset value, for example, if the keyword set corresponds to a non-single query result, the number of the query results is not the first preset value, which is relatively simple to implement, and if the second preset value is only added in the second corresponding relationship, the spatial complexity is.
In an embodiment, the step of "determining the number of query results corresponding to the first keyword set" may be implemented as follows:
if the second preset value is M1, determining that the number of the query results corresponding to the first keyword set is not the first preset value; m1 is an integer, or,
if the second preset value is M2 and the index value corresponding to the query result is not equal to the index value of the single query result, determining that the number of the query results corresponding to the first keyword set is not the first preset value; wherein M2 is not equal to M1.
Specifically, if the second preset value is M1, that is, the above-mentioned situation 1 is satisfied, and the matching is complete, but there may exist embedded keyword sets of other articles including the first keyword set, it is determined that the number of query results corresponding to the first keyword set is not the first preset value.
If the second preset value is M2, it indicates that the first keyword set is a subset of the embedded keyword sets of some articles, and further, if the index value corresponding to the query result is not equal to the index value of the single query result, that is, not the index value of some article, it indicates that the first keyword set is a subset of the embedded keyword sets of multiple articles, that is, not uniquely pointing to some article, so that it is determined that the number of the query results corresponding to the first keyword set is not the first preset value.
In other embodiments, the second preset value may be, for example, other values, and different three values may be adopted to distinguish the three cases, which is not limited in the embodiment of the present application.
In the above embodiment, the second preset value distinguishes different conditions of the query result corresponding to the keyword set by different values, and the implementation is simpler.
In an embodiment, determining the first index information may be implemented as follows:
for any keyword in the first keyword set, determining the value of an exponential function taking a as a base number and taking the index of the keyword as a power; a is an integer greater than 1
And taking the sum of the values of the exponential functions corresponding to the keywords in the first keyword set as first index information corresponding to the first keyword set.
Specifically, the following description will be given by taking a as 2:
as shown in table 1, for example, if the first keyword set is { a, B, C }, the first index information corresponding to the first keyword set is 20+21+22I.e., 1+2+4 ═ 7.
For example, if the first keyword set is { a, D, E }, the first index information corresponding to the first keyword set is 20+23+24I.e., 1+8+ 16-25.
In one embodiment, each keyword corresponds to an index (e.g., an integer starting with 0). Assume a keyword set KeyList, where each keyword has an integer value of intkey (N), where N is 0,1,2,3NWherein a is, for example, 2. The keyword set KeyList may be represented as follows:
the logic of binary storage mode in the computer by using the integer can regard the keyword set as the binary sum of all keywords in the keyword set
Namely: the index information corresponding to the keyword set is
Thus, a value (e.g., an integer) may be used to represent a set of keywords. Space complexity can be optimized, space is saved by storing integers compared with character strings, meanwhile, the disorder of the keywords in the keyword set has no influence on query, each keyword corresponds to a unique integer, and no matter where the keyword appears in the keyword set, the binary system of the integer represented by the keyword only needs to be left-shifted and added to the integer value of the keyword set.
In the embodiment, the index information corresponding to the keyword set is obtained according to the index of each keyword in the keyword set, and the index information is simple, low in implementation complexity and high in efficiency.
In an embodiment, determining the second keyword set may be implemented by:
determining second index information corresponding to the second keyword set according to the first index information and the first corresponding relation;
and determining the second keyword set according to the second index information.
According to the second index information, determining the second keyword set can be realized in the following manner:
converting the second index information into a binary number, and performing bitwise AND operation on the binary number and the negative number of the binary number to obtain a target value;
acquiring indexes corresponding to the keywords in the second keyword set according to the position of a third preset value in the target value;
and determining each keyword in the second keyword set according to the index corresponding to each keyword in the second keyword set.
Specifically, the second index information is converted into a binary number, the binary number and a negative number of the binary number are subjected to bitwise and operation, and positions of 1 in a lower order are sequentially found, so that each keyword in the second keyword set is obtained.
For example, the second index information is 4, the binary number is 100, the negative number of the binary number is bitwise and operated to obtain 100, the lower 1 is the 2 nd digit from the right to the left, and the 2 is the index to find the corresponding keyword as { C }.
For example, the second index information is 12, the binary is 1100, 1100 is obtained by bitwise and operation on the negative number of the binary, the lower 1 is 2 nd digit from right to left, 1000 is obtained by subtracting 100 from 1100, and the lower 1 is 3 rd digit from right to left, so the corresponding keyword is { C, D }.
In the embodiment, each keyword list in the keyword set is obtained according to the index information corresponding to the keyword set, so that the implementation complexity is low and the efficiency is high.
In an embodiment, in order to facilitate the query, the correspondence needs to be stored first, such as the first correspondence and the second correspondence, the articles and the keywords need to be extracted from the HTML string first, and the extraction process of the articles and the keywords is described as follows:
as shown in fig. 3, the keyword tag "data-tags ═ substring" is searched from the start position of the HTML string.
If not, it shows that the article has no embedded keywords, then jump to the next HTML string.
If the keyword tag "data-tags" is found, assuming that the obtained result is the keyword P, taking this as a boundary, the search for the character string from the starting position of the HTML character string to the keyword P from the right side to the left is performed for < articule >, that is, the search for < articule > from the character string of the keyword P to the starting position of the HTML character string is performed, and the search for </articule > from this P to the ending position of the HTML character string from the left side to the right is performed at the same time, that is, the search for </articule > from the character string of the keyword P to the ending position of the HTML character string is performed. After the two marks are found, the article can be extracted. In order to obtain the embedded keyword set, the next substring "data-tags" is searched from the beginning of P to the end position of HTML, so that the embedded keyword set of the article can be extracted.
In one embodiment, for a huge amount of HTML strings, the method can be processed by writing a MapReduce task, namely extracting articles and embedded keywords through a parallel program.
Further, the method of the embodiment of the present application provides an effective data structure for supporting efficient query. In the embodiment of the present application, a dictionary is used as a main data structure because the dictionary can provide a query of a constant time O (1). That is, the first correspondence and the second correspondence in the foregoing embodiments are stored by a dictionary.
In the method according to the embodiment of the present application, two dictionaries are defined, one dictionary MATCH _ logkup stores matching information (i.e., stores the second correspondence), and the other dictionary NEXT _ TAGS _ logkup stores "NEXT group of keyword" information (i.e., stores the first correspondence).
The dictionary MATCH _ LOOKUP stores three kinds of correspondence among the second correspondence:
if there is a full match (i.e., the keyword set matches the embedded keyword set of an article), store: and the corresponding relation between the keyword set and the article character string.
If a keyword set is used as a subset of the embedded keyword set of an article, the article can be uniquely pointed to, and the following steps are stored: and the corresponding relation between the keyword set and the article character string.
If a keyword set is a subset of the set of embedded keywords of an article but does not uniquely point to the article, i.e., the keyword set is also a subset of the embedded keywords of other articles, storing: a keyword set, and a preset string (e.g., None). NONE may mean that it is not the only one pointing to a single article.
How to know which is a perfect match and which is a unique pointing article when updating this dictionary information? Therefore, the two cases need to be distinguished, and the storage mode in the embodiment of the present application may be modified as follows:
if the matching is complete, the corresponding relation between the character strings of the article is stored (keyword set, 1).
If the keyword set can uniquely point to a single article, the correspondence between (keyword set, 0) and the article character string is stored.
If the keyword set can not only point to a single article, storing the corresponding relation between (keyword set, 0) and None.
The article strings may be repeated in the dictionary MATCH _ lokup many times, for example, if the embedded keyword set of an article is { a, B, E }, then ({ a, E },0) ({ B, E },0) will appear in the dictionary MATCH _ lokup, and the article strings of the article will be copied many times. However, an article string may occupy a large space, so to reduce the space complexity, an additional list is required to store the correspondence between the article string and the index value. Instead of storing the article string in the MATCH _ logkup dictionary, the index value of the article string in this list, is stored. This is because the article string is unique, and the article string needs to be stored only once by this method, and the use of the index value in MATCH _ LOOKUP saves much space compared to the use of the article string. Therefore, the dictionary MATCH _ LOOKUP is stored in the following manner:
1. complete matching, storing: (keyword set, 1), and the correspondence between indexes (e.g., integers) of article strings;
2. a certain keyword set is used as a subset of an embedded keyword set of a single article, and uniquely points to the article, and stores: (keyword set, 0) corresponding to an index (e.g., integer) of an article string;
3. a keyword set is used as a subset of the embedded keyword set of a single article, but does not only point to the article, and stores: (keyword set, 0), and None.
In order to further reduce the spatial complexity, since the keyword set is a character string, the stored keyword set may be replaced by index information, for example, a numerical value obtained according to an index of the keyword, and the MATCH _ LOOKUP storage manner is, for example:
1. complete matching, storing: (index information (e.g., integer) corresponding to the keyword set, 1), and an index (e.g., integer) of the article string;
2. a certain keyword set is used as a subset of an embedded keyword set of a single article, and uniquely points to the article, and stores: (index information (e.g., integer) corresponding to the keyword set, 0), and an index (e.g., integer) of the article string;
3. a keyword set is used as a subset of the embedded keyword set of a single article, but does not only point to the article, and stores: (index information (e.g., integer) corresponding to the keyword set, 0), and None.
For the second dictionary NEXT _ TAGS _ LOOKUP, this dictionary only stores "the correspondence between the keyword sets and the NEXT set of keyword sets, i.e. the correspondence between the first keyword set and the second keyword set", so that the NEXT set of keyword sets can be searched through the dictionary and the current keyword set.
Since the keyword sets are used as the keys of the dictionaries in both dictionaries, but each keyword is a character string, the same keyword may appear multiple times (for example, in the NEXT _ TAGS _ LOOKUP, both the full set and the subset are stored), and in order to reduce the space complexity, the character string itself without the keyword may be considered.
But also to take into account that the keywords are unordered within the keyword set. Therefore, before storing in these two dictionaries, the keywords in the keyword set may need to be ranked first to facilitate later query, but if ranking is performed, the time complexity is increased greatly (because the best ranking algorithm may only have the lowest complexity of o (nlogn), where N is the length of the keyword set).
Therefore, in the embodiment of the present application, a value (e.g., an integer) obtained according to the index of the keyword is used to represent a keyword set. Spatial complexity can be optimized because storing a value saves space compared to storing a string, and the disorder of the keywords in the keyword set has no effect on the query because each keyword corresponds to a unique index (e.g., an integer), and no matter where in the keyword set the keyword appears, for example, only the index represented by the keyword needs to be calculated as an exponential function value and added to the value of the keyword set. Therefore, the NEXT _ TAGS _ LOOKUP dictionary is adjusted.
NEXT _ TAGS _ logkup stores: and the index information corresponding to the keyword set and the index information corresponding to the next group of keywords set.
Furthermore, another dictionary WORD _ AND _ INDEX is added on the basis of the original data structure AND is used for storing the association information of the keywords AND the indexes. Unlike the first two dictionaries, this dictionary can be queried from keywords to indexes, and also from indexes to keywords. The reason for querying the index from the keyword is to record the index corresponding to each keyword, and multiple indexes cannot be allocated to the same keyword, that is, the keyword and the index correspond to each other one to one. The reason for searching the keywords from the index is that the set of "next group of keywords" is also represented by the index information, and when determining the next group of keywords, all the keywords included in the next group of keywords need to be determined according to the index.
In summary, the following 3 dictionaries can be obtained, as shown in table 5 below:
TABLE 5
In an embodiment, the method of this embodiment further includes:
acquiring at least one first data and a keyword set of each first data;
for any keyword set of the first data, acquiring index information corresponding to the keyword set of the first data;
acquiring at least one first subset of the keyword set of the first data and index information corresponding to each first subset;
and storing the corresponding relation between the index information corresponding to each first subset and the index value corresponding to the first data according to the index information corresponding to the keyword set of each first data.
Specifically, the first data is, for example, an article character string, and it is assumed that the article character string and the set of embedded keywords have already been obtained.
Firstly, storing the character string of the article into an ARTICLES list, and returning the index value corresponding to the article.
Second, the set of keywords is processed. Traversing the keyword set, for each keyword in the keyword set, obtaining an INDEX corresponding to the keyword, such as an integer shown in table 1 (if the keyword does not exist in the WORD _ AND _ INDEX dictionary, giving it an integer value AND storing it in WORD _ AND _ INDEX), AND in addition, calculating INDEX information corresponding to the keyword set, that is, INDEX information corresponding to the keyword set of the first data, for example, calculating to obtain an integer value TAGS _ SUM corresponding to the keyword set.
Then, all the first subsets (and also the complete set) of each keyword set need to be obtained. Recursion can be used to deal with this problem. While recursion is performed, not only the index information corresponding to the current first subset, for example, an integer value, is calculated, but also the correspondence between the index information corresponding to each first subset and the index value of the article (i.e., the first data) corresponding to the first subset is determined, and an insertion operation is performed into two dictionaries, MATCH _ logkup and NEXT _ TAGS _ logkup.
In one embodiment, the insertion operation may be implemented as follows:
if the keyword sets of the first subset and the second data are the same, inserting the corresponding relation among the index information corresponding to the first subset, a second preset value and the index value of the second data into the second corresponding relation; wherein the first data comprises the second data;
if the first subset is a second subset of a keyword set of third data, determining whether the second corresponding relationship has the corresponding relationship between the index information corresponding to the first subset and a second preset value;
if so, and the second preset value is M2, modifying the quantity of the third data in the corresponding relationship of the index information corresponding to the first subset in the second corresponding relationship to a fourth preset value;
if not, inserting the corresponding relation among the index information corresponding to the first subset, a second preset value and the index value of the third data into the second corresponding relation; wherein the first data comprises the third data.
How the correspondence is stored in the dictionary (i.e., in the first correspondence and the second correspondence) is explained below. As shown in fig. 4:
the specific implementation process of the insertion is as follows:
(1) if the current keyword set (i.e. the first subset) is the complete set of the embedded keyword set of a certain article, i.e. the first subset is the same as the embedded keyword set of a certain article, the keyword set is directly inserted into the MATCH _ LOOKUP dictionary, i.e. the association between the keyword set and the character strings of the article is directly inserted into the MATCH _ LOOKUP dictionary, and the corresponding relationship between the index information corresponding to the first subset, the second preset value and the index values of the article is inserted, wherein the second preset value is, for example, M1, and is, for example, 1. Whether the current set of keywords is a subset of the set of in-line keywords of other articles, or whether this set of keywords already exists in the dictionary, assume that the full set of keyword sets matches have a higher priority. For NEXT _ TAGS _ LOOKUP, no action is done because the current set of keywords does not have a corresponding "NEXT set of keywords" set.
(2) If the current set of keywords (the first subset) is only a second subset of the set of inline keywords of an article, then the following are several cases:
(a) inserted into a MATCH _ LOOKUP dictionary.
And determining whether the index information corresponding to the first subset exists in the MATCH _ LOOKUP dictionary or not, and determining the corresponding relation between the index information and a second preset value, wherein if the same complete set as the keyword set already exists in the MATCH _ LOOKUP dictionary, in other words, (the keyword set (integer), 1) already exists in the dictionary, the index information is not inserted. The second preset value is, for example, M1, and is, for example, 1.
If there is a subset in the MATCH _ LOOKUP dictionary that is the same as this keyword set, in other words (keyword set (integer), 0) exists in the dictionary, then no insertion is made and the number of articles in the dictionary corresponding to (keyword set (integer), 0) is modified to a fourth predetermined value, e.g., None, since this means that the current keyword set may not uniquely point to a single article. It should be noted that it may not be deleted from the dictionary, since this subset may appear later in view of the repetitiveness of the subset. The second preset value is, for example, 0.
If neither case exists, the index values of the first subset and the corresponding articles are inserted into the MATCH _ LOOKUP dictionary, that is, the subset can only point to the single article, that is, the correspondence between the index information corresponding to the first subset, the second preset value and the index value of the third data is inserted, where the second preset value is, for example, M2, and is, for example, 0.
In the above embodiment, the index information corresponding to the first subset, the second preset value, and the index value corresponding to the first data are stored, so that the space complexity is low, and the query efficiency can be improved.
(b) Inserted into NEXT _ TAGS _ LOOKUP dictionary
In an embodiment, the method further comprises:
determining index information corresponding to a second keyword set according to the index information corresponding to the first subset and the index information corresponding to the keyword set of the first data;
if the index information corresponding to the first subset does not exist in the first corresponding relationship, inserting the corresponding relationship between the index information corresponding to the first subset and the index information corresponding to the second keyword set into the first corresponding relationship;
if the index information corresponding to the first subset exists in the first corresponding relationship, updating index information corresponding to a third keyword set in the first corresponding relationship according to the index information corresponding to the second keyword set, wherein the third keyword set is the keyword set corresponding to the first subset in the first corresponding relationship.
Specifically, it is assumed that index information, such as an integer value, corresponding to the full set of the keyword sets is already obtained, and in the recursive process, index information, such as an integer value, corresponding to the current first subset may be calculated. In order to obtain a "next group of keyword" set (i.e. a second keyword set) corresponding to the current subset, according to the index information corresponding to the full set of keyword sets and the index information corresponding to the first subset, the index information corresponding to the second keyword set, i.e. the index information corresponding to the next group of keyword set, is obtained, for example, the above two values are subjected to xor operation, so as to obtain the index information corresponding to the next group of keyword set. Then attempt to insert the subset and "NEXT set of keywords" set into NEXT _ TAGS _ LOOKUP.
Determining whether index information corresponding to the first subset exists in a NEXT _ TAGS _ LOOKUP dictionary;
if the current first subset does not exist in the NEXT _ TAGS _ LOOKUP dictionary, the insertion is directly performed, that is, the corresponding relationship between the index information corresponding to the first subset and the index information corresponding to the second keyword set is inserted.
If the current first subset already exists in the NEXT _ TAGS _ LOOKUP dictionary, the index information corresponding to the NEXT group of keyword sets corresponding to the first subset in the dictionary and the index information corresponding to the NEXT group of keyword sets to be inserted need to be merged, that is, the index information corresponding to the NEXT group of keyword sets corresponding to the first subset in the dictionary is updated.
In the above embodiment, by storing the index information corresponding to the first subset and the index information corresponding to the second keyword set, the spatial complexity is low, and the query efficiency can be improved.
For example, assume that the embedded keyword set of each article is unique, such as the following four articles:
TABLE 6
Defining an index corresponding to the keyword, A: 0, B: 1, C: 2, D: 3, E: 4, …, as shown in table 1.
The integer value of the index information corresponding to the keyword set corresponding to the article S1 is 20I.e. 1.
Keyword set corresponding to article S2The integer value corresponding to the index information is 20+21+22I.e., 1+2+4 ═ 7.
The integer value of the index information corresponding to the keyword set corresponding to the article S3 is 20+21+23I.e., 1+2+8 equals 11.
The integer value of the index information corresponding to the keyword set corresponding to the article S4 is 20+23+24I.e., 1+8+ 16-25.
The index values corresponding to the articles are shown in table 4.
Taking the article S2 as an example, the storage logic of the two dictionaries MATCH _ logkup and NEXT _ TAGS _ logkup is introduced. The keyword set corresponding to the S2 is { a, B, C }, all subsets of { a, B, C } are obtained through binary bit calculation, that is, { a, B, C }, { a, B }, { a, C }, { B, C }, { a }, { B }, and { C }, and the subsets are sequentially processed in a circulating manner.
The following description takes the processing subset { A, B } as an example:
for the dictionary NEXT _ TAGS _ LOOKUP:
performing XOR operation on the binary values corresponding to the { A, B, C } and the { A, B }, namely 111XOR operation 11 to obtain 100 (namely { C }), so that the NEXT set of key words corresponding to the { A, B } is { C }, and if the { A, B } does not exist in NEXT _ TAGS _ LOOKUP, directly storing the association relationship between the { A, B } and the { C } into
NEXT _ TAGS _ LOOKUP. For example, as shown in table 7:
TABLE 7
| Keyword set (integer value) | The next set of keywords (integer values) |
| 3 (i.e. { A, B }) | 4 (namely { C }) |
If { A, B } exists in NEXT _ TAGS _ LOOKUP, the corresponding value of the subset in the dictionary and the value to be inserted are merged. For example, { A, B } stored before merging is shown in Table 8, and { A, B } stored after merging is shown in Table 9:
TABLE 8
TABLE 9
| Keyword set (integer value) | The next set of keywords (integer values) |
| 3 (i.e. { A, B }) | 12 (i.e., { C, D }) |
For the dictionary MATCH _ logkup:
(1) if ({ A, B },1) exists in MATCH _ LOOKUP, i.e., a complete MATCH, then nothing is done and processing continues with the next set of keywords.
(2) If ({ A, B },0) exists in MATCH _ LOOKUP, then the value corresponding to { A, B } in the dictionary is changed to None, since this indicates that the current set of keywords may not uniquely point to a single article, and then processing continues with the next set of keywords. For example, as shown in table 10:
watch 10
| Keyword set (integer value) | Index of article (integer value) |
| 3 (i.e. { A, B }) | None |
If neither of (1) and (2) above holds, i.e., uniquely points to a single article, then the index values of { A, B } and the corresponding article are inserted into MATCH _ LOOKUP. For example, as shown in table 11:
TABLE 11
| Keyword set (integer value) | Index of article (integer value) |
| 3 (i.e. { A, B }) | n (integer index for article Sn) |
And sequentially and circularly processing the rest other subsets of the keyword set { A, B, C }, and inserting all the association relations into MATCH _ LOOKUP and NEXT _ TAGS _ LOOKUP to finish the preparation work of storing data.
In the above embodiment, the stored correspondence is low in spatial complexity, and the query efficiency can be improved.
Fig. 5 is a schematic structural diagram of a data query apparatus according to an embodiment of the present application, and as shown in fig. 5, the data query apparatus according to the embodiment includes:
an obtainingmodule 110, configured to obtain a first keyword set input by a user and first index information corresponding to the first keyword set, where the first index information is obtained according to an index of a keyword in the first keyword set;
a determiningmodule 111, configured to determine a second keyword set according to the first index information and the first corresponding relationship if it is determined that the number of the query results corresponding to the first keyword set is not the first preset value; the first correspondence includes: the first index information and second index information corresponding to the second keyword set are in corresponding relation;
aprocessing module 112, configured to determine, according to the second keyword set and the second corresponding relationship, a target query result corresponding to the second keyword set; the second correspondence includes: and the corresponding relation between the index information and the index value of the query result.
In a possible implementation manner, the second correspondence is a correspondence between the index information and a second preset value and an index value of a query result, where the second preset value is used to indicate that the keyword set completely matches the corresponding query result, or that the keyword set corresponds to a single query result or a non-single query result, and the determiningmodule 111 is specifically configured to:
determining a second preset value corresponding to the first keyword set according to the second corresponding relation and first index information corresponding to the first keyword set;
and determining the number of the query results corresponding to the first keyword set according to a second preset value corresponding to the first keyword set.
In a possible implementation manner, the determiningmodule 111 is specifically configured to:
if the second preset value is M1, determining that the number of the query results corresponding to the first keyword set is not the first preset value; m1 is an integer, or,
if the second preset value is M2 and the index value corresponding to the query result is not equal to the index value of the single query result, determining that the number of the query results corresponding to the first keyword set is not the first preset value; wherein M2 is not equal to M1.
In a possible implementation manner, the determiningmodule 111 is specifically configured to:
determining second index information corresponding to the second keyword set according to the first index information and the first corresponding relation;
and determining the second keyword set according to the second index information.
In a possible implementation manner, theprocessing module 112 is specifically configured to:
determining an index value of a target query result corresponding to the second keyword set according to the second keyword set and the second corresponding relation;
and determining the target query result corresponding to the second keyword set according to the index value of the target query result corresponding to the second keyword set.
In a possible implementation manner, the obtainingmodule 110 is specifically configured to:
for any keyword in the first keyword set, determining the value of an exponential function taking a as a base number and taking the index of the keyword as a power; a is an integer greater than 1
And taking the sum of the values of the exponential functions corresponding to the keywords in the first keyword set as first index information corresponding to the first keyword set.
In a possible implementation manner, the obtainingmodule 110 is further configured to:
acquiring at least one first data and a keyword set of each first data;
for any keyword set of the first data, acquiring index information corresponding to the keyword set of the first data;
acquiring at least one first subset of the keyword set of the first data and index information corresponding to each first subset;
theprocessing module 112 is specifically configured to:
and storing the corresponding relation between the index information corresponding to each first subset and the index value corresponding to the first data according to the index information corresponding to the keyword set of each first data.
In a possible implementation manner, theprocessing module 112 is specifically configured to:
if the keyword sets of the first subset and the second data are the same, inserting the corresponding relation among the index information corresponding to the first subset, a second preset value and the index value of the second data into the second corresponding relation; wherein the first data comprises the second data;
if the first subset is a second subset of a keyword set of third data, determining whether the second corresponding relationship has the corresponding relationship between the index information corresponding to the first subset and a second preset value;
if so, and the second preset value is M2, modifying the quantity of the third data in the corresponding relationship of the index information corresponding to the first subset in the second corresponding relationship to a fourth preset value;
if not, inserting the corresponding relation among the index information corresponding to the first subset, a second preset value and the index value of the third data into the second corresponding relation; wherein the first data comprises the third data.
In one possible implementation, theprocessing module 112 is further configured to:
determining index information corresponding to a second keyword set according to the index information corresponding to the first subset and the index information corresponding to the keyword set of the first data;
if the index information corresponding to the first subset does not exist in the first corresponding relationship, inserting the corresponding relationship between the index information corresponding to the first subset and the index information corresponding to the second keyword set into the first corresponding relationship;
if the index information corresponding to the first subset exists in the first corresponding relationship, updating index information corresponding to a third keyword set in the first corresponding relationship according to the index information corresponding to the second keyword set, wherein the third keyword set is the keyword set corresponding to the first subset in the first corresponding relationship.
In a possible implementation manner, if the first subset is the same as the keyword set of the second data, the second preset value is M1;
if the first subset is a second subset of the keyword set of the third data and the second corresponding relationship does not have the index information corresponding to the first subset, the second corresponding relationship is a corresponding relationship between the index information and a second preset value, and the second preset value is M2.
The apparatus of this embodiment may be configured to implement the technical solutions of the above method embodiments, and the implementation principles and technical effects are similar, which are not described herein again.
Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application, and as shown in fig. 6, the electronic device includes:
aprocessor 210, and amemory 211 for storing executable instructions for theprocessor 210.
Optionally, the method may further include: acommunication interface 212 for enabling communication with other devices.
The above components may communicate over one or more buses.
Theprocessor 210 is configured to execute the corresponding method in the foregoing method embodiment by executing the executable instruction, and the specific implementation process of the method may refer to the foregoing method embodiment, which is not described herein again.
The embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the method in the foregoing method embodiment is implemented.
An embodiment of the present application further provides a computer program product, including a computer program, where the computer program is executed by a processor to implement the method according to any one of the foregoing method embodiments, and specific implementation processes thereof may refer to the foregoing method embodiments, which implement similar principles and technical effects, and are not described herein again.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.