Disclosure of Invention
The invention aims to provide a satisfactory technical solution in the technical field of word frequency statistics so as to further apply the method to the related technical fields of word teaching, word retrieval and the like.
In order to solve the technical problems, the invention provides a word frequency statistics method and a word frequency statistics device, and adopts the following technical scheme:
in a first aspect, the present invention provides a word frequency statistics method for counting total number of occurrences of each word entry in a word set in an entire corpus, the method comprising:
step S1: setting an Index table Index of a linked list array structure storage, wherein each array element corresponds to a head pointer of a linear linked list and initializes a value of NULL, each node in the linear linked list comprises three fields, namely a word string S for storing the word item in the word set, a numerical value T for storing the occurrence number of the word item in the whole corpus and a pointer N of a storage position of a relay node after the current node;
step S2: for each word entry in the word set, firstly calculating a code value C1 by utilizing a hash function, then newly creating a node, assigning a word string field S of the node as the word entry and a numerical field T as 0, and finally inserting the node into a word set composed of array elements Index [ C1 ]]In the indicated linear chain table, the Hash function is c1=hash (U1 ,U2 ,U3 ……Ui ) Wherein C1 is a non-negative integer, i is the number of characters in the word entry, i is the maximum value of the number of characters in each word entry in the word set, and L and U are the maximum values of the number of characters in each word entry in the word set1 、U2 、U3 ……Ui A numerical value corresponding to a part of digits in the binary Unicode coding represented by each character in sequence in the word entry;
step S3: scanning the text of each file in the whole corpus, firstly calculating non-negative integer code values C2 of the text strings by utilizing the same hash function rule as the previous text strings aiming at all text strings with continuous characters not exceeding the L in number, and then sequentially reading each node in a linear chain table indicated by array elements Index [ C2] aiming at each text string and the corresponding calculated code value C2 thereof, wherein if the strings in the S domain of the node are completely the same as the text strings in comparison, the T domain value in the node is increased by 1;
step S4: and sequentially reading and outputting an S domain word string and a T domain numerical value in each node aiming at a linear linked list indicated by each non-NULL value array element in the Index table Index, wherein the S domain word string forms the word entry serving as a frequency statistics object, and the T domain numerical value correspondingly forms the total occurrence frequency of the word entry in the whole corpus.
Preferably, the subscript addressing range of the Index table Index [ ] is 0-65535, the C1 and the C2 are integers greater than or equal to 0 and less than 65536, and the numerical value corresponding to a part of digits in the binary Unicode in the step S2 is the numerical value corresponding to the lower 16 digits in the binary Unicode.
Preferably, the c1=hash (U1 ,U2 ,U3 ……Ui ) Is c1=u1 +U2 +U3 +……+Ui And when the result value calculated by the calculation rule is greater than or equal to 2≡16, C1 adjusts the binary value of the result value to represent the value corresponding to the middle and low 16 digits.
Preferably, the c1=hash (U1 ,U2 ,U3 ……Ui ) Is c1= |u1 -U2 -U3 -……-Ui I, or c1= |ui -Ui-1 -Ui-2 -……-U1 And when the result value calculated by the calculation rule is greater than or equal to 2≡16, C1 adjusts the binary value of the result value to represent the value corresponding to the middle-low 16-bit number.
Preferably, in the step S2, the node is inserted into a linear linked list indicated by the array element Index [ C1], and the insertion method is that the N pointer field in the node is assigned to the value of the array element Index [ C1], and then the value of the array element Index [ C1] is updated to the pointer value of the self-saved location generated by the node when the node is newly created.
Preferably, in said step S2 and said step S3, the sub-step of converting the character encoding form is further included: before the hash function is used to calculate the code value C1 of each word entry and before the hash function is used to calculate the code value C2 of each text string, the codes of each word entry and each character involved in each text string are converted from codes of other non-Unicode forms to codes of Unicode forms.
Preferably, in the step S4, the method further includes the substep of calculating the frequency percentage of word entries: dividing the total number of occurrences of each word entry in the word set in the whole corpus by the total number of characters in the whole corpus to calculate the percentage of the occurrence frequency of each word entry in the word set in the whole corpus.
In a second aspect, the present invention provides a word frequency statistics apparatus for counting a total number of occurrences of each word entry in a word set in an entire corpus, the apparatus comprising:
module M1 for: setting an Index table Index of a linked list array structure storage, wherein each array element corresponds to a head pointer of a linear linked list and initializes a value of NULL, each node in the linear linked list comprises three fields, namely a word string S for storing the word item in the word set, a numerical value T for storing the occurrence number of the word item in the whole corpus and a pointer N of a storage position of a relay node after the current node;
Module M2 for: for each word entry in the word set, firstly calculating a code value C1 by utilizing a hash function, then newly creating a node, assigning a word string field S of the node as the word entry and a numerical field T as 0, and finally inserting the node into a word set composed of array elements Index [ C1 ]]In the indicated linear chain table, the Hash function is c1=hash (U1 ,U2 ,U3 ……Ui ) Wherein C1 is a non-negative integer, i is the number of characters in the word entry, i is the maximum value of the number of characters in each word entry in the word set, and L and U are the maximum values of the number of characters in each word entry in the word set1 、U2 、U3 ……Ui A numerical value corresponding to a part of digits in the binary Unicode coding represented by each character in sequence in the word entry;
module M3 for: scanning the text of each file in the whole corpus, firstly calculating non-negative integer code values C2 of the text strings by utilizing the same hash function rule as the previous text strings aiming at all text strings with continuous characters not exceeding the L in number, and then sequentially reading each node in a linear chain table indicated by array elements Index [ C2] aiming at each text string and the corresponding calculated code value C2 thereof, wherein if the strings in the S domain of the node are completely the same as the text strings in comparison, the T domain value in the node is increased by 1;
Module M4 for: and sequentially reading and outputting an S domain word string and a T domain numerical value in each node aiming at a linear linked list indicated by each non-NULL value array element in the Index table Index, wherein the S domain word string forms the word entry serving as a frequency statistics object, and the T domain numerical value correspondingly forms the total occurrence frequency of the word entry in the whole corpus.
Preferably, in the modules M2 and M3, a sub-module for converting character encoding forms is further included, for: before the hash function is used to calculate the code value C1 of each word entry and before the hash function is used to calculate the code value C2 of each text string, the codes of each word entry and each character involved in each text string are converted from codes of other non-Unicode forms to codes of Unicode forms.
Preferably, in the module M4, a sub-module for calculating the frequency percentage of word entries is further included, and is configured to: dividing the total number of occurrences of each word entry in the word set in the whole corpus by the total number of characters in the whole corpus to calculate the percentage of the occurrence frequency of each word entry in the word set in the whole corpus.
The technical scheme of the invention has the following beneficial effects:
in the technical scheme design of word frequency statistics, the traversal process of the whole corpus is generally unavoidable, and the efficiency optimization direction of the technical scheme design can be an index or positioning strategy for each word entry in the word set, namely, how to find the data storage position for storing the total occurrence times of the word statistical entries in the corpus, namely, the addressing mode, is an important factor affecting the retrieval efficiency. In the prior art, each word item in the word set is positioned in a one-by-one or halved search mode; in the invention, all word entries in the word set are structured and built into an index table through a hash function rule, when traversing and reading texts in each file in a corpus, for each text word string formed by scanning, the same hash function rule is utilized to generate an addressing position of a data element and directly read a linked list at a corresponding position in the index table, then a word entry serving as a statistical object is searched in a limited number of nodes in the linked list, and in hash function calculation, unicode coding information naturally used by related characters in computer storage is directly utilized, so that direct mapping from the word entry in the word set or the text word string in the file to a specific linear linked list in the index table is realized.
In the technical scheme provided by the invention, for each text string read in the text of each file in the corpus, in the traversal of matching the text string with all word entries serving as statistical objects in the word set, the text string is not required to be searched one by one, but a certain word entry at a specific position in a word set index table is quickly read by utilizing coding information naturally carried by each text string and combining with hash function calculation, even if the number of nodes in one linked list directly mapped by the text string is possibly non-unique, the total number of the nodes is limited, so the technical scheme of the invention can be used for counting the time complexity T (m) or O (m) log of a word frequency statistical method in the prior art2 n) is reduced to O (m), the technical scheme of the invention is also particularly suitable for the application scene of counting the word frequency under the condition of more word items in the word set and larger corpus scale, and has more practical value significance.
Detailed Description
In order to more clearly illustrate the features of the technical solution of the present invention, the present invention will be further described in detail below by means of specific embodiments in combination with the accompanying drawings.
The word frequency statistics related by the invention refers to statistics of the frequency of each word item in the word set in the whole corpus for a given word set and corpus, wherein the frequency value can be expressed as the total number of occurrences of each word item in the whole corpus, the frequency value can also be expressed as the percentage of occurrences of each word item in the whole corpus, and the total number or percentage of occurrences can be further ordered if necessary, in fact, the frequency value expressed by the percentage and the ordering of the frequency value can obviously obtain certain rules in word use, and the word frequency statistics method has practical significance for realizing the utilization of word frequency statistics results. The term "a word entry in a word set" or "a text string" in a text in this document may be a single character, i.e., a word, or may be a plurality of consecutive characters, i.e., a word, which is commonly referred to as "two" cases. Nodes referred to herein are also referred to as nodes in certain textbooks and literature.
The specific meaning of the terms concerned in the present invention can be understood as appropriate to one of ordinary skill in the art.
As shown in FIG. 1, the invention provides a word frequency statistical method for counting the total occurrence times of each word item in a word set in the whole corpus, comprising the following steps:
step S1: setting an Index table Index of a linked list array structure storage, wherein each array element corresponds to a head pointer of a linear linked list and initializes a value of NULL, each node in the linear linked list comprises three fields, namely a word string S for storing the word item in the word set, a numerical value T for storing the occurrence number of the word item in the whole corpus and a pointer N of a storage position of a relay node after the current node;
step S2: for each word entry in the word set, firstly calculating a code value C1 by utilizing a hash function, then newly creating a node, assigning a word string field S of the node as the word entry and a numerical field T as 0, and finally inserting the node into a word set composed of array elements Index [ C1 ]]In the indicated linear chain table, the Hash function is c1=hash (U1 ,U2 ,U3 ……Ui ) Wherein C1 is a non-negative integer, i is the number of characters in the word entry, i is the maximum value of the number of characters in each word entry in the word set, and L and U are the maximum values of the number of characters in each word entry in the word set1 、U2 、U3 ……Ui A numerical value corresponding to a part of digits in the binary Unicode coding represented by each character in sequence in the word entry;
step S3: scanning the text of each file in the whole corpus, firstly calculating non-negative integer code values C2 of the text strings by utilizing the same hash function rule as the previous text strings aiming at all text strings with continuous characters not exceeding the L in number, and then sequentially reading each node in a linear chain table indicated by array elements Index [ C2] aiming at each text string and the corresponding calculated code value C2 thereof, wherein if the strings in the S domain of the node are completely the same as the text strings in comparison, the T domain value in the node is increased by 1;
step S4: and sequentially reading and outputting an S domain word string and a T domain numerical value in each node aiming at a linear linked list indicated by each non-NULL value array element in the Index table Index, wherein the S domain word string forms the word entry serving as a frequency statistics object, and the T domain numerical value correspondingly forms the total occurrence frequency of the word entry in the whole corpus.
In step S1 of the method of the present invention, an Index table is set in the computer memory, and because the Index table adopts a storage structure of linked list array, it can be expressed as Index [ ], as shown in FIG. 3, each array element in the Index table correspondingly stores pointer data of a storage position of a linear linked list head node. As shown in fig. 4, in the index table, each node in each linear chain table includes at least three fields, namely an S-string field, a T-number field, and an N-pointer field, where: the S-word string fields of all nodes store one word entry in the word set; the T numerical value domains of all nodes store the occurrence times of word entries in the S domain in the same node in the whole corpus; the N pointer field of the last node points to the storage position of the next direct successor node in the linear chain table, the NULL value of the N pointer field in the last node is used as a chain tail mark in each linear chain table, the storage position of the head node of the chain table is indicated by one data element in the chain table array Index, and the head node position of one linear chain table pointed by the value of each array element in the chain table array Index and the N pointer field value of each node in the chain table can read word items stored in each node in the chain table and the corresponding frequency statistical data thereof. All nodes in the same linked list store statistical results of the occurrence times of a plurality of word entries with the same code value in the word set in the whole corpus. The initialization data setting state of the Index table, namely the linked list array Index after the completion of creation is illustrated by fig. 7, wherein the initialization values of all array elements are NULL, that is, each linear linked list indicated by the linked list array Index does not actually exist in the initialization setting state; and the storage state of the Index table Index [ ] as an example after creation of the node data is illustrated by fig. 8 and 9.
In step S2 of the method, a word entry is required to be sequentially read from the word set, the code value C1 of the word entry is calculated by utilizing a hash function, then a node is created, initialization assignment is carried out on the node, finally the node is inserted into a linear linked list indicated by an array element Index [ C1] until all the word entries in the word set are respectively and correspondingly created in the linked list in the Index table, so that the aim of structurally building all the word entries in the whole word set into the Index table is fulfilled.
In step S3 of the method of the present invention, the text of each document in the whole corpus needs to be scanned, any text string with the number not exceeding L is formed for consecutive characters, which is compared with a word entry which is stored in a linear chain table at a corresponding position in the index table and is possibly matched, and the word entry is subjected to 1-added quantity adjustment if the comparison is consistent. The text of each document in the whole corpus is scanned by setting main and auxiliary pointers for the text of each document in the whole corpus, respectively pointing to two characters at the same or different positions in the text, wherein one character pointed by the main pointer moves one by one from the first character position of the whole text to the last character of the whole text on different characters, one character pointed by the auxiliary pointer moves one by one from the current pointed character of the main pointer to the last L character, and one character pointed by the auxiliary pointer moves one by one from the last L character, and in each pointing combination state of the scanning movements of the main pointer and the auxiliary pointer, the main pointer points to the characters and ends at the auxiliary pointer All consecutive characters between the two pointers to the character form a text string. In step S3 of the method of the present invention, when the text of each document in the whole corpus is scanned, reading of a text string at a specified position in the document is performed by a document object, and methods (or processes, functions, etc.) such as tell (), seek () and the like are usually provided in a computer programming language, and reading of a text string at a specified position and a specified length in the document text is implemented by using a read pointer indicating the current reading position of the document. In addition, in step S3 of the method of the present invention, the Hash function used is the same as the calculation rule of the Hash function used in step S2, and the code value C2 of the one text string formed is calculated using the same Hash function, that is, c2=hash (U1 ,U2 ,U3 ……Uj ) Wherein C2 is a non-negative integer, j is the total number of characters in the text string, and the maximum value is L, U1 、U2 、U3 ……Uj And sequentially forming numerical values corresponding to partial digits in the binary Unicode codes represented by the characters for the text strings.
In step S4 of the method, the statistical result of word frequency is derived according to the final data of the Index table after word entry statistics, and the method is that each array element in Index is read in turn, if the value is not NULL, each node data in a linear chain list indicated by the array element is read in turn, wherein the S domain word string forms the word entry as a frequency statistical object, the T domain value forms the total occurrence number of the word entry in the whole corpus, and finally the table is integrated to realize the statistical result of the total occurrence number of each word entry in the whole corpus.
The technical scheme of the method is that in short, the following steps are: in the process of traversing the whole corpus to count the word frequency, an index table with an auxiliary function is arranged for facilitating the rapid positioning of each word entry in the word set, and the task of word frequency statistics is finally completed through four links of data initialization, structured construction, data statistics and data derivation of the index table; when the structural construction of the index table is carried out on each word item in the word set and the text string is positioned in the index table by traversing the whole corpus, character coding information, namely Unicode, naturally carried by the related characters in the computer storage is directly generated through hash function operation so as to map the related word item and the text string into a linked list of related positions in the index table. In short, the arrangement of the data storage structure of the index table and the coding and addressing modes of the key word index table form the key points of the technical scheme of the invention.
In order to structurally store each word entry of the word set as a frequency statistics object, the index table integrally adopts a storage mode of a one-dimensional array and a linear chain table. The storage structure of the one-dimensional array is adopted, so that quick access to array elements is facilitated by using a code value C1 of a word entry and a code value C2 of a text string generated by a hash function as subscript addresses; the storage structure of the linear chain table is adopted because the specific constitution and distribution of the word entries as the frequency statistics object are not expected in advance, and the linear chain table storage structure which can be long or short and is convenient for flexible setting creates convenient conditions for data storage, and is easier in software implementation of an algorithm.
In the technical scheme of the method, aiming at word items in a word set and text strings in a corpus, the computer coding information carried by the word items is extracted, and the corresponding node positions of the word items in an index table can be positioned through simple calculation and a small amount of matching. The extraction of the code information carried by the text character is that the text character is stored in the form of built-in code and exchanged in the computer document system, and the built-in code information of each character in the composition of any character string, such as Unicode code of the character, can be naturally obtained. The Unicode coding value of each character in the character string is formed, a corresponding code value is generated through simple Hash function operation, and then a corresponding mapping relation is established between the code value and the subscript address of the Index table array Index, so that a direct addressing mode from the word entry or the text character string to the Index table is realized. Although linear linked lists employ a method of sequentially reading data from a node to a node from the node, since word entries or text strings are randomly and relatively uniformly mapped on a plurality of linear linked lists within a key index table, after a linear linked list pointed to by a linked list array element is obtained, the time complexity consumed for reading all nodes on the linked list is typically much less than the time complexity consumed for reading the entire document text.
In the method of the invention, the code value of each word item in the word set and the text word string read in the corpus generated by the calculation of the same hash function is always used as the subscript access address of each linked list array in the index table, the linear linked list is positioned in a positioning mode other than sequential searching or halving searching, and the corresponding node storage position of one word item in the word set can be accurately positioned through the node matching operation of limited times under the condition that the linked list is not empty, so that the node data addressing efficiency is higher in the whole process.
It should be noted that, although the code value C1 of the word entry or the code value C2 of the text string and the Index value of the linked list array Index are mapped in a one-to-one correspondence, the word entry or the text string itself and the Index address of the linked list array are not in a one-to-one correspondence, in other words, the code values generated by two word entries or text strings having different character compositions or numbers after hash function operations may be the same, in which case, the nodes for storing the occurrence frequencies of two different word entries will be stored in the same linear linked list, and these two word entries having the same code value but different character compositions or word lengths may be so-called parity relations. In addition, a linked list corresponding to the code value of a text string, although present in the index table, does not necessarily have the same word entries stored in all nodes in the linked list as the text string.
The Index table Index is stored with different word entries in the word set with the same code value C1, so after the text string is calculated to generate a corresponding code value C2 through the hash function rule, even if a chain table indicated by the array element Index [ C2] is not empty, the string stored in the S domain of all nodes in the chain table is not necessarily included in the current text string. Therefore, the word strings stored in the S fields of all nodes in the linked list indicated by Index [ C2] and the current text word string can only be calculated as suspected matches, the hash function calculation process of the code value C2 and the addressing process of inquiring the Index [ C2] value can also be calculated as a fuzzy retrieval process, and after each suspected matched word item and the current text word string are subjected to accurate matching operation, the method can obtain whether the current text word string belongs to one word item serving as a statistical object in the word set. The meaning of the suspected matching and fuzzy retrieval is that the workload of matching the text strings with each word item in the word set is greatly reduced. Under the condition of successful fuzzy matching, the exact conclusion of whether the current text string belongs to one word item of the word set as a statistical object can be obtained only through a small number of times of precise matching processes, and all word items in the whole word set do not need to be traversed, so that the work efficiency of word frequency statistics is improved finally.
As a preferred embodiment, the subscript addressing range of the Index table Index [ ] is 0-65535, the C1 and the C2 are integers greater than or equal to 0 and less than 65536, and the numerical value corresponding to a part of digits in the binary Unicode in the step S2 is the numerical value corresponding to the lower 16 digits in the binary Unicode.
In the preferred embodiment, not only is the calculation result value of the hash function limited to an integer in the range of 0 to 65536, but also the argument of the hash function is limited to the value corresponding to the lower 16 bits in the character Unicode code involved in the word entry or text string. According to Unicode coding specifications, unicode uses four bytes at most and maps different characters in various language words by using codes 0-0 x0010FFFF, in the invention, whether the code value C1 of each word entry in a word set or the code value C2 of each text string in a corpus is calculated, the argument in a hash function of the Unicode adopts the low 16 bits in the related character binary Unicode code value, specifically, for the characters stored in double bytes, the value corresponding to the Unicode code is directly taken as the argument in the hash function, and for the characters stored in three bytes or four bytes, the value corresponding to the low 16 bits in the Unicode is taken as the argument in the hash function.
In the initialization setting of the Index table, the number of subscript addressing bits of the linked list array Index is 16, so that the size of the addressing range, that is, the number of array elements contained in the array, that is, the total number of linear linked lists is 2≡16 (=64k=65536), and the subscript addressing range of the array elements storing the head node pointers of the linked list corresponds to 0 to (2≡16-1), or 0 to 65535; on the other hand, the code value C1 of the word entry in the word set and the code value C2 of the Chinese string in the corpus are integers ranging from 0 to 65535, which creates the possibility for the element access of the word entry and the text string in the Index table array Index, and the aim of structuring the word entry into the Index table is to realize the convenience of addressing.
As a preferred embodiment, the c1=hash (U1 ,U2 ,U3 ……Ui ) Is c1=u1 +U2 +U3 +……+Ui And when the result value calculated by the calculation rule is greater than or equal to 2≡16, C1 adjusts the binary value of the result value to represent the value corresponding to the middle and low 16 digits.
In the preferred embodiment, the Unicode code values corresponding to the related characters are added, so that the calculation method is simpler, the execution efficiency is higher, and the occupied calculation resources are less.
In the addition operation, U1 、U2 、U3 ……Ui When the sum of the function operation results generates high-order overflow, i.e. the sum is greater than or equal to 2≡16, the overflow high-order is discarded and only 16-bit binary numbers in two low bytes are reserved, and the value expressed as decimal is obtainedThe code value C1, which is the result of the final hash function calculation, is always less than 2A 16, otherwise the array element Index is accessed]The problem of out-of-limit of the reading and writing of the array can occur. In summary, the overflow high order bits are discarded to ensure that the code value C1 is limited to a range of 0 to (2≡16-1) data values to match the linked list array Index [ []Is used for the subscript value range of (2).
In addition, two low-byte data instead of high-byte data are taken as the operation result of the final code value C1, and the reason is that the code value distribution of the data in the two low-byte data after addition operation is more random and uniform, and the dispersion performance is better.
The inventor tries to take a default and expansion-defined word input unit library in each mainstream Chinese input method at present as a test object, gathers the word libraries of each input method according to the addition operation adopted in the preferred embodiment, eliminates repeated data, calculates and counts hash function digital values of 223678 non-repeated words, and discovers that the maximum number of a group of words with the same code value is 21. Therefore, although the linear chain table adopts an algorithm of sequentially reading and writing, namely sequentially reading data of each node, in the linear chain table formed by at most 21 nodes, the time complexity generated by the data reading process performed by each node is controllable after all, and the time complexity is relative to the quantity scale of word entries in the whole word set.
For a group of words whose code values are repeated, the calculation process and calculation values of the code value C1 of each character Unicode value and the entire word in the character configuration may be partially as follows: (all expressed in decimal numbers)
And (3) carrying out the following steps: 28866+28459=57325;
waste islands: 33618+23707=57325;
jumping out: 36339+20986=57325;
listening and evidence: 21548+35777=57325;
and (3) funding: 36164+21161=57325;
white runs for one time: 30333+36305+19968+36255= 122861,122861-65536 =57325;
……
the process of building the Index table Index [ ] and counting the number of occurrences of each word entry in the word set according to various text strings occurring in the corpus is described in detail below by specific cases with reference to the accompanying drawings:
FIG. 5 shows, as an example, a word set requiring statistics of the total number of occurrences of each word entry in the corpus, and listing, as an example, seven word entries: "equivalent", "control", "carry", "safety", "discard", "not at all" and "walk in rain".
Based on the initialization state of the index table shown in fig. 7, the storage structure of the index table is built according to the seven word entries example shown in fig. 5, and for this purpose, the code values of the seven word entries should be calculated first, where the addition rule as the hash function of the preferred embodiment is adopted, and the code value calculation process is as follows:
Equal amount of: 31561+37327= 68888, 68888-65536=3352;
and (3) control: 25511+21046= 46557;
and (3) carrying: 25645+3673= 62378;
safety: 23433+20840+24615= 68888, 68888-65536=3352;
and (5) picking off: 21076+25481= 46557;
also at all: 19968+28857+20063= 68888, 68888-65536=3352;
walk in rain: 38632+20013+25955+27493= 112093, 112093-65536= 46557.
In calculating the code values of the above seven word entries, four entries of "equal amount", "security", "both" and "walk in rain" have occurred in the case where the higher order parts exceeding 16 bits in the binary value representation of the sum of the code values of the respective characters overflow and are discarded.
The storage structure of the Index table after completion of construction is shown in fig. 8, seven word entries use the respective code value calculation results as subscript addresses of linked list array Index [ ], nodes are created in one linear linked list indicated by a corresponding array element in the linked list array Index [ ], and the number of times of initialization occurrence frequency recorded in each node is set to 0.
Fig. 6 shows the text of a document in the corpus as an example, the number of occurrences of each word entry in the word set in fig. 5 in the text of fig. 6 is counted, for this purpose, the whole text is scanned by using two pointers, namely a main pointer and an auxiliary pointer, for each text string formed in the scanned whole text, the code value of each text string is calculated and compared with one word entry stored in the linked list node at the corresponding position in the index table, and if the comparison matches, the frequency value stored in the corresponding node is increased by 1. Taking four text strings "security", "control", "questions" and "let cloud" in fig. 6 as examples, the code value is calculated by using the same hash function rule as before, and the process is as follows:
Safety: 23433+20840+24615= 68888, 68888-65536=3352;
and (3) control: 25511+21046= 46557;
the problem is that: 39064+65292+23558=127914, 127914-65536=62378;
let cloud: 35753+20113= 55866.
The index table storage structure after completion of statistics for the above four text strings is shown in fig. 9, in which:
the code value of the word string 'security' is 3352, index [3352] is queried, and the word string 'security' is successfully matched with each node in a linked list indicated by Index [3352], which indicates that 'security' belongs to one word entry which is taken as a statistical object in a word set, and the occurrence number of records in the 'security' node in an Index table is adjusted to 2 on the basis that the original occurrence number is 0 because the security is twice in the text of FIG. 6;
the code value of the word string control is 46557, index [46557] is queried, and the word string control is successfully matched with each node in a linked list indicated by Index [46557], which indicates that the control belongs to a word entry of a word set as a statistical object, and the occurrence number of records in the control node in the Index table is adjusted to 1 on the basis that the original occurrence number is 0 because the control appears 1 time in the text of FIG. 6;
The code value of the word string 'question is 62378, index [62378] is queried, and although the value of the word string' question 'is not NULL, the word string' question 'is not matched with the S domain word string' piggy-backed 'in a unique node in a linked list indicated by the Index [62378], which indicates that the' question 'is word entries taking a' non-word set as a statistical object;
the string "let cloud" has a code value of 55866, query Index [55866], and a value of NULL, which indicates that "let cloud" is also not a word entry in the word set as a statistical object.
As a preferred embodiment, the c1=hash (U1 ,U2 ,U3 ……Ui ) Is c1= |u1 -U2 -U3 -……-Ui I, or c1= |ui -Ui-1 -Ui-2 -……-U1 And when the result value calculated by the calculation rule is greater than or equal to 2≡16, C1 adjusts the binary value of the result value to represent the value corresponding to the middle-low 16-bit number.
In the preferred embodiment, the scheme idea of adopting the subtracting operation rule is the same as that of the previous adding operation, because compared with other operation rules, the results generated by the two operations are distributed more uniformly in the code value resources of 0-65535, which can make full use of the addressing space of the index table, namely, the length of each linear linked list tends to be minimized by increasing the number of the linear linked list as far as possible in the index table, so as to reduce the average time consumed by one linked list when node data is sequentially read.
In the preferred embodiment, under the calculation rule that the difference value is first obtained and then the absolute value is obtained, and under the calculation rule that the sum value in the previous preferred embodiment, the code value distribution of the calculated result value is more uniform in the range of 0-65535, because although Unicode codes have interruption and vacancy phenomena on the whole, unicode codes of different characters are linearly continuous in part and have no problem of repeated codes, the numerical value of Unicode codes of a plurality of characters after addition and subtraction operation still keeps linear characteristics, and especially under the rule that only two low-byte data are reserved when the addition and subtraction operation result generates high-order overflow, the original interrupted and vacant coding resources can be randomly occupied, so that the whole coding resources in the range of 0-65535 can be fully utilized.
As a preferred embodiment, in the step S2, the node is inserted into a linear linked list indicated by an array element Index [ C1], and the insertion method is that the N pointer field in the node is assigned to the value of the array element Index [ C1], and then the value of the array element Index [ C1] is updated to the pointer value of the self-saved position generated by the node when the node is newly created.
In the preferred embodiment, the newly created linked list nodes are inserted into the head positions of the linear linked list to become new head nodes, which improves the execution efficiency of the index table when the linked list nodes are inserted. According to the characteristic that node data are sequentially read from the linear linked list, if each new node is inserted into the tail position of the linear linked list after being newly created, the tail position of the chain to be inserted can be found after the N pointer domain values in each node in the linked list are sequentially read from the head node pointer of the linked list each time. Otherwise, if each newly created node is inserted into the head position of the linked list, the time-consuming process of sequentially reading all nodes in the linked list can be avoided.
As a preferred embodiment, in said step S2 and said step S3, the sub-steps of character encoding form conversion are further included: before the hash function is used to calculate the code value C1 of each word entry and before the hash function is used to calculate the code value C2 of each text string, the codes of each word entry and each character involved in each text string are converted from codes of other non-Unicode forms to codes of Unicode forms.
In the technical scheme of the invention, the character coding based on hash function operation finally adopts a Unicode coding scheme, unicode is used as a character coding standard, and the Unicode comprises letters, numbers, punctuation marks and special symbols in almost all languages in the world, and is one of a character set and a character coding scheme which are widely used worldwide at present, especially in the fields of word processing and document preservation, the character coding supports a multi-language environment, and the unified standard and the like of the character coding also lead to the problems of incapability of recognition or messy codes and the like caused by different character sets in the data transmission process. When each word entry as a frequency statistics object stored in a word set and each text word string appearing in a corpus adopt a Unicode character coding scheme, hash operation can be naturally performed directly according to Unicode code values of each character in the related word string composition when an index table is created and read, but when the word set and the corpus are not stored in a Unicode coding format, the applicability problem of the technical scheme of the invention needs to be considered, although the Unicode character coding format has universality and trend in document storage. The character coding scheme other than Unicode standard is commonly used in the standards of GB2312, GBK, GB18030 and the like as far as the character coding scheme is implemented by the national standard of China.
In the preferred embodiment, for storage and interface reasons, the word set and corpus documents processed by the technical scheme of the invention adopt special cases of non-Unicode coding standards, and before hash function calculation is executed, for each word entry read from the word set and each text string read from the corpus, a Unicode coding value of each constituent character is firstly generated, then the code value calculation of the hash function is carried out, and fortunately, most of software development languages at present provide the function of the character coding form conversion.
As a preferred embodiment, in the step S4, the method further includes the substep of calculating the frequency percentage of word entries: dividing the total number of occurrences of each word entry in the word set in the whole corpus by the total number of characters in the whole corpus to calculate the percentage of the occurrence frequency of each word entry in the word set in the whole corpus.
In the preferred embodiment, the percentage of the occurrence frequency of each word item in the word set in the whole corpus is counted, and even the word items are further ordered, so that the purpose of finding the word usage rule is to better apply the frequency rule result to the related technical fields of word teaching, word retrieval and the like.
In order to more clearly describe the specific implementation manner of the technical scheme of the invention, the working process of the relevant steps in the method of the invention is described in detail through a flow chart, wherein:
the flow chart of fig. 11 shows the operation of step S2; the flow chart of fig. 12 shows the operation of step S3; the flow chart of fig. 13 shows the operation of said step S4.
As shown in fig. 2, the present invention provides a word frequency statistics apparatus for counting the total number of occurrences of each word entry in a word set in an entire corpus, the apparatus including:
module M1 for: setting an Index table Index of a linked list array structure storage, wherein each array element corresponds to a head pointer of a linear linked list and initializes a value of NULL, each node in the linear linked list comprises three fields, namely a word string S for storing the word item in the word set, a numerical value T for storing the occurrence number of the word item in the whole corpus and a pointer N of a storage position of a relay node after the current node;
module M2 for: for each word entry in the word set, firstly calculating a code value C1 by utilizing a hash function, then newly creating a node, assigning a word string field S of the node as the word entry and a numerical field T as 0, and finally inserting the node into a word set composed of array elements Index [ C1 ] ]In the indicated linear chain table, the Hash function is c1=hash (U1 ,U2 ,U3 ……Ui ) Wherein C1 is a non-negative integer, i is the number of characters in the word entry, i is the maximum value of the number of characters in each word entry in the word set, and L and U are the maximum values of the number of characters in each word entry in the word set1 、U2 、U3 ……Ui A numerical value corresponding to a part of digits in the binary Unicode coding represented by each character in sequence in the word entry;
module M3 for: scanning the text of each file in the whole corpus, firstly calculating non-negative integer code values C2 of the text strings by utilizing the same hash function rule as the previous text strings aiming at all text strings with continuous characters not exceeding the L in number, and then sequentially reading each node in a linear chain table indicated by array elements Index [ C2] aiming at each text string and the corresponding calculated code value C2 thereof, wherein if the strings in the S domain of the node are completely the same as the text strings in comparison, the T domain value in the node is increased by 1;
module M4 for: and sequentially reading and outputting an S domain word string and a T domain numerical value in each node aiming at a linear linked list indicated by each non-NULL value array element in the Index table Index, wherein the S domain word string forms the word entry serving as a frequency statistics object, and the T domain numerical value correspondingly forms the total occurrence frequency of the word entry in the whole corpus.
As a preferred embodiment, in said module M2 and said module M3, a sub-module for character encoding format conversion is further included for: before the hash function is used to calculate the code value C1 of each word entry and before the hash function is used to calculate the code value C2 of each text string, the codes of each word entry and each character involved in each text string are converted from codes of other non-Unicode forms to codes of Unicode forms.
As a preferred embodiment, in the module M4, a sub-module for calculating the frequency percentage of word entries is further included, where: dividing the total number of occurrences of each word entry in the word set in the whole corpus by the total number of characters in the whole corpus to calculate the percentage of the occurrence frequency of each word entry in the word set in the whole corpus.
Finally, it should be noted that, although the present invention has been illustrated by the specific embodiments, it should not be construed as limiting the scope of the invention, and those skilled in the art should understand that various equivalent substitutions and optimization modifications can be made to the specific embodiments of the present invention without departing from the spirit and scope of the invention.