Disclosure of Invention
In view of this, embodiments of the present invention provide a dictionary construction method and apparatus, which can use a text classification model constructed based on a semi-supervised learning algorithm to automatically expand and update a commodity element dictionary, so that the constructed commodity element dictionary is more complete and accurate.
To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a dictionary construction method including:
dividing sentences in the target text into one or more clauses according to punctuation marks;
predicting a first probability that a clause in a target text belongs to a commodity element contained in a pre-constructed commodity element dictionary by using a text classification model trained based on a semi-supervised learning algorithm;
calculating a second probability that a word other than the element word currently included in the commodity element in the clause belongs to the commodity element, if a first probability that the clause belongs to the commodity element is greater than a first threshold probability;
adding the word as an element word of the commodity element to the commodity element dictionary in a case where a second probability that the word belongs to the commodity element is greater than a second threshold probability.
Optionally, training the text classification model based on the semi-supervised learning algorithm includes:
obtaining training data based on the commodity element dictionary, the training data indicating one or more clauses and a first probability that the clause belongs to the commodity element;
inputting the training data into the text classification model to calculate a current loss function of the text classification model according to a first probability that the clause belongs to the commodity element, and optimizing the text classification model according to the current loss function;
predicting a third probability that a clause not containing the commodity element belongs to the commodity element by using the optimized text classification model;
and under the condition that the third probability is greater than the first threshold probability, adding the clauses and the third probability corresponding to the clauses to the training data so as to continuously optimize the text classification model.
Optionally, the obtaining training data based on the commodity element dictionary, the training data indicating one or more clauses and a first probability that the clause belongs to the commodity element, includes:
judging whether the clauses contain element words of the commodity elements contained in the commodity element dictionary;
if the clause contains the element words of the commodity elements, the first probability is a first value;
and if the clause does not contain the element words of the commodity elements and other clauses adjacent to the clause in the same sentence contain the element words of the commodity elements, the first probability is a second value.
Optionally, the calculating, when a first probability that the clause is attributed to the commodity element is greater than a first threshold probability, a second probability that a word other than an element word currently included in the commodity element in the clause is attributed to the commodity element includes:
determining the probability that the word belongs to the commodity element in different clauses according to the probability that the clauses belong to the commodity element;
calculating an average of the probabilities that the word belongs to the commodity element in the different clauses to obtain the second probability.
Optionally, in a case that a first probability that the clause belongs to the commodity element is greater than a first threshold probability, further comprising:
calculating the occurrence frequency of the words in all the clauses belonging to the same commodity element;
determining one or more words according to the sequence of the appearance frequency of the words from high to low so as to add the words as element words of the commodity elements to the commodity element dictionary.
Optionally, the method further comprises:
before calculating a second probability that words except element words currently contained in the commodity element in the clause belong to the commodity element, judging whether the words belong to a forbidden element word list;
in the case where the word does not belong to the list of forbidden element words, calculating a second probability that the word belongs to the commodity element.
Optionally, the method further comprises:
before calculating a second probability that words except element words currently contained in the commodity element in the clause belong to the commodity element, judging whether the words belong to stop words or not;
in the event that the word does not belong to a stop word, calculating a second probability that the word belongs to the commodity element.
To achieve the above object, according to another aspect of an embodiment of the present invention, there is provided a dictionary construction apparatus including: the clause prediction module is used for predicting the clause of the user; wherein,
the clause acquisition module is used for dividing sentences in the target text into one or more clauses according to punctuation marks;
the first probability prediction module is used for predicting a first probability that a clause in a target text belongs to a commodity element contained in a pre-constructed commodity element dictionary by using a text classification model trained based on a semi-supervised learning algorithm;
the second probability calculation module is configured to calculate a second probability that a word other than the element word currently included in the commodity element in the clause belongs to the commodity element, if a first probability that the clause belongs to the commodity element is greater than a first threshold probability;
the dictionary expansion module is configured to add the word as a component word of the commodity component to the commodity component dictionary when a second probability that the word belongs to the commodity component is greater than a second threshold probability.
Optionally, the method further comprises: a classification model training module; wherein the classification model training module is used for,
obtaining training data based on the commodity element dictionary, the training data indicating one or more clauses and a first probability that the clause belongs to the commodity element;
inputting the training data into the text classification model to calculate a current loss function of the text classification model according to a first probability that the clause belongs to the commodity element, and optimizing the text classification model according to the current loss function;
predicting a third probability that a clause not containing the commodity element belongs to the commodity element by using the optimized text classification model;
and under the condition that the third probability is greater than the first threshold probability, adding the clauses and the third probability corresponding to the clauses to the training data so as to continuously optimize the text classification model.
Optionally, the obtaining training data based on the commodity element dictionary, the training data indicating one or more clauses and a first probability that the clause belongs to the commodity element, includes:
judging whether the clauses contain element words of the commodity elements contained in the commodity element dictionary;
if the clause contains the element words of the commodity elements, the first probability is a first value;
and if the clause does not contain the element words of the commodity elements and other clauses adjacent to the clause in the same sentence contain the element words of the commodity elements, the first probability is a second value.
Optionally, the calculating, when a first probability that the clause is attributed to the commodity element is greater than a first threshold probability, a second probability that a word other than an element word currently included in the commodity element in the clause is attributed to the commodity element includes:
determining the probability that the word belongs to the commodity element in different clauses according to the probability that the clauses belong to the commodity element;
calculating an average of the probabilities that the word belongs to the commodity element in the different clauses to obtain the second probability.
Optionally, in a case that a first probability that the clause belongs to the commodity element is greater than a first threshold probability, further comprising:
calculating the occurrence frequency of the words in all the clauses belonging to the same commodity element;
determining one or more words according to the sequence of the appearance frequency of the words from high to low so as to add the words as element words of the commodity elements to the commodity element dictionary.
Optionally, the second probability calculation module is further configured to,
before calculating a second probability that words except element words currently contained in the commodity element in the clause belong to the commodity element, judging whether the words belong to a forbidden element word list;
in the case where the word does not belong to the list of forbidden element words, calculating a second probability that the word belongs to the commodity element.
Optionally, the second probability calculation module is further configured to,
before calculating a second probability that words except element words currently contained in the commodity element in the clause belong to the commodity element, judging whether the words belong to stop words or not;
in the event that the word does not belong to a stop word, calculating a second probability that the word belongs to the commodity element.
To achieve the above object, according to still another aspect of an embodiment of the present invention, there is provided a dictionary construction electronic device including:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement a method as any one of the dictionary construction methods described above.
To achieve the above object, according to still another aspect of embodiments of the present invention, there is provided a computer-readable medium on which a computer program is stored, the program, when executed by a processor, implementing any one of the dictionary construction methods described above.
One embodiment of the above invention has the following advantages or benefits: dividing sentences in the target text into one or more clauses according to punctuation marks; predicting a first probability that a clause in a target text belongs to a commodity element contained in a pre-constructed commodity element dictionary by using a text classification model trained based on a semi-supervised learning algorithm; calculating a second probability that a word other than the element word currently included in the commodity element in the clause belongs to the commodity element, if a first probability that the clause belongs to the commodity element is greater than a first threshold probability; adding the word as an element word of the commodity element to the commodity element dictionary in a case where a second probability that the word belongs to the commodity element is greater than a second threshold probability. Based on the method, the automatic expansion of the commodity element dictionary is realized, a large amount of manual labeling is not needed, the efficiency of constructing the commodity element dictionary is improved, the problem of poor consistency caused by manual labeling is avoided, and the quality of the commodity element dictionary is improved.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a schematic diagram of a main flow of a dictionary construction method according to an embodiment of the present invention, and as shown in fig. 1, the dictionary construction method may specifically include the following steps:
step S101, dividing the sentences in the target text into one or more clauses according to punctuation marks.
Specifically, the sentence "the mobile phone adopts a full-screen, the screen occupation ratio is up to 90%, and the screen brightness is high. "for example, the punctuation marks", "and" may be used in the sentence. "divide the sentence into the following three clauses:
clause 1: the mobile phone adopts a comprehensive screen
Clause 2: the screen accounts for up to 90 percent
Clause 3: high screen brightness
Step S102, a text classification model trained based on a semi-supervised learning algorithm is used for predicting a first probability that a clause in a target text belongs to a commodity element contained in a pre-constructed commodity element dictionary.
Semi-supervised learning algorithms that may be employed include, but are not limited to, generative semi-supervised models (generative semi-supervised models), Self-training algorithms (Self-training), Co-training (Co-training), semi-supervised support vector machines (S3VMs), bert (bidirectional Encoder retrieval from transformations) algorithms, and the like.
The commodity element dictionary comprises one or more commodity types, one or more commodity elements corresponding to the commodity types and element words corresponding to each commodity element. The pre-constructed commodity element dictionary is completed based on manual labeling, one or more commodity elements and one or more element words corresponding to the commodity elements can be pre-constructed for commodity types based on actual experience in the manual labeling process, and then the element words corresponding to the commodity elements are added in the process of labeling clauses in a commodity description text; meanwhile, the method can be not limited to the predefined commodity elements, and new commodity elements can be added or existing unreasonable commodity elements can be deleted according to actual conditions. For example, when two clauses "the mobile phone adopts a ceramic body" and "feels warm" in the same sentence in the label commodity description text, it is known that "ceramic" can be used as an element word of "material" of the commodity element, and "warm" describes "touch" of the mobile phone, but "touch" is not in the predefined commodity elements, so that a new commodity element "feel" can be added to the commodity element dictionary, and the corresponding element word includes "warm" and "feel". Specifically, the pre-constructed commodity element dictionary is shown in the following table 1,
TABLE 1 Commodity element dictionary example
In an alternative embodiment, training the text classification model based on the semi-supervised learning algorithm includes: obtaining training data based on the commodity element dictionary, the training data indicating one or more clauses and a first probability that the clause belongs to the commodity element; inputting the training data into the text classification model to calculate a current loss function of the text classification model according to a first probability that the clause belongs to the commodity element, and optimizing the text classification model according to the current loss function; predicting a third probability that a clause not containing the commodity element belongs to the commodity element by using the optimized text classification model; and under the condition that the third probability is greater than the first threshold probability, adding the clauses and the third probability corresponding to the clauses to the training data so as to continuously optimize the text classification model.
That is to say, in the process of training the classification model, a small amount of training data obtained by manual labeling can be used for training in advance, and the probability that the clause belongs to the commodity element is used for adjusting the loss function of the sample, so that a more accurate and real loss function is obtained, and further, the optimization of the text classification model is realized. On the basis, the probability that the clause not containing the commodity element belongs to the commodity element is predicted by using the text classification model, and on the basis that the third probability that the predicted clause belongs to the commodity element is larger than the first threshold probability, the clause and the corresponding third probability are added into training data to realize continuous expansion of the training data so as to further optimize the text classification model.
Still further, the obtaining training data based on the commodity element dictionary, the training data indicating one or more clauses and a first probability that the clause belongs to the commodity element, comprises: judging whether the clauses contain element words of the commodity elements contained in the commodity element dictionary; if the clause contains the element words of the commodity elements, the first probability is a first value; and if the clause does not contain the element words of the commodity elements and other clauses adjacent to the clause in the same sentence contain the element words of the commodity elements, the first probability is a second value. The first value and the second value are preset according to actual requirements, generally, the second value is smaller than the first value, for example, the first value is 1, and the second value is 0.8.
Specifically, referring to table 1, the sentence "this mobile phone uses a full-screen, the screen occupation ratio is up to 90%, and the screen brightness is high. "is illustrated as an example, which includes the following three clauses:
clause 1: the mobile phone adopts a comprehensive screen
Clause 2: the screen accounts for up to 90 percent
Clause 3: high screen brightness
Since clause 1 contains the element word "full screen" of the merchandise element "screen", the first probability that clause 1 belongs to the merchandise element "screen" is determined to be a first value (e.g., 1); clause 3 contains the element word "screen" of the merchandise element "screen", and thus the first probability that clause 3 belongs to the merchandise element "screen" is determined to be a first value (1 as an example); the clause 2 does not include any element words of commodity elements included in the commodity element dictionary, but since both of the clauses 1 and 3 adjacent to the clause 2 in the same sentence include an element word of the commodity element "screen", it is determined that the clause 2 is highly likely to be related to the commodity element "screen" and the probability that the clause 2 belongs to the commodity element "screen" is a second value (0.8 as an example). Therefore, clauses in a small quantity of commodity description texts can be labeled based on the commodity element dictionary, and then the value of the first probability corresponding to the clauses is determined so as to obtain training data. It should be noted that, in the process of acquiring the training data, only the clauses including the element words of the commodity elements or the situation that at least one adjacent clause in the same sentence includes the commodity elements are considered, and the other clauses cannot be determined whether to be related to the commodity elements, so that the step is not considered.
Step S103, if a first probability that the clause belongs to the commodity element is greater than a first threshold probability, calculating a second probability that a word other than the element word currently included in the commodity element in the clause belongs to the commodity element.
Wherein the first threshold is any one of 0-1, such as 0.8, 0.7, etc., set according to practical experience. Specifically, the text classification model may predict a first probability that a clause belongs to each commodity element included in the commodity element dictionary, and a sum of first probabilities that the same clause belongs to different commodity elements is 1. On the basis, commodity elements to which the clauses belong can be determined according to the maximum value in the first probability; and then judging whether the first probability of the clause is greater than the first threshold probability, if so, indicating that the clause belongs to the commodity element and is credible, if not, indicating that the clause belongs to the commodity element and is incredible, and if and only if the probability that the clause belongs to the commodity element is credible, continuously calculating a second probability that the word in the clause belongs to the commodity element.
In an alternative embodiment, the calculating a second probability that a word other than the element word currently included in the commodity element in the clause belongs to the commodity element when a first probability that the clause belongs to the commodity element is greater than a first threshold probability includes: determining the probability that the word belongs to the commodity element in different clauses according to the probability that the clauses belong to the commodity element; calculating an average of the probabilities that the word belongs to the commodity element in the different clauses to obtain the second probability.
Specifically, taking the example that the first probability that the clause a belongs to the commodity element a is 0.8, the probability that any word X in the clause a other than the element word currently included in the commodity element belongs to the commodity element is determined to be 0.8 as well. It is understood that, in the actual implementation process, it may also be adjusted based on the first probability according to a preset weight (for example, 0.8), and then the probability that the word X belongs to the commodity element a is 0.64. Based on the above, after calculating the probability that the same word X belongs to the commodity element a in different clauses, calculating the average probability that the word X belongs to the commodity element a, namely the second probability that the word X belongs to the commodity element.
In an optional embodiment, in a case that the first probability that the clause is attributed to the commodity element is greater than a first threshold probability, the method further includes: calculating the occurrence frequency of the words in all the clauses belonging to the same commodity element; and determining one or more words according to the sequence of the appearance frequency of the words from high to low so as to add the words serving as element words of the commodity elements to the commodity element dictionary to realize the expansion of the commodity element dictionary.
Specifically, the description will be given taking as an example that the first probability that the clause a belongs to the commodity element a is 0.8, the first probability that the clause B belongs to the commodity element a is 0.7, the first probability that the clause C belongs to the commodity element a is 0.75, and the first probability that the clause D belongs to the commodity element a is 0.9. Counting the occurrence frequencies of all words (taking the word 1, the word 2 and the word 3 as examples) except the element words currently contained in the commodity elements in the clauses a, B, C and D, and if the occurrence frequencies of the word 1, the word 2 and the word 3 are divided into 5, 4 and 3, two words with high occurrence frequencies, namely the word 1 and the word 2, can be used as the element words of the commodity element a and added into the commodity element dictionary, so as to further realize the expansion of the commodity element dictionary.
In an alternative embodiment, before calculating a second probability that a word other than the element word currently contained in the commodity element in the clause belongs to the commodity element, determining whether the word belongs to a forbidden element word list; in the case where the word does not belong to the list of forbidden element words, calculating a second probability that the word belongs to the commodity element.
That is to say, under the condition that forbidden element words are constructed according to actual requirements, words in clauses can be preliminarily screened so as to reduce the amount of words needing to calculate the second probability, and further, the calculation efficiency can be improved.
In an alternative embodiment, before calculating a second probability that a word other than the element word currently contained in the commodity element in the clause belongs to the commodity element, judging whether the word belongs to a stop word; in the event that the word does not belong to a stop word, calculating a second probability that the word belongs to the commodity element.
Stop Words refer to that in information retrieval, in order to save storage space and improve search efficiency, some Words or Words are automatically filtered before or after processing natural language data (or text), and these Words or Words are called Stop Words, such as "this", "that", and so on. Therefore, the amount of words needing to calculate the second probability is further reduced through filtering the stop words, and the calculation efficiency is improved.
Step S104 of adding an element word using the word as the commodity element to the commodity element dictionary when a second probability that the word belongs to the commodity element is greater than a second threshold probability. The second threshold probability is also any value within the interval of 0-1, such as 0.75, 0.8, etc., set based on practical experience or demand.
Based on the embodiment, the sentences in the target text are divided into one or more clauses according to punctuation marks; predicting a first probability that a clause in a target text belongs to a commodity element contained in a pre-constructed commodity element dictionary by using a text classification model trained based on a semi-supervised learning algorithm; calculating a second probability that a word other than the element word currently included in the commodity element in the clause belongs to the commodity element, if a first probability that the clause belongs to the commodity element is greater than a first threshold probability; adding the word as an element word of the commodity element to the commodity element dictionary in a case where a second probability that the word belongs to the commodity element is greater than a second threshold probability. Based on the method, the automatic expansion of the commodity element dictionary is realized, a large amount of manual labeling is not needed, the efficiency of constructing the commodity element dictionary is improved, the problem of poor consistency caused by manual labeling is avoided, and the quality of the commodity element dictionary is improved.
Referring to fig. 2, on the basis of the foregoing embodiment, a method for training a text classification model is provided, which may specifically include the following steps:
step S201, obtaining training data based on the commodity element dictionary, the training data indicating one or more clauses and a first probability that the clause belongs to the commodity element.
Wherein the obtaining training data based on the commodity element dictionary, the training data indicating one or more clauses and a first probability that the clause belongs to the commodity element, comprises: judging whether the clauses contain element words of the commodity elements contained in the commodity element dictionary; if the clause contains the element words of the commodity elements, the first probability is a first value; and if the clause does not contain the element words of the commodity elements and other clauses adjacent to the clause in the same sentence contain the element words of the commodity elements, the first probability is a second value. The first value and the second value are set according to actual requirements, generally, the second value is smaller than the first value, for example, the first value is 1, and the second value is 0.8.
Specifically, referring to table 1, the sentence "this mobile phone uses a full-screen, the screen occupation ratio is up to 90%, and the screen brightness is high. "is illustrated as an example, which includes the following three clauses:
clause 1: the mobile phone adopts a comprehensive screen
Clause 2: the screen accounts for up to 90 percent
Clause 3: high screen brightness
Since clause 1 contains the element word "full screen" of the merchandise element "screen", the first probability that clause 1 belongs to the merchandise element "screen" is determined to be a first value (e.g., 1); clause 3 contains the element word "screen" of the merchandise element "screen", and thus the first probability that clause 3 belongs to the merchandise element "screen" is determined to be a first value (1 as an example); the clause 2 does not include any element words of commodity elements included in the commodity element dictionary, but since both of the clauses 1 and 3 adjacent to the clause 2 in the same sentence include an element word of the commodity element "screen", it is determined that the clause 2 is highly likely to be related to the commodity element "screen" and the probability that the clause 2 belongs to the epithelial element "screen" is a second value (0.8 as an example). Therefore, clauses in a small quantity of commodity description texts can be labeled based on the commodity element dictionary, and then the value of the first probability corresponding to the clauses is determined so as to obtain training data. It should be noted that, in the process of acquiring the training data, only the clauses including the element words of the commodity elements or the situation that at least one adjacent clause in the same sentence includes the commodity elements are considered, and the other clauses cannot be determined whether to be related to the commodity elements, so that the step is not considered.
Step S202, inputting the training data into the text classification model, so as to calculate a current loss function of the text classification model according to a first probability that the clause belongs to the commodity element, and optimizing the text classification model according to the current loss function.
Specifically, after the training data is input into the text classification model, the text classification model predicts a first probability that each training sample in the training data belongs to each commodity element in the commodity element words, and then calculates a loss function corresponding to each clause based on the first probability that the text classification model predicts the clause and the first probability of the clause in the training data. If the loss function corresponding to the clause i in the training data is li, the loss function of the text classification model at the current time can be calculated according to the following formula:
L=∑Pi*li
wherein, L is a loss function of the text classification model, li is a loss function corresponding to the clause i in the training data, and Pi is a first probability that the clause i in the training data belongs to the commodity element.
Step S203, using the optimized text classification model to predict a third probability that a clause not including a commodity element belongs to the commodity element.
Step S204, under the condition that the third probability is larger than the first threshold probability, the clause and the third probability corresponding to the clause are added to the training data so as to continuously optimize the text classification model.
That is, the probability that the clause not including the commodity element belongs to the commodity element is predicted by using the text classification model, and then on the basis that the third probability that the predicted clause belongs to the commodity element is larger than the first threshold probability, the clause and the corresponding third probability are added into the training data, so that continuous expansion of the training data is realized, the text classification model can be further optimized based on the newly expanded training data, and the text classification model is circularly and repeatedly improved in accuracy and does not need a large amount of manual labeling.
Referring to fig. 3, on the basis of the above embodiment, an embodiment of the present invention provides adictionary construction apparatus 300, including: aclause acquisition module 301, a firstprobability prediction module 303, a secondprobability calculation module 304, and adictionary expansion module 305; wherein,
theclause acquiring module 301 is configured to divide a sentence in a target text into one or more clauses according to punctuation marks;
the firstprobability prediction module 303 is configured to predict a first probability that a clause in the target text belongs to a commodity element included in a pre-constructed commodity element dictionary, using a text classification model trained based on a semi-supervised learning algorithm;
the secondprobability calculation module 304 is configured to calculate a second probability that a word other than the element word currently included in the commodity element in the clause belongs to the commodity element, if a first probability that the clause belongs to the commodity element is greater than a first threshold probability;
thedictionary expansion module 305 is configured to add the word as the element word of the commodity element to the commodity element dictionary if a second probability that the word belongs to the commodity element is greater than a second threshold probability.
In an optional embodiment, the method further comprises: a classificationmodel training module 302; wherein the classificationmodel training module 302 is configured to,
obtaining training data based on the commodity element dictionary, the training data indicating one or more clauses and a first probability that the clause belongs to the commodity element;
inputting the training data into the text classification model to calculate a current loss function of the text classification model according to a first probability that the clause belongs to the commodity element, and optimizing the text classification model according to the current loss function;
predicting a third probability that a clause not containing the commodity element belongs to the commodity element by using the optimized text classification model;
and under the condition that the third probability is greater than the first threshold probability, adding the clauses and the third probability corresponding to the clauses to the training data so as to continuously optimize the text classification model.
In an alternative embodiment, the obtaining training data based on the commodity element dictionary, the training data indicating one or more clauses and a first probability that the clause belongs to the commodity element, includes:
judging whether the clauses contain element words of the commodity elements contained in the commodity element dictionary;
if the clause contains the element words of the commodity elements, the first probability is a first value;
and if the clause does not contain the element words of the commodity elements and other clauses adjacent to the clause in the same sentence contain the element words of the commodity elements, the first probability is a second value.
In an alternative embodiment, the calculating a second probability that a word other than the element word currently included in the commodity element in the clause belongs to the commodity element when a first probability that the clause belongs to the commodity element is greater than a first threshold probability includes:
determining the probability that the word belongs to the commodity element in different clauses according to the probability that the clauses belong to the commodity element;
calculating an average of the probabilities that the word belongs to the commodity element in the different clauses to obtain the second probability.
In an optional embodiment, in a case that the first probability that the clause is attributed to the commodity element is greater than a first threshold probability, the method further includes:
calculating the occurrence frequency of the words in all the clauses belonging to the same commodity element;
determining one or more words according to the sequence of the appearance frequency of the words from high to low so as to add the words as element words of the commodity elements to the commodity element dictionary.
In an alternative embodiment, the secondprobability calculation module 304 is further configured to,
before calculating a second probability that words except element words currently contained in the commodity element in the clause belong to the commodity element, judging whether the words belong to a forbidden element word list;
in the case where the word does not belong to the list of forbidden element words, calculating a second probability that the word belongs to the commodity element.
In an alternative embodiment, the secondprobability calculation module 304 is further configured to,
before calculating a second probability that words except element words currently contained in the commodity element in the clause belong to the commodity element, judging whether the words belong to stop words or not;
in the event that the word does not belong to a stop word, calculating a second probability that the word belongs to the commodity element.
Fig. 4 shows anexemplary system architecture 400 to which the dictionary construction method or the dictionary construction apparatus of the embodiment of the present invention can be applied.
As shown in fig. 4, thesystem architecture 400 may includeterminal devices 401, 402, 403, anetwork 404, and aserver 405. Thenetwork 404 serves as a medium for providing communication links between theterminal devices 401, 402, 403 and theserver 405.Network 404 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.
A user may useterminal devices 401, 402, 403 to interact with aserver 405 over anetwork 404 to receive or send messages or the like. Theterminal devices 401, 402, 403 may have various communication client applications installed thereon, such as a shopping application, a web browser application, a search application, and the like.
Theterminal devices 401, 402, 403 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
Theserver 405 may be a server that provides various services, such as a background management server that supports shopping websites browsed by users using theterminal devices 401, 402, and 403. The background management server can analyze and process the received data such as the product information query request and the like so as to obtain one or more clauses and the like in the commodity description text.
It should be noted that the dictionary construction method provided in the embodiment of the present invention is generally executed by theserver 405, and accordingly, the dictionary construction device is generally provided in theserver 405.
It should be understood that the number of terminal devices, networks, and servers in fig. 4 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 5, shown is a block diagram of acomputer system 500 suitable for use with a terminal device implementing an embodiment of the present invention. The terminal device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 5, thecomputer system 500 includes a Central Processing Unit (CPU)501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from astorage section 508 into a Random Access Memory (RAM) 503. In theRAM 503, various programs and data necessary for the operation of thesystem 500 are also stored. TheCPU 501,ROM 502, andRAM 503 are connected to each other via abus 504. An input/output (I/O)interface 505 is also connected tobus 504.
The following components are connected to the I/O interface 505: aninput portion 506 including a keyboard, a mouse, and the like; anoutput portion 507 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; astorage portion 508 including a hard disk and the like; and acommunication section 509 including a network interface card such as a LAN card, a modem, or the like. Thecommunication section 509 performs communication processing via a network such as the internet. Thedriver 510 is also connected to the I/O interface 505 as necessary. Aremovable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on thedrive 510 as necessary, so that a computer program read out therefrom is mounted into thestorage section 508 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through thecommunication section 509, and/or installed from theremovable medium 511. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 501.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a module. The names of these modules do not in some cases constitute a limitation to the modules themselves, and for example, the sending unit may also be described as a "module that sends a picture acquisition request to a connected server".
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: dividing sentences in the target text into one or more clauses according to punctuation marks; predicting a first probability that a clause in a target text belongs to a commodity element contained in a pre-constructed commodity element dictionary by using a text classification model trained based on a semi-supervised learning algorithm; calculating a second probability that a word other than the element word currently included in the commodity element in the clause belongs to the commodity element, if a first probability that the clause belongs to the commodity element is greater than a first threshold probability; adding the word as an element word of the commodity element to the commodity element dictionary in a case where a second probability that the word belongs to the commodity element is greater than a second threshold probability.
According to the technical scheme of the embodiment of the invention, the sentences in the target text are divided into one or more clauses according to punctuation marks; predicting a first probability that a clause in a target text belongs to a commodity element contained in a pre-constructed commodity element dictionary by using a text classification model trained based on a semi-supervised learning algorithm; calculating a second probability that a word other than the element word currently included in the commodity element in the clause belongs to the commodity element, if a first probability that the clause belongs to the commodity element is greater than a first threshold probability; adding the word as an element word of the commodity element to the commodity element dictionary in a case where a second probability that the word belongs to the commodity element is greater than a second threshold probability. Based on the method, the automatic expansion of the commodity element dictionary is realized, a large amount of manual labeling is not needed, the efficiency of constructing the commodity element dictionary is improved, the problem of poor consistency caused by manual labeling is avoided, and the quality of the commodity element dictionary is improved.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.