Disclosure of Invention
The application provides a risk phrase identification method and device, electronic equipment and a storage medium, which are used for solving the technical problem that the existing risk description text is poor in risk phrase identification effect.
In a first aspect, a risk phrase identification method is provided, which includes:
performing phrase recognition on the risk description text by adopting a preset phrase recognition algorithm to obtain a first risk phrase list;
processing the risk description text by adopting a preset word segmentation tool to obtain a second risk phrase list;
and combining the first risk phrase list and the second risk phrase list to determine a risk phrase list comprising a plurality of risk phrases.
On the basis of the technical scheme, filtering the risk description text based on a preset filtering rule;
performing part-of-speech tagging on the filtered risk description text, and screening words with preset parts-of-speech to form a text to be recognized;
counting word strings with the frequency of occurrence larger than a preset number threshold in the text to be recognized to serve as candidate phrases;
and adopting a predetermined phrase identification algorithm to select out the risk phrases from the candidate phrases.
On the basis of the above technical solution, the predetermined filtering rule includes: filtering stop words according to a preset stop word list;
the words with the preset parts of speech screening comprise:
and screening nouns, verbs, adjectives and degree adverbs from the filtered risk description text.
On the basis of the above technical solution, the selecting a risk phrase from the candidate phrases by using a predetermined phrase recognition algorithm includes:
calculating a mutual information value of each candidate phrase by using mutual information;
calculating a left entropy value and a right entropy value of each candidate phrase by using left entropy and right entropy;
calculating a weight value of each candidate phrase according to a predetermined weight algorithm based on the statistic value of each candidate phrase; the statistic value comprises a mutual information value, a left entropy value, a right entropy value and the occurrence frequency of candidate phrases in the text to be identified; or a mutual information value, a left entropy value, and a right entropy value;
and selecting the risk phrases from the candidate phrases by adopting a preset selection rule according to the weight value of each candidate phrase.
On the basis of the above technical solution, the predetermined selection rule includes:
sorting the weighted value of each candidate phrase from large to small, and selecting a plurality of candidate phrases with a preset number before sorting as risk phrases; or the like, or a combination thereof,
and when the weight value of the candidate phrase is not less than a preset threshold value, taking the candidate phrase as a risk phrase.
On the basis of the technical scheme, the risk description text is processed by adopting a preset word segmentation tool, and the method comprises the following steps:
combining all the characters in the risk description text to form a phrase to be matched;
the phrases to be matched are inquired and matched in a preset vocabulary library of the word segmentation tool, and phrases matched with the vocabulary in the preset vocabulary library are determined;
and filtering the matched phrases based on a preset filtering rule, and using the filtered phrases as the risk phrases of the second risk phrase list.
On the basis of the above technical solution, the predetermined filtering rule includes at least one of:
filtering the single words; filtering the numbers; the phrases that constitute words with a number less than a predetermined number are filtered.
On the basis of the technical scheme, before the preset phrase recognition algorithm is adopted to perform phrase recognition on the risk description text, the method further comprises the following steps:
and identifying the preset text by paragraphs, and extracting the paragraphs containing the risk description in the text to serve as risk description texts.
In a second aspect, a risk phrase recognition apparatus is provided, including:
the first acquisition module is used for carrying out phrase recognition on the risk description text by adopting a preset phrase recognition algorithm to obtain a first risk phrase list;
the second acquisition module is used for processing the risk description text by adopting a preset word segmentation tool to obtain a second risk phrase list;
and the merging processing module is used for merging the first risk phrase list and the second risk phrase list and determining a risk phrase list comprising a plurality of risk phrases.
On the basis of the above technical solution, the first obtaining module includes:
the first filtering module is used for filtering the risk description text based on a preset filtering rule;
the screening module is used for performing part-of-speech tagging on the filtered risk description texts and screening words with preset parts-of-speech to form texts to be identified;
the counting module is used for counting word strings with the frequency greater than a preset number threshold in the text to be recognized as candidate phrases;
and the selecting module selects the risk phrases from the candidate phrases by adopting a preset phrase recognition algorithm.
On the basis of the above technical solution, the second obtaining module includes:
the combination module is used for combining all the characters in the risk description text to form a phrase to be matched;
the matching module is used for inquiring and matching phrases to be matched in a preset vocabulary library of the word segmentation tool and determining phrases matched with the vocabulary in the preset vocabulary library;
and the second filtering module is used for filtering the matched phrases based on a preset filtering rule and taking the filtered phrases as the risk phrases of the second risk phrase list.
In a third aspect, an electronic device is provided, including:
a processor; and
a memory configured to store machine readable instructions that, when executed by the processor, cause the processor to perform the risk phrase identification method of the first aspect.
In a fourth aspect, a computer-readable storage medium is provided for storing computer instructions that, when executed on a computer, enable the computer to perform the risk phrase identification method of the first aspect.
The beneficial effect that technical scheme that this application provided brought is:
performing phrase recognition on the risk description text by adopting a preset phrase recognition algorithm, and preliminarily extracting phrases to obtain a first risk phrase list; and processing the risk description text by adopting a preset word segmentation tool, and expanding the risk phrases to obtain a second risk phrase list. Then, the first risk phrase list and the second risk phrase list are combined, and a risk phrase list comprising a plurality of risk phrases is determined. The information representation capability of the phrases is obviously higher than that of a single keyword, the accuracy of the phrases extracted by the phrase recognition algorithm is high, but the number of risk phrases obtained by singly adopting the phrase recognition algorithm is small, and the risk phrases are not enough to represent all information to be expressed by the original risk text. The risk description text is further processed by combining a word segmentation tool, the risk phrases are expanded, the accuracy is further improved, and the risk phrases obtained by adopting the two modes are more comprehensive. The risk phrase identification method and the risk phrase identification device can quickly and accurately identify the risk phrases, the identified phrases are more comprehensive, the information amount is larger, the risk subject can be well disclosed, and the technical problem that the identification effect of the risk phrases of the existing risk description texts is poor is solved.
Detailed Description
Reference will now be made in detail to the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative and are only for the purpose of explaining the present application and are not to be construed as limiting the present invention.
As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The terms referred to in this application will first be introduced and explained:
mutual Information (Mutual Information) is a useful Information metric in Information theory, and refers to the correlation between two event sets. Mutual information is a common method of computational linguistic model analysis that measures the interactivity between two objects. And is used in filtering problems to measure the discrimination of features to topics. Mutual information is defined approximately with cross entropy. Mutual information is originally a concept in information theory, is used for representing the relationship between information, is a measure of statistical correlation of two random variables, and the feature extraction by using the mutual information theory is based on the following hypothesis: the frequency of occurrence in a certain category is high, but the frequency of occurrence in other categories is low, and the mutual information between the entries and the category is large. In general, mutual information is used as a measure between feature words and categories, and their mutual information amount is the largest if the feature words belong to the category.
Left-right entropy is an important statistical feature of a pattern, but left-right entropy calculation of a massive word string based on a large-scale corpus requires a read operation involving a large number of irrelevant characters. The larger the left-right entropy, the more abundant the surrounding words that describe the word, meaning that the greater the degree of freedom of the word, the greater the likelihood that it will become an independent word.
The existing identification technology is not suitable for risk phrase identification and extraction of yearly-newspaper risk description texts. The extraction result of the subject word is inclined to the word, the word length is short, the risk subject cannot be well represented, and meanwhile, a large amount of semantic content is lost. And the information representation capability of the phrase is obviously higher than that of a single keyword, for example, the meaning of the expression of the two words of 'account age' is richer than that of 'account age' and 'growth', and the 'manager talent' conveys more information than the two words of 'manager' and 'talent'. In view of the characteristic of short annual report risk description space, the existing key phrase extraction algorithm has limited identifiable phrases, and even some important words in the annual report risk description, such as words with more risk early warning meanings like 'increase', 'decrease', 'shortage', etc., have reduced importance, the extracted phrases are mostly composed of noun words, the overall identification effect is poor, and the requirement of risk phrase identification of the annual report risk description text cannot be met.
The application provides a risk phrase identification method, a risk phrase identification device, an electronic device and a storage medium, and aims to solve the above technical problems in the prior art.
The following describes the technical solution of the present application and how to solve the above technical problems in detail by specific embodiments. These several specific embodiments may be combined with each other below, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
Example one
The embodiment of the application provides a risk phrase identification method, and as shown in fig. 1, the method comprises the following steps:
s100, performing phrase recognition on the risk description text by adopting a preset phrase recognition algorithm to obtain a first risk phrase list.
S200, processing the risk description text by adopting a preset word segmentation tool to obtain a second risk phrase list.
And S300, merging the first risk phrase list and the second risk phrase list, and determining a risk phrase list comprising a plurality of risk phrases. Specifically, duplicate risk phrases are removed when the merging process is performed.
Based on the embodiment, the risk phrases obtained by adopting the preset phrase identification algorithm are fewer in number and are not enough to represent the information to be expressed by the original risk text. The risk description text is processed by combining a word segmentation tool, the risk phrases are expanded, the accuracy is high, and the risk phrases obtained by adopting the two modes are more comprehensive.
Example two
Referring to fig. 2, an embodiment of the present invention provides a possible implementation manner, and on the basis of the first embodiment, the step S100 includes the following steps:
s101, filtering the risk description text based on a preset filtering rule. Wherein, the filtering rule is as follows: stop words are filtered according to a predetermined stop word list, and punctuation, nouns, verbs, adjectives, and degree adverbs are retained.
And S102, performing part-of-speech tagging on the filtered risk description text, and screening words with preset parts-of-speech to form a text to be recognized.
Further, filtering words of a predetermined part of speech includes: and screening nouns, verbs, adjectives and degree adverbs from the filtered risk description text. The punctuation marks, nouns, verbs, adjectives and degree adverbs are not filtered, so that the situation that contents separated by the marks are combined together due to the fact that the punctuation marks are removed can be prevented, the words forming the risk phrase are adjacent to one another left and right in position, and noise word strings such as ' risk companies ', risk countries ' and the like can be prevented from being extracted. Meanwhile, the noun, the verb, the adjective and the degree adverb are all reserved, so that the extracted phrases are guaranteed to be higher in quality, words with larger information quantity such as 'increase', 'shortage', 'reduction' and the like can be reserved, and only names are prevented from being extracted.
S103, counting word strings with the frequency greater than a preset number threshold in the text to be recognized as candidate phrases. Specifically, the preset number threshold may be set according to an actual text, and the word string may be a combination of two words.
And S104, selecting the risk phrases from the candidate phrases by adopting a preset phrase identification algorithm.
Specifically, in step S104, the step of selecting a risk phrase from the candidate phrases by using a predetermined phrase recognition algorithm includes:
s1041, calculating mutual information value of each candidate phrase by adopting mutual information; mutual information values are used to represent the likelihood that a candidate phrase makes up a phrase, and the mutual information values are proportional to the likelihood that the candidate phrase makes up a phrase.
S1042, calculating a left entropy value and a right entropy value of each candidate phrase by adopting left entropy and right entropy; the left entropy value and the right entropy value are respectively used for representing the possibility of words collocated left and right of the candidate phrase, and the left entropy value and the right entropy value are in direct proportion to the possibility of the candidate phrase forming the phrase.
Further, step S1041 and step S1042 are not distinguished in sequence, and may be performed simultaneously or sequentially.
S1043, calculating a weight value of each candidate phrase according to a preset weight algorithm based on the statistic value of each candidate phrase; the statistic value comprises a mutual information value, a left entropy value, a right entropy value and the occurrence frequency of the candidate phrases in the text to be identified; or a mutual information value, a left entropy value and a right entropy value.
S1044, selecting the risk phrases from the candidate phrases by adopting a preset selection rule according to the weight value of each candidate phrase.
Further, the weighted values of the candidate phrases are sorted from big to small, and a plurality of candidate phrases in a preset number before sorting are selected as risk phrases. In practical applications, meaningless phrases such as numbers may be removed and then the top 20 ranked phrase selected.
Or, when the weight value of the candidate phrase is not less than the preset threshold, taking the candidate phrase as the risk phrase. In practical application, the threshold is set in advance, and the threshold smaller than the preset threshold can be eliminated.
As an optional implementation manner, in step S1043, if the statistic value includes a mutual information value, a left entropy value, a right entropy value, and a frequency of occurrence of the candidate phrase, based on the mutual information value, the left entropy value, the right entropy value, and the frequency of occurrence of each candidate phrase, a first weight value of each candidate phrase is calculated according to a first predetermined algorithm;
and selecting the risk phrases from the candidate phrases by adopting a preset selection rule according to the first weight value of each candidate phrase.
If the statistic value comprises a mutual information value, a left entropy value, a right entropy value and the occurrence frequency of the candidate phrase, the adopted calculation mode is as follows:
(1) Calculating mutual information value of each candidate phrase by using mutual information
We use the two component words of the candidate phrase t as the character a and the character b, respectively, and then the formula for calculating mutual information is shown in formula 1.1:
wherein p (t), p (a), p (b) respectively represent the probability of t, a, b, we can simply estimate the probability calculation, and calculate by using the normalized frequency form:
p(t)=nt /NP (equation 1.2)
p(a)=na /NT (formula 1.3)
p(b)=nb /NT (equation 1.4)
Wherein n ist 、na 、nb Respectively representing the number of occurrences of t, a and b in the corpus, NP Representing the total number of occurrences of the candidate phrase in the corpus set, NT Is the total number of occurrences of a single word in the corpus.
The higher the value of mutual information, the higher the correlation between a and b, the more likely a and b are to constitute a phrase; conversely, the lower the value of mutual information, the lower the correlation between a and b, the greater the likelihood of a phrase boundary between a and b, and thus the less likely a and b constitute a phrase.
(2) Calculating left entropy value and right entropy value of each candidate phrase by using left entropy and right entropy
The adjacency entropy includes left adjacency entropy and right adjacency entropy, and the adjacency entropy is essentially a measure of uncertainty of the left adjacency word and the right adjacency word of the candidate phrase by using the information entropy. The lower the uncertainty of the left and right adjacent words, the less and more stable the words before and after the description candidate phrase, so the lower the possibility of word formation; conversely, it indicates that the more, more confusing and less stable the words before and after the candidate phrase, the higher the possibility that the candidate phrase becomes a word. The calculation formulas for calculating the left entropy and the right entropy by using the left entropy and the right entropy are shown as formula 2.1 and formula 2.2:
wherein E isL And ER Respectively representing left and right entropy values of the candidate phrase, W being used to represent the candidate phrase, W = { W = { n }1 ,w2 ,...,wn }; a represents a set of all words of the candidate phrase appearing on the left side, and a represents a certain word in the set A; b represents a set of all words of the candidate phrase appearing on the right side, and B represents a certain word in the set B; if E of a candidate phraseL And ER The larger the value is, the more confusing and unstable the words appearing left and right of the candidate phrase are, the richer the collocation is, and thus the candidate phrase is more likely to be a phrase.
(3) Calculating a first weight value of each candidate phrase according to a first predetermined algorithm
Judging the composition boundary of the phrase by utilizing the left adjacent entropy and the right adjacent entropy, extracting the phrase by combining with the occurrence frequency TF, fitting the mutual information, the left entropy, the right entropy and the frequency TF to obtain a first weight value Score, and setting a threshold value for phrase identification, wherein the Score value is calculated in the following mode:
score = (Norfreq + NorMI + NorLE + NorRE)/4 (equation 3.1)
Wherein, norfreq, norMI, norLE, norRE are appearing frequent TF, mutual information, left entropy, right entropy value after normalizing separately, the calculation method is as follows:
NorFreqi =(Freqi -MAXFreq )/(MAXFreq -MINFreq ) (formula 3.2)
NorMIi =(MIi -MAXMI )/(MAXMI -MINMI ) (formula 3.3)
NorLEi =(LEi -MAXLE )/(MAXLE -MINLE ) (formula 3.4)
NorREi =(REi -MAXRE )/(MAXRE -MINRE ) (formula 3.5)
Then, the higher the first weight value Score is, the higher the possibility that the candidate phrase becomes a phrase is represented; otherwise, it means that the candidate phrase is less likely to be a phrase.
As another implementation manner, in step S1043, if the statistic value includes a mutual information value, a left entropy value, and a right entropy value, a second weight value of each candidate phrase is calculated according to a second predetermined algorithm based on the mutual information value, the left entropy value, and the right entropy value of each candidate phrase;
and selecting the risk phrases from the candidate phrases by adopting a preset selection rule according to the second weight value of each candidate phrase.
If the statistic value comprises a mutual information value, a left entropy value and a right entropy value, the adopted calculation mode is as follows:
(1) Computing mutual information value of each candidate phrase using mutual information
The mutual information may use the following calculation formula:
wherein t represents a candidate phrase, N represents the number of all candidate phrases with the length satisfying the requirement in the set, and Nt 、na 、nb Indicating the frequency with which the words t, a, b appear in the text, respectively. When the mutual information value is larger, the more compact the combination between the words is, the higher the possibility of being a phrase is; conversely, the smaller the mutual information value is, the word is indicatedThe more uncorrelated there is, the less likely it is to constitute a phrase.
(2) Calculating left entropy value and right entropy value of each candidate phrase by using left entropy and right entropy
The left-right entropy can adopt the following calculation formula:
wherein, EL Left entropy of the word string, ER Representing right entropy, W representing a candidate phrase set, A representing a set of all words appearing on the left of the candidate phrase, a ∈ A; similarly, B represents the set of all words that appear to the right of the candidate phrase, B ∈ B. If the word string EL And ER The larger the value of (A), the richer and more various the words representing the left and right collocation of the word string, and the higher the probability of the phrase formed by the word string.
(3) Calculating a second weight value of each candidate phrase according to a second predetermined algorithm
And fitting the mutual information value, the left entropy value and the right entropy value according to a preset algorithm, such as taking an average value, so as to obtain a second weight value.
The above two implementation manners calculate the weight value of each candidate phrase to obtain a first weight value and a second weight value, and then select an adventure phrase from the candidate phrases by using a predetermined selection rule, including:
and sorting the first weight value or the second weight value of each candidate phrase from large to small, and selecting a plurality of candidate phrases with preset number before sorting as risk phrases. Or when the first weight value or the second weight value of the candidate phrase is not less than the preset threshold value, taking the candidate phrase as a risk phrase.
EXAMPLE III
Referring to fig. 3, an embodiment of the present invention provides a possible implementation manner, and on the basis of the first embodiment, the step S200 includes the following steps:
s201, combining all the words in the risk description text to form a phrase to be matched.
S202, the phrases to be matched are inquired and matched in a preset vocabulary library of the word segmentation tool, and phrases matched with the vocabulary in the preset vocabulary library are determined.
And S203, filtering the matched phrases based on a preset filtering rule, and taking the filtered phrases as risk phrases of the second risk phrase list. Wherein the predetermined filtering rules comprise at least one of: filtering the single words; filtering the numbers; the phrases that constitute words with a number less than a predetermined number are filtered. Further, phrases with less words than a predetermined number of words can be common words with a length less than 3, further improving word segmentation efficiency.
Only by adopting the phrase identification algorithm of the embodiment, the phrase quantity of the phrase combination by adopting mutual information is limited, and when the threshold value of the left entropy and the right entropy is set to be high, some phrases containing information can not be identified easily. Such words as "economic descending", "increased rate of unemployment", etc. cannot be recognized.
In particular, the common word segmentation tool dictionary covers a small number of risk phrases, and in most cases, one risk phrase is segmented into a plurality of words. In this embodiment, the word segmentation tool may adopt a jieba word segmentation tool, a large amount of encyclopedia words are acquired from a general knowledge map of Chinese encyclopedia, each word is stored in a txt file in a form of one line, and then the txt file is used as a word library to replace an original dictionary of the jieba word segmentation tool, so that the word size is expanded, the acquired phrases are more comprehensive, the phrases are prevented from being missed, and the risk theme cannot be well expressed. Meanwhile, considering that the size of the word segmentation dictionary carried by the jieba is about 35 ten thousand, and the size of the encyclopedia is about 1200 ten thousand and 34 times of the original dictionary in size, directly using the encyclopedia as the word segmentation dictionary and initializing the jieba may cause program crash. In view of the fact that a Hidden Markov Model (HMM) based on Chinese character word forming capability is adopted in the jieba to identify unknown words, the model is removed in the text to ensure word segmentation efficiency.
The Chinese encyclopedia knowledge graph data can be obtained from a Chinese open knowledge graph website of the compound university by adopting encyclopedia words for identification, wherein the Chinese encyclopedia knowledge graph data comprises 900 ten thousand + encyclopedia entities and 6600 ten thousand + triple relations. The Chinese general encyclopedia knowledge graph provided by the university of Compound Dan covers entries of Chinese encyclopedia websites such as encyclopedia, interactive encyclopedia and Chinese Wikipedia, and the entries comprise contents such as concrete things, famous characters, abstract concepts, literature works, hot spot events, professional terms, chinese words and phrases or specific theme combinations, and a total of thousands, two hundred and more than ten thousand encyclopedia entity words are obtained from the entries, so that almost all fields are covered, and the accuracy is high.
When any phrase to be matched belongs to the vocabulary in the encyclopedia, the phrase to be matched is used as a risk phrase, and the vocabulary of the risk phrase is enlarged.
The embodiment of the present invention provides another possible implementation manner, and on the basis of the first embodiment, before the step S100, the method further includes the following steps:
and identifying the preset text by paragraphs, and extracting the paragraphs containing the risk description in the text to serve as risk description texts.
Specifically, the entire annual report of the city company is explained as an example.
First, all A stock annual reports of listed companies are obtained, and risk description information is extracted from the A stock annual reports. Due to the characteristics of annual report writing, the risk description information mainly exists in the form of short texts, and each risk type (such as the risk of account receivable increase and cash flow decrease in the drawing) corresponds to a brief and brief specific risk description. And secondly, processing the risk description information according to paragraphs, wherein each listed company corresponds to a plurality of risk description texts. And finally, cleaning data, eliminating contents which hardly contain risk prompt information, such as 'competition commitment for avoiding the same industry', 'stock sale commitment' and the like in annual newspaper risk description, and simultaneously eliminating illegal symbols, and obvious non-character and messy code contents.
On the basis of the first to fourth embodiments, a risk description text in the 2016 year report of the Bishui Source science and technology Co., ltd is randomly selected as an experiment text, a risk phrase extraction experiment is performed on the experiment text, and the extraction results of the four risk phrases in the first to fourth embodiments are compared with the extraction results of the existing HanLP phrase recognition algorithm, as shown in Table 1.
Experimental text: the account amount to be collected is larger, the risk of account age increase is 30 days 6 months in 2016, the balance of the account to be collected of the company is nearly 2.19 hundred million yuan, part of the account age increases, the preparation of the bad account to be collected is correspondingly increased, and the operation performance of the company is influenced. In order to solve the problem of overhigh balance of receivable accounts, a company increases the responsibility of business personnel for urging receipt of the receivable accounts, brings the recovery condition of the receivable accounts into performance assessment, and directly hooks the recovery condition of the receivable accounts with the income of the business personnel; the company establishes an owing group to settle the owed items of the important large-amount owed customers, strengthens the risk assessment of accounts receivable of business units, and adopts proper legal means for units which are owed for a long time and have small business volume in recent years. "
TABLE 1 comparison of the results
Obviously, the risk phrases extracted by the identification method in the first to fourth embodiments are more comprehensive, and can represent the information to be expressed by the original risk text.
Example four
Fig. 4 is a schematic diagram of a risk phrase identifying apparatus according to an embodiment of the present invention, and as shown in fig. 4, the riskphrase identifying apparatus 1 includes:
the first obtainingmodule 11 is configured to perform phrase recognition on the risk description text by using a predetermined phrase recognition algorithm to obtain a first risk phrase list.
The second obtainingmodule 12 processes the risk description text by using a predetermined word segmentation tool to obtain a second risk phrase list.
And the mergingprocessing module 13 is configured to merge the first risk phrase list and the second risk phrase list to determine a risk phrase list including a plurality of risk phrases.
The number of risk phrases obtained by the first obtainingmodule 11 is small, and is not enough to represent the information to be expressed by the original risk text. And then the second obtainingmodule 12 processes the risk description text to expand the risk phrase. The mergingprocessing module 13 then merges the first risk phrase list and the second risk phrase list to determine a final risk phrase list, so that the accuracy is high, and the risk phrases obtained by combining the two modes are more comprehensive.
In addition, the risk phrase recognition device of the present invention may further include a text acquisition module, configured to process a predetermined text by paragraphs, and extract paragraphs of the text containing risk descriptions to obtain a risk description text.
EXAMPLE five
Fig. 5 is a specific content of the first obtainingmodule 11 according to an embodiment of the present invention, and as shown in fig. 5, the first obtainingmodule 11 includes:
afirst filtering module 111, configured to filter the risk description text based on a predetermined filtering rule. In the actual filtering process, thefirst filtering module 111 is specifically configured to filter the stop words according to a predetermined stop word list.
And thescreening module 112 is configured to perform part-of-speech tagging on the filtered risk description text, and screen words with a predetermined part-of-speech to form a text to be recognized. In the actual filtering process, thescreening module 112 is specifically configured to screen nouns, verbs, adjectives and degree adverbs from the filtered risk description text.
Thecounting module 113 counts word strings with the frequency greater than a preset number threshold in the text to be recognized as candidate phrases;
a selectingmodule 114 for selecting the risk phrase from the candidate phrases by using a predetermined phrase recognition algorithm.
In the actual processing process, the selectingmodule 114 is specifically configured to calculate a mutual information value of each candidate phrase by using mutual information, calculate a left entropy value and a right entropy value of each candidate phrase by using a left entropy and a right entropy, calculate a weight value of each candidate phrase based on a statistical value of each candidate phrase according to a predetermined weight algorithm, and select a risk phrase from the candidate phrases by using a predetermined selection rule according to the weight value of each candidate phrase. The statistic value comprises a mutual information value, a left entropy value, a right entropy value and the occurrence frequency of the candidate phrases; or a mutual information value, a left entropy value, and a right entropy value, and specifically, the calculation method of embodiment two may be applied.
EXAMPLE six
Fig. 6 is a diagram further providing specific contents of the second obtainingmodule 12 according to an embodiment of the present invention, and as shown in fig. 6, the second obtainingmodule 12 includes:
thecombination module 121 is configured to combine the words in the risk description text to form a phrase to be matched;
thematching module 122 is configured to perform query matching on the phrases to be matched in a predetermined vocabulary library of the word segmentation tool, and determine phrases matched with vocabularies in the predetermined vocabulary library;
and asecond filtering module 123, configured to filter the matching phrases based on a predetermined filtering rule, and use the filtered phrases as risk phrases of the second risk phrase list. During the actual filtering process, thesecond filtering module 123 is further configured to filter at least one of: filtering the single words; filtering the numbers; the phrases that constitute words with a number less than a predetermined number are filtered.
EXAMPLE seven
An embodiment of the present invention further provides an electronic device, as shown in fig. 7, anelectronic device 4000 shown in fig. 7 includes:
aprocessor 4001; and
amemory 4003 configured to store machine readable instructions that, when executed by a processor, cause the processor to perform the risk phrase identification method in the foregoing method embodiments.
Processor 4001 is coupled tomemory 4003, such as viabus 4002. Further, theelectronic device 4000 may also comprise atransceiver 4004. In addition, thetransceiver 4004 is not limited to one in practical applications, and the structure of theelectronic device 4000 is not limited to the embodiment of the present application.
Theprocessor 4001 is applied in this embodiment of the application, and is configured to implement the first obtainingmodule 11, the second obtainingmodule 12, and the mergingprocessing module 13 shown in fig. 4.
Thetransceiver 4004 comprises a receiver and a transmitter, and thetransceiver 4004 is used for risk description text acquisition by a text acquisition module in the embodiments of the present application, and is used to implement that theprocessor 4001 may be a CPU, a general-purpose processor, a DSP, an ASIC, an FPGA or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or execute the various illustrative logical blocks, modules, and circuits described in connection with the disclosure herein. Theprocessor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.Bus 4002 may include a path that carries information between the aforementioned components.Bus 4002 may be a PCI bus, EISA bus, or the like. Thebus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 7, but this is not intended to represent only one bus or type of bus.
Memory 4003 may be, but is not limited to, a ROM or other type of static storage device that can store static information and instructions, a RAM or other type of dynamic storage device that can store information and instructions, an EEPROM, a CD-ROM or other optical disk storage, an optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
Thememory 4003 is used for storing application codes for executing the scheme of the present application, and the execution is controlled by theprocessor 4001. Theprocessor 4001 is configured to execute application code stored in thememory 4003 to implement the actions of the risk phrase recognition apparatus provided by the embodiment shown in fig. 4.
Embodiments of the present invention further provide a computer-readable storage medium, which is used for storing computer instructions, and when the computer instructions are run on a computer, the computer can execute the corresponding contents in the foregoing method embodiments.
In addition, the risk phrase identification method is mainly used for improving the accuracy of annual newspaper risk description text phrase extraction, so that the content contained in the annual newspaper risk description text is effectively mined, the utilization rate of the annual newspaper character description content is improved, the method can be used in annual newspaper database project research of listed companies, the enterprise risk early warning research thought can be expanded, and the defect that annual newspaper financial data are emphasized and the annual newspaper character descriptive content is ignored in the existing research is overcome. Identification of risk phrases the following two important realistic roles:
(1) And technical support is provided for the follow-up research of annual report database items of listed companies. In recent years, the annual reports have become more and more extensive, but the contents of the three financial reports serving as the main body of the annual reports are hardly increased, the financial information capable of being disclosed reaches the upper limit, the text contents outside the financial reports are more and more abundant, and more information is disclosed by various supplementary descriptions and explanations. How to mine valuable information from a large amount of risk description texts is an important problem in the item of annual report database of listed companies, and the accuracy and comprehensiveness of later analysis and prediction are greatly influenced. The risk phrase identification provided by the invention lays a foundation for the subsequent research of projects.
(2) And technical support is provided for risk early warning related research. The identification method can identify and extract the risk phrases with high quality, accuracy and large information quantity. As the decision-making relevance of the annual newspaper disclosure data is more and more recognized, the text information in the annual newspaper is also getting more and more important. The invention provides technical support for students and enterprises to mine annual report information, is beneficial to make up for the existing risk early warning research and improves the comprehensiveness of enterprise risk early warning.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and embellishments can be made without departing from the principle of the present invention, and these should also be construed as the scope of the present invention.