Summary of the invention
Technical problem to be solved by the present invention lies in provide a kind of business conduct relationship for being based on " quotation relationship " dataAbstracting method and device can be improved correlativity business, than single word distance closer to business authenticity, improve and are based on appointingThe knowledge retrieval accuracy of business.
To solve the above problems, one embodiment of the present of invention provides a kind of business conduct for being based on " quotation relationship " dataRelation extraction method, comprising:
Corpus is acquired, and the corpus is pre-processed and constructed corpus;
Business conduct word is extracted from the All Files title in the corpus, and according to business scope to the business rowSorted out for word, forms the corresponding business conduct dictionary in various functional areas;
All Files title is extracted from the corpus and is drawn the relation data of file title, constructs quotation relationship numberAccording to library;
According to the quotation relational database, the quantity and simultaneously of statistical service behavior word and the business conduct word that is citedThe number of appearance generates business conduct relationship, and constructs business conduct relationship library.
Further, the acquisition corpus specifically, searching for existing corpus, and is downloaded from the Internet, grabs corpus;InstituteIt states and the corpus is pre-processed, specifically, carrying out corpus cleaning, participle, part-of-speech tagging to the corpus and removing stop words.
Further, the All Files title from the corpus extracts business conduct word, specific:
All Files title in the corpus is parsed and segmented;
Business conduct word is collected, including known business conduct word, continuous derivative business conduct word and the industry that need to be convertedBusiness behavior word;
It screens and tests business conduct word;
The business conduct word is initially sorted out and reasoning.
It is further, described to extract All Files title from the corpus and drawn the relation data of file title,Quotation relational database is constructed, specific:
Every file content in corpus is parsed, extracted file title and the relationship number for being drawn file titleAccording to;
According to the file title, business conduct label is stamped to every file, forms quotation relation data, and construct and drawLiterary relational database;Wherein, the quotation relation data, including file title, behavior label, drawn file title, drawn rowFor label.
Another embodiment of the invention also provides a kind of business conduct Relation extraction dress based on " quotation relationship " dataIt sets, comprising:
Corpus library module for acquiring corpus, and is pre-processed and is constructed corpus to the corpus;
Business conduct dictionary module for extracting business conduct word from the All Files title in the corpus, and is pressedThe business conduct word is sorted out according to business scope, forms the corresponding business conduct dictionary in various functional areas;
Quotation Relation DB module, for extracting All Files title from the corpus and being drawn file titleRelation data constructs quotation relational database;
Business conduct relationship library module, for according to the quotation relational database, statistical service behavior word be citedThe quantity of business conduct word and the number occurred simultaneously generate business conduct relationship, and construct business conduct relationship library.
Further, the corpus library module, is specifically used for: searching for existing corpus, and downloads from the Internet, grabs languageMaterial;Corpus cleaning, participle, part-of-speech tagging are carried out to the corpus and remove stop words.
Further, the business conduct dictionary module, is specifically used for:
All Files title in the corpus is parsed and segmented;
Business conduct word is collected, including known business conduct word, continuous derivative business conduct word and the industry that need to be convertedBusiness behavior word;
It screens and tests business conduct word;
The business conduct word is initially sorted out and reasoning.
Further, the quotation Relation DB module, is specifically used for:
Every file content in corpus is parsed, extracted file title and the relationship number for being drawn file titleAccording to;
According to the file title, business conduct label is stamped to every file, forms quotation relation data, and construct and drawLiterary relational database;Wherein, the quotation relation data, including file title, behavior label, drawn file title, drawn rowFor label.
Another embodiment of the invention also provides a kind of business conduct Relation extraction dress based on " quotation relationship " dataIt sets, which is characterized in that including processor, memory and store in the memory and be configured as being held by the processorCapable computer program, and when processor executes the computer program, is realized as above-mentioned based on " quotation relationship " dataBusiness conduct Relation extraction method.
Implementing the embodiment of the present invention can be improved correlativity business, truer closer to business than single word distanceProperty, improve the knowledge retrieval accuracy of task based access control.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, completeSite preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based onEmbodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every otherEmbodiment shall fall within the protection scope of the present invention.
In a first aspect, please referring to Fig. 1-2.One embodiment of the present of invention provides a kind of industry for being based on " quotation relationship " dataBusiness behavior relation abstracting method, comprising:
S1, acquisition corpus, and the corpus is pre-processed and is constructed corpus.
Wherein, the acquisition corpus specifically, searching for existing corpus, and is downloaded from the Internet, grabs corpus;It is described rightThe corpus is pre-processed, specifically, carrying out corpus cleaning, participle, part-of-speech tagging to the corpus and removing stop words.
In specific embodiment, mainly to each government official website acquisition center, the special policy of provincial government and neckSpeech file is led to be collected and arrange.
It is understood that the tissue such as many business department, companies can all accumulate a large amount of papery with business developmentOr e-text data.So, for these data, we are slightly integrated under conditions of permission, and the text papery is completePortion's electronization can serve as our corpus.
We, which are also an option that, obtains the open data set of national and foreign standards, for example, domestic Chinese Chinese have search dog corpus,People's Daily's corpus.Also it can choose and go to grab some data by crawler oneself, then carry out subsequent content.
In specific embodiment, corpus pretreatment can probably account for a complete Chinese natural language processing engineering and answerThe workload of 50%-70%, so developer's most of the time is just carrying out corpus pretreatment.It is washed below by dataClearly, participle, part-of-speech tagging, go four big aspects of stop words to complete the pretreatment work of corpus.
1. corpus cleans
Data cleansing, as the term suggests be exactly our interested things are found in corpus, it is uninterested, be considered as and make an uproarThe content of sound, which is cleaned, deletes, including extracts the information such as title, abstract, text for urtext, for the web page contents of crawl,Remove the codes and annotation etc. such as advertisement, label, HTML, JS.Common data cleansing mode has: artificial duplicate removal, alignment, deletion andMark etc. or Rule Extraction content, regular expression matching, according to part of speech and name entity extraction, write script or generationCode batch processing etc..
2. participle
Chinese corpus data is a collection of short text or long text, such as: sentence, article abstract, paragraph or entire chapter textA set of Zhang Zucheng.Word, word between general sentence, paragraph are that continuously, there is certain meaning.And carry out text miningWhen analysis, it is intended that the minimum unit granularity of text-processing is word or word, is incited somebody to action so just needing this when to segmentText is all segmented.
Common segmentation methods have: the segmenting method based on string matching, is based on statistics at the segmenting method based on understandingSegmenting method and rule-based segmenting method, correspond to many specific algorithms below every kind of method.
The Major Difficulties of current Chinese segmentation methods have ambiguity identification and new word identification, such as: " racket, which is sold, to be over ",This can be cut into " racket, which is sold, to be over ", can also be cut into " racket, which is sold, to be over ", if do not depend on context itsHis sentence is probably difficult to know how to understand.
3. part-of-speech tagging
Part-of-speech tagging is exactly to beat part of speech label, such as adjective, verb, noun to each word or word.Doing so canTo allow text to incorporate more useful language messages in processing below.Part-of-speech tagging is that a classical sequence labelling is askedTopic, but for the processing of some Chinese natural languages, part-of-speech tagging is not non-required.For example, common text classificationWith regard to not having to be concerned about part of speech problem, but similar sentiment analysis, knowledge reasoning are but needed, and the following figure is that common Chinese part of speech is wholeReason.
Common part-of-speech tagging method can be divided into rule-based and Statistics-Based Method.Wherein based on the side of statisticsMethod, part-of-speech tagging such as based on maximum entropy, based on statistics maximum probability output part of speech and based on the part-of-speech tagging of HMM.
4. removing stop words
Stop words refers generally to the words for not having any contribution function to text feature, such as punctuation mark, the tone, person etc.Some words.So after participle, a following step is exactly stop words in general text-processing.But for ChineseFor, go stop words operation be not it is unalterable, stop words dictionary is determined according to concrete scene, such as in emotion pointIn analysis, modal particle, exclamation mark should retain because they to indicate tone degree, emotion have certain contribution andMeaning.
S2, business conduct word is extracted from the All Files title in the corpus, and according to business scope to the industryBusiness behavior word is sorted out, and the corresponding business conduct dictionary in various functional areas is formed.
Wherein, the All Files title from the corpus extracts business conduct word, specific:
All Files title in the corpus is parsed and segmented;
Business conduct word is collected, including known business conduct word, continuous derivative business conduct word and the industry that need to be convertedBusiness behavior word;
It screens and tests business conduct word;
The business conduct word is initially sorted out and reasoning.
In specific embodiment, targetedly business conduct word can allow conversion ratio client to find entrance.
It specifically includes that
1. collecting business conduct word.
(1) constantly derivative business conduct word;
(2) existing business conduct word most people has no knowledge about and (does not know that these words have conversion ratio).It is understood thatIt is, as long as user's search, system can select out, as long as therefrom we find core word.
2. screening business conduct word.
It is constantly generated in new word, old word constantly disappears, and system can be screened constantly and produced either with or without new speciesIt is raw.Each word of but not is useful, it should which some words for obviously not meeting user demand are cut down.Apparently withoutBusiness conduct word removes.The test of business conduct word can not judge that business conduct word obtains to some.
3. business conduct word is tested.
Using testing tool, the conversion ratio of business conduct word is checked, but single cannot judged by conversion ratio, wherein also needingEach link, customer service are wanted, web site contents etc. are all the standards of conversion ratio height.After test, obtained business conductWord is exactly effective business conduct word.
4. business conduct word sorts out and reasoning.
S3, All Files title is extracted from the corpus and is drawn the relation data of file title, building quotation closesIt is database.It is specific:
Every file content in corpus is parsed, extracted file title and the relationship number for being drawn file titleAccording to;
According to the file title, business conduct label is stamped to every file, forms quotation relation data, and construct and drawLiterary relational database;Wherein, the quotation relation data, including file title, behavior label, drawn file title, drawn rowFor label.
In specific embodiment, when handling file, quotation relation data is extracted from the consulting database of referenceTo quotation relational database.
S4, according to the quotation relational database, the quantity of statistical service behavior word and the business conduct word that is cited andThe number occurred simultaneously generates business conduct relationship, and constructs business conduct relationship library.
The degree of correlation in specific embodiment, based on the number occurred simultaneously, between evaluation assignment behavior.
The present embodiment comparison is artificial and based on relationship between the building business conduct of word2vec algorithm, is primarily present following excellentPoint:
(1), artificial and machine combines, artificial constructed more efficient than simple.
(2), strong relationship between the implicit business of quotation relation data, than single word distance closer to business authenticity, instituteWith more strong correlation on the business relations that are constructed than word2vec algorithm.
Implementing the embodiment of the present invention can be improved correlativity business, truer closer to business than single word distanceProperty, improve the knowledge retrieval accuracy of task based access control.
Second aspect, as shown in figure 3, another embodiment of the invention also provides one kind based on " quotation relationship " dataBusiness conduct Relation extraction device, comprising:
Corpus library module 21 for acquiring corpus, and is pre-processed and is constructed corpus to the corpus.
Wherein, the corpus library module 21, is specifically used for: searching for existing corpus, and downloads from the Internet, grabs corpus;Corpus cleaning, participle, part-of-speech tagging are carried out to the corpus and remove stop words.
In specific embodiment, mainly to each government official website acquisition center, the special policy of provincial government and neckSpeech file is led to be collected and arrange.
It is understood that the tissue such as many business department, companies can all accumulate a large amount of papery with business developmentOr e-text data.So, for these data, we are slightly integrated under conditions of permission, and the text papery is completePortion's electronization can serve as our corpus.
We, which are also an option that, obtains the open data set of national and foreign standards, for example, domestic Chinese Chinese have search dog corpus,People's Daily's corpus.Also it can choose and go to grab some data by crawler oneself, then carry out subsequent content.
In specific embodiment, corpus pretreatment can probably account for a complete Chinese natural language processing engineering and answerThe workload of 50%-70%, so developer's most of the time is just carrying out corpus pretreatment.It is washed below by dataClearly, participle, part-of-speech tagging, go four big aspects of stop words to complete the pretreatment work of corpus.
Business conduct dictionary module 22, for extracting business conduct word from the All Files title in the corpus, andThe business conduct word is sorted out according to business scope, forms the corresponding business conduct dictionary in various functional areas.
Wherein, the business conduct dictionary module 22, is specifically used for:
All Files title in the corpus is parsed and segmented;
Business conduct word is collected, including known business conduct word, continuous derivative business conduct word and the industry that need to be convertedBusiness behavior word;
It screens and tests business conduct word;
The business conduct word is initially sorted out and reasoning.
In specific embodiment, targetedly business conduct word can allow conversion ratio client to find entrance.
It specifically includes that
1. collecting business conduct word.
(1) constantly derivative business conduct word;
(2) existing business conduct word most people has no knowledge about and (does not know that these words have conversion ratio).It is understood thatIt is, as long as user's search, system can select out, as long as therefrom we find core word.
2. screening business conduct word.
It is constantly generated in new word, old word constantly disappears, and system can be screened constantly and produced either with or without new speciesIt is raw.Each word of but not is useful, it should which some words for obviously not meeting user demand are cut down.Apparently withoutBusiness conduct word removes.The test of business conduct word can not judge that business conduct word obtains to some.
3. business conduct word is tested.
Using testing tool, the conversion ratio of business conduct word is checked, but single cannot judged by conversion ratio, wherein also needingEach link, customer service are wanted, web site contents etc. are all the standards of conversion ratio height.After test, obtained business conductWord is exactly effective business conduct word.
4. business conduct word sorts out and reasoning.
Quotation Relation DB module 23, for extracting All Files title from the corpus and being drawn file titleRelation data, construct quotation relational database.
Wherein, the quotation Relation DB module 23, is specifically used for:
Every file content in corpus is parsed, extracted file title and the relationship number for being drawn file titleAccording to;
According to the file title, business conduct label is stamped to every file, forms quotation relation data, and construct and drawLiterary relational database;Wherein, the quotation relation data, including file title, behavior label, drawn file title, drawn rowFor label.
In specific embodiment, when handling file, quotation relation data is extracted from the consulting database of referenceTo quotation relational database.
Business conduct relationship library module 24, for according to the quotation relational database, statistical service behavior word with drawnThe number occurred with the quantity of business conduct word and simultaneously, generates business conduct relationship, and construct business conduct relationship library.
The degree of correlation in specific embodiment, based on the number occurred simultaneously, between evaluation assignment behavior.
Implementing the embodiment of the present invention can be improved correlativity business, truer closer to business than single word distanceProperty, improve the knowledge retrieval accuracy of task based access control.
Another embodiment of the invention also provides a kind of business conduct Relation extraction dress based on " quotation relationship " dataIt sets, which is characterized in that including processor, memory and store in the memory and be configured as being held by the processorCapable computer program, and when processor executes the computer program, is realized as above-mentioned based on " quotation relationship " dataBusiness conduct Relation extraction method.
The above is a preferred embodiment of the present invention, it is noted that for those skilled in the artFor, without departing from the principle of the present invention, several improvement and deformations can also be made, these improvement and deformations are also considered asProtection scope of the present invention.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be withRelevant hardware is instructed to complete by computer program, the program can be stored in a computer-readable storage mediumIn, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can be magneticDish, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random AccessMemory, RAM) etc..