Being of purpose of the present invention provides a kind of relevant information the go forward side by side Chinese personal biographical notes information treatment system and method for row format processing extracted automatically from the Chinese personal biographical notes text of any format write.
According to an aspect of the present invention, provide a kind of Chinese personal biographical notes information processing method, this method may further comprise the steps:
Chinese personal biographical notes text to input carries out pre-service, forms the first resume text that has marked;
The described first resume text is carried out word segmentation processing, form the second resume text that has marked;
Mark discerned in resume in described second resume text proprietary name phrase commonly used, form the 3rd resume text that has marked;
The 3rd resume text that has marked is carried out the text structure analysis, form the text block that has marked and had particular type.
According to a further aspect in the invention, provide a kind of Chinese personal biographical notes information treatment system, it comprises:
In order to the resume text message identification annotation equipment that character, word, phrase and proper noun in the resume text of input are marked; And
In order to the resume text behind the identification mark carried out piecemeal and the text block behind the piecemeal to be marked, cuts apart and merges the resume text structure analysis annotation equipment of combination.
Adopt Chinese personal biographical notes information treatment system of the present invention and method, can handle the resume text that any writing style forms, extract the main information in the resume text, a kind of unified format of final formation has been brought convenience to qualified database foundation and talents information retrieval.
The present invention is described in further detail below with reference to accompanying drawing and preferred embodiment.Other purpose, feature and effect of the present invention will become clearer in the following description.
Referring to Fig. 1, Chinese personal biographical notes information treatment system of the present invention comprises the resume text message identification annotation equipment 1 in order to character, word, phrase and proper noun in the resume text of input are marked; In order to the resume text behind the identification mark carried out piecemeal and the text block behind the piecemeal to be marked, cuts apart and merges the resume text structure analysis annotation equipment 2 of combination; And gather various information according to specific order, gather device 3 as the information gathering of information extraction result output.
Wherein, resume text message identification annotation equipment 1 comprises: in order to the specific character in the text is discerned the pretreatment unit 11 of mark; Described text is carried out the word segmentation processing device 12 of word segmentation processing; And the proper noun identification annotation equipment 13 of the proprietary name phrase commonly used of the resume in the described text being discerned mark.
The resume text structure is analyzed annotation equipment 2 and is comprised: in order to described text is carried out the resume text sections device 21 of initial piecemeal by natural paragraph; The text block annotation equipment 22 that the text block of described initial piecemeal is mated mark; To cutting apart, form the text block segmenting device 23 of text block with single type through the text block of mark; And the text block composite set 24 that each text block that has same type after described cutting apart is merged the big text block that is combined into single type.
Next referring to Fig. 2 to Fig. 4, its expression is according to the operational flowchart of Chinese personal biographical notes information treatment system of the present invention.Step S1, system's input Chinese personal biographical notes text.Step S2, system carries out pre-service to the resume text of input, and it comprises step S21, and system discerns and mark the numeral in the original resume text, foreign language word and punctuation mark etc.; Step S22, system further carry out identification marking to the time on date in the text, URL web page address and e-mail address etc.So far, system forms the first resume text that has marked.
Step S3, conventional dictionary of system's utilization and resume dictionary carry out word segmentation processing to the first resume text.Wherein, the resume dictionary is a kind of special dictionary at Chinese Resume text special configuration, and it has comprised the bigger combination vocabulary of granularity that extracts in a large number from true resume text.After the word segmentation processing step, system forms the second resume text that has marked.In the second resume text, having occurred can be for Chinese word, common phrase and the resume proper noun and the phrase of identification, for example, " Beijing ", " Tsing-Hua University ", " undergraduate course ", " graduation ", " carefree working net ", " development department ", " slip-stick artist ", " technical director ", " education background ", " work experience ", " hobby " or the like.
Step S4, system utilize proprietary name phrase identification knowledge base (calling first knowledge base in the following text) and first rule-interpreter that the resume in the above-mentioned second resume text is used always proprietary name phrase (for example name, educational institution's title, major name, work unit's title, department's title, academic title's job title, project name, take on role etc.) and discern mark.Wherein, first knowledge base is to construct at the characteristics of proprietary name phrase commonly used in the resume, and it has comprised the architectural feature rule of many resumes proprietary name phrase commonly used.For example, according to this rule, the proprietary name phrase of similar " place noun (as Beijing, Shanghai, Jiangsu Province)+one or more other nouns (as aviation, traffic)+educational institution's title suffix (as university, institute) " this structure will be identified and be labeled as " educational institution's title ".First rule-interpreter is used the proprietary name phrase in order to analysis that the phrase structure feature rule in first knowledge base is made an explanation always thereby identify above-mentioned resume.After proper noun identification annotation step, system forms the 3rd resume text that has marked.
Step S5, system carries out the text structure analysis to the 3rd resume text that has marked.It comprises step S51, by natural paragraph the 3rd resume text is carried out initial piecemeal; Step S52, system utilize Text Mode knowledge base (calling second knowledge base in the following text) and second rule-interpreter to initially the text block of piecemeal mate mark.Text block behind overmatching mark both may be that only to comprise the text of single type information fast, also may be the mixing text block that comprises polytype information.Wherein, second knowledge base has comprised the pattern rules of many latent structures according to text block dissimilar in the resume text.Second rule-interpreter is then in order to make an explanation to the pattern rules in second knowledge base and to analyze.For example, according to this rule, similar in the above-mentioned text block " life period start-stop scope AND exists the title AND of educational institution to exist major name AND to have the degree title " will be noted as " education background piece ".Step S53, system utilizes first database and specific decision criteria to determine to mix the stem type of text block, so-called stem refers to top continuous some sentences of text piece, and these sentences only comprise the information of same type, and the information type that comprises immediately following (if any) after stem is different with the information type of stem.Wherein, first database is also referred to as " information frequency weights database ", and it comprises the statistics of many different information frequencies of occurrences dissimilar text block that come out from a large amount of true resume texts.Step S54, system utilize resume text sections clue dictionary and probability database that above-mentioned mixing text block is cut apart, and are about to text piece and are divided into thinner, as to have single type text block.Wherein, this piecemeal clue dictionary and probability database comprise the probability statistics data that many training from a large amount of true resume texts, the piecemeal clue word that extracts and these speech become resume text sections mark.Step S55, each text block that has same type after cutting apart more than system incites somebody to action merges the big text block that is combined into single type.For example, essential information piece, education background piece, working experience piece, project experience piece, job hunting require piece and out of Memory piece etc.
Step S6, system collect corresponding information from all kinds of text block, the information that collect has all been marked out by identification gradually in front each step.For example, from individual essential information piece, collect name, sex, date of birth, marital status, postcode, telephone number, Email address, inhabitation city, mailing address or information such as inhabitation address, ID (identity number) card No.; From the education background text block, collect start-stop days, the educational institution's title accept the education, be that name or major name, educational background or degree title, the most well educated title, foreign language extremely wait information such as stage; Information such as collection work start-stop days, unit one belongs to's title, department's title, academic title's post of serving as, working year number from the working experience text block; Assembled item start-stop days, project name, developing instrument title, hardware environment title, software environment title and information such as role who serves as or responsibility from project experience text block; Require the text block to collect information such as unit property that industry, job function title, work place, the monthly pay be engaged in require, expect, from the out of Memory text block, collect the out of Memory that is not included in above-mentioned text block from job hunting, as the certificate name of professional skill, training experience, acquisition, reward information such as title, personal interest and personal preference.
Step S7, system gathers various information according to specific order, exports as the information extraction result.
The above only is the preferred embodiment of Chinese personal biographical notes information treatment system of the present invention and method.According to design of the present invention, those skilled in the art can also make various modifications and conversion to this, but this modification and conversion all belong to scope of the present invention.