Specific embodiment
In a first aspect, the embodiment of the present application provides a kind of address information classification method referring to Fig. 1, the method includesFollowing steps:
Step 101: extracting institute's address information to be handled in text.
The extraction of address information to be processed in text can be completed using address information extraction model.Specifically, it utilizesChinese word segmentation system carries out participle and part-of-speech tagging to enough training texts one by one, then using bilSTM model to training textOriginally it is trained, to generate address extraction model.Staff proposes the address information in text using the modelIt takes.
Step 102: according to each address information to be processed, determining the integrity degree of each address information to be processedType, the integrity degree type of the address information to be processed include positive address information and negative sense address information, it is described positivelyLocation information includes that complete or partial address information, negative sense address information include the address information containing other words.
Step 103: being existed according to the integrity degree type of each address information to be processed and the address information to be processedPosition in the text obtains each address information pair to be processed using searching algorithm forward and searching algorithm backwardThe address information to be sorted answered, the address information to be sorted are sufficient address information.
Using searching algorithm forward and the combination of searching algorithm backward, the side of address information to be processed can be accurately marked offThe accuracy and integrality of follow-up data processing can be improved in boundary.
Step 104: using the contextual information of each address information to be sorted, each address to be sorted being believedBreath is classified, and the corresponding classification of each address information to be sorted is obtained.
Step 105: exporting each address information to be sorted and corresponding classification.
From the above technical scheme, this application provides a kind of address information classification method, this method extracts text firstAddress information in this is as address information to be processed, position according to the integrity degree of address information to be processed and its in the textIt sets, using searching algorithm forward and searching algorithm backward, obtains complete address to be sorted, then utilize the address to be sortedContextual information carries out classification processing to the address to be sorted.Therefore, whether the address information that no matter the application extracts is complete,Finally sufficient address information can be obtained and carry out Accurate classification, improve the accuracy of classification results.
Referring to fig. 2, a kind of address information classification method provided in another embodiment of the application, includes the following steps:
Step 201: extracting institute's address information to be handled in text.
The extraction of address information to be processed in text can be completed using address information extraction model.Specifically, it utilizesChinese word segmentation system carries out participle and part-of-speech tagging to enough training texts one by one, then using bilSTM model to training textOriginally it is trained, to generate address extraction model.Staff proposes the address information in text using the modelIt takes.
Step 202: according to each address information to be processed, determining the integrity degree of each address information to be processedType, the integrity degree type of the address information to be processed include positive address information and negative sense address information, it is described positivelyLocation information includes that complete or partial address information, negative sense address information include the address information containing other words.
It may be sufficient address information in the text using the address information that address information extraction model extraction comes out,Or partial address information, or the address information comprising other words.For example, text is " so-and-so (household register: the aa province city bb ccThe area street dd e cell x x unit x, identification card number: xxxxxxxxxxxxx) it reports in the city BB of the AA province area the CC town G H cell xX unit x is stolen, and door lock is intact, and safety cabinet is prized in family ", it is assumed that it by the result that address information model extraction comes out is " aaThe city bb of the province area the cc street dd e cell x x unit x " and " area the CC town G H cell ", the then " city bb of the aa province area the cc street dd e cell xX unit x " is sufficient address information, i.e., positive address information;" area the CC town G H cell " is partial address information, is also belonged toIn positive address information.It include " stolen " word in " x unit x is stolen " if extracting result is " x unit x is stolen ",It is then negative sense address information.
Step 203: if the address information to be processed is positive address information, existing from the address information to be processedPosition in the text starts first direction search corresponding to first searching algorithm, by an adjacent word and instituteIt states address information to be processed to merge, the address information after being merged, wherein when the first searching algorithm is that search is calculated forwardWhen method, first direction is forward direction;When the first searching algorithm is searching algorithm backward, first direction is side backwardTo.
Step 204: if the address information after the merging is positive address information, the address after the merging being believedBreath is determined as address information to be processed, and gos to step 203, believes until searching for first direction to the address to be processedUntil the preset stopping symbol of manner of breathing neighbour.
Step 205: if the address information after the merging is negative sense address information, record is determined as that negative sense address is believedAddress information after the merging is determined as address information to be processed by the read-around ratio of breath, and gos to step 203, untilBe determined as that the read-around ratio of negative sense address information is equal to default read-around ratio, or to first direction search for it is described to be processedUntil the adjacent preset stopping symbol of address information.
Step 206: will be determined as the last time the address information to be processed of positive address information with being determined as first objectLocation information.
Preset stopping symbol can be set by staff, for example, comma, branch etc..For continuing the text in the above example,Assuming that being " city bb of the aa province area the cc street dd e cell x x unit x " and " CC using the result that address information model extraction comes outThe area town G H cell ".It is " city bb of the aa province area the cc street dd e cell x x unit x " for address information to be processed, determines that it isPositive address information, when the first search calculation method is searching algorithm forward, in the text using the address information to be processedPosition is searched for forward, adjacent thereto after search therefore no longer to scan for recycling for comma, obtains first object address letterBreath is " city bb of the aa province area the cc street dd e cell x x unit x ".
And be " area the CC town G H cell " for address information to be processed, positive address information is determined that it is, is searched on firstRope calculate method be forward searching algorithm when, searched for forward using the position of information to be processed in the text, word adjacent theretoLanguage is " city bb ", then merges " city bb " and " area the CC town G H cell ", and the address information after being merged is " the area CC of the city bbThe town G H cell " then determines that the address information after merging still is positive address information, then continues to search for forward, adjacent theretoWord is " AA province ", and " AA province " is merged with " area CC of the city the bb town G H cell ", and the address information after being merged is " the city bb of AA provinceThe area the CC town G H cell " then determines that the address information after merging still is positive address information, then continues to search for forward, with its phaseAdjacent word is " ", " " and " city bb of the AA province area the CC town G H cell " is merged, the address information after being merged is" in the city bb of the AA province area the CC town G H cell " then determines that the address information after merging is negative sense address information, records it and continuously sentenceThe number for being set to negative sense address information is 1, then proceedes to search for forward, and word adjacent thereto is " title ", by " title " and " in AAThe city bb of the province area the CC town G H cell " merges, and the address information after being merged is " claiming in the city bb of the AA province area the CC town G H cell ",Then determine that the address information after merging still is negative sense address information, record it and be continuously determined as that the number of negative sense address information is2, it then proceedes to search for forward, word adjacent thereto is " reporting a case to the security authorities ", " will report a case to the security authorities " and " claim small in the city bb of the AA province area the CC town G HArea " merges, and the address information after being merged is " reporting in the city bb of the AA province area the CC town G H cell ", then determines to mergeAddress information afterwards is still negative sense address information, and recording the number that it is continuously determined as negative sense address information is 3, if presetRead-around ratio is 3, then stops searching for forward, by be determined as the last time positive address information " city bb of the AA province area the CC town G H is smallArea " is determined as first object information.
And for when the first searching algorithm is searching algorithm backward, only different from the direction of search in upper example, other areIt is identical, it repeats no more again.
Step 207: being calculated the position in the text to second search since the first object address informationThe corresponding second direction search of method, an adjacent word is merged with the address information to be processed, after obtaining mergingAddress information, wherein when the second searching algorithm be searching algorithm forward when, second direction is forward direction;It searches when secondRope algorithm be backward searching algorithm when, second direction is rearwardly direction.
Step 208: if the address information after the merging is positive address information, the address after the merging being believedBreath is determined as first object address information, and gos to step 206, until to second direction search for the address to be processedUntil the adjacent preset stopping symbol of information.
Step 209: if the address information after the merging is negative sense address information, record is determined as that negative sense address is believedAddress information after the merging is determined as first object address information by the read-around ratio of breath, and gos to step 206, directlyTo being determined as that the read-around ratio of negative sense address information is equal to default read-around ratio, or to second direction search for described wait locateUntil managing the adjacent preset stopping symbol of address information.
Step 210: the first object address information for being determined as positive address information for the last time is determined as to be sortedlyLocation information.
For continuing the above example, first object address information is " city bb of the aa province area the cc street dd e cell x x unit x "" city bb of the AA province area the CC town G H cell ".It is " the city bb of the aa province area the cc street dd e cell x x unit for address information to be processedNo. x ", positive address information is determined that it is, the second search calculation method is searching algorithm backward, using the information to be processed in textIn position search for backward, it is adjacent thereto after search therefore no longer to scan for recycling for comma, obtain address to be sortedInformation is " city bb of the aa province area the cc street dd e cell x x unit x ".
And be " city bb of the AA province area the CC town G H cell " for address information to be processed, positive address information is determined that it is, theTwo searching algorithms are searching algorithm backward, are searched for backward using the position of information to be processed in the text, word adjacent theretoLanguage is " x ", then merges " x " and " city bb of the AA province area the CC town G H cell ", and the address information after being merged is " AAThe city bb of the province area the CC town G H cell x " then determines that the address information after merging still is positive address information, then continues to search backwardRope, word adjacent thereto are " x unit ", and " x unit " is merged with " city bb of the AA province area the CC town G H cell x ", is mergedAddress information afterwards is " city bb of the AA province area the CC town G H cell x x unit ", then determines that the address information after merging still is positiveAddress information then continues to search for backward, and word adjacent thereto is " No. x ", by " No. x " and " city bb of the AA province area the CC town G H cell xX unit " merges, and the address information after being merged is " city bb of the AA province area the CC town G H cell x x unit x ", thenDetermine that the address information after merging still is positive address information, continuation is searched for backward, and word adjacent thereto is " stolen ", will" stolen " merges with " city bb of the AA province area the CC town G H cell x x unit x ", and the address information after being merged is that " AA is savedThe area CC of the city the bb town G H cell x x unit x is stolen ", the address information is determined for negative sense address information, records its continuous judgementNumber for negative sense address information is 1, then proceedes to search for backward, adjacent thereto for comma, then stops searching for forward, will mostOnce it is determined as that " city bb of the AA province area the CC town the G H cell x x unit x " of positive address information is determined as address letter to be sorted afterwardsBreath.
And for when the second searching algorithm is searching algorithm forward, only different from the direction of search in upper example, other areIt is identical, it repeats no more again.
Step 211: if the address information to be processed is negative sense address information, by the address information to be processed intoRow word segmentation processing obtains multiple participles.
Assuming that extracting address information to be processed in upper example includes " x unit x stolen ", since the address information to be processed isNegative sense address information then carries out word segmentation processing to the address information to be processed, obtains " x unit ", " No. x " and " stolen ".
Step 212: extracting any one address participle in multiple participles, address participle is determined as wait locateManage address information.
Due to word segmentation result be address participle be " x unit ", " No. x ", then can extract it is therein any one conduct toHandle address information.
Step 213: since the address information to be processed to first searching algorithm the position in the textCorresponding first direction search, an adjacent word is merged with the address information to be processed, after being mergedAddress information, wherein when the first searching algorithm is searching algorithm forward, first direction is forward direction;When the first searchAlgorithm be backward searching algorithm when, first direction is rearwardly direction.
Step 214: if the address information after the merging is positive address information, the address after the merging being believedBreath is determined as address information to be processed, and gos to step 212, believes until searching for first direction to the address to be processedUntil the preset stopping symbol of manner of breathing neighbour.
Step 215: if the address information after the merging is negative sense address information, record is determined as that negative sense address is believedAddress information after the merging is determined as address information to be processed by the read-around ratio of breath, and gos to step 212, untilBe determined as that the read-around ratio of negative sense address information is equal to default read-around ratio, or to first direction search for it is described to be processedUntil the adjacent preset stopping symbol of address information.
Step 216: will be determined as the last time the address information to be processed of positive address information with being determined as first objectLocation information.
Step 217: being calculated the position in the text to second search since the first object address informationThe corresponding second direction search of method, an adjacent word is merged with the address information to be processed, after obtaining mergingAddress information, wherein when the second searching algorithm be searching algorithm forward when, second direction is forward direction;It searches when secondRope algorithm be backward searching algorithm when, second direction is rearwardly direction.
Step 218: if the address information after the merging is positive address information, the address after the merging being believedBreath is determined as first object address information, and gos to step 216, until to second direction search for the address to be processedUntil the adjacent preset stopping symbol of information.
Step 219: if the address information after the merging is negative sense address information, record is determined as that negative sense address is believedAddress information after the merging is determined as first object address information by the read-around ratio of breath, and gos to step 214, directlyTo being determined as that the read-around ratio of negative sense address information is equal to default read-around ratio, or to second direction search for described wait locateUntil managing the adjacent preset stopping symbol of address information.
Step 220: the first object address information for being determined as positive address information for the last time is determined as to be sortedlyLocation information.
The processing mode of step 211- step 221 is identical as the processing mode of step 203- step 210, no longer superfluous hereinIt states.It can thus be seen that sufficient address information can be marked off using searching algorithm forward and the combination of searching algorithm backwardBoundary, and do not contain other vocabulary, the accuracy and integrality of the result of subsequent processing can be increased.
Step 221: obtaining the contextual information of each address information to be sorted, obtain each address to be sortedTarget text information belonging to information.
The contextual information of address information to be sorted for address to be sorted position in the text forwardly and rearwardly pre-If the word of quantity, if wherein containing preset punctuation mark, such as comma, fullstop, branch, then with the word between punctuation markSubject to language, to obtain comprising target text information belonging to the address information to be sorted.For example, address information to be sorted is" city bb of the aa province area the cc street dd e cell x x unit x ", word preset quantity forwardly and rearwardly is 3, still, due to thisAddress information rear adjacent to be processed is comma, before only there are two word, then the target text information belonging to it is " so-and-so(household register: the city bb of the aa province area the cc street dd e cell x x unit x ".
Step 222: the address information to be sorted in each target text information is replaced with into preset characters.
Preset characters in the embodiment of the present application without limitation, can be letter or number etc., such as will " so-and-so (familyAddress information to be sorted in nationality: the city bb of the aa province area the cc street dd e cell x x unit x " replaces with character string aaaaaa, thenObtain " so-and-so (household register: aaaaaa ".Address information to be sorted is replaced with into preset characters, can avoid address information pair to be sortedThe interference of subsequent semantic analysis improves the accuracy of classification.
Step 223: semantic classification model is utilized, according to the semanteme of each replaced target text information, by each instituteIt states address information to be sorted in replaced target text information to classify, obtains the class of each address information to be sortedNot.
Semantic classification model is obtained from being trained as TextCNN to training sample.TextCNN is applied to Chinese textPresent treatment has very high accuracy rate.TextCNN common usage scenario is single classification, convolutional layer, pond layer, full articulamentumAfter be then connected to Softmax layers.Probability distribution in Softmax layers of output classification, wherein the classification of maximum probability is this pointThe final output result of class model.Single disaggregated model can even reach 97% accuracy rate under business scenario.
Semantic analysis is carried out to replaced target text information using semantic classification model, then classifies, can obtainTo the classification of address information to be sorted.For example, being carried out to replaced target text information " so-and-so (household register: aaaaaa " semanticAnalysis, and after classification, obtaining aaaaaa is household register address.Aaaaaa is converted into corresponding address information to be sorted again, finallyObtaining result is household register address: the city bb of the aa province area the cc street dd e cell x x unit x.
Step 224: exporting each address information to be sorted and corresponding classification.
From the above technical scheme, this application provides a kind of address information classification method, this method extracts text firstAddress information in this is as address information to be processed, position according to the integrity degree of address information to be processed and its in the textIt sets, using searching algorithm forward and searching algorithm backward, obtains complete address to be sorted, then utilize the address to be sortedContextual information carries out classification processing to the address to be sorted.Therefore, whether the address information that no matter the application extracts is complete,Finally sufficient address information can be obtained and carry out Accurate classification, improve the accuracy of classification results.
Second aspect, referring to Fig. 3, the application provides a kind of address information sorter, and described device includes:
Extraction module 301, for extracting institute's address information to be handled in text;
Determining module 302, for determining each address information to be processed according to each address information to be processedIntegrity degree type, the integrity degree type of the address information to be processed includes positive address information and negative sense address information, instituteState that positive address information includes complete or partial address information, negative sense address information include the address containing other wordsInformation;
Address determination module 303 to be sorted, for according to each address information to be processed integrity degree type and institutePosition of the address information to be processed in the text is stated, using searching algorithm forward and searching algorithm backward, obtains each instituteThe corresponding address information to be sorted of address information to be processed is stated, the address information to be sorted is sufficient address information;
Categorization module 304, for the contextual information using each address information to be sorted, to each described wait divideClass address information is classified, and the corresponding classification of each address information to be sorted is obtained;
Output module 305, for exporting each address information to be sorted and corresponding classification.
Further, referring to fig. 4, the address determination module to be sorted 303 includes:
First searching algorithm unit 401, if being positive address information for the address information to be processed, from describedPosition of the address information to be processed in the text starts, and using the first searching algorithm, obtains first object address information, instituteStating the first searching algorithm is searching algorithm or backward searching algorithm forward;
Second searching algorithm unit 402, for since the first object address information is in the position in the text,Using the second searching algorithm, address information to be sorted is obtained, wherein the address information to be sorted is sufficient address information,When the first searching algorithm is searching algorithm forward, the second searching algorithm is searching algorithm backward;When the first searching algorithm be toAfterwards when searching algorithm, the second searching algorithm is searching algorithm forward.
Further, referring to Fig. 5, the address determination module 303 to be sorted further include:
Participle unit 501, if being negative sense address information for the address information to be processed, by it is described to be processedlyLocation information carries out word segmentation processing, obtains multiple participles;
Extraction unit 502 segments the address true for extracting the participle of any one address in multiple participlesIt is set to address information to be processed;
First searching algorithm unit 401 is also used to since the address information to be processed is in the position in the text,Using the first searching algorithm, first object address information is obtained, first searching algorithm is searching algorithm forward or searches backwardRope algorithm;
Second searching algorithm unit 402 is also used to open from position of the first object address information in the textBegin, using the second searching algorithm, obtain address information to be sorted, wherein the address information to be sorted is sufficient address letterBreath, when the first searching algorithm is searching algorithm forward, the second searching algorithm is searching algorithm backward;When the first searching algorithm isBackward when searching algorithm, the second searching algorithm is searching algorithm forward.
Further, referring to Fig. 6, the first searching algorithm unit 401 includes:
First direction searches for subelement 601, for since the address information to be processed is in the position in the textTo the corresponding first direction search of first searching algorithm, an adjacent word and the address information to be processed are carried outMerge, the address information after being merged, wherein when the first searching algorithm is searching algorithm forward, first direction is forwardDirection;When the first searching algorithm is searching algorithm backward, first direction is rearwardly direction;
Subelement 602 is looped to determine, it, will be described if being positive address information for the address information after the mergingAddress information after merging is determined as address information to be processed, and repeats above-mentioned the step of searching for first direction, until to theOne direction is searched for until the preset stopping symbol adjacent with the address information to be processed;If the address after the merging is believedBreath is negative sense address information, then record is determined as the read-around ratio of negative sense address information, and the address information after the merging is trueIt is set to address information to be processed, and repeats above-mentioned the step of searching for first direction, until is determined as the company of negative sense address informationContinuous number is equal to default read-around ratio, or searches for first direction to the preset stopping adjacent with the address information to be processedUntil symbol;
Subelement 603 is determined, for the address information to be processed for being determined as positive address information for the last time to be determined asFirst object address information.
From the above technical scheme, this application provides a kind of address information classification method, this method extracts text firstAddress information in this is as address information to be processed, position according to the integrity degree of address information to be processed and its in the textIt sets, using searching algorithm forward and searching algorithm backward, obtains complete address to be sorted, then utilize the address to be sortedContextual information carries out classification processing to the address to be sorted.Therefore, whether the address information that no matter the application extracts is complete,Finally sufficient address information can be obtained and carry out Accurate classification, improve the accuracy of classification results.
It is required that those skilled in the art can be understood that the technology in the embodiment of the present application can add by softwareThe mode of general hardware platform realize.Based on this understanding, the technical solution in the embodiment of the present application substantially orOr the part that contributes to existing technology can be embodied in the form of software products, which can depositStorage is in storage medium, such as ROM/RAM, magnetic disk, CD, including some instructions computer equipment to as (can be withIt is personal computer, server or the network equipment etc.) execute certain part institutes of each embodiment of the application or embodimentThe method stated.
Various embodiments are described in a progressive manner for this specification, same and similar part between each embodimentCan cross-reference, each embodiment focuses on the differences from other embodiments, especially for device realityFor applying example, since it is substantially similar to the method embodiment, so being described relatively simple, related place is referring to embodiment of the methodPart explanation.