Disclosure of Invention
In view of the above, an object of the present invention is to provide an address identification method and apparatus, which overcome the above technical defects in the prior art.
The embodiment of the application provides an address identification method, which comprises the following steps:
the address is segmented to identify the address according to a hash table, and an unidentifiable address element in the address is determined;
and according to the address element similarity model, performing similarity judgment on the unidentifiable address elements in the Chinese address so as to identify the unidentifiable address elements in the Chinese address.
Optionally, in any embodiment of the present application, the splitting the address to identify the address according to a hash table, and determining an unrecognizable address element in the address includes:
performing character segmentation on the address, and performing word segmentation on the address according to the hash table and characters obtained by performing character segmentation on the address;
and identifying the address according to the result of word segmentation processing on the address, and determining an unidentifiable address element in the address.
Optionally, in any embodiment of the present application, performing character segmentation on the address, and performing word segmentation processing on the address according to the hash table and the character obtained by performing character segmentation on the address includes:
performing character segmentation on the address, and determining the character state of the character obtained by the character segmentation and the node relation in the hash table according to the hash table and the character obtained by performing the character segmentation on the address;
and performing word segmentation processing on the address according to the character state of the character and the node relation in the hash table.
Optionally, in any embodiment of the present application, performing character segmentation on the address, and determining a character state of the character obtained by the character segmentation and a node relationship in the hash table according to the hash table and the character obtained by performing character segmentation on the address includes: and performing character segmentation on the address, and determining the character state of the next character obtained by the character segmentation and the node relation between the current character and the next character in the hash table according to the hash table and the current character and the next character obtained by performing the character segmentation on the address.
Optionally, in any embodiment of the present application, an address element that is not identifiable in the address is determined: and determining the address elements which cannot be identified in the address according to the length of the character strings of the corresponding address elements in the address and the indexes of the character strings in the Hass table.
Optionally, in any embodiment of the present application, before performing similarity determination on an unrecognizable address element in the chinese address according to an address element similarity model to recognize the unrecognizable address element in the chinese address, the method further includes: and establishing an address element similarity model according to the unidentifiable historical address elements.
Optionally, in any embodiment of the present application, establishing an address element similarity model according to unrecognizable historical address elements includes: and establishing an address element similarity model according to the distribution probability of the address elements and the historical addresses and the distribution probability of the historical addresses and the administrative regions.
Optionally, in any embodiment of the present application, each historical address is abstracted into a document, and each word in the document corresponds to one address element; abstracting the administrative region into a subject;
correspondingly, establishing an address element similarity model according to the distribution probability of the address elements and the historical addresses and the distribution probability of the historical addresses and the administrative regions comprises the following steps:
determining the conditional probability of the theme according to the distribution probability of the address elements and the historical addresses;
according to the distribution probability of the historical address and the administrative region and the conditional probability of the words in the document;
and establishing an address element similarity model by the conditional probability of the theme and the conditional probability of the words in the document.
Optionally, in any embodiment of the present application, the determining similarity of the unrecognizable address elements in the chinese address according to the address element similarity model to recognize the unrecognizable address elements in the chinese address includes:
according to an address element similarity model, similarity judgment is carried out on unidentifiable address elements in the Chinese address, and the probability that the unidentifiable address elements belong to different administrative regions is obtained;
and identifying the unidentifiable address elements in the Chinese address according to the probability that the unidentifiable address elements belong to different administrative regions.
The embodiment of the present application further includes an address recognition apparatus, which includes:
the first unit is used for carrying out segmentation processing on the address so as to identify the address according to a hash table and determining an unidentifiable address element in the address;
and the second unit is used for judging the similarity of the unidentifiable address elements in the Chinese address according to the address element similarity model so as to identify the unidentifiable address elements in the Chinese address.
According to the address identification method and device provided by the embodiment of the application, the address is identified according to a hash table by segmenting the address, and an unidentifiable address element in the address is determined; and then according to the address element similarity model, performing similarity judgment on the unidentifiable address elements in the Chinese address so as to identify the unidentifiable address elements in the Chinese address, thereby identifying the unidentifiable address elements in the Chinese address sink.
Detailed Description
It is not necessary for any particular embodiment of the invention to achieve all of the above advantages at the same time.
In order to make those skilled in the art better understand the technical solutions in the embodiments of the present invention, the technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments of the present invention shall fall within the scope of the protection of the embodiments of the present invention.
In the following embodiments, the identification of a chinese address is taken as an example, and for a chinese address formation, the four constituent parts of the address include:
(1) and (4) administrative divisions, namely, the administrative regions above the villages and the towns are sorted from large to small. According to the ' administrative region codes of the people's republic of China ' (GB2260-1995), the administrative region is divided into four levels, wherein the first level is province, autonomous region, direct district city and special administrative region; the second level is city, region, autonomous state, union and the prefecture and prefecture of the city of the national prefecture; the third level is county, prefecture, county-level city and flag; the fourth stage is village, town and village.
(2) The street mainly refers to road names, street names and the like.
(3) The house number is mainly named as house number, building name, room number and the like.
(4) The supplementary information is the name of the organization added after the number of the doorhouse or the vocabulary representing the spatial relationship.
The following embodiment of the present invention is based on the above-mentioned rules to identify the chinese address.
According to the address identification method and device provided by the embodiment of the application, the address is identified according to a hash table by segmenting the address, and an unidentifiable address element in the address is determined; and then according to the address element similarity model, performing similarity judgment on the unidentifiable address elements in the Chinese address so as to identify the unidentifiable address elements in the Chinese address, thereby identifying the unidentifiable address elements in the Chinese address sink.
The following further describes specific implementation of the embodiments of the present invention with reference to the drawings.
Fig. 1 is a schematic flowchart of an address identification method according to an embodiment of the present application; as shown in fig. 1, it includes:
s101, performing segmentation processing on the Chinese address to identify the Chinese address according to a hash table, and determining unidentifiable address elements in the Chinese address;
in this embodiment, in step S101, when the chinese address is split to identify the chinese address according to a hash table and an unrecognizable address element in the chinese address is determined, the method may specifically include:
firstly, performing character segmentation on the Chinese address, and performing word segmentation processing on the Chinese address according to the hash table and characters obtained by performing character segmentation on the Chinese address;
secondly, according to the result of word segmentation processing on the Chinese address, the Chinese address is identified, and the unidentifiable address elements in the Chinese address are determined.
Optionally, in this embodiment, in step S101, performing character segmentation on the chinese address, and performing word segmentation on the chinese address according to the hash table and the character obtained by performing character segmentation on the chinese address, may specifically include:
firstly, performing character segmentation on the Chinese address, and determining the character state of the character obtained by the character segmentation and the node relation in the hash table according to the hash table and the character obtained by performing the character segmentation on the Chinese address;
and secondly, performing word segmentation processing on the Chinese address according to the character state of the character and the node relation in the hash table.
Optionally, in step S101, when performing character segmentation on the chinese address, and determining a character state of the character obtained by the character segmentation and a node relationship in the hash table according to the hash table and the character obtained by performing character segmentation on the chinese address, the method may be specifically used for performing character segmentation on the chinese address, and determining a character state of a next character obtained by the character segmentation and a node relationship between the current character and the next character in the hash table according to the hash table and the current character and the next character obtained by the character segmentation on the chinese address.
In a specific application scenario, the step S101 is implemented based on a forward maximum matching mechanism, and the specific process is as follows:
(1) reading the ith character C in Chinese addressi;
(2) Looking up the current character C from the hash table recording the address elements and the relationship between the address elementsiAnd forming a current node;
(3) reading the i +1 th character C in the Chinese addressi+1At the current character CiSearching character C in the sub-node (or next-level node) of the current node in the hash tablei+1If not, the word segmentation is finished, and the step (5) is carried out; otherwise, taking the child node of the current child node as a new current node, reading the character state, and if the character state is not in the termination state, jumping to the step (2); otherwise, go to the following (4);
(4) extracting words and turning to the step (1);
(5) judging the current character CiIf the state is an extended state, the index is increased by 1 (searching the child node of the current node), and the step (1) is switched to read the (i + 1) th character C in the Chinese address againi+1。
The above processing procedures (1) to (5) are a specific implementation of the forward maximum matching, and the chinese address is segmented into words having the longest address elements. If the index of the address string after word segmentation is the length of the address string, it indicates that all the chinese addresses are recognized, and the process may be directly ended, or the processing of step S102 described below may be continuously performed to perform address similarity calculation.
S102, according to the address element similarity model, similarity judgment is carried out on the unidentifiable address elements in the Chinese address so as to identify the unidentifiable address elements in the Chinese address.
Optionally, in this embodiment, an unrecognizable address element in the chinese address is determined: and determining the unrecognizable address elements in the Chinese address according to the length of the character string of the corresponding address elements in the Chinese address and the index of the character string in the Hass table.
Optionally, in this embodiment, before performing similarity determination on an unrecognizable address element in the chinese address according to an address element similarity model to identify the unrecognizable address element in the chinese address, the method further includes: and establishing an address element similarity model according to the unidentifiable historical address elements.
Optionally, in this embodiment, the establishing an address element similarity model according to the unrecognizable historical address elements includes: and establishing an address element similarity model according to the distribution probability of the address elements and the historical addresses and the distribution probability of the historical addresses and the administrative regions.
Optionally, in this embodiment, each historical address is abstracted into a document, and each word in the document corresponds to one address element; abstracting the administrative region into a subject;
correspondingly, establishing an address element similarity model according to the distribution probability of the address elements and the historical addresses and the distribution probability of the historical addresses and the administrative regions comprises the following steps:
determining the conditional probability of the theme according to the distribution probability of the address elements and the historical addresses;
according to the distribution probability of the historical address and the administrative region and the conditional probability of the words in the document;
and establishing an address element similarity model by the conditional probability of the theme and the conditional probability of the words in the document.
Optionally, in this embodiment, according to the address element similarity model, performing similarity judgment on the unrecognizable address elements in the chinese address to recognize the unrecognizable address elements in the chinese address includes:
according to an address element similarity model, similarity judgment is carried out on unidentifiable address elements in the Chinese address, and the probability that the unidentifiable address elements belong to different administrative regions is obtained;
and identifying the unidentifiable address elements in the Chinese address according to the probability that the unidentifiable address elements belong to different administrative regions.
The following illustrates a detailed implementation flow of the step S102 in a specific application scenario:
abstracting each historical address into a document, and abstracting a plurality of historical addresses into a document set; each word in the document corresponds to an address element; abstracting the administrative region into a subject.
1. Firstly, randomly obtaining a theme-document distribution from the document-theme distributionThen, the nth subject z of the mth document is obtainedm,n;
2. Distributing K topics in a training set to obtain a topic zm,nAccording to a topic-word distribution, to obtain words wm,n;
3.Corresponding to a dirichlet distribution, the physical meaning of which is a random mixed distribution of potential topics,is a parameter of its prior probability,corresponding to a multinomial distribution, the physical meaning of which is a polynomial distribution of a potential subject, the whole being a dirichlet-multinomial conjugate structure;
obtaining a conditional probability calculation formula of the theme:
whereinA number vector representing words in the mth document;
5.conforming to a dirichlet distribution, whose physical meaning is a random mixed distribution of words,is its prior probability parameter, andconforming to a multinomial distribution, the physical meaning is a polynomial distribution of words.
Thus, the conditional probability formula for the word is obtained:
wherein,representing the number vector of the words generated by the kth subject;
6. based on the two distributions, a joint probability distribution calculation formula of the words in the document on the theme is obtainedFirst of all, calculate
After the joint probability distribution of the subject term is obtained, training is carried out through MCMC algorithm and Gibbs sampling process to obtain variableAndto complete the LDA probabilistic model.
7. For new unidentifiable address elements, change variablesAndsubstituting the value of (2) into the formula in step 6 to obtain the probability that the address belongs to the administrative division, and finally sorting according to the probability.
Due to the above formulaAndthus, a difference can be obtained in the formula in step 6And (4) time probability value, wherein the combination relation between the unrecognizable address elements and the administrative division is preferably selected when the maximum probability value is selected, so as to obtain the recognized Chinese address.
Fig. 2 is a schematic structural diagram of an address recognition apparatus according to a second embodiment of the present application; as shown in fig. 2, it includes:
a first unit 201, configured to perform a segmentation process on the chinese address to identify the chinese address according to a hash table, and determine an unrecognizable address element in the chinese address;
a second unit 202, configured to perform similarity judgment on an unrecognizable address element in the chinese address according to the address element similarity model, so as to recognize the unrecognizable address element in the chinese address.
Optionally, in any embodiment of the present application, the first unit is further configured to:
performing character segmentation on the Chinese address, and performing word segmentation on the Chinese address according to the hash table and characters obtained by performing character segmentation on the Chinese address;
and identifying the Chinese address according to the result of word segmentation processing on the Chinese address, and determining an unidentifiable address element in the Chinese address.
Optionally, in any embodiment of the present application, the first unit is further configured to perform character segmentation on the chinese address, and determine a character state of the character obtained by the character segmentation and a node relationship in the hash table according to the hash table and the character obtained by performing character segmentation on the chinese address; and performing word segmentation processing on the Chinese address according to the character state of the character and the node relation in the hash table.
Optionally, in any embodiment of the present application, the first unit is further configured to perform character segmentation on the chinese address, and determine, according to the hash table and a current character and a next character obtained by performing character segmentation on the chinese address, a character state of the next character obtained by the character segmentation and a node relationship between the current character and the next character in the hash table.
Optionally, in any embodiment of the present application, the first unit is further configured to determine an unrecognizable address element in the chinese address according to a length of a character string of a corresponding address element in the chinese address and an index of the character string in the hass table.
Optionally, in any embodiment of the present application, the first unit is further configured to establish an address element similarity model according to the unrecognizable historical address elements.
Optionally, in any embodiment of the present application, the first unit is further configured to establish an address element similarity model according to a distribution probability of an address element and a historical address and a distribution probability of the historical address and an administrative region.
Optionally, in any embodiment of the present application, each historical address is abstracted into a document, and each word in the document corresponds to one address element; abstracting the administrative region into a subject;
optionally, in any embodiment of the present application, the first unit is further configured to: determining the conditional probability of the theme according to the distribution probability of the address elements and the historical addresses; according to the distribution probability of the historical address and the administrative region and the conditional probability of the words in the document; and establishing an address element similarity model according to the conditional probability of the theme and the conditional probability of the words in the document.
Optionally, in any embodiment of the present application, the second unit is further configured to perform similarity judgment on an unidentifiable address element in the chinese address according to an address element similarity model, so as to obtain probabilities that the unidentifiable address element belongs to different administrative regions; and identifying the unidentifiable address elements in the Chinese address according to the probability that the unidentifiable address elements belong to different administrative regions.
The above-described embodiments of the apparatus are merely illustrative, wherein the modules described as separate parts may or may not be physically separate, and the parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions and/or portions thereof that contribute to the prior art may be embodied in the form of a software product that can be stored on a computer-readable storage medium including any mechanism for storing or transmitting information in a form readable by a computer (e.g., a computer). For example, a machine-readable medium includes Read Only Memory (ROM), Random Access Memory (RAM), magnetic disk storage media, optical storage media, flash memory storage media, electrical, optical, acoustical or other form of propagated signals (e.g., carrier waves, infrared signals, digital signals, etc.), and others, and the computer software product includes instructions for causing a computing device (which may be a personal computer, server, or network device, etc.) to perform the methods described in the various embodiments or portions of the embodiments.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the embodiments of the present application, and are not limited thereto; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus (device), or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.