and finally, combining the aggregated data in the table into positive and negative sample pairs of training data according to different address levels, wherein the output format is as follows: { target _ addr, pos _ addr, neg _ addr }. As described above, { target _ addr, pos _ addr } constitutes a pair of positive sample pairs, and { target _ addr, neg _ addr } constitutes a pair of negative sample pairs. It should be noted that a pair of positive sample pairs may correspond to multiple pairs of negative sample pairs, that is, one target _ addr corresponds to one pos _ addr, and the target _ addr may correspond to multiple neg _ addr.

The specific operation is as follows:

(1) selecting an address text, for example: the royal city in Zhejiang province, discrict, Yunjian, road ad, Wen-xi, road ad, 969, poi, Alibara Xixi Yun district;

(2) traversing all address levels, e.g. province- > city- > district- > road, finding the same and different address elements respectively at each address level as the current address element, constituting positive and negative sample pairs respectively with the current address element, e.g.:

at the province level, a good example of an aribab Xixi park, No. 969 Wenyu district, Hangzhou city, Zhejiang: 37150, Yuan Ye Yijia Garden No. 245 in State area, Zhejiang Ningbo city, Zhejiang province, 1; the negative example is: shanghai hong Qiao No. 2550 International airport of Shanghai hong Qiao in Changning district of Shanghai city.

At the city level, a good example of an aribab Xixi park No. 969 Wenyu district in Yunhong city, Hangzhou, Zhejiang: Wen-West road 1008 Zhejiang socialist college from the Yunjun of Hangzhou city, Zhejiang; the negative example is: 37150of Ningbo city in Zhejiang province, and 525 # garden road in State region for household.

At district level, a good example of an aribab Xixi park, No. 969, Wen Yixi district, Hangzhou, Zhejiang: high education road No. 248 Saiyin International Square in Hangzhou city, Hangzhou, Zhejiang; the negative example is: southern mountain school of China Art institute No. 218 south mountain of Hangzhou city, Zhejiang.

After the training data set is acquired, themethod 300 proceeds to step S320. Before describing the processing procedure of step S320, the structure of the address text similarity calculation model according to the embodiment of the present invention will be described.

Referring to fig. 4, an address textsimilarity calculation model 400 according to an embodiment of the present invention includes: aword embedding layer 410, atext encoding layer 420, and asimilarity calculation layer 430. Theword embedding layer 410 is adapted to convert each address element in the address text into a word vector, and combine each word vector into a set of word vectors corresponding to the address text; thetext encoding layer 420 is adapted to encode a set of word vectors corresponding to the address text as text vectors; thesimilarity calculation layer 430 is adapted to calculate a similarity between two text vectors, and characterize the similarity between address texts by using the similarity between the text vectors.

In step S320, the first address text, the second address text, and the third address text in each piece of training data are respectively input to the word embedding layer for processing, so as to obtain a first word vector set corresponding to the first address text, a second word vector set corresponding to the second address text, and a third word vector set corresponding to the third address text.

The Word embedding layer (embedding layer) can convert each Word in a sentence into a digital vector (Word vector), and the weight of the embedding layer can be obtained by pre-calculating the text co-occurrence information of a massive corpus, for example, by adopting a Glove algorithm, or by adopting a CBOW and skip-gram algorithm in Word2 Vec. These algorithms are based on the fact that: in the case that different text representations of the same latent semantic can repeatedly appear in the same context, the word-to-context prediction is performed by using the context and the relation between the words, or the words are predicted through the context, so that the latent semantic of each word is obtained. In the embodiment of the invention, the parameters of the word embedding layer can be obtained by utilizing a corpus to carry out training independently; the word embedding layer and the text coding layer can also be trained together, so that the parameters of the word embedding layer and the parameters of the text coding layer are obtained simultaneously. The following description takes the example of training the word embedding layer and the text encoding layer together.

Specifically, the address text comprises a plurality of formatted address elements, after the address text is input into the word embedding layer, the word embedding layer converts each address element in the address text into a word vector as a word, so as to obtain a plurality of word vectors, and then combines the word vectors into a word vector set.

In one implementation, the word vector set is represented as a list, i.e., a word vector list, each list item in the word vector list corresponds to a word vector, and the number of items in the list is the number of address elements in the address text. In another implementation, the word vector set is represented as a matrix, that is, a word vector matrix, each column of the matrix corresponds to a word vector, and the number of columns of the matrix is the number of address elements in the address text.

After obtaining the set of word vectors, themethod 300 proceeds to step S330. In step S330, the first word vector set, the second word vector set, and the third word vector set are respectively input to the text encoding layer for processing, so that the first word vector set is encoded as a first text vector, the second word vector set is encoded as a second text vector, and the third word vector set is encoded as a third text vector.

The text coding layer is implemented by using a Deep Neural Network (DNN) model, for example, a Recurrent Neural Network (RNN) model, a Convolutional Neural Network (CNN) model, or a Deep Belief Network (DBN) model. The embedding output of the address sentence text with indefinite length is encoded into a sentence vector with fixed length through DNN, and at the moment, target _ addr, pos _ addr and neg _ addr are respectively converted into vector _ A, vector _ B and vector _ C. vector _ a is the first text vector, vector _ B is the second text vector, and vector _ C is the third text vector.

Taking RNN as an example, a word vector sequence corresponding to the address text may be regarded as a time sequence, word vectors in the word vector sequence are sequentially input into RNN, and a finally output vector is a text vector (sentence vector) corresponding to the address text.

Taking the CNN as an example, inputting a word vector matrix corresponding to the address text into the CNN, processing the CNN through a plurality of convolution layers and pooling layers, and finally converting the two-dimensional feature map into a one-dimensional feature vector through a full-connection layer, wherein the feature vector is a text vector corresponding to the address text.

After the text vector is obtained, themethod 300 proceeds to step S340. In step S340, a first similarity between the first text vector and the second text vector and a second similarity between the first text vector and the third text vector are calculated by using the similarity calculation layer. In this way, the first similarity may represent a similarity between the first address text and the second address text, and the second similarity may represent a similarity between the first address text and the third address text.

Various similarity distance calculation methods can be selected, for example: euclidean distance, cosine similarity, Jaccard coefficient, etc. In this embodiment, the similarity between vector _ a and vector _ B is referred to as SIM _ AB, and the similarity between vector _ a and vector _ C is referred to as SIM _ AC.

Finally, in step S350, network parameters of the word embedding layer and the text encoding layer are adjusted according to the first similarity and the second similarity. The method specifically comprises the following steps: calculating a loss function value according to the first similarity and the second similarity; and adjusting network parameters of the word embedding layer and the text coding layer by using a back propagation algorithm until the loss function value is lower than a preset value or the training times reach a preset number.

The loss function is a triplet loss function, and the distance between the positive sample pairs can be shortened and the distance between the negative sample pairs can be shortened by using the triplet loss function. The loss function may be specifically expressed as: loss ═ Margin- (SIM _ AB-SIM _ AC). The target min (loss) of the network is optimized by using a back propagation algorithm, so that the network actively learns the parameters to make the target _ addr closer to the pos _ addr in the semantic space and further from the neg _ addr.

The Margin is a hyper-parameter which indicates that a training target needs to ensure that a certain distance is kept between the SIM _ AB and the SIM _ AC so as to increase the discrimination of the model, and the value of the Margin can be repeatedly adjusted according to the data condition and the actual task until the effect is optimal.

After the training process is completed, a similarity calculation model for calculating the similarity between the two sections of address texts is finally obtained. Based on the similarity calculation model, the embodiment of the invention also provides an address text similarity determination method, which comprises the following steps:

1) acquiring an address text pair with similarity to be determined;

2) and inputting the address text pair into a trained address text similarity calculation model to output the similarity of the two address texts included in the address text pair.

In addition, the similarity calculation model can be applied to various scenes in which the similarity of the address text needs to be calculated, and can be applied to address standardization in the fields of public security, express delivery, logistics, electronic maps and the like. In these scenarios, the address search service can be provided for the user by using the address text similarity calculation model of the embodiment of the present invention.

FIG. 5 shows a flow diagram of anaddress search method 500 according to one embodiment of the invention. Referring to FIG. 5, themethod 500 includes steps S510 to S530.

In step S510, one or more candidate address texts corresponding to the address texts to be queried are obtained. In the address search service, a user inputs a query address text (query) through a user terminal, and generally, the input of the user is a incomplete and inaccurate address text. The user terminal sends the query to the computing device, and an address searching device in the computing device recalls a batch of candidate address texts after searching the standard address library, wherein the number of the candidate address texts is usually several to thousands.

In step S520, the address text to be queried and the candidate address text are input to a preset address text similarity calculation model to obtain the similarity between the two, wherein the address text similarity calculation model is obtained by training according to themethod 300. In this step, the similarity between the address text to be queried and each candidate address file is calculated respectively.

After the similarity between the address text to be queried and all candidate address texts is obtained, themethod 500 proceeds to step S530. In step S530, the candidate address text with the maximum similarity is determined as the target address text corresponding to the address text to be queried, and the target address text is returned to the user.

Fig. 6 is a schematic diagram of atraining apparatus 600 for an address text similarity calculation model according to an embodiment of the present invention. The address text similarity calculation model includes a word embedding layer, a text encoding layer, and a similarity calculation layer, and thetraining apparatus 600 includes:

the obtainingmodule 610 is adapted to obtain a training data set, where the training data set includes a plurality of pieces of training data, and each piece of training data includes first, second, and third address texts, where address elements of first n levels of the first and third address texts are the same, address elements of first (n-1) levels of the first and third address texts are the same, and address elements of nth level are different. The obtainingmodule 610 is specifically configured to execute the method of step S310, and for processing logic and functions of the obtainingmodule 610, reference may be made to the related description of step S310, which is not described herein again.

The wordvector obtaining module 620 is adapted to input the first, second, and third address texts of each piece of training data into the word embedding layer to obtain corresponding first, second, and third word vector sets. The wordvector obtaining module 620 is specifically configured to execute the method in step S320, and for processing logic and functions of the wordvector obtaining module 620, reference may be made to the related description in step S320, which is not described herein again.

The textvector obtaining module 630 is adapted to input the first, second, and third word vector sets into the text encoding layer to obtain corresponding first, second, and third text vectors. The textvector obtaining module 630 is specifically configured to execute the method in step S330, and for processing logic and functions of the wordvector obtaining module 630, reference may be made to the related description in step S330, which is not described herein again.

The second similarity calculation module 640 is adapted to calculate the first similarities of the first and second text vectors and the second similarities of the first and third text vectors by using the similarity calculation layer. The second similarity calculation module 640 is specifically configured to execute the method in step S340, and for processing logic and functions of the second similarity calculation module 640, reference may be made to the related description in step S340, which is not described herein again.

And theparameter adjusting module 650 is adapted to adjust the network parameters of the word embedding layer and the text encoding layer according to the first similarity and the second similarity. Thereference module 650 is specifically configured to execute the method of step S350, and for the processing logic and functions of the secondsimilarity calculation module 650, reference may be made to the related description of step S350, which is not repeated herein.

Fig. 7 shows a schematic diagram of anaddress search apparatus 700 according to an embodiment of the present invention. Referring to fig. 7, anaddress search apparatus 700 includes:

thequery module 710 is adapted to obtain one or more candidate address texts corresponding to the address texts to be queried;

the first similarity calculation module 720 is adapted to input the address text to be queried and the candidate address text into a preset address text similarity calculation model to obtain the similarity between the address text and the candidate address text, wherein the address text similarity calculation model is obtained by training thetraining device 600;

and the output module 730 is adapted to determine the candidate address text with the maximum similarity as the target address text corresponding to the address text to be queried.

The various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present invention, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as removable hard drives, U.S. disks, floppy disks, CD-ROMs, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention.

In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to execute the multilingual spam-text recognition method of the present invention according to instructions in said program code stored in the memory.

By way of example, and not limitation, readable media may comprise readable storage media and communication media. Readable storage media store information such as computer readable instructions, data structures, program modules or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of readable media.

In the description provided herein, algorithms and displays are not inherently related to any particular computer, virtual system, or other apparatus. Various general purpose systems may also be used with examples of this invention. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Claims

1. An address text similarity determination method, the address text including a plurality of address elements arranged from high to low in rank, the method comprising:

acquiring an address text pair with similarity to be determined;

2. The method of claim 1, wherein the address text similarity calculation model includes a word embedding layer, a text encoding layer, and a similarity calculation layer, and the step of training the address text similarity calculation model includes:

inputting the first, second and third address texts of each piece of training data into a word embedding layer to obtain corresponding first, second and third word vector sets;

inputting the first, second and third word vector sets into a text coding layer to obtain corresponding first, second and third text vectors;

calculating a first similarity of the first text vector and the second text vector and a second similarity of the first text vector and the third text vector by using a similarity calculation layer;

and adjusting the network parameters of the address text similarity calculation model according to the first similarity and the second similarity.

3. The method of claim 2, wherein the network parameters comprise: parameters of a word embedding layer and/or parameters of a text encoding layer.

4. The method of claim 2, wherein each word vector set in the first, second, and third word vector sets comprises a plurality of word vectors, each word vector corresponding to an address element in the address text.

5. The method of claim 2, wherein the Word embedding layer employs a Glove model or a Word2Vec model.

6. The method of claim 2, wherein the first and second similarities comprise at least one of euclidean distance, cosine similarity, or Jaccard coefficients.

7. The method of claim 2, wherein said adjusting network parameters of said address text similarity calculation model according to said first and second similarity comprises:

calculating a loss function value according to the first similarity and the second similarity;

and adjusting the network parameters of the address text similarity calculation model by using a back propagation algorithm until the loss function value is lower than a preset value or the training times reach a preset number.

8. The method of claim 7, wherein the loss function value is:

loss ═ Margin- (first similarity-second similarity)

Wherein, Loss is the Loss function value, and Margin is the over parameter.

9. The method of claim 2, wherein the text encoding layer comprises at least one of an RNN model, a CNN model, or a DBN model.

10. An address search method, comprising:

11. An address search apparatus, comprising:

12. An apparatus for training an address text similarity calculation model, the address text including a plurality of address elements arranged from high to low in order, the address text similarity calculation model including a word embedding layer, a text encoding layer, and a similarity calculation layer, the apparatus comprising:

13. A computing device, comprising:

one or more processors;

a memory; and

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for performing any of the methods of claims 1-10.