Disclosure of Invention
In view of the above problems, the present invention has been made to provide an address text similarity determination method and an address search method that overcome or at least partially solve the above problems.
According to an aspect of the present invention, there is provided an address text similarity determination method, the address text including a plurality of address elements arranged from high to low in rank, the method including:
acquiring an address text pair with similarity to be determined;
inputting the address text pair into a preset address text similarity calculation model to output the similarity of two address texts included in the address text pair;
the address text similarity calculation model is obtained by training based on a training data set comprising a plurality of pieces of training data, each piece of training data at least comprises a first address text, a second address text and a third address text, wherein the first n levels of address elements of the first address text and the second address text are the same to form a positive sample pair, and the first (n-1) levels of address elements of the first address text and the third address text are the same and the nth level of address elements are different to form a negative sample pair.
Optionally, in the address text similarity determining method according to the present invention, the address text similarity calculation model includes a word embedding layer, a text encoding layer, and a similarity calculation layer, and the step of training the address text similarity calculation model includes: inputting the first, second and third address texts of each piece of training data into a word embedding layer to obtain corresponding first, second and third word vector sets; inputting the first, second and third word vector sets into a text coding layer to obtain corresponding first, second and third text vectors; calculating a first similarity of the first text vector and the second text vector and a second similarity of the first text vector and the third text vector by using a similarity calculation layer; and adjusting the network parameters of the address text similarity calculation model according to the first similarity and the second similarity.
Optionally, in the address text similarity determining method according to the present invention, the network parameter includes: parameters of a word embedding layer and/or parameters of a text encoding layer.
Optionally, in the address text similarity determining method according to the present invention, each word vector set in the first, second, and third word vector sets includes a plurality of word vectors, and each word vector corresponds to one address element in the address text.
Optionally, in the address text similarity determining method according to the present invention, the Word embedding layer employs a Glove model or a Word2Vec model.
Optionally, in the address text similarity determining method according to the present invention, the first similarity and the second similarity include at least one of a euclidean distance, a cosine similarity, or a Jaccard coefficient.
Optionally, in the method for determining similarity of address texts according to the present invention, the adjusting network parameters of the address text similarity calculation model according to the first and second similarities includes: calculating a loss function value according to the first similarity and the second similarity; and adjusting the network parameters of the address text similarity calculation model by using a back propagation algorithm until the loss function value is lower than a preset value or the training times reach a preset number.
Optionally, in the address text similarity determining method according to the present invention, the loss function value is: and (4) Loss ═ Margin- (first similarity-second similarity), wherein the Loss is a Loss function value, and Margin is a hyperparameter.
Optionally, in the address text similarity determining method according to the present invention, the text encoding layer includes at least one of an RNN model, a CNN model, or a DBN model.
According to another aspect of the present invention, there is provided an address search method including:
acquiring one or more candidate address texts corresponding to the address text to be inquired;
inputting an address text to be inquired and a candidate address text into a preset address text similarity calculation model to obtain the similarity of the address text and the candidate address text, wherein the address text similarity calculation model is obtained by training based on a training data set comprising a plurality of pieces of training data, each piece of training data at least comprises a first address text, a second address text and a third address text, the first n levels of address elements of the first address text and the second address text are the same to form a positive sample pair, and the first (n-1) levels of address elements of the first address text and the third address text are the same and the nth level of address elements are different to form a negative sample pair;
and determining the candidate address text with the maximum similarity as the target address text corresponding to the address text to be inquired.
According to another aspect of the present invention, there is provided an address search apparatus including:
the query module is suitable for acquiring one or more candidate address texts corresponding to the address texts to be queried;
the first similarity calculation module is suitable for inputting the address text to be inquired and the candidate address text into a preset address text similarity calculation model to obtain the similarity of the address text and the candidate address text, wherein the address text similarity calculation model is obtained by training a training data set comprising a plurality of pieces of training data, each piece of training data at least comprises a first address text, a second address text and a third address text, wherein the first n levels of address elements of the first address text and the second address text are the same to form a positive sample pair, and the first (n-1) levels of address elements of the first address text and the third address text are the same and the nth level of address elements are different to form a negative sample pair;
and the output module is suitable for determining the candidate address text with the maximum similarity as the target address text corresponding to the address text to be inquired.
According to another aspect of the present invention, there is provided an apparatus for training an address text similarity calculation model, the address text including a plurality of address elements arranged in a high-to-low order, the address text similarity calculation model including a word embedding layer, a text encoding layer, and a similarity calculation layer, the apparatus comprising:
the training data set comprises a plurality of pieces of training data, each piece of training data at least comprises a first address text, a second address text and a third address text, wherein the first n levels of address elements of the first address text and the second address text are the same to form a positive sample pair, and the first (n-1) levels of address elements of the first address text and the third address text are the same and the nth level of address elements are different to form a negative sample pair;
the word vector acquisition module is suitable for inputting the first, second and third address texts of each piece of training data into the word embedding layer to obtain corresponding first, second and third word vector sets;
the text vector acquisition module is suitable for inputting the first, second and third word vector sets into a text coding layer to obtain corresponding first, second and third text vectors;
the second similarity calculation module is suitable for calculating the first similarities of the first text vector and the second similarities of the first text vector and the third text vector by utilizing the similarity calculation layer;
and the parameter adjusting module is suitable for adjusting the network parameters of the address text similarity calculation model according to the first similarity and the second similarity.
According to another aspect of the present invention, there is provided a computing device comprising:
one or more processors;
a memory; and
one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing a method according to any of the methods described above.
Since the address text naturally contains hierarchical relationships, address elements of different levels play different roles in address similarity calculation. The embodiment of the invention automatically learns the weights of the address elements in different levels by utilizing the hierarchical relationship in the address text, avoids the subjectivity of manually assigning the weights, and has the self-adaptive capacity to the target data source, thereby accurately calculating the similarity of the two address texts.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
First, some terms or terms appearing in the description of the embodiments of the present invention are applicable to the following explanations:
address text: such as "a shangzhou wen xi lu 969 a rilibaba", "a greater river college of greater chang 1 of the greater river of the maxi town of penshan district, meenshan, sichuan, etc. The address text includes a plurality of address elements arranged in a high-to-low order.
Address element: elements constituting each granularity of the address text, such as "hangzhou wen xi lu No. 969 a, where" hangzhou "represents a city," wen xi lu "represents a road," 969 "represents a road number, and where" ariba "represents a Point of Interest (POI).
Address level: the area corresponding to the address element in the address has a size-containing relationship, i.e. the level element has a corresponding address level, for example: province > city > district > street/community > way > building.
The address similarity is as follows: the similarity between two sections of address texts takes the value from 0 to 1, the larger the value is, the higher the possibility that two addresses are the same location is, the two sections of texts take the value 1 and represent the same address, and the two sections of addresses have no relation when the value is 0.
Partial order relationship: regions in an address have a hierarchical relationship of size inclusion, such as: province > city > district > street/community > way > building.
Since the address text naturally contains a hierarchical relationship, i.e., the partial order relationship described above, address elements of different levels play different roles in address similarity calculation. The embodiment of the invention automatically generates the weights of the address elements with different levels by utilizing the hierarchical relationship in the address text, and the weights are implicitly embodied in the network parameters of the address text similarity calculation model, thereby accurately calculating the similarity degree of the two address texts.
FIG. 1 shows a schematic diagram of anaddress search system 100 according to one embodiment of the invention. As shown in fig. 1, theaddress search system 100 includes auser terminal 110 and acomputing device 200.
Theuser terminal 110 is a terminal device used by a user, and may specifically be a personal computer such as a desktop computer and a notebook computer, or may also be a mobile phone, a tablet computer, a multimedia device, an intelligent wearable device, and the like, but is not limited thereto.Computing device 200 is used to provide services touser terminal 110, and may be implemented as a server, such as an application server, a Web server, or the like; but may also be implemented as a desktop computer, a notebook computer, a processor chip, a mobile phone, a tablet computer, etc., but is not limited thereto.
In an embodiment of the present invention, thecomputing device 200 may be used to provide address search services to the user, for example, thecomputing device 200 may be a server of an electronic map application, but it will be understood by those skilled in the art that thecomputing device 200 may be any device capable of providing address search services to the user, and is not limited to a server of an electronic map application.
In one embodiment, theaddress search system 100 also includes adata storage 120. Thedata storage 120 may be a relational database such as MySQL, ACCESS, etc., or a non-relational database such as NoSQL, etc.; thedata storage device 120 may be a local database residing in thecomputing device 200, or may be disposed at a plurality of geographic locations as a distributed database, such as HBase, in short, thedata storage device 120 is used for storing data, and the present invention is not limited to the specific deployment and configuration of thedata storage device 120. Thecomputing device 200 may connect with thedata storage 120 and retrieve data stored in thedata storage 120. For example, thecomputing device 200 may directly read the data in the data storage 120 (when thedata storage 120 is a local database of the computing device 200), or may access the internet in a wired or wireless manner and obtain the data in thedata storage 120 through a data interface.
In the embodiment of the present invention, thedata storage device 120 stores therein a standard address library, and the address text in the standard address library is standard address text (complete and accurate address text). In the address search service, a user inputs a query address text (query) through theuser terminal 110, and generally, the user inputs a incomplete and inaccurate address text. Theuser terminal 110 sends the query to thecomputing device 200, and the address search device in thecomputing device 200 recalls a batch of candidate address texts, usually several to several thousand address texts, by searching the standard address library. And then the address searching device calculates the correlation degree between the candidate address texts and the query, wherein the address similarity is important reference information of the correlation degree, and after the address similarity between the query and all the candidate address texts is respectively calculated, the candidate address text with the maximum similarity is determined as a target address text corresponding to the address text to be queried, and the target address text is returned to the user.
Specifically, the address search means may calculate the similarity between the address text to be queried and the candidate address text using the address text similarity calculation model. Correspondingly, thecomputing device 200 may further include a training device for the address text similarity calculation model, and thedata storage device 120 further stores a training address library, which may be the same as or different from the standard address library, where the training address library includes a plurality of address texts, and the training device trains the address text similarity calculation model by using the address texts in the training address library.
FIG. 2 shows a block diagram of acomputing device 200, according to one embodiment of the invention. As shown in FIG. 2, in a basic configuration 202, acomputing device 200 typically includes a system memory 206 and one or more processors 204. A memory bus 208 may be used for communication between the processor 204 and the system memory 206.
Depending on the desired configuration, the processor 204 may be any type of processing, including but not limited to: a microprocessor (μ P), a microcontroller (μ C), a Digital Signal Processor (DSP), or any combination thereof. The processor 204 may include one or more levels of cache, such as a level onecache 210 and a level twocache 212, aprocessor core 214, and registers 216.Example processor cores 214 may include Arithmetic Logic Units (ALUs), Floating Point Units (FPUs), digital signal processing cores (DSP cores), or any combination thereof. The example memory controller 218 may be used with the processor 204, or in some implementations the memory controller 218 may be an internal part of the processor 204.
Depending on the desired configuration, system memory 206 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. System memory 106 may include an operating system 220, one or more applications 222, andprogram data 224. The application 222 is actually a plurality of program instructions that direct the processor 204 to perform corresponding operations. In some embodiments, application 222 may be arranged to cause processor 204 to operate withprogram data 224 on an operating system.
Computing device 200 may also include an interface bus 240 that facilitates communication from various interface devices (e.g., output devices 242, peripheral interfaces 244, and communication devices 246) to the basic configuration 202 via the bus/interface controller 230. The example output device 242 includes a graphics processing unit 248 and an audio processing unit 250. They may be configured to facilitate communication with various external devices, such as a display or speakers, via one or more a/V ports 252. Example peripheral interfaces 244 can include a serial interface controller 254 and aparallel interface controller 256, which can be configured to facilitate communications with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 258. An example communication device 246 may include a network controller 260, which may be arranged to facilitate communications with one or moreother computing devices 262 over a network communication link via one ormore communication ports 264.
A network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media, such as carrier waves or other transport mechanisms, in a modulated data signal. A "modulated data signal" may be a signal that has one or more of its data set or its changes made in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or private-wired network, and various wireless media such as acoustic, Radio Frequency (RF), microwave, Infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media.
In thecomputing device 200 according to the invention, the application 222 comprises training means 600 and address search means 700 of an address text similarity calculation model. Theapparatus 600 includes a plurality of program instructions that may direct the processor 104 to perform themethod 300 of training an address text similarity calculation model. Theapparatus 700 includes a plurality of program instructions that may direct the processor 104 to perform theaddress search method 600.
FIG. 3 shows a flow diagram of amethod 300 for training an address text similarity calculation model according to one embodiment of the invention. Themethod 300 is suitable for execution in a computing device, such as thecomputing device 200 described above. As shown in fig. 3, themethod 300 begins at step S310. In step S310, a training data set is obtained, where the training data set includes a plurality of pieces of training data, and each piece of training data includes 3 pieces of address texts, which are a first address text, a second address text, and a third address text, respectively. Each address text comprises a plurality of address elements with the levels ranging from high to low, and the first n levels of the address elements of the first address text and the second address text are the same; the first (n-1) levels of address elements of the first address text and the third address text are the same, and the nth level of address elements are different. Here, the value range of N is (1, N), N is the number of address levels included in the address text, for example, the address text includes 5 address levels, which are: and the province, city, district, road and road number, the value of N is 5. Of course, n may also adopt other value ranges according to specific application scenarios.
In the embodiment of the present invention, each piece of training data is a triplet { target _ addr, pos _ addr, neg _ addr } formed by 3 address texts, where target _ addr corresponds to the first address text, pos _ addr corresponds to the second address text, and neg _ addr corresponds to the third address text. { target _ addr, pos _ addr } constitutes a pair of positive sample pairs, and { target _ addr, neg _ addr } constitutes a pair of negative sample pairs.
In one embodiment, the training data set is obtained as follows:
firstly, an original address text is obtained from a training address library (or a standard address library), the original address text is analyzed, and character strings of the address text are segmented and formatted into address elements. For example, the address text "No. 1 floor No. 7 in the ariba west park No. 1 in one west way 969 in hangzhou city in zhejiang may be divided into" prov (province) ═ city (city) ═ department in hangzhou city) ═ load (road) ═ west way roadno (road number) ═ 969 poi ═ ariba west park hosueno (floor number) ═ 1 floor floorno (floor number) ═ 7 floor no (room number) > 910 ". Specifically, the above analysis may be completed by combining a word segmentation model and a named entity model, and the embodiment of the present invention does not limit the specific word segmentation model and named entity model, and those skilled in the art may reasonably select the word segmentation model and named entity model according to needs.
Then, the address texts formatted as address elements are aggregated (deduplicated and sorted) according to the address elements of different levels, so as to form the following table:
and finally, combining the aggregated data in the table into positive and negative sample pairs of training data according to different address levels, wherein the output format is as follows: { target _ addr, pos _ addr, neg _ addr }. As described above, { target _ addr, pos _ addr } constitutes a pair of positive sample pairs, and { target _ addr, neg _ addr } constitutes a pair of negative sample pairs. It should be noted that a pair of positive sample pairs may correspond to multiple pairs of negative sample pairs, that is, one target _ addr corresponds to one pos _ addr, and the target _ addr may correspond to multiple neg _ addr.
The specific operation is as follows:
(1) selecting an address text, for example: the royal city in Zhejiang province, discrict, Yunjian, road ad, Wen-xi, road ad, 969, poi, Alibara Xixi Yun district;
(2) traversing all address levels, e.g. province- > city- > district- > road, finding the same and different address elements respectively at each address level as the current address element, constituting positive and negative sample pairs respectively with the current address element, e.g.:
at the province level, a good example of an aribab Xixi park, No. 969 Wenyu district, Hangzhou city, Zhejiang: 37150, Yuan Ye Yijia Garden No. 245 in State area, Zhejiang Ningbo city, Zhejiang province, 1; the negative example is: shanghai hong Qiao No. 2550 International airport of Shanghai hong Qiao in Changning district of Shanghai city.
At the city level, a good example of an aribab Xixi park No. 969 Wenyu district in Yunhong city, Hangzhou, Zhejiang: Wen-West road 1008 Zhejiang socialist college from the Yunjun of Hangzhou city, Zhejiang; the negative example is: 37150of Ningbo city in Zhejiang province, and 525 # garden road in State region for household.
At district level, a good example of an aribab Xixi park, No. 969, Wen Yixi district, Hangzhou, Zhejiang: high education road No. 248 Saiyin International Square in Hangzhou city, Hangzhou, Zhejiang; the negative example is: southern mountain school of China Art institute No. 218 south mountain of Hangzhou city, Zhejiang.
After the training data set is acquired, themethod 300 proceeds to step S320. Before describing the processing procedure of step S320, the structure of the address text similarity calculation model according to the embodiment of the present invention will be described.
Referring to fig. 4, an address textsimilarity calculation model 400 according to an embodiment of the present invention includes: aword embedding layer 410, atext encoding layer 420, and asimilarity calculation layer 430. Theword embedding layer 410 is adapted to convert each address element in the address text into a word vector, and combine each word vector into a set of word vectors corresponding to the address text; thetext encoding layer 420 is adapted to encode a set of word vectors corresponding to the address text as text vectors; thesimilarity calculation layer 430 is adapted to calculate a similarity between two text vectors, and characterize the similarity between address texts by using the similarity between the text vectors.
In step S320, the first address text, the second address text, and the third address text in each piece of training data are respectively input to the word embedding layer for processing, so as to obtain a first word vector set corresponding to the first address text, a second word vector set corresponding to the second address text, and a third word vector set corresponding to the third address text.
The Word embedding layer (embedding layer) can convert each Word in a sentence into a digital vector (Word vector), and the weight of the embedding layer can be obtained by pre-calculating the text co-occurrence information of a massive corpus, for example, by adopting a Glove algorithm, or by adopting a CBOW and skip-gram algorithm in Word2 Vec. These algorithms are based on the fact that: in the case that different text representations of the same latent semantic can repeatedly appear in the same context, the word-to-context prediction is performed by using the context and the relation between the words, or the words are predicted through the context, so that the latent semantic of each word is obtained. In the embodiment of the invention, the parameters of the word embedding layer can be obtained by utilizing a corpus to carry out training independently; the word embedding layer and the text coding layer can also be trained together, so that the parameters of the word embedding layer and the parameters of the text coding layer are obtained simultaneously. The following description takes the example of training the word embedding layer and the text encoding layer together.
Specifically, the address text comprises a plurality of formatted address elements, after the address text is input into the word embedding layer, the word embedding layer converts each address element in the address text into a word vector as a word, so as to obtain a plurality of word vectors, and then combines the word vectors into a word vector set.
In one implementation, the word vector set is represented as a list, i.e., a word vector list, each list item in the word vector list corresponds to a word vector, and the number of items in the list is the number of address elements in the address text. In another implementation, the word vector set is represented as a matrix, that is, a word vector matrix, each column of the matrix corresponds to a word vector, and the number of columns of the matrix is the number of address elements in the address text.
After obtaining the set of word vectors, themethod 300 proceeds to step S330. In step S330, the first word vector set, the second word vector set, and the third word vector set are respectively input to the text encoding layer for processing, so that the first word vector set is encoded as a first text vector, the second word vector set is encoded as a second text vector, and the third word vector set is encoded as a third text vector.
The text coding layer is implemented by using a Deep Neural Network (DNN) model, for example, a Recurrent Neural Network (RNN) model, a Convolutional Neural Network (CNN) model, or a Deep Belief Network (DBN) model. The embedding output of the address sentence text with indefinite length is encoded into a sentence vector with fixed length through DNN, and at the moment, target _ addr, pos _ addr and neg _ addr are respectively converted into vector _ A, vector _ B and vector _ C. vector _ a is the first text vector, vector _ B is the second text vector, and vector _ C is the third text vector.
Taking RNN as an example, a word vector sequence corresponding to the address text may be regarded as a time sequence, word vectors in the word vector sequence are sequentially input into RNN, and a finally output vector is a text vector (sentence vector) corresponding to the address text.
Taking the CNN as an example, inputting a word vector matrix corresponding to the address text into the CNN, processing the CNN through a plurality of convolution layers and pooling layers, and finally converting the two-dimensional feature map into a one-dimensional feature vector through a full-connection layer, wherein the feature vector is a text vector corresponding to the address text.
After the text vector is obtained, themethod 300 proceeds to step S340. In step S340, a first similarity between the first text vector and the second text vector and a second similarity between the first text vector and the third text vector are calculated by using the similarity calculation layer. In this way, the first similarity may represent a similarity between the first address text and the second address text, and the second similarity may represent a similarity between the first address text and the third address text.
Various similarity distance calculation methods can be selected, for example: euclidean distance, cosine similarity, Jaccard coefficient, etc. In this embodiment, the similarity between vector _ a and vector _ B is referred to as SIM _ AB, and the similarity between vector _ a and vector _ C is referred to as SIM _ AC.
Finally, in step S350, network parameters of the word embedding layer and the text encoding layer are adjusted according to the first similarity and the second similarity. The method specifically comprises the following steps: calculating a loss function value according to the first similarity and the second similarity; and adjusting network parameters of the word embedding layer and the text coding layer by using a back propagation algorithm until the loss function value is lower than a preset value or the training times reach a preset number.
The loss function is a triplet loss function, and the distance between the positive sample pairs can be shortened and the distance between the negative sample pairs can be shortened by using the triplet loss function. The loss function may be specifically expressed as: loss ═ Margin- (SIM _ AB-SIM _ AC). The target min (loss) of the network is optimized by using a back propagation algorithm, so that the network actively learns the parameters to make the target _ addr closer to the pos _ addr in the semantic space and further from the neg _ addr.
The Margin is a hyper-parameter which indicates that a training target needs to ensure that a certain distance is kept between the SIM _ AB and the SIM _ AC so as to increase the discrimination of the model, and the value of the Margin can be repeatedly adjusted according to the data condition and the actual task until the effect is optimal.
After the training process is completed, a similarity calculation model for calculating the similarity between the two sections of address texts is finally obtained. Based on the similarity calculation model, the embodiment of the invention also provides an address text similarity determination method, which comprises the following steps:
1) acquiring an address text pair with similarity to be determined;
2) and inputting the address text pair into a trained address text similarity calculation model to output the similarity of the two address texts included in the address text pair.
In addition, the similarity calculation model can be applied to various scenes in which the similarity of the address text needs to be calculated, and can be applied to address standardization in the fields of public security, express delivery, logistics, electronic maps and the like. In these scenarios, the address search service can be provided for the user by using the address text similarity calculation model of the embodiment of the present invention.
FIG. 5 shows a flow diagram of anaddress search method 500 according to one embodiment of the invention. Referring to FIG. 5, themethod 500 includes steps S510 to S530.
In step S510, one or more candidate address texts corresponding to the address texts to be queried are obtained. In the address search service, a user inputs a query address text (query) through a user terminal, and generally, the input of the user is a incomplete and inaccurate address text. The user terminal sends the query to the computing device, and an address searching device in the computing device recalls a batch of candidate address texts after searching the standard address library, wherein the number of the candidate address texts is usually several to thousands.
In step S520, the address text to be queried and the candidate address text are input to a preset address text similarity calculation model to obtain the similarity between the two, wherein the address text similarity calculation model is obtained by training according to themethod 300. In this step, the similarity between the address text to be queried and each candidate address file is calculated respectively.
After the similarity between the address text to be queried and all candidate address texts is obtained, themethod 500 proceeds to step S530. In step S530, the candidate address text with the maximum similarity is determined as the target address text corresponding to the address text to be queried, and the target address text is returned to the user.
Fig. 6 is a schematic diagram of atraining apparatus 600 for an address text similarity calculation model according to an embodiment of the present invention. The address text similarity calculation model includes a word embedding layer, a text encoding layer, and a similarity calculation layer, and thetraining apparatus 600 includes:
the obtainingmodule 610 is adapted to obtain a training data set, where the training data set includes a plurality of pieces of training data, and each piece of training data includes first, second, and third address texts, where address elements of first n levels of the first and third address texts are the same, address elements of first (n-1) levels of the first and third address texts are the same, and address elements of nth level are different. The obtainingmodule 610 is specifically configured to execute the method of step S310, and for processing logic and functions of the obtainingmodule 610, reference may be made to the related description of step S310, which is not described herein again.
The wordvector obtaining module 620 is adapted to input the first, second, and third address texts of each piece of training data into the word embedding layer to obtain corresponding first, second, and third word vector sets. The wordvector obtaining module 620 is specifically configured to execute the method in step S320, and for processing logic and functions of the wordvector obtaining module 620, reference may be made to the related description in step S320, which is not described herein again.
The textvector obtaining module 630 is adapted to input the first, second, and third word vector sets into the text encoding layer to obtain corresponding first, second, and third text vectors. The textvector obtaining module 630 is specifically configured to execute the method in step S330, and for processing logic and functions of the wordvector obtaining module 630, reference may be made to the related description in step S330, which is not described herein again.
The second similarity calculation module 640 is adapted to calculate the first similarities of the first and second text vectors and the second similarities of the first and third text vectors by using the similarity calculation layer. The second similarity calculation module 640 is specifically configured to execute the method in step S340, and for processing logic and functions of the second similarity calculation module 640, reference may be made to the related description in step S340, which is not described herein again.
And theparameter adjusting module 650 is adapted to adjust the network parameters of the word embedding layer and the text encoding layer according to the first similarity and the second similarity. Thereference module 650 is specifically configured to execute the method of step S350, and for the processing logic and functions of the secondsimilarity calculation module 650, reference may be made to the related description of step S350, which is not repeated herein.
Fig. 7 shows a schematic diagram of anaddress search apparatus 700 according to an embodiment of the present invention. Referring to fig. 7, anaddress search apparatus 700 includes:
thequery module 710 is adapted to obtain one or more candidate address texts corresponding to the address texts to be queried;
the first similarity calculation module 720 is adapted to input the address text to be queried and the candidate address text into a preset address text similarity calculation model to obtain the similarity between the address text and the candidate address text, wherein the address text similarity calculation model is obtained by training thetraining device 600;
and the output module 730 is adapted to determine the candidate address text with the maximum similarity as the target address text corresponding to the address text to be queried.
The various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present invention, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as removable hard drives, U.S. disks, floppy disks, CD-ROMs, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention.
In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to execute the multilingual spam-text recognition method of the present invention according to instructions in said program code stored in the memory.
By way of example, and not limitation, readable media may comprise readable storage media and communication media. Readable storage media store information such as computer readable instructions, data structures, program modules or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of readable media.
In the description provided herein, algorithms and displays are not inherently related to any particular computer, virtual system, or other apparatus. Various general purpose systems may also be used with examples of this invention. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.