Disclosure of Invention
The embodiment of the invention provides a domain name management method and device, which are used for judging semantic similarity between an unknown domain name and a key domain name through cluster analysis of a vectorized domain name, so that accuracy of identifying the priority of the domain name is improved.
In a first aspect, an embodiment of the present invention provides a domain name management method, which may be applied to a server (such as a DNS server). The method comprises the following steps:
acquiring a domain name set, wherein the domain name set comprises a first domain name with a first label and a second domain name without the first label, and the first label indicates the domain name priority;
vectorizing the first domain name and the second domain name to obtain vectors corresponding to each domain name in the domain name set;
Clustering vectors corresponding to each domain name in the domain name set to obtain at least two clusters, determining a first cluster from the at least two clusters, wherein the number of vectors corresponding to the first domain name in the first cluster is larger than the number of vectors corresponding to the second domain name, and the at least two clusters comprise the first cluster and at least one second cluster;
Determining a domain name corresponding to the second cluster meeting the distance condition according to the distance between the first cluster and each second cluster; and the response priority of the domain name corresponding to the second cluster meeting the distance condition is larger than that of the domain name corresponding to the second cluster not meeting the distance condition.
In the above technical solution, the first tag may be understood as a target tag, and indicates the priority of the domain name. In the domain name protection scenario, the first label is set according to the importance degree of the domain name, so the first label may also be referred to as an important label, and thus the first domain name may be referred to as an important domain name or an important domain name. The second domain name does not have the first label therein, and therefore the priority of the second domain name is unknown, so the second domain name may be referred to as an unknown domain name. And then vectorizing the first domain name and the second domain name to obtain vectors corresponding to each domain name, so that semantic features of the first domain name and the second domain name can be better represented. Clustering vectors corresponding to each domain name to obtain at least two clusters, and determining a first cluster from the clusters. And finally, according to the distance between the first cluster and each second cluster, determining the domain name corresponding to the second cluster meeting the distance condition, realizing the recognition of the priority of the domain name, and improving the accuracy of recognizing the priority of the domain name by clustering the vectors.
Optionally, the vectorizing the first domain name and the second domain name to obtain a vector corresponding to each domain name in the domain name set includes:
inputting the first domain name and the second domain name into a first model to obtain a vector corresponding to each domain name in the domain name set output by the first model; the first model is a mask language model and is used for outputting semantic feature vectors of domain names.
In the technical scheme, the domain names in the first domain name and the second domain name are respectively input into the first model to obtain vectors corresponding to the domain names output by the first model, and then a first vector set is obtained. The first model is a mask language model, and the output vector can express the semantic features of the domain name, so that the clustering result of the vector is more accurate, and the accuracy of identifying the priority of the domain name is improved.
Optionally, the method further comprises:
dividing each domain name in the training sample into words according to the vocabulary;
Inputting the domain name into a pre-training model aiming at any domain name to obtain a vector of each word segment of the domain name output by the pre-training model;
Masking the vector of one or more words of the domain name, training the pre-training model according to the vector of the domain name after masking to obtain the first model, wherein the loss function of the pre-training model represents the difference between the output of the masking words and the real vector corresponding to the masking words.
Optionally, the determining a first cluster from the at least two clusters includes:
determining N clusters from the at least two clusters, wherein the number of vectors corresponding to the first domain name in each of the N clusters is greater than the number of vectors corresponding to the second domain name, and N is an integer greater than or equal to 1;
and if N is greater than 1, merging N clusters to obtain the first cluster.
In the above technical solution, at least two clusters include a first vector having a first label and a second vector not having the first label, so that the first cluster can be determined according to the number of vectors corresponding to the first domain name and the number of vectors corresponding to the second domain name in each cluster. The number of vectors corresponding to the first domain name in the first cluster is greater than the number of vectors corresponding to the second domain name. If clusters exist in which the number of vectors corresponding to the first domain name is greater than the number of vectors corresponding to the second domain name, merging the clusters, and taking the clusters obtained after merging as the first clusters.
Optionally, the number of the at least two clusters is determined according to a preset cache rate, the number of the levels of the cache and the space multiple between adjacent levels of the cache;
the determining the domain name corresponding to the second cluster meeting the distance condition according to the distance between the first cluster and each second cluster comprises the following steps:
Determining centroids of the first cluster and the at least one second cluster, respectively;
Calculating the distance between the centroid of the first cluster and the centroid of each second cluster respectively, and taking the distance between the centroid of the first cluster and the centroid of each second cluster as the distance between the first cluster and each second cluster;
determining the priority of the second cluster according to the distance between the first cluster and the second cluster in the order of the distance from the small to the large;
sequencing the other clusters according to the priority of the second cluster according to the order of the priority from high to low;
Determining a first number according to the number of the at least two clusters and the preset cache rate;
And selecting the first number of second clusters according to the sorting result of the at least one second cluster as the second clusters meeting the distance condition according to the order of the priority from high to low.
In the above technical solution, the distance between the centroid of the first cluster and the centroid of each second cluster is taken as the distance between the first cluster and each second cluster. And the priority ordering of other clusters is realized according to the distance between the first cluster and each second cluster, so that the priority of the domain name is identified. The prioritized results of the other clusters represent the domain names in the remaining clusters, possibly to the extent that the key domain names are at the semantic level. The higher the priority of the cluster, the closer the domain name corresponding to the vector in the cluster is to the key domain name at the semantic level. And determining the number of clusters obtained after clustering according to the preset cache rate, the number of levels of caches and the space multiple between adjacent levels of caches, and ensuring that the storage space required by the finally obtained domain name to be written into the cache does not exceed the cache space. The different levels of caches have different read and write speeds, and the higher the level is, the faster the read and write speed is. And determining a first number of second clusters according to the priority ranking of the at least one second cluster and the ranking result of the at least one second cluster from high to low as the second clusters meeting the distance condition.
Optionally, after determining the domain name corresponding to the second cluster that satisfies the distance condition, the method further includes:
And sequentially storing domain names corresponding to the second clusters meeting the distance condition into caches of different levels, wherein the domain names corresponding to the first clusters are recorded in the cache of the highest level, and the priority of the second clusters meeting the distance condition is positively correlated with the level of the cache.
In the above technical solution, the domain name corresponding to each vector in the first cluster is recorded in the highest-level cache. And sequentially writing domain names corresponding to vectors contained in each cluster in the second clusters meeting the distance condition into caches at different levels according to the priority order of the second clusters. The higher the priority of the domain name is, the faster the response speed is ensured when the domain name request is responded.
Optionally, the number of at least two clusters is determined according to the following formula (1):
Wherein K is the number of at least two clusters, L is the level number of the caches, R is the preset cache rate, and m is the space multiple between adjacent level caches.
In a second aspect, an embodiment of the present invention provides a domain name management apparatus, including:
The system comprises an acquisition module, a judgment module and a display module, wherein the acquisition module is used for acquiring a domain name set, the domain name set comprises a first domain name with a first label and a second domain name without the first label, and the first label indicates the priority of the domain name;
The processing module is used for vectorizing the first domain name and the second domain name to obtain vectors corresponding to each domain name in the domain name set;
Clustering vectors corresponding to each domain name in the domain name set to obtain at least two clusters, determining a first cluster from the at least two clusters, wherein the number of vectors corresponding to the first domain name in the first cluster is larger than the number of vectors corresponding to the second domain name, and the at least two clusters comprise the first cluster and at least one second cluster;
Determining a domain name corresponding to the second cluster meeting the distance condition according to the distance between the first cluster and each second cluster; and the response priority of the domain name corresponding to the second cluster meeting the distance condition is larger than that of the domain name corresponding to the second cluster not meeting the distance condition.
Optionally, the processing module is specifically configured to:
inputting the first domain name and the second domain name into a first model to obtain a vector corresponding to each domain name in the domain name set output by the first model; the first model is a mask language model and is used for outputting semantic feature vectors of domain names.
Optionally, the processing module is further configured to:
dividing each domain name in the training sample into words according to the vocabulary;
Inputting the domain name into a pre-training model aiming at any domain name to obtain a vector of each word segment of the domain name output by the pre-training model;
Masking the vector of one or more words of the domain name, training the pre-training model according to the vector of the domain name after masking to obtain the first model, wherein the loss function of the pre-training model represents the difference between the output of the masking words and the real vector corresponding to the masking words.
Optionally, the processing module is specifically configured to:
determining N clusters from the at least two clusters, wherein the number of vectors corresponding to the first domain name in each of the N clusters is greater than the number of vectors corresponding to the second domain name, and N is an integer greater than or equal to 1;
and if N is greater than 1, merging N clusters to obtain the first cluster.
Optionally, the number of the at least two clusters is determined according to a preset cache rate, the number of levels of the caches, and a space multiple between adjacent levels of caches, and the processing module is specifically configured to:
Determining centroids of the first cluster and the at least one second cluster, respectively;
Calculating the distance between the centroid of the first cluster and the centroid of each second cluster respectively, and taking the distance between the centroid of the first cluster and the centroid of each second cluster as the distance between the first cluster and each second cluster;
determining the priority of the second cluster according to the distance between the first cluster and the second cluster in the order of the distance from the small to the large;
sequencing the other clusters according to the priority of the second cluster according to the order of the priority from high to low;
Determining a first number according to the number of the at least two clusters and the preset cache rate;
And selecting the first number of second clusters according to the sorting result of the at least one second cluster as the second clusters meeting the distance condition according to the order of the priority from high to low.
Optionally, the processing module is further configured to:
And sequentially storing domain names corresponding to the second clusters meeting the distance condition into caches of different levels, wherein the domain names corresponding to the first clusters are recorded in the cache of the highest level, and the priority of the second clusters meeting the distance condition is positively correlated with the level of the cache.
In a third aspect, an embodiment of the present invention further provides a computer apparatus, including:
A memory for storing program instructions;
and the processor is used for calling the program instructions stored in the memory and executing the domain name management method according to the obtained program.
In a fourth aspect, embodiments of the present invention further provide a computer-readable storage medium storing computer-executable instructions for causing a computer to perform the domain name management method described above.
In a fifth aspect, embodiments of the present invention further provide a computer program product, where the computer program product includes an executable program that is executed by a processor to perform the domain name management method described above.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The application scenario described in the embodiment of the present application is to more clearly illustrate the technical solution protected by the embodiment of the present application, and does not constitute a limitation on the technical solution provided by the embodiment of the present application, and as a person of ordinary skill in the art can know that, with the appearance of a new application scenario, the technical solution provided by the embodiment of the present application is also applicable to similar technical problems. The terms first and second in the description and claims of the application and in the above-mentioned figures are used for distinguishing between different objects and not for describing a particular sequential order. In the description of the present application, unless otherwise indicated, the meaning of "a plurality" is two or more.
Before describing a domain name management method provided by the embodiment of the present application, for convenience of understanding, the following description will first describe nouns related to the embodiment of the present application.
DNS server: also known as a domain name server (DomainName System, DNS) is a distributed database that manages the mapping of domain names and IP addresses to each other, and a DNS server is a host that provides domain name resolution services for users.
Pre-training model: an artificial intelligence model in the field of Natural Language Processing (NLP) performs tasks by analyzing, understanding and generating human language. These models are typically based on neural networks of deep learning, which enable capturing semantic information in text by learning language patterns and context in a large corpus of text, and thus enabling converting text into a vector representation. The result may reflect the semantic clarity and information content of the text to some extent.
Model pre-training: the first stage in the pre-training model development process. At this stage, the model is trained using a large, diverse corpus with the goal of letting the model learn the basic structure and patterns of the language. The pre-trained model is able to understand and generate the underlying language, but has not been optimized for a particular task or application.
Local fine tuning: the pre-trained model is applied to the process of a particular task. At this stage, the model is trained additionally on smaller, more specialized data sets that are typically related to a particular task or application scenario. Through local fine tuning, the model can adapt to specific language application, and can show higher accuracy and efficiency for specific tasks on the basis of maintaining original wide language knowledge.
Self-supervision study: one term in the art of machine learning and deep learning, particularly when processing unlabeled data. In self-supervised learning, algorithms automatically generate labels or supervision signals from raw data and then use these signals to train a model. This approach allows the model to learn from a large amount of unlabeled data, thereby avoiding the reliance on a large amount of manually labeled data in traditional supervised learning. In a language model, self-supervised learning can be used to predict the next word or masked (missing) words in a sentence, providing an efficient way to process and understand large amounts of non-markup language data.
Fig. 1 illustrates a system architecture to which embodiments of the present invention are applicable, the system architecture including a server 100, the server 100 may include a processor 110, a communication interface 120, and a memory 130.
Wherein the communication interface 120 is used for transmitting data.
The processor 110 is a control center of the server 100, connects various parts of the entire server 100 using various interfaces and routes, and performs various functions of the server 100 and processes data by running or executing software programs and/or modules stored in the memory 130, and calling data stored in the memory 130. Optionally, the processor 110 may include one or more processing units.
The memory 130 may be used to store software programs and modules, and the processor 110 performs various functional applications and data processing by executing the software programs and modules stored in the memory 130. The memory 130 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, application programs required for at least one function, and the like; the storage data area may store data created according to business processes, etc. In addition, memory 130 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
It should be noted that the structure shown in fig. 1 is merely an example, and the server may be a DNS server, which is not limited in this embodiment of the present invention.
Based on the above description, fig. 2 is a schematic flow chart illustrating a domain name management method according to an embodiment of the present invention, where the flow may be executed by a domain name management device.
As shown in fig. 2, the process specifically includes:
Step 210, obtaining a domain name set, wherein the domain name set comprises a first domain name with a first label and a second domain name without the first label, and the first label indicates the domain name priority.
In the embodiment of the invention, the first label can be understood as a target label, and indicates the priority of the domain name. Illustratively, in the domain name protection scenario, the first label is set according to the importance degree of the domain name, for example, the first label is 1, so the first label may also be called an important label, and thus, the first domain name may be called an important domain name or an important domain name. The second domain name does not have the first label therein, and therefore the priority of the second domain name is unknown, so the second domain name may be referred to as an unknown domain name. In some embodiments, the second domain name has a second label therein, the second label indicating that the priority of the domain name is unknown, e.g., the second label is 0.
The first domain name in the set of domain names may be extracted from a specified domain name request, or a preset key domain name. Illustratively, extracting a domain name from a domain name request acquired when accessing a specified protection application or website, and performing de-duplication to serve as a key domain name; or be a key domain name for the division. The second domain name may be extracted from domain name requests received while the other servers are operating. The method includes extracting a domain name from a domain name request of an operator metropolitan area backbone export, and removing the duplicate and key domain names in the extracted domain name to obtain a second domain name.
Step 220, vectorizing the first domain name and the second domain name to obtain a vector corresponding to each domain name in the domain name set.
In the embodiment of the invention, the first domain name and the second domain name are vectorized to obtain the vector corresponding to each domain name in the domain name set. Because the first domain name has the first label and the first label is not vectorized when the first domain name is vectorized, a vector obtained after vectorizing the first domain name may also have the first label. Similarly, the vector obtained after vectorizing the second domain name does not have the first label. It will be appreciated that vectorizing the first domain name and the second domain name is performed, i.e. to show semantic features of the first domain name and the second domain name. Specifically, the first domain name and the second domain name are input into a first model, and a vector corresponding to each domain name in a domain name set output by the first model is obtained. The first model is a mask language model, is obtained by performing local fine adjustment on a pre-training model, and is used for outputting semantic feature vectors of domain names.
The construction method of the first model specifically comprises the following steps: and dividing each domain name in the training sample according to the vocabulary. For example, the domain name in the training sample may be extracted from domain name requests received while other servers are operating. The vocabulary may be preset. The word segmentation operation may be performed by a self-contained word segmenter (e.g., wordPiece, BPE or SENTENCEPIECE) in the pre-trained model for decomposing the domain name into the smallest units present in the model vocabulary.
Inputting the domain name into a pre-training model aiming at any domain name, and obtaining the vector of each word of the domain name output by the pre-training model. Illustratively, for a domain name d, a pre-trained model is usedIt can be expressed as a series of numerically-controlled vector token that the model can understand, i.eDenoted as vector value input ids. tn is token corresponding to each word, n is the number of the words, and n is an integer greater than or equal to 1. It is to be understood that the pre-training models mentioned in the present application include, but are not limited to BERT, including GPT, roBERTa, etc., pre-training models, and are not specifically limited herein.
The vector of one or more tokens of the domain name is masked. The masked vector is the label data of the self-supervision learning of the model. Illustratively, for each domain name, a subset is randomly selectedAs the location of the MASK, Ti is replaced with a special MASK mark MASK for i e T. The masked vector is used as the marking data. The specific masking step comprises the following steps: and generating a probability matrix rand for randomly sampling the mask, wherein the shape of the probability matrix rand is the same as that of input_ids, and the values are uniformly distributed in the interval of [0,1 ]. And then calculating and generating a binary vector (Boolean vector), wherein the element value of the corresponding position is True if and only if the input_ids element of the position is smaller than the random probability of the preset mask probability p and the input_ids element is not equal to special values such as 101 (CLS mark), 102 (SEP mark) and the like, and otherwise setting the element value of the position to False. Namely: mask= (rand < p) (-input_ids noteqcls) (-input_ids noteqsep) (-input_ids noteqpad). Finally, a MASK array MASK is used to select which tokens are to be masked and the input_ids for these locations are replaced with 103 (MASK flags).
And then training the pre-training model according to the vector of the domain name after masking to obtain a first model. Wherein the loss function of the pre-training model represents the difference between the output of the mask word and the true vector corresponding to the mask word. Illustratively, the batch number and the epoch number are set first for determining the size of the training data and the number of iterative training. The specific steps on the ith iteration training number epoch are as follows: the model is first propagated forward and the loss function of the model is calculated. Wherein the loss function uses a modified cross entropy (Cross Entropy Loss) loss function for comparing the difference between the output of the model predicted, fractional mask positions and the true vector (i.e., original, unmasked token) corresponding to the fractional mask positions. It is understood that "loss function" in this patent includes, but is not limited to, a method of calculating a triplet loss function or the like that can be used as a model parameter convergence condition.
Specifically, the loss function of model prediction is determined by the difference between the predicted value and the true value, and can be expressed as:
Wherein,As a loss function. P (ti)t1,…,ti-1,MASK,ti+1,…,tn; θ) represents the probability that the predicted value for token is correct given the model parameters θ, ti. In the masking task, only the loss of masked locations will be computed and counter-propagated. Thus, if a token location is not masked, its contribution to the penalty function needs to be ignored. Implementation can be achieved by setting labels for these locations to-100, as most deep learning frameworks will automatically ignore these locations labeled-100.
Then, the back propagation loss functionCalculating gradientsUsing gradient descent algorithm to cause loss functionMinimizing. In the implementation process, an optimizer is used for parameter updating. Parameter updates can be expressed as:
Wherein, eta represents the learning rate,Is the gradient of the total loss function relative to the parameter θ, θnew is the new model parameter, and θold is the old model parameter.
And stopping iteration when the pre-training model iterates to the preset iteration training times, and taking the model at the moment as a first model. The model training process can be performed on a device or platform with a high computational power resource to speed up model training. The principle of the model training is that if the model can predict the original token at the mask position, the model can be better shown to complete vectorization of the target domain name according to the semantic characteristics and the context relation of the domain name.
Step 230, clustering vectors corresponding to each domain name in the domain name set to obtain at least two clusters, and determining a first cluster from the at least two clusters, wherein the number of vectors corresponding to the first domain name in the first cluster is greater than the number of vectors corresponding to the second domain name, and the at least two clusters comprise the first cluster and at least one second cluster.
In the embodiment of the invention, the number of at least two clusters is determined according to a preset cache rate, the number of levels of cache and the space multiple between adjacent levels of cache. The read-write speeds of caches at different levels are different, and the higher the level is, the faster the read-write speed is. Illustratively, the domain name cache space has a total of L levels. The level of level 1 cache is highest. The precious degree of the buffer space and the read-write speed monotonically decrease from the 1 st level to the L st level. L is the number of levels of cache, and is an integer greater than 1. The multiple of the space between adjacent level caches means that the m+1st level cache is M times the M-th level cache. m is an integer greater than 1. M is an integer greater than or equal to 1 and less than L.
Determining the number of at least two clusters according to the following formula (1):
Wherein K is the number of at least two clusters, L is the level number of the caches, R is the preset cache rate, and m is the space multiple between adjacent level caches. The value range of the preset cache rate R is [0,1], which is determined by the available space size of the cache and the actual size of the first domain name set. The preset cache rate indicates the proportion of the second domain name added to the cache, or indicates how much of the domain name in the second domain name is discarded.
After the number of clusters is determined, clustering vectors corresponding to each domain name in the domain name set based on the number of clusters to obtain at least two clusters. Illustratively, an unsupervised learning algorithm (e.g., k-means, etc.) that specifies the number of clusters is used to cluster the vectors corresponding to each domain name in the set of domain names.
After the at least two clusters are obtained, a first cluster is determined from the at least two clusters. The at least two clusters include a first cluster and at least one second cluster. The number of vectors corresponding to the first domain name in the first cluster is larger than the number of vectors corresponding to the second domain name. Specifically, N clusters are determined from at least two clusters. The number of vectors corresponding to the first domain name in each of the N clusters is greater than the number of vectors corresponding to the second domain name. N is an integer greater than or equal to 1. If N is greater than 1, combining the N clusters to obtain a first cluster. If N is equal to 1, the cluster is taken as a first cluster. Illustratively, the label of the vector corresponding to the first domain name is set to 1 and the label of the vector corresponding to the second domain name is set to 0. For each cluster Si, the primary label within the cluster (i.e., the label that appears most in the cluster, or the label corresponding to the vector that appears most in the cluster) is yi. The cluster in which yi =1 is recorded is a first cluster, and is labeled S0, and if not only the cluster in which yi =1 is recorded, the clusters are combined as the first cluster. The second cluster is a cluster other than the first cluster of the at least two clusters, i.e., each of the at least two clusters other than the first cluster may be referred to as a second cluster.
Step 240, determining a domain name corresponding to the second cluster meeting the distance condition according to the distance between the first cluster and each second cluster; and the response priority of the domain name corresponding to the second cluster meeting the distance condition is larger than that of the domain name corresponding to the second cluster not meeting the distance condition.
In an embodiment of the present invention, the distance between the first cluster and each of the second clusters is first determined. Specifically, the centroids of the first cluster and the at least one second cluster are determined separately. The distance between the centroid of the first cluster and the centroid of each second cluster is then calculated separately and taken as the distance between the first cluster and each second cluster. The priorities of the second clusters are then determined in order of the distances from the first cluster to each of the second clusters. At least one second cluster is ordered according to the priority of the second clusters in order of priority from high to low.
Exemplary, as shown in fig. 3, fig. 3 is a schematic view of a cluster distance according to an embodiment of the present invention. The geometric centroid Ci of each cluster is determined separately. Centroid C0 corresponds to first cluster S0. The distances between the centroids corresponding to each second cluster and the centroids C0 corresponding to the first cluster S0 are determined separately (distances between centroids include, but are not limited to, euclidean distances) and then sorted in ascending order as d= [ d1,d2,…,dK ]. And marking each second cluster and centroid according to the distance sequence. The second cluster corresponding to distance d1 is labeled S1, centroid C1; the second cluster corresponding to distance d2 is labeled S2 and centroid is labeled C2; …; the second cluster corresponding to distance dK is labeled SK and the centroid is labeled CK. It will be appreciated that the second clusters are marked according to a distance order, i.e. the priorities of the second clusters are determined, so that the ordering result of at least one second cluster is { S1,S2,…,SK }.
Ordering based on the distance between the first cluster and each second cluster can be understood as ordering by semantic level to the extent that the remaining domain names that are not listed in the first cluster (i.e., the domain name in each second cluster) are more likely to be key domain names than the listed key domain names in the first cluster.
Since the caches are classified into different levels, the domain name corresponding to the second cluster satisfying the distance condition is determined according to the distance between the first cluster and each second cluster. And determining a second cluster meeting the distance condition from the second clusters according to the priority ordering result, and determining the domain name corresponding to the vector in the second cluster meeting the distance condition. It will be appreciated that the response priority of the domain name corresponding to the second cluster satisfying the distance condition is greater than the response priority of the domain name corresponding to the second cluster not satisfying the distance condition and is lower than the priority of the domain name corresponding to the first cluster. Wherein the distance condition is that a preset number of second clusters are selected from at least one second cluster in the order of the distances from small to large (or the order of the priorities from high to low), that is, the second clusters satisfying the distance condition. In the present invention, the preset number is the first number, and the preset number may be a value preset according to experience, which is not specifically limited herein. The first number is determined based on a number of at least two clusters, each of the at least two clusters having a predetermined cache rate. It is understood that the distance condition may be the second cluster satisfying the distance condition when the distance is smaller than the threshold.
Specifically, the first number is determined according to the number of at least two clusters and a preset cache rate. The first number is the number of second clusters that need to be written to the cache. And selecting the first number of second clusters as the second clusters meeting the distance condition according to the ordering result of the at least one second cluster according to the order of the priorities from high to low. In some embodiments, the second clusters (i.e., the first number of second clusters) that satisfy the distance condition are sequentially stored in different levels of cache. The domain name corresponding to the first cluster is recorded in the highest-level cache. The priority of the second cluster satisfying the distance condition is positively correlated with the level of the cache, i.e. the higher the priority of the cluster, the higher the level of the cache written. Illustratively, the number of clusters that need to be written to the cache isThe number of second clusters (i.e., the first number) that need to be written to the cache isThe clusters to be written into the cache include a first cluster and a second cluster. Based on the above-mentioned ranking result { S1,S2,…,SK }, the second cluster will beThe domain name corresponding to each cluster is judged to be a non-key domain name, and the cache is not written in and discarded. From the first second cluster S1 to the second clusterSecond clustersAnd writing the L-level caches sequentially. The domain name corresponding to the first cluster S0 is used as an important domain name, and is written into the highest-level cache, namely the first-level cache.
According to the above method, as shown in fig. 4, fig. 4 is a schematic diagram of a clustering result provided by an embodiment of the present invention. For example, for a buffer memory m=2 and l=2, a preset buffer memory rate r=0.5 is set, and k=4 can be calculated. Thus, as shown, four clusters are obtained after clustering. Then determining a first Cluster, and sorting the clusters according to the distance between the first Cluster and each second Cluster to obtain a first Cluster Cluster0: the key domain name listed; second Cluster Cluster1: under the key domain name, there is a sub domain name of explicit service semantics; second Cluster2: subdomains of general API services or semantically ambiguous key domain names; second Cluster3: some CDNs of services split domain names with a low emphasis on being disposable. And according toIt can be seen that only the first 2 clusters remain and the last 2 clusters are discarded. Obviously, in order to protect the service carried by the key domain name, cluster0 and Cluster1 need to be written into the first-level cache and the second-level cache; cluster2 and Cluster3 are less significant, are non-key domain names, and can be discarded.
In the embodiment of the invention, based on a self-supervision learning task, an improved cross entropy loss function is adopted in the model training process through a mask language model, and the difference between the output result of the mask position predicted by the model and the real label is considered, so that the model prediction accuracy is improved, and the model can be subjected to vectorization processing according to the semantic features of the domain name. And then, vectorizing the unlabeled domain name by using a first model obtained after the pre-training model is subjected to local fine adjustment. And the priority of the domain name to be detected is classified according to the characteristics of the cache space and the distribution characteristics of the clustering result by the clustering analysis of the semantic similarity of the key domain name, so that the accuracy of identifying the priority of the domain name is improved.
Based on the same technical concept, fig. 5 schematically illustrates a structural diagram of a domain name management apparatus according to an embodiment of the present invention, where the apparatus may perform a flow of a domain name management method.
As shown in fig. 5, the apparatus specifically includes:
An obtaining module 510, configured to obtain a domain name set, where the domain name set includes a first domain name with a first label and a second domain name without the first label, and the first label indicates a domain name priority;
The processing module 520 is configured to vectorize the first domain name and the second domain name to obtain a vector corresponding to each domain name in the domain name set;
Clustering vectors corresponding to each domain name in the domain name set to obtain at least two clusters, determining a first cluster from the at least two clusters, wherein the number of vectors corresponding to the first domain name in the first cluster is larger than the number of vectors corresponding to the second domain name, and the at least two clusters comprise the first cluster and at least one second cluster;
Determining a domain name corresponding to the second cluster meeting the distance condition according to the distance between the first cluster and each second cluster; and the response priority of the domain name corresponding to the second cluster meeting the distance condition is larger than that of the domain name corresponding to the second cluster not meeting the distance condition.
Optionally, the processing module 520 is specifically configured to:
inputting the first domain name and the second domain name into a first model to obtain a vector corresponding to each domain name in the domain name set output by the first model; the first model is a mask language model and is used for outputting semantic feature vectors of domain names.
Optionally, the processing module 520 is further configured to:
dividing each domain name in the training sample into words according to the vocabulary;
Inputting the domain name into a pre-training model aiming at any domain name to obtain a vector of each word segment of the domain name output by the pre-training model;
Masking the vector of one or more words of the domain name, training the pre-training model according to the vector of the domain name after masking to obtain the first model, wherein the loss function of the pre-training model represents the difference between the output of the masking words and the real vector corresponding to the masking words.
Optionally, the processing module 520 is specifically configured to:
determining N clusters from the at least two clusters, wherein the number of vectors corresponding to the first domain name in each of the N clusters is greater than the number of vectors corresponding to the second domain name, and N is an integer greater than or equal to 1;
and if N is greater than 1, merging N clusters to obtain the first cluster.
Optionally, the number of the at least two clusters is determined according to a preset buffer rate, the number of levels of the buffer, and a space multiple between adjacent levels of the buffer, and the processing module 520 is specifically configured to:
Determining centroids of the first cluster and the at least one second cluster, respectively;
Calculating the distance between the centroid of the first cluster and the centroid of each second cluster respectively, and taking the distance between the centroid of the first cluster and the centroid of each second cluster as the distance between the first cluster and each second cluster;
determining the priority of the second cluster according to the distance between the first cluster and the second cluster in the order of the distance from the small to the large;
sequencing the other clusters according to the priority of the second cluster according to the order of the priority from high to low;
Determining a first number according to the number of the at least two clusters and the preset cache rate;
And selecting the first number of second clusters according to the sorting result of the at least one second cluster as the second clusters meeting the distance condition according to the order of the priority from high to low.
Optionally, the processing module 520 is further configured to:
And sequentially storing domain names corresponding to the second clusters meeting the distance condition into caches of different levels, wherein the domain names corresponding to the first clusters are recorded in the cache of the highest level, and the priority of the second clusters meeting the distance condition is positively correlated with the level of the cache.
Based on the same technical concept, the embodiment of the invention further provides a computer device, including:
A memory for storing program instructions;
And the processor is used for calling the program instructions stored in the memory and executing the domain name management method according to the obtained program.
Based on the same technical concept, the embodiment of the present invention also provides a computer-readable storage medium storing computer-executable instructions for causing a computer to perform the above domain name management method.
Based on the same technical concept, the embodiment of the invention also provides a computer program product, which is characterized in that the computer program product comprises an executable program, and the executable program is used for executing the domain name management method by a processor.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.