CN118869292A

Movatterモバイル変換

Info

Publication number: CN118869292A
Application number: CN202410904265.8A
Authority: CN
Inventors: 郝逸航; 常力元; 郭俊言; 杨成; 郭惟; 宋悦
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2024-07-05
Filing date: 2024-07-05
Publication date: 2024-10-29

Abstract

The invention discloses a domain name management method and a domain name management device, comprising the following steps: a set of domain names including a first domain name having a first label indicating a priority of the domain name and a second domain name not having the first label is obtained. And vectorizing the first domain name and the second domain name to obtain vectors corresponding to each domain name. Clustering vectors corresponding to each domain name to obtain at least two clusters, and determining a first cluster from the at least two clusters. The at least two clusters include a first cluster and at least one second cluster. And determining the domain name corresponding to the second cluster meeting the distance condition according to the distance between the first cluster and each second cluster. The response priority of the domain name corresponding to the second cluster meeting the distance condition is larger than that of the domain name corresponding to the second cluster not meeting the distance condition, so that the recognition of the priority of the domain name is realized, and the accuracy of recognizing the priority of the domain name is improved by clustering the vectors of the domain name.

Description

Domain name management method and device

Technical Field

The present invention relates to the field of network security, and in particular, to a domain name management method and apparatus.

Background

Domain name protection refers to a series of measures and techniques for protecting domain names in the internet. Domain names are a set of artificially defined lists of domain names whose security is critical to maintaining network space stability and user trust.

One more effective domain name protection method is to pay attention to protection from the DNS cache layer to the caching of key domain names. For example, by optimizing a key domain name caching mechanism, the domain name can achieve faster searching efficiency, so that the server can quickly answer. In the existing scheme, on one hand, a method for performing key protection by using a fixed list is adopted, and on the other hand, a domain name dividing method based on statistical characteristics is adopted. However, the method can only capture a fixed domain name for protection on one hand, and cannot accurately identify the domain name on the other hand.

Therefore, how to accurately identify the priority of domain names is a problem that needs to be solved at present.

Disclosure of Invention

The embodiment of the invention provides a domain name management method and device, which are used for judging semantic similarity between an unknown domain name and a key domain name through cluster analysis of a vectorized domain name, so that accuracy of identifying the priority of the domain name is improved.

In a first aspect, an embodiment of the present invention provides a domain name management method, which may be applied to a server (such as a DNS server). The method comprises the following steps:

acquiring a domain name set, wherein the domain name set comprises a first domain name with a first label and a second domain name without the first label, and the first label indicates the domain name priority;

vectorizing the first domain name and the second domain name to obtain vectors corresponding to each domain name in the domain name set;

Clustering vectors corresponding to each domain name in the domain name set to obtain at least two clusters, determining a first cluster from the at least two clusters, wherein the number of vectors corresponding to the first domain name in the first cluster is larger than the number of vectors corresponding to the second domain name, and the at least two clusters comprise the first cluster and at least one second cluster;

Determining a domain name corresponding to the second cluster meeting the distance condition according to the distance between the first cluster and each second cluster; and the response priority of the domain name corresponding to the second cluster meeting the distance condition is larger than that of the domain name corresponding to the second cluster not meeting the distance condition.

In the above technical solution, the first tag may be understood as a target tag, and indicates the priority of the domain name. In the domain name protection scenario, the first label is set according to the importance degree of the domain name, so the first label may also be referred to as an important label, and thus the first domain name may be referred to as an important domain name or an important domain name. The second domain name does not have the first label therein, and therefore the priority of the second domain name is unknown, so the second domain name may be referred to as an unknown domain name. And then vectorizing the first domain name and the second domain name to obtain vectors corresponding to each domain name, so that semantic features of the first domain name and the second domain name can be better represented. Clustering vectors corresponding to each domain name to obtain at least two clusters, and determining a first cluster from the clusters. And finally, according to the distance between the first cluster and each second cluster, determining the domain name corresponding to the second cluster meeting the distance condition, realizing the recognition of the priority of the domain name, and improving the accuracy of recognizing the priority of the domain name by clustering the vectors.

Optionally, the vectorizing the first domain name and the second domain name to obtain a vector corresponding to each domain name in the domain name set includes:

inputting the first domain name and the second domain name into a first model to obtain a vector corresponding to each domain name in the domain name set output by the first model; the first model is a mask language model and is used for outputting semantic feature vectors of domain names.

In the technical scheme, the domain names in the first domain name and the second domain name are respectively input into the first model to obtain vectors corresponding to the domain names output by the first model, and then a first vector set is obtained. The first model is a mask language model, and the output vector can express the semantic features of the domain name, so that the clustering result of the vector is more accurate, and the accuracy of identifying the priority of the domain name is improved.

Optionally, the method further comprises:

dividing each domain name in the training sample into words according to the vocabulary;

Inputting the domain name into a pre-training model aiming at any domain name to obtain a vector of each word segment of the domain name output by the pre-training model;

Masking the vector of one or more words of the domain name, training the pre-training model according to the vector of the domain name after masking to obtain the first model, wherein the loss function of the pre-training model represents the difference between the output of the masking words and the real vector corresponding to the masking words.

Optionally, the determining a first cluster from the at least two clusters includes:

determining N clusters from the at least two clusters, wherein the number of vectors corresponding to the first domain name in each of the N clusters is greater than the number of vectors corresponding to the second domain name, and N is an integer greater than or equal to 1;

and if N is greater than 1, merging N clusters to obtain the first cluster.

In the above technical solution, at least two clusters include a first vector having a first label and a second vector not having the first label, so that the first cluster can be determined according to the number of vectors corresponding to the first domain name and the number of vectors corresponding to the second domain name in each cluster. The number of vectors corresponding to the first domain name in the first cluster is greater than the number of vectors corresponding to the second domain name. If clusters exist in which the number of vectors corresponding to the first domain name is greater than the number of vectors corresponding to the second domain name, merging the clusters, and taking the clusters obtained after merging as the first clusters.

Optionally, the number of the at least two clusters is determined according to a preset cache rate, the number of the levels of the cache and the space multiple between adjacent levels of the cache;

the determining the domain name corresponding to the second cluster meeting the distance condition according to the distance between the first cluster and each second cluster comprises the following steps:

Determining centroids of the first cluster and the at least one second cluster, respectively;

Calculating the distance between the centroid of the first cluster and the centroid of each second cluster respectively, and taking the distance between the centroid of the first cluster and the centroid of each second cluster as the distance between the first cluster and each second cluster;

determining the priority of the second cluster according to the distance between the first cluster and the second cluster in the order of the distance from the small to the large;

sequencing the other clusters according to the priority of the second cluster according to the order of the priority from high to low;

Determining a first number according to the number of the at least two clusters and the preset cache rate;

And selecting the first number of second clusters according to the sorting result of the at least one second cluster as the second clusters meeting the distance condition according to the order of the priority from high to low.

In the above technical solution, the distance between the centroid of the first cluster and the centroid of each second cluster is taken as the distance between the first cluster and each second cluster. And the priority ordering of other clusters is realized according to the distance between the first cluster and each second cluster, so that the priority of the domain name is identified. The prioritized results of the other clusters represent the domain names in the remaining clusters, possibly to the extent that the key domain names are at the semantic level. The higher the priority of the cluster, the closer the domain name corresponding to the vector in the cluster is to the key domain name at the semantic level. And determining the number of clusters obtained after clustering according to the preset cache rate, the number of levels of caches and the space multiple between adjacent levels of caches, and ensuring that the storage space required by the finally obtained domain name to be written into the cache does not exceed the cache space. The different levels of caches have different read and write speeds, and the higher the level is, the faster the read and write speed is. And determining a first number of second clusters according to the priority ranking of the at least one second cluster and the ranking result of the at least one second cluster from high to low as the second clusters meeting the distance condition.

Optionally, after determining the domain name corresponding to the second cluster that satisfies the distance condition, the method further includes:

And sequentially storing domain names corresponding to the second clusters meeting the distance condition into caches of different levels, wherein the domain names corresponding to the first clusters are recorded in the cache of the highest level, and the priority of the second clusters meeting the distance condition is positively correlated with the level of the cache.

In the above technical solution, the domain name corresponding to each vector in the first cluster is recorded in the highest-level cache. And sequentially writing domain names corresponding to vectors contained in each cluster in the second clusters meeting the distance condition into caches at different levels according to the priority order of the second clusters. The higher the priority of the domain name is, the faster the response speed is ensured when the domain name request is responded.

Optionally, the number of at least two clusters is determined according to the following formula (1):

Wherein K is the number of at least two clusters, L is the level number of the caches, R is the preset cache rate, and m is the space multiple between adjacent level caches.

In a second aspect, an embodiment of the present invention provides a domain name management apparatus, including:

The system comprises an acquisition module, a judgment module and a display module, wherein the acquisition module is used for acquiring a domain name set, the domain name set comprises a first domain name with a first label and a second domain name without the first label, and the first label indicates the priority of the domain name;

The processing module is used for vectorizing the first domain name and the second domain name to obtain vectors corresponding to each domain name in the domain name set;

Optionally, the processing module is specifically configured to:

Optionally, the processing module is further configured to:

Optionally, the processing module is specifically configured to:

and if N is greater than 1, merging N clusters to obtain the first cluster.

Optionally, the number of the at least two clusters is determined according to a preset cache rate, the number of levels of the caches, and a space multiple between adjacent levels of caches, and the processing module is specifically configured to:

Optionally, the processing module is further configured to:

In a third aspect, an embodiment of the present invention further provides a computer apparatus, including:

A memory for storing program instructions;

and the processor is used for calling the program instructions stored in the memory and executing the domain name management method according to the obtained program.

In a fourth aspect, embodiments of the present invention further provide a computer-readable storage medium storing computer-executable instructions for causing a computer to perform the domain name management method described above.

In a fifth aspect, embodiments of the present invention further provide a computer program product, where the computer program product includes an executable program that is executed by a processor to perform the domain name management method described above.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a system architecture according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of a domain name management method according to an embodiment of the present invention;

FIG. 3 is a schematic view of a cluster distance according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a clustering result provided in an embodiment of the present invention;

Fig. 5 is a schematic structural diagram of a domain name management device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The application scenario described in the embodiment of the present application is to more clearly illustrate the technical solution protected by the embodiment of the present application, and does not constitute a limitation on the technical solution provided by the embodiment of the present application, and as a person of ordinary skill in the art can know that, with the appearance of a new application scenario, the technical solution provided by the embodiment of the present application is also applicable to similar technical problems. The terms first and second in the description and claims of the application and in the above-mentioned figures are used for distinguishing between different objects and not for describing a particular sequential order. In the description of the present application, unless otherwise indicated, the meaning of "a plurality" is two or more.

Before describing a domain name management method provided by the embodiment of the present application, for convenience of understanding, the following description will first describe nouns related to the embodiment of the present application.

DNS server: also known as a domain name server (DomainName System, DNS) is a distributed database that manages the mapping of domain names and IP addresses to each other, and a DNS server is a host that provides domain name resolution services for users.

Pre-training model: an artificial intelligence model in the field of Natural Language Processing (NLP) performs tasks by analyzing, understanding and generating human language. These models are typically based on neural networks of deep learning, which enable capturing semantic information in text by learning language patterns and context in a large corpus of text, and thus enabling converting text into a vector representation. The result may reflect the semantic clarity and information content of the text to some extent.

Model pre-training: the first stage in the pre-training model development process. At this stage, the model is trained using a large, diverse corpus with the goal of letting the model learn the basic structure and patterns of the language. The pre-trained model is able to understand and generate the underlying language, but has not been optimized for a particular task or application.

Local fine tuning: the pre-trained model is applied to the process of a particular task. At this stage, the model is trained additionally on smaller, more specialized data sets that are typically related to a particular task or application scenario. Through local fine tuning, the model can adapt to specific language application, and can show higher accuracy and efficiency for specific tasks on the basis of maintaining original wide language knowledge.

Self-supervision study: one term in the art of machine learning and deep learning, particularly when processing unlabeled data. In self-supervised learning, algorithms automatically generate labels or supervision signals from raw data and then use these signals to train a model. This approach allows the model to learn from a large amount of unlabeled data, thereby avoiding the reliance on a large amount of manually labeled data in traditional supervised learning. In a language model, self-supervised learning can be used to predict the next word or masked (missing) words in a sentence, providing an efficient way to process and understand large amounts of non-markup language data.

Fig. 1 illustrates a system architecture to which embodiments of the present invention are applicable, the system architecture including a server 100, the server 100 may include a processor 110, a communication interface 120, and a memory 130.

Wherein the communication interface 120 is used for transmitting data.

The processor 110 is a control center of the server 100, connects various parts of the entire server 100 using various interfaces and routes, and performs various functions of the server 100 and processes data by running or executing software programs and/or modules stored in the memory 130, and calling data stored in the memory 130. Optionally, the processor 110 may include one or more processing units.

The memory 130 may be used to store software programs and modules, and the processor 110 performs various functional applications and data processing by executing the software programs and modules stored in the memory 130. The memory 130 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, application programs required for at least one function, and the like; the storage data area may store data created according to business processes, etc. In addition, memory 130 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

It should be noted that the structure shown in fig. 1 is merely an example, and the server may be a DNS server, which is not limited in this embodiment of the present invention.

Based on the above description, fig. 2 is a schematic flow chart illustrating a domain name management method according to an embodiment of the present invention, where the flow may be executed by a domain name management device.

As shown in fig. 2, the process specifically includes:

Step 210, obtaining a domain name set, wherein the domain name set comprises a first domain name with a first label and a second domain name without the first label, and the first label indicates the domain name priority.

In the embodiment of the invention, the first label can be understood as a target label, and indicates the priority of the domain name. Illustratively, in the domain name protection scenario, the first label is set according to the importance degree of the domain name, for example, the first label is 1, so the first label may also be called an important label, and thus, the first domain name may be called an important domain name or an important domain name. The second domain name does not have the first label therein, and therefore the priority of the second domain name is unknown, so the second domain name may be referred to as an unknown domain name. In some embodiments, the second domain name has a second label therein, the second label indicating that the priority of the domain name is unknown, e.g., the second label is 0.

The first domain name in the set of domain names may be extracted from a specified domain name request, or a preset key domain name. Illustratively, extracting a domain name from a domain name request acquired when accessing a specified protection application or website, and performing de-duplication to serve as a key domain name; or be a key domain name for the division. The second domain name may be extracted from domain name requests received while the other servers are operating. The method includes extracting a domain name from a domain name request of an operator metropolitan area backbone export, and removing the duplicate and key domain names in the extracted domain name to obtain a second domain name.

Step 220, vectorizing the first domain name and the second domain name to obtain a vector corresponding to each domain name in the domain name set.

In the embodiment of the invention, the first domain name and the second domain name are vectorized to obtain the vector corresponding to each domain name in the domain name set. Because the first domain name has the first label and the first label is not vectorized when the first domain name is vectorized, a vector obtained after vectorizing the first domain name may also have the first label. Similarly, the vector obtained after vectorizing the second domain name does not have the first label. It will be appreciated that vectorizing the first domain name and the second domain name is performed, i.e. to show semantic features of the first domain name and the second domain name. Specifically, the first domain name and the second domain name are input into a first model, and a vector corresponding to each domain name in a domain name set output by the first model is obtained. The first model is a mask language model, is obtained by performing local fine adjustment on a pre-training model, and is used for outputting semantic feature vectors of domain names.

The construction method of the first model specifically comprises the following steps: and dividing each domain name in the training sample according to the vocabulary. For example, the domain name in the training sample may be extracted from domain name requests received while other servers are operating. The vocabulary may be preset. The word segmentation operation may be performed by a self-contained word segmenter (e.g., wordPiece, BPE or SENTENCEPIECE) in the pre-trained model for decomposing the domain name into the smallest units present in the model vocabulary.

Inputting the domain name into a pre-training model aiming at any domain name, and obtaining the vector of each word of the domain name output by the pre-training model. Illustratively, for a domain name d, a pre-trained model is usedIt can be expressed as a series of numerically-controlled vector token that the model can understand, i.eDenoted as vector value input ids. t_n is token corresponding to each word, n is the number of the words, and n is an integer greater than or equal to 1. It is to be understood that the pre-training models mentioned in the present application include, but are not limited to BERT, including GPT, roBERTa, etc., pre-training models, and are not specifically limited herein.

The vector of one or more tokens of the domain name is masked. The masked vector is the label data of the self-supervision learning of the model. Illustratively, for each domain name, a subset is randomly selectedAs the location of the MASK, T_i is replaced with a special MASK mark MASK for i e T. The masked vector is used as the marking data. The specific masking step comprises the following steps: and generating a probability matrix rand for randomly sampling the mask, wherein the shape of the probability matrix rand is the same as that of input_ids, and the values are uniformly distributed in the interval of [0,1 ]. And then calculating and generating a binary vector (Boolean vector), wherein the element value of the corresponding position is True if and only if the input_ids element of the position is smaller than the random probability of the preset mask probability p and the input_ids element is not equal to special values such as 101 (CLS mark), 102 (SEP mark) and the like, and otherwise setting the element value of the position to False. Namely: mask= (rand < p) (-input_ids noteqcls) (-input_ids noteqsep) (-input_ids noteqpad). Finally, a MASK array MASK is used to select which tokens are to be masked and the input_ids for these locations are replaced with 103 (MASK flags).

And then training the pre-training model according to the vector of the domain name after masking to obtain a first model. Wherein the loss function of the pre-training model represents the difference between the output of the mask word and the true vector corresponding to the mask word. Illustratively, the batch number and the epoch number are set first for determining the size of the training data and the number of iterative training. The specific steps on the ith iteration training number epoch are as follows: the model is first propagated forward and the loss function of the model is calculated. Wherein the loss function uses a modified cross entropy (Cross Entropy Loss) loss function for comparing the difference between the output of the model predicted, fractional mask positions and the true vector (i.e., original, unmasked token) corresponding to the fractional mask positions. It is understood that "loss function" in this patent includes, but is not limited to, a method of calculating a triplet loss function or the like that can be used as a model parameter convergence condition.

Specifically, the loss function of model prediction is determined by the difference between the predicted value and the true value, and can be expressed as:

Wherein,As a loss function. P (t_i)t₁,…,t_i-1,MASK,t_i+1,…,t_n; θ) represents the probability that the predicted value for token is correct given the model parameters θ, t_i. In the masking task, only the loss of masked locations will be computed and counter-propagated. Thus, if a token location is not masked, its contribution to the penalty function needs to be ignored. Implementation can be achieved by setting labels for these locations to-100, as most deep learning frameworks will automatically ignore these locations labeled-100.

Then, the back propagation loss functionCalculating gradientsUsing gradient descent algorithm to cause loss functionMinimizing. In the implementation process, an optimizer is used for parameter updating. Parameter updates can be expressed as:

Wherein, eta represents the learning rate,Is the gradient of the total loss function relative to the parameter θ, θ_new is the new model parameter, and θ_old is the old model parameter.

And stopping iteration when the pre-training model iterates to the preset iteration training times, and taking the model at the moment as a first model. The model training process can be performed on a device or platform with a high computational power resource to speed up model training. The principle of the model training is that if the model can predict the original token at the mask position, the model can be better shown to complete vectorization of the target domain name according to the semantic characteristics and the context relation of the domain name.

Step 230, clustering vectors corresponding to each domain name in the domain name set to obtain at least two clusters, and determining a first cluster from the at least two clusters, wherein the number of vectors corresponding to the first domain name in the first cluster is greater than the number of vectors corresponding to the second domain name, and the at least two clusters comprise the first cluster and at least one second cluster.

In the embodiment of the invention, the number of at least two clusters is determined according to a preset cache rate, the number of levels of cache and the space multiple between adjacent levels of cache. The read-write speeds of caches at different levels are different, and the higher the level is, the faster the read-write speed is. Illustratively, the domain name cache space has a total of L levels. The level of level 1 cache is highest. The precious degree of the buffer space and the read-write speed monotonically decrease from the 1 st level to the L st level. L is the number of levels of cache, and is an integer greater than 1. The multiple of the space between adjacent level caches means that the m+1st level cache is M times the M-th level cache. m is an integer greater than 1. M is an integer greater than or equal to 1 and less than L.

Determining the number of at least two clusters according to the following formula (1):

Wherein K is the number of at least two clusters, L is the level number of the caches, R is the preset cache rate, and m is the space multiple between adjacent level caches. The value range of the preset cache rate R is [0,1], which is determined by the available space size of the cache and the actual size of the first domain name set. The preset cache rate indicates the proportion of the second domain name added to the cache, or indicates how much of the domain name in the second domain name is discarded.

After the number of clusters is determined, clustering vectors corresponding to each domain name in the domain name set based on the number of clusters to obtain at least two clusters. Illustratively, an unsupervised learning algorithm (e.g., k-means, etc.) that specifies the number of clusters is used to cluster the vectors corresponding to each domain name in the set of domain names.

After the at least two clusters are obtained, a first cluster is determined from the at least two clusters. The at least two clusters include a first cluster and at least one second cluster. The number of vectors corresponding to the first domain name in the first cluster is larger than the number of vectors corresponding to the second domain name. Specifically, N clusters are determined from at least two clusters. The number of vectors corresponding to the first domain name in each of the N clusters is greater than the number of vectors corresponding to the second domain name. N is an integer greater than or equal to 1. If N is greater than 1, combining the N clusters to obtain a first cluster. If N is equal to 1, the cluster is taken as a first cluster. Illustratively, the label of the vector corresponding to the first domain name is set to 1 and the label of the vector corresponding to the second domain name is set to 0. For each cluster S_i, the primary label within the cluster (i.e., the label that appears most in the cluster, or the label corresponding to the vector that appears most in the cluster) is y_i. The cluster in which y_i =1 is recorded is a first cluster, and is labeled S₀, and if not only the cluster in which y_i =1 is recorded, the clusters are combined as the first cluster. The second cluster is a cluster other than the first cluster of the at least two clusters, i.e., each of the at least two clusters other than the first cluster may be referred to as a second cluster.

Step 240, determining a domain name corresponding to the second cluster meeting the distance condition according to the distance between the first cluster and each second cluster; and the response priority of the domain name corresponding to the second cluster meeting the distance condition is larger than that of the domain name corresponding to the second cluster not meeting the distance condition.

In an embodiment of the present invention, the distance between the first cluster and each of the second clusters is first determined. Specifically, the centroids of the first cluster and the at least one second cluster are determined separately. The distance between the centroid of the first cluster and the centroid of each second cluster is then calculated separately and taken as the distance between the first cluster and each second cluster. The priorities of the second clusters are then determined in order of the distances from the first cluster to each of the second clusters. At least one second cluster is ordered according to the priority of the second clusters in order of priority from high to low.

Exemplary, as shown in fig. 3, fig. 3 is a schematic view of a cluster distance according to an embodiment of the present invention. The geometric centroid C_i of each cluster is determined separately. Centroid C₀ corresponds to first cluster S₀. The distances between the centroids corresponding to each second cluster and the centroids C₀ corresponding to the first cluster S₀ are determined separately (distances between centroids include, but are not limited to, euclidean distances) and then sorted in ascending order as d= [ d₁,d₂,…,d_K ]. And marking each second cluster and centroid according to the distance sequence. The second cluster corresponding to distance d₁ is labeled S₁, centroid C₁; the second cluster corresponding to distance d₂ is labeled S₂ and centroid is labeled C₂; …; the second cluster corresponding to distance d_K is labeled S_K and the centroid is labeled C_K. It will be appreciated that the second clusters are marked according to a distance order, i.e. the priorities of the second clusters are determined, so that the ordering result of at least one second cluster is { S₁,S₂,…,S_K }.

Ordering based on the distance between the first cluster and each second cluster can be understood as ordering by semantic level to the extent that the remaining domain names that are not listed in the first cluster (i.e., the domain name in each second cluster) are more likely to be key domain names than the listed key domain names in the first cluster.

Since the caches are classified into different levels, the domain name corresponding to the second cluster satisfying the distance condition is determined according to the distance between the first cluster and each second cluster. And determining a second cluster meeting the distance condition from the second clusters according to the priority ordering result, and determining the domain name corresponding to the vector in the second cluster meeting the distance condition. It will be appreciated that the response priority of the domain name corresponding to the second cluster satisfying the distance condition is greater than the response priority of the domain name corresponding to the second cluster not satisfying the distance condition and is lower than the priority of the domain name corresponding to the first cluster. Wherein the distance condition is that a preset number of second clusters are selected from at least one second cluster in the order of the distances from small to large (or the order of the priorities from high to low), that is, the second clusters satisfying the distance condition. In the present invention, the preset number is the first number, and the preset number may be a value preset according to experience, which is not specifically limited herein. The first number is determined based on a number of at least two clusters, each of the at least two clusters having a predetermined cache rate. It is understood that the distance condition may be the second cluster satisfying the distance condition when the distance is smaller than the threshold.

Specifically, the first number is determined according to the number of at least two clusters and a preset cache rate. The first number is the number of second clusters that need to be written to the cache. And selecting the first number of second clusters as the second clusters meeting the distance condition according to the ordering result of the at least one second cluster according to the order of the priorities from high to low. In some embodiments, the second clusters (i.e., the first number of second clusters) that satisfy the distance condition are sequentially stored in different levels of cache. The domain name corresponding to the first cluster is recorded in the highest-level cache. The priority of the second cluster satisfying the distance condition is positively correlated with the level of the cache, i.e. the higher the priority of the cluster, the higher the level of the cache written. Illustratively, the number of clusters that need to be written to the cache isThe number of second clusters (i.e., the first number) that need to be written to the cache isThe clusters to be written into the cache include a first cluster and a second cluster. Based on the above-mentioned ranking result { S₁,S₂,…,S_K }, the second cluster will beThe domain name corresponding to each cluster is judged to be a non-key domain name, and the cache is not written in and discarded. From the first second cluster S₁ to the second clusterSecond clustersAnd writing the L-level caches sequentially. The domain name corresponding to the first cluster S₀ is used as an important domain name, and is written into the highest-level cache, namely the first-level cache.

According to the above method, as shown in fig. 4, fig. 4 is a schematic diagram of a clustering result provided by an embodiment of the present invention. For example, for a buffer memory m=2 and l=2, a preset buffer memory rate r=0.5 is set, and k=4 can be calculated. Thus, as shown, four clusters are obtained after clustering. Then determining a first Cluster, and sorting the clusters according to the distance between the first Cluster and each second Cluster to obtain a first Cluster Cluster0: the key domain name listed; second Cluster Cluster1: under the key domain name, there is a sub domain name of explicit service semantics; second Cluster2: subdomains of general API services or semantically ambiguous key domain names; second Cluster3: some CDNs of services split domain names with a low emphasis on being disposable. And according toIt can be seen that only the first 2 clusters remain and the last 2 clusters are discarded. Obviously, in order to protect the service carried by the key domain name, cluster0 and Cluster1 need to be written into the first-level cache and the second-level cache; cluster2 and Cluster3 are less significant, are non-key domain names, and can be discarded.

In the embodiment of the invention, based on a self-supervision learning task, an improved cross entropy loss function is adopted in the model training process through a mask language model, and the difference between the output result of the mask position predicted by the model and the real label is considered, so that the model prediction accuracy is improved, and the model can be subjected to vectorization processing according to the semantic features of the domain name. And then, vectorizing the unlabeled domain name by using a first model obtained after the pre-training model is subjected to local fine adjustment. And the priority of the domain name to be detected is classified according to the characteristics of the cache space and the distribution characteristics of the clustering result by the clustering analysis of the semantic similarity of the key domain name, so that the accuracy of identifying the priority of the domain name is improved.

Based on the same technical concept, fig. 5 schematically illustrates a structural diagram of a domain name management apparatus according to an embodiment of the present invention, where the apparatus may perform a flow of a domain name management method.

As shown in fig. 5, the apparatus specifically includes:

An obtaining module 510, configured to obtain a domain name set, where the domain name set includes a first domain name with a first label and a second domain name without the first label, and the first label indicates a domain name priority;

The processing module 520 is configured to vectorize the first domain name and the second domain name to obtain a vector corresponding to each domain name in the domain name set;

Optionally, the processing module 520 is specifically configured to:

Optionally, the processing module 520 is further configured to:

Optionally, the processing module 520 is specifically configured to:

and if N is greater than 1, merging N clusters to obtain the first cluster.

Optionally, the number of the at least two clusters is determined according to a preset buffer rate, the number of levels of the buffer, and a space multiple between adjacent levels of the buffer, and the processing module 520 is specifically configured to:

Optionally, the processing module 520 is further configured to:

Based on the same technical concept, the embodiment of the invention further provides a computer device, including:

A memory for storing program instructions;

Based on the same technical concept, the embodiment of the present invention also provides a computer-readable storage medium storing computer-executable instructions for causing a computer to perform the above domain name management method.

Based on the same technical concept, the embodiment of the invention also provides a computer program product, which is characterized in that the computer program product comprises an executable program, and the executable program is used for executing the domain name management method by a processor.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

Translated fromChinese

1.一种域名管理方法，其特征在于，包括：1. A domain name management method, comprising:

获取域名集合，所述域名集合中包括具有第一标签的第一域名和不具有所述第一标签的第二域名，所述第一标签指示域名优先级；Acquire a domain name set, the domain name set including a first domain name with a first label and a second domain name without the first label, the first label indicating a domain name priority;

将所述第一域名和所述第二域名进行向量化，得到所述域名集合中每个域名对应的向量；Vectorize the first domain name and the second domain name to obtain a vector corresponding to each domain name in the domain name set;

对所述域名集合中每个域名对应的向量进行聚类，得到至少两个聚簇，并从所述至少两个聚簇中确定第一聚簇，所述第一聚簇中所述第一域名对应的向量的数量大于所述第二域名对应的向量的数量，所述至少两个聚簇包括所述第一聚簇和至少一个第二聚簇；Clustering the vectors corresponding to each domain name in the domain name set to obtain at least two clusters, and determining a first cluster from the at least two clusters, wherein the number of vectors corresponding to the first domain name in the first cluster is greater than the number of vectors corresponding to the second domain name, and the at least two clusters include the first cluster and at least one second cluster;

根据所述第一聚簇和每个第二聚簇之间的距离，确定满足距离条件的第二聚簇对应的域名；其中，所述满足距离条件的第二聚簇对应的域名的响应优先级，大于未满足距离条件的第二聚簇对应的域名的响应优先级。According to the distance between the first cluster and each second cluster, a domain name corresponding to the second cluster that meets the distance condition is determined; wherein the response priority of the domain name corresponding to the second cluster that meets the distance condition is greater than the response priority of the domain name corresponding to the second cluster that does not meet the distance condition.

2.如权利要求1所述的方法，其特征在于，所述将所述第一域名和所述第二域名进行向量化，得到所述域名集合中每个域名对应的向量，包括：2. The method according to claim 1, wherein the step of vectorizing the first domain name and the second domain name to obtain a vector corresponding to each domain name in the domain name set comprises:

将所述第一域名和所述第二域名输入第一模型，得到所述第一模型输出的所述域名集合中每个域名对应的向量；其中，所述第一模型为掩码语言模型，用于输出域名的语义特征向量。The first domain name and the second domain name are input into a first model to obtain a vector corresponding to each domain name in the domain name set output by the first model; wherein the first model is a masked language model, which is used to output a semantic feature vector of a domain name.

3.如权利要求1所述的方法，其特征在于，所述方法还包括：3. The method according to claim 1, characterized in that the method further comprises:

将训练样本中每个域名根据词汇表进行分词；Segment each domain name in the training sample according to the vocabulary;

针对任一域名，将所述域名输入预训练模型，得到所述预训练模型输出的所述域名的每个分词的向量；For any domain name, input the domain name into the pre-training model to obtain the vector of each word segment of the domain name output by the pre-training model;

对所述域名的一个或多个分词的向量进行掩码，并根据掩码后的域名的向量对所述预训练模型进行训练，得到所述第一模型，所述预训练模型的损失函数表示掩码分词的输出和掩码分词对应的真实向量之间的差异。Mask one or more word segmentation vectors of the domain name, and train the pre-trained model according to the masked domain name vectors to obtain the first model, wherein the loss function of the pre-trained model represents the difference between the output of the masked word segmentation and the true vector corresponding to the masked word segmentation.

4.如权利要求1所述的方法，其特征在于，所述从所述至少两个聚簇中确定第一聚簇，包括：4. The method of claim 1, wherein determining the first cluster from the at least two clusters comprises:

从所述至少两个聚簇中确定出N个聚簇，所述N个聚簇中的每个聚簇中所述第一域名对应的向量的数量大于所述第二域名对应的向量的数量，N为大于或等于1的整数；Determine N clusters from the at least two clusters, wherein the number of vectors corresponding to the first domain name in each of the N clusters is greater than the number of vectors corresponding to the second domain name, and N is an integer greater than or equal to 1;

若N大于1，则合并N个聚簇，得到所述第一聚簇。If N is greater than 1, N clusters are merged to obtain the first cluster.

5.如权利要求1所述的方法，其特征在于，所述至少两个聚簇的数量是根据预设缓存率、所述缓存的级别数量、相邻级别缓存之间的空间倍数确定的；5. The method according to claim 1, wherein the number of the at least two clusters is determined according to a preset cache rate, the number of cache levels, and a space multiple between adjacent cache levels;

所述根据所述第一聚簇和每个第二聚簇之间的距离，确定满足距离条件的第二聚簇对应的域名，包括：The determining, according to the distance between the first cluster and each second cluster, a domain name corresponding to the second cluster that meets the distance condition includes:

分别确定所述第一聚簇和所述至少一个第二聚簇的质心；determining the centroids of the first cluster and the at least one second cluster, respectively;

分别计算所述第一聚簇的质心和所述每个第二聚簇的质心之间的距离，并将所述第一聚簇的质心和所述每个第二聚簇的质心之间的距离，作为所述第一聚簇和所述每个第二聚簇之间的距离；Calculating the distance between the centroid of the first cluster and the centroid of each of the second clusters respectively, and using the distance between the centroid of the first cluster and the centroid of each of the second clusters as the distance between the first cluster and each of the second clusters;

按照距离由小到大的顺序，根据所述第一聚簇和所述第二聚簇之间的距离，确定所述第二聚簇的优先级；Determine the priority of the second cluster according to the distance between the first cluster and the second cluster in order from small to large;

按照优先级由高到低的顺序，根据所述第二聚簇的优先级，对所述其他聚簇进行排序；In descending order of priority, the other clusters are sorted according to the priority of the second cluster;

根据所述至少两个聚簇的数量以及所述预设缓存率确定第一数量；Determining a first number according to the number of the at least two clusters and the preset cache rate;

按照优先级由高到低的顺序，根据所述至少一个第二聚簇的排序结果选择所述第一数量的第二聚簇，作为满足距离条件的第二聚簇。The first number of second clusters are selected according to the sorting result of the at least one second cluster in a descending order of priority as second clusters meeting the distance condition.

6.如权利要求5所述的方法，其特征在于，所述确定满足距离条件的第二聚簇对应的域名之后，所述方法还包括：6. The method according to claim 5, characterized in that after determining the domain name corresponding to the second cluster that meets the distance condition, the method further comprises:

将所述满足距离条件的第二聚簇对应的域名依次存入不同级别的缓存，所述第一聚簇对应的域名记录在最高级别的缓存，所述满足距离条件的第二聚簇的优先级与所述缓存的级别呈正相关。The domain name corresponding to the second cluster meeting the distance condition is stored in caches of different levels in sequence, the domain name corresponding to the first cluster is recorded in the highest level cache, and the priority of the second cluster meeting the distance condition is positively correlated with the level of the cache.

7.一种域名管理装置，其特征在于，包括：7. A domain name management device, comprising:

获取模块，用于获取域名集合，所述域名集合中包括具有第一标签的第一域名和不具有所述第一标签的第二域名，所述第一标签指示域名优先级；An acquisition module, configured to acquire a domain name set, wherein the domain name set includes a first domain name with a first label and a second domain name without the first label, wherein the first label indicates a domain name priority;

处理模块，用于将所述第一域名和所述第二域名进行向量化，得到所述域名集合中每个域名对应的向量；A processing module, configured to vectorize the first domain name and the second domain name to obtain a vector corresponding to each domain name in the domain name set;

8.一种计算机设备，其特征在于，包括：8. A computer device, comprising:

存储器，用于存储程序指令；A memory for storing program instructions;

处理器，用于调用所述存储器中存储的程序指令，按照获得的程序执行权利要求1至6任一项所述的方法。A processor, configured to call the program instructions stored in the memory, and execute the method according to any one of claims 1 to 6 according to the obtained program.

9.一种计算机可读存储介质，其特征在于，所述计算机可读存储介质存储有计算机可执行指令，所述计算机可执行指令用于使计算机执行权利要求1至6任一项所述的方法。9. A computer-readable storage medium, characterized in that the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are used to enable a computer to execute the method according to any one of claims 1 to 6.

10.一种计算机程序产品，其特征在于，所述计算机程序产品包括可执行程序，该可执行程序被处理器执行实现权利要求1至6任一项所述的方法。10. A computer program product, characterized in that the computer program product comprises an executable program, and the executable program is executed by a processor to implement the method according to any one of claims 1 to 6.