Movatterモバイル変換


[0]ホーム

URL:


CN114186150A - URL similarity detection method, device, device and storage medium - Google Patents

URL similarity detection method, device, device and storage medium
Download PDF

Info

Publication number
CN114186150A
CN114186150ACN202111545196.9ACN202111545196ACN114186150ACN 114186150 ACN114186150 ACN 114186150ACN 202111545196 ACN202111545196 ACN 202111545196ACN 114186150 ACN114186150 ACN 114186150A
Authority
CN
China
Prior art keywords
url
source
target
sequence
participles
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111545196.9A
Other languages
Chinese (zh)
Other versions
CN114186150B (en
Inventor
游丽娜
钟良志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Corp Ltd
Original Assignee
China Telecom Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Corp LtdfiledCriticalChina Telecom Corp Ltd
Priority to CN202111545196.9ApriorityCriticalpatent/CN114186150B/en
Publication of CN114186150ApublicationCriticalpatent/CN114186150A/en
Application grantedgrantedCritical
Publication of CN114186150BpublicationCriticalpatent/CN114186150B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Classifications

Landscapes

Abstract

The application relates to the technical field of web security, and discloses a URL similarity detection method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring a source URL and a target URL; segmenting the source URL according to the hierarchical structure of the URL and the special characters to obtain a source segmentation sequence; segmenting words of the target URL according to the hierarchical structure and the special characters of the URL to obtain a target segmentation sequence; determining the editing distance between the target URL and the source URL by taking the participles in the target participle sequence and the participles in the source participle sequence as a minimum measuring unit; and determining the similarity between the target URL and the source URL according to the editing distance. The method and the device improve the URL similarity detection accuracy.

Description

URL similarity detection method, device, equipment and storage medium
Technical Field
The present application relates to the field of web security technologies, and in particular, to a method, an apparatus, a device, and a storage medium for detecting URL similarity.
Background
The URL is used as a network address identifier, and usually includes keywords related to page resources or topics or brand keywords of a certain famous company, so as to facilitate people to memorize and search, however, an attacker often uses some confusing words to forge the URL, deceive users, and perform phishing attacks. Therefore, when the Web application firewall detects the abnormal flow by using a statistical learning method, the similarity between the URL to be detected and the historical normal URL is often evaluated.
The Levenshtein edit distance calculation method can quickly calculate the similarity of two character strings, but the method is sensitive to the length of the character strings and only measures the edit distance, and cannot accurately measure the similarity of the URL to be detected and the normal URL.
Disclosure of Invention
Embodiments of the present application provide a method, an apparatus, a device, and a storage medium for detecting URL similarity, so that the problem that the similarity between URLs cannot be accurately measured by a Levenshtein edit distance calculation method can be solved at least to a certain extent, and the accuracy of URL similarity detection can be improved.
Other features and advantages of the present application will be apparent from the following detailed description, or may be learned by practice of the application.
According to a first aspect of the embodiments of the present application, a method for detecting URL similarity is provided, where the method includes:
acquiring a source URL and a target URL;
segmenting the source URL according to the hierarchical structure of the URL and the special characters to obtain a source segmentation sequence;
segmenting words of the target URL according to the hierarchical structure and the special characters of the URL to obtain a target segmentation sequence;
determining the editing distance between the target URL and the source URL by taking the participles in the target participle sequence and the participles in the source participle sequence as a minimum measuring unit;
and determining the similarity between the target URL and the source URL according to the editing distance.
In some aspects of the present application, based on the above aspects, the method further includes:
generalizing the character strings meeting preset rules in the source URL to obtain a first generalized variable set;
generalizing the character strings meeting the preset rules in the target URL to obtain a second generalized variable set;
obtaining a third generalized variable set according to the union set of the first generalized variable set and the second generalized variable set;
determining the editing distance between the target URL and the source URL by taking the participles in the target participle sequence and the source participle sequence as a minimum measurement unit, wherein the step of determining the editing distance comprises the following steps:
recursively calculating the editing distance between the first M participles in the source participle sequence and the first N participles in the target participle sequence by taking the participles in the target participle sequence and the source participle sequence as a minimum measurement unit;
if the Mth participle in the source participle sequence and the Nth participle in the target participle sequence both belong to the third generalized variable set, the edit distance between the first M participles in the source participle sequence and the first N participles in the target participle sequence is less than or equal to the edit distance between the first M-1 participles in the source participle sequence and the first N-1 participles in the target participle sequence, and M is greater than or equal to 1, and N is greater than or equal to 1.
In some aspects of the application, based on the above-mentioned scheme URL, determining the edit distance between the target URL and the source URL by using the participles in the target participle sequence and the source participle sequence as minimum measurement units includes:
determining a detection matrix of the target URL and the source URL by adopting the following formula;
Figure BDA0003415559510000021
wherein a represents the source participle sequence, b represents the source participle sequence, leva,b(i, j) represents the edit distance of the first i participles in a and the first j participles in b, i is more than or equal to 0 and less than or equal to m, j is more than or equal to 0 and less than or equal to n, and L represents the third generalized variable set;
determining lev in the detection matrixa,b(m, n) is the edit distance of the target URL from the source URL.
In some aspects of the present application, based on the above-mentioned scheme URL, the determining the similarity between the target URL and the source URL according to the edit distance includes:
normalizing the editing distance of the source URL and the target URL to obtain the normalized editing distance;
and determining the similarity between the target URL and the source URL according to the normalized editing distance.
In some embodiments of the present application, based on the above-mentioned URL, normalizing the edit distance between the source URL and the target URL to obtain a normalized edit distance includes:
acquiring the word segmentation number of the target word segmentation sequence;
and taking the ratio of the editing distance of the source URL and the target URL to the number of the participles as the normalized editing distance.
In some aspects of the present application, based on the above-mentioned scheme URL, the determining the similarity between the target URL and the source URL according to the edit distance includes:
if the editing distance is larger than or equal to a preset threshold value, determining that the target URL is not similar to the source URL;
and if the editing distance is smaller than the preset threshold value, determining that the target URL is similar to the source URL.
In some solutions of the present application, based on the above solution URL, the segmenting the source URL according to the hierarchical structure of the URL and the special character to obtain a source segmentation sequence includes:
dividing the source URL according to the hierarchical structure of the URL to obtain a plurality of subunits;
respectively carrying out symbol segmentation on the plurality of subunits according to the special characters of the URL and a preset regular expression to obtain a plurality of character substrings;
and respectively segmenting the plurality of character substrings through a maximum matching algorithm to obtain the source word segmentation sequence.
According to a second aspect of the embodiments of the present application, there is provided a URL similarity obtaining apparatus, including:
a URL acquisition unit for acquiring a source URL and a target URL;
the word segmentation unit is used for segmenting the source URL according to the hierarchical structure of the URL and the special characters to obtain a source word segmentation sequence;
the word segmentation unit is also used for segmenting words of the target URL according to the hierarchical structure of the URL and special characters to obtain a target word segmentation sequence;
the editing distance calculation unit is used for determining the editing distance between the target URL and the source URL by taking the participles in the target participle sequence and the participles in the source participle sequence as a minimum measurement unit;
and the similarity calculation unit is used for determining the similarity between the target URL and the source URL according to the editing distance.
According to a third aspect of embodiments of the present application, there is provided an electronic apparatus, comprising:
one or more processors;
storage means for storing one or more programs which, when executed by the one or more processors, cause the electronic device to carry out the method according to the first aspect as described above.
According to a fourth aspect of embodiments of the present application, there is provided a computer-readable medium having a computer program stored thereon, wherein the computer program, when executed, implements the method of the first aspect as described above.
The method and the device have the advantages that the source URL and the target URL are segmented respectively, the segmentation is used as the minimum measurement unit, the editing distance between the source URL and the target URL is calculated, the structure of the URL is fully considered, and the similarity detection accuracy is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
In the drawings:
fig. 1 is a flowchart of a URL similarity detection method according to an embodiment of the present disclosure.
Fig. 2 is a flowchart of a URL word segmentation method according to an embodiment of the present disclosure.
Fig. 3 is a flowchart of another URL similarity detection method according to an embodiment of the present disclosure.
Fig. 4 is a flowchart of a URL similarity detection apparatus according to an embodiment of the present disclosure.
Fig. 5 shows a schematic diagram of a program product for implementing the above method according to an embodiment of the present application.
FIG. 6 shows a schematic diagram of an electronic device according to one embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the subject matter of the present application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the application.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
In the description of the present application, it is to be understood that the terms "first", "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present application, "a plurality" means two or more unless otherwise specified.
The URL is a uniform resource locator, which is an address of a standard resource on the internet, through which access and acquisition of information resources can be achieved. The URL address is used as a special character string which is expressed by using partial ASCII codes and has no space interval, is different from Chinese and English text data of the traditional natural language, and is used as special network data with a hierarchical structure, and has unique language characteristics: a single URL is not a complete sentence, is of limited length, and usually contains some special strings, such as fields with special meaning as IP address, date, version number, etc.
Fig. 1 is a flowchart of a URL similarity detection method according to an embodiment of the present disclosure. As shown in fig. 1, the method includes at least the following steps.
Step 110: a source URL and a target URL are obtained.
Step 120: and segmenting the source URL according to the hierarchical structure of the URL and the special characters to obtain a source segmentation sequence.
The syntax of the URL is extensible, and the standard structure is as follows:
protocol type [// server address [: port number ] ] [/path ] [? Inquiry ] [ # fragment ]
Most URLs include three main parts: protocol type (scheme), server address (domain), and path (path). The protocol type part indicates the transfer protocol used by the URL, and the common protocols in the network field are http and https. The server address portion typically uses a URL or IP address to specify the location of the resource on the network. The path section specifies the specific location of the resource file at the server address, and is also a hierarchical structure, with "/" as a separator to separate the entire path.
According to the RFC 1738 specification for URLs, only letters and numbers [0-9a-zA-Z ], some special symbols "-" [ excluding double quotes ], and some reserved words can be used directly for URLs without encoding.
The method for segmenting the source URL according to the hierarchical structure of the URL and the special characters is to segment the source URL in each hierarchical structure of the URL according to special symbols allowed by specifications to obtain a source segmentation sequence.
For example, the source URL is http://10.1.1.1.abc. xyz. com/foods/apple. html
The three levels are the scheme level part "http://", the domain level part "67.1.12.3. abc.xyz.com", and
html "path hierarchy part" foods/applet ".
A plurality of parts of the special character string are divided to obtain a plurality of participles of http, 10.1.1, abc, xyz, com, foods, applet and html.
Step 130: and segmenting the target URL according to the hierarchical structure of the URL and the special characters to obtain a target segmentation sequence.
Step 130 may employ the same method asstep 120 to tokenize the target URL.
Step 140: and determining the editing distance between the target URL and the source URL by taking the participles in the target participle sequence and the participles in the source participle sequence as minimum measurement units.
The Edit Distance (Edit Distance), also called Levenshtein Distance, refers to the minimum number of editing operations required to change one character string into another character string. Permitted editing operations include replacing one character with another, inserting one character, and deleting one character. The smaller the edit distance, the greater the similarity between the two character strings.
In the traditional Levenshtein, a single character in a character string is used as a minimum measurement unit, the Levenshtein is directly used for detecting the editing distance between a source URL and a target URL, the editing distance between the character string of the source URL and the character string of the target URL can only be measured, and the fact that the URL is hierarchically structured data is not considered. Therefore, in the embodiment of the application, the participles in the target participle sequence and the participles in the source participle sequence are used as the minimum input unit of the Levenshtein edit distance calculation algorithm, and the edit distance between the target URL and the source URL is calculated.
Step 150: and determining the similarity between the target URL and the source URL according to the editing distance.
According to the method and the device, the source URL and the target URL are segmented respectively, the segmentation is used as the minimum measuring unit, the editing distance between the source URL and the target URL is calculated, and compared with the traditional editing distance algorithm that single characters are used as the minimum editable granularity to calculate the editing distance, the method and the device fully consider the structure of the URL, the segmentation is used as the minimum editable granularity to calculate the editing distance, and the similarity detection accuracy is improved.
In some aspects of the present application, based on the above aspects, the method further includes:
generalizing character strings which meet preset rules in a source URL to obtain a first generalized variable set;
generalizing the character strings meeting the preset rules in the target URL to obtain a second generalized variable set;
and obtaining a third generalized variable set according to the union set of the first generalized variable set and the second generalized variable set.
In specific implementation, the characteristics of character strings such as < IP >, < CH >, < EMIAL >, < TIME >, and < COOKIE > can be refined, regular expressions are designed according to the respective characteristics, and the source URL and the target URL are generalized through the regular expressions, so that respective generalized variable sets can be obtained. For example, a string of source URLs satisfies the characteristics of < IP >, < EMIAL >, < TIME >, then the first set of generalized variables contains < IP >, < EMIAL >, < TIME >; the character string of the target URL meets the characteristics of < IP >, < EMIAL >, and then the second generalized variable set comprises < IP >, < EMIAL >; the third set of generalized variables contains < IP >, < EMIAL >.
Based on the generalization processing step, determining the edit distance between the target URL and the source URL by taking the participles in the target participle sequence and the source participle sequence as a minimum measurement unit, comprising:
recursively calculating the editing distances between the first M participles in the source participle sequence and the first N participles in the target participle sequence by taking the participles in the target participle sequence and the source participle sequence as a minimum measurement unit;
if the Mth participle in the source participle sequence and the Nth participle in the target participle sequence both belong to the third generalized variable set, the editing distance between the first M participles in the source participle sequence and the first N participles in the target participle sequence is smaller than or equal to the editing distance between the first M-1 participles in the source participle sequence and the first N-1 participles in the target participle sequence, M is larger than or equal to 1, and N is larger than or equal to 1.
And the Mth participle in the source participle sequence and the Nth participle in the target participle sequence both belong to the third generalized variable set, so that the source participle sequence and the Nth participle in the target participle sequence have the same attribute. Supposing that the first M-1 participles in the source participle sequence are successfully matched with the first N-1 participles in the target participle sequence and need K times of editing operation, respectively adding the Mth participle and the Nth participle to the two parties at the moment, and if the attributes of two newly added and matched characters are the same, keeping the operation number K unchanged; if the newly added matched characters are different, the replacement operation is performed, for example, the mth participle in the source participle sequence is replaced by the nth participle in the target participle sequence, and the minimum operation frequency for successful matching of the mth participle in the source participle sequence and the nth participle in the target participle sequence is K + 1.
In the embodiment of the application, the source URL and the target URL are generalized, whether each newly added participle of the two parties has the same attribute is considered during recursive calculation of the editing distance, if the two parties have the same attribute, the similarity of the two parties is high, and when the two parties match after each newly added participle, in order to match the two parties, the editing operation is not required to be performed before the newly added participle, namely, the editing distance is not required to be increased by 1 on the basis of the two newly added participles.
In some aspects of the present application, based on the above-mentioned schemes, determining an edit distance between a target URL and a source URL using a participle in a target participle sequence and a participle in a source participle sequence as a minimum measurement unit includes:
determining a detection matrix of a target URL and a source URL by adopting the following formula;
Figure BDA0003415559510000091
wherein a represents a source participle sequence, b represents a source participle sequence, leva,b(i, j) represents the editing distance of the first i participles in a and the first j participles in b, i is more than or equal to 0 and less than or equal to m, j is more than or equal to 0 and less than or equal to n, and L represents a third generalized variable set;
determining lev in a detection matrixa,b(m, n) is the edit distance of the target URL from the source URL.
If the word segmentation of both the source URL and the target URL is null, the minimum editing distance is 0. If one participle is empty and the other participle is not empty, the minimum editing distance is the number of the currently matched participles of the party not empty.
Let a [ i-1: and b [ j: matching requires the editing operation K1 times, when a [ i ] is added to the source participle sequence, then in order for a [ i: and b [ j: match, a [ i ] needs to be removed, so a [ i: and b [ j: the number of edit operands required for matching is K1+ 1;
similarly, assume a [ i: and b [ j-1: matching requires the editing operation K2 times, when b [ j ] is added to the target participle sequence, then in order for a [ i: and b [ j: match, b [ j ] needs to be removed, so a [ i: and b [ j: the number of edit operands required for matching is K2+ 1;
similarly, assume a [ i-1: and b [ j-1: k3 times, where a [ i ] is added to the source participle sequence and b [ j ] is added to the target participle sequence, if a [ i ] and b [ j ] both belong to the third generalized set, then in order for a [ i: and b [ j: matching, no editing operation is required, the editing operand K3 is unchanged, and if either a [ i ] or b [ j ] does not belong to the third generalized set, in order to make a [ i: and b [ j: match, replacement operation is required, edit operand K3+ 1.
Fig. 2 is a flowchart of a URL word segmentation method according to an embodiment of the present disclosure. As shown in fig. 2, the method includes at least the following steps.
Step 210: and dividing the source URL according to the hierarchical structure of the URL to obtain a plurality of subunits.
As introduced above, a URL mainly contains three parts: protocol type (scheme), server address (domain), and path (path). The URL in the server address can be divided into two parts: free URLs (fdn), which may be managed and modified by the URL owner, and registered URLs (rdn), which may require management and distribution by the URL registrar or the URL registration authority. The registered URL portion may be further divided from left to right into a second-level Subdomain (SLD), also known as a common suffix, managed by the registrar or registrar constraints, and a top-level subdomain (TLD), located before the top-level subdomain, defined by the URL registrar. In the invention, five parts are finally obtained after the URL is subjected to hierarchical segmentation, wherein the five parts are respectively as follows: protocol type of URL (scheme), free URL (fdn), secondary sub-domain (SLD), top sub-domain (TLD), and path (path).
Step 220: and respectively carrying out symbol segmentation on the plurality of subunits according to the special characters of the URL and a preset regular expression to obtain a plurality of character substrings.
The URL may use an IP address to specify a server address, and the path part or the FDN part may include character strings in the form of a date, a version number, or consecutive numbers.
Step 230: and respectively segmenting the plurality of character substrings through a maximum matching algorithm to obtain a source word segmentation sequence.
The two-way maximum matching algorithm comprises two kinds of matching: the forward maximum matching and the reverse maximum matching are both character string matching based on a dictionary prepared in advance. The inverse maximum matching algorithm is to read a string of unsegmented text from a pointer starting at the end of the string, checking if the current string is a word in the dictionary. If so, insert a space and repeat the process. If not, the pointer is moved one to the right, the string length is decreased, and the matching process is repeated until a single character finally remains. If no word is found, a single character, i.e., a non-dictionary word, is created to represent the final segmented word. The forward maximum matching algorithm works similarly, but with pointer reads starting at the beginning of the string.
For example, the source URL is as follows:
/art/3492/3/3/art_35_349341/doc?/name=3298sfjkasdk&mail=123@sn.com
the segmentation results in "art", "3492", "", "3", "art", "35", "", "349341", "", "doc" "? "" name ""' "3298 sfjkasdk" ".&”“mail”、“=”、“123@sn.com”。
For example, the target URL is as follows:
/art/1034/11/23/art_35_14891/doc?/name=klak324345lkl3456jd&mail=y2ln@sd.com
dividing it into "art", "1034", "11", "23", "", "art", "35", "" "" "14891", "" "," "doc" "? "" name "" ═ klak324345lkl3456jd ""&”“mail”、“=”、“y2ln@sn.com”。
From the word segmentation result, it can be found that the word segmentation numbers of the source URL and the target URL are the same, and the word segmentations at the same positions of the source URL and the target URL satisfy the same preset rule, that is, belong to the third generalized set, then the edit distance between the source URL and the target URL is 0, which is similar to the edit distance between the source URL and the target URL.
Fig. 3 is a flowchart illustrating another URL similarity detection method according to an embodiment of the present disclosure. As shown in fig. 3, the method includes at least the following steps.
Step 310: a source URL and a target URL are obtained.
Step 320: word segmentation: segmenting the source URL according to the hierarchical structure of the URL and the special characters to obtain a source segmentation sequence; and segmenting the target URL according to the hierarchical structure of the URL and the special characters to obtain a target segmentation sequence.
Step 330: generalization: generalizing the character strings meeting preset rules in the source URL to obtain a first generalized variable set; generalizing the character strings meeting the preset rules in the target URL to obtain a second generalized variable set; and obtaining a third generalized variable set according to the union set of the first generalized variable set and the second generalized variable set.
Step 340: and (3) calculating an editing distance: and determining the editing distance between the target URL and the source URL by taking the participles in the target participle sequence and the participles in the source participle sequence as minimum measurement units.
Step 350: normalization of the edit distance: and normalizing the editing distance of the source URL and the target URL to obtain the normalized editing distance.
In a specific implementation, normalizing the edit distance between the source URL and the target URL to obtain a normalized edit distance, includes:
acquiring the word segmentation number of a target word segmentation sequence;
and taking the ratio of the editing distance of the source URL to the target URL to the number of the participles as the normalized editing distance.
Step 360: if the normalized editing distance is larger than or equal to a preset threshold value, determining that the target URL is not similar to the source URL; and if the normalized editing distance is smaller than a preset threshold value, determining that the target URL is similar to the source URL.
In the embodiment of the application, if the edit distance between the target URL and the source URL is smaller than the preset threshold, which indicates that the target URL and the source URL are very similar, whether the target URL is a safe URL may be determined by determining the similarity between the target URL and the normal source URL.
An embodiment of an apparatus for performing the URL similarity detection method is described below, and please refer to the embodiment of the URL similarity detection method for details that are not carelessly missed in the URL similarity detection apparatus.
Fig. 4 is a flowchart of a URL similarity detection apparatus according to an embodiment of the present disclosure. As shown in fig. 3, the apparatus 400 includes at least the following.
AURL obtaining unit 410 for obtaining a source URL and a target URL;
aword segmentation unit 420, configured to perform word segmentation on the source URL according to the hierarchical structure of the URL and the special character, to obtain a source word segmentation sequence;
theword segmentation unit 420 is further configured to perform word segmentation on the target URL according to the hierarchical structure of the URL and the special character, so as to obtain a target word segmentation sequence;
an editdistance calculation unit 430, configured to determine an edit distance between the target URL and the source URL by using the participles in the target participle sequence and the participles in the source participle sequence as minimum measurement units;
and asimilarity calculation unit 440, configured to determine a similarity between the target URL and the source URL according to the editing distance.
The word segmentation method provided by the embodiment of the application fully considers the hierarchical information and the special character information of the URL and divides the URL into a plurality of structural units.
Referring to fig. 5, aprogram product 500 for implementing the above method according to an embodiment of the present application is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
As another aspect, the present application further provides an electronic device capable of implementing the above method. As will be appreciated by one skilled in the art, aspects of the present application may be embodied as a system, method or program product. Accordingly, various aspects of the present application may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
Anelectronic device 600 according to this embodiment of the present application is described below with reference to fig. 6. Theelectronic device 600 shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 6, theelectronic device 600 is embodied in the form of a general purpose computing device. The components of theelectronic device 600 may include, but are not limited to: the at least oneprocessing unit 610, the at least one memory unit 620, and abus 630 that couples the various system components including the memory unit 620 and theprocessing unit 610.
Wherein the storage unit stores program code, which can be executed by theprocessing unit 610, to cause theprocessing unit 610 to perform the steps according to various exemplary embodiments of the present application described in the section "example methods" above in this description.
The storage unit 620 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)621 and/or acache memory unit 622, and may further include a read only memory unit (ROM) 623.
The storage unit 620 may also include a program/utility 624 having a set (at least one) of program modules 626, such program modules 626 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 630 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
Theelectronic device 600 may also communicate with one or more external devices 700 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with theelectronic device 600, and/or with any devices (e.g., router, modem, etc.) that enable theelectronic device 600 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O)interface 660. Also, theelectronic device 600 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via thenetwork adapter 660. As shown, thenetwork adapter 660 communicates with the other modules of theelectronic device 600 over thebus 630. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with theelectronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to make a computing device (which can be a personal computer, a server, a terminal device, or a network device, etc.) execute the method according to the embodiments of the present application.
Furthermore, the above-described figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the present application, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules. Finally, it should be noted that: as will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1.A URL similarity detection method is characterized by comprising the following steps:
acquiring a source URL and a target URL;
segmenting the source URL according to the hierarchical structure of the URL and the special characters to obtain a source segmentation sequence;
segmenting words of the target URL according to the hierarchical structure and the special characters of the URL to obtain a target segmentation sequence;
determining the editing distance between the target URL and the source URL by taking the participles in the target participle sequence and the participles in the source participle sequence as a minimum measuring unit;
and determining the similarity between the target URL and the source URL according to the editing distance.
2. The URL similarity detection method according to claim 1, further comprising:
generalizing the character strings meeting preset rules in the source URL to obtain a first generalized variable set;
generalizing the character strings meeting the preset rules in the target URL to obtain a second generalized variable set;
obtaining a third generalized variable set according to the union set of the first generalized variable set and the second generalized variable set;
determining the editing distance between the target URL and the source URL by taking the participles in the target participle sequence and the source participle sequence as a minimum measurement unit, wherein the step of determining the editing distance comprises the following steps:
recursively calculating the editing distance between the first M participles in the source participle sequence and the first N participles in the target participle sequence by taking the participles in the target participle sequence and the source participle sequence as a minimum measurement unit;
if the Mth participle in the source participle sequence and the Nth participle in the target participle sequence both belong to the third generalized variable set, the edit distance between the first M participles in the source participle sequence and the first N participles in the target participle sequence is less than or equal to the edit distance between the first M-1 participles in the source participle sequence and the first N-1 participles in the target participle sequence, and M is greater than or equal to 1, and N is greater than or equal to 1.
3. The URL similarity detection method according to claim 2, wherein the determining the edit distance between the target URL and the source URL by using the participles in the target participle sequence and the source participle sequence as a minimum measurement unit includes:
determining a detection matrix of the target URL and the source URL by adopting the following formula;
Figure FDA0003415559500000021
wherein a represents the source participle sequence, b represents the source participle sequence, leva,b(i, j) represents the edit distance of the first i participles in a and the first j participles in b, i is more than or equal to 0 and less than or equal to m, j is more than or equal to 0 and less than or equal to n, and L represents the third generalized variable set;
determining lev in the detection matrixa,b(m, n) is the edit distance of the target URL from the source URL.
4. The URL similarity detection method according to claim 1, wherein the determining the similarity between the target URL and the source URL according to the edit distance includes:
normalizing the editing distance of the source URL and the target URL to obtain the normalized editing distance;
and determining the similarity between the target URL and the source URL according to the normalized editing distance.
5. The URL similarity detection method according to claim 4, wherein the normalizing the edit distance between the source URL and the target URL to obtain a normalized edit distance includes:
acquiring the word segmentation number of the target word segmentation sequence;
and taking the ratio of the editing distance of the source URL and the target URL to the number of the participles as the normalized editing distance.
6. The URL similarity detection method according to claim 1, wherein the determining the similarity between the target URL and the source URL according to the edit distance includes:
if the editing distance is larger than or equal to a preset threshold value, determining that the target URL is not similar to the source URL;
and if the editing distance is smaller than the preset threshold value, determining that the target URL is similar to the source URL.
7. The URL similarity detection method according to claim 1, wherein the segmenting the source URL according to the hierarchical structure of the URL and the special character to obtain a source segmentation sequence includes:
dividing the source URL according to the hierarchical structure of the URL to obtain a plurality of subunits;
respectively carrying out symbol segmentation on the plurality of subunits according to the special characters of the URL and a preset regular expression to obtain a plurality of character substrings;
and respectively segmenting the plurality of character substrings through a maximum matching algorithm to obtain the source word segmentation sequence.
8. An apparatus for detecting URL similarity, the apparatus comprising:
a URL acquisition unit for acquiring a source URL and a target URL;
the word segmentation unit is used for segmenting the source URL according to the hierarchical structure of the URL and the special characters to obtain a source word segmentation sequence;
the word segmentation unit is also used for segmenting words of the target URL according to the hierarchical structure of the URL and special characters to obtain a target word segmentation sequence;
the editing distance calculation unit is used for determining the editing distance between the target URL and the source URL by taking the participles in the target participle sequence and the participles in the source participle sequence as a minimum measurement unit;
and the similarity calculation unit is used for determining the similarity between the target URL and the source URL according to the editing distance.
9. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs which, when executed by the one or more processors, cause the electronic device to carry out the method of any one of claims 1-7.
10. A computer-readable medium, on which a computer program is stored, characterized in that the computer program, when executed, implements the method according to any of claims 1-7.
CN202111545196.9A2021-12-162021-12-16 URL similarity detection method, device, equipment and storage mediumActiveCN114186150B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202111545196.9ACN114186150B (en)2021-12-162021-12-16 URL similarity detection method, device, equipment and storage medium

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202111545196.9ACN114186150B (en)2021-12-162021-12-16 URL similarity detection method, device, equipment and storage medium

Publications (2)

Publication NumberPublication Date
CN114186150Atrue CN114186150A (en)2022-03-15
CN114186150B CN114186150B (en)2025-03-28

Family

ID=80605413

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202111545196.9AActiveCN114186150B (en)2021-12-162021-12-16 URL similarity detection method, device, equipment and storage medium

Country Status (1)

CountryLink
CN (1)CN114186150B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN103428307A (en)*2013-08-092013-12-04中国科学院计算机网络信息中心Method and equipment for detecting counterfeit domain names
US20170075877A1 (en)*2015-09-162017-03-16Marie-Therese LEPELTIERMethods and systems of handling patent claims
CN108228710A (en)*2017-11-302018-06-29中国科学院信息工程研究所A kind of segmenting method and device for URL
CN111324784A (en)*2015-03-092020-06-23阿里巴巴集团控股有限公司Character string processing method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN103428307A (en)*2013-08-092013-12-04中国科学院计算机网络信息中心Method and equipment for detecting counterfeit domain names
CN111324784A (en)*2015-03-092020-06-23阿里巴巴集团控股有限公司Character string processing method and device
US20170075877A1 (en)*2015-09-162017-03-16Marie-Therese LEPELTIERMethods and systems of handling patent claims
CN108228710A (en)*2017-11-302018-06-29中国科学院信息工程研究所A kind of segmenting method and device for URL

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CAYMANT: "计算两个URL的相似度 编辑距离和docsim", pages 1 - 3, Retrieved from the Internet <URL:https://blog.csdn.net/cayman_2015/article/details/84950524>*
YONGJIE HUANG等: "Phishing URL Detection via CNN and Attention-Based Hierarchical RNN", 2019 18TH IEEE INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS/13TH IEEE INTERNATIONAL CONFERENCE ON BIG DATA SCIENCE AND ENGINEERING (TRUSTCOM/BIGDATASE), 31 October 2019 (2019-10-31), pages 112 - 119*
杨春磊: "基于模式匹配的结构化信息抽取研究", 中国优秀硕士学位论文全文数据库信息科技辑, no. 5, 15 May 2014 (2014-05-15), pages 138 - 3108*

Also Published As

Publication numberPublication date
CN114186150B (en)2025-03-28

Similar Documents

PublicationPublication DateTitle
US12231390B2 (en)Domain name classification systems and methods
US10104113B1 (en)Using machine learning for classification of benign and malicious webpages
Liu et al.Who is. com? Learning to parse WHOIS records
US10430610B2 (en)Adaptive data obfuscation
CN111177184A (en) Natural language-based structured query language conversion method and related equipment
US10163063B2 (en)Automatically mining patterns for rule based data standardization systems
US20090089278A1 (en)Techniques for keyword extraction from urls using statistical analysis
CN108228710B (en)Word segmentation method and device for URL
CN111753171B (en)Malicious website identification method and device
US10311218B2 (en)Identifying machine-generated strings
US9110852B1 (en)Methods and systems for extracting information from text
CN111783443A (en)Text disturbance detection method, disturbance reduction method, disturbance processing method and device
JP7254925B2 (en) Transliteration of data records for improved data matching
US11994980B2 (en)Method, device and computer program product for application testing
CN118264450B (en)Alarm information processing method, system, equipment and medium
CN113743101A (en)Text error correction method and device, electronic equipment and computer storage medium
Wong et al.iSentenizer‐μ: Multilingual Sentence Boundary Detection Model
CN113177407A (en)Data dictionary construction method and device, computer equipment and storage medium
CN113408660A (en)Book clustering method, device, equipment and storage medium
CN107220249B (en)Classification-based full-text search
CN114186150B (en) URL similarity detection method, device, equipment and storage medium
CN113434792B (en) Training method of network address matching model and network address matching method
CN113051876B (en)Malicious website identification method and device, storage medium and electronic equipment
CN115712925A (en)Webpage tampering detection method and device, electronic equipment and readable storage medium
US20180293508A1 (en)Training question dataset generation from query data

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp