Movatterモバイル変換


[0]ホーム

URL:


CN119203976B - A web page deduplication method based on structure perception - Google Patents

A web page deduplication method based on structure perception
Download PDF

Info

Publication number
CN119203976B
CN119203976BCN202411687242.2ACN202411687242ACN119203976BCN 119203976 BCN119203976 BCN 119203976BCN 202411687242 ACN202411687242 ACN 202411687242ACN 119203976 BCN119203976 BCN 119203976B
Authority
CN
China
Prior art keywords
web page
simhash
url
content
collected
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411687242.2A
Other languages
Chinese (zh)
Other versions
CN119203976A (en
Inventor
赵云
苏杭
许源
仲思超
嵇程
高亮
马锐
陆茹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Jiangsu Co ltd Suqian Branch
Jiangsu Branch Center National Computer Network And Information Security Management Center
China Mobile Hangzhou Information Technology Co Ltd
Original Assignee
China Mobile Communications Group Jiangsu Co ltd Suqian Branch
Jiangsu Branch Center National Computer Network And Information Security Management Center
China Mobile Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Jiangsu Co ltd Suqian Branch, Jiangsu Branch Center National Computer Network And Information Security Management Center, China Mobile Hangzhou Information Technology Co LtdfiledCriticalChina Mobile Communications Group Jiangsu Co ltd Suqian Branch
Priority to CN202411687242.2ApriorityCriticalpatent/CN119203976B/en
Publication of CN119203976ApublicationCriticalpatent/CN119203976A/en
Application grantedgrantedCritical
Publication of CN119203976BpublicationCriticalpatent/CN119203976B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

The invention discloses a webpage deduplication method based on structure perception, which belongs to the technical field of safety, infrastructure and IT support and comprises the following steps of 1, collecting webpages, 2, analyzing URLs of the collected webpages to generate URL parameter feature labels, 3, constructing URL feature information and difference analysis result mapping tables, carrying out initialization comparison and filtering on the collected webpages, updating the URL feature information and the difference analysis result mapping tables, 4, extracting weighted areas of the collected webpages according to the URL feature information and the difference analysis result mapping tables if the collected webpages are not filtered, calculating SimHash values of the collected webpages, and 5, judging whether the collected webpages have similar webpages according to SimHash values of the collected webpages and webpage URLs, and carrying out deduplication processing if the collected webpages have the similar webpages. The method improves the duplicate removal efficiency and accuracy, and is suitable for carrying out high-efficiency and accurate duplicate removal treatment on large-scale webpage data.

Description

Webpage deduplication method based on structure perception
Technical Field
The invention belongs to the technical fields of security, infrastructure and IT support, and particularly relates to a webpage deduplication method based on structural awareness.
Background
With the rapid development of the internet, the amount of network information increases exponentially. Many web sites contain a large number of duplicate or highly similar web pages, such as multi-version stories for the same event in a news web site, or a merchandise listing page of an e-commerce web site. This situation results in the generation of a large amount of redundant information when data is collected, reducing the efficiency and quality of information acquisition.
In the prior art, the URL-based deduplication method is mainly implemented through URL mapping relations, including technologies such as Hash (Hash) algorithm, MD5 (message digest algorithm) or bit mapping (such as Bitmap or Bloom filter). These methods can only accomplish deduplication at the URL level, and cannot effectively address web pages that are identical or highly similar but have different URLs. For example, different versions of the same news event report links, although URLs are different, the web page content may be very similar. Furthermore, such methods are not capable of dynamically generated web pages because the URLs of dynamic web pages often contain parameters or time stamps, and the actual content may be consistent despite the differences in URLs.
Another prior art is a web page duplication eliminating method based on SimHash, whose basic principle is to segment the entire text content of the web page, calculate the hash value of each word, and then weight average these hash values to generate SimHash fingerprint of the web page. By calculating the hamming distance between SimHash fingerprints of different web pages, it can be determined whether the web pages are repeated. However, the method has the defect that the importance difference of the webpage structure and the contents of different areas cannot be fully considered, the whole webpage text is regarded as equally important, and the template content has overlarge interference on similarity calculation. For example, common portions of navigation bars, footers, etc. may cause irrelevant pages to be misjudged to be similar. In addition, the method is difficult to effectively identify the webpages with the same text content and different template contents, such as the reloading of the same news on different websites, and the template difference can cause misdiscrimination although the text content is the same.
Disclosure of Invention
The invention aims to solve the technical problem of providing a webpage deduplication method based on structure perception aiming at the defects of the prior art.
In order to solve the technical problems, the invention discloses a webpage deduplication method based on structural awareness, which comprises the following steps of.
And step 1, collecting the web pages.
And 2, analyzing the URL of the acquired webpage, extracting a parameter part in the URL, and generating a URL parameter feature tag according to the parameter name and the value mode.
And 3, constructing a URL characteristic information and difference analysis result mapping table, carrying out initialization comparison and filtering on the acquired web pages, and updating the URL characteristic information and difference analysis result mapping table.
And step 4, if the acquired web page is not filtered, extracting a weighted area from the acquired web page according to the URL characteristic information and the difference analysis result mapping table, and calculating SimHash values of the acquired web page.
And 5, judging whether the collected web page has a similar web page or not according to the SimHash value of the collected web page and the web page URL, and if the collected web page has the similar web page, performing duplicate removal processing.
Further, step 2 includes performing label replacement on the URL parameter name, the parameter value and the path according to rules, re-splicing each part after label replacement into a character string, and calculating a unique request ID of the character string, where the unique request ID is used as a URL parameter feature tag.
Further, the step 3 of constructing the URL feature information and variance analysis result mapping table includes storing and managing URL feature information and variance analysis results thereof using Map data structure, the mapping table uniqueMarkedMap is defined as Map < String, diffRes > uniqueMarkedMap
The Key Key of the mapping table is a URL parameter characteristic tag, namely a unique request ID uniqueMarkedId, the type is String, the Value is a DiffRes object, the DiffRes object comprises a mark DIRECTFILTER for direct filtering or not, accumulated comparison times compareCount, similar times similarCount in accumulated comparison, a SimHash Value LASTSIMHASH in last comparison, a last webpage source code rawHtml and a webpage content change area path set DIFFPATHLIST, and the webpage content change area path set DIFFPATHLIST comprises a content change type, an element XPath and an element change times.
Further, in the step 3, initializing, comparing and filtering the collected web page, updating URL characteristic information and a difference analysis result mapping table, namely searching whether a difference analysis result exists in the mapping table according to the URL parameter characteristic label of the collected web page, if not, indicating that the collected web page is the web page for collecting the URL parameter characteristic label for the first time, initializing the difference analysis table of the characteristic in the mapping table, storing the web page source code into the last web page source code rawHtml attribute of DiffRes, calculating SimHash value of the current source code by using a standard SimHash algorithm, and storing the SimHash value into the last compared SimHash value LASTSIMHASH attribute of DiffRes.
If yes, whether the identifier DIRECTFILTER of direct filtration in the mapping table is false is judged, if true, the collected web page is repeated, and if false is judged as the identifier DIRECTFILTER of direct filtration in the mapping table, the accumulated comparison times is greater than or equal to the comparison times threshold, the initialization comparison is completed, and the step 4 is executed.
And if the accumulated comparison times are smaller than the comparison times threshold, identifying the content change area of the acquired webpage, generating a content change area path list, updating the content change area path list to a mapping table, and executing the step 1.
Further, in step 3, the content change area of the collected web page is identified, a content change area path list is generated, and updating is carried out to the mapping table, wherein the method comprises the steps of comparing DOM structures, taking the web page corresponding to the last web page source code rawHtml in the difference analysis result with the same URL parameter characteristic label in the mapping table as a comparison sample, starting from the root node of the DOM tree, and comparing the structure and the content of the nodes layer by layer in a breadth-first traversal mode.
Comparing the number of the sub-nodes corresponding to the current node, if the number of the sub-nodes is inconsistent, acquiring texts of all the sub-nodes under the current node, comparing, and if the summarized texts are inconsistent, recording the change type of the content area under the current node as structural change, and increasing the change times.
If the number of the child nodes is consistent, comparing the text contents of all the nodes one by one, if the text contents are inconsistent, recording the change type of the content area under the current node as text change, increasing the change times, and if the text contents are consistent, continuing to compare all the child nodes of the current node.
After the comparison is completed, if the content fluctuation area is not found, the acquired webpage is directly filtered, otherwise, a content fluctuation area path list corresponding to the acquired webpage is generated, the content fluctuation area path list corresponding to the DiffRes of the mapping table is updated to a webpage content fluctuation area path set DIFFPATHLIST, the accumulated comparison times compareCount are increased, the last webpage source code rawHtml is updated to be the source code of the acquired webpage, the SimHash value of the new source code is recalculated, the comparison is carried out last time, if the Hamming distance is smaller than 2, the similar times similarCount in the accumulated comparison are increased by 1, the SimHash value of the new webpage source code is updated LASTSIMHASH, and if the similar times similarCount in the accumulated comparison is larger than or equal to the similar times threshold when the accumulated comparison times compareCount reaches the comparison times threshold, the identifier DIRECTFILTER whether the direct filtration is carried out is true is updated.
Prior to SimHash computations, a pre-filtering mechanism based on URL features is introduced. For web pages characterized by specific URLs, if no content change area or SimHash values are found that have a hamming distance of less than 2, special labeling is performed. The feature links marked as true by the identifier DIRECTFILTER of whether the DiffRes object is directly filtered in the mapping table can be directly filtered without downloading and recalculating the similarity, so that the efficiency and the resource utilization rate are improved.
Further, step 4 includes step 4-1 of downloading the web content of the collected web page using the headless browser.
And 4-2, analyzing the downloaded webpage content.
And 4-3, calculating weights of the content change areas corresponding to the same URL parameter feature labels in the mapping table.
And 4-4, traversing all the content change areas, searching DOM elements corresponding to the webpage content analyzed in the step 4-2 by using an XPath positioner, and extracting text content of the DOM elements.
And 4-5, calculating SimHash values of the acquired web pages according to the content change area weights and the corresponding text contents.
Further, step 4-3 includes recording n content variation regions, and calculating the weight Wi of the ith content variation region.
Wi= (Ci / M) * (1 + log(Ci))
Wherein n >0, 1.ltoreq.i.ltoreq.n, ci represents the number of changes in the ith content change area, and M represents the threshold number of comparison times.
The self-adaptive weight calculation formula comprehensively considers the importance and marginal effect of the change frequency, not only highlights an important change area, but also does not excessively exaggerate the influence of frequent change, provides a reasonable weighting basis for SimHash calculation, and improves the capability of the system for identifying the change of the substantial content.
Further, step 4-5 includes step 4-5-1 of creating 64 bit integer array bits [64], initializing to 0, initializing total weight sumWeight =0, maximum weight maxWeight =0.
Step 4-5-2, calculating a weighting characteristic for each content variation area to obtain a combined weight accumulated by all the content variation areas:
Using a word segmentation tool to segment text content in the ith content variation area to obtain a word set Ti;
For each word t in Ti, calculating a 64-bit hash value h, a word weight w and a combination weight cw of the word t, and updating a total weight sumWeight, a maximum weight maxWeight and an integer array bits [64];
Wherein, the combined weight cw=round (w is Wi), the total weight sumWeight is updated to be the existing total weight sumWeight plus the combined weight cw, the maximum weight maxWeight is updated to be the maximum value of the existing maximum weight maxWeight and the combined weight cw, the bits 0 to 63 of the hash value h are traversed, if the j bit of the hash value h is 1, bits [ j ] are updated to be the existing bits [ j ] plus the combined weight cw, otherwise bits [ j ] are updated to be the existing bits [ j ] minus the combined weight cw, j represents the index, and 0 is less than or equal to j is less than or equal to 63.
Step 4-5-3, generating SimHash values of the acquired web pages based on the combined weights accumulated in all content variation areas:
Initializing an empty string simHash, traversing 0 to 63 bits of an integer array bits [64], if bits [ k ] >0, adding 1 after the string simHash, otherwise adding 0 after the string simHash, wherein k represents an index, and 0 is less than or equal to k is less than or equal to 63.
The application introduces the concept of document area structure and distributes different weights for different areas. In the calculation SimHash, the term weights are combined with the region weights to form combined weights. The method can more accurately reflect the structural characteristics and the content importance of the webpage, and improves the expression capability and the distinguishing degree of SimHash.
Further, step 5 includes creating L index tables, where L can equally divide the string SimHash, and the L index tables are used to store L parts of the history acquisition webpage SimHash value respectively.
And dividing SimHash values of the acquired web pages into equal L parts, searching the same values from the corresponding index tables respectively, and acquiring the web page IDs corresponding to the same values to form a candidate web page list.
Acquiring SimHash values corresponding to each webpage in the candidate webpage list, calculating the Hamming distance between the SimHash values of the acquired webpages and the candidate webpages SimHash, judging that the two webpages are dissimilar if the Hamming distance is greater than or equal to a preset similarity threshold, respectively storing L parts of SimHash values of the acquired webpages into corresponding index tables, and recording SimHash values and URLs of the acquired webpages, otherwise, judging that the two webpages are similar, and not storing SimHash values and URLs of the acquired webpages.
Further, the step 2 of performing label replacement on the URL parameter name, the parameter value and the path according to rules comprises the step of using a first label replacement for the parameter name exceeding a fixed length.
And for parameter names which do not exceed the fixed length, if the parameter names and/or parameter values contain numbers, replacing the numbers in the parameter names and/or parameter values with the second marks.
The parameter value and/or chinese in the path is replaced with a third label.
The portion of the path containing Unicode characters is replaced with a fourth tag.
The time in the parameter value is replaced with a fifth flag.
By replacing the characters such as numbers, time, overlong character strings and the like with unified marks, the diversity of parameters is effectively reduced. Based on the assumption that sub-webpages with the same URL characteristics under the same website generally have the same webpage structure, text contents needing important attention are identified, and template contents are ignored, so that the duplicate removal efficiency and accuracy are improved.
The webpage deduplication method based on structure perception has the beneficial effects that the webpage deduplication method based on structure perception is provided, and aims to improve accuracy and efficiency of the prior art when similar template websites are processed. By analyzing the DOM tree structure of the webpage, identifying different areas and giving different weights according to the importance of the areas, the method and the device can accurately extract text contents and perform weighted similarity calculation, so that the webpage deduplication with higher precision is realized. In addition, by combining with URL parameter feature markers and SimHash optimization technology, the duplication removal efficiency and accuracy are further improved, so that the method can efficiently and accurately process large-scale webpage data, and finally, the storage cost and the server load are effectively reduced.
Drawings
The foregoing and/or other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings and detailed description.
Fig. 1 is a flow chart of a web page deduplication method based on structure awareness according to an embodiment of the present application.
Detailed Description
Embodiments of the present invention will be described below with reference to the accompanying drawings.
Related terms are explained as follows.
Headless browser (Headless Browser) a browser without a graphical user interface that can programmatically control its behavior, such as loading web pages, executing JavaScript, intercepting screen shots, etc. The method is commonly used for scenes such as webpage testing and webpage content extraction.
DOM tree (Document Object Model Tree) parses an HTML or XML document into a data structure of tree structure that represents elements, attributes, and text content in the document. DOM trees are the basis for web page structure analysis.
XPath (XML Path Language) A language for locating nodes in an XML document. XPath is also commonly used for node location of HTML documents.
SimHash (Similarity Hash) A technique for converting text content into fixed length fingerprints for fast computation of similarity between texts. The closer SimHash values are, the more similar the text content is represented.
Hamming distance (HAMMING DISTANCE) is an indicator for comparing the degree of difference between two strings of the same length. The hamming distance refers to the number of characters with different corresponding positions of the two character strings.
The webpage deduplication method based on structure perception can be applied to various scenes needing webpage deduplication, such as a search engine, improves search result quality, reduces redundant information, news aggregation, filters repeated news, provides more comprehensive information, monitors network content, identifies repeated information, improves network content analysis efficiency, recommends non-repeated information for users through an e-commerce platform, knowledge question and answer and the like, increases user viscosity, and can obtain more accurate and reliable analysis results based on webpage data after deduplication in the fields of data analysis and mining such as market research, user portrait analysis and the like.
The embodiment of the application discloses a webpage deduplication method based on structure perception, which comprises the following steps as shown in fig. 1.
And step 1, collecting the webpage.
The method utilizes an efficient webpage content extraction technology and combines a headless browser rendering technology to collect a plurality of webpages in the same domain in batches. To improve the collection efficiency, web content extraction supports multithreading and asynchronous requests, simulates a real browser environment, such as setting header information of users-Agent, referer, and generating browser fingerprints. In the acquisition process, information such as URL addresses of web pages, HTML source codes and HTTP state codes after web page rendering and metadata such as acquisition time are recorded, and a complete data base is provided for subsequent web page deduplication analysis.
Step 2, generating URL parameter feature labels
The step analyzes each acquired URL, and extracts parameter parts (such as query parameters, paging parameters and the like in a search page) in the URL. Then, a URL parameter feature tag is generated according to the name and value pattern of the parameter. By replacing the characters such as numbers, dates and overlong character strings with uniform marks, the diversity of parameter names, values and paths can be effectively reduced, and subsequent feature extraction and comparison are facilitated. The method comprises the steps of carrying out marking replacement on the URL parameter name, the parameter value and the path according to rules, re-splicing all parts after marking replacement into a character string, and calculating the unique request ID of the character string, wherein the unique request ID is used as a URL parameter characteristic tag.
The marking and replacing the URL parameter name, the parameter value and the path according to the rule comprises the step of replacing the parameter name exceeding the fixed length by using a first mark { { { long }, wherein the fixed length can take 32 characters.
For parameter names which do not exceed the fixed length, if the parameter names and/or parameter values contain numbers, the numbers in the parameter names and/or parameter values are replaced by the second marks { { number }.
The parameter value and/or chinese in the path is replaced with the third label { { chinese }.
The portion of the path containing Unicode characters is replaced with the fourth label { { Unicode }.
The time in the parameter value is replaced with the fifth flag { { time }.
The marking and replacing of the URL parameter name, the parameter value and the path according to the rule further comprises the step of replacing a part of the pure capital letters in the parameter value and/or the path with a fifth mark { { upper }, and replacing a part of the pure lowercase letters with a sixth mark { { lower }.
The parameter value and/or URL encoded string in the path is replaced with the seventh tag { { { urlencode }.
The portion of the parameter values containing boolean values is replaced with the eighth notation { { bool }.
2.1 Feature tagging process
To generalize URLs that meet certain characteristics, the present embodiment tags the various components of the URL.
2.1.1 The parameter names are marked.
And marking the URL parameter names so as to classify the parameters with similar structures, thereby reducing the diversity of the parameter names and improving the efficiency and accuracy of URL feature extraction. All parameter names and values are traversed first. The method comprises the steps of carrying out no processing on parameter names of pure letters and the length of the parameter names is not more than 32 characters, replacing the parameter names of the length of the parameter names of the pure letters by using { { long } }, representing that the parameter names are overlong, and replacing numbers in the parameter names of the pure letters by using a regular expression.
Examples:
the parameter name "keyword" is not processed, and remains as it is.
The parameter name "productName" would be replaced with "productName { number } }.
The parameter name "aVeryVeryVeryLongParameterName" would be replaced with "{ { long }").
2.1.2 The parameter values are marked.
Depending on the type of parameter value, a different label is used for substitution. Examples are as follows.
The parameter value "123" is labeled "{ number }.
Chinese, the parameter value "hello" is labeled "{ { chinese }.
2.1.3 The path is marked.
The path is divided into parts according to "/", and each part is marked with a purely numerical part replaced with a { { number } }' mark.
The Chinese containing part is marked with { { chinese }.
The portion containing Unicode characters is marked with { { Unicode }.
And finally, splicing all marked parts into a character string again to be used as a marked path.
2.1.4 The tagged unique request ID is calculated.
The parameter names, parameter values and paths matched in the original URL are replaced by corresponding feature templates, and then hash calculation (such as MD 5) is carried out on the feature templates to obtain a 32-bit unique request ID which is used as a URL parameter feature tag.
For example, for the URL "https:// www.example.com/path/api/searchpage =2 & begin=2024-09-21", its unique request ID can be calculated as:
MD5("https://www.example.com/path/api/searchpage={{number}}&begin={{time}}")
2.2 Example
Assume that there are two URLs:
URL1 http:// www.test.com/searchname =Zhang San & age=21
URL2 http:// www.test.com/searchname =litetra & age=24
After feature labeling, the unique request IDs corresponding to the feature labels are:
MD5("http://www.test.com/searchname={{chinese}}&age={{number}}")
In this way, URLs with the same characteristics can be categorized together, facilitating subsequent web structure analysis and deduplication operations.
And 3, constructing a URL characteristic information and difference analysis result mapping table, carrying out initialization comparison and filtering on the acquired web pages, and updating the URL characteristic information and difference analysis result mapping table.
3.1 And constructing a URL characteristic information and difference analysis result mapping table uniqueMarkedMap.
Map data structures are used to store and manage URL feature information and its differential analysis results. The data structure is named as a URL characteristic information and difference analysis result mapping table (hereinafter referred to as mapping table), and is specifically implemented as follows.
3.1.1. And (5) defining a mapping table.
Map<String, DiffRes>uniqueMarkedMap
3.1.2. Mapping table key value pair description.
Key (Key) the only request ID uniqueMarkedId after calculation of the tag is String type.
And a Value DiffRes object containing content change information of the characteristic web page.
3.1.3. DiffRes object structure.
public class DiffRes {
Private boolean directFilter;// identifies whether direct filtering is possible
PRIVATE INT compareCount number of times of/(cumulative comparison), upper limit 20 times
PRIVATE INT similarCount similar times in the/(20) comparison
PRIVATE STRING LASTSIMHASH,// last simhash (calculated using the original method, unweighted)
PRIVATE STRING RAWHTML,// last web page source code
PRIVATE SET < DiffPath > DIFFPATHLIST;// content change area Path set
}
3.1.4. DiffPath object structure.
public class DiffPath {
PRIVATE STRING PATHTYPE;// change types
PRIVATE STRING PATH XPath of the element;//
PRIVATE INTEGER count;// number of element changes
}
3.1.5. PathType values are described.
"TEXT_DIFF" means TEXT change
"Structure_diff" means page STRUCTURE variation
In step 2, after labeling the currently collected sub-links, the corresponding uniqueMarkedId may be calculated, and it may be determined whether the URL feature information and the difference analysis result mapping table uniqueMarkedMap have a corresponding comparison result.
3.2 There is already a corresponding comparison result.
First check DIRECTFILTER field in the comparison:
if DIRECTFILTER is true, the link is filtered directly without downloading and computing the similarity.
If DIRECTFILTER is false, it is checked whether the number of comparisons reaches a threshold number of comparisons (e.g., 20 in this embodiment).
If the comparison frequency threshold is reached, initializing the comparison is completed, and the step 4 is entered.
If the comparison frequency threshold is not reached, continuing to execute 3.3 for comparison, and updating the comparison result.
3.3 There is no corresponding comparison result or a need to continue the comparison.
If there is no comparison corresponding to current uniqueMarkedId in uniqueMarkedMap, this is the web page that first acquired the feature. At this time, it is required that:
1) A variance analysis table for the feature is initialized at uniqueMarkedMap.
2) The web page source code is saved to rawHtml attribute of DiffRes.
3) The SimHash value of the current source code is calculated using standard SimHash algorithm and saved to LASTSIMHASH attribute of DiffRes.
When the web pages with the same characteristics are acquired again later, the following steps are executed:
1) The newly acquired source code is compared to rawHtml saved in DiffRes (i.e., the last acquired source code) and the difference path is recorded in DIFFPATHLIST of DiffRes.
2) Update rawHtml is the newly acquired source code.
3) The SimHash value of the new source code is recalculated and compared with the last time. If the Hamming distance is less than 2, then similarCount in DiffRes is added to 1.
4) LASTSIMHASH is updated with SimHash values of the new source code.
The specific comparison procedure is as follows.
A) DOM structure comparison
And selecting the webpage which is acquired last time in the same characteristic category (with the same URL parameter characteristic label) as a comparison sample.
And comparing the current webpage source code with the selected sample to identify a content change area.
B) Breadth-first traversal
Starting from the root node of the DOM tree, adopting a breadth-first traversal mode.
The structure and content of the nodes are compared layer by layer.
C) Node difference identification
C.i) structural difference, namely comparing the number of the child nodes corresponding to the current node.
And if the number is inconsistent, acquiring the texts of all the child nodes under the current node and comparing the texts. If the summary text is inconsistent, the area is recorded as structure_diff (STRUCTURE change), and the number of changes is increased.
And ii) text difference, namely if the number of the child nodes is consistent, comparing ownText text contents of all the nodes one by one. If ownText is inconsistent, the area is recorded as text_diff (TEXT change), and the number of changes is increased. If ownText are identical, the deep comparison is continued for all child nodes of the node.
D) Generating a content change area path list
After the comparison is completed, if the content variation area is not found, the acquired web pages are directly filtered, otherwise, a content variation area path list corresponding to the characteristic web pages is generated.
Binding the path list to the current URL feature tag for processing of subsequent similar class webpages.
E) Updating uniqueMarkedMap
The generated comparison results (including the content fluctuation zone path list and the special label) are stored uniqueMarkedMap.
Using the current uniqueMarkedId as a key ensures that subsequent quick retrievals can be made.
F) Comparison count management
After each comparison, the comparison count is incremented. If so, the similarCount count is incremented.
F.1 If similarCount is greater than or equal to the threshold number of similarity times, in this embodiment, the threshold number of similarity times may be set to 18, then DIRECTFILTER in DiffRes is set to true, indicating that the web pages corresponding to the feature are highly similar.
And (4) finishing the mark initialization contrast stage, and executing step 4.
F.2 If the number of comparisons is less than 20, save the current progress. Wait for the next acquisition to continue the comparison.
And 4, if the acquired web page is not filtered, extracting a weighted area from the acquired web page according to the URL characteristic information and the difference analysis result mapping table, and calculating SimHash values of the acquired web page.
The step is to process and calculate the similarity of the collected web pages based on the content variation area information corresponding to the characteristic web pages obtained in the step 3.
4.1 And (5) collecting the web pages.
And downloading the webpage content corresponding to the current URL by using the headless browser.
If the download fails or times out, an error is recorded and the current process is ended.
4.2 And (5) extracting content.
The downloaded web page content is parsed using an HTML parser (Jsoup).
Extracting text content of the webpage, and removing the HTML tag and the script.
4.3 And (5) calculating the variable region weight.
The n content fluctuation areas are recorded, and the weight Wi of the ith content fluctuation area is calculated.
Wi = (Ci / M) * (1 + log(Ci))
Where n >0, 1.ltoreq.i.ltoreq.n, ci represents the number of changes in the ith content change area, M is the total number of comparisons (the threshold number of comparisons is at most 20).
The weight formula design thought comprehensively considers the importance and marginal effect of the change frequency, not only can highlight important change areas, but also can not excessively exaggerate the influence of frequent change, and provides a reasonable weighting basis for SimHash calculation. This design helps to increase the ability of the system to identify substantial content changes while maintaining sensitivity to subtle changes.
Assume that the total number of comparisons m=20 (maximum), and calculate weights under some typical scenarios.
Using the formula wi= (Ci/M), (1+log (Ci))
Example 1 Low frequency variation
Ci=1 (only once changed)
Wi = (1 / 20) * (1 + log(1)) = 0.05 * 1 = 0.05
Example 2 Medium frequency variation
Ci = 5
Wi = (5 / 20) * (1 + log(5)) ≈ 0.25 * 2.61 ≈ 0.65
Example 3 high frequency variation
Ci = 10
Wi = (10 / 20) * (1 + log(10)) ≈ 0.5 * 3.30 ≈ 1.65
Example 4 very high frequency variation
Ci = 15
Wi = (15 / 20) * (1 + log(15)) ≈ 0.75 * 3.71 ≈ 2.78
Example 5 extreme case (maximum number of changes)
Ci = 20
Wi = (20 / 20) * (1 + log(20)) = 1 * 3.99 ≈ 3.99
This design ensures that areas that are frequently changing do get higher weights reflecting that they may contain more important or updated information.
But at the same time the speed of weight increase is controlled, preventing excessive impact in extreme cases.
Low frequency variations, although weighted less, are still considered, preserving the ability to capture occasional but possibly important variations.
4.4 And (5) extracting weighted features.
A) All content change areas are traversed.
B) For each region, a corresponding DOM element is found using the XPath positioner. The text content of the element is extracted. In step 4.5, the text content is weighted according to the weight Wi of the region.
4.5 And calculating SimHash values of the acquired web pages according to the content change area weights and the corresponding text contents.
A) Input preparation, namely preparing a text region with weight as input.
Inputting a document D, which is composed of n text regions, each region Ri having a corresponding weight Wi
D = {(R1, W1), (R2, W2), ..., (Rn, Wn)}
B) Initializing necessary variables and data structures. Similar to conventional SimHash, a 64 bit integer array bits [64] is created, initialized to 0, sumweight=0 (total weight), maxWeight =0 (maximum weight).
C) Text region processing, namely processing each text region and calculating weighting characteristics.
For i=1 to n:
and (3) segmenting Ri by using a segmentation tool to obtain a word set Ti.
For each word t in Ti:
Calculating the word hash, h=hash (t)// 64-bit hash value
Calculating word weight w=wordwight (t)// weight function based on word features
Calculating combining weights cw=round (w×wi)// rounded to integer
Updating the total weight: sumWeight +=cw
Update maximum weight maxWeight =max (maxWeight, cw)
Updating the integer array bits [64]: for index j=0 to 63:
if the j-th bit of h is 1:
bits[j] += cw
Otherwise:
bits[j] -= cw
the word feature based weighting function may determine the weights of words according to the following rules.
Empty or blank word weight is 0.
Single character word with weight 1. Since single character words are hashed to have a limited number of bits (about 40 bits), an excessive weight tends to result in a full 0 in the high order portion of SimHash values, thereby reducing discrimination.
Multiword words-words that are treated differently starting with CJK uniform ideograms and other types of words. The Unicode code point of the first character is judged whether to be more than or equal to 0x3040 (the initial code point of the CJK unified ideogram).
CJK words beginning with length 2 and weighted 8.
The CJK characters start and the length is more than 2 words with weight of 16.
Words of length 2 that are not beginning with CJK text have a weight of 2.
Words with the length greater than 2 and the beginning of the non-CJK characters are weighted to be 4.
This weight distribution strategy aims to give CJK words a higher weight, as they typically carry more important information in the text. Words of different lengths are also given different weights, longer words generally having stronger semantic information and therefore higher weights.
D) Generating SimHash a value the final SimHash value of the acquisition web page is generated based on the accumulated combining weights.
Similar to conventional SimHash, a brief description is as follows.
Initializing an empty string simHash
For index k=0 to 63:
If bits [ k ] >0:
simHash += "1"
Otherwise:
simHash += "0"
And returning a result.
And 5, judging whether the collected web page has a similar web page or not according to the SimHash value of the collected web page and the web page URL, and if the collected web page has the similar web page, performing duplicate removal processing.
A) An index table is created.
L index tables are created, wherein L can equally divide character strings SimHash, in the embodiment, L can take a value of 4, and the L index tables are used for respectively storing L parts of the values of the history acquisition webpage SimHash.
B) Candidate web pages are screened.
And dividing SimHash values of the acquired web pages into equal L parts, searching the same values from the corresponding index tables respectively, and acquiring the web page IDs corresponding to the same values to form a candidate web page list.
C) And calculating the similarity.
Each web page ID in the candidate web page list is traversed.
And acquiring the complete SimHash information of the corresponding webpage.
The degree of difference between SimHash of the acquired web pages and the candidate web pages SimHash is calculated and is referred to as the hamming distance.
D) And confirming the similar web pages.
If the Hamming distance is greater than or equal to the preset similarity threshold, judging that the two webpages are dissimilar, respectively storing L parts of SimHash values of the acquired webpages into corresponding index tables, and recording SimHash values and URLs of the acquired webpages, and if the Hamming distance is less than the preset similarity threshold, considering that the two webpages are similar, and not storing SimHash values and URLs of the acquired webpages. In this embodiment, the preset similarity threshold may be set to 4.
The number of accesses to similar web pages may be recorded for use in subsequent cleaning of the index.
The application provides a webpage deduplication method based on structural perception, which can effectively distinguish webpage structures, reduce template influence and improve deduplication accuracy and efficiency, and can be used for:
effectively identifying web pages with the same or similar content even if URLs are different or web page structures are different.
The influence of the website template content on similarity calculation is reduced, and the duplicate removal accuracy is improved.
And accurately identifying the webpages with similar text content and different template content.
Compared with the traditional SimHash-based deduplication technology, the application has the following advantages:
1. structural-aware accurate deduplication.
Conventional approach the conventional SimHash approach generally treats the entire web page content as a whole, without considering the structural features of the web page and the importance differences of different regions.
The application innovatively introduces a document region structure and a region weighting mechanism. Based on the rule that the web pages with the same URL features have similar structures, the content difference frequency of different areas (such as fixed areas of navigation bars, headers, footers and the like have less change, and core areas of titles, texts and the like have obvious difference) is calculated through batch analysis and comparison of the web pages with the same URL features. Therefore, differential weights are distributed to the areas, so that SimHash can more accurately embody the structural characteristics and the content values of the webpage. The method effectively distinguishes the template content from the core content, and obviously improves the duplicate removal effect of the webpages with similar structures but different essential contents.
2. Adaptive content weights.
The traditional method is that the traditional SimHash generally uses a fixed weight calculation mode, so that the content change characteristics of different webpages are difficult to adapt.
The application adopts an innovative self-adaptive weight calculation formula wi= (Ci/M) × (1+log (Ci)). The formula can dynamically adjust the weight according to the change frequency of different areas of the webpage, so that important change areas are highlighted, and the influence of frequent change is avoided. This approach enhances the ability to capture substantial content changes while maintaining sensitivity to subtle changes, making deduplication results more practical.
3. Efficient prefiltering mechanisms.
Conventional methods conventional SimHash techniques typically require downloading and processing of all web page content, and then calculating SimHash values for comparison, which is inefficient in processing large-scale web page collections.
The application introduces a pre-filtering mechanism based on URL characteristics. By analyzing the URL structure and parameters, a preliminary determination of potential duplicate content can be made prior to downloading the web page content. For web pages with specific URL features, if no content change area or SimHash value hamming distance is found to be less than the threshold, the filtering can be performed directly without downloading and further processing. The pre-filtering mechanism effectively reduces the number of web pages to be processed, and improves the efficiency and the resource utilization rate of the duplicate removal process.
In a specific implementation, the present application provides a computer storage medium and a corresponding data processing unit, where the computer storage medium is capable of storing a computer program, where the computer program when executed by the data processing unit may perform part or all of the steps in the summary of a web page deduplication method based on structure awareness and provided by the present application. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a random-access memory (random access memory, RAM), or the like.
It will be apparent to those skilled in the art that the technical solutions in the embodiments of the present invention may be implemented by means of a computer program and its corresponding general hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be embodied essentially or in the form of a computer program, i.e. a software product, which may be stored in a storage medium, and include several instructions to cause a device (which may be a personal computer, a server, a single-chip microcomputer, MUU or a network device, etc.) including a data processing unit to perform the methods described in the embodiments or some parts of the embodiments of the present invention.
The invention provides a webpage deduplication method based on structural perception, and the method and the way for realizing the technical scheme are numerous, the above is only the preferred implementation mode of the invention, and it should be pointed out that a plurality of improvements and modifications can be made to a person skilled in the art without departing from the principle of the invention, and the improvements and the modifications are also considered as the protection scope of the invention. The components not explicitly described in this embodiment can be implemented by using the prior art.

Claims (7)

Translated fromChinese
1.一种基于结构感知的网页去重方法,其特征在于,包括以下步骤:1. A web page deduplication method based on structure perception, characterized in that it comprises the following steps:步骤1,网页采集;Step 1, web page collection;步骤2,对采集网页的URL进行解析,提取URL中参数部分,根据参数名称和值的模式,生成URL参数特征标签;Step 2: parse the URL of the collected web page, extract the parameter part in the URL, and generate the URL parameter feature tag according to the pattern of the parameter name and value;步骤3,构建URL特征信息及差异分析结果映射表,对采集网页进行初始化对比和过滤,更新URL特征信息及差异分析结果映射表;Step 3: construct a URL feature information and difference analysis result mapping table, perform initial comparison and filtering on the collected web pages, and update the URL feature information and difference analysis result mapping table;步骤4,若采集网页未被过滤,根据URL特征信息及差异分析结果映射表对采集网页提取加权区域,计算采集网页的SimHash值;Step 4: If the collected web page is not filtered, extract the weighted area of the collected web page according to the URL feature information and the difference analysis result mapping table, and calculate the SimHash value of the collected web page;步骤5,根据采集网页的SimHash值和网页URL,判断采集网页是否存在相似网页,若存在相似网页,进行去重处理;Step 5: Based on the SimHash value and URL of the collected webpage, determine whether there are similar webpages on the collected webpage. If there are similar webpages, perform deduplication processing;步骤4包括:步骤4-1,使用无头浏览器下载采集网页的网页内容;Step 4 includes: step 4-1, using a headless browser to download web page content of the collected web page;步骤4-2,解析下载的网页内容;Step 4-2, parsing the downloaded web page content;步骤4-3,对映射表中相同URL参数特征标签对应的内容变动区域,计算权重;Step 4-3, calculating the weight of the content change area corresponding to the same URL parameter feature tag in the mapping table;步骤4-4,遍历所有内容变动区域,使用XPath定位器查找步骤4-2中解析后的网页内容对应的DOM元素,提取所述DOM元素的文本内容;Step 4-4, traverse all content change areas, use the XPath locator to find the DOM element corresponding to the web page content parsed in step 4-2, and extract the text content of the DOM element;步骤4-5,根据内容变动区域权重和对应的文本内容,计算采集网页的SimHash值;Step 4-5, calculate the SimHash value of the collected web page according to the content change area weight and the corresponding text content;步骤4-3包括:记包括n个内容变动区域,计算第i个内容变动区域的权重Wi:Step 4-3 includes: recording n content change regions, and calculating the weight Wi of the i-th content change region:Wi= (Ci / M) * (1 + log(Ci))Wi = (Ci / M) * (1 + log(Ci))其中,n > 0,1 ≤ i ≤ n,Ci表示第i个内容变动区域的变动次数,M表示对比次数阈值;Where n > 0, 1 ≤ i ≤ n, Ci represents the number of changes in the i-th content change area, and M represents the comparison number threshold;步骤4-5包括:步骤4-5-1,创建64位整数数组bits[64],初始化为0;初始化总权重sumWeight = 0,最大权重maxWeight = 0;Step 4-5 includes: Step 4-5-1, creating a 64-bit integer array bits[64], initializing it to 0; initializing the total weight sumWeight = 0, the maximum weight maxWeight = 0;步骤4-5-2,对每个内容变动区域,计算加权特征,获得所有内容变动区域累加的组合权重:Step 4-5-2, for each content change area, calculate the weighted features and obtain the cumulative combined weight of all content change areas:使用分词工具对第i个内容变动区域的文本内容进行分词,获得词语集合Ti;Use the word segmentation tool to segment the text content of the i-th content change area to obtain the word set Ti;对Ti中的每个词语t,计算词语t的64位哈希值h、词语权重w和组合权重cw,更新总权重sumWeight、最大权重maxWeight和整数数组bits[64];For each word t in Ti, calculate the 64-bit hash value h, word weight w and combined weight cw of word t, and update the total weight sumWeight, maximum weight maxWeight and integer array bits[64];其中,组合权重cw = round(w * Wi),总权重sumWeight更新为已有总权重sumWeight加上组合权重cw,最大权重maxWeight更新为已有最大权重maxWeight和组合权重cw两者中的最大值;遍历哈希值h的0至63位,如果哈希值h的第j位为1,则bits[j]更新为已有bits[j]加上组合权重cw,否则bits[j]更新为已有bits[j]减去组合权重cw,j表示索引,0 ≤ j≤63;Among them, combined weight cw = round(w * Wi), the total weight sumWeight is updated to the existing total weight sumWeight plus the combined weight cw, and the maximum weight maxWeight is updated to the maximum value of the existing maximum weight maxWeight and the combined weight cw; traverse the 0 to 63 bits of the hash value h, if the j-th bit of the hash value h is 1, then bits[j] is updated to the existing bits[j] plus the combined weight cw, otherwise bits[j] is updated to the existing bits[j] minus the combined weight cw, j represents the index, 0 ≤ j≤63;步骤4-5-3,基于所有内容变动区域累加的组合权重生成采集网页的SimHash值:Step 4-5-3, generate the SimHash value of the collected web page based on the combined weight accumulated from all content change areas:初始化空字符串simHash,遍历整数数组bits[64]的0至63位,如果bits[k] > 0,则在字符串simHash后追加1,否则在字符串simHash后追加0,k表示索引,0 ≤ k≤63。Initialize an empty string simHash, traverse bits 0 to 63 of the integer array bits[64], and if bits[k] > 0, append 1 to the string simHash, otherwise append 0 to the string simHash, where k represents the index, 0 ≤ k≤63.2.根据权利要求1所述的一种基于结构感知的网页去重方法,其特征在于,步骤2包括对URL参数名称、参数值和路径按照规则进行标记替换,将标记替换后的各个部分重新拼接成一个字符串,计算所述字符串的唯一请求ID,所述唯一请求ID作为URL参数特征标签。2. According to the structure-aware web page deduplication method described in claim 1, it is characterized in that step 2 includes replacing the mark of the URL parameter name, parameter value and path according to the rules, re-joining the parts after the mark replacement into a string, and calculating the unique request ID of the string, and the unique request ID is used as the URL parameter feature tag.3.根据权利要求2所述的一种基于结构感知的网页去重方法,其特征在于,步骤3中构建URL特征信息及差异分析结果映射表包括:使用Map数据结构存储和管理URL特征信息及其差异分析结果,映射表uniqueMarkedMap定义为:3. According to the structure-aware web page deduplication method of claim 2, the construction of the URL feature information and difference analysis result mapping table in step 3 comprises: using a Map data structure to store and manage the URL feature information and its difference analysis results, and the mapping table uniqueMarkedMap is defined as:Map<String, DiffRes> uniqueMarkedMapMap<String, DiffRes> uniqueMarkedMap映射表的键Key为URL参数特征标签,即唯一请求ID uniqueMarkedId,类型为String;值Value为DiffRes对象,所述DiffRes对象包括是否直接过滤的标识directFilter、累计对比次数compareCount、累计对比中相似的次数similarCount、上一次比较的SimHash值lastSimhash、上一次的网页源码rawHtml和网页内容变动区域路径集合diffPathList,所述网页内容变动区域路径集合diffPathList包括内容区域变动类型、元素XPath和元素变动次数。The key Key of the mapping table is the URL parameter feature tag, that is, the unique request ID uniqueMarkedId, of type String; the value Value is a DiffRes object, which includes an identifier of whether to directly filter directFilter, a cumulative number of comparisons compareCount, a cumulative number of similarities in the comparisons similarCount, a SimHash value of the last comparison lastSimhash, the last web page source code rawHtml, and a web page content change area path set diffPathList, wherein the web page content change area path set diffPathList includes a content area change type, an element XPath, and an element change number.4.根据权利要求3所述的一种基于结构感知的网页去重方法,其特征在于,步骤3中对采集网页进行初始化对比和过滤,更新URL特征信息及差异分析结果映射表包括:4. According to the structure-aware web page deduplication method of claim 3, the method is characterized in that the initial comparison and filtering of the collected web pages in step 3 and the updating of the URL feature information and the difference analysis result mapping table include:根据采集网页的URL参数特征标签在映射表中查找是否存在差异分析结果,若不存在,则表示这是首次采集所述URL参数特征标签的网页,在映射表中初始化所述URL参数特征标签的差异分析表,将网页源码保存到DiffRes的上一次的网页源码rawHtml属性中,使用标准SimHash算法计算当前源码的SimHash值,保存至DiffRes的上一次比较的SimHash值lastSimhash属性中;Check whether there is a difference analysis result in the mapping table according to the URL parameter feature tag of the collected web page. If not, it means that this is the first time to collect the URL parameter feature tag of the web page. Initialize the difference analysis table of the URL parameter feature tag in the mapping table, save the web page source code to the rawHtml attribute of the last web page source code of DiffRes, use the standard SimHash algorithm to calculate the SimHash value of the current source code, and save it to the lastSimhash attribute of the last compared SimHash value of DiffRes;若存在且映射表中是否直接过滤的标识directFilter为true,则判定采集网页重复,直接过滤所述采集网页;若映射表中是否直接过滤的标识directFilter为false,累计对比次数大于等于对比次数阈值,则初始化对比完成,执行步骤4;If it exists and the flag directFilter in the mapping table indicating whether it is directly filtered is true, it is determined that the collected web page is duplicated and the collected web page is directly filtered; if the flag directFilter in the mapping table indicating whether it is directly filtered is false and the cumulative number of comparisons is greater than or equal to the comparison number threshold, the initialization comparison is completed and step 4 is executed;若累计对比次数小于对比次数阈值,则识别采集网页的内容变化区域,生成内容变动区域路径列表,更新至映射表,执行步骤1。If the cumulative comparison times are less than the comparison times threshold, the content change area of the collected web page is identified, a content change area path list is generated, the list is updated to the mapping table, and step 1 is executed.5.根据权利要求4所述的一种基于结构感知的网页去重方法,其特征在于,步骤3中识别采集网页的内容变化区域,生成内容变动区域路径列表,更新至映射表包括:5. A webpage deduplication method based on structure perception according to claim 4, characterized in that in step 3, identifying the content change area of the collected webpage, generating a content change area path list, and updating the mapping table comprises:对比DOM结构,将映射表中具有相同URL参数特征标签的差异分析结果中上一次的网页源码rawHtml对应的网页作为比较样本,从DOM树的根节点开始,采用广度优先遍历的方式逐层比较节点的结构和内容;Compare the DOM structure, take the web page corresponding to the last web page source code rawHtml in the difference analysis result with the same URL parameter feature label in the mapping table as the comparison sample, start from the root node of the DOM tree, and use the breadth-first traversal method to compare the structure and content of the nodes layer by layer;比较当前节点对应的子节点数量,若子节点数量不一致,获取当前节点下所有子节点的文本并进行对比,若汇总文本不一致,将当前节点下的内容区域变动类型记录为结构变动,增加变动次数;Compare the number of child nodes corresponding to the current node. If the number of child nodes is inconsistent, obtain the text of all child nodes under the current node and compare them. If the summary text is inconsistent, record the content area change type under the current node as a structural change and increase the number of changes.若子节点数量一致,逐个比较各节点的文本内容,若文本内容不一致,将当前节点下的内容区域变动类型记录为文本变动,增加变动次数;若文本内容一致,继续比较当前节点的所有子节点;If the number of child nodes is the same, compare the text content of each node one by one. If the text content is inconsistent, record the content area change type under the current node as text change and increase the number of changes. If the text content is consistent, continue to compare all child nodes of the current node.完成对比后,若未发现内容变动区域,则直接过滤所述采集网页;否则生成采集网页对应的内容变动区域路径列表,更新至映射表对应DiffRes的网页内容变动区域路径集合diffPathList,增加累计对比次数compareCount,更新上一次的网页源码rawHtml为所述采集网页的源码,重新计算新源码的SimHash值,和上一次的进行比较,如果海明距离小于2,则累计对比中相似的次数similarCount加1,用新网页源码的SimHash值更新lastSimhash;当累计对比次数compareCount达到对比次数阈值时,如果累计对比中相似的次数similarCount大于等于相似次数阈值,则更新是否直接过滤的标识directFilter为true。After the comparison is completed, if no content change area is found, the collected web page is directly filtered; otherwise, a content change area path list corresponding to the collected web page is generated, updated to the web page content change area path set diffPathList corresponding to DiffRes in the mapping table, the cumulative comparison times compareCount is increased, the last web page source code rawHtml is updated to the source code of the collected web page, the SimHash value of the new source code is recalculated, and compared with the last one. If the Hamming distance is less than 2, the cumulative number of similarities similarCount in the comparison is increased by 1, and the SimHash value of the new web page source code is used to update lastSimhash; when the cumulative number of comparisons compareCount reaches the comparison times threshold, if the cumulative number of similarities similarCount in the comparison is greater than or equal to the similarity times threshold, the flag directFilter of whether to filter directly is updated to true.6.根据权利要求5所述的一种基于结构感知的网页去重方法,其特征在于,步骤5包括:创建L个索引表,所述L能够均等划分字符串SimHash,所述L个索引表用于分别存放历史采集网页SimHash值的L个部分;6. According to the structure-aware web page deduplication method of claim 5, step 5 comprises: creating L index tables, wherein L can equally divide the string SimHash, and the L index tables are used to store L parts of the SimHash values of the historically collected web pages respectively;将采集网页的SimHash值划分为均等的L个部分,分别从对应的索引表中查找相同值,获取相同值对应的网页ID,组成候选网页列表;Divide the SimHash values of the collected web pages into L equal parts, search for the same values from the corresponding index tables, obtain the web page IDs corresponding to the same values, and form a list of candidate web pages;获取候选网页列表中每个网页对应的SimHash值,计算采集网页的SimHash值和候选网页SimHash之间的汉明距离,如果汉明距离大于等于预设相似阈值,则判定两个网页不相似,将采集网页的SimHash值的L个部分分别存入对应的索引表中,并记录采集网页的SimHash值和URL;否则,判定两个网页相似,采集网页的SimHash值和URL不进行保存。Get the SimHash value corresponding to each web page in the candidate web page list, calculate the Hamming distance between the SimHash value of the collected web page and the SimHash of the candidate web page, if the Hamming distance is greater than or equal to the preset similarity threshold, then the two web pages are determined to be dissimilar, the L parts of the SimHash value of the collected web page are respectively stored in the corresponding index table, and the SimHash value and URL of the collected web page are recorded; otherwise, the two web pages are determined to be similar, and the SimHash value and URL of the collected web page are not saved.7.根据权利要求6所述的一种基于结构感知的网页去重方法,其特征在于,步骤2中所述对URL参数名称、参数值和路径按照规则进行标记替换包括:7. A webpage deduplication method based on structure perception according to claim 6, characterized in that the step of replacing the URL parameter name, parameter value and path according to the rules in step 2 comprises:对于超过固定长度的参数名称,使用第一标记替换;For parameter names exceeding a fixed length, the first token is used for replacement;对于未超过固定长度的参数名称,若参数名称和/或参数值中含有数字,将其中的数字替换为第二标记;For parameter names that do not exceed the fixed length, if the parameter name and/or parameter value contains numbers, the numbers are replaced with the second marker;将参数值和/或路径中的中文替换为第三标记;Replace Chinese characters in parameter values and/or paths with the third token;将路径中含有Unicode字符的部分替换为第四标记;Replace the part of the path containing Unicode characters with the fourth token;将参数值中的时间替换为第五标记。Replace the time in the parameter value with the fifth marker.
CN202411687242.2A2024-11-252024-11-25 A web page deduplication method based on structure perceptionActiveCN119203976B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202411687242.2ACN119203976B (en)2024-11-252024-11-25 A web page deduplication method based on structure perception

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202411687242.2ACN119203976B (en)2024-11-252024-11-25 A web page deduplication method based on structure perception

Publications (2)

Publication NumberPublication Date
CN119203976A CN119203976A (en)2024-12-27
CN119203976Btrue CN119203976B (en)2025-02-11

Family

ID=94070839

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202411687242.2AActiveCN119203976B (en)2024-11-252024-11-25 A web page deduplication method based on structure perception

Country Status (1)

CountryLink
CN (1)CN119203976B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN103559259A (en)*2013-11-042014-02-05同济大学Method for eliminating similar-duplicate webpage on the basis of cloud platform
CN104778164A (en)*2014-01-092015-07-15中国银联股份有限公司Method and device for detecting repeated URL (Uniform Resource Locator)

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN108984703B (en)*2018-07-052023-04-18平安科技(深圳)有限公司Uniform Resource Locator (URL) duplicate removal method and device
CN110008419B (en)*2019-03-112023-07-14创新先进技术有限公司Webpage deduplication method, device and equipment
CN111444411A (en)*2020-03-302020-07-24深圳前海微众银行股份有限公司Network data increment acquisition method, device, equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN103559259A (en)*2013-11-042014-02-05同济大学Method for eliminating similar-duplicate webpage on the basis of cloud platform
CN104778164A (en)*2014-01-092015-07-15中国银联股份有限公司Method and device for detecting repeated URL (Uniform Resource Locator)

Also Published As

Publication numberPublication date
CN119203976A (en)2024-12-27

Similar Documents

PublicationPublication DateTitle
KR101119290B1 (en)Information processing apparatus, database system, information processing method, and program
CN113254751B (en)Method, equipment and storage medium for accurately extracting complex webpage structured information
CN102662969B (en) A Method for Locating Internet Information Objects Based on Webpage Structural Semantics
US20100169311A1 (en)Approaches for the unsupervised creation of structural templates for electronic documents
CN106960058B (en)Webpage structure change detection method and system
US20090248707A1 (en)Site-specific information-type detection methods and systems
CN109344355B (en)Automatic regression detection and block matching self-adaption method and device for webpage change
Manabe et al.Extracting logical hierarchical structure of HTML documents based on headings
Cardoso et al.An efficient language-independent method to extract content from news webpages
CN109165373B (en)Data processing method and device
CN112818200A (en)Data crawling and event analyzing method and system based on static website
Mehta et al.DOM tree based approach for web content extraction
CN112818279A (en)Method and device for determining similarity of web pages and computer-readable storage medium
CN112035723A (en)Resource library determination method and device, storage medium and electronic device
CN115269923A (en)Method, system, equipment and medium for processing webpage text area and text information
CN119203976B (en) A web page deduplication method based on structure perception
US20080033953A1 (en)Method to search transactional web pages
Liu et al.An automated algorithm for extracting website skeleton
Guo et al.RED: Redundancy-Driven Data Extraction from Result Pages?
CN117390329A (en)Webpage labeling method, device and equipment
Arya et al.Content extraction from news web pages using tag tree
Zeleny et al.Cluster-based Page Segmentation-a fast and precise method for web page pre-processing
CN114201698A (en) A kind of website homepage identification method and electronic device based on URL feature
Gao et al.Detecting data records in semi-structured web sites based on text token clustering
CN116362223B (en)Automatic identification method and device for web page article titles and texts

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp