Disclosure of Invention
The invention aims to solve the technical problem of providing a webpage deduplication method based on structure perception aiming at the defects of the prior art.
In order to solve the technical problems, the invention discloses a webpage deduplication method based on structural awareness, which comprises the following steps of.
And step 1, collecting the web pages.
And 2, analyzing the URL of the acquired webpage, extracting a parameter part in the URL, and generating a URL parameter feature tag according to the parameter name and the value mode.
And 3, constructing a URL characteristic information and difference analysis result mapping table, carrying out initialization comparison and filtering on the acquired web pages, and updating the URL characteristic information and difference analysis result mapping table.
And step 4, if the acquired web page is not filtered, extracting a weighted area from the acquired web page according to the URL characteristic information and the difference analysis result mapping table, and calculating SimHash values of the acquired web page.
And 5, judging whether the collected web page has a similar web page or not according to the SimHash value of the collected web page and the web page URL, and if the collected web page has the similar web page, performing duplicate removal processing.
Further, step 2 includes performing label replacement on the URL parameter name, the parameter value and the path according to rules, re-splicing each part after label replacement into a character string, and calculating a unique request ID of the character string, where the unique request ID is used as a URL parameter feature tag.
Further, the step 3 of constructing the URL feature information and variance analysis result mapping table includes storing and managing URL feature information and variance analysis results thereof using Map data structure, the mapping table uniqueMarkedMap is defined as Map < String, diffRes > uniqueMarkedMap
The Key Key of the mapping table is a URL parameter characteristic tag, namely a unique request ID uniqueMarkedId, the type is String, the Value is a DiffRes object, the DiffRes object comprises a mark DIRECTFILTER for direct filtering or not, accumulated comparison times compareCount, similar times similarCount in accumulated comparison, a SimHash Value LASTSIMHASH in last comparison, a last webpage source code rawHtml and a webpage content change area path set DIFFPATHLIST, and the webpage content change area path set DIFFPATHLIST comprises a content change type, an element XPath and an element change times.
Further, in the step 3, initializing, comparing and filtering the collected web page, updating URL characteristic information and a difference analysis result mapping table, namely searching whether a difference analysis result exists in the mapping table according to the URL parameter characteristic label of the collected web page, if not, indicating that the collected web page is the web page for collecting the URL parameter characteristic label for the first time, initializing the difference analysis table of the characteristic in the mapping table, storing the web page source code into the last web page source code rawHtml attribute of DiffRes, calculating SimHash value of the current source code by using a standard SimHash algorithm, and storing the SimHash value into the last compared SimHash value LASTSIMHASH attribute of DiffRes.
If yes, whether the identifier DIRECTFILTER of direct filtration in the mapping table is false is judged, if true, the collected web page is repeated, and if false is judged as the identifier DIRECTFILTER of direct filtration in the mapping table, the accumulated comparison times is greater than or equal to the comparison times threshold, the initialization comparison is completed, and the step 4 is executed.
And if the accumulated comparison times are smaller than the comparison times threshold, identifying the content change area of the acquired webpage, generating a content change area path list, updating the content change area path list to a mapping table, and executing the step 1.
Further, in step 3, the content change area of the collected web page is identified, a content change area path list is generated, and updating is carried out to the mapping table, wherein the method comprises the steps of comparing DOM structures, taking the web page corresponding to the last web page source code rawHtml in the difference analysis result with the same URL parameter characteristic label in the mapping table as a comparison sample, starting from the root node of the DOM tree, and comparing the structure and the content of the nodes layer by layer in a breadth-first traversal mode.
Comparing the number of the sub-nodes corresponding to the current node, if the number of the sub-nodes is inconsistent, acquiring texts of all the sub-nodes under the current node, comparing, and if the summarized texts are inconsistent, recording the change type of the content area under the current node as structural change, and increasing the change times.
If the number of the child nodes is consistent, comparing the text contents of all the nodes one by one, if the text contents are inconsistent, recording the change type of the content area under the current node as text change, increasing the change times, and if the text contents are consistent, continuing to compare all the child nodes of the current node.
After the comparison is completed, if the content fluctuation area is not found, the acquired webpage is directly filtered, otherwise, a content fluctuation area path list corresponding to the acquired webpage is generated, the content fluctuation area path list corresponding to the DiffRes of the mapping table is updated to a webpage content fluctuation area path set DIFFPATHLIST, the accumulated comparison times compareCount are increased, the last webpage source code rawHtml is updated to be the source code of the acquired webpage, the SimHash value of the new source code is recalculated, the comparison is carried out last time, if the Hamming distance is smaller than 2, the similar times similarCount in the accumulated comparison are increased by 1, the SimHash value of the new webpage source code is updated LASTSIMHASH, and if the similar times similarCount in the accumulated comparison is larger than or equal to the similar times threshold when the accumulated comparison times compareCount reaches the comparison times threshold, the identifier DIRECTFILTER whether the direct filtration is carried out is true is updated.
Prior to SimHash computations, a pre-filtering mechanism based on URL features is introduced. For web pages characterized by specific URLs, if no content change area or SimHash values are found that have a hamming distance of less than 2, special labeling is performed. The feature links marked as true by the identifier DIRECTFILTER of whether the DiffRes object is directly filtered in the mapping table can be directly filtered without downloading and recalculating the similarity, so that the efficiency and the resource utilization rate are improved.
Further, step 4 includes step 4-1 of downloading the web content of the collected web page using the headless browser.
And 4-2, analyzing the downloaded webpage content.
And 4-3, calculating weights of the content change areas corresponding to the same URL parameter feature labels in the mapping table.
And 4-4, traversing all the content change areas, searching DOM elements corresponding to the webpage content analyzed in the step 4-2 by using an XPath positioner, and extracting text content of the DOM elements.
And 4-5, calculating SimHash values of the acquired web pages according to the content change area weights and the corresponding text contents.
Further, step 4-3 includes recording n content variation regions, and calculating the weight Wi of the ith content variation region.
Wi= (Ci / M) * (1 + log(Ci))
Wherein n >0, 1.ltoreq.i.ltoreq.n, ci represents the number of changes in the ith content change area, and M represents the threshold number of comparison times.
The self-adaptive weight calculation formula comprehensively considers the importance and marginal effect of the change frequency, not only highlights an important change area, but also does not excessively exaggerate the influence of frequent change, provides a reasonable weighting basis for SimHash calculation, and improves the capability of the system for identifying the change of the substantial content.
Further, step 4-5 includes step 4-5-1 of creating 64 bit integer array bits [64], initializing to 0, initializing total weight sumWeight =0, maximum weight maxWeight =0.
Step 4-5-2, calculating a weighting characteristic for each content variation area to obtain a combined weight accumulated by all the content variation areas:
Using a word segmentation tool to segment text content in the ith content variation area to obtain a word set Ti;
For each word t in Ti, calculating a 64-bit hash value h, a word weight w and a combination weight cw of the word t, and updating a total weight sumWeight, a maximum weight maxWeight and an integer array bits [64];
Wherein, the combined weight cw=round (w is Wi), the total weight sumWeight is updated to be the existing total weight sumWeight plus the combined weight cw, the maximum weight maxWeight is updated to be the maximum value of the existing maximum weight maxWeight and the combined weight cw, the bits 0 to 63 of the hash value h are traversed, if the j bit of the hash value h is 1, bits [ j ] are updated to be the existing bits [ j ] plus the combined weight cw, otherwise bits [ j ] are updated to be the existing bits [ j ] minus the combined weight cw, j represents the index, and 0 is less than or equal to j is less than or equal to 63.
Step 4-5-3, generating SimHash values of the acquired web pages based on the combined weights accumulated in all content variation areas:
Initializing an empty string simHash, traversing 0 to 63 bits of an integer array bits [64], if bits [ k ] >0, adding 1 after the string simHash, otherwise adding 0 after the string simHash, wherein k represents an index, and 0 is less than or equal to k is less than or equal to 63.
The application introduces the concept of document area structure and distributes different weights for different areas. In the calculation SimHash, the term weights are combined with the region weights to form combined weights. The method can more accurately reflect the structural characteristics and the content importance of the webpage, and improves the expression capability and the distinguishing degree of SimHash.
Further, step 5 includes creating L index tables, where L can equally divide the string SimHash, and the L index tables are used to store L parts of the history acquisition webpage SimHash value respectively.
And dividing SimHash values of the acquired web pages into equal L parts, searching the same values from the corresponding index tables respectively, and acquiring the web page IDs corresponding to the same values to form a candidate web page list.
Acquiring SimHash values corresponding to each webpage in the candidate webpage list, calculating the Hamming distance between the SimHash values of the acquired webpages and the candidate webpages SimHash, judging that the two webpages are dissimilar if the Hamming distance is greater than or equal to a preset similarity threshold, respectively storing L parts of SimHash values of the acquired webpages into corresponding index tables, and recording SimHash values and URLs of the acquired webpages, otherwise, judging that the two webpages are similar, and not storing SimHash values and URLs of the acquired webpages.
Further, the step 2 of performing label replacement on the URL parameter name, the parameter value and the path according to rules comprises the step of using a first label replacement for the parameter name exceeding a fixed length.
And for parameter names which do not exceed the fixed length, if the parameter names and/or parameter values contain numbers, replacing the numbers in the parameter names and/or parameter values with the second marks.
The parameter value and/or chinese in the path is replaced with a third label.
The portion of the path containing Unicode characters is replaced with a fourth tag.
The time in the parameter value is replaced with a fifth flag.
By replacing the characters such as numbers, time, overlong character strings and the like with unified marks, the diversity of parameters is effectively reduced. Based on the assumption that sub-webpages with the same URL characteristics under the same website generally have the same webpage structure, text contents needing important attention are identified, and template contents are ignored, so that the duplicate removal efficiency and accuracy are improved.
The webpage deduplication method based on structure perception has the beneficial effects that the webpage deduplication method based on structure perception is provided, and aims to improve accuracy and efficiency of the prior art when similar template websites are processed. By analyzing the DOM tree structure of the webpage, identifying different areas and giving different weights according to the importance of the areas, the method and the device can accurately extract text contents and perform weighted similarity calculation, so that the webpage deduplication with higher precision is realized. In addition, by combining with URL parameter feature markers and SimHash optimization technology, the duplication removal efficiency and accuracy are further improved, so that the method can efficiently and accurately process large-scale webpage data, and finally, the storage cost and the server load are effectively reduced.
Detailed Description
Embodiments of the present invention will be described below with reference to the accompanying drawings.
Related terms are explained as follows.
Headless browser (Headless Browser) a browser without a graphical user interface that can programmatically control its behavior, such as loading web pages, executing JavaScript, intercepting screen shots, etc. The method is commonly used for scenes such as webpage testing and webpage content extraction.
DOM tree (Document Object Model Tree) parses an HTML or XML document into a data structure of tree structure that represents elements, attributes, and text content in the document. DOM trees are the basis for web page structure analysis.
XPath (XML Path Language) A language for locating nodes in an XML document. XPath is also commonly used for node location of HTML documents.
SimHash (Similarity Hash) A technique for converting text content into fixed length fingerprints for fast computation of similarity between texts. The closer SimHash values are, the more similar the text content is represented.
Hamming distance (HAMMING DISTANCE) is an indicator for comparing the degree of difference between two strings of the same length. The hamming distance refers to the number of characters with different corresponding positions of the two character strings.
The webpage deduplication method based on structure perception can be applied to various scenes needing webpage deduplication, such as a search engine, improves search result quality, reduces redundant information, news aggregation, filters repeated news, provides more comprehensive information, monitors network content, identifies repeated information, improves network content analysis efficiency, recommends non-repeated information for users through an e-commerce platform, knowledge question and answer and the like, increases user viscosity, and can obtain more accurate and reliable analysis results based on webpage data after deduplication in the fields of data analysis and mining such as market research, user portrait analysis and the like.
The embodiment of the application discloses a webpage deduplication method based on structure perception, which comprises the following steps as shown in fig. 1.
And step 1, collecting the webpage.
The method utilizes an efficient webpage content extraction technology and combines a headless browser rendering technology to collect a plurality of webpages in the same domain in batches. To improve the collection efficiency, web content extraction supports multithreading and asynchronous requests, simulates a real browser environment, such as setting header information of users-Agent, referer, and generating browser fingerprints. In the acquisition process, information such as URL addresses of web pages, HTML source codes and HTTP state codes after web page rendering and metadata such as acquisition time are recorded, and a complete data base is provided for subsequent web page deduplication analysis.
Step 2, generating URL parameter feature labels
The step analyzes each acquired URL, and extracts parameter parts (such as query parameters, paging parameters and the like in a search page) in the URL. Then, a URL parameter feature tag is generated according to the name and value pattern of the parameter. By replacing the characters such as numbers, dates and overlong character strings with uniform marks, the diversity of parameter names, values and paths can be effectively reduced, and subsequent feature extraction and comparison are facilitated. The method comprises the steps of carrying out marking replacement on the URL parameter name, the parameter value and the path according to rules, re-splicing all parts after marking replacement into a character string, and calculating the unique request ID of the character string, wherein the unique request ID is used as a URL parameter characteristic tag.
The marking and replacing the URL parameter name, the parameter value and the path according to the rule comprises the step of replacing the parameter name exceeding the fixed length by using a first mark { { { long }, wherein the fixed length can take 32 characters.
For parameter names which do not exceed the fixed length, if the parameter names and/or parameter values contain numbers, the numbers in the parameter names and/or parameter values are replaced by the second marks { { number }.
The parameter value and/or chinese in the path is replaced with the third label { { chinese }.
The portion of the path containing Unicode characters is replaced with the fourth label { { Unicode }.
The time in the parameter value is replaced with the fifth flag { { time }.
The marking and replacing of the URL parameter name, the parameter value and the path according to the rule further comprises the step of replacing a part of the pure capital letters in the parameter value and/or the path with a fifth mark { { upper }, and replacing a part of the pure lowercase letters with a sixth mark { { lower }.
The parameter value and/or URL encoded string in the path is replaced with the seventh tag { { { urlencode }.
The portion of the parameter values containing boolean values is replaced with the eighth notation { { bool }.
2.1 Feature tagging process
To generalize URLs that meet certain characteristics, the present embodiment tags the various components of the URL.
2.1.1 The parameter names are marked.
And marking the URL parameter names so as to classify the parameters with similar structures, thereby reducing the diversity of the parameter names and improving the efficiency and accuracy of URL feature extraction. All parameter names and values are traversed first. The method comprises the steps of carrying out no processing on parameter names of pure letters and the length of the parameter names is not more than 32 characters, replacing the parameter names of the length of the parameter names of the pure letters by using { { long } }, representing that the parameter names are overlong, and replacing numbers in the parameter names of the pure letters by using a regular expression.
Examples:
the parameter name "keyword" is not processed, and remains as it is.
The parameter name "productName" would be replaced with "productName { number } }.
The parameter name "aVeryVeryVeryLongParameterName" would be replaced with "{ { long }").
2.1.2 The parameter values are marked.
Depending on the type of parameter value, a different label is used for substitution. Examples are as follows.
The parameter value "123" is labeled "{ number }.
Chinese, the parameter value "hello" is labeled "{ { chinese }.
2.1.3 The path is marked.
The path is divided into parts according to "/", and each part is marked with a purely numerical part replaced with a { { number } }' mark.
The Chinese containing part is marked with { { chinese }.
The portion containing Unicode characters is marked with { { Unicode }.
And finally, splicing all marked parts into a character string again to be used as a marked path.
2.1.4 The tagged unique request ID is calculated.
The parameter names, parameter values and paths matched in the original URL are replaced by corresponding feature templates, and then hash calculation (such as MD 5) is carried out on the feature templates to obtain a 32-bit unique request ID which is used as a URL parameter feature tag.
For example, for the URL "https:// www.example.com/path/api/searchpage =2 & begin=2024-09-21", its unique request ID can be calculated as:
MD5("https://www.example.com/path/api/searchpage={{number}}&begin={{time}}")
2.2 Example
Assume that there are two URLs:
URL1 http:// www.test.com/searchname =Zhang San & age=21
URL2 http:// www.test.com/searchname =litetra & age=24
After feature labeling, the unique request IDs corresponding to the feature labels are:
MD5("http://www.test.com/searchname={{chinese}}&age={{number}}")
In this way, URLs with the same characteristics can be categorized together, facilitating subsequent web structure analysis and deduplication operations.
And 3, constructing a URL characteristic information and difference analysis result mapping table, carrying out initialization comparison and filtering on the acquired web pages, and updating the URL characteristic information and difference analysis result mapping table.
3.1 And constructing a URL characteristic information and difference analysis result mapping table uniqueMarkedMap.
Map data structures are used to store and manage URL feature information and its differential analysis results. The data structure is named as a URL characteristic information and difference analysis result mapping table (hereinafter referred to as mapping table), and is specifically implemented as follows.
3.1.1. And (5) defining a mapping table.
Map<String, DiffRes>uniqueMarkedMap
3.1.2. Mapping table key value pair description.
Key (Key) the only request ID uniqueMarkedId after calculation of the tag is String type.
And a Value DiffRes object containing content change information of the characteristic web page.
3.1.3. DiffRes object structure.
public class DiffRes {
Private boolean directFilter;// identifies whether direct filtering is possible
PRIVATE INT compareCount number of times of/(cumulative comparison), upper limit 20 times
PRIVATE INT similarCount similar times in the/(20) comparison
PRIVATE STRING LASTSIMHASH,// last simhash (calculated using the original method, unweighted)
PRIVATE STRING RAWHTML,// last web page source code
PRIVATE SET < DiffPath > DIFFPATHLIST;// content change area Path set
}
3.1.4. DiffPath object structure.
public class DiffPath {
PRIVATE STRING PATHTYPE;// change types
PRIVATE STRING PATH XPath of the element;//
PRIVATE INTEGER count;// number of element changes
}
3.1.5. PathType values are described.
"TEXT_DIFF" means TEXT change
"Structure_diff" means page STRUCTURE variation
In step 2, after labeling the currently collected sub-links, the corresponding uniqueMarkedId may be calculated, and it may be determined whether the URL feature information and the difference analysis result mapping table uniqueMarkedMap have a corresponding comparison result.
3.2 There is already a corresponding comparison result.
First check DIRECTFILTER field in the comparison:
if DIRECTFILTER is true, the link is filtered directly without downloading and computing the similarity.
If DIRECTFILTER is false, it is checked whether the number of comparisons reaches a threshold number of comparisons (e.g., 20 in this embodiment).
If the comparison frequency threshold is reached, initializing the comparison is completed, and the step 4 is entered.
If the comparison frequency threshold is not reached, continuing to execute 3.3 for comparison, and updating the comparison result.
3.3 There is no corresponding comparison result or a need to continue the comparison.
If there is no comparison corresponding to current uniqueMarkedId in uniqueMarkedMap, this is the web page that first acquired the feature. At this time, it is required that:
1) A variance analysis table for the feature is initialized at uniqueMarkedMap.
2) The web page source code is saved to rawHtml attribute of DiffRes.
3) The SimHash value of the current source code is calculated using standard SimHash algorithm and saved to LASTSIMHASH attribute of DiffRes.
When the web pages with the same characteristics are acquired again later, the following steps are executed:
1) The newly acquired source code is compared to rawHtml saved in DiffRes (i.e., the last acquired source code) and the difference path is recorded in DIFFPATHLIST of DiffRes.
2) Update rawHtml is the newly acquired source code.
3) The SimHash value of the new source code is recalculated and compared with the last time. If the Hamming distance is less than 2, then similarCount in DiffRes is added to 1.
4) LASTSIMHASH is updated with SimHash values of the new source code.
The specific comparison procedure is as follows.
A) DOM structure comparison
And selecting the webpage which is acquired last time in the same characteristic category (with the same URL parameter characteristic label) as a comparison sample.
And comparing the current webpage source code with the selected sample to identify a content change area.
B) Breadth-first traversal
Starting from the root node of the DOM tree, adopting a breadth-first traversal mode.
The structure and content of the nodes are compared layer by layer.
C) Node difference identification
C.i) structural difference, namely comparing the number of the child nodes corresponding to the current node.
And if the number is inconsistent, acquiring the texts of all the child nodes under the current node and comparing the texts. If the summary text is inconsistent, the area is recorded as structure_diff (STRUCTURE change), and the number of changes is increased.
And ii) text difference, namely if the number of the child nodes is consistent, comparing ownText text contents of all the nodes one by one. If ownText is inconsistent, the area is recorded as text_diff (TEXT change), and the number of changes is increased. If ownText are identical, the deep comparison is continued for all child nodes of the node.
D) Generating a content change area path list
After the comparison is completed, if the content variation area is not found, the acquired web pages are directly filtered, otherwise, a content variation area path list corresponding to the characteristic web pages is generated.
Binding the path list to the current URL feature tag for processing of subsequent similar class webpages.
E) Updating uniqueMarkedMap
The generated comparison results (including the content fluctuation zone path list and the special label) are stored uniqueMarkedMap.
Using the current uniqueMarkedId as a key ensures that subsequent quick retrievals can be made.
F) Comparison count management
After each comparison, the comparison count is incremented. If so, the similarCount count is incremented.
F.1 If similarCount is greater than or equal to the threshold number of similarity times, in this embodiment, the threshold number of similarity times may be set to 18, then DIRECTFILTER in DiffRes is set to true, indicating that the web pages corresponding to the feature are highly similar.
And (4) finishing the mark initialization contrast stage, and executing step 4.
F.2 If the number of comparisons is less than 20, save the current progress. Wait for the next acquisition to continue the comparison.
And 4, if the acquired web page is not filtered, extracting a weighted area from the acquired web page according to the URL characteristic information and the difference analysis result mapping table, and calculating SimHash values of the acquired web page.
The step is to process and calculate the similarity of the collected web pages based on the content variation area information corresponding to the characteristic web pages obtained in the step 3.
4.1 And (5) collecting the web pages.
And downloading the webpage content corresponding to the current URL by using the headless browser.
If the download fails or times out, an error is recorded and the current process is ended.
4.2 And (5) extracting content.
The downloaded web page content is parsed using an HTML parser (Jsoup).
Extracting text content of the webpage, and removing the HTML tag and the script.
4.3 And (5) calculating the variable region weight.
The n content fluctuation areas are recorded, and the weight Wi of the ith content fluctuation area is calculated.
Wi = (Ci / M) * (1 + log(Ci))
Where n >0, 1.ltoreq.i.ltoreq.n, ci represents the number of changes in the ith content change area, M is the total number of comparisons (the threshold number of comparisons is at most 20).
The weight formula design thought comprehensively considers the importance and marginal effect of the change frequency, not only can highlight important change areas, but also can not excessively exaggerate the influence of frequent change, and provides a reasonable weighting basis for SimHash calculation. This design helps to increase the ability of the system to identify substantial content changes while maintaining sensitivity to subtle changes.
Assume that the total number of comparisons m=20 (maximum), and calculate weights under some typical scenarios.
Using the formula wi= (Ci/M), (1+log (Ci))
Example 1 Low frequency variation
Ci=1 (only once changed)
Wi = (1 / 20) * (1 + log(1)) = 0.05 * 1 = 0.05
Example 2 Medium frequency variation
Ci = 5
Wi = (5 / 20) * (1 + log(5)) ≈ 0.25 * 2.61 ≈ 0.65
Example 3 high frequency variation
Ci = 10
Wi = (10 / 20) * (1 + log(10)) ≈ 0.5 * 3.30 ≈ 1.65
Example 4 very high frequency variation
Ci = 15
Wi = (15 / 20) * (1 + log(15)) ≈ 0.75 * 3.71 ≈ 2.78
Example 5 extreme case (maximum number of changes)
Ci = 20
Wi = (20 / 20) * (1 + log(20)) = 1 * 3.99 ≈ 3.99
This design ensures that areas that are frequently changing do get higher weights reflecting that they may contain more important or updated information.
But at the same time the speed of weight increase is controlled, preventing excessive impact in extreme cases.
Low frequency variations, although weighted less, are still considered, preserving the ability to capture occasional but possibly important variations.
4.4 And (5) extracting weighted features.
A) All content change areas are traversed.
B) For each region, a corresponding DOM element is found using the XPath positioner. The text content of the element is extracted. In step 4.5, the text content is weighted according to the weight Wi of the region.
4.5 And calculating SimHash values of the acquired web pages according to the content change area weights and the corresponding text contents.
A) Input preparation, namely preparing a text region with weight as input.
Inputting a document D, which is composed of n text regions, each region Ri having a corresponding weight Wi
D = {(R1, W1), (R2, W2), ..., (Rn, Wn)}
B) Initializing necessary variables and data structures. Similar to conventional SimHash, a 64 bit integer array bits [64] is created, initialized to 0, sumweight=0 (total weight), maxWeight =0 (maximum weight).
C) Text region processing, namely processing each text region and calculating weighting characteristics.
For i=1 to n:
and (3) segmenting Ri by using a segmentation tool to obtain a word set Ti.
For each word t in Ti:
Calculating the word hash, h=hash (t)// 64-bit hash value
Calculating word weight w=wordwight (t)// weight function based on word features
Calculating combining weights cw=round (w×wi)// rounded to integer
Updating the total weight: sumWeight +=cw
Update maximum weight maxWeight =max (maxWeight, cw)
Updating the integer array bits [64]: for index j=0 to 63:
if the j-th bit of h is 1:
bits[j] += cw
Otherwise:
bits[j] -= cw
the word feature based weighting function may determine the weights of words according to the following rules.
Empty or blank word weight is 0.
Single character word with weight 1. Since single character words are hashed to have a limited number of bits (about 40 bits), an excessive weight tends to result in a full 0 in the high order portion of SimHash values, thereby reducing discrimination.
Multiword words-words that are treated differently starting with CJK uniform ideograms and other types of words. The Unicode code point of the first character is judged whether to be more than or equal to 0x3040 (the initial code point of the CJK unified ideogram).
CJK words beginning with length 2 and weighted 8.
The CJK characters start and the length is more than 2 words with weight of 16.
Words of length 2 that are not beginning with CJK text have a weight of 2.
Words with the length greater than 2 and the beginning of the non-CJK characters are weighted to be 4.
This weight distribution strategy aims to give CJK words a higher weight, as they typically carry more important information in the text. Words of different lengths are also given different weights, longer words generally having stronger semantic information and therefore higher weights.
D) Generating SimHash a value the final SimHash value of the acquisition web page is generated based on the accumulated combining weights.
Similar to conventional SimHash, a brief description is as follows.
Initializing an empty string simHash
For index k=0 to 63:
If bits [ k ] >0:
simHash += "1"
Otherwise:
simHash += "0"
And returning a result.
And 5, judging whether the collected web page has a similar web page or not according to the SimHash value of the collected web page and the web page URL, and if the collected web page has the similar web page, performing duplicate removal processing.
A) An index table is created.
L index tables are created, wherein L can equally divide character strings SimHash, in the embodiment, L can take a value of 4, and the L index tables are used for respectively storing L parts of the values of the history acquisition webpage SimHash.
B) Candidate web pages are screened.
And dividing SimHash values of the acquired web pages into equal L parts, searching the same values from the corresponding index tables respectively, and acquiring the web page IDs corresponding to the same values to form a candidate web page list.
C) And calculating the similarity.
Each web page ID in the candidate web page list is traversed.
And acquiring the complete SimHash information of the corresponding webpage.
The degree of difference between SimHash of the acquired web pages and the candidate web pages SimHash is calculated and is referred to as the hamming distance.
D) And confirming the similar web pages.
If the Hamming distance is greater than or equal to the preset similarity threshold, judging that the two webpages are dissimilar, respectively storing L parts of SimHash values of the acquired webpages into corresponding index tables, and recording SimHash values and URLs of the acquired webpages, and if the Hamming distance is less than the preset similarity threshold, considering that the two webpages are similar, and not storing SimHash values and URLs of the acquired webpages. In this embodiment, the preset similarity threshold may be set to 4.
The number of accesses to similar web pages may be recorded for use in subsequent cleaning of the index.
The application provides a webpage deduplication method based on structural perception, which can effectively distinguish webpage structures, reduce template influence and improve deduplication accuracy and efficiency, and can be used for:
effectively identifying web pages with the same or similar content even if URLs are different or web page structures are different.
The influence of the website template content on similarity calculation is reduced, and the duplicate removal accuracy is improved.
And accurately identifying the webpages with similar text content and different template content.
Compared with the traditional SimHash-based deduplication technology, the application has the following advantages:
1. structural-aware accurate deduplication.
Conventional approach the conventional SimHash approach generally treats the entire web page content as a whole, without considering the structural features of the web page and the importance differences of different regions.
The application innovatively introduces a document region structure and a region weighting mechanism. Based on the rule that the web pages with the same URL features have similar structures, the content difference frequency of different areas (such as fixed areas of navigation bars, headers, footers and the like have less change, and core areas of titles, texts and the like have obvious difference) is calculated through batch analysis and comparison of the web pages with the same URL features. Therefore, differential weights are distributed to the areas, so that SimHash can more accurately embody the structural characteristics and the content values of the webpage. The method effectively distinguishes the template content from the core content, and obviously improves the duplicate removal effect of the webpages with similar structures but different essential contents.
2. Adaptive content weights.
The traditional method is that the traditional SimHash generally uses a fixed weight calculation mode, so that the content change characteristics of different webpages are difficult to adapt.
The application adopts an innovative self-adaptive weight calculation formula wi= (Ci/M) × (1+log (Ci)). The formula can dynamically adjust the weight according to the change frequency of different areas of the webpage, so that important change areas are highlighted, and the influence of frequent change is avoided. This approach enhances the ability to capture substantial content changes while maintaining sensitivity to subtle changes, making deduplication results more practical.
3. Efficient prefiltering mechanisms.
Conventional methods conventional SimHash techniques typically require downloading and processing of all web page content, and then calculating SimHash values for comparison, which is inefficient in processing large-scale web page collections.
The application introduces a pre-filtering mechanism based on URL characteristics. By analyzing the URL structure and parameters, a preliminary determination of potential duplicate content can be made prior to downloading the web page content. For web pages with specific URL features, if no content change area or SimHash value hamming distance is found to be less than the threshold, the filtering can be performed directly without downloading and further processing. The pre-filtering mechanism effectively reduces the number of web pages to be processed, and improves the efficiency and the resource utilization rate of the duplicate removal process.
In a specific implementation, the present application provides a computer storage medium and a corresponding data processing unit, where the computer storage medium is capable of storing a computer program, where the computer program when executed by the data processing unit may perform part or all of the steps in the summary of a web page deduplication method based on structure awareness and provided by the present application. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a random-access memory (random access memory, RAM), or the like.
It will be apparent to those skilled in the art that the technical solutions in the embodiments of the present invention may be implemented by means of a computer program and its corresponding general hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be embodied essentially or in the form of a computer program, i.e. a software product, which may be stored in a storage medium, and include several instructions to cause a device (which may be a personal computer, a server, a single-chip microcomputer, MUU or a network device, etc.) including a data processing unit to perform the methods described in the embodiments or some parts of the embodiments of the present invention.
The invention provides a webpage deduplication method based on structural perception, and the method and the way for realizing the technical scheme are numerous, the above is only the preferred implementation mode of the invention, and it should be pointed out that a plurality of improvements and modifications can be made to a person skilled in the art without departing from the principle of the invention, and the improvements and the modifications are also considered as the protection scope of the invention. The components not explicitly described in this embodiment can be implemented by using the prior art.