Big data URL De-weight method based on convolution feature and multiple Hash mappingTechnical field
The present invention relates to URL duplicate removal technical field, especially a kind of big number based on convolution feature and multiple Hash mappingAccording to URL De-weight method.
Background technique
Existing big data URL processing technique first is that the URL accessed is saved with HashSet, only need to be close to O(1) cost can find whether a URL is accessed.The method deposits following deficiency:Memory is consumed very much, with URLIncrease, the memory of occupancy can be more and more, even if only 100,000,000 URL, each URL calculates 50 characters, it is necessary to 5GB memory,Quantity is so great that more when the processing of practical big data.
Existing big data URL processing technique second is that URL is saved in again after the one-way hash functions such as MD5 or SHA-1HashSet or database, since character string is by MD5 treated informative abstract length only has 128Bit, after SHA-1 processingOnly 160Bit saves several-fold memory.The method has the following disadvantages:Hash mapping has been used, has saved several timesMemory, but many Hash table conflicts are still had in the data volume treatment process of tens ranks, wrong report quantity reachesVery important quantity causes duplicate removal effect bad.
Relevant technical terms
Convolution:In functional analysis, convolution (Convolution) is to generate third function by two functions f and gA kind of mathematical algorithm, the mapping of the lap of characterization function f and g by overturning and translation.
URL:Uniform resource locator is one kind of the position and access method to the resource that can be obtained from internetSuccinct expression is the address of standard resource on internet.
Hash function:Keyword in data element is mapped to Hash table by certain functional relation.
Summary of the invention
The big data based on convolution feature and multiple Hash mapping that technical problem to be solved by the invention is to provide a kind ofURL De-weight method can be quick for solving the problems, such as that big data WEB access log URL duplicate removal processing speed is slow, effect is poorThe independent URL that high value is filtered out in super amount data, is convenient for post-processing.
In order to solve the above technical problems, the technical solution adopted by the present invention is that:
A kind of big data URL De-weight method based on convolution feature and multiple Hash mapping, includes the following steps:
Step 1:WEB access log is extracted from WEB server or in tradition WAF equipment, then filters out the domain of requirementName HOST and URL;
Step 2:Sequence convolution is carried out to url field and Function Mapping is wished in Doha;
Customized one convolution kernel comprising a numeric string, the Serial No. of convolution kernel arbitrarily determine one, will be eachNeed the character string of duplicate removal to be determined as a Serial No. according to mapping table, convolution operation the result is that corresponding number phaseMultiply;A step-length is defined, successively convolution can generate many convolution values to convolution kernel on Serial No.;
Be arranged hash function number k, bit array size m, addition character string quantity n, occur Hash under this conditionTable report by mistake a possibility that be:
The value that k is arranged is k=ln2*m/n, reaches minimum rate of false alarm at this time:
Hash function number to be used is determined according to expected rate of false alarm according to above-mentioned formula;
Step 3:Convolution function is f, is mapped using k hash function, hash function h1, h2, h3, h4...hk,Then the binary digit assignment principle of convolution characteristic value and hash function mapping value in BitSet is:Initialize one m bitArray (every is initially set to 0), convolution output and the output of each hash function are a numbers between one (0, m-1)(corresponding bit array index);
X is inputted, for each hash function, calculates j=hi (x), m_bit [j] is set as 1;Similarly, for convolution letterM_bit [c] is set as 1 to number by convolution operation, calculating c=f (x) each time;
Step 4:According to the BitSet after assignment, as the keyword or label of URL, and then independent URL is identified.
Further, the step 1 is specially:
1) the extra field in WAF log is filtered out;
2) tag match is carried out using script, filters the URL of static file and undesirable status code;
3) domain name HOST and URL character string is spliced, is exported.
Further, " filtering out the extra field in WAF log " uses the filter in scala language in step 1Method is filtered, i.e., is first split every log by space character using the split method in scala, then makes by oneselfAn adopted filter method, wherein including the rule for wanting filtering.
Further, when domain name HOST and URL character string is spliced, the character string combinations side built in scala is usedMethod "+" is spliced.
Compared with prior art, the beneficial effects of the invention are as follows:This method utilizes the convolution feature for representing URL independent characteristicFunction Mapping is wished in value and Doha, represents multiword by the BitSet of generation and accords with URL, more saves resource than traditional Duplicate Removal Algorithm,Greatly reduce the probability of hash-collision again simultaneously, recognition speed is very fast.
Detailed description of the invention
Fig. 1 is the flow diagram of the big data URL De-weight method the present invention is based on convolution feature and multiple Hash mapping.
Fig. 2 is convolution value and hash function value mapping principle.
Specific embodiment
The present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.
By the present invention in that establishing one with the swift nature mapping algorithm based on convolution feature and the mapping of multiple hash functionEach URL is passed through a convolution algorithm and multiple hash functions is mapped to multiple positions, greatly reduces conflict by a BitSetProbability, to identify the URL in independent web log.Details are as follows:
One, WEB access log is extracted from WEB server or in tradition WAF equipment, then filters out the HOST of requirement(domain name) and URL (uniform resource locator).Specific screening technique:
1, the extra field in WAF log is filtered out
The filter method that can be used in scala language is filtered.It first will using the split method in scalaEvery log is split by space character:
Val fields=line.split (" ")
A subsequent customized filter method, wherein including the rule for wanting filtering.Rule settings method can be usedThe matched mode of scala canonical carries out:
def myfilter()
Val ho=" gov.cn " .r//example:Only filter out government website
ho.findFirstMatchIn(host)!=None
2, tag match is carried out using script, filters the URL of static file and undesirable status code
Wherein static file refers to that URL is accessed is the static page or file, such as .html .xml .js .cssEtc., these URL are not often needed in the application of practical duplicate removal, are equally carried out using filter function customized in scala:
3, by HOST and URL combination, export
The host and URL that filter out are the forms of character string, behind to carry out deduplication operation, need one domain name of duplicate removalUnder all URL the character string combinations method built in scala can be used so host and URL character string is spliced"+" is spliced:
Val fields=line.split (" ")
Val host=fields (8)
Val url=fields (9)
Val uRL=host+url
Two, sequence convolution is carried out to url field and Function Mapping is wished in Doha
Convolution operation explanation:Customized one convolution kernel comprising a numeric string, wherein convolution kernel length is not preferably lowIn 6 numbers.The Serial No. of convolution kernel is arbitrary, and such as (" 453752 "), but once it is determined that cannot be modified, Yi HousuoSome convolution operations all use this convolution kernel.Each character string for needing duplicate removal is determined as a digital sequence according to mapping tableColumn, (being mapped as " 12345 " such as " abcde "), convolution operation the result is that corresponding number is multiplied, such as 123 convolution 234, knotFruit is exactly 1*2+2*3+3*4=20.A step-length is defined, successively convolution can produce many convolution to convolution kernel on Serial No.Value, as convolution kernel " 123 " on " 23456 " convolution, one is generated after 123 and 234 convolution and is worth, 123 generate with 345 convolution again againOne value (step-length 1).
General MD5 algorithm can be used in Hash mapping algorithm.Hash function number k is set, bit array size m, is addedCharacter string quantity n, a possibility that occurring Hash table wrong report under this condition is:
The value that k is arranged is k=ln2*m/n, reaches minimum rate of false alarm at this time:
Hash function number to be used is determined according to expected rate of false alarm according to above-mentioned formula.
Three, convolution function f is mapped using k hash function, hash function h1, h2, h3, h4...hk, thenThe binary digit assignment principle of convolution characteristic value and hash function mapping value in BitSet is as shown in Figure 2.Initialize one mBit array, convolution output and the output of each hash function are a numbers between one (0, m-1);X is inputted, forEach hash function calculates j=hi (x), m_bit [j] is set as 1;Similarly, for convolution function convolution operation each time,It calculates c=f (x), m_bit [c] is set as 1.
Note:Many values (value that big step-length generates is few, and the value that small step-length generates is more) can be generated in convolution process, according to certainlyOneself needs to define step-length, each convolution value should be mapped in BitSet (the convolution value only generated in Fig. 2 with convolution intoRow image display).
It is as follows that Function Mapping procedure division Java code is wished in Doha:
Four, according to the BitSet after assignment, it is easy to the keyword or label as URL, and then identify independentURL.For example the BitSet that a URL is mapped to has existed, then it is assumed that this URL is duplicate.
Five, the partial results after identifying are as follows:
www.xxxx.com/piwik.php?Action_name=www.wdzj.com%2F%E7%A4%BC%E5%BE%B7%E8%B4%A2%E5%AF%8C%E7%BD%91%E8%B4%B7%E6%A1%A3%E6 %A1%88_%E7%A4%BC%E5%BE%B7%E8%B4%A2%E5%AF%8C%E5%AE%98%E7%B D%91%E8%B5%84%E6%96%99_p2p%E5%B9%B3%E5%8F%B0%E6%A1%A3%E6%A 1%88_%E7%BD%91%E8%B4%B7%E4%B9%8B%E5%AE%B6&idsite=1&rec=1&r=931653&h=23&m=31&s=47&url=https%3A%2F%2Fwww.wdzj.com%2Fd angan%2Fldcf1%2F&urlref=https%3A%2F%2Fwww.wdzj.com%2Fdangan%2F search%3Ffilter%3De1-b41-n44%26show%3D1&_id=747107e1f17b5566&_i dts=1521648124&_Idvc=3&_idn=0&_refts=1521732597&_viewts=1521732597&_ref=https%3A%2F%2Fwww.google.com%2F&send_image=0&pdf=1&qt=0&realp=0&wma=0&dir=0&fla=0&java=0&gears=0&ag=0&cookie=1&res=1440x900&cvar=%7B%223 %22%3A%5B%22www%22%2C%22%22%5D%2C%225%22%3A%5B%22uid%22%2C%220% 22%5D%7D>_ms=888>-1
www.xxxx.com/m/c.ashx?S=35&u=100000&c=4&P=170663&Fl=https%3A//www.google.com.hk/>3
www.xxxx.com/user/action?Event_type=load&curt_id=7f8745b8-2d39-11e8-897e-00163e131d5b&prev_id=&event_info=%7B%22ad_uuid %22%3A%22add_Trwgzad1tkva%22%7D&event=ad_exposure&target=http%3A%2F%2Fwww.shixiseng.com%2Ftc%2Frpo&uuid=9f2cd019-7402-8948-9 c0b-501353d6a9e5&Url=https%253A%2F%2Fwww.shixiseng.com%2F&referrer=https% 3A%2F%2Fwww.google.com%2F&uri=%2F&source=pc>---1
www.xxxx.com/user/action?Event_type=load&curt_id=54c06304-2d8a-11e8-97ea-00163e131d5b&prev_id=42b3df23-dfea-4651-a86f-8 8ba92a4e42d&event_Info=%7B%22ad_uuid%22%3A%22add_77mcl4cyo2uu%22%7D&event=ad_exposure&Target=%2Fcom%2Fcom_qrf1ioxwhvxk&uuid=6e65a594-9034-97f3-ad00-B1e7a46d39ca&url=https%253A%2F%2Fwww.shixiseng.com%2F&re ferrer=https%3A%2F%2Fwww.google.com%2F&uri=%2F&source=pc>---1
www.xxxx.com/user/action?Event_type=load&curt_id=d72ad59e-2dcc-11e8-97ea-00163e131d5b&prev_id=&event_info=%7B%22ad_uuid %22%3A%22add_5zj7701ibn7t%22%7D&event=ad_exposure&target=http%3A%2F%2Fcampus.51job.com%2Funiqlo%2F&uuid=f6878545-7dff-4c90-9 ec4-d3f0b0be2cb7&Url=https%253A%2F%2Fwww.shixiseng.com%2F&referrer=https% 3A%2F%2Fwww.google.com%2F&uri=%2F&source=pc>---1
www.xxxx.com/user/action?Event_type=load&curt_id=dc320a40-2dd5-11e8-86b1-00163e0e0af8&prev_id=&event_info=%7B%22ad_uuid %22%3A%22add_Q5h0sozpfgsg%22%7D&event=ad_exposure&target=%2Fcom%2Fcom _ ohgsahcs55rv&Uuid=4d6eab5f-7e0a-da6c-a70a-75cacb5b8e2f&url=https%253A %2F%2Fwww.shixiseng.com%2F&referrer=https%3A%2F%2Fwww.google .com.hk%2F&uri=%2F&source=pc>1
www.xxxx.com/user/action?Event_type=load&curt_id=0ae48218-2db9-11e8-99e6-00163e040372&prev_id=&event_info=%7B%22ad_uuid %22%3A%22add_Trwgzad1tkva%22%7D&event=ad_exposure&target=http%3A%2F%2Fwww.shixiseng.com%2Ftc%2Frpo&uuid=21bf3ae9-d652-1e0a-8 67a-1f9c29660cd5&Url=https%253A%2F%2Fwww.shixiseng.com%2F&referrer=https% 3A%2F%2Fwww.google.com.hk%2F&uri=%2F&source=pc>1