Movatterモバイル変換


[0]ホーム

URL:


CN108874941A - Big data URL De-weight method based on convolution feature and multiple Hash mapping - Google Patents

Big data URL De-weight method based on convolution feature and multiple Hash mapping
Download PDF

Info

Publication number
CN108874941A
CN108874941ACN201810562678.7ACN201810562678ACN108874941ACN 108874941 ACN108874941 ACN 108874941ACN 201810562678 ACN201810562678 ACN 201810562678ACN 108874941 ACN108874941 ACN 108874941A
Authority
CN
China
Prior art keywords
convolution
url
mapping
hash
big data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810562678.7A
Other languages
Chinese (zh)
Other versions
CN108874941B (en
Inventor
宋绪成
邓金城
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Zhidaochuangyu Information Technology Co Ltd
Original Assignee
Chengdu Zhidaochuangyu Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Zhidaochuangyu Information Technology Co LtdfiledCriticalChengdu Zhidaochuangyu Information Technology Co Ltd
Priority to CN201810562678.7ApriorityCriticalpatent/CN108874941B/en
Publication of CN108874941ApublicationCriticalpatent/CN108874941A/en
Application grantedgrantedCritical
Publication of CN108874941BpublicationCriticalpatent/CN108874941B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Landscapes

Abstract

The big data URL De-weight method based on convolution feature and multiple Hash mapping that the invention discloses a kind of, by using the swift nature mapping algorithm based on convolution feature and the mapping of multiple hash function, establish a BitSet, each URL is passed through into a convolution algorithm and multiple hash functions are mapped to multiple positions, collision probability is greatly reduced, to identify the URL in independent web log.The method of the present invention more saves resource than traditional Duplicate Removal Algorithm, while greatly reducing the probability of hash-collision again, and recognition speed is very fast.

Description

Big data URL De-weight method based on convolution feature and multiple Hash mapping
Technical field
The present invention relates to URL duplicate removal technical field, especially a kind of big number based on convolution feature and multiple Hash mappingAccording to URL De-weight method.
Background technique
Existing big data URL processing technique first is that the URL accessed is saved with HashSet, only need to be close to O(1) cost can find whether a URL is accessed.The method deposits following deficiency:Memory is consumed very much, with URLIncrease, the memory of occupancy can be more and more, even if only 100,000,000 URL, each URL calculates 50 characters, it is necessary to 5GB memory,Quantity is so great that more when the processing of practical big data.
Existing big data URL processing technique second is that URL is saved in again after the one-way hash functions such as MD5 or SHA-1HashSet or database, since character string is by MD5 treated informative abstract length only has 128Bit, after SHA-1 processingOnly 160Bit saves several-fold memory.The method has the following disadvantages:Hash mapping has been used, has saved several timesMemory, but many Hash table conflicts are still had in the data volume treatment process of tens ranks, wrong report quantity reachesVery important quantity causes duplicate removal effect bad.
Relevant technical terms
Convolution:In functional analysis, convolution (Convolution) is to generate third function by two functions f and gA kind of mathematical algorithm, the mapping of the lap of characterization function f and g by overturning and translation.
URL:Uniform resource locator is one kind of the position and access method to the resource that can be obtained from internetSuccinct expression is the address of standard resource on internet.
Hash function:Keyword in data element is mapped to Hash table by certain functional relation.
Summary of the invention
The big data based on convolution feature and multiple Hash mapping that technical problem to be solved by the invention is to provide a kind ofURL De-weight method can be quick for solving the problems, such as that big data WEB access log URL duplicate removal processing speed is slow, effect is poorThe independent URL that high value is filtered out in super amount data, is convenient for post-processing.
In order to solve the above technical problems, the technical solution adopted by the present invention is that:
A kind of big data URL De-weight method based on convolution feature and multiple Hash mapping, includes the following steps:
Step 1:WEB access log is extracted from WEB server or in tradition WAF equipment, then filters out the domain of requirementName HOST and URL;
Step 2:Sequence convolution is carried out to url field and Function Mapping is wished in Doha;
Customized one convolution kernel comprising a numeric string, the Serial No. of convolution kernel arbitrarily determine one, will be eachNeed the character string of duplicate removal to be determined as a Serial No. according to mapping table, convolution operation the result is that corresponding number phaseMultiply;A step-length is defined, successively convolution can generate many convolution values to convolution kernel on Serial No.;
Be arranged hash function number k, bit array size m, addition character string quantity n, occur Hash under this conditionTable report by mistake a possibility that be:
The value that k is arranged is k=ln2*m/n, reaches minimum rate of false alarm at this time:
Hash function number to be used is determined according to expected rate of false alarm according to above-mentioned formula;
Step 3:Convolution function is f, is mapped using k hash function, hash function h1, h2, h3, h4...hk,Then the binary digit assignment principle of convolution characteristic value and hash function mapping value in BitSet is:Initialize one m bitArray (every is initially set to 0), convolution output and the output of each hash function are a numbers between one (0, m-1)(corresponding bit array index);
X is inputted, for each hash function, calculates j=hi (x), m_bit [j] is set as 1;Similarly, for convolution letterM_bit [c] is set as 1 to number by convolution operation, calculating c=f (x) each time;
Step 4:According to the BitSet after assignment, as the keyword or label of URL, and then independent URL is identified.
Further, the step 1 is specially:
1) the extra field in WAF log is filtered out;
2) tag match is carried out using script, filters the URL of static file and undesirable status code;
3) domain name HOST and URL character string is spliced, is exported.
Further, " filtering out the extra field in WAF log " uses the filter in scala language in step 1Method is filtered, i.e., is first split every log by space character using the split method in scala, then makes by oneselfAn adopted filter method, wherein including the rule for wanting filtering.
Further, when domain name HOST and URL character string is spliced, the character string combinations side built in scala is usedMethod "+" is spliced.
Compared with prior art, the beneficial effects of the invention are as follows:This method utilizes the convolution feature for representing URL independent characteristicFunction Mapping is wished in value and Doha, represents multiword by the BitSet of generation and accords with URL, more saves resource than traditional Duplicate Removal Algorithm,Greatly reduce the probability of hash-collision again simultaneously, recognition speed is very fast.
Detailed description of the invention
Fig. 1 is the flow diagram of the big data URL De-weight method the present invention is based on convolution feature and multiple Hash mapping.
Fig. 2 is convolution value and hash function value mapping principle.
Specific embodiment
The present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.
By the present invention in that establishing one with the swift nature mapping algorithm based on convolution feature and the mapping of multiple hash functionEach URL is passed through a convolution algorithm and multiple hash functions is mapped to multiple positions, greatly reduces conflict by a BitSetProbability, to identify the URL in independent web log.Details are as follows:
One, WEB access log is extracted from WEB server or in tradition WAF equipment, then filters out the HOST of requirement(domain name) and URL (uniform resource locator).Specific screening technique:
1, the extra field in WAF log is filtered out
The filter method that can be used in scala language is filtered.It first will using the split method in scalaEvery log is split by space character:
Val fields=line.split (" ")
A subsequent customized filter method, wherein including the rule for wanting filtering.Rule settings method can be usedThe matched mode of scala canonical carries out:
def myfilter()
Val ho=" gov.cn " .r//example:Only filter out government website
ho.findFirstMatchIn(host)!=None
2, tag match is carried out using script, filters the URL of static file and undesirable status code
Wherein static file refers to that URL is accessed is the static page or file, such as .html .xml .js .cssEtc., these URL are not often needed in the application of practical duplicate removal, are equally carried out using filter function customized in scala:
3, by HOST and URL combination, export
The host and URL that filter out are the forms of character string, behind to carry out deduplication operation, need one domain name of duplicate removalUnder all URL the character string combinations method built in scala can be used so host and URL character string is spliced"+" is spliced:
Val fields=line.split (" ")
Val host=fields (8)
Val url=fields (9)
Val uRL=host+url
Two, sequence convolution is carried out to url field and Function Mapping is wished in Doha
Convolution operation explanation:Customized one convolution kernel comprising a numeric string, wherein convolution kernel length is not preferably lowIn 6 numbers.The Serial No. of convolution kernel is arbitrary, and such as (" 453752 "), but once it is determined that cannot be modified, Yi HousuoSome convolution operations all use this convolution kernel.Each character string for needing duplicate removal is determined as a digital sequence according to mapping tableColumn, (being mapped as " 12345 " such as " abcde "), convolution operation the result is that corresponding number is multiplied, such as 123 convolution 234, knotFruit is exactly 1*2+2*3+3*4=20.A step-length is defined, successively convolution can produce many convolution to convolution kernel on Serial No.Value, as convolution kernel " 123 " on " 23456 " convolution, one is generated after 123 and 234 convolution and is worth, 123 generate with 345 convolution again againOne value (step-length 1).
General MD5 algorithm can be used in Hash mapping algorithm.Hash function number k is set, bit array size m, is addedCharacter string quantity n, a possibility that occurring Hash table wrong report under this condition is:
The value that k is arranged is k=ln2*m/n, reaches minimum rate of false alarm at this time:
Hash function number to be used is determined according to expected rate of false alarm according to above-mentioned formula.
Three, convolution function f is mapped using k hash function, hash function h1, h2, h3, h4...hk, thenThe binary digit assignment principle of convolution characteristic value and hash function mapping value in BitSet is as shown in Figure 2.Initialize one mBit array, convolution output and the output of each hash function are a numbers between one (0, m-1);X is inputted, forEach hash function calculates j=hi (x), m_bit [j] is set as 1;Similarly, for convolution function convolution operation each time,It calculates c=f (x), m_bit [c] is set as 1.
Note:Many values (value that big step-length generates is few, and the value that small step-length generates is more) can be generated in convolution process, according to certainlyOneself needs to define step-length, each convolution value should be mapped in BitSet (the convolution value only generated in Fig. 2 with convolution intoRow image display).
It is as follows that Function Mapping procedure division Java code is wished in Doha:
Four, according to the BitSet after assignment, it is easy to the keyword or label as URL, and then identify independentURL.For example the BitSet that a URL is mapped to has existed, then it is assumed that this URL is duplicate.
Five, the partial results after identifying are as follows:
www.xxxx.com/piwik.php?Action_name=www.wdzj.com%2F%E7%A4%BC%E5%BE%B7%E8%B4%A2%E5%AF%8C%E7%BD%91%E8%B4%B7%E6%A1%A3%E6 %A1%88_%E7%A4%BC%E5%BE%B7%E8%B4%A2%E5%AF%8C%E5%AE%98%E7%B D%91%E8%B5%84%E6%96%99_p2p%E5%B9%B3%E5%8F%B0%E6%A1%A3%E6%A 1%88_%E7%BD%91%E8%B4%B7%E4%B9%8B%E5%AE%B6&idsite=1&rec=1&r=931653&h=23&m=31&s=47&url=https%3A%2F%2Fwww.wdzj.com%2Fd angan%2Fldcf1%2F&urlref=https%3A%2F%2Fwww.wdzj.com%2Fdangan%2F search%3Ffilter%3De1-b41-n44%26show%3D1&_id=747107e1f17b5566&_i dts=1521648124&_Idvc=3&_idn=0&_refts=1521732597&_viewts=1521732597&_ref=https%3A%2F%2Fwww.google.com%2F&send_image=0&pdf=1&qt=0&realp=0&wma=0&dir=0&fla=0&java=0&gears=0&ag=0&cookie=1&res=1440x900&cvar=%7B%223 %22%3A%5B%22www%22%2C%22%22%5D%2C%225%22%3A%5B%22uid%22%2C%220% 22%5D%7D&gt_ms=888>-1
www.xxxx.com/m/c.ashx?S=35&u=100000&c=4&P=170663&Fl=https%3A//www.google.com.hk/>3
www.xxxx.com/user/action?Event_type=load&curt_id=7f8745b8-2d39-11e8-897e-00163e131d5b&prev_id=&event_info=%7B%22ad_uuid %22%3A%22add_Trwgzad1tkva%22%7D&event=ad_exposure&target=http%3A%2F%2Fwww.shixiseng.com%2Ftc%2Frpo&uuid=9f2cd019-7402-8948-9 c0b-501353d6a9e5&Url=https%253A%2F%2Fwww.shixiseng.com%2F&referrer=https% 3A%2F%2Fwww.google.com%2F&uri=%2F&source=pc>---1
www.xxxx.com/user/action?Event_type=load&curt_id=54c06304-2d8a-11e8-97ea-00163e131d5b&prev_id=42b3df23-dfea-4651-a86f-8 8ba92a4e42d&event_Info=%7B%22ad_uuid%22%3A%22add_77mcl4cyo2uu%22%7D&event=ad_exposure&Target=%2Fcom%2Fcom_qrf1ioxwhvxk&uuid=6e65a594-9034-97f3-ad00-B1e7a46d39ca&url=https%253A%2F%2Fwww.shixiseng.com%2F&re ferrer=https%3A%2F%2Fwww.google.com%2F&uri=%2F&source=pc>---1
www.xxxx.com/user/action?Event_type=load&curt_id=d72ad59e-2dcc-11e8-97ea-00163e131d5b&prev_id=&event_info=%7B%22ad_uuid %22%3A%22add_5zj7701ibn7t%22%7D&event=ad_exposure&target=http%3A%2F%2Fcampus.51job.com%2Funiqlo%2F&uuid=f6878545-7dff-4c90-9 ec4-d3f0b0be2cb7&Url=https%253A%2F%2Fwww.shixiseng.com%2F&referrer=https% 3A%2F%2Fwww.google.com%2F&uri=%2F&source=pc>---1
www.xxxx.com/user/action?Event_type=load&curt_id=dc320a40-2dd5-11e8-86b1-00163e0e0af8&prev_id=&event_info=%7B%22ad_uuid %22%3A%22add_Q5h0sozpfgsg%22%7D&event=ad_exposure&target=%2Fcom%2Fcom _ ohgsahcs55rv&Uuid=4d6eab5f-7e0a-da6c-a70a-75cacb5b8e2f&url=https%253A %2F%2Fwww.shixiseng.com%2F&referrer=https%3A%2F%2Fwww.google .com.hk%2F&uri=%2F&source=pc>1
www.xxxx.com/user/action?Event_type=load&curt_id=0ae48218-2db9-11e8-99e6-00163e040372&prev_id=&event_info=%7B%22ad_uuid %22%3A%22add_Trwgzad1tkva%22%7D&event=ad_exposure&target=http%3A%2F%2Fwww.shixiseng.com%2Ftc%2Frpo&uuid=21bf3ae9-d652-1e0a-8 67a-1f9c29660cd5&Url=https%253A%2F%2Fwww.shixiseng.com%2F&referrer=https% 3A%2F%2Fwww.google.com.hk%2F&uri=%2F&source=pc>1

Claims (4)

CN201810562678.7A2018-06-042018-06-04Big data URL duplication removing method based on convolution characteristics and multiple Hash mappingActiveCN108874941B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201810562678.7ACN108874941B (en)2018-06-042018-06-04Big data URL duplication removing method based on convolution characteristics and multiple Hash mapping

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201810562678.7ACN108874941B (en)2018-06-042018-06-04Big data URL duplication removing method based on convolution characteristics and multiple Hash mapping

Publications (2)

Publication NumberPublication Date
CN108874941Atrue CN108874941A (en)2018-11-23
CN108874941B CN108874941B (en)2021-09-21

Family

ID=64335996

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201810562678.7AActiveCN108874941B (en)2018-06-042018-06-04Big data URL duplication removing method based on convolution characteristics and multiple Hash mapping

Country Status (1)

CountryLink
CN (1)CN108874941B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110334251A (en)*2019-07-012019-10-15南京邮电大学 An Element Sequence Generation Method for Resolving Rehash Conflicts Effectively

Citations (10)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN101958883A (en)*2010-03-262011-01-26湘潭大学 A Method of Defense against SYN Flood Attack Based on Bloom Filter and Open Source Kernel
CN103106219A (en)*2011-11-152013-05-15盛乐信息技术(上海)有限公司Method and system of web page grabbing
US20140107997A1 (en)*2010-04-192014-04-17Altera CorporationSimulation Tool for High-Speed Communications Links
CN104657350A (en)*2015-03-042015-05-27中国科学院自动化研究所Hash learning method for short text integrated with implicit semantic features
CN104809182A (en)*2015-04-172015-07-29东南大学Method for web crawler URL (uniform resource locator) deduplicating based on DSBF (dynamic splitting Bloom Filter)
CN105956068A (en)*2016-04-272016-09-21湖南蚁坊软件有限公司Webpage URL repetition elimination method based on distributed database
CN106295629A (en)*2016-07-152017-01-04北京市商汤科技开发有限公司Structured text detection method and system
CN106599022A (en)*2016-11-012017-04-26中山大学User portrait forming method based on user access data
CN107832476A (en)*2017-12-012018-03-23北京百度网讯科技有限公司A kind of understanding method of search sequence, device, equipment and storage medium
CN107871014A (en)*2017-11-232018-04-03清华大学 A Big Data Cross-Modal Retrieval Method and System Based on Deep Fusion Hash

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN101958883A (en)*2010-03-262011-01-26湘潭大学 A Method of Defense against SYN Flood Attack Based on Bloom Filter and Open Source Kernel
US20140107997A1 (en)*2010-04-192014-04-17Altera CorporationSimulation Tool for High-Speed Communications Links
CN103106219A (en)*2011-11-152013-05-15盛乐信息技术(上海)有限公司Method and system of web page grabbing
CN104657350A (en)*2015-03-042015-05-27中国科学院自动化研究所Hash learning method for short text integrated with implicit semantic features
CN104809182A (en)*2015-04-172015-07-29东南大学Method for web crawler URL (uniform resource locator) deduplicating based on DSBF (dynamic splitting Bloom Filter)
CN105956068A (en)*2016-04-272016-09-21湖南蚁坊软件有限公司Webpage URL repetition elimination method based on distributed database
CN106295629A (en)*2016-07-152017-01-04北京市商汤科技开发有限公司Structured text detection method and system
CN106599022A (en)*2016-11-012017-04-26中山大学User portrait forming method based on user access data
CN107871014A (en)*2017-11-232018-04-03清华大学 A Big Data Cross-Modal Retrieval Method and System Based on Deep Fusion Hash
CN107832476A (en)*2017-12-012018-03-23北京百度网讯科技有限公司A kind of understanding method of search sequence, device, equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
F_URY: "BloomFilter 确定合适的 k(hash函数个数) 值", 《HTTPS://BLOG.CSDN.NET/U012400327/ARTICLE/DETAILS/62222922》*
XIANG ZHANG 等: "Character-level Convolutional Networks for Text Classification", 《ARXIV:1509.01626V3》*
刘小云: "网络爬虫技术在云平台上的研究与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》*

Cited By (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110334251A (en)*2019-07-012019-10-15南京邮电大学 An Element Sequence Generation Method for Resolving Rehash Conflicts Effectively
CN110334251B (en)*2019-07-012021-10-19南京邮电大学 An Element Sequence Generation Method to Effectively Solve Rehash Conflicts

Also Published As

Publication numberPublication date
CN108874941B (en)2021-09-21

Similar Documents

PublicationPublication DateTitle
US10216848B2 (en)Method and system for recommending cloud websites based on terminal access statistics
US9405910B2 (en)Automatic library detection
WO2019100645A1 (en)Method for realizing multilevel interactive drop-down box, electronic device, and storage medium
US9977818B2 (en)Local extrema based data sampling system
Lohr et al.Orbital period changes and the higher-order multiplicity fraction amongst SuperWASP eclipsing binaries
US20190065444A1 (en)Techniques for efficient & high-throughput web content-creation
CN102930038A (en)Combined method of search result similar items and system of the same
CN104065736A (en)URL redirection method, device, and system
CN111368227A (en)URL processing method and device
CN114443701B (en)Data stream processing method, electronic device and computer program product
US9792355B2 (en)Searches for similar documents
Ravi Kumar et al.Application of Markov Chain in the PageRank Algorithm.
CN108874941A (en)Big data URL De-weight method based on convolution feature and multiple Hash mapping
CN110392032B (en) Method, device and storage medium for detecting abnormal URL
CN103618742A (en)Method and system for acquiring sub domain names and webmaster permission verification method
CN106557483A (en)A kind of data processing, data query method and apparatus
CN108304545A (en)A kind of URL log storing methods and device
CN109710860B (en)URL (Uniform resource locator) classification matching method and device
US10572560B2 (en)Detecting relevant facets by leveraging diagram identification, social media and statistical analysis software
CN104809146B (en)System and method for determining index of the object in object sequence
CN104715068B (en)Method and device for generating document indexes and searching method and device
CN109426358B (en)Information input method and device
Valera et al.An efficient web recommender system based on approach of mining frequent sequential pattern from customized web log preprocessing
US10229095B2 (en)Expanded icon navigation
US9996621B2 (en)System and method for retrieving internet pages using page partitions

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant
CP02Change in the address of a patent holder
CP02Change in the address of a patent holder

Address after:9/F, Block C, No. 28 Tianfu Avenue North Section, Chengdu High tech Zone, China (Sichuan) Pilot Free Trade Zone, Chengdu City, Sichuan Province, 610000

Patentee after:CHENGDU KNOWNSEC INFORMATION TECHNOLOGY Co.,Ltd.

Address before:610000, 11th floor, building 2, No. 219, Tianfu Third Street, hi tech Zone, Chengdu, Sichuan Province

Patentee before:CHENGDU KNOWNSEC INFORMATION TECHNOLOGY Co.,Ltd.


[8]ページ先頭

©2009-2025 Movatter.jp