Embed presentation
Downloaded 14 times





















































![Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerG = 5/15C = 2R = 3/6= 2/4The “link-cluster” conjecture• Connection between semantic topology (relevance) andlink topology (hypertext)– G = Pr[rel(p)] ~ fraction of relevant/topical pages (topic generality)– R = Pr[rel(p) | rel(q) AND link(q,p)] ~ cond. prob. Given neighbor on topic• Related nodes are clustered if R > G– Necessary andsufficientcondition for arandom crawlerto find pages relatedto start points– Example:2 topical clusterswith strongermodularity withineach cluster than outside](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fwebcrawlingchapter-190227092242%2f75%2fWeb-crawlingchapter-54-2048.jpg&f=jpg&w=240)

![Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo Menczer• Preservation ofsemantics (meaning)across links• 1000 times morelikely to be on topicif near an on-topicpage!Link-clusterconjectureR(q,δ)G(q)≡Pr rel(p) | rel(q)∧ path(q, p) ≤δ[ ]Pr[rel(p)]L(q,δ) ≡path(q, p){ p: path(q,p) ≤δ }∑{p : path(q, p) ≤δ}](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fwebcrawlingchapter-190227092242%2f75%2fWeb-crawlingchapter-56-2048.jpg&f=jpg&w=240)













![Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerPr l[ ] =εβλλεβλλ∋λ∋∑λλ= ννετ ιν1,...,ινΝ( )ινκ =δ κ,ω( )διστω,λ( )ω∈∆∑link l instances of kiλlk1knkiagent's neural netsum of matcheswithinverse-distanceweightinglink l Instancesof kiAgent’s neural netStochasticselectorLink scoring andselection by eachcrawling agent](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fwebcrawlingchapter-190227092242%2f75%2fWeb-crawlingchapter-70-2048.jpg&f=jpg&w=240)





![Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerReinforcement Learning• In general, reward function R: S A ℜ• Learn policy (π: S A) to maximize rewardover time, typically discounted in thefuture:• Q-learning: optimal policyV = γtr(t),t∑ 0 ≤ γ <1π*(s) = argmaxaQ(s,a)= argmaxaR(s,a) + γV*(s')[ ]ss2a2a1 s1Value of following optimal policy in future](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fwebcrawlingchapter-190227092242%2f75%2fWeb-crawlingchapter-76-2048.jpg&f=jpg&w=240)






































This document discusses web crawling techniques. It begins with an outline of topics covered, including the motivation and taxonomy of crawlers, basic crawlers and implementation issues, universal crawlers, preferential crawlers, crawler evaluation, ethics, and new developments. It then covers basic crawlers and their implementation, including graph traversal techniques, a basic crawler code example in Perl, and various implementation issues around fetching, parsing, indexing text, dealing with dynamic content, relative URLs, and URL canonicalization.





















































![Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerG = 5/15C = 2R = 3/6= 2/4The “link-cluster” conjecture• Connection between semantic topology (relevance) andlink topology (hypertext)– G = Pr[rel(p)] ~ fraction of relevant/topical pages (topic generality)– R = Pr[rel(p) | rel(q) AND link(q,p)] ~ cond. prob. Given neighbor on topic• Related nodes are clustered if R > G– Necessary andsufficientcondition for arandom crawlerto find pages relatedto start points– Example:2 topical clusterswith strongermodularity withineach cluster than outside](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fwebcrawlingchapter-190227092242%2f75%2fWeb-crawlingchapter-54-2048.jpg&f=jpg&w=240)

![Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo Menczer• Preservation ofsemantics (meaning)across links• 1000 times morelikely to be on topicif near an on-topicpage!Link-clusterconjectureR(q,δ)G(q)≡Pr rel(p) | rel(q)∧ path(q, p) ≤δ[ ]Pr[rel(p)]L(q,δ) ≡path(q, p){ p: path(q,p) ≤δ }∑{p : path(q, p) ≤δ}](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fwebcrawlingchapter-190227092242%2f75%2fWeb-crawlingchapter-56-2048.jpg&f=jpg&w=240)













![Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerPr l[ ] =εβλλεβλλ∋λ∋∑λλ= ννετ ιν1,...,ινΝ( )ινκ =δ κ,ω( )διστω,λ( )ω∈∆∑link l instances of kiλlk1knkiagent's neural netsum of matcheswithinverse-distanceweightinglink l Instancesof kiAgent’s neural netStochasticselectorLink scoring andselection by eachcrawling agent](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fwebcrawlingchapter-190227092242%2f75%2fWeb-crawlingchapter-70-2048.jpg&f=jpg&w=240)





![Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerReinforcement Learning• In general, reward function R: S A ℜ• Learn policy (π: S A) to maximize rewardover time, typically discounted in thefuture:• Q-learning: optimal policyV = γtr(t),t∑ 0 ≤ γ <1π*(s) = argmaxaQ(s,a)= argmaxaR(s,a) + γV*(s')[ ]ss2a2a1 s1Value of following optimal policy in future](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fwebcrawlingchapter-190227092242%2f75%2fWeb-crawlingchapter-76-2048.jpg&f=jpg&w=240)





































