Movatterモバイル変換


[0]ホーム

URL:


Uploaded byBorseshweta
PPT, PDF548 views

Web crawlingchapter

This document discusses web crawling techniques. It begins with an outline of topics covered, including the motivation and taxonomy of crawlers, basic crawlers and implementation issues, universal crawlers, preferential crawlers, crawler evaluation, ethics, and new developments. It then covers basic crawlers and their implementation, including graph traversal techniques, a basic crawler code example in Perl, and various implementation issues around fetching, parsing, indexing text, dealing with dynamic content, relative URLs, and URL canonicalization.

Embed presentation

Downloaded 14 times
Ch. 8: Web CrawlingBy Filippo MenczerIndiana University School of Informaticsin Web Data Mining by Bing LiuSpringer, 2007
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerOutline• Motivation and taxonomy of crawlers• Basic crawlers and implementation issues• Universal crawlers• Preferential (focused and topical) crawlers• Evaluation of preferential crawlers• Crawler ethics and conflicts• New developments: social, collaborative, federatedcrawlers
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerQ: How does asearch engineknow that allthese pagescontain the queryterms?A: Because all ofthose pageshave beencrawled
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerCrawler:basicideastartingpages(seeds)
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMany names• Crawler• Spider• Robot (or bot)• Web agent• Wanderer, worm, …• And famous instances: googlebot,scooter, slurp, msnbot, …
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerGooglebot & you
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMotivation for crawlers• Support universal search engines (Google,Yahoo, MSN/Windows Live, Ask, etc.)• Vertical (specialized) search engines, e.g.news, shopping, papers, recipes, reviews, etc.• Business intelligence: keep track of potentialcompetitors, partners• Monitor Web sites of interest• Evil: harvest emails for spamming, phishing…• … Can you think of some others?…
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerA crawler within a search engineWebText index PageRankPagerepositorygooglebotText & linkanalysisQueryhitsRanker
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerOne taxonomy of crawlersUniversal crawlersFocused crawlersEvolutionary crawlers Reinforcement learning crawlersetc...Adaptive topical crawlersBest-first PageRanketc...Static crawlersTopical crawlersPreferential crawlersCrawlers• Many other criteria could be used:– Incremental, Interactive, Concurrent, Etc.
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerOutline• Motivation and taxonomy of crawlers• Basic crawlers and implementation issues• Universal crawlers• Preferential (focused and topical) crawlers• Evaluation of preferential crawlers• Crawler ethics and conflicts• New developments: social, collaborative, federatedcrawlers
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerBasic crawlers• This is a sequentialcrawler• Seeds can be any list ofstarting URLs• Order of page visits isdetermined by frontierdata structure• Stop criterion can beanything
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerGraph traversal(BFS or DFS?)• Breadth First Search– Implemented with QUEUE (FIFO)– Finds pages along shortest paths– If we start with “good” pages, thiskeeps us close; maybe other goodstuff…• Depth First Search– Implemented with STACK (LIFO)– Wander away (“lost in cyberspace”)
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerA basic crawler in Perl• Queue: a FIFO list (shift and push)my @frontier = read_seeds($file);while (@frontier && $tot < $max) {my $next_link = shift @frontier;my $page = fetch($next_link);add_to_index($page);my @links = extract_links($page, $next_link);push @frontier, process(@links);}
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerImplementation issues• Don’t want to fetch same page twice!– Keep lookup table (hash) of visited pages– What if not visited but in frontier already?• The frontier grows very fast!– May need to prioritize for large crawls• Fetcher must be robust!– Don’t crash if download fails– Timeout mechanism• Determine file type to skip unwanted files– Can try using extensions, but not reliable– Can issue ‘HEAD’ HTTP commands to get Content-Type(MIME) headers, but overhead of extra Internet requests
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore implementation issues• Fetching– Get only the first 10-100 KB per page– Take care to detect and breakredirection loops– Soft fail for timeout, server notresponding, file not found, and othererrors
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore implementation issues: Parsing• HTML has the structure of a DOM(Document Object Model) tree• Unfortunately actual HTML is oftenincorrect in a strict syntactic sense• Crawlers, like browsers, must berobust/forgiving• Fortunately there are tools that canhelp– E.g. tidy.sourceforge.net• Must pay attention to HTMLentities and unicode in text• What to do with a growing numberof other formats?– Flash, SVG, RSS, AJAX…
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore implementation issues• Stop words– Noise words that do not carry meaning should be eliminated(“stopped”) before they are indexed– E.g. in English: AND, THE, A, AT, OR, ON, FOR, etc…– Typically syntactic markers– Typically the most common terms– Typically kept in a negative dictionary• 10–1,000 elements• E.g. http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words– Parser can detect these right away and disregard them
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore implementation issuesConflation and thesauri• Idea: improve recall by merging words with samemeaning1. We want to ignore superficial morphologicalfeatures, thus merge semantically similar tokens– {student, study, studying, studious} => studi1. We can also conflate synonyms into a single formusing a thesaurus– 30-50% smaller index– Doing this in both pages and queries allows to retrievepages about ‘automobile’ when user asks for ‘car’– Thesaurus can be implemented as a hash table
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore implementation issues• Stemming– Morphological conflation based on rewrite rules– Language dependent!– Porter stemmer very popular for English• http://www.tartarus.org/~martin/PorterStemmer/• Context-sensitive grammar rules, eg:– “IES” except (“EIES” or “AIES”) --> “Y”• Versions in Perl, C, Java, Python, C#, Ruby, PHP, etc.– Porter has also developed Snowball, a language to createstemming algorithms in any language• http://snowball.tartarus.org/• Ex. Perl modules: Lingua::Stem and Lingua::Stem::Snowball
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore implementation issues• Static vs. dynamic pages– Is it worth trying to eliminate dynamic pages and only indexstatic pages?– Examples:• http://www.census.gov/cgi-bin/gazetteer• http://informatics.indiana.edu/research/colloquia.asp• http://www.amazon.com/exec/obidos/subst/home/home.html/002-8332429-6490452• http://www.imdb.com/Name?Menczer,+Erico• http://www.imdb.com/name/nm0578801/– Why or why not? How can we tell if a page is dynamic? Whatabout ‘spider traps’?– What do Google and other search engines do?
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore implementation issues• Relative vs. Absolute URLs– Crawler must translate relative URLs into absoluteURLs– Need to obtain Base URL from HTTP header, orHTML Meta tag, or else current page path bydefault– Examples• Base: http://www.cnn.com/linkto/• Relative URL: intl.html• Absolute URL: http://www.cnn.com/linkto/intl.html• Relative URL: /US/• Absolute URL: http://www.cnn.com/US/
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore implementation issues• URL canonicalization– All of these:• http://www.cnn.com/TECH• http://WWW.CNN.COM/TECH/• http://www.cnn.com:80/TECH/• http://www.cnn.com/bogus/../TECH/– Are really equivalent to this canonical form:• http://www.cnn.com/TECH/– In order to avoid duplication, the crawler musttransform all URLs into canonical form– Definition of “canonical” is arbitrary, e.g.:• Could always include port• Or only include port when not default :80
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore on Canonical URLs• Some transformation are trivial, for example: http://informatics.indiana.edu http://informatics.indiana.edu/ http://informatics.indiana.edu/index.html#fragment http://informatics.indiana.edu/index.html http://informatics.indiana.edu/dir1/./../dir2/ http://informatics.indiana.edu/dir2/ http://informatics.indiana.edu/%7Efil/ http://informatics.indiana.edu/~fil/ http://INFORMATICS.INDIANA.EDU/fil/ http://informatics.indiana.edu/fil/
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore on Canonical URLsOther transformations require heuristic assumptionabout the intentions of the author or configurationof the Web server:1. Removing default file name http://informatics.indiana.edu/fil/index.html http://informatics.indiana.edu/fil/– This is reasonable in general but would be wrong in thiscase because the default happens to be ‘default.asp’instead of ‘index.html’1. Trailing directory http://informatics.indiana.edu/fil http://informatics.indiana.edu/fil/– This is correct in this case but how can we be sure ingeneral that there isn’t a file named ‘fil’ in the root dir?
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore implementation issues• Spider traps– Misleading sites: indefinite number of pagesdynamically generated by CGI scripts– Paths of arbitrary depth created using softdirectory links and path rewriting features inHTTP server– Only heuristic defensive measures:• Check URL length; assume spider trap above somethreshold, for example 128 characters• Watch for sites with very large number of URLs• Eliminate URLs with non-textual data types• May disable crawling of dynamic pages, if can detect
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore implementation issues• Page repository– Naïve: store each page as a separate file• Can map URL to unique filename using a hashing function,e.g. MD5• This generates a huge number of files, which is inefficientfrom the storage perspective– Better: combine many pages into a single large file, usingsome XML markup to separate and identify them• Must map URL to {filename, page_id}– Database options• Any RDBMS -- large overhead• Light-weight, embedded databases such as Berkeley DB
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerConcurrency• A crawler incurs several delays:– Resolving the host name in the URL to anIP address using DNS– Connecting a socket to the server andsending the request– Receiving the requested page in response• Solution: Overlap the above delays byfetching many pages concurrently
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerArchitectureof aconcurrentcrawler
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerConcurrent crawlers• Can use multi-processing or multi-threading• Each process or thread works like asequential crawler, except they share datastructures: frontier and repository• Shared data structures must besynchronized (locked for concurrentwrites)• Speedup of factor of 5-10 are easy thisway
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerOutline• Motivation and taxonomy of crawlers• Basic crawlers and implementation issues• Universal crawlers• Preferential (focused and topical) crawlers• Evaluation of preferential crawlers• Crawler ethics and conflicts• New developments: social, collaborative, federatedcrawlers
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerUniversal crawlers• Support universal search engines• Large-scale• Huge cost (network bandwidth) ofcrawl is amortized over many queriesfrom users• Incremental updates to existingindex and other data repositories
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerLarge-scale universal crawlers• Two major issues:1. Performance• Need to scale up to billions of pages2. Policy• Need to trade-off coverage,freshness, and bias (e.g. toward“important” pages)
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerLarge-scale crawlers: scalability• Need to minimize overhead of DNS lookups• Need to optimize utilization of network bandwidthand disk throughput (I/O is bottleneck)• Use asynchronous sockets– Multi-processing or multi-threading do not scale up tobillions of pages– Non-blocking: hundreds of network connections opensimultaneously– Polling socket to monitor completion of networktransfers
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerHigh-levelarchitecture of ascalable universalcrawlerSeveral parallelqueues to spread loadacross servers (keepconnections alive)DNS server using UDP(less overhead thanTCP), large persistentin-memory cache, andprefetchingOptimize use ofnetwork bandwidthOptimize disk I/O throughputHuge farm of crawl machines
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerUniversal crawlers: Policy• Coverage– New pages get added all the time– Can the crawler find every page?• Freshness– Pages change over time, get removed, etc.– How frequently can a crawler revisit ?• Trade-off!– Focus on most “important” pages (crawler bias)?– “Importance” is subjective
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerWeb coverage by search engine crawlers35% 34%16%50%0%10%20%30%40%50%60%70%80%90%100%1997 1998 1999 2000This assumes we know thesize of the entire the Web.Do we? Can you define “thesize of the Web”?
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMaintaining a “fresh” collection• Universal crawlers are never “done”• High variance in rate and amount of page changes• HTTP headers are notoriously unreliable– Last-modified– Expires• Solution– Estimate the probability that a previously visited pagehas changed in the meanwhile– Prioritize by this probability estimate
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerEstimating page change rates• Algorithms for maintaining a crawl in whichmost pages are fresher than a specifiedepoch– Brewington & Cybenko; Cho, Garcia-Molina & Page• Assumption: recent past predicts the future(Ntoulas, Cho & Olston 2004)– Frequency of change not a good predictor– Degree of change is a better predictor
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerDo we need to crawl the entire Web?• If we cover too much, it will get stale• There is an abundance of pages in the Web• For PageRank, pages with very low prestige are largelyuseless• What is the goal?– General search engines: pages with high prestige– News portals: pages that change often– Vertical portals: pages on some topic• What are appropriate priority measures in thesecases? Approximations?
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerBreadth-first crawlers• BF crawler tends tocrawl high-PageRank pagesvery early• Therefore, BFcrawler is a goodbaseline to gaugeother crawlers• But why is this so? Najork and Weiner 2001
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerBias of breadth-first crawlers• The structure of theWeb graph is verydifferent from a randomnetwork• Power-law distribution ofin-degree• Therefore there are hubpages with very high PRand many incoming links• These are attractors: youcannot avoid them!
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerOutline• Motivation and taxonomy of crawlers• Basic crawlers and implementation issues• Universal crawlers• Preferential (focused and topical) crawlers• Evaluation of preferential crawlers• Crawler ethics and conflicts• New developments: social, collaborative, federatedcrawlers
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerPreferential crawlers• Assume we can estimate for each page animportance measure, I(p)• Want to visit pages in order of decreasing I(p)• Maintain the frontier as a priority queue sorted byI(p)• Possible figures of merit:– Precision ~| p: crawled(p) & I(p) > threshold | / | p: crawled(p) |– Recall ~| p: crawled(p) & I(p) > threshold | / | p: I(p) > threshold |
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerPreferential crawlers• Selective bias toward some pages, eg. most“relevant”/topical, closest to seeds, most popular/largestPageRank, unknown servers, highest rate/amount ofchange, etc…• Focused crawlers– Supervised learning: classifier based on labeled examples• Topical crawlers– Best-first search based on similarity(topic, parent)– Adaptive crawlers• Reinforcement learning• Evolutionary algorithms/artificial life
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerPreferential crawling algorithms:Examples• Breadth-First– Exhaustively visit all links in order encountered• Best-N-First– Priority queue sorted by similarity, explore top N at a time– Variants: DOM context, hub scores• PageRank– Priority queue sorted by keywords, PageRank• SharkSearch– Priority queue sorted by combination of similarity, anchor text, similarity ofparent, etc. (powerful cousin of FishSearch)• InfoSpiders– Adaptive distributed algorithm using an evolving population of learningagents
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerPreferential crawlers: ExamplesRecallCrawl size• For I(p) = PageRank(estimated based onpages crawled sofar), we can findhigh-PR pages fasterthan a breadth-firstcrawler (Cho, Garcia-Molina & Page 1998)
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerFocused crawlers: Basic idea• Naïve-Bayes classifier basedon example pages in desiredtopic, c*• Score(p) = Pr(c*|p)– Soft focus: frontier is priorityqueue using page score– Hard focus:• Find best leaf ĉ for p• If an ancestor c’ of ĉ is in c*then add links from p tofrontier, else discard– Soft and hard focus workequally well empiricallyExample: Open Directory
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerFocused crawlers• Can have multiple topics with as many classifiers,with scores appropriately combined (Chakrabarti etal. 1999)• Can use a distiller to find topical hubs periodically,and add these to the frontier• Can accelerate with the use of a critic (Chakrabartiet al. 2002)• Can use alternative classifier algorithms to naïve-Bayes, e.g. SVM and neural nets have reportedlyperformed better (Pant & Srinivasan 2005)
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerContext-focused crawlers• Same idea, but multiple classes (andclassifiers) based on link distancefrom relevant targets– ℓ=0 is topic of interest– ℓ=1 link to topic of interest– Etc.• Initially needs a back-crawl fromseeds (or known targets) to trainclassifiers to estimate distance• Links in frontier prioritized based onestimated distance from targets• Outperforms standard focusedcrawler empiricallyContext graph
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerTopical crawlers• All we have is a topic (query, description,keywords) and a set of seed pages (notnecessarily relevant)• No labeled examples• Must predict relevance of unvisited links toprioritize• Original idea: Menczer 1997, Menczer &Belew 1998
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerExample: myspiders.informatics.indiana.edu
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerTopical locality• Topical locality is a necessary condition for a topicalcrawler to work, and for surfing to be a worthwhileactivity for humans• Links must encode semantic information, i.e. saysomething about neighbor pages, not be random• It is also a sufficient condition if we start from “good”seed pages• Indeed we know that Web topical locality is strong :– Indirectly (crawlers work and people surf the Web)– From direct measurements (Davison 2000; Menczer 2004, 2005)
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerQuantifying topical locality• Different ways to pose thequestion:– How quickly does semanticlocality decay?– How fast is topic drift?– How quickly does contentchange as we surf away from astarting page?• To answer these questions,let us consider exhaustivebreadth-first crawls from100 topic pages
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerG = 5/15C = 2R = 3/6= 2/4The “link-cluster” conjecture• Connection between semantic topology (relevance) andlink topology (hypertext)– G = Pr[rel(p)] ~ fraction of relevant/topical pages (topic generality)– R = Pr[rel(p) | rel(q) AND link(q,p)] ~ cond. prob. Given neighbor on topic• Related nodes are clustered if R > G– Necessary andsufficientcondition for arandom crawlerto find pages relatedto start points– Example:2 topical clusterswith strongermodularity withineach cluster than outside
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo Menczer• Stationary hit rate for a random crawler:Link-cluster conjectureη(t + 1) = η(t) ⋅ R + (1 −η(t))⋅ G ≥ η(t)ηt →∞⏐ →⏐ ⏐ η∗=G1− (R − G)η∗> G ⇔ R > Gη∗G−1 =R− G1 − (R− G)Valueaddedof linksConjecture
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo Menczer• Preservation ofsemantics (meaning)across links• 1000 times morelikely to be on topicif near an on-topicpage!Link-clusterconjectureR(q,δ)G(q)≡Pr rel(p) | rel(q)∧ path(q, p) ≤δ[ ]Pr[rel(p)]L(q,δ) ≡path(q, p){ p: path(q,p) ≤δ }∑{p : path(q, p) ≤δ}
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo Menczer• Correlation oflexical (content)and linkagetopology• L(δ): average linkdistance• S(δ): averagecontentsimilarity tostart (topic)page from pagesup to distance δ• Correlationρ(L,S) = –0.76The “link-content”conjecture S(q,δ) ≡sim(q, p){p: path(q,p) ≤δ }∑{p : path(q, p) ≤δ}
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerHeterogeneity oflink-content correlationS = c + (1−c)eaLb edu netgovorgcomsignif. diff. a only (α=0.05)signif. diff. a & b (α=0.05).com hasmore drift
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerTopical locality-inspired tricksfor topical crawlers• Co-citation (a.k.a. siblinglocality): A and C are goodhubs, thus A and D shouldbe given high priority• Co-reference (a.k.a.blbliographic coupling):E and G are goodauthorities, thus E and Hshould be given highpriority
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerCorrelations between differentsimilarity measures• Semantic similarity measuredfrom ODP, correlated with:– Content similarity: TF or TF-IDFvector cosine– Link similarity: Jaccardcoefficient of (in+out) linkneighborhoods• Correlation overall is significantbut weak• Much stronger topical locality insome topics, e.g.:– Links very informative in newssources– Text very informative in recipes
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerNaïve Best-FirstBestFirst(topic, seed_urls) {foreach link (seed_urls) {enqueue(frontier, link);}while (#frontier > 0 and visited < MAX_PAGES) {link := dequeue_link_with_max_score(frontier);doc := fetch_new_document(link);score := sim(topic, doc);foreach outlink (extract_links(doc)) {if (#frontier >= MAX_BUFFER) {dequeue_link_with_min_score(frontier);}enqueue(frontier, outlink, score);}}}Simplesttopical crawler:Frontier ispriority queuebased on textsimilaritybetween topicand parentpage
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerBest-first variations• Many in literature, mostly stemming fromdifferent ways to score unvisited URLs. E.g.:– Giving more importance to certain HTML markup inparent page– Extending text representation of parent page withanchor text from “grandparent” pages (SharkSearch)– Limiting link context to less than entire page– Exploiting topical locality (co-citation)– Exploration vs exploitation: relax priorities• Any of these can be (and many have been)combined
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerLink context based on text neighborhood• Often consider afixed-size window, e.g.50 words aroundanchor• Can weigh links basedon their distance fromtopic keywords withinthe document(InfoSpiders, Clever)• Anchor text deservesextra importance
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerLink context based on DOM tree• Consider DOM subtreerooted at parent node oflink’s <a> tag• Or can go further up in thetree (Naïve Best-First isspecial case of entiredocument body)• Trade-off between noisedue to too small or toolarge context tree (Pant2003)
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerDOM contextLink score = linearcombination betweenpage-based and context-based similarity score
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerCo-citation: hub scoresLink scorehub = linearcombination betweenlink and hub scoreNumber of seeds linked from page
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerCombining DOM context and hub scoresAdd 10 best hubsto seeds for 94topicsExperiment based on159 ODP topics (Pant& Menczer 2003)Split ODP URLsbetween seedsand targets
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerExploration vs Exploitation• Best-N-First (or BFSN)• Rather than re-sortingthe frontier every timeyou add links, be lazy andsort only every N pagesvisited• Empirically, being lessgreedy helps crawlerperformancesignificantly: escape“local topical traps” byexploring morePant et al. 2002
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerInfoSpiders• A series of intelligentmulti-agent topicalcrawling algorithmsemploying variousadaptive techniques:– Evolutionary bias ofexploration/exploitation– Selective queryexpansion– (Connectionist)reinforcement learningMenczer & Belew 1998, 2000;Menczer et al. 2004
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerPr l[ ] =εβλλεβλλ∋λ∋∑λλ= ννετ ιν1,...,ινΝ( )ινκ =δ κ,ω( )διστω,λ( )ω∈∆∑link l instances of kiλlk1knkiagent's neural netsum of matcheswithinverse-distanceweightinglink l Instancesof kiAgent’s neural netStochasticselectorLink scoring andselection by eachcrawling agent
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerForeach agent thread:Pick & follow link from local frontierEvaluate new links, merge frontierAdjust link estimatorE := E + payoff - costIf E < 0:DieElsif E > Selection_Threshold:Clone offspringSplit energy with offspringSplit frontier with offspringMutate offspringArtificial life-inspired EvolutionaryLocal Selection Algorithm (ELSA)selectivequeryexpansionmatchresourcebiasReinforcementlearning
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerAdaptation in InfoSpiders• Unsupervised population evolution– Select agents to match resource bias– Mutate internal queries: selective queryexpansion– Mutate weights• Unsupervised individual adaptation– Q-learning: adjust neural net weights topredict relevance locally
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerInfoSpidersevolutionary bias: anagent in a relevant areawill spawn other agentsto exploit/explore thatneighborhoodkeywordvector neural netlocal frontieroffspring
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMultithreaded InfoSpiders(MySpiders)• Different ways to compute the cost ofvisiting a document:– Constant: costconst = E0 p0 / Tmax– Proportional to download time:costtime = f(costconst t / timeout)• The latter is of course more efficient(faster crawling), but it also yieldsbetter quality pages!
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerSelective Query Expansion in InfoSpiders:Internalization of local text featuresPOLITCONSTITUTTHSYSTEMGOVERN0.990.840.180.310.190.22When a new agent isspawned, it picks up acommon term from thecurrent page (here ‘th’)
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerReinforcement Learning• In general, reward function R: S  A  ℜ• Learn policy (π: S  A) to maximize rewardover time, typically discounted in thefuture:• Q-learning: optimal policyV = γtr(t),t∑ 0 ≤ γ <1π*(s) = argmaxaQ(s,a)= argmaxaR(s,a) + γV*(s')[ ]ss2a2a1 s1Value of following optimal policy in future
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerQ-learning in InfoSpiders• Use neural nets to estimate Q scores• Compare estimated relevance of visited page with Q score oflink estimated from parent page to obtain feedback signal• Learn neural net weights using back-propagation of error withteaching input: E(D) + γ maxl(D) λl
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerOther Reinforcement Learning Crawlers• Rennie & McCallum (1999):– Naïve-Bayes classifier trained on text nearby links inpre-labeled examples to estimate Q values– Immediate reward R=1 for “on-topic” pages (with desiredCS papers for CORA repository)– All RL algorithms outperform Breath-First Search• Future discounting: “For spidering, it is alwaysbetter to choose immediate over delayed rewards”-- Or is it?– But we cannot possibly cover the entire search space, andrecall that by being greedy we can be trapped in localtopical clusters and fail to discover better ones– Need to explore!
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerOutline• Motivation and taxonomy of crawlers• Basic crawlers and implementation issues• Universal crawlers• Preferential (focused and topical) crawlers• Evaluation of preferential crawlers• Crawler ethics and conflicts• New developments: social, collaborative, federatedcrawlers
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerEvaluation of topical crawlers• Goal: build “better” crawlers to supportapplications (Srinivasan & al. 2005)• Build an unbiased evaluation framework– Define common tasks of measurable difficulty– Identify topics, relevant targets– Identify appropriate performance measures• Effectiveness: quality of crawler pages, order, etc.• Efficiency: separate CPU & memory of crawler algorithmsfrom bandwidth & common utilities
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerEvaluationcorpus =ODP + Web• Automateevaluationusing editeddirectories• Differentsources ofrelevanceassessments
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerTopics and Targetstopic level ~ specificitydepth ~ generality
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerTasksStart from seeds, find targetsand/or pages similar to target descriptionsd=2d=3Back-crawl from targets to get seeds
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerTarget based performance measuresA: Independence!…Q: What assumption are we making?
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerPerformance matrixtargetdepth“recall” “precision”targetpagestargetdescriptionsd=0d=1d=2Sct∩ TdTdσc (p,Dd )p∈ Sct∑Sct∩ TdSctσc (p,Dd )p∈ Sct∑Sct
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerCrawling evaluation frameworkURLWebCommonDataStructuresCrawler 1LogicMain• Keywords• Seed URLsCrawler NLogicPrivate DataStructures(limited resource)Concurrent Fetch/Parse/Stem ModulesHTTP HTTP HTTP
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerUsing framework to compare crawler performancePages crawledAveragetargetpagerecall
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerLink frontier sizePerformance/costEfficiency & scalability
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerTopical crawlerperformance dependson topic characteristics++++InfoSpiders+++BFS-256+++BFS-1++++BreadthFirstLPACLPACCrawlerTarget descriptionsTarget pagesC = target link cohesivenessA = target authoritativenessP = popularity (topic kw generality)L = seed-target similarity
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerOutline• Motivation and taxonomy of crawlers• Basic crawlers and implementation issues• Universal crawlers• Preferential (focused and topical) crawlers• Evaluation of preferential crawlers• Crawler ethics and conflicts• New developments: social, collaborative, federatedcrawlers
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerCrawler ethics and conflicts• Crawlers can cause trouble, evenunwillingly, if not properly designed to be“polite” and “ethical”• For example, sending too many requests inrapid succession to a single server canamount to a Denial of Service (DoS) attack!– Server administrator and users will be upset– Crawler developer/admin IP address may beblacklisted
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerCrawler etiquette (important!)• Identify yourself– Use ‘User-Agent’ HTTP header to identify crawler, website withdescription of crawler and contact information for crawler developer– Use ‘From’ HTTP header to specify crawler developer email– Do not disguise crawler as a browser by using their ‘User-Agent’ string• Always check that HTTP requests are successful, and in case oferror, use HTTP error code to determine and immediately addressproblem• Pay attention to anything that may lead to too many requests to anyone server, even unwillingly, e.g.:– redirection loops– spider traps
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerCrawler etiquette (important!)• Spread the load, do not overwhelm a server– Make sure that no more than some max. number of requests toany single server per unit time, say < 1/second• Honor the Robot Exclusion Protocol– A server can specify which parts of its document tree anycrawler is or is not allowed to crawl by a file named ‘robots.txt’placed in the HTTP root directory, e.g.http://www.indiana.edu/robots.txt– Crawler should always check, parse, and obey this file beforesending any requests to a server– More info at:• http://www.google.com/robots.txt• http://www.robotstxt.org/wc/exclusion.html
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore on robot exclusion• Make sure URLs are canonical beforechecking against robots.txt• Avoid fetching robots.txt for eachrequest to a server by caching itspolicy as relevant to this crawler• Let’s look at some examples tounderstand the protocol…
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo Menczerwww.apple.com/robots.txt# robots.txt for http://www.apple.com/User-agent: *Disallow:All crawlers……can goanywhere!
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo Menczerwww.microsoft.com/robots.txt# Robots.txt file for http://www.microsoft.comUser-agent: *Disallow: /canada/Library/mnp/2/aspx/Disallow: /communities/bin.aspxDisallow: /communities/eventdetails.mspxDisallow: /communities/blogs/PortalResults.mspxDisallow: /communities/rss.aspxDisallow: /downloads/Browse.aspxDisallow: /downloads/info.aspxDisallow: /france/formation/centres/planning.aspDisallow: /france/mnp_utility.mspxDisallow: /germany/library/images/mnp/Disallow: /germany/mnp_utility.mspxDisallow: /ie/ie40/Disallow: /info/customerror.htmDisallow: /info/smart404.aspDisallow: /intlkb/Disallow: /isapi/#etc…All crawlers……are notallowed inthesepaths…
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo Menczerwww.springer.com/robots.txt# Robots.txt for http://www.springer.com (fragment)User-agent: GooglebotDisallow: /chl/*Disallow: /uk/*Disallow: /italy/*Disallow: /france/*User-agent: slurpDisallow:Crawl-delay: 2User-agent: MSNBotDisallow:Crawl-delay: 2User-agent: scooterDisallow:# all othersUser-agent: *Disallow: /Google crawler isallowed everywhereexcept these pathsYahoo andMSN/Windows Liveare allowedeverywhere butshould slow downAltaVista has no limitsEveryone else keep off!
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore crawler ethics issues• Is compliance with robot exclusion amatter of law?– No! Compliance is voluntary, but if you do notcomply, you may be blocked– Someone (unsuccessfully) sued InternetArchive over a robots.txt related issue• Some crawlers disguise themselves– Using false User-Agent– Randomizing access frequency to look like ahuman/browser– Example: click fraud for ads
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore crawler ethics issues• Servers can disguise themselves, too– Cloaking: present different content based onUser-Agent– E.g. stuff keywords on version of page shown tosearch engine crawler– Search engines do not look kindly on this typeof “spamdexing” and remove from their indexsites that perform such abuse• Case of bmw.de made the news
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerGray areas for crawler ethics• If you write a crawler that unwillinglyfollows links to ads, are you just beingcareless, or are you violating terms ofservice, or are you violating the law bydefrauding advertisers?– Is non-compliance with Google’s robots.txt inthis case equivalent to click fraud?• If you write a browser extension thatperforms some useful service, should youcomply with robot exclusion?
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerOutline• Motivation and taxonomy of crawlers• Basic crawlers and implementation issues• Universal crawlers• Preferential (focused and topical) crawlers• Evaluation of preferential crawlers• Crawler ethics and conflicts• New developments
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerNew developments: social,collaborative, federated crawlers• Idea: go beyond the “one-fits-all”model of centralized search engines• Extend the search task to anyone,and distribute the crawling task• Each search engine is a peer agent• Agents collaborate by routing queriesand results
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo Menczer6S: CollaborativePeer SearchABqueryqueryhithitData mining &referralopportunitiesqueryqueryqueryCEmergingcommunitiesbookmarkslocalstorageWWWIndexCrawlerPeer
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerBasic idea: Learn based on priorquery/response interactions
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerLearning about other peers
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerQuery routing in 6S
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerEmergent semantic clustering
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerSimulation 1: 70 peers, 7 groups• The dynamic networkof queries and resultsexchanged among 6Speer agents quicklyforms a small-world,with small diameterand high clustering(Wu & al. 2005)
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerSimulation 2:500 UsersODP (dmoz.org)Each synthetic userassociated with a topic
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerSemanticsimilarityPeers with similar interests aremore likely to talk to each other(Akavipat & al. 2006)
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerQuality of resultsMoresophisticatedlearningalgorithmsdo betterThe moreinteractions,the better
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo Menczerhttp://homer.informatics.indiana.edu/~nan/6S/1-click configuration ofpersonal crawler andsetup of search engineDownload and try free 6S prototype:
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo Menczerhttp://homer.informatics.indiana.edu/~nan/6S/Search viaFirefox browserextensionDownload and try free 6S prototype:
Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerNeed crawling code?• Reference C implementation of HTTP, HTML parsing, etc– w3c-libwww package from World-Wide Web Consortium: www.w3c.org/Library/• LWP (Perl)– http://www.oreilly.com/catalog/perllwp/– http://search.cpan.org/~gaas/libwww-perl-5.804/• Open source crawlers/search engines– Nutch: http://www.nutch.org/ (Jakarta Lucene: jakarta.apache.org/lucene/)– Heretrix: http://crawler.archive.org/– WIRE: http://www.cwr.cl/projects/WIRE/– Terrier: http://ir.dcs.gla.ac.uk/terrier/• Open source topical crawlers, Best-First-N (Java)– http://informatics.indiana.edu/fil/IS/JavaCrawlers/• Evaluation framework for topical crawlers (Perl)– http://informatics.indiana.edu/fil/IS/Framework/

Recommended

PPTX
Web scraping
PPTX
ETL Process
PPTX
Data Quality & Data Governance
PDF
How to Become a Data Scientist
PPTX
Web mining
PPTX
Introduction to Data Mining and Data Warehousing
PDF
Exploratory Data Analysis - Satyajit.pdf
PPTX
Introduction to ETL process
PPTX
Ppt on data science
PDF
Introduction to ETL and Data Integration
PDF
Data driven decision making
PPTX
DataOps: Nine steps to transform your data science impact Strata London May 18
PPTX
Causal Inference in Marketing
PDF
Big Data Architecture
PPTX
web mining
PPTX
PPTX
Web spam
PPTX
Text MIning
PPTX
Web mining (structure mining)
PPTX
Data Visualization Design Best Practices Workshop
byJSI
 
PPTX
information retrieval Techniques and normalization
PDF
Link Analysis
PDF
Hadoop YARN
PPT
Chapter - 8.1 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
PDF
Performance tuning and optimization (ppt)
PPTX
HITS + Pagerank
 
PPT
4.5 mining the worldwideweb
DOCX
Non functional requirement
PPT
Web crawler
PPT
Webcrawler

More Related Content

PPTX
Web scraping
PPTX
ETL Process
PPTX
Data Quality & Data Governance
PDF
How to Become a Data Scientist
PPTX
Web mining
PPTX
Introduction to Data Mining and Data Warehousing
PDF
Exploratory Data Analysis - Satyajit.pdf
PPTX
Introduction to ETL process
Web scraping
ETL Process
Data Quality & Data Governance
How to Become a Data Scientist
Web mining
Introduction to Data Mining and Data Warehousing
Exploratory Data Analysis - Satyajit.pdf
Introduction to ETL process

What's hot

PPTX
Ppt on data science
PDF
Introduction to ETL and Data Integration
PDF
Data driven decision making
PPTX
DataOps: Nine steps to transform your data science impact Strata London May 18
PPTX
Causal Inference in Marketing
PDF
Big Data Architecture
PPTX
web mining
PPTX
PPTX
Web spam
PPTX
Text MIning
PPTX
Web mining (structure mining)
PPTX
Data Visualization Design Best Practices Workshop
byJSI
 
PPTX
information retrieval Techniques and normalization
PDF
Link Analysis
PDF
Hadoop YARN
PPT
Chapter - 8.1 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
PDF
Performance tuning and optimization (ppt)
PPTX
HITS + Pagerank
 
PPT
4.5 mining the worldwideweb
DOCX
Non functional requirement
Ppt on data science
Introduction to ETL and Data Integration
Data driven decision making
DataOps: Nine steps to transform your data science impact Strata London May 18
Causal Inference in Marketing
Big Data Architecture
web mining
Web spam
Text MIning
Web mining (structure mining)
Data Visualization Design Best Practices Workshop
byJSI
 
information retrieval Techniques and normalization
Link Analysis
Hadoop YARN
Chapter - 8.1 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Performance tuning and optimization (ppt)
HITS + Pagerank
 
4.5 mining the worldwideweb
Non functional requirement

Similar to Web crawlingchapter

PPT
Web crawler
PPT
Webcrawler
PPTX
Challenges in web crawling
PPT
alsutojo_crawler web crawler hyperlink analysis
PPT
Smart Web Crawling in Search Engine Optimization
PPT
PPT berisi penjelasan mengenai web crawling
PPT
Webcrawler
PDF
Web Crawler For Mining Web Data
PDF
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
PDF
[LvDuit//Lab] Crawling the web
PDF
E017624043
PDF
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
PPTX
Scalability andefficiencypres
PDF
A Two Stage Crawler on Web Search using Site Ranker for Adaptive Learning
PPTX
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
PDF
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...
 
PDF
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
 
PDF
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
 
PDF
I0331047050
PDF
Web Crawlers - Exploring the WWW
Web crawler
Webcrawler
Challenges in web crawling
alsutojo_crawler web crawler hyperlink analysis
Smart Web Crawling in Search Engine Optimization
PPT berisi penjelasan mengenai web crawling
Webcrawler
Web Crawler For Mining Web Data
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
[LvDuit//Lab] Crawling the web
E017624043
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Scalability andefficiencypres
A Two Stage Crawler on Web Search using Site Ranker for Adaptive Learning
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...
 
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
 
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
 
I0331047050
Web Crawlers - Exploring the WWW

More from Borseshweta

PPT
Cs511 data-extraction
PPT
Cs583 information-integration
PPT
Cs583 intro
PPT
Cs583 link-analysis
PPT
Cs583 association-sequential-patterns
PPT
Cs621 lect7-si-13aug07
PPT
Cs583 info-retrieval
Cs511 data-extraction
Cs583 information-integration
Cs583 intro
Cs583 link-analysis
Cs583 association-sequential-patterns
Cs621 lect7-si-13aug07
Cs583 info-retrieval

Recently uploaded

PPTX
Automation Testing and New Ways to test Software using AI
PPTX
Generative AI Deep Dive: Architectures, Mechanics, and Future Applications
PDF
Workshop practice theory course (Unit-1) By Varun Pratap Singh.pdf
PPTX
unit-1 Data structure (3).pptx Data structure And Algorithms
PPTX
Washing-Machine-Simulation-using-PICSimLab.pptx
PPTX
AI at the Crossroads_ Transforming the Future of Green Technology.pptx
PDF
How-Forensic-Structural-Engineering-Can-Minimize-Structural-Failures.pdf
PPTX
Intrusion Detection Systems presentation.pptx
PPTX
Planning and selection of equipment for different mining conditions. Equipmen...
PDF
Best Marketplaces to Buy Snapchat Accounts in 2025.pdf
PPTX
Ship Repair and fault diagnosis and restoration of system back to normal .pptx
PDF
Advancements in Telecommunication for Disaster Management (www.kiu.ac.ug)
PDF
Welcome to ISPR 2026 - 12th International Conference on Image and Signal Pro...
PPTX
waste to energy deck v.3.pptx changing garbage to electricity
PDF
The Impact of Telework on Urban Development (www.kiu.ac.ug)
PPTX
TRANSPORTATION ENGINEERING Unit-5.1.pptx
PDF
HEV Descriptive Questions https://www.slideshare.net/slideshow/hybrid-electr...
PDF
ANPARA THERMAL POWER STATION[1] sangam.pdf
PPTX
Waste to Energy - G2 Ethanol.pptx to process
PPTX
TRANSPORTATION ENGINEERING Unit-5.2.pptx
Automation Testing and New Ways to test Software using AI
Generative AI Deep Dive: Architectures, Mechanics, and Future Applications
Workshop practice theory course (Unit-1) By Varun Pratap Singh.pdf
unit-1 Data structure (3).pptx Data structure And Algorithms
Washing-Machine-Simulation-using-PICSimLab.pptx
AI at the Crossroads_ Transforming the Future of Green Technology.pptx
How-Forensic-Structural-Engineering-Can-Minimize-Structural-Failures.pdf
Intrusion Detection Systems presentation.pptx
Planning and selection of equipment for different mining conditions. Equipmen...
Best Marketplaces to Buy Snapchat Accounts in 2025.pdf
Ship Repair and fault diagnosis and restoration of system back to normal .pptx
Advancements in Telecommunication for Disaster Management (www.kiu.ac.ug)
Welcome to ISPR 2026 - 12th International Conference on Image and Signal Pro...
waste to energy deck v.3.pptx changing garbage to electricity
The Impact of Telework on Urban Development (www.kiu.ac.ug)
TRANSPORTATION ENGINEERING Unit-5.1.pptx
HEV Descriptive Questions https://www.slideshare.net/slideshow/hybrid-electr...
ANPARA THERMAL POWER STATION[1] sangam.pdf
Waste to Energy - G2 Ethanol.pptx to process
TRANSPORTATION ENGINEERING Unit-5.2.pptx

Web crawlingchapter

  • 1.
    Ch. 8: WebCrawlingBy Filippo MenczerIndiana University School of Informaticsin Web Data Mining by Bing LiuSpringer, 2007
  • 2.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerOutline• Motivation and taxonomy of crawlers• Basic crawlers and implementation issues• Universal crawlers• Preferential (focused and topical) crawlers• Evaluation of preferential crawlers• Crawler ethics and conflicts• New developments: social, collaborative, federatedcrawlers
  • 3.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerQ: How does asearch engineknow that allthese pagescontain the queryterms?A: Because all ofthose pageshave beencrawled
  • 4.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerCrawler:basicideastartingpages(seeds)
  • 5.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMany names• Crawler• Spider• Robot (or bot)• Web agent• Wanderer, worm, …• And famous instances: googlebot,scooter, slurp, msnbot, …
  • 6.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerGooglebot & you
  • 7.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMotivation for crawlers• Support universal search engines (Google,Yahoo, MSN/Windows Live, Ask, etc.)• Vertical (specialized) search engines, e.g.news, shopping, papers, recipes, reviews, etc.• Business intelligence: keep track of potentialcompetitors, partners• Monitor Web sites of interest• Evil: harvest emails for spamming, phishing…• … Can you think of some others?…
  • 8.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerA crawler within a search engineWebText index PageRankPagerepositorygooglebotText & linkanalysisQueryhitsRanker
  • 9.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerOne taxonomy of crawlersUniversal crawlersFocused crawlersEvolutionary crawlers Reinforcement learning crawlersetc...Adaptive topical crawlersBest-first PageRanketc...Static crawlersTopical crawlersPreferential crawlersCrawlers• Many other criteria could be used:– Incremental, Interactive, Concurrent, Etc.
  • 10.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerOutline• Motivation and taxonomy of crawlers• Basic crawlers and implementation issues• Universal crawlers• Preferential (focused and topical) crawlers• Evaluation of preferential crawlers• Crawler ethics and conflicts• New developments: social, collaborative, federatedcrawlers
  • 11.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerBasic crawlers• This is a sequentialcrawler• Seeds can be any list ofstarting URLs• Order of page visits isdetermined by frontierdata structure• Stop criterion can beanything
  • 12.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerGraph traversal(BFS or DFS?)• Breadth First Search– Implemented with QUEUE (FIFO)– Finds pages along shortest paths– If we start with “good” pages, thiskeeps us close; maybe other goodstuff…• Depth First Search– Implemented with STACK (LIFO)– Wander away (“lost in cyberspace”)
  • 13.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerA basic crawler in Perl• Queue: a FIFO list (shift and push)my @frontier = read_seeds($file);while (@frontier && $tot < $max) {my $next_link = shift @frontier;my $page = fetch($next_link);add_to_index($page);my @links = extract_links($page, $next_link);push @frontier, process(@links);}
  • 14.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerImplementation issues• Don’t want to fetch same page twice!– Keep lookup table (hash) of visited pages– What if not visited but in frontier already?• The frontier grows very fast!– May need to prioritize for large crawls• Fetcher must be robust!– Don’t crash if download fails– Timeout mechanism• Determine file type to skip unwanted files– Can try using extensions, but not reliable– Can issue ‘HEAD’ HTTP commands to get Content-Type(MIME) headers, but overhead of extra Internet requests
  • 15.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore implementation issues• Fetching– Get only the first 10-100 KB per page– Take care to detect and breakredirection loops– Soft fail for timeout, server notresponding, file not found, and othererrors
  • 16.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore implementation issues: Parsing• HTML has the structure of a DOM(Document Object Model) tree• Unfortunately actual HTML is oftenincorrect in a strict syntactic sense• Crawlers, like browsers, must berobust/forgiving• Fortunately there are tools that canhelp– E.g. tidy.sourceforge.net• Must pay attention to HTMLentities and unicode in text• What to do with a growing numberof other formats?– Flash, SVG, RSS, AJAX…
  • 17.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore implementation issues• Stop words– Noise words that do not carry meaning should be eliminated(“stopped”) before they are indexed– E.g. in English: AND, THE, A, AT, OR, ON, FOR, etc…– Typically syntactic markers– Typically the most common terms– Typically kept in a negative dictionary• 10–1,000 elements• E.g. http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words– Parser can detect these right away and disregard them
  • 18.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore implementation issuesConflation and thesauri• Idea: improve recall by merging words with samemeaning1. We want to ignore superficial morphologicalfeatures, thus merge semantically similar tokens– {student, study, studying, studious} => studi1. We can also conflate synonyms into a single formusing a thesaurus– 30-50% smaller index– Doing this in both pages and queries allows to retrievepages about ‘automobile’ when user asks for ‘car’– Thesaurus can be implemented as a hash table
  • 19.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore implementation issues• Stemming– Morphological conflation based on rewrite rules– Language dependent!– Porter stemmer very popular for English• http://www.tartarus.org/~martin/PorterStemmer/• Context-sensitive grammar rules, eg:– “IES” except (“EIES” or “AIES”) --> “Y”• Versions in Perl, C, Java, Python, C#, Ruby, PHP, etc.– Porter has also developed Snowball, a language to createstemming algorithms in any language• http://snowball.tartarus.org/• Ex. Perl modules: Lingua::Stem and Lingua::Stem::Snowball
  • 20.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore implementation issues• Static vs. dynamic pages– Is it worth trying to eliminate dynamic pages and only indexstatic pages?– Examples:• http://www.census.gov/cgi-bin/gazetteer• http://informatics.indiana.edu/research/colloquia.asp• http://www.amazon.com/exec/obidos/subst/home/home.html/002-8332429-6490452• http://www.imdb.com/Name?Menczer,+Erico• http://www.imdb.com/name/nm0578801/– Why or why not? How can we tell if a page is dynamic? Whatabout ‘spider traps’?– What do Google and other search engines do?
  • 21.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore implementation issues• Relative vs. Absolute URLs– Crawler must translate relative URLs into absoluteURLs– Need to obtain Base URL from HTTP header, orHTML Meta tag, or else current page path bydefault– Examples• Base: http://www.cnn.com/linkto/• Relative URL: intl.html• Absolute URL: http://www.cnn.com/linkto/intl.html• Relative URL: /US/• Absolute URL: http://www.cnn.com/US/
  • 22.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore implementation issues• URL canonicalization– All of these:• http://www.cnn.com/TECH• http://WWW.CNN.COM/TECH/• http://www.cnn.com:80/TECH/• http://www.cnn.com/bogus/../TECH/– Are really equivalent to this canonical form:• http://www.cnn.com/TECH/– In order to avoid duplication, the crawler musttransform all URLs into canonical form– Definition of “canonical” is arbitrary, e.g.:• Could always include port• Or only include port when not default :80
  • 23.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore on Canonical URLs• Some transformation are trivial, for example: http://informatics.indiana.edu http://informatics.indiana.edu/ http://informatics.indiana.edu/index.html#fragment http://informatics.indiana.edu/index.html http://informatics.indiana.edu/dir1/./../dir2/ http://informatics.indiana.edu/dir2/ http://informatics.indiana.edu/%7Efil/ http://informatics.indiana.edu/~fil/ http://INFORMATICS.INDIANA.EDU/fil/ http://informatics.indiana.edu/fil/
  • 24.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore on Canonical URLsOther transformations require heuristic assumptionabout the intentions of the author or configurationof the Web server:1. Removing default file name http://informatics.indiana.edu/fil/index.html http://informatics.indiana.edu/fil/– This is reasonable in general but would be wrong in thiscase because the default happens to be ‘default.asp’instead of ‘index.html’1. Trailing directory http://informatics.indiana.edu/fil http://informatics.indiana.edu/fil/– This is correct in this case but how can we be sure ingeneral that there isn’t a file named ‘fil’ in the root dir?
  • 25.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore implementation issues• Spider traps– Misleading sites: indefinite number of pagesdynamically generated by CGI scripts– Paths of arbitrary depth created using softdirectory links and path rewriting features inHTTP server– Only heuristic defensive measures:• Check URL length; assume spider trap above somethreshold, for example 128 characters• Watch for sites with very large number of URLs• Eliminate URLs with non-textual data types• May disable crawling of dynamic pages, if can detect
  • 26.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore implementation issues• Page repository– Naïve: store each page as a separate file• Can map URL to unique filename using a hashing function,e.g. MD5• This generates a huge number of files, which is inefficientfrom the storage perspective– Better: combine many pages into a single large file, usingsome XML markup to separate and identify them• Must map URL to {filename, page_id}– Database options• Any RDBMS -- large overhead• Light-weight, embedded databases such as Berkeley DB
  • 27.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerConcurrency• A crawler incurs several delays:– Resolving the host name in the URL to anIP address using DNS– Connecting a socket to the server andsending the request– Receiving the requested page in response• Solution: Overlap the above delays byfetching many pages concurrently
  • 28.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerArchitectureof aconcurrentcrawler
  • 29.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerConcurrent crawlers• Can use multi-processing or multi-threading• Each process or thread works like asequential crawler, except they share datastructures: frontier and repository• Shared data structures must besynchronized (locked for concurrentwrites)• Speedup of factor of 5-10 are easy thisway
  • 30.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerOutline• Motivation and taxonomy of crawlers• Basic crawlers and implementation issues• Universal crawlers• Preferential (focused and topical) crawlers• Evaluation of preferential crawlers• Crawler ethics and conflicts• New developments: social, collaborative, federatedcrawlers
  • 31.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerUniversal crawlers• Support universal search engines• Large-scale• Huge cost (network bandwidth) ofcrawl is amortized over many queriesfrom users• Incremental updates to existingindex and other data repositories
  • 32.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerLarge-scale universal crawlers• Two major issues:1. Performance• Need to scale up to billions of pages2. Policy• Need to trade-off coverage,freshness, and bias (e.g. toward“important” pages)
  • 33.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerLarge-scale crawlers: scalability• Need to minimize overhead of DNS lookups• Need to optimize utilization of network bandwidthand disk throughput (I/O is bottleneck)• Use asynchronous sockets– Multi-processing or multi-threading do not scale up tobillions of pages– Non-blocking: hundreds of network connections opensimultaneously– Polling socket to monitor completion of networktransfers
  • 34.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerHigh-levelarchitecture of ascalable universalcrawlerSeveral parallelqueues to spread loadacross servers (keepconnections alive)DNS server using UDP(less overhead thanTCP), large persistentin-memory cache, andprefetchingOptimize use ofnetwork bandwidthOptimize disk I/O throughputHuge farm of crawl machines
  • 35.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerUniversal crawlers: Policy• Coverage– New pages get added all the time– Can the crawler find every page?• Freshness– Pages change over time, get removed, etc.– How frequently can a crawler revisit ?• Trade-off!– Focus on most “important” pages (crawler bias)?– “Importance” is subjective
  • 36.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerWeb coverage by search engine crawlers35% 34%16%50%0%10%20%30%40%50%60%70%80%90%100%1997 1998 1999 2000This assumes we know thesize of the entire the Web.Do we? Can you define “thesize of the Web”?
  • 37.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMaintaining a “fresh” collection• Universal crawlers are never “done”• High variance in rate and amount of page changes• HTTP headers are notoriously unreliable– Last-modified– Expires• Solution– Estimate the probability that a previously visited pagehas changed in the meanwhile– Prioritize by this probability estimate
  • 38.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerEstimating page change rates• Algorithms for maintaining a crawl in whichmost pages are fresher than a specifiedepoch– Brewington & Cybenko; Cho, Garcia-Molina & Page• Assumption: recent past predicts the future(Ntoulas, Cho & Olston 2004)– Frequency of change not a good predictor– Degree of change is a better predictor
  • 39.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerDo we need to crawl the entire Web?• If we cover too much, it will get stale• There is an abundance of pages in the Web• For PageRank, pages with very low prestige are largelyuseless• What is the goal?– General search engines: pages with high prestige– News portals: pages that change often– Vertical portals: pages on some topic• What are appropriate priority measures in thesecases? Approximations?
  • 40.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerBreadth-first crawlers• BF crawler tends tocrawl high-PageRank pagesvery early• Therefore, BFcrawler is a goodbaseline to gaugeother crawlers• But why is this so? Najork and Weiner 2001
  • 41.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerBias of breadth-first crawlers• The structure of theWeb graph is verydifferent from a randomnetwork• Power-law distribution ofin-degree• Therefore there are hubpages with very high PRand many incoming links• These are attractors: youcannot avoid them!
  • 42.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerOutline• Motivation and taxonomy of crawlers• Basic crawlers and implementation issues• Universal crawlers• Preferential (focused and topical) crawlers• Evaluation of preferential crawlers• Crawler ethics and conflicts• New developments: social, collaborative, federatedcrawlers
  • 43.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerPreferential crawlers• Assume we can estimate for each page animportance measure, I(p)• Want to visit pages in order of decreasing I(p)• Maintain the frontier as a priority queue sorted byI(p)• Possible figures of merit:– Precision ~| p: crawled(p) & I(p) > threshold | / | p: crawled(p) |– Recall ~| p: crawled(p) & I(p) > threshold | / | p: I(p) > threshold |
  • 44.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerPreferential crawlers• Selective bias toward some pages, eg. most“relevant”/topical, closest to seeds, most popular/largestPageRank, unknown servers, highest rate/amount ofchange, etc…• Focused crawlers– Supervised learning: classifier based on labeled examples• Topical crawlers– Best-first search based on similarity(topic, parent)– Adaptive crawlers• Reinforcement learning• Evolutionary algorithms/artificial life
  • 45.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerPreferential crawling algorithms:Examples• Breadth-First– Exhaustively visit all links in order encountered• Best-N-First– Priority queue sorted by similarity, explore top N at a time– Variants: DOM context, hub scores• PageRank– Priority queue sorted by keywords, PageRank• SharkSearch– Priority queue sorted by combination of similarity, anchor text, similarity ofparent, etc. (powerful cousin of FishSearch)• InfoSpiders– Adaptive distributed algorithm using an evolving population of learningagents
  • 46.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerPreferential crawlers: ExamplesRecallCrawl size• For I(p) = PageRank(estimated based onpages crawled sofar), we can findhigh-PR pages fasterthan a breadth-firstcrawler (Cho, Garcia-Molina & Page 1998)
  • 47.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerFocused crawlers: Basic idea• Naïve-Bayes classifier basedon example pages in desiredtopic, c*• Score(p) = Pr(c*|p)– Soft focus: frontier is priorityqueue using page score– Hard focus:• Find best leaf ĉ for p• If an ancestor c’ of ĉ is in c*then add links from p tofrontier, else discard– Soft and hard focus workequally well empiricallyExample: Open Directory
  • 48.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerFocused crawlers• Can have multiple topics with as many classifiers,with scores appropriately combined (Chakrabarti etal. 1999)• Can use a distiller to find topical hubs periodically,and add these to the frontier• Can accelerate with the use of a critic (Chakrabartiet al. 2002)• Can use alternative classifier algorithms to naïve-Bayes, e.g. SVM and neural nets have reportedlyperformed better (Pant & Srinivasan 2005)
  • 49.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerContext-focused crawlers• Same idea, but multiple classes (andclassifiers) based on link distancefrom relevant targets– ℓ=0 is topic of interest– ℓ=1 link to topic of interest– Etc.• Initially needs a back-crawl fromseeds (or known targets) to trainclassifiers to estimate distance• Links in frontier prioritized based onestimated distance from targets• Outperforms standard focusedcrawler empiricallyContext graph
  • 50.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerTopical crawlers• All we have is a topic (query, description,keywords) and a set of seed pages (notnecessarily relevant)• No labeled examples• Must predict relevance of unvisited links toprioritize• Original idea: Menczer 1997, Menczer &Belew 1998
  • 51.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerExample: myspiders.informatics.indiana.edu
  • 52.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerTopical locality• Topical locality is a necessary condition for a topicalcrawler to work, and for surfing to be a worthwhileactivity for humans• Links must encode semantic information, i.e. saysomething about neighbor pages, not be random• It is also a sufficient condition if we start from “good”seed pages• Indeed we know that Web topical locality is strong :– Indirectly (crawlers work and people surf the Web)– From direct measurements (Davison 2000; Menczer 2004, 2005)
  • 53.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerQuantifying topical locality• Different ways to pose thequestion:– How quickly does semanticlocality decay?– How fast is topic drift?– How quickly does contentchange as we surf away from astarting page?• To answer these questions,let us consider exhaustivebreadth-first crawls from100 topic pages
  • 54.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerG = 5/15C = 2R = 3/6= 2/4The “link-cluster” conjecture• Connection between semantic topology (relevance) andlink topology (hypertext)– G = Pr[rel(p)] ~ fraction of relevant/topical pages (topic generality)– R = Pr[rel(p) | rel(q) AND link(q,p)] ~ cond. prob. Given neighbor on topic• Related nodes are clustered if R > G– Necessary andsufficientcondition for arandom crawlerto find pages relatedto start points– Example:2 topical clusterswith strongermodularity withineach cluster than outside
  • 55.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo Menczer• Stationary hit rate for a random crawler:Link-cluster conjectureη(t + 1) = η(t) ⋅ R + (1 −η(t))⋅ G ≥ η(t)ηt →∞⏐ →⏐ ⏐ η∗=G1− (R − G)η∗> G ⇔ R > Gη∗G−1 =R− G1 − (R− G)Valueaddedof linksConjecture
  • 56.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo Menczer• Preservation ofsemantics (meaning)across links• 1000 times morelikely to be on topicif near an on-topicpage!Link-clusterconjectureR(q,δ)G(q)≡Pr rel(p) | rel(q)∧ path(q, p) ≤δ[ ]Pr[rel(p)]L(q,δ) ≡path(q, p){ p: path(q,p) ≤δ }∑{p : path(q, p) ≤δ}
  • 57.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo Menczer• Correlation oflexical (content)and linkagetopology• L(δ): average linkdistance• S(δ): averagecontentsimilarity tostart (topic)page from pagesup to distance δ• Correlationρ(L,S) = –0.76The “link-content”conjecture S(q,δ) ≡sim(q, p){p: path(q,p) ≤δ }∑{p : path(q, p) ≤δ}
  • 58.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerHeterogeneity oflink-content correlationS = c + (1−c)eaLb edu netgovorgcomsignif. diff. a only (α=0.05)signif. diff. a & b (α=0.05).com hasmore drift
  • 59.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerTopical locality-inspired tricksfor topical crawlers• Co-citation (a.k.a. siblinglocality): A and C are goodhubs, thus A and D shouldbe given high priority• Co-reference (a.k.a.blbliographic coupling):E and G are goodauthorities, thus E and Hshould be given highpriority
  • 60.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerCorrelations between differentsimilarity measures• Semantic similarity measuredfrom ODP, correlated with:– Content similarity: TF or TF-IDFvector cosine– Link similarity: Jaccardcoefficient of (in+out) linkneighborhoods• Correlation overall is significantbut weak• Much stronger topical locality insome topics, e.g.:– Links very informative in newssources– Text very informative in recipes
  • 61.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerNaïve Best-FirstBestFirst(topic, seed_urls) {foreach link (seed_urls) {enqueue(frontier, link);}while (#frontier > 0 and visited < MAX_PAGES) {link := dequeue_link_with_max_score(frontier);doc := fetch_new_document(link);score := sim(topic, doc);foreach outlink (extract_links(doc)) {if (#frontier >= MAX_BUFFER) {dequeue_link_with_min_score(frontier);}enqueue(frontier, outlink, score);}}}Simplesttopical crawler:Frontier ispriority queuebased on textsimilaritybetween topicand parentpage
  • 62.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerBest-first variations• Many in literature, mostly stemming fromdifferent ways to score unvisited URLs. E.g.:– Giving more importance to certain HTML markup inparent page– Extending text representation of parent page withanchor text from “grandparent” pages (SharkSearch)– Limiting link context to less than entire page– Exploiting topical locality (co-citation)– Exploration vs exploitation: relax priorities• Any of these can be (and many have been)combined
  • 63.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerLink context based on text neighborhood• Often consider afixed-size window, e.g.50 words aroundanchor• Can weigh links basedon their distance fromtopic keywords withinthe document(InfoSpiders, Clever)• Anchor text deservesextra importance
  • 64.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerLink context based on DOM tree• Consider DOM subtreerooted at parent node oflink’s <a> tag• Or can go further up in thetree (Naïve Best-First isspecial case of entiredocument body)• Trade-off between noisedue to too small or toolarge context tree (Pant2003)
  • 65.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerDOM contextLink score = linearcombination betweenpage-based and context-based similarity score
  • 66.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerCo-citation: hub scoresLink scorehub = linearcombination betweenlink and hub scoreNumber of seeds linked from page
  • 67.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerCombining DOM context and hub scoresAdd 10 best hubsto seeds for 94topicsExperiment based on159 ODP topics (Pant& Menczer 2003)Split ODP URLsbetween seedsand targets
  • 68.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerExploration vs Exploitation• Best-N-First (or BFSN)• Rather than re-sortingthe frontier every timeyou add links, be lazy andsort only every N pagesvisited• Empirically, being lessgreedy helps crawlerperformancesignificantly: escape“local topical traps” byexploring morePant et al. 2002
  • 69.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerInfoSpiders• A series of intelligentmulti-agent topicalcrawling algorithmsemploying variousadaptive techniques:– Evolutionary bias ofexploration/exploitation– Selective queryexpansion– (Connectionist)reinforcement learningMenczer & Belew 1998, 2000;Menczer et al. 2004
  • 70.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerPr l[ ] =εβλλεβλλ∋λ∋∑λλ= ννετ ιν1,...,ινΝ( )ινκ =δ κ,ω( )διστω,λ( )ω∈∆∑link l instances of kiλlk1knkiagent's neural netsum of matcheswithinverse-distanceweightinglink l Instancesof kiAgent’s neural netStochasticselectorLink scoring andselection by eachcrawling agent
  • 71.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerForeach agent thread:Pick & follow link from local frontierEvaluate new links, merge frontierAdjust link estimatorE := E + payoff - costIf E < 0:DieElsif E > Selection_Threshold:Clone offspringSplit energy with offspringSplit frontier with offspringMutate offspringArtificial life-inspired EvolutionaryLocal Selection Algorithm (ELSA)selectivequeryexpansionmatchresourcebiasReinforcementlearning
  • 72.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerAdaptation in InfoSpiders• Unsupervised population evolution– Select agents to match resource bias– Mutate internal queries: selective queryexpansion– Mutate weights• Unsupervised individual adaptation– Q-learning: adjust neural net weights topredict relevance locally
  • 73.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerInfoSpidersevolutionary bias: anagent in a relevant areawill spawn other agentsto exploit/explore thatneighborhoodkeywordvector neural netlocal frontieroffspring
  • 74.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMultithreaded InfoSpiders(MySpiders)• Different ways to compute the cost ofvisiting a document:– Constant: costconst = E0 p0 / Tmax– Proportional to download time:costtime = f(costconst t / timeout)• The latter is of course more efficient(faster crawling), but it also yieldsbetter quality pages!
  • 75.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerSelective Query Expansion in InfoSpiders:Internalization of local text featuresPOLITCONSTITUTTHSYSTEMGOVERN0.990.840.180.310.190.22When a new agent isspawned, it picks up acommon term from thecurrent page (here ‘th’)
  • 76.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerReinforcement Learning• In general, reward function R: S  A  ℜ• Learn policy (π: S  A) to maximize rewardover time, typically discounted in thefuture:• Q-learning: optimal policyV = γtr(t),t∑ 0 ≤ γ <1π*(s) = argmaxaQ(s,a)= argmaxaR(s,a) + γV*(s')[ ]ss2a2a1 s1Value of following optimal policy in future
  • 77.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerQ-learning in InfoSpiders• Use neural nets to estimate Q scores• Compare estimated relevance of visited page with Q score oflink estimated from parent page to obtain feedback signal• Learn neural net weights using back-propagation of error withteaching input: E(D) + γ maxl(D) λl
  • 78.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerOther Reinforcement Learning Crawlers• Rennie & McCallum (1999):– Naïve-Bayes classifier trained on text nearby links inpre-labeled examples to estimate Q values– Immediate reward R=1 for “on-topic” pages (with desiredCS papers for CORA repository)– All RL algorithms outperform Breath-First Search• Future discounting: “For spidering, it is alwaysbetter to choose immediate over delayed rewards”-- Or is it?– But we cannot possibly cover the entire search space, andrecall that by being greedy we can be trapped in localtopical clusters and fail to discover better ones– Need to explore!
  • 79.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerOutline• Motivation and taxonomy of crawlers• Basic crawlers and implementation issues• Universal crawlers• Preferential (focused and topical) crawlers• Evaluation of preferential crawlers• Crawler ethics and conflicts• New developments: social, collaborative, federatedcrawlers
  • 80.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerEvaluation of topical crawlers• Goal: build “better” crawlers to supportapplications (Srinivasan & al. 2005)• Build an unbiased evaluation framework– Define common tasks of measurable difficulty– Identify topics, relevant targets– Identify appropriate performance measures• Effectiveness: quality of crawler pages, order, etc.• Efficiency: separate CPU & memory of crawler algorithmsfrom bandwidth & common utilities
  • 81.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerEvaluationcorpus =ODP + Web• Automateevaluationusing editeddirectories• Differentsources ofrelevanceassessments
  • 82.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerTopics and Targetstopic level ~ specificitydepth ~ generality
  • 83.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerTasksStart from seeds, find targetsand/or pages similar to target descriptionsd=2d=3Back-crawl from targets to get seeds
  • 84.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerTarget based performance measuresA: Independence!…Q: What assumption are we making?
  • 85.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerPerformance matrixtargetdepth“recall” “precision”targetpagestargetdescriptionsd=0d=1d=2Sct∩ TdTdσc (p,Dd )p∈ Sct∑Sct∩ TdSctσc (p,Dd )p∈ Sct∑Sct
  • 86.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerCrawling evaluation frameworkURLWebCommonDataStructuresCrawler 1LogicMain• Keywords• Seed URLsCrawler NLogicPrivate DataStructures(limited resource)Concurrent Fetch/Parse/Stem ModulesHTTP HTTP HTTP
  • 87.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerUsing framework to compare crawler performancePages crawledAveragetargetpagerecall
  • 88.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerLink frontier sizePerformance/costEfficiency & scalability
  • 89.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerTopical crawlerperformance dependson topic characteristics++++InfoSpiders+++BFS-256+++BFS-1++++BreadthFirstLPACLPACCrawlerTarget descriptionsTarget pagesC = target link cohesivenessA = target authoritativenessP = popularity (topic kw generality)L = seed-target similarity
  • 90.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerOutline• Motivation and taxonomy of crawlers• Basic crawlers and implementation issues• Universal crawlers• Preferential (focused and topical) crawlers• Evaluation of preferential crawlers• Crawler ethics and conflicts• New developments: social, collaborative, federatedcrawlers
  • 91.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerCrawler ethics and conflicts• Crawlers can cause trouble, evenunwillingly, if not properly designed to be“polite” and “ethical”• For example, sending too many requests inrapid succession to a single server canamount to a Denial of Service (DoS) attack!– Server administrator and users will be upset– Crawler developer/admin IP address may beblacklisted
  • 92.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerCrawler etiquette (important!)• Identify yourself– Use ‘User-Agent’ HTTP header to identify crawler, website withdescription of crawler and contact information for crawler developer– Use ‘From’ HTTP header to specify crawler developer email– Do not disguise crawler as a browser by using their ‘User-Agent’ string• Always check that HTTP requests are successful, and in case oferror, use HTTP error code to determine and immediately addressproblem• Pay attention to anything that may lead to too many requests to anyone server, even unwillingly, e.g.:– redirection loops– spider traps
  • 93.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerCrawler etiquette (important!)• Spread the load, do not overwhelm a server– Make sure that no more than some max. number of requests toany single server per unit time, say < 1/second• Honor the Robot Exclusion Protocol– A server can specify which parts of its document tree anycrawler is or is not allowed to crawl by a file named ‘robots.txt’placed in the HTTP root directory, e.g.http://www.indiana.edu/robots.txt– Crawler should always check, parse, and obey this file beforesending any requests to a server– More info at:• http://www.google.com/robots.txt• http://www.robotstxt.org/wc/exclusion.html
  • 94.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore on robot exclusion• Make sure URLs are canonical beforechecking against robots.txt• Avoid fetching robots.txt for eachrequest to a server by caching itspolicy as relevant to this crawler• Let’s look at some examples tounderstand the protocol…
  • 95.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo Menczerwww.apple.com/robots.txt# robots.txt for http://www.apple.com/User-agent: *Disallow:All crawlers……can goanywhere!
  • 96.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo Menczerwww.microsoft.com/robots.txt# Robots.txt file for http://www.microsoft.comUser-agent: *Disallow: /canada/Library/mnp/2/aspx/Disallow: /communities/bin.aspxDisallow: /communities/eventdetails.mspxDisallow: /communities/blogs/PortalResults.mspxDisallow: /communities/rss.aspxDisallow: /downloads/Browse.aspxDisallow: /downloads/info.aspxDisallow: /france/formation/centres/planning.aspDisallow: /france/mnp_utility.mspxDisallow: /germany/library/images/mnp/Disallow: /germany/mnp_utility.mspxDisallow: /ie/ie40/Disallow: /info/customerror.htmDisallow: /info/smart404.aspDisallow: /intlkb/Disallow: /isapi/#etc…All crawlers……are notallowed inthesepaths…
  • 97.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo Menczerwww.springer.com/robots.txt# Robots.txt for http://www.springer.com (fragment)User-agent: GooglebotDisallow: /chl/*Disallow: /uk/*Disallow: /italy/*Disallow: /france/*User-agent: slurpDisallow:Crawl-delay: 2User-agent: MSNBotDisallow:Crawl-delay: 2User-agent: scooterDisallow:# all othersUser-agent: *Disallow: /Google crawler isallowed everywhereexcept these pathsYahoo andMSN/Windows Liveare allowedeverywhere butshould slow downAltaVista has no limitsEveryone else keep off!
  • 98.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore crawler ethics issues• Is compliance with robot exclusion amatter of law?– No! Compliance is voluntary, but if you do notcomply, you may be blocked– Someone (unsuccessfully) sued InternetArchive over a robots.txt related issue• Some crawlers disguise themselves– Using false User-Agent– Randomizing access frequency to look like ahuman/browser– Example: click fraud for ads
  • 99.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore crawler ethics issues• Servers can disguise themselves, too– Cloaking: present different content based onUser-Agent– E.g. stuff keywords on version of page shown tosearch engine crawler– Search engines do not look kindly on this typeof “spamdexing” and remove from their indexsites that perform such abuse• Case of bmw.de made the news
  • 100.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerGray areas for crawler ethics• If you write a crawler that unwillinglyfollows links to ads, are you just beingcareless, or are you violating terms ofservice, or are you violating the law bydefrauding advertisers?– Is non-compliance with Google’s robots.txt inthis case equivalent to click fraud?• If you write a browser extension thatperforms some useful service, should youcomply with robot exclusion?
  • 101.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerOutline• Motivation and taxonomy of crawlers• Basic crawlers and implementation issues• Universal crawlers• Preferential (focused and topical) crawlers• Evaluation of preferential crawlers• Crawler ethics and conflicts• New developments
  • 102.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerNew developments: social,collaborative, federated crawlers• Idea: go beyond the “one-fits-all”model of centralized search engines• Extend the search task to anyone,and distribute the crawling task• Each search engine is a peer agent• Agents collaborate by routing queriesand results
  • 103.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo Menczer6S: CollaborativePeer SearchABqueryqueryhithitData mining &referralopportunitiesqueryqueryqueryCEmergingcommunitiesbookmarkslocalstorageWWWIndexCrawlerPeer
  • 104.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerBasic idea: Learn based on priorquery/response interactions
  • 105.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerLearning about other peers
  • 106.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerQuery routing in 6S
  • 107.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerEmergent semantic clustering
  • 108.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerSimulation 1: 70 peers, 7 groups• The dynamic networkof queries and resultsexchanged among 6Speer agents quicklyforms a small-world,with small diameterand high clustering(Wu & al. 2005)
  • 109.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerSimulation 2:500 UsersODP (dmoz.org)Each synthetic userassociated with a topic
  • 110.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerSemanticsimilarityPeers with similar interests aremore likely to talk to each other(Akavipat & al. 2006)
  • 111.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerQuality of resultsMoresophisticatedlearningalgorithmsdo betterThe moreinteractions,the better
  • 112.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo Menczerhttp://homer.informatics.indiana.edu/~nan/6S/1-click configuration ofpersonal crawler andsetup of search engineDownload and try free 6S prototype:
  • 113.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo Menczerhttp://homer.informatics.indiana.edu/~nan/6S/Search viaFirefox browserextensionDownload and try free 6S prototype:
  • 114.
    Slides © 2007Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerNeed crawling code?• Reference C implementation of HTTP, HTML parsing, etc– w3c-libwww package from World-Wide Web Consortium: www.w3c.org/Library/• LWP (Perl)– http://www.oreilly.com/catalog/perllwp/– http://search.cpan.org/~gaas/libwww-perl-5.804/• Open source crawlers/search engines– Nutch: http://www.nutch.org/ (Jakarta Lucene: jakarta.apache.org/lucene/)– Heretrix: http://crawler.archive.org/– WIRE: http://www.cwr.cl/projects/WIRE/– Terrier: http://ir.dcs.gla.ac.uk/terrier/• Open source topical crawlers, Best-First-N (Java)– http://informatics.indiana.edu/fil/IS/JavaCrawlers/• Evaluation framework for topical crawlers (Perl)– http://informatics.indiana.edu/fil/IS/Framework/

Editor's Notes

  • #81 -A “good” crawler would make “good” choices and hence retrieve “good” pages early.-how do we measure the “goodness” of crawlers?-some work on evaluation has been done but is often limited in terms of number of topics or same set of measures are used to crawl and to evaluate.

[8]ページ先頭

©2009-2025 Movatter.jp