Movatterモバイル変換

Ch. 8: Web CrawlingBy Filippo MenczerIndiana University School of Informaticsin Web Data Mining by Bing LiuSpringer, 2007

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerOutline• Motivation and taxonomy of crawlers• Basic crawlers and implementation issues• Universal crawlers• Preferential (focused and topical) crawlers• Evaluation of preferential crawlers• Crawler ethics and conflicts• New developments: social, collaborative, federatedcrawlers

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerQ: How does asearch engineknow that allthese pagescontain the queryterms?A: Because all ofthose pageshave beencrawled

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMany names• Crawler• Spider• Robot (or bot)• Web agent• Wanderer, worm, …• And famous instances: googlebot,scooter, slurp, msnbot, …

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMotivation for crawlers• Support universal search engines (Google,Yahoo, MSN/Windows Live, Ask, etc.)• Vertical (specialized) search engines, e.g.news, shopping, papers, recipes, reviews, etc.• Business intelligence: keep track of potentialcompetitors, partners• Monitor Web sites of interest• Evil: harvest emails for spamming, phishing…• … Can you think of some others?…

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerA crawler within a search engineWebText index PageRankPagerepositorygooglebotText & linkanalysisQueryhitsRanker

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerOne taxonomy of crawlersUniversal crawlersFocused crawlersEvolutionary crawlers Reinforcement learning crawlersetc...Adaptive topical crawlersBest-first PageRanketc...Static crawlersTopical crawlersPreferential crawlersCrawlers• Many other criteria could be used:– Incremental, Interactive, Concurrent, Etc.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerBasic crawlers• This is a sequentialcrawler• Seeds can be any list ofstarting URLs• Order of page visits isdetermined by frontierdata structure• Stop criterion can beanything

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerGraph traversal(BFS or DFS?)• Breadth First Search– Implemented with QUEUE (FIFO)– Finds pages along shortest paths– If we start with “good” pages, thiskeeps us close; maybe other goodstuff…• Depth First Search– Implemented with STACK (LIFO)– Wander away (“lost in cyberspace”)

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerA basic crawler in Perl• Queue: a FIFO list (shift and push)my @frontier = read_seeds($file);while (@frontier && $tot < $max) {my $next_link = shift @frontier;my $page = fetch($next_link);add_to_index($page);my @links = extract_links($page, $next_link);push @frontier, process(@links);}

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerImplementation issues• Don’t want to fetch same page twice!– Keep lookup table (hash) of visited pages– What if not visited but in frontier already?• The frontier grows very fast!– May need to prioritize for large crawls• Fetcher must be robust!– Don’t crash if download fails– Timeout mechanism• Determine file type to skip unwanted files– Can try using extensions, but not reliable– Can issue ‘HEAD’ HTTP commands to get Content-Type(MIME) headers, but overhead of extra Internet requests

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore implementation issues• Fetching– Get only the first 10-100 KB per page– Take care to detect and breakredirection loops– Soft fail for timeout, server notresponding, file not found, and othererrors

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore implementation issues: Parsing• HTML has the structure of a DOM(Document Object Model) tree• Unfortunately actual HTML is oftenincorrect in a strict syntactic sense• Crawlers, like browsers, must berobust/forgiving• Fortunately there are tools that canhelp– E.g. tidy.sourceforge.net• Must pay attention to HTMLentities and unicode in text• What to do with a growing numberof other formats?– Flash, SVG, RSS, AJAX…

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore implementation issues• Stop words– Noise words that do not carry meaning should be eliminated(“stopped”) before they are indexed– E.g. in English: AND, THE, A, AT, OR, ON, FOR, etc…– Typically syntactic markers– Typically the most common terms– Typically kept in a negative dictionary• 10–1,000 elements• E.g. http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words– Parser can detect these right away and disregard them

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore implementation issuesConflation and thesauri• Idea: improve recall by merging words with samemeaning1. We want to ignore superficial morphologicalfeatures, thus merge semantically similar tokens– {student, study, studying, studious} => studi1. We can also conflate synonyms into a single formusing a thesaurus– 30-50% smaller index– Doing this in both pages and queries allows to retrievepages about ‘automobile’ when user asks for ‘car’– Thesaurus can be implemented as a hash table

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore implementation issues• Stemming– Morphological conflation based on rewrite rules– Language dependent!– Porter stemmer very popular for English• http://www.tartarus.org/~martin/PorterStemmer/• Context-sensitive grammar rules, eg:– “IES” except (“EIES” or “AIES”) --> “Y”• Versions in Perl, C, Java, Python, C#, Ruby, PHP, etc.– Porter has also developed Snowball, a language to createstemming algorithms in any language• http://snowball.tartarus.org/• Ex. Perl modules: Lingua::Stem and Lingua::Stem::Snowball

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore implementation issues• Static vs. dynamic pages– Is it worth trying to eliminate dynamic pages and only indexstatic pages?– Examples:• http://www.census.gov/cgi-bin/gazetteer• http://informatics.indiana.edu/research/colloquia.asp• http://www.amazon.com/exec/obidos/subst/home/home.html/002-8332429-6490452• http://www.imdb.com/Name?Menczer,+Erico• http://www.imdb.com/name/nm0578801/– Why or why not? How can we tell if a page is dynamic? Whatabout ‘spider traps’?– What do Google and other search engines do?

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore implementation issues• Relative vs. Absolute URLs– Crawler must translate relative URLs into absoluteURLs– Need to obtain Base URL from HTTP header, orHTML Meta tag, or else current page path bydefault– Examples• Base: http://www.cnn.com/linkto/• Relative URL: intl.html• Absolute URL: http://www.cnn.com/linkto/intl.html• Relative URL: /US/• Absolute URL: http://www.cnn.com/US/

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore implementation issues• URL canonicalization– All of these:• http://www.cnn.com/TECH• http://WWW.CNN.COM/TECH/• http://www.cnn.com:80/TECH/• http://www.cnn.com/bogus/../TECH/– Are really equivalent to this canonical form:• http://www.cnn.com/TECH/– In order to avoid duplication, the crawler musttransform all URLs into canonical form– Definition of “canonical” is arbitrary, e.g.:• Could always include port• Or only include port when not default :80

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore on Canonical URLs• Some transformation are trivial, for example: http://informatics.indiana.edu http://informatics.indiana.edu/ http://informatics.indiana.edu/index.html#fragment http://informatics.indiana.edu/index.html http://informatics.indiana.edu/dir1/./../dir2/ http://informatics.indiana.edu/dir2/ http://informatics.indiana.edu/%7Efil/ http://informatics.indiana.edu/~fil/ http://INFORMATICS.INDIANA.EDU/fil/ http://informatics.indiana.edu/fil/

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore on Canonical URLsOther transformations require heuristic assumptionabout the intentions of the author or configurationof the Web server:1. Removing default file name http://informatics.indiana.edu/fil/index.html http://informatics.indiana.edu/fil/– This is reasonable in general but would be wrong in thiscase because the default happens to be ‘default.asp’instead of ‘index.html’1. Trailing directory http://informatics.indiana.edu/fil http://informatics.indiana.edu/fil/– This is correct in this case but how can we be sure ingeneral that there isn’t a file named ‘fil’ in the root dir?

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore implementation issues• Spider traps– Misleading sites: indefinite number of pagesdynamically generated by CGI scripts– Paths of arbitrary depth created using softdirectory links and path rewriting features inHTTP server– Only heuristic defensive measures:• Check URL length; assume spider trap above somethreshold, for example 128 characters• Watch for sites with very large number of URLs• Eliminate URLs with non-textual data types• May disable crawling of dynamic pages, if can detect

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore implementation issues• Page repository– Naïve: store each page as a separate file• Can map URL to unique filename using a hashing function,e.g. MD5• This generates a huge number of files, which is inefficientfrom the storage perspective– Better: combine many pages into a single large file, usingsome XML markup to separate and identify them• Must map URL to {filename, page_id}– Database options• Any RDBMS -- large overhead• Light-weight, embedded databases such as Berkeley DB

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerConcurrency• A crawler incurs several delays:– Resolving the host name in the URL to anIP address using DNS– Connecting a socket to the server andsending the request– Receiving the requested page in response• Solution: Overlap the above delays byfetching many pages concurrently

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerConcurrent crawlers• Can use multi-processing or multi-threading• Each process or thread works like asequential crawler, except they share datastructures: frontier and repository• Shared data structures must besynchronized (locked for concurrentwrites)• Speedup of factor of 5-10 are easy thisway

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerUniversal crawlers• Support universal search engines• Large-scale• Huge cost (network bandwidth) ofcrawl is amortized over many queriesfrom users• Incremental updates to existingindex and other data repositories

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerLarge-scale universal crawlers• Two major issues:1. Performance• Need to scale up to billions of pages2. Policy• Need to trade-off coverage,freshness, and bias (e.g. toward“important” pages)

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerLarge-scale crawlers: scalability• Need to minimize overhead of DNS lookups• Need to optimize utilization of network bandwidthand disk throughput (I/O is bottleneck)• Use asynchronous sockets– Multi-processing or multi-threading do not scale up tobillions of pages– Non-blocking: hundreds of network connections opensimultaneously– Polling socket to monitor completion of networktransfers

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerHigh-levelarchitecture of ascalable universalcrawlerSeveral parallelqueues to spread loadacross servers (keepconnections alive)DNS server using UDP(less overhead thanTCP), large persistentin-memory cache, andprefetchingOptimize use ofnetwork bandwidthOptimize disk I/O throughputHuge farm of crawl machines

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerUniversal crawlers: Policy• Coverage– New pages get added all the time– Can the crawler find every page?• Freshness– Pages change over time, get removed, etc.– How frequently can a crawler revisit ?• Trade-off!– Focus on most “important” pages (crawler bias)?– “Importance” is subjective

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerWeb coverage by search engine crawlers35% 34%16%50%0%10%20%30%40%50%60%70%80%90%100%1997 1998 1999 2000This assumes we know thesize of the entire the Web.Do we? Can you define “thesize of the Web”?

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMaintaining a “fresh” collection• Universal crawlers are never “done”• High variance in rate and amount of page changes• HTTP headers are notoriously unreliable– Last-modified– Expires• Solution– Estimate the probability that a previously visited pagehas changed in the meanwhile– Prioritize by this probability estimate

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerEstimating page change rates• Algorithms for maintaining a crawl in whichmost pages are fresher than a specifiedepoch– Brewington & Cybenko; Cho, Garcia-Molina & Page• Assumption: recent past predicts the future(Ntoulas, Cho & Olston 2004)– Frequency of change not a good predictor– Degree of change is a better predictor

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerDo we need to crawl the entire Web?• If we cover too much, it will get stale• There is an abundance of pages in the Web• For PageRank, pages with very low prestige are largelyuseless• What is the goal?– General search engines: pages with high prestige– News portals: pages that change often– Vertical portals: pages on some topic• What are appropriate priority measures in thesecases? Approximations?

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerBreadth-first crawlers• BF crawler tends tocrawl high-PageRank pagesvery early• Therefore, BFcrawler is a goodbaseline to gaugeother crawlers• But why is this so? Najork and Weiner 2001

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerBias of breadth-first crawlers• The structure of theWeb graph is verydifferent from a randomnetwork• Power-law distribution ofin-degree• Therefore there are hubpages with very high PRand many incoming links• These are attractors: youcannot avoid them!

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerPreferential crawlers• Assume we can estimate for each page animportance measure, I(p)• Want to visit pages in order of decreasing I(p)• Maintain the frontier as a priority queue sorted byI(p)• Possible figures of merit:– Precision ~| p: crawled(p) & I(p) > threshold | / | p: crawled(p) |– Recall ~| p: crawled(p) & I(p) > threshold | / | p: I(p) > threshold |

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerPreferential crawlers• Selective bias toward some pages, eg. most“relevant”/topical, closest to seeds, most popular/largestPageRank, unknown servers, highest rate/amount ofchange, etc…• Focused crawlers– Supervised learning: classifier based on labeled examples• Topical crawlers– Best-first search based on similarity(topic, parent)– Adaptive crawlers• Reinforcement learning• Evolutionary algorithms/artificial life

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerPreferential crawling algorithms:Examples• Breadth-First– Exhaustively visit all links in order encountered• Best-N-First– Priority queue sorted by similarity, explore top N at a time– Variants: DOM context, hub scores• PageRank– Priority queue sorted by keywords, PageRank• SharkSearch– Priority queue sorted by combination of similarity, anchor text, similarity ofparent, etc. (powerful cousin of FishSearch)• InfoSpiders– Adaptive distributed algorithm using an evolving population of learningagents

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerPreferential crawlers: ExamplesRecallCrawl size• For I(p) = PageRank(estimated based onpages crawled sofar), we can findhigh-PR pages fasterthan a breadth-firstcrawler (Cho, Garcia-Molina & Page 1998)

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerFocused crawlers: Basic idea• Naïve-Bayes classifier basedon example pages in desiredtopic, c*• Score(p) = Pr(c*|p)– Soft focus: frontier is priorityqueue using page score– Hard focus:• Find best leaf ĉ for p• If an ancestor c’ of ĉ is in c*then add links from p tofrontier, else discard– Soft and hard focus workequally well empiricallyExample: Open Directory

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerFocused crawlers• Can have multiple topics with as many classifiers,with scores appropriately combined (Chakrabarti etal. 1999)• Can use a distiller to find topical hubs periodically,and add these to the frontier• Can accelerate with the use of a critic (Chakrabartiet al. 2002)• Can use alternative classifier algorithms to naïve-Bayes, e.g. SVM and neural nets have reportedlyperformed better (Pant & Srinivasan 2005)

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerContext-focused crawlers• Same idea, but multiple classes (andclassifiers) based on link distancefrom relevant targets– ℓ=0 is topic of interest– ℓ=1 link to topic of interest– Etc.• Initially needs a back-crawl fromseeds (or known targets) to trainclassifiers to estimate distance• Links in frontier prioritized based onestimated distance from targets• Outperforms standard focusedcrawler empiricallyContext graph

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerTopical crawlers• All we have is a topic (query, description,keywords) and a set of seed pages (notnecessarily relevant)• No labeled examples• Must predict relevance of unvisited links toprioritize• Original idea: Menczer 1997, Menczer &Belew 1998

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerTopical locality• Topical locality is a necessary condition for a topicalcrawler to work, and for surfing to be a worthwhileactivity for humans• Links must encode semantic information, i.e. saysomething about neighbor pages, not be random• It is also a sufficient condition if we start from “good”seed pages• Indeed we know that Web topical locality is strong :– Indirectly (crawlers work and people surf the Web)– From direct measurements (Davison 2000; Menczer 2004, 2005)

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerQuantifying topical locality• Different ways to pose thequestion:– How quickly does semanticlocality decay?– How fast is topic drift?– How quickly does contentchange as we surf away from astarting page?• To answer these questions,let us consider exhaustivebreadth-first crawls from100 topic pages

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerG = 5/15C = 2R = 3/6= 2/4The “link-cluster” conjecture• Connection between semantic topology (relevance) andlink topology (hypertext)– G = Pr[rel(p)] ~ fraction of relevant/topical pages (topic generality)– R = Pr[rel(p) | rel(q) AND link(q,p)] ~ cond. prob. Given neighbor on topic• Related nodes are clustered if R > G– Necessary andsufficientcondition for arandom crawlerto find pages relatedto start points– Example:2 topical clusterswith strongermodularity withineach cluster than outside

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo Menczer• Stationary hit rate for a random crawler:Link-cluster conjectureη(t + 1) = η(t) ⋅ R + (1 −η(t))⋅ G ≥ η(t)ηt →∞⏐ →⏐ ⏐ η∗=G1− (R − G)η∗> G ⇔ R > Gη∗G−1 =R− G1 − (R− G)Valueaddedof linksConjecture

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo Menczer• Preservation ofsemantics (meaning)across links• 1000 times morelikely to be on topicif near an on-topicpage!Link-clusterconjectureR(q,δ)G(q)≡Pr rel(p) | rel(q)∧ path(q, p) ≤δ[ ]Pr[rel(p)]L(q,δ) ≡path(q, p){ p: path(q,p) ≤δ }∑{p : path(q, p) ≤δ}

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo Menczer• Correlation oflexical (content)and linkagetopology• L(δ): average linkdistance• S(δ): averagecontentsimilarity tostart (topic)page from pagesup to distance δ• Correlationρ(L,S) = –0.76The “link-content”conjecture S(q,δ) ≡sim(q, p){p: path(q,p) ≤δ }∑{p : path(q, p) ≤δ}

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerHeterogeneity oflink-content correlationS = c + (1−c)eaLb edu netgovorgcomsignif. diff. a only (α=0.05)signif. diff. a & b (α=0.05).com hasmore drift

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerTopical locality-inspired tricksfor topical crawlers• Co-citation (a.k.a. siblinglocality): A and C are goodhubs, thus A and D shouldbe given high priority• Co-reference (a.k.a.blbliographic coupling):E and G are goodauthorities, thus E and Hshould be given highpriority

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerCorrelations between differentsimilarity measures• Semantic similarity measuredfrom ODP, correlated with:– Content similarity: TF or TF-IDFvector cosine– Link similarity: Jaccardcoefficient of (in+out) linkneighborhoods• Correlation overall is significantbut weak• Much stronger topical locality insome topics, e.g.:– Links very informative in newssources– Text very informative in recipes

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerNaïve Best-FirstBestFirst(topic, seed_urls) {foreach link (seed_urls) {enqueue(frontier, link);}while (#frontier > 0 and visited < MAX_PAGES) {link := dequeue_link_with_max_score(frontier);doc := fetch_new_document(link);score := sim(topic, doc);foreach outlink (extract_links(doc)) {if (#frontier >= MAX_BUFFER) {dequeue_link_with_min_score(frontier);}enqueue(frontier, outlink, score);}}}Simplesttopical crawler:Frontier ispriority queuebased on textsimilaritybetween topicand parentpage

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerBest-first variations• Many in literature, mostly stemming fromdifferent ways to score unvisited URLs. E.g.:– Giving more importance to certain HTML markup inparent page– Extending text representation of parent page withanchor text from “grandparent” pages (SharkSearch)– Limiting link context to less than entire page– Exploiting topical locality (co-citation)– Exploration vs exploitation: relax priorities• Any of these can be (and many have been)combined

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerLink context based on text neighborhood• Often consider afixed-size window, e.g.50 words aroundanchor• Can weigh links basedon their distance fromtopic keywords withinthe document(InfoSpiders, Clever)• Anchor text deservesextra importance

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerLink context based on DOM tree• Consider DOM subtreerooted at parent node oflink’s <a> tag• Or can go further up in thetree (Naïve Best-First isspecial case of entiredocument body)• Trade-off between noisedue to too small or toolarge context tree (Pant2003)

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerDOM contextLink score = linearcombination betweenpage-based and context-based similarity score

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerCo-citation: hub scoresLink scorehub = linearcombination betweenlink and hub scoreNumber of seeds linked from page

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerCombining DOM context and hub scoresAdd 10 best hubsto seeds for 94topicsExperiment based on159 ODP topics (Pant& Menczer 2003)Split ODP URLsbetween seedsand targets

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerExploration vs Exploitation• Best-N-First (or BFSN)• Rather than re-sortingthe frontier every timeyou add links, be lazy andsort only every N pagesvisited• Empirically, being lessgreedy helps crawlerperformancesignificantly: escape“local topical traps” byexploring morePant et al. 2002

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerInfoSpiders• A series of intelligentmulti-agent topicalcrawling algorithmsemploying variousadaptive techniques:– Evolutionary bias ofexploration/exploitation– Selective queryexpansion– (Connectionist)reinforcement learningMenczer & Belew 1998, 2000;Menczer et al. 2004

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerPr l[ ] =εβλλεβλλ∋λ∋∑λλ= ννετ ιν1,...,ινΝ( )ινκ =δ κ,ω( )διστω,λ( )ω∈∆∑link l instances of kiλlk1knkiagent's neural netsum of matcheswithinverse-distanceweightinglink l Instancesof kiAgent’s neural netStochasticselectorLink scoring andselection by eachcrawling agent

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerForeach agent thread:Pick & follow link from local frontierEvaluate new links, merge frontierAdjust link estimatorE := E + payoff - costIf E < 0:DieElsif E > Selection_Threshold:Clone offspringSplit energy with offspringSplit frontier with offspringMutate offspringArtificial life-inspired EvolutionaryLocal Selection Algorithm (ELSA)selectivequeryexpansionmatchresourcebiasReinforcementlearning

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerAdaptation in InfoSpiders• Unsupervised population evolution– Select agents to match resource bias– Mutate internal queries: selective queryexpansion– Mutate weights• Unsupervised individual adaptation– Q-learning: adjust neural net weights topredict relevance locally

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerInfoSpidersevolutionary bias: anagent in a relevant areawill spawn other agentsto exploit/explore thatneighborhoodkeywordvector neural netlocal frontieroffspring

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMultithreaded InfoSpiders(MySpiders)• Different ways to compute the cost ofvisiting a document:– Constant: costconst = E0 p0 / Tmax– Proportional to download time:costtime = f(costconst t / timeout)• The latter is of course more efficient(faster crawling), but it also yieldsbetter quality pages!

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerSelective Query Expansion in InfoSpiders:Internalization of local text featuresPOLITCONSTITUTTHSYSTEMGOVERN0.990.840.180.310.190.22When a new agent isspawned, it picks up acommon term from thecurrent page (here ‘th’)

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerReinforcement Learning• In general, reward function R: S  A  ℜ• Learn policy (π: S  A) to maximize rewardover time, typically discounted in thefuture:• Q-learning: optimal policyV = γtr(t),t∑ 0 ≤ γ <1π*(s) = argmaxaQ(s,a)= argmaxaR(s,a) + γV*(s')[ ]ss2a2a1 s1Value of following optimal policy in future

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerQ-learning in InfoSpiders• Use neural nets to estimate Q scores• Compare estimated relevance of visited page with Q score oflink estimated from parent page to obtain feedback signal• Learn neural net weights using back-propagation of error withteaching input: E(D) + γ maxl(D) λl

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerOther Reinforcement Learning Crawlers• Rennie & McCallum (1999):– Naïve-Bayes classifier trained on text nearby links inpre-labeled examples to estimate Q values– Immediate reward R=1 for “on-topic” pages (with desiredCS papers for CORA repository)– All RL algorithms outperform Breath-First Search• Future discounting: “For spidering, it is alwaysbetter to choose immediate over delayed rewards”-- Or is it?– But we cannot possibly cover the entire search space, andrecall that by being greedy we can be trapped in localtopical clusters and fail to discover better ones– Need to explore!

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerEvaluation of topical crawlers• Goal: build “better” crawlers to supportapplications (Srinivasan & al. 2005)• Build an unbiased evaluation framework– Define common tasks of measurable difficulty– Identify topics, relevant targets– Identify appropriate performance measures• Effectiveness: quality of crawler pages, order, etc.• Efficiency: separate CPU & memory of crawler algorithmsfrom bandwidth & common utilities

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerEvaluationcorpus =ODP + Web• Automateevaluationusing editeddirectories• Differentsources ofrelevanceassessments

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerTopics and Targetstopic level ~ specificitydepth ~ generality

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerTasksStart from seeds, find targetsand/or pages similar to target descriptionsd=2d=3Back-crawl from targets to get seeds

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerTarget based performance measuresA: Independence!…Q: What assumption are we making?

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerPerformance matrixtargetdepth“recall” “precision”targetpagestargetdescriptionsd=0d=1d=2Sct∩ TdTdσc (p,Dd )p∈ Sct∑Sct∩ TdSctσc (p,Dd )p∈ Sct∑Sct

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerCrawling evaluation frameworkURLWebCommonDataStructuresCrawler 1LogicMain• Keywords• Seed URLsCrawler NLogicPrivate DataStructures(limited resource)Concurrent Fetch/Parse/Stem ModulesHTTP HTTP HTTP

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerUsing framework to compare crawler performancePages crawledAveragetargetpagerecall

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerLink frontier sizePerformance/costEfficiency & scalability

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerTopical crawlerperformance dependson topic characteristics++++InfoSpiders+++BFS-256+++BFS-1++++BreadthFirstLPACLPACCrawlerTarget descriptionsTarget pagesC = target link cohesivenessA = target authoritativenessP = popularity (topic kw generality)L = seed-target similarity

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerCrawler ethics and conflicts• Crawlers can cause trouble, evenunwillingly, if not properly designed to be“polite” and “ethical”• For example, sending too many requests inrapid succession to a single server canamount to a Denial of Service (DoS) attack!– Server administrator and users will be upset– Crawler developer/admin IP address may beblacklisted

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerCrawler etiquette (important!)• Identify yourself– Use ‘User-Agent’ HTTP header to identify crawler, website withdescription of crawler and contact information for crawler developer– Use ‘From’ HTTP header to specify crawler developer email– Do not disguise crawler as a browser by using their ‘User-Agent’ string• Always check that HTTP requests are successful, and in case oferror, use HTTP error code to determine and immediately addressproblem• Pay attention to anything that may lead to too many requests to anyone server, even unwillingly, e.g.:– redirection loops– spider traps

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerCrawler etiquette (important!)• Spread the load, do not overwhelm a server– Make sure that no more than some max. number of requests toany single server per unit time, say < 1/second• Honor the Robot Exclusion Protocol– A server can specify which parts of its document tree anycrawler is or is not allowed to crawl by a file named ‘robots.txt’placed in the HTTP root directory, e.g.http://www.indiana.edu/robots.txt– Crawler should always check, parse, and obey this file beforesending any requests to a server– More info at:• http://www.google.com/robots.txt• http://www.robotstxt.org/wc/exclusion.html

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore on robot exclusion• Make sure URLs are canonical beforechecking against robots.txt• Avoid fetching robots.txt for eachrequest to a server by caching itspolicy as relevant to this crawler• Let’s look at some examples tounderstand the protocol…

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo Menczerwww.apple.com/robots.txt# robots.txt for http://www.apple.com/User-agent: *Disallow:All crawlers……can goanywhere!

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo Menczerwww.microsoft.com/robots.txt# Robots.txt file for http://www.microsoft.comUser-agent: *Disallow: /canada/Library/mnp/2/aspx/Disallow: /communities/bin.aspxDisallow: /communities/eventdetails.mspxDisallow: /communities/blogs/PortalResults.mspxDisallow: /communities/rss.aspxDisallow: /downloads/Browse.aspxDisallow: /downloads/info.aspxDisallow: /france/formation/centres/planning.aspDisallow: /france/mnp_utility.mspxDisallow: /germany/library/images/mnp/Disallow: /germany/mnp_utility.mspxDisallow: /ie/ie40/Disallow: /info/customerror.htmDisallow: /info/smart404.aspDisallow: /intlkb/Disallow: /isapi/#etc…All crawlers……are notallowed inthesepaths…

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo Menczerwww.springer.com/robots.txt# Robots.txt for http://www.springer.com (fragment)User-agent: GooglebotDisallow: /chl/*Disallow: /uk/*Disallow: /italy/*Disallow: /france/*User-agent: slurpDisallow:Crawl-delay: 2User-agent: MSNBotDisallow:Crawl-delay: 2User-agent: scooterDisallow:# all othersUser-agent: *Disallow: /Google crawler isallowed everywhereexcept these pathsYahoo andMSN/Windows Liveare allowedeverywhere butshould slow downAltaVista has no limitsEveryone else keep off!

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore crawler ethics issues• Is compliance with robot exclusion amatter of law?– No! Compliance is voluntary, but if you do notcomply, you may be blocked– Someone (unsuccessfully) sued InternetArchive over a robots.txt related issue• Some crawlers disguise themselves– Using false User-Agent– Randomizing access frequency to look like ahuman/browser– Example: click fraud for ads

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerMore crawler ethics issues• Servers can disguise themselves, too– Cloaking: present different content based onUser-Agent– E.g. stuff keywords on version of page shown tosearch engine crawler– Search engines do not look kindly on this typeof “spamdexing” and remove from their indexsites that perform such abuse• Case of bmw.de made the news

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerGray areas for crawler ethics• If you write a crawler that unwillinglyfollows links to ads, are you just beingcareless, or are you violating terms ofservice, or are you violating the law bydefrauding advertisers?– Is non-compliance with Google’s robots.txt inthis case equivalent to click fraud?• If you write a browser extension thatperforms some useful service, should youcomply with robot exclusion?

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerOutline• Motivation and taxonomy of crawlers• Basic crawlers and implementation issues• Universal crawlers• Preferential (focused and topical) crawlers• Evaluation of preferential crawlers• Crawler ethics and conflicts• New developments

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerNew developments: social,collaborative, federated crawlers• Idea: go beyond the “one-fits-all”model of centralized search engines• Extend the search task to anyone,and distribute the crawling task• Each search engine is a peer agent• Agents collaborate by routing queriesand results

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo Menczer6S: CollaborativePeer SearchABqueryqueryhithitData mining &referralopportunitiesqueryqueryqueryCEmergingcommunitiesbookmarkslocalstorageWWWIndexCrawlerPeer

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerBasic idea: Learn based on priorquery/response interactions

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerSimulation 1: 70 peers, 7 groups• The dynamic networkof queries and resultsexchanged among 6Speer agents quicklyforms a small-world,with small diameterand high clustering(Wu & al. 2005)

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerSimulation 2:500 UsersODP (dmoz.org)Each synthetic userassociated with a topic

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerSemanticsimilarityPeers with similar interests aremore likely to talk to each other(Akavipat & al. 2006)

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerQuality of resultsMoresophisticatedlearningalgorithmsdo betterThe moreinteractions,the better

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo Menczerhttp://homer.informatics.indiana.edu/~nan/6S/1-click configuration ofpersonal crawler andsetup of search engineDownload and try free 6S prototype:

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo Menczerhttp://homer.informatics.indiana.edu/~nan/6S/Search viaFirefox browserextensionDownload and try free 6S prototype:

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007Ch. 8 Web Crawling by Filippo MenczerNeed crawling code?• Reference C implementation of HTTP, HTML parsing, etc– w3c-libwww package from World-Wide Web Consortium: www.w3c.org/Library/• LWP (Perl)– http://www.oreilly.com/catalog/perllwp/– http://search.cpan.org/~gaas/libwww-perl-5.804/• Open source crawlers/search engines– Nutch: http://www.nutch.org/ (Jakarta Lucene: jakarta.apache.org/lucene/)– Heretrix: http://crawler.archive.org/– WIRE: http://www.cwr.cl/projects/WIRE/– Terrier: http://ir.dcs.gla.ac.uk/terrier/• Open source topical crawlers, Best-First-N (Java)– http://informatics.indiana.edu/fil/IS/JavaCrawlers/• Evaluation framework for topical crawlers (Perl)– http://informatics.indiana.edu/fil/IS/Framework/

Movatterモバイル変換

Change Language

Web crawlingchapter

Embed presentation

Recommended

More Related Content

What's hot

Similar to Web crawlingchapter

More from Borseshweta

Recently uploaded

Web crawlingchapter

Editor's Notes