Movatterモバイル変換

[0]ホーム

Jump to content

Web scraping

Edit links

From Wikipedia, the free encyclopedia

For broader coverage of this topic, seeData scraping.

Method of extracting data from websites

"Web scraper" redirects here. For websites that scrape content, seeScraper site.

This articleneeds additional citations forverification. Please helpimprove this article byadding citations to reliable sources. Unsourced material may be challenged and removed.
Find sources: "Web scraping" – news ·newspapers ·books ·scholar ·JSTOR(April 2023) (Learn how and when to remove this message)

Web scraping,web harvesting, orweb data extraction isdata scraping used forextracting data fromwebsites.^[1] Web scraping software may directly access theWorld Wide Web using theHypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using abot orweb crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central localdatabase orspreadsheet, for laterretrieval oranalysis.

Scraping a web page involves fetching it and then extracting data from it. Fetching is the downloading of a page (which a browser does when a user views a page). Therefore, web crawling is a main component of web scraping, to fetch pages for later processing. Having fetched, extraction can take place. The content of a page may beparsed, searched and reformatted, and its data copied into a spreadsheet or loaded into a database. Web scrapers typically take something out of a page, to make use of it for another purpose somewhere else. An example would be finding and copying names and telephone numbers, companies and their URLs, or e-mail addresses to a list (contact scraping). Another example is collecting competitors’ product prices for marketing purposes.

Contact scraping is a type of web scraping that is used as a component of applications used forweb indexing,web mining anddata mining, online price change monitoring andprice comparison, product review scraping (to watch the competition), gathering real estate listings, weather data monitoring,website change detection, research, tracking online presence and reputation,web mashup, andweb data integration.

Web pages are built using text-basedmarkup languages (HTML andXHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for humanend-users and not for ease of automated use. As a result, specialized tools and software have been developed to facilitate the scraping of web pages. Web scraping applications includemarket research, price comparison, content monitoring andartificial intelligence. Businesses rely on web scraping services to efficiently gather and utilize this data.

Newer forms of web scraping involve monitoringdata feeds fromweb servers. For example,JSON is commonly used as a transport mechanism between the client and the web server.

There are methods that some websites use to prevent web scraping, such as detecting and disallowing bots from crawling (viewing) their pages. In response, web scraping systems use techniques involvingDOM parsing,computer vision andnatural language processing to simulate human-like browsing to enable gathering web page content for offline parsing.

Movatterモバイル変換

History

Techniques

Human copy-and-paste

Text pattern matching

HTTP programming

HTML parsing

DOM parsing

Vertical aggregation

Semantic annotation recognizing

Computer vision web-page analysis

Legal issues

United States

European Union

Australia

India

Methods to prevent web scraping

See also

References