![]() |
|||||||||||||
|
Web scraping |
| This article may require cleanup to meet Wikipedia's quality standards. Please improve this article if you can. (June 2007) |
Web scraping (sometimes called harvesting) generically describes any of various means to extract content from a website over HTTP for the purpose of transforming that content into another format suitable for use in another context. Those who scrape websites may wish to store the information in their own databases or manipulate the data within a spreadsheet (Often, spreadsheets are only able to contain a fraction of the data scraped). Others may utilize data extraction techniques as means of obtaining the most recent data possible, particularly when working with information subject to frequent changes. Investors analyzing stock prices, realtors researching home listings, meteorologists studying weather, or insurance salespeople following insurance prices are a few individuals who might fit this category of users of frequently updated data.
Access to certain information may also provide users with strategic advantage in business. Attorneys might wish to scrape arrest records from county courthouses in search of potential clients. Businesses that know the locations of competitors can make better decisions about where to focus further growth. Another common, but controversial use of information taken from websites is reposting scraped data to other sites.
Contents |
A typical example application for web scraping is a web crawler that copies content from one or more existing websites in order to generate a scraper site. The result can range from fair use excerpts or reproduction of text and content, to plagiarized content. In some instances, plagiarized content may be used as an illicit means to increase traffic and advertising revenue. The typical scraper website generates revenue using Google AdSense, hence the term 'Made for AdSense' or MFA website.
Web scraping differs from screen scraping in the sense that a website is really not a visual screen, but a live HTML/JavaScript-based content, with a graphics interface in front of it. Therefore, web scraping does not involve working at the visual interface as screen scraping, but rather working on the underlying object structure (Document Object Model) of the HTML and JavaScript.
Web scraping also differs from screen scraping in that screen scraping typically occurs many times from the same dynamic screen "page", whereas web scraping occurs only once per web page over many different static web pages. Recursive web scraping, by following links to other pages over many web sites, is called "web harvesting". Web harvesting is necessarily performed by a software called a bot or a "webbot", "crawler", "harvester" or "spider" with similar arachnological analogies used to refer to other creepy-crawly aspects of their functions. Web harvesters are typically demonised, while "webbots" are often typecast as benevolent.
There are legal web scraping sites that provide free content and are commonly used by webmasters looking to populate a hastily made site with web content, often to profit by some means from the traffic the article hopefully brings. This content does not help the ranking of the site in search engine results because the content is not original to that page.[1] Original content is a priority of search engines. [2] Use of free articles usually requires one to link back to the free article site, as well as to a link(s) provided by the author. This is however not necessary as some sites those which provide free articles might also have a clause in their terms of service that does not allow copying content - link back or not. The site Wikipedia.org, (particularly the English Wikipedia) is a common target for web scraping.1
Although scraping is against the terms of use of some websites, the enforceability of these terms is unclear.2 While outright duplication of original expression will in many cases be illegal, the courts ruled in Feist Publications v. Rural Telephone Service that duplication of facts is allowable. Also, in a February, 2006 ruling, the Danish Maritime and Commercial Court (Copenhagen) found systematic crawling, indexing and deep linking by portal site ofir.dk of real estate site Home.dk not to conflict with Danish law or the database directive of the European Union.3
U.S. courts have acknowledged that users of "scrapers" or "robots" may be held liable for committing trespass to chattels,45 which involves a computer system itself being considered personal property upon which the user of a scraper is trespassing. However, to succeed on a claim of trespass to chattels, the plaintiff must demonstrate that the defendant intentionally and without authorization interfered with the plaintiff's possessory interest in the computer system and that the defendant's unauthorized use caused damage to the plaintiff. Not all cases of web spidering brought before the courts have been considered trespass to chattels.6
In Australia, the 2003 Spam Act outlaws some forms of web harvesting.78
A web master can use various measures to stop or slow a bot. Some techniques include: