A web crawler is a bot (or Internet bot) that crawls the World Wide Web to index all the websites out there. Also known as an automatic indexer, web spider or an ant, web crawlers use spidering software that updates the index almost daily. They copy all the pages so that they can be later processed by search engines and indexed so that search engine results can be returned faster. They also validate hyperlinks as well as HTML coding. Web crawlers can also be used to help in web scraping.
The web crawler works in a very methodical manner. It starts with a list of URLs. It visits each URL, and adds to its list all the hyperlinks on that page. Anyone familiar with the internet will realize that this could soon become a list with millions of hyperlinks, which is why web crawlers follow a set of policies that make it possible to prioritize between the pages.
- A selection policy tells it which pages it needs to download. This is based on the quality of content on a page, the popularity of the page (based on visits and links pointing to it), and even the relevance of the URL.
- A re-visit policy, which tells the web crawler how often to check for changes to a page.
- A politeness policy, so that the web crawler does not crash a website by overloading it.
- A parallelization policy that helps co-ordinate distributed web crawlers.
Each website can add code to every page on the site to help web crawlers identify which pages must and must-not be crawled. There are two main web crawlers: BingBot and GoogleBot, although there are many lesser known web crawlers as well.
Two additional uses for web crawlers are (1) to automate maintenance on a website (such as checking links to make sure they are still valid), and (2) to gather information from website pages. The second one is usually to get email addresses (mostly for spam), and is therefore unethical.