The name Common Crawl is both a project and a non-profit organization that since 2011 is dedicated to crawl the World Wide Web and generate an open and accessible archive so that any person or company can access a complete copy comfortably and without having to pay anything… something like the same as Google, Bing and the rest of search engines, but in a version for the common people. According to their blog, the latest version occupies 280 terabytes and contains 2.7 billion pages. Good figures; huge but not unmanageable.
Among the projects already carried out with this data are:
- Domain popularity analysis
- Extraction of job offers
- Categorization tests
- Internet advertising analysis
- Tag search
- Recognition of sites publishing RSS feeds
- Analysis of the impact of news on the markets
- … and dozens of others
The project technically uses a bot called CCBot which is based on Apache’s Nutch. It behaves like any other bot and any webmaster can use the robots.txt robot exclusion protocol if they want to prevent it from indexing their pages or slow it down (see FAQ). The crawl is automatic and is done periodically, apparently at least once a month. Although I have not seen it specifically indicated it gives the impression that it only extracts the text of the pages, not the images or videos, which would undoubtedly occupy much more.
The average number of hyperlinks per web page is around 35 links, so this means that at the last crawl there must have been around 100 billion links on the WWW across the entire Web that they were able to crawl.