doc: add public data (#168)

Co-authored-by: Greg Lindahl <greg@commomncrawl.org>
This commit is contained in:
Greg Lindahl
2026-01-19 11:28:05 -08:00
committed by GitHub
parent 2b3e2e24ac
commit b8bf24e051
+12
View File
@@ -27,6 +27,7 @@ Web archiving is the process of collecting portions of the World Wide Web to ens
* [Web Archiving Service Providers](#web-archiving-service-providers)
* [Self-hostable, Open Source](#self-hostable-open-source)
* [Hosted, Closed Source](#hosted-closed-source)
* [Public Data](#public-data)
## Training/Documentation
@@ -275,3 +276,14 @@ The intention is that we only list services that allow web archives to be export
* [MirrorWeb](https://www.mirrorweb.com/solutions/capabilities/website-archiving)
* [PageFreezer](https://www.pagefreezer.com/)
* [Smarsh](https://www.smarsh.com/platform/compliance-management/web-archive)
## Public Data
This is a list of publicly available WARCs, Wayback Machines, CDX API endpoints, other indexes, and so on.
* [Common Crawl files](https://data.commoncrawl.org/) - WARCs, CDX files, parquet url index, parquet host index, etc.
* [Common Crawl CDX API](https://index.commoncrawl.org/)
* [End of Term Archive](https://eotarchive.org/) - WARCs, CDX files, parquet url index
* [Internet Archive Wayback](https://web.archive.org/)
* [Webrecorder US GovArchive](https://govarchive.us/) - high-fidelity replay
* [UK Government Web Archive](https://www.nationalarchives.gov.uk/webarchive/) - Wayback