doc: add public data (#168)

Co-authored-by: Greg Lindahl <greg@commomncrawl.org>
2026-05-26 15:05:32 +00:00 · 2026-01-19 11:28:05 -08:00
parent 2b3e2e24ac
commit b8bf24e051
1 changed files with 12 additions and 0 deletions
@@ -27,6 +27,7 @@ Web archiving is the process of collecting portions of the World Wide Web to ens
 * [Web Archiving Service Providers](#web-archiving-service-providers)
  * [Self-hostable, Open Source](#self-hostable-open-source)
  * [Hosted, Closed Source](#hosted-closed-source)
+* [Public Data](#public-data)

 ## Training/Documentation

@@ -275,3 +276,14 @@ The intention is that we only list services that allow web archives to be export
 *	[MirrorWeb](https://www.mirrorweb.com/solutions/capabilities/website-archiving)
 *	[PageFreezer](https://www.pagefreezer.com/)
 *	[Smarsh](https://www.smarsh.com/platform/compliance-management/web-archive)
+
+## Public Data
+
+This is a list of publicly available WARCs, Wayback Machines, CDX API endpoints, other indexes, and so on.
+
+* [Common Crawl files](https://data.commoncrawl.org/) - WARCs, CDX files, parquet url index, parquet host index, etc.
+* [Common Crawl CDX API](https://index.commoncrawl.org/)
+* [End of Term Archive](https://eotarchive.org/) - WARCs, CDX files, parquet url index
+* [Internet Archive Wayback](https://web.archive.org/)
+* [Webrecorder US GovArchive](https://govarchive.us/) - high-fidelity replay
+* [UK Government Web Archive](https://www.nationalarchives.gov.uk/webarchive/) - Wayback