mirror of
https://github.com/iipc/awesome-web-archiving.git
synced 2026-05-26 15:05:32 +00:00
doc: add public data (#168)
Co-authored-by: Greg Lindahl <greg@commomncrawl.org>
This commit is contained in:
@@ -27,6 +27,7 @@ Web archiving is the process of collecting portions of the World Wide Web to ens
|
||||
* [Web Archiving Service Providers](#web-archiving-service-providers)
|
||||
* [Self-hostable, Open Source](#self-hostable-open-source)
|
||||
* [Hosted, Closed Source](#hosted-closed-source)
|
||||
* [Public Data](#public-data)
|
||||
|
||||
## Training/Documentation
|
||||
|
||||
@@ -275,3 +276,14 @@ The intention is that we only list services that allow web archives to be export
|
||||
* [MirrorWeb](https://www.mirrorweb.com/solutions/capabilities/website-archiving)
|
||||
* [PageFreezer](https://www.pagefreezer.com/)
|
||||
* [Smarsh](https://www.smarsh.com/platform/compliance-management/web-archive)
|
||||
|
||||
## Public Data
|
||||
|
||||
This is a list of publicly available WARCs, Wayback Machines, CDX API endpoints, other indexes, and so on.
|
||||
|
||||
* [Common Crawl files](https://data.commoncrawl.org/) - WARCs, CDX files, parquet url index, parquet host index, etc.
|
||||
* [Common Crawl CDX API](https://index.commoncrawl.org/)
|
||||
* [End of Term Archive](https://eotarchive.org/) - WARCs, CDX files, parquet url index
|
||||
* [Internet Archive Wayback](https://web.archive.org/)
|
||||
* [Webrecorder US GovArchive](https://govarchive.us/) - high-fidelity replay
|
||||
* [UK Government Web Archive](https://www.nationalarchives.gov.uk/webarchive/) - Wayback
|
||||
|
||||
Reference in New Issue
Block a user