162 Commits

Author SHA1 Message Date
Ed Summers db689ad1e7 Add duckdb_warc (#178) 2026-04-27 12:36:28 -04:00
Natanael Arndt d9ca358fb5 Add subsections for training material (#177)
* Add subsections for training material to make the syntax more consistant across sections

* fix some linter issues

* Some more linting
2026-04-22 06:38:30 -04:00
Greg Lindahl 303c558027 more whirlwinds 🌪️ (#176) 2026-04-21 11:24:20 -04:00
Natanael Arndt 58be7236f6 fix syntax of stable and in development annotations (#173) 2026-03-18 13:31:19 -04:00
Alex Osborne a19a9466ee Add bagnabit2warc utility (#172) 2026-03-17 10:02:34 -04:00
Michael Lip 6a9d393783 Update Chrome Web Store URLs to new format (#169)
- Updated 5 Chrome Web Store URLs from chrome.google.com/webstore
  to chromewebstore.google.com to reflect Google's URL migration

Fixes 5 dead/redirecting links in Quality Assurance section.
2026-03-04 06:03:40 -05:00
Greg Lindahl b8bf24e051 doc: add public data (#168)
Co-authored-by: Greg Lindahl <greg@commomncrawl.org>
2026-01-19 14:28:05 -05:00
Ed Summers 2b3e2e24ac Adding two tools (#166)
* duckdb-web-archive-cdx
* warc (Rust library)
2026-01-08 12:05:26 -05:00
Greg Lindahl 7d65f20ae0 feat: common crawl discord and twitter (#164)
Co-authored-by: Greg Lindahl <greg@commomncrawl.org>
2026-01-07 14:51:49 -05:00
Ed Summers 4b42fd3de4 Add warcbench tool to README links (#163) 2025-10-31 14:52:19 -04:00
Natanael Arndt 304a530b78 Update some links for moved repos (#162)
* Update some links for moved repos

* Remove whitespace
2025-09-23 15:38:36 -04:00
Natanael Arndt 6915bc5487 Remove indention (#161)
I assume this indention was not intentional.
2025-04-09 17:49:05 -04:00
Ross Spencer af8c5bbc19 Add Community Archive (Twitter Archive and API) (#160)
Co-authored-by: Gabriel Chartier <gabriel@chartier.link>
2025-02-11 18:43:13 +00:00
Benjamin Ooghe-Tabanou cf4504832d Update README.md (add hyphe tool from médialab) (#159)
* Update README.md (add hyphe tool from médialab)

* Update README.md

---------

Co-authored-by: Nick Ruest <ruestn@gmail.com>
2025-01-29 07:34:53 -05:00
nruest a48cb0da7a Fix b193d5411a 2025-01-28 13:41:12 -05:00
Guillaume Levrier b193d5411a Update README.md (#158)
* Update README.md

Adding PANDORÆ

* Update README.md

move it before lint line
2025-01-28 10:20:32 -05:00
Ed Summers 1aad9d46c9 Added warcat-rs (#157)
which appears to still be in development...
2025-01-03 19:37:35 -05:00
Martin Hoppenheit 670bcba445 Update link to Stanford Libraries' Archivability pages, closes #154 (#156)
* Update link to Stanford Libraries' Archivability pages, closes #154

* Satisfy the linter
2024-12-27 10:58:50 -05:00
Martin Hoppenheit 129d1fa2ff Update link for The Unarchiver (#155)
... and remove the entry for The Archive Browser because it redirects to The Unarchiver as well.
2024-12-27 10:16:53 -05:00
Henry Wilkinson 49282b06dd Updates Webrecorder's website links (#153)
* Markdown syntax - blank lines around lists

* Update Webrecorder website links
2024-11-05 21:20:58 -05:00
Greg Lindahl 5bff7a0d46 Add a few new Common Crawl resources (#152)
* Add a few new Common Crawl resources
2024-11-05 08:42:39 -05:00
Mat Kelly 952e4d34dd Update URI for SiteStory (#151)
Closes #150
2024-10-17 15:21:36 -04:00
IIPC 1953151aae Update README.md 2024-09-10 10:26:04 -04:00
Natanael Arndt 168526a62c Fix the jwat link(s) according to answers in the #os-sos@iipc.slack.com channel (#149) 2024-05-08 08:44:42 -04:00
lasztoth 99241ae461 Added warc-safe to list (#148) 2024-05-06 08:26:07 -04:00
Henry Wilkinson 8e713a4388 Update list with current Webrecorder related URLs (#147)
* Update list with current Webrecorder URLs

A few terms have changed!  These should all be the most current, Conifer is notably duped, could remove one of them?

* Remove dupe Conifer link, updates Webrecorder tools

- Update PYWB link
- Update "ReplayWeb.page" casing

* Add stable tag to ReplayWeb.page

* Update README.md

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-04-25 09:43:26 +01:00
Andy Jackson 101ee998d9 Adding a Web Archive Services section to list hosted and self-hostable web archiving options. (#144)
* Add Services section
* Add TOC headings
* Update Node version for linting
  * Node 12 is very old and linting is failing. So trying the most recent LTS version.
* Fix up linting problems
2024-01-18 10:57:01 -05:00
kokomo123 86c769597d Add IA Library to Utilities (#143) 2023-12-20 08:53:11 -05:00
Ed Summers f0b7cdbae0 Added warcdb (#142) 2023-10-16 12:21:54 -04:00
Ross Spencer 4b12cc7b32 Update the details around HTTPreserve.info (#141) 2023-08-30 07:14:42 -04:00
Ed Summers 034582f3aa Adjusted jwarc description (#140) 2023-08-01 11:55:09 -04:00
IIPC d6ca8af2c0 Update README.md (#139)
* Update README.md

* Update README.md

* Update README.md

* Update README.md
2023-07-14 07:31:45 -04:00
Greg Lindahl 5d41023b2b add cc analysis (#138)
Co-authored-by: Greg Lindahl <greg@commomncrawl.org>
Co-authored-by: Nick Ruest <ruestn@gmail.com>
2023-07-04 12:54:21 -04:00
Greg Lindahl d4673d008e add cdx-toolkit (#135)
Co-authored-by: Greg Lindahl <greg@commomncrawl.org>
Co-authored-by: Andy Jackson <Andrew.Jackson@bl.uk>
2023-07-04 09:37:05 +01:00
Greg Lindahl d395bb1b44 add common crawl mailing list (#136)
Co-authored-by: Greg Lindahl <greg@commomncrawl.org>
Co-authored-by: Andy Jackson <Andrew.Jackson@bl.uk>
2023-07-04 09:36:05 +01:00
Greg Lindahl bf9664ff45 add web data commons (#137)
Co-authored-by: Greg Lindahl <greg@commomncrawl.org>
Co-authored-by: Andy Jackson <Andrew.Jackson@bl.uk>
2023-07-04 09:34:33 +01:00
Greg Lindahl 54110410bf warcio was stable a long time ago (#134)
Co-authored-by: Greg Lindahl <greg@commomncrawl.org>
2023-07-04 09:33:09 +01:00
Greg Lindahl 4c04474998 this link works (#131) 2023-06-28 11:38:31 +01:00
Nick Ruest 11fee57dcb Fix linter error: Ignore double IA Wayback link. (#129) 2023-06-01 15:58:29 -04:00
Rustem Kamalov 232966c4cb Add gogetcrawl (#128)
* Add `gogetcrawl`
2023-06-01 15:33:15 -04:00
Nick Ruest d8631ddf05 Add crau. (#127)
- Resolves #95
2023-04-30 20:05:45 -04:00
Matteo Cargnelutti 4ecc363191 Adding @harvard-lil/scoop (#126) 2023-04-26 16:56:25 -04:00
Ed Summers 46dc9518e4 added warcdedupe (#125) 2023-04-18 20:28:39 -04:00
Andy Jackson b309687f88 Update runs-on
Was pointing at defunct base image.
2023-04-13 14:46:14 +01:00
Andy Jackson 6bdb3373cb Add two tools that can do WARC deduplication (#124) 2023-04-12 11:00:52 -04:00
Hendursaga fc1a73d22d Rename 22120 to DiskerNet (#123) 2023-01-20 07:59:41 -05:00
Andy Jackson 248f9dc42e Update README.md (#122) 2022-10-17 19:47:33 -04:00
Mat Kelly 0104c202c8 Fix typo (#121) 2022-09-27 10:46:37 -04:00
Andy Jackson 6b7a3372d4 Add the Bellingcat Auto Archiver (#120) 2022-09-23 22:38:51 -04:00
IIPC f1a10b71b1 Update README.md 2022-08-23 12:10:15 -04:00