Ed Summers
db689ad1e7
Add duckdb_warc ( #178 )
2026-04-27 12:36:28 -04:00
Natanael Arndt
d9ca358fb5
Add subsections for training material ( #177 )
...
* Add subsections for training material to make the syntax more consistant across sections
* fix some linter issues
* Some more linting
2026-04-22 06:38:30 -04:00
Greg Lindahl
303c558027
more whirlwinds 🌪️ ( #176 )
2026-04-21 11:24:20 -04:00
Natanael Arndt
58be7236f6
fix syntax of stable and in development annotations ( #173 )
2026-03-18 13:31:19 -04:00
Alex Osborne
a19a9466ee
Add bagnabit2warc utility ( #172 )
2026-03-17 10:02:34 -04:00
Michael Lip
6a9d393783
Update Chrome Web Store URLs to new format ( #169 )
...
- Updated 5 Chrome Web Store URLs from chrome.google.com/webstore
to chromewebstore.google.com to reflect Google's URL migration
Fixes 5 dead/redirecting links in Quality Assurance section.
2026-03-04 06:03:40 -05:00
Greg Lindahl
b8bf24e051
doc: add public data ( #168 )
...
Co-authored-by: Greg Lindahl <greg@commomncrawl.org >
2026-01-19 14:28:05 -05:00
Ed Summers
2b3e2e24ac
Adding two tools ( #166 )
...
* duckdb-web-archive-cdx
* warc (Rust library)
2026-01-08 12:05:26 -05:00
Greg Lindahl
7d65f20ae0
feat: common crawl discord and twitter ( #164 )
...
Co-authored-by: Greg Lindahl <greg@commomncrawl.org >
2026-01-07 14:51:49 -05:00
Ed Summers
4b42fd3de4
Add warcbench tool to README links ( #163 )
2025-10-31 14:52:19 -04:00
Natanael Arndt
304a530b78
Update some links for moved repos ( #162 )
...
* Update some links for moved repos
* Remove whitespace
2025-09-23 15:38:36 -04:00
Natanael Arndt
6915bc5487
Remove indention ( #161 )
...
I assume this indention was not intentional.
2025-04-09 17:49:05 -04:00
Ross Spencer
af8c5bbc19
Add Community Archive (Twitter Archive and API) ( #160 )
...
Co-authored-by: Gabriel Chartier <gabriel@chartier.link >
2025-02-11 18:43:13 +00:00
Benjamin Ooghe-Tabanou
cf4504832d
Update README.md (add hyphe tool from médialab) ( #159 )
...
* Update README.md (add hyphe tool from médialab)
* Update README.md
---------
Co-authored-by: Nick Ruest <ruestn@gmail.com >
2025-01-29 07:34:53 -05:00
nruest
a48cb0da7a
Fix b193d5411a
2025-01-28 13:41:12 -05:00
Guillaume Levrier
b193d5411a
Update README.md ( #158 )
...
* Update README.md
Adding PANDORÆ
* Update README.md
move it before lint line
2025-01-28 10:20:32 -05:00
Ed Summers
1aad9d46c9
Added warcat-rs ( #157 )
...
which appears to still be in development...
2025-01-03 19:37:35 -05:00
Martin Hoppenheit
670bcba445
Update link to Stanford Libraries' Archivability pages, closes #154 ( #156 )
...
* Update link to Stanford Libraries' Archivability pages, closes #154
* Satisfy the linter
2024-12-27 10:58:50 -05:00
Martin Hoppenheit
129d1fa2ff
Update link for The Unarchiver ( #155 )
...
... and remove the entry for The Archive Browser because it redirects to The Unarchiver as well.
2024-12-27 10:16:53 -05:00
Henry Wilkinson
49282b06dd
Updates Webrecorder's website links ( #153 )
...
* Markdown syntax - blank lines around lists
* Update Webrecorder website links
2024-11-05 21:20:58 -05:00
Greg Lindahl
5bff7a0d46
Add a few new Common Crawl resources ( #152 )
...
* Add a few new Common Crawl resources
2024-11-05 08:42:39 -05:00
Mat Kelly
952e4d34dd
Update URI for SiteStory ( #151 )
...
Closes #150
2024-10-17 15:21:36 -04:00
IIPC
1953151aae
Update README.md
2024-09-10 10:26:04 -04:00
Natanael Arndt
168526a62c
Fix the jwat link(s) according to answers in the #os-sos@iipc.slack.com channel ( #149 )
2024-05-08 08:44:42 -04:00
lasztoth
99241ae461
Added warc-safe to list ( #148 )
2024-05-06 08:26:07 -04:00
Henry Wilkinson
8e713a4388
Update list with current Webrecorder related URLs ( #147 )
...
* Update list with current Webrecorder URLs
A few terms have changed! These should all be the most current, Conifer is notably duped, could remove one of them?
* Remove dupe Conifer link, updates Webrecorder tools
- Update PYWB link
- Update "ReplayWeb.page" casing
* Add stable tag to ReplayWeb.page
* Update README.md
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net >
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net >
2024-04-25 09:43:26 +01:00
Andy Jackson
101ee998d9
Adding a Web Archive Services section to list hosted and self-hostable web archiving options. ( #144 )
...
* Add Services section
* Add TOC headings
* Update Node version for linting
* Node 12 is very old and linting is failing. So trying the most recent LTS version.
* Fix up linting problems
2024-01-18 10:57:01 -05:00
kokomo123
86c769597d
Add IA Library to Utilities ( #143 )
2023-12-20 08:53:11 -05:00
Ed Summers
f0b7cdbae0
Added warcdb ( #142 )
2023-10-16 12:21:54 -04:00
Ross Spencer
4b12cc7b32
Update the details around HTTPreserve.info ( #141 )
2023-08-30 07:14:42 -04:00
Ed Summers
034582f3aa
Adjusted jwarc description ( #140 )
2023-08-01 11:55:09 -04:00
IIPC
d6ca8af2c0
Update README.md ( #139 )
...
* Update README.md
* Update README.md
* Update README.md
* Update README.md
2023-07-14 07:31:45 -04:00
Greg Lindahl
5d41023b2b
add cc analysis ( #138 )
...
Co-authored-by: Greg Lindahl <greg@commomncrawl.org >
Co-authored-by: Nick Ruest <ruestn@gmail.com >
2023-07-04 12:54:21 -04:00
Greg Lindahl
d4673d008e
add cdx-toolkit ( #135 )
...
Co-authored-by: Greg Lindahl <greg@commomncrawl.org >
Co-authored-by: Andy Jackson <Andrew.Jackson@bl.uk >
2023-07-04 09:37:05 +01:00
Greg Lindahl
d395bb1b44
add common crawl mailing list ( #136 )
...
Co-authored-by: Greg Lindahl <greg@commomncrawl.org >
Co-authored-by: Andy Jackson <Andrew.Jackson@bl.uk >
2023-07-04 09:36:05 +01:00
Greg Lindahl
bf9664ff45
add web data commons ( #137 )
...
Co-authored-by: Greg Lindahl <greg@commomncrawl.org >
Co-authored-by: Andy Jackson <Andrew.Jackson@bl.uk >
2023-07-04 09:34:33 +01:00
Greg Lindahl
54110410bf
warcio was stable a long time ago ( #134 )
...
Co-authored-by: Greg Lindahl <greg@commomncrawl.org >
2023-07-04 09:33:09 +01:00
Greg Lindahl
4c04474998
this link works ( #131 )
2023-06-28 11:38:31 +01:00
Nick Ruest
11fee57dcb
Fix linter error: Ignore double IA Wayback link. ( #129 )
2023-06-01 15:58:29 -04:00
Rustem Kamalov
232966c4cb
Add gogetcrawl ( #128 )
...
* Add `gogetcrawl`
2023-06-01 15:33:15 -04:00
Nick Ruest
d8631ddf05
Add crau. ( #127 )
...
- Resolves #95
2023-04-30 20:05:45 -04:00
Matteo Cargnelutti
4ecc363191
Adding @harvard-lil/scoop ( #126 )
2023-04-26 16:56:25 -04:00
Ed Summers
46dc9518e4
added warcdedupe ( #125 )
2023-04-18 20:28:39 -04:00
Andy Jackson
b309687f88
Update runs-on
...
Was pointing at defunct base image.
2023-04-13 14:46:14 +01:00
Andy Jackson
6bdb3373cb
Add two tools that can do WARC deduplication ( #124 )
2023-04-12 11:00:52 -04:00
Hendursaga
fc1a73d22d
Rename 22120 to DiskerNet ( #123 )
2023-01-20 07:59:41 -05:00
Andy Jackson
248f9dc42e
Update README.md ( #122 )
2022-10-17 19:47:33 -04:00
Mat Kelly
0104c202c8
Fix typo ( #121 )
2022-09-27 10:46:37 -04:00
Andy Jackson
6b7a3372d4
Add the Bellingcat Auto Archiver ( #120 )
2022-09-23 22:38:51 -04:00
IIPC
f1a10b71b1
Update README.md
2022-08-23 12:10:15 -04:00