Archiving my Reddit Data before the impending API changes
I've been archiving my Reddit data slowly over the past couple weeks while the API is still freely accessible. It has been very time-consuming because of how awful getting correct/complete data is. You'll probably run into some of these problems if you're attempting your own archive as well.
TL;DR
- The GDPR export is extremely slow. It has most data you would want though.
- The API export is limited to 1,000 items. It also has most data you would want and more. If a subreddit is privated while you query the API for an API export, it will not return in your subreddits and you will not see any comments or posts from it. The data will silently be missing.
- Reddit chat archiving via the API is currently flaky.
- Reddit thread archiving was pretty simple. The archive.org archive also has most every post archived you would want. 29 of 30 I tested from my dataset were found.
GDPR Export
From what I gather, Reddit's GDPR export includes the following1:
- Saved posts and comments (id and permalink)
- Votes for posts and comments (id, permalink, and direction)
- Your own comments and posts (id, permalink, date, ip, subreddit, body, etc...)
- Message and chat history
- IP log
- quite a bit of other smaller pieces of data
The GDPR export is also extremely slow. I can't fully confirm the above because I'm still waiting for mine >:( . It's been >7 days2.
API Export
I used rexport by karlicoss in the past when experimenting with promnesia so I ran that against all of my Reddit accounts to download their data. The API limits queries to the last 1,000 of each item type (posts, comments, etc) but this was not a problem for my accounts.
I only realized later that I was still missing some data. Apparently privated subreddits are silently omitted from a lot of the API returns, exacerbated by the subreddit blackout3. Running the API export now (6/29/2023) I get a lot more data.
Chat Export
Most of my Reddit chats were in the old chat system (SendBird). This seems hard to export or even find in the UI anymore4, especially with Reddit actively messing with the infrastructure around it at the current moment. I found reddit-chat-archiver but wasn't able to get it to return anything, even with tweaks. Some of the endpoints are currently returning 404s.
For the new Matrix-based chat, I found Rexit but didn't have the dependencies to run and test it. Supposedly chat was to be migrated to the new platform so in the future Rexit should work. I have only had one of my chats migrated though so who knows...
Thread Archiving
Neither the GDPR or API export will give you the entire thread of a post, including all comments. I used a modified version of RedditArchiver-standalone by Ailothaen for this. By default you get a minimal but functional .html file of the archived thread. I also added my own quick .json export for future use.
I'm no good with CLI, but I hacked together a jq
and xargs
invocation to archive all my upvoted, submitted, saved, and commented posts from the rexport
output.
jq '[.submissions[], .saved[], .upvoted[], .comments[] | if .link_id then .link_id[3:] else .id end] | unique | .[]' ~/archive/reddit/export-user-2023-06-11.json | xargs -i python ~/RedditArchiver.py -c ./config-user.yml -i {} -o ~/archive/reddit/
archive.org
Some of the threads I wanted to archive are still privated. I heard ArchiveTeam has been rapidly collecting web crawl snapshots onto archive.org and wanted to check out how that was doing.
I took all the links from my API export and used jq
to form a lookup URL onto archive.org. Apparently it was important to use old.reddit.com
instead of www.reddit.com
when looking up the archived URLs on archive.org otherwise you won't see much5.
jq '[.submissions[], .saved[], .upvoted[], .comments[] | "https://web.archive.org/web/2/http://old.reddit.com" + .permalink] | unique | .[]' ~/archive/reddit/export-user-2023-06-11.json
I tried 30 different threads, including some from pretty obscure subreddits, and I amazed that only 1 thread was missing.
addendum
I'm happy I have most of my data off of the platform, now it's just an impatient wait for my GDPR export. I'm still not quite sure how my usage of Reddit will change yet, though I know I won't be using BaconReader6 anymore. My use will most likely be restricted to just the web view as a tool in search engines for human-discussion-esque content. Still hoping my development plans pan out and I have a better data storage/retrieval system to workaround all of this mess.
[1]: Thanks to David Brownman for this information!
[2]: Other comments I read said it normally takes about 4-ish days, so the blackout might be really straining whatever auto or manual system is doing the exports.
[3]: I did all my first exports on 6/11/2023. Even though the coordinated blackout started officially 6/12/2023 there were quite a few already subreddits privated.
[4]: Intermittently this url https://www.reddit.com/messages/messages/
would show the legacy chat on a 404 page, but I haven't had success with it lately.
[5]: Originally I was using reddit.com
in the URLs but most posts displayed with no content, like this example. I asked the ArchiveTeam IRC and was told to use old.reddit.com
. After running the ArchiveTeam Warrior VM myself, it seems like they're archiving both, but old.reddit.com
has the best chance to be properly viewed after archival.
[6]: BaconReader was the third party client an ex showed me during college 8 years ago shutting down on June 30, 2023. What I didn't realize until recently is that OneLouder describes itself as "Mobile Advertising & Insights Powered By First-Party Data". Yikes. It's also owned by PinsightMedia who is owned by Inmobi. Crazy to think I was putting so much data into some salespersons hands. It is bittersweet to see it go.