March 8, 2017
After CS50x, I didn’t know what to do next. Thankfully some lunchtime conversation with coworkers turned into a new project idea. Unthankfully, that project would turn out to be a disaster. What follows is a cautionary tale of how not to develop a good reddit bot. This blog is not going to be very informative, but if you stick with it you might get a good laugh.
Or a bad laugh.
Or no laughs.
Step One: Have a Bad Idea
I like the internet, but the internet is full of people trolling each other. One particular brand of troll that gets my goat is the grammar nazi. I used to be a grammar nazi, and maybe that’s why they in particular drive me crazy. Over lunch one day, one of my friends joked about replying to grammar nazis by incorrectly correcting their grammar corrections.
I had just finished this intro to computer science course, so the first thing I said was “I can probably automate that.” Of course everyone at the table thought it would be hilarious, and that only encouraged me to actually look into it. As it turns out, Reddit has a Python wrapper for their API specifically so people can write bots in Python. How convenient!
Step Two: Learn The Reddit Python API
This was my first attempt at learning a library straight from the documentation. Regular readers may be thinking “But you said this blogging app was written using Django and it was your final project for CS50? How did you learn Django?” Answer: I tried to write a Reddit bot as a final project, and it was such a failure that I made a blogging app to write about it. Hence: bot first, django second. In a way, I owe having this blog to my bot failure.
Anyway, the Python Reddit API Wrapper (PRAW) documentation is really well-written and easy to understand. Kudos to the folks at Reddit for making it so easy for people like me to fool around using their life’s work as a digital playground. If anyone has recently picked up Python and is looking for a project, PRAW makes it easy to turn ideas of all varying quality into reality. For better or worse.
Step Three: Execute Your Bad Idea
I decided quickly (and wisely) that trying to learn natural language processing for the sake of writing a troll bot might be too much of a time investment. Instead I opted to search for any use of an asterisk next to any form or the word their or your (e.g. *there or you're*). My plan was to respond with a random different version of the word with an asterisk. E.g. responding to *there with *their, regardless of which was correct.
It was only after getting everything working in a text console that I thought to myself “Maybe this is a stupid idea.” I didn't want to let this comment parsing script I'd written go to waste, so I started trying to think of other ways to auto-respond to simple things that would be less... pointless.
Step Four: Have a Worse Idea
This is where things get really good. Another one of my pet peeves is when people use acronyms or jargon without explaining what it means. Like if someone were to say that they were reading the WSJ at JFK while waiting on the MTA to take me to NoHo for a PBJ, that would really grind my gears. I thought I could use my comment-parser to look up and expand acronyms to help people out.
My wife insisted that people didn’t need most acronyms explained to them, but I’m as stubborn as I am inquisitive. Both came in handy during CS50—neither could save this bot. I forged ahead and started looking at how to load a big dictionary of acronyms into memory for searching. That’s when I found Webopedia’s and Netlingo’s giant lists of acronyms. If only I knew some way to get information from a web page into a Python dictionary...
Step Five: Learn How to Write an HTML Scraper
Luckily enough for me, Python is a really popular programming language that has a lot of great tools already written. One of those tools is BeautifulSoup—a library for web scraping. If you don't know, web scraping is the process of looking through a web-site's raw HTML and trying to pick out whatever information you’re looking for based on what HTML tags are used in what order. To get an idea what that would be like you can right-click and “View Page Source” in Chrome and think about how you'd tell a computer to find a needle in that haystack.
BeautifulSoup made it easy enough that with some new-fashioned object-oriented thinking combined with old-fashioned trial and error I got my acronym dictionaries loaded in no time. No time being like a week's worth of evenings. It was at this point that fatigue started to creep in and I was antsy to just get this thing done.
Step Six: Putting The Pieces Together
Somehow, this was the first time I'd ever split a python script into multiple files. Before I did it I thought this process was going to be hard. One line of code later, the process was complete and I could get access the dictionaries from my HTML scraper from my comment-searcher. Victory!
Step Seven: Testing and Refinement
At this point, it’s important to note that Reddit requires any script that accesses its API to wait 2 minutes in between API calls to avoid overloading their servers. For testing purposes I disabled the part of my comment-searcher that waited two minutes in between posts so I could get results faster. I used these results to remove some acronyms from my dictionaries that were too commonly used to need explaining (like lol or wtf).
I noticed that my script was taking a really long time to return results, and that it was only finding none or a single comment on each post. This seemed off. After some digging I realized that instead of having my bot search things that were already posted, I was having it intercept new posts. This meant it would only find comments made faster than it could find each new post. Once I fixed that I went from catching 1–2 acronyms a minute to catching hundreds. This seemed like a good thing at the time. Finally—I thought—my bot was working as intended.
Step Eight:Run The Bot
Once my bot was catching hundreds of acronyms a minute, I thought it was ready to deploy. All deployment took was changing a print statement into a comment.reply statement, and it was off. I executed the script and went to the bathroom. While I was in the bathroom I realized that I hadn't re-enabled the line of code that told my script to wait two minutes in between actions. That was a mistake.
When I got back to my computer I was equal parts horrified and amused to find a message on my console stating my access to Reddit’s API had been cut off. Over the course of the next couple hours my inbox was flooded with hate. My bot had replied to over 350 comments in under 2 minutes, and had been banned for spam in over 20 Subreddits while I was in the bathroom. Reddit doesn’t delete accounts, so if you’re interested, check out my bot account for a log of everything that it replied to.
- Have A Good Idea
- Test Thoroughly
- Do Not Deploy Late At Night While Tired
- Follow API Access Rules
- Share Your Work Early So Other People Can Tell You If It’s Stupid
If you liked this and want to see more posts about my less successful endeavors, let me know!