Web Scraping Reddit



There’s a subreddit for everything.

Hi everyone, having worked in the web scraping industry for a few years I know how easily troublesome it can be to write, maintain and even begin web scraping. One year ago, I wrote a web-scraping guide that was really loved by the community. Reddit post, article. It was actually my first and only gilded post here 😊. Scrape posts from Reddit Wednesday, December 04, 2019. Advanced Mode is a highly flexible and powerful web scraping mode. For people who want to scrape from websites with complex structures, like Airbnb.com, we strongly recommend Advanced Mode to start your data extraction project. Welcome to the most interesting (and fun!) blog post on web scraping for dummies. Mind you, this is not a typical web scraping tutorial. You will learn the whys and hows of data scraping along with a few interesting use-cases and fun facts.

No matter what your interests are, you will most likely find a subreddit with a thriving community for each of them.

This also means that the information on some subreddits can be quite valuable. Either for marketing analysis, sentimental analysis or just for archival purposes.

Reddit and Web Scraping

Today, we will walk through the process of using a web scraper to extract all kinds of information from any subreddit. This includes links, comments, images, usernames and more.

To achieve this, we will use ParseHub, a powerful and free web scraper that can deal with any sort of dynamic website.

Want to learn more about web scraping? Check out our guide on web scraping and what it is used for.

Reddit and Web Scraping

For this example, we will scrape the r/deals subreddit. We will assume that we want to scrape these into a simple spreadsheet for us to analyze.

Additionally, we will scrape using the old reddit layout. Mainly because the layout allows for easier scraping due to how links work on the page.

Getting Started

  1. Make sure you download and open ParseHub, this will be the web scraper we will use for our project.
  2. In ParseHub, click on New Project and submit the URL of the subreddit you’d like to scrape. In this case, the r/deals subreddit. Make sure you are using the old.reddit.com version of the site.
PythonR web scraping

Scraping The Subreddit’s Front page

  1. Once submitted, the URL will render inside ParseHub and you will be able to make your first selection.
  1. Start by clicking on the title of the first post on the page. It will be highlighted in green to indicate that it has been selected.
  1. The rest of the post titles on the page will also be highlighted in yellow. Click on the second post title on the page to select them all. They should all now turn green.
  1. On the left sidebar, rename your selection to posts. We have now told ParseHub to extract both the title and link URL for every post on the page.
  2. Now, use the PLUS (+) sign next to the post selection and select the Relative Select command.
  1. Using Relative Select, click on the title of the first post on the page and then on the timestamp for the post. An arrow will appear to show the selection. Rename this new selection to date.
Web Scraping Reddit

R Web Scraping

  1. You will notice that this new selection is pulling the relative timestamp (“2 hours ago”) and not the actual time and date on which the post was made. To change this, go to the left sidebar, expand your date selection and click on the extract command. Here, use the drop down menu to change the extract command to “title Attribute”.
  1. Now, repeat step 5 to create new Relative Select commands to extract the posts’ usernames, flairs, number of comments and number of votes.
  2. Your final project should look like this:

Downloading Reddit Images

You might be interested in scraping data from and image-focused subreddit. The method below will be able to extract the URL for each image post.

You can then follow the steps on our guide “How to Scrape and Download images from any Website” to download the images to your hard drive.

Scraping

Adding Navigation

ParseHub is now setup to scrape the first page of posts of the subreddit we’ve chosen. But we might want to scrape more than just the first page. Now we will tell ParseHub to navigate to the next couple of pages and scrape more posts.

  1. Click the PLUS(+) sign next to your page selection and choose the Select command.
  1. Using the Select command, click on the “next” link at the bottom of the subreddit page. Rename your selection to next.
  1. Expand the next selection and remove the 2 Extract commands created by default.
  1. Now click on the PLUS(+) sign on the next selection and choose the Click command.
  1. A pop-up will appear asking you if this a “next page' button. Click “Yes” and enter the number of times you’d like ParseHub to click on it. In this case, we will input 2, which equals to 3 full pages of posts scraped. Finally, click on “Repeat Current Template” to confirm.

Reddit Scraper

Scraping Reddit Comments

Scraping reddit comments works in a very similar way. First, we will choose a specific posts we’d like to scrape.

In this case, we will choose a thread with a lot of comments. In this case, we will scrape comments from this thread on r/technology which is currently at the top of the subreddit with over 1000 comments.

  1. First, start a new project on ParseHub and enter the URL you will be scraping comments from. (Note: ParseHub will only be able to scrape comments that are actually displayed on the page).
  1. Once the site is rendered, you can use the Select command to click on the first username from the comment thread. Rename your selection to user.

Reddit Image Scraper

  1. Click on the PLUS(+) symbol next to the user selection and select the Relative Select command.

Web Scraping Reddit Jobs

  1. Similarly to Step 5 in the first section of this post, use Relative Select to extract the comment text, points and date.
  1. Your final project should look like this:
Web Scraping Reddit

Running and Exporting your Project

Once your project is fully setup, it’s time to run your scrape.

Start by clicking on the Get Data button on the left sidebar and then choose “Run”.

Web Scraping Reddit Download

Pro Tip: For longer scrape jobs, we recommend running a Test Run first to verify that the data will be extracted correctly.

Web Scraping Reddit Free

You will now be able to download the data you’ve scraped from reddit as an excel spreadsheet or a JSON file.

Web Scraping Freelance Reddit

Which subreddit will you scrape first?