Skip to main content

Web Scraping using Python Scrapy framework with Example

In today's world, internet has an overwhelming amount of data, this same data is used by different stakeholders for different services. Saving this data set manually would be an enormous amount of work.

Web Scraping is a process of extracting data from websites automatically. Scrapy is a Python framework designed for large scale web scraping. Scrapy's architecture is build around "Spiders", which are self-contained crawlers. Spiders are Python classes which are used by the framework to extract from the website(s).

As an example, in this post we will scrape the popular Canadian website Redflagdeals - Hot Deals section to extract the information like Deal Title, Vote Count and the Link to the deal. This information would be dumped in a .csv file which can be used to sort the deals based on the vote count.

To start with, we will install the Scrapy and create a new project

pip install scrapy

scrapy startproject rfd

Our new project will have the following file structure

rfd/
  scrapy.cfg
  rfd/
    __init__.py
    items.py           #project items definition file
    middlewares.py     #project middlewares file
    pipelines.py       #project pipelines file
    settings.py        #project settings file
    spiders/           #directory to store spiders for this project
       __init__.py

We will be creating our spiders (or web-crawlers) inside the spiders folder. Rest of the files are for advanced projects and customizations, which will be out of scope for this post.

Let's create a new file "hotdeals.py" inside the spiders folder.

Code starts with importing the scrapy module and then defining the name of the Spider by using class variable name, this is to identify the Spider uniquely.

start_requests method must return an iterable of Requests, which will have the name of callback function. Consider this as the starting pointing of scraping, which will tell the scrapy from where to start the scraping and which callback function to call to handle the response. 

Note : Instead of defining a start_requests function, we can use start_urls variable with the list of URLs that we want to crawl. This would have been helpful, if we just we wanted to scrape say first 5 pages. In this post, we will actually navigate to all the pages using the Next button from pagintation.

parse method will be called to handle the response downloaded for each of the request, response parameter (instance of TextResponse) holds the page contents and the method functions to access the data. Scrapy provides a different set of methods to extract the data from the response - using CSS Selectors, XPaths, Regular Expressions etc. You can check my earlier posts on CSS Selectors & XPath queries for quick reference.

This example code is using XPath queries to extract the Deal Title, Vote Count, Link to the post itself.  XPath query is also used to find the Next button element from Pagintation and if it exists, href link is extracted (relative link is joined with the page URL). XPath queries will return list-like objects (SelectorList) and we use methods extract() or extract_first() to extract data further from this list.

Scrapy offers a powerful feature "scrapy shell <URL Name>" to explore the response and try different methods to extract the desired information.

To run our web-crawler example, use the following command by changing directory to rfd folder (parent project directory). This will create a csv file with the extracted data from all the pages.

scrapy crawl rfd -o output.csv -t csv

Comments