Cloud-Web-Scraping

An autonomous cloud based web scraper

This project is maintained by Kabuswe

A CLOUD BASED WEB SCRAPER

Index

Introduction

This article will help you build a web scraper and upload it to work autonomously on the cloud. Before we deploy a web scraper to the cloud it’s important to understand what web scraping is. Web scraping as per Wikipedia, is the process extracting data from websites. Why would one need to extract data from a website? Should I be interested? What’s the point? Well the answer to these questions depends on the use case. One would scrape a website for analytical purposes or just for personal use, seeing that this information is available publicly. You can read more about web scraping from the Wikipedia article here https://en.wikipedia.org/wiki/Web_scraping, it’ll give you sufficient information to satisfy your curiosity.

They’re several web scraping techniques that exist and that can be done in several programming languages, you’re free to choose your preferred tool to get the work done. In this article we’ll be working with the python programming language. No need to worry, the syntax is fairly simple and easy to understand and of course I’ll be explaining every step so you don’t get confused. If you already understand basic python programming syntax, this will be a breeze. Hang in there and let’s get it done. So what will our web scraper do? Our web scraper is going to be assigned the task of extracting news articles from a news site. Because the main reason one would want an autonomous web scraper is to extract data that is constantly being updated.

Disclaimer: before scraping any website be sure to read their user terms and conditions. Some sites may take legal action if you don’t follow usage guidelines.

Platforms and services

This section will list and briefly explain the platforms and services we’ll use for our cloud based web scraper example.

Build flow

It always helps to know the end goal of a project to better understand what features to implement. Below is a list of steps we’ll take to reach our end goal:

1. Build a web scraper in python

Python proposes several libraries for your web scraping needs, ranging from the requests library for making HTTP requests to selenium for browser automation and testing. These libraries are very useful if you’re using your machine but for our use case, we want it to work on the cloud independent of our machines and the network latency we are subjected to. You can read about selenium web browser automation here https://www.seleniumhq.org/.

In our example we’ll be using cloud phantomjs, which will be responsible for rendering our website on the cloud and unirest library to make an http request to cloud phantomjs. Unirest is an open source library that allows us to make HTTP requests for simple html websites and also JavaScript driven websites. You can further read on Unirest here http://unirest.io/python.html. Once the request is made, a response is returned in the form of html. To be able to navigate through the html tags, we have to properly parse the response. But how? The answer is Beautiful Soup, another python library that allows us to parse the html response. Once parsed we can then easily navigate through the html tags programmatically with Beautiful Soup.

2. Containerise the python web scraper using Docker

Containers help us bind our project in an environment necessary for our project to run including all the external packages it needs. This allows our project to work anywhere and not only on the machine we worked on to create it. There several options for containerisation offered by several platforms. In this example we’ll be using the docker platform to containerise our project.

In order to use docker you’ll need to download and install docker desktop for your specific system type and operating system. For older machines it’s recommended to use docker toolbox. In this example we’ll be continuing with docker toolbox, here’s a link to the download page https://docs.docker.com/toolbox/toolbox_install_windows/. Make sure to enable virtualisation in your BIOS before using docker toolbox or docker desktop.

After successfully installing docker toolbox, you can use docker by running the docker quick start terminal. In order to containerise the web scraper we first need to change our current directory to the directory containing the web scraper. The directory containing the web scraper must contain a docker file. A docker file is a text file that contains all the commands necessary to assemble a docker image. In this example our docker file will contain the python 2.7 runtime as the parent image of the web scraper. The file also contains a command to use python pip to install all the packages needed to run the web scraper. The code snippets below break down the contents of the docker file necessary for the web scraper to work.

First things first we need a python environment to run our python based web scraper. So for this we need to set a python runtime as the parent image of the web scraper. The referenced image is an official python image pulled directly from docker hub.

FROM python:2.7-slim

Next we set the working directory to app, which will contain all the files that the web scraper needs to run and to build the docker image.

WORKDIR /app

Once the working directory is established we need to copy all the files in our current project directory to the working directory.

COPY . /app

After the files have been copied, all the necessary dependencies will have to installed using python pip.

RUN pip install --trusted-host pypi.python.org -r requirements.txt

The contents of requirements.txt:

unirest
lxml
urllib3
bs4
Cloudant

Once all the dependencies are installed the web scraper is ready to run by making a command line instruction.

CMD ["python", "app.py"]

Once we have our docker file all set up we can finally build the docker image for our web scraper by executing the following command in the docker terminal.

docker build --tag=web_scraper .

3. Upload Docker container of the web scraper to Docker Hub

Supposing that we successfully built the docker container in the previous section. Docker hub allows us to host our project’s image much like github repositories. In this section we’ll be uploading the already built container to docker hub.

In order to upload to docker hub you first need to create a free account using the following link if you don’t already have an account. https://hub.docker.com/signup

Once signed up take note of your credentials we’ll need them in the next step. Open the docker terminal and connect to your newly created account using the docker login command.

docker login

After logging in we’ll now tag the docker container to the docker hub account currently logged in. This action creates a repository to host the image on docker hub.

docker tag web_scraper user_name/web-scraper:v1
$ docker image ls
REPOSITORY                 TAG                 IMAGE ID            CREATED             SIZE
user_name/web-scraper      v1                  c4b87238b670        2 hours ago         152MB
web_scraper                latest              c4b87238b670        2 hours ago         152MB

Once the repository has been created on your machine, we can now make a push action to docker hub.

docker push user_name/web-scraper:v1

After the push action has been made and has completed the push process you can verify the created repository by accessing your account.

4. Deploy the container to IBM cloud

Now that the web scraper has been containerised and hosted on docker hub, we can now deploy it as a cloud function on ibm cloud. Once deployed as a cloud function IBM cloud makes a pull action on the hosted container in order to execute the web scraper on the cloud.

In order to deploy the container to IBM cloud you’ll need to download and install the IBM cloud functions CLI. Here’s the link https://cloud.ibm.com/docs/cli/reference/bluemix_cli?topic=cloud-cli-install-ibmcloud-cli#install-ibmcloud-cli .

After successfully installing the IBM cloud functions CLI, open the docker terminal to install the cloud functions plugin. To install the plugin run the command below in the docker terminal:

ibmcloud plugin install cloud-functions

Once the plugin is installed we can now login to our IBM cloud account in order to create a cloud function using the hosted web scraper container. The IBM login command must include the cloud foundry org and cloud foundry space. If you have’nt created any cloud foundry org, the default is the email you used to sign up to IBM cloud. To confirm these details you can type functions in the search bar of the IBM cloud website. Once presented the functions page you can click the start creating button. If presented the error of No Cloud Foundry Space, simply close it and change the region in the region section, to the region you account is based on. Once the right region has been selected you can clearly see your cloud foundry org and cloud foundry space details. To login using these details you can execute the command below:

ibmcloud login -a cloud.ibm.com -o "cloud_foundry_org" -s "cloud_foundry_space"

After successfully logging in to the IBM account we now need to login to the docker hub account hosting the web scraper container.

docker login

Once logged in to the two accounts we can now deploy the web scraper container as a cloud function on IBM cloud. To do this we create a new action that will pull the container from docker hub.

ibmcloud fn action create cloud_webscraper --docker <username>/web-scraper:v1

After executing the above command, the action is created and can be viewed in the actions tab of the IBM cloud functions page. Open the created action by clicking the name. Once presented the action’s page, select Runtime in order to modify the default runtime timeout of 60 seconds to 300 seconds. This done to allow the web scraper enough time to run without being interrupted. It takes roughly 60 seconds for the container to pulled by IBM cloud, so a 60 second timeout would’nt allow the web scraper to run.

The screenshots below show how to check for the created action on IBM cloud and how to modidify the runtime timeout:

Accessing the actions from the IBM cloud Functions page:

Action1

Selecting our desired Action from the actions list:

Action2

Accessing the runtime tab:

Action3

Modifying the runtime timeout:

Action4

5. Create time controlled triggers on the IBM cloud functions service

Triggers are very useful in completing the autonomous experience. This means that the cloud function can be executed without you having to explicitly run it. There several types of triggers we could use on an action but for this is example we’ll be using the time controlled triggers. This means that for the days selected and times selected the action will be triggered on the cloud and will work fully autonomously.

Our example involved getting news articles from a news website. So our trigger will basically execute the action that executes the web scraper to get scrap news articles from the site every morning when the site is updated. Executing the web scraper then adds the latest articles to the cloud based database and this sums up the entire project of an autonomous cloud based web scraper.

The screenshots below demonstrate how to add a time controlled trigger to the created action of the web scraper:

Accessing the triggers from the IBM cloud Functions page:

Trigger1

Start the trigger creation process:

Trigger2

Selecting the periodic trigger type:

Trigger3

Configuring the trigger:

Trigger4

Connecting an action to the trigger:

Trigger5

Adding an existing action to the connection:

Trigger6

Viewing the connected action:

Trigger7