technology

Web Scraping Has Become Easy

5 minute read

Introduction

A few years ago I wanted to scrape some information from a web page and I recall the effort being quite significant. Recently I wanted to do it again but have been procrastinating because of that old memory. As it happens I have been learning more Javascript recently too so eventually I decided I would try web scraping using Javascript on node and I was very pleasantly surprised - web scraping has become easy! This article will take you step-by-step through how you can scrape the bitcoin price using Javascript, node and a clever HTTP-processing module called Cheerio.

If you haven’t yet installed node or Cheerio please see the links at the end of the article for details on how to do that.

Approach

We will scrape the current bitcoin price from the webpage located at cryptowat.ch. The first thing we must be able to do is grab that page so that we can scrape it. Fortunately, Cheerio makes this a piece of cake. Before we do anything else, we tell node.js about the modules that we want to use - one that is in-built called Request and, of course, Cheerio:

let request = require('request');
let cheerio = require('cheerio');

With the request library now available, we can use its functionality to pull the HTTP response from cryptowat.ch like this:

    request({
      method: 'GET',
      url: 'https://cryptowat.ch/'
    }, (err, res, body) => {
      if (err) return console.error(err);

Within the request method, we now have Document Object Model (DOM) representing the whole page in our variable called body. So what can we do with that? To make it useful, we ask Cheerio to process it for us:

      let $ = cheerio.load(body);

So far, so good but what will cheerio do for us? One of its most useful features is to make a (apparently jQuery-like) search to return useful parts of the DOM. Let’s do a quick example:

      let title = $('title');
      console.log(title.text));
    });

Note that now that we have something remotely useful, I have closed the request function so that we can execute or code. Add all the code so far to a file called scrapeBTCPrice.js and in the same directory, execute using node like this:

node scrapeBTCPrice

If everything goes as planned, you should see ‘Cryptowatch | Your Trading Terminal’ printed on the console.

Getting to the Price

The bitcoin price we want to scrape is shown in the screenshot below. Bitcoin price on cryptowatch

To figure out how to pull the price out of the webpage, we need to take a closer look at it. To do this, navigate to cryptowat.ch and pull up your browser’s developer tools. I am using the Brave browser and I do this by clicking on the menu at symbol (three horizontal bars) at the top right and selecting More Tools –> Developer Tools. This will bring up a new pane showing all the underlying HTML used to create the page. What’s more, in Brave, as you mouse-over different HTML elements, the corresponding part of the webpage is highlighted which makes picking out the HTML we need really easy as you can see in this screenshot:

Picking out the price HTML

If you drill down to where the price is located in the page you will see that it is contained in this link:

<a class="_1roDdymkPS2zplXEDcBm0L _3z3AqahoD2pN2R7vFue-0o pointer" title="Bitcoin" href="/assets/btc" data-testid="list-row">
  <div class="text-center _2yv_NtK1R_FBVWqrvRdgcN _2jRRJJvarKXJGP9oRP-Bv0 _2eU06SRnF8jtz1L2K41BsV">2</div>
  <div class="text-left _2yv_NtK1R_FBVWqrvRdgcN _2jRRJJvarKXJGP9oRP-Bv0 _1TuQ_Cac70IaRi6hBmwL9">
    <i class="crypton sym-default-s sym-btc-s _3fjTNyT8S3oN5Xib_Ru5mn"></i>BTC</div>
  <div class="text-left _2yv_NtK1R_FBVWqrvRdgcN _2jRRJJvarKXJGP9oRP-Bv0 _2eU06SRnF8jtz1L2K41BsV">
    <i class="crypton sym-default-s sym-btc-s"></i>Bitcoin</div>
  <div class="text-right _2yv_NtK1R_FBVWqrvRdgcN _2jRRJJvarKXJGP9oRP-Bv0 _1TuQ_Cac70IaRi6hBmwL9">
    <span class="price">9082.93</span>
  </div>
  <div class="text-right _2yv_NtK1R_FBVWqrvRdgcN _2jRRJJvarKXJGP9oRP-Bv0 _1TuQ_Cac70IaRi6hBmwL9">2.727B</div>

  ...

</a>

Now we know where our price is embedded but how to we grab it? The answer is that Cheerio has a fantastic mechanism for pulling HTML elements out of the DOM. It is modeled after jQuery but as I know nothing about jQuery that doesn’t really help me much! Nevertheless, it isn’t difficult to grasp how it works. It is best explained with an example.

Remember that in the code above we loaded the entire DOM into a variable called $ using Cheerio’s load function. We apply a search function to $ simply by giving it a selector to match against. There is a specified set of selectors to use (you can see a list used in jQuery here). The result can be empty, a single element or an array of elements depending on what the selector matches within the DOM. We know that the bitcoin price we want is embedded in the link shown above so let’s start by pulling the links out of the DOM using the appropriate selector like this:

    const allLinks = $("a");

    console.log("number of links in DOM = " + allLinks.length);

If you add those lines to scrapeBTCPrice.js and run it with node your output should look something like this:

    Cryptowatch | Your Trading Terminal
    number of links in DOM = 125

This is great and we could start looping through our list of links looking for the price but Cheerio can do even better than this. Instead of querying for all links we can note that the particular link of interest has an attribute called title which has the value Bitcoin. Let’s use this attribute to pull just our link out of the DOM:

    const bitcoinPriceLink = $("a[title='Bitcoin']");

This tells Cheerio exactly which link we want. From here, we can pull the actual price out by noting that it lives in a span element with a css class called price. This is how we use Cheerio and these characteristics to get the price:

    console.log("the current bitcoin price is: " + $(".price", $(bitcoinPriceLink)).text());

This needs a little bit of unpacking. “.price” is the selector and tells Cheerio to search for a CSS class with this name however this time we are not applying the selector to the whole DOM. Instead we tell it only to search within bitcoinPriceLink. Once Cheerio has pulled out the HTML element with that CSS class we simply pull out the text of the element which is the current bitcoin price. Hey presto! We successfully scraped the price!

If you enjoyed this, you can find me on Twitter at the handle below. You can also find the code used in this article in this repository

Installing Node.js and Cheerio

Node is runtime environment that allows you to execute Javascript code. It is based on the Javascript engine from the Google Chrome browser. Installation instructions can be found here

You can test your installation by running

node -v

which will display the version of node that has been installed. Node also comes with its own package manager, npm. Check which version of this has been installed by running

npm -v

Cheerio is a Node module for handling HTTP requests and responses. You will npm to install it as described here

Twitter Facebook LinkedIn

Michael

Web Scraping Has Become Easy

Introduction

Approach

Getting to the Price

Installing Node.js and Cheerio

You May Also Enjoy

What Now?

Stripe Investigation

Moving on to Website Design

Continuing the Web-Scraping Article