Write your first Web Scraper

Write your first Web Scraper

In this tutorial, we will write a web scraper to query the Hacker News homepage for a list of the latest articles with their URLs. We will produce a JSON string containing our results based on the scraped data. We will also cover this with some unit tests!

This will be great to follow along, especially if you’ve been meaning to sink your teeth into the Dart language and tooling.

Get the source code

What is a Web Scraper?

Web Scrapers refer to scripts that perform the task of extracting data from websites. This usually happens by performing a GET request to the web page and then parsing the HTML response to retrieve the desired content.

1. Generate a console project

Create a directory for your project:

$ mkdir hacker_news_scraper && cd hacker_news_scraper

Use the stagehand package to generate a console application:

$ pub global activate stagehand # If you don't have it installed
$ stagehand console-full

Add the http and html dependency in the pubspec.yaml file:

dependencies:
  html: ^0.13.3+3
  http: ^0.12.0

The http package provides a Future-based API for making requests. The html package contains helpers to parse HTML5 strings using a DOM-inspired API. It’s a port of html5lib from Python.

And install the added dependencies:

$ pub get

Following these instructions correctly should give you the file/folder structure below:

2. Implement the script

Empty the contents of lib/hacker_news_scraper.dart, for we shall start from scratch☝️

Import our installed dependencies:

import 'dart:convert'; // Contains the JSON encoder

import 'package:http/http.dart'; // Contains a client for making API calls
import 'package:html/parser.dart'; // Contains HTML parsers to generate a Document object
import 'package:html/dom.dart'; // Contains DOM related classes for extracting data from elements

Create a function after our imports to contain our logic:

initiate() async {}

The http package contains a Client class for making HTTP calls. Create an instance and perform a GET request to the Hacker News homepage:

Future initiate() async {
  var client = Client();
  Response response = await client.get('https://news.ycombinator.com');

  print(response.body);
}

To test this out, go to bin/main.dart and invoke the initiate method:

import 'package:hacker_news_scraper/hacker_news_scraper.dart' as hacker_news_scraper;

void main(List<String> arguments) async {
  print(await hacker_news_scraper.initiate());
}

Run this file:

$ dart bin/main.dart

Below is an extract of the response:

<html op="news"><head><meta name="referrer" content="origin"><meta name="viewport" content="width=device-width, initial-scale=1.0"> 
...
...
<table border="0" cellpadding="0" cellspacing="0" class="itemlist">
  <tr class='athing' id='18678314'>
    <td align="right" valign="top" class="title">
      <span class="rank">1.</span></td>      
      <td valign="top" class="votelinks">
      <center><a id='up_18678314' href='vote?id=18678314&how=up&goto=news'><div class='votearrow' title='upvote'></div></a></center></td>
      <td class="title">
      <a href="http://vmls-book.stanford.edu/" class="storylink">
        Introduction to Applied Linear Algebra: Vectors, Matrices, and Least Squares
      </a><span class="sitebit comhead"> (<a href="from?site=stanford.edu"><span class="sitestr">stanford.edu</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">
      <span class="score" id="score_18678314">381 points</span> by <a href="user?id=yarapavan" class="hnuser">yarapavan</a> 
      <span class="age"><a href="item?id=18678314">8 hours ago</a></span> <span id="unv_18678314"></span> | <a href="hide?id=18678314&goto=news">hide</a> | 
      <a href="item?id=18678314">37 comments</a>              
    </td>
  </tr>
...
...

In order to know what to look for, we need to know how to select the links on the page:

It appears that each link is in a table cell and has the class “storylink”. This means that we can use this CSS selector to traverse those: td.title > a.storylink

In lib/hacker_news_scraper.dart, rather than printing the response body in the initiate function, let’s parse the body and select our elements using the helpers from the html package.

Future initiate() async {
  var client = Client();
  Response response = await client.get('https://news.ycombinator.com');
  // Use html parser and query selector
  var document = parse(response.body);
  List<Element> links = document.querySelectorAll('td.title > a.storylink');
}

At this point we now have a list of Elements where each element is an a.storylink item. The Element type provides an API similar to the DOM.

With a for in loop we can traverse the collection:

List<Map<String, dynamic>> linkMap = [];

for (var link in links) {
  linkMap.add({
    'title': link.text,
    'href': link.attributes['href'],
  });
}

And return a JSON-encoded output:

import 'dart:convert'; // Do this at the top of the file

Future initiate() {
  ...
  ...
  return json.encode(linkMap);
}

Here's the full script so far:

// lib/hacker_news_scraper.dart
import 'dart:convert';

import 'package:http/http.dart';
import 'package:html/parser.dart';
import 'package:html/dom.dart';

Future initiate() async {
  // Make API call to Hackernews homepage
  var client = Client();
  Response response = await client.get('https://news.ycombinator.com');

  // Use html parser
  var document = parse(response.body);
  List<Element> links = document.querySelectorAll('td.title > a.storylink');
  List<Map<String, dynamic>> linkMap = [];

  for (var link in links) {
    linkMap.add({
      'title': link.text,
      'href': link.attributes['href'],
    });
  }

  return json.encode(linkMap);
}

Running this should return a JSON output similar to below:

[
  {
    "title":"Introduction to Applied Linear Algebra: Vectors, Matrices, and Least Squares",
    "href":"http://vmls-book.stanford.edu/"
  },
  {
    "title":"Write Your Own Virtual Machine",
    "href":"https://justinmeiners.github.io/lc3-vm/"
  },
  {
    "title":"Verizon signals its Yahooand AOL divisions are almost worthless",
    "href":"https://www.nbcnews.com/tech/tech-news/verizon-signals-its-yahoo-aol-divisions-are-almost-worthless-n946846"
  },
  ...
  ...
]

3. Write the unit tests

Our tests will go in test/hacker_news_scraper_test.dart. Replace its contents with the below:

import 'dart:convert';

import 'package:test/test.dart';
import 'package:http/http.dart';
import 'package:http/testing.dart';
import 'package:hacker_news_scraper/hacker_news_scraper.dart' as hacker_news_scraper;

void main() {
  // Our tests will go here
}

This is what our first test looks like so far:

void main() {
  test('calling initiate() returns a list of storylinks', () async {
    var response = await hacker_news_scraper.initiate();
    expect(response, equals('/* JSON string to match against */'));
  });
}

We need to refactor our solution slightly for our tests. This is because writing tests will be flakey since we will be making actual calls to the Hacker News website.

In the scenario where Hacker News isn’t available or we do not have an internet connection or the story listings change(and they will), our tests will fail.

Let’s refactor our initiate() method call to expect a client parameter and remove the var client = Client(); declaration:

// lib/hacker_news_scraper.dart
initiate(BaseClient client) {
  // var client = Client(); // <- Remove this line
  ...
}

The http package extends a BaseClient type for its HTTP client. This is also useful because the same package provides another subclass called MockClient for mocking HTTP calls, useful for our unit tests!

Return to bin/main.dart and ensure the Client is passed in:

import 'package:http/http.dart'; // Import the package first!
import 'package:hacker_news_scraper/hacker_news_scraper.dart' as hacker_news_scraper;

void main(List<String> arguments) async {
  print(await hacker_news_scraper.initiate(Client()));
}

Ok, back to our unit tests.

This is the first test that uses our MockClient:

void main() {
  MockClient client = null;

  test('calling initiate(client) returns a list of storylinks', () async {
    // Arrange
    client = MockClient((req) => Future(() => Response('''
      <body>
        <table><tbody><tr>
        <td class="title">
          <a class="storylink" href="https://dartlang.org">Get started with Dart</a>
        </td>
        </tr></tbody></table>
      </body>
    ''', 200)));

    // Act
    var response = await hacker_news_scraper.initiate(client);

    // Assert
    expect(
        response,
        equals(json.encode([
          {
            'title': 'Get started with Dart',
            'href': 'https://dartlang.org',
          }
        ])));
  });
}

The MockClient instance takes a closure as the first parameter. This closure provides a request object which we can manipulate if needed. A Future object is expected to be returned from this closure, which is what were doing here. We are returning an HTML string when the call is made in our await client.get(...) method.

The MockClient instance also takes in a second parameter, an integer representing the response status code. In this case its a 200 OK.

We then proceed to make our initiate() call passing in our MockClient. This means that our test is now predictable and can confidently perform assertions on the response.

The expect and equals top-level functions come as part of the test package by the Dart team. We installed this earlier on and it is listed under dev_dependencies: in our pubspec.yaml file.

We are using the json.encode() method as its an encoded JSON string we expect from the operation.

We can run this test by doing:

$ pub run test

Here's the second test to address the failure scenario:

void main() {
  ...
  ...
  test('calling initiate(client) should silently fail', () async {
    // Arrange
    client = MockClient((req) => Future(() => Response('Failed', 400)));

    // Act
    var response = await hacker_news_scraper.initiate(client);

    // Assert
    expect(response, equals('Failed'));
  });
}

Run pub run test again. This will fail.

Let’s make this pass. In our initiate() method, let’s add this condition below our GET call:

if (response.statusCode != 200) return response.body;

Run the test again. And it's a pass!

Conclusion

To sum things up, we have built a scraping tool to pull in the latest feed from the Hacker News website using the http and html packages provided by the Dart team. We then covered our backs by writing some unit tests.

In reality though it may serve you better to use the Hacker News API for this ?. That being said, you will still need this approach for websites that do not have an official API for traversing their content.

I hope this has been insightful, especially in the area of writing tests in Dart.

Get the source code

Further reading

  1. http: A composable, Future-based library for making HTTP requests

  2. html: HTML5 parser in Dart

  3. Free Dart screencasts on Egghead.io