In this tutorial, we will write a web scraper to query the Hacker News homepage for a list of the latest articles with their URLs. We will produce a JSON string containing our results based on the scraped data. We will also cover this with some unit tests!
This will be great to follow along, especially if you’ve been meaning to sink your teeth into the Dart language and tooling.
What is a Web Scraper?
Web Scrapers refer to scripts that perform the task of extracting data from websites. This usually happens by performing a GET request to the web page and then parsing the HTML response to retrieve the desired content.
1. Generate a console project
Create a directory for your project:
$ mkdir hacker_news_scraper && cd hacker_news_scraper
Use the stagehand package to generate a console application:
$ pub global activate stagehand # If you don't have it installed
$ stagehand console-full
Add the http and html dependency in the pubspec.yaml
file:
dependencies:
html: ^0.13.3+3
http: ^0.12.0
The http package provides a Future-based API for making requests. The html package contains helpers to parse HTML5 strings using a DOM-inspired API. It’s a port of html5lib from Python.
And install the added dependencies:
$ pub get
Following these instructions correctly should give you the file/folder structure below:
2. Implement the script
Empty the contents of lib/hacker_news_scraper.dart
, for we shall start from scratch☝️
Import our installed dependencies:
import 'dart:convert'; // Contains the JSON encoder
import 'package:http/http.dart'; // Contains a client for making API calls
import 'package:html/parser.dart'; // Contains HTML parsers to generate a Document object
import 'package:html/dom.dart'; // Contains DOM related classes for extracting data from elements
Create a function after our imports to contain our logic:
initiate() async {}
The http package contains a Client
class for making HTTP calls. Create an instance and perform a GET
request to the Hacker News homepage:
Future initiate() async {
var client = Client();
Response response = await client.get('https://news.ycombinator.com');
print(response.body);
}
To test this out, go to bin/main.dart
and invoke the initiate
method:
import 'package:hacker_news_scraper/hacker_news_scraper.dart' as hacker_news_scraper;
void main(List<String> arguments) async {
print(await hacker_news_scraper.initiate());
}
Run this file:
$ dart bin/main.dart
Below is an extract of the response:
<html op="news"><head><meta name="referrer" content="origin"><meta name="viewport" content="width=device-width, initial-scale=1.0">
...
...
<table border="0" cellpadding="0" cellspacing="0" class="itemlist">
<tr class='athing' id='18678314'>
<td align="right" valign="top" class="title">
<span class="rank">1.</span></td>
<td valign="top" class="votelinks">
<center><a id='up_18678314' href='vote?id=18678314&how=up&goto=news'><div class='votearrow' title='upvote'></div></a></center></td>
<td class="title">
<a href="http://vmls-book.stanford.edu/" class="storylink">
Introduction to Applied Linear Algebra: Vectors, Matrices, and Least Squares
</a><span class="sitebit comhead"> (<a href="from?site=stanford.edu"><span class="sitestr">stanford.edu</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">
<span class="score" id="score_18678314">381 points</span> by <a href="user?id=yarapavan" class="hnuser">yarapavan</a>
<span class="age"><a href="item?id=18678314">8 hours ago</a></span> <span id="unv_18678314"></span> | <a href="hide?id=18678314&goto=news">hide</a> |
<a href="item?id=18678314">37 comments</a>
</td>
</tr>
...
...
In order to know what to look for, we need to know how to select the links on the page:
It appears that each link is in a table cell and has the class “storylink”. This means that we can use this CSS selector to traverse those: td.title > a.storylink
In lib/hacker_news_scraper.dart
, rather than printing the response body in the initiate function, let’s parse the body and select our elements using the helpers from the html package.
Future initiate() async {
var client = Client();
Response response = await client.get('https://news.ycombinator.com');
// Use html parser and query selector
var document = parse(response.body);
List<Element> links = document.querySelectorAll('td.title > a.storylink');
}
At this point we now have a list of Element
s where each element is an a.storylink
item. The Element
type provides an API similar to the DOM.
With a for in
loop we can traverse the collection:
List<Map<String, dynamic>> linkMap = [];
for (var link in links) {
linkMap.add({
'title': link.text,
'href': link.attributes['href'],
});
}
And return a JSON-encoded output:
import 'dart:convert'; // Do this at the top of the file
Future initiate() {
...
...
return json.encode(linkMap);
}
Here's the full script so far:
// lib/hacker_news_scraper.dart
import 'dart:convert';
import 'package:http/http.dart';
import 'package:html/parser.dart';
import 'package:html/dom.dart';
Future initiate() async {
// Make API call to Hackernews homepage
var client = Client();
Response response = await client.get('https://news.ycombinator.com');
// Use html parser
var document = parse(response.body);
List<Element> links = document.querySelectorAll('td.title > a.storylink');
List<Map<String, dynamic>> linkMap = [];
for (var link in links) {
linkMap.add({
'title': link.text,
'href': link.attributes['href'],
});
}
return json.encode(linkMap);
}
Running this should return a JSON output similar to below:
[
{
"title":"Introduction to Applied Linear Algebra: Vectors, Matrices, and Least Squares",
"href":"http://vmls-book.stanford.edu/"
},
{
"title":"Write Your Own Virtual Machine",
"href":"https://justinmeiners.github.io/lc3-vm/"
},
{
"title":"Verizon signals its Yahooand AOL divisions are almost worthless",
"href":"https://www.nbcnews.com/tech/tech-news/verizon-signals-its-yahoo-aol-divisions-are-almost-worthless-n946846"
},
...
...
]
3. Write the unit tests
Our tests will go in test/hacker_news_scraper_test.dart
. Replace its contents with the below:
import 'dart:convert';
import 'package:test/test.dart';
import 'package:http/http.dart';
import 'package:http/testing.dart';
import 'package:hacker_news_scraper/hacker_news_scraper.dart' as hacker_news_scraper;
void main() {
// Our tests will go here
}
This is what our first test looks like so far:
void main() {
test('calling initiate() returns a list of storylinks', () async {
var response = await hacker_news_scraper.initiate();
expect(response, equals('/* JSON string to match against */'));
});
}
We need to refactor our solution slightly for our tests. This is because writing tests will be flakey since we will be making actual calls to the Hacker News website.
In the scenario where Hacker News isn’t available or we do not have an internet connection or the story listings change(and they will), our tests will fail.
Let’s refactor our initiate()
method call to expect a client
parameter and remove the var client = Client();
declaration:
// lib/hacker_news_scraper.dart
initiate(BaseClient client) {
// var client = Client(); // <- Remove this line
...
}
The http package extends a BaseClient
type for its HTTP client. This is also useful because the same package provides another subclass called MockClient
for mocking HTTP calls, useful for our unit tests!
Return to bin/main.dart
and ensure the Client
is passed in:
import 'package:http/http.dart'; // Import the package first!
import 'package:hacker_news_scraper/hacker_news_scraper.dart' as hacker_news_scraper;
void main(List<String> arguments) async {
print(await hacker_news_scraper.initiate(Client()));
}
Ok, back to our unit tests.
This is the first test that uses our MockClient
:
void main() {
MockClient client = null;
test('calling initiate(client) returns a list of storylinks', () async {
// Arrange
client = MockClient((req) => Future(() => Response('''
<body>
<table><tbody><tr>
<td class="title">
<a class="storylink" href="https://dartlang.org">Get started with Dart</a>
</td>
</tr></tbody></table>
</body>
''', 200)));
// Act
var response = await hacker_news_scraper.initiate(client);
// Assert
expect(
response,
equals(json.encode([
{
'title': 'Get started with Dart',
'href': 'https://dartlang.org',
}
])));
});
}
The MockClient
instance takes a closure as the first parameter. This closure provides a request object which we can manipulate if needed. A Future
object is expected to be returned from this closure, which is what were doing here. We are returning an HTML string when the call is made in our await client.get(...)
method.
The MockClient
instance also takes in a second parameter, an integer representing the response status code. In this case its a 200 OK
.
We then proceed to make our initiate()
call passing in our MockClient
. This means that our test is now predictable and can confidently perform assertions on the response.
The expect
and equals
top-level functions come as part of the test package by the Dart team. We installed this earlier on and it is listed under dev_dependencies:
in our pubspec.yaml
file.
We are using the json.encode()
method as its an encoded JSON string we expect from the operation.
We can run this test by doing:
$ pub run test
Here's the second test to address the failure scenario:
void main() {
...
...
test('calling initiate(client) should silently fail', () async {
// Arrange
client = MockClient((req) => Future(() => Response('Failed', 400)));
// Act
var response = await hacker_news_scraper.initiate(client);
// Assert
expect(response, equals('Failed'));
});
}
Run pub run test
again. This will fail.
Let’s make this pass. In our initiate()
method, let’s add this condition below our GET call:
if (response.statusCode != 200) return response.body;
Run the test again. And it's a pass!
Conclusion
To sum things up, we have built a scraping tool to pull in the latest feed from the Hacker News website using the http and html packages provided by the Dart team. We then covered our backs by writing some unit tests.
In reality though it may serve you better to use the Hacker News API for this ?. That being said, you will still need this approach for websites that do not have an official API for traversing their content.
I hope this has been insightful, especially in the area of writing tests in Dart.