Web Scraping with Rust
In this article, we will learn web scraping through Rust. Now, many of you might not have heard about this language but in this tutorial, we will learn about Rust step by step. This tutorial will focus on web scraping and then I will talk about the advantages and disadvantages of using Rust.
We will scrape this website using two popular libraries of Rust reqwest
and scraper
. We will talk about these libraries in a bit. At the end of this tutorial, you will have a basic idea of how Rust works and how it can be used for web scraping.
What is Rust?
Rust is a high-level programming language designed by Mozilla itself. It is built with a main focus on software building. It works great when it comes to low-level memory manipulation like pointers
in C and C++. Concurrent connections are also quite stable when it comes to Rust. Multiple components of the software can run independently and at the same time too without putting too much stress on the server.
Error handling is also top-notch because concurrency errors are compile-time errors instead of run-time errors. This saves time and a proper message will be shown about the error.
This language is also used in game development and blockchain-based tools. Many big companies like AWS, Microsoft, etc already use this language in their architecture.
Setting up the prerequisites
I am assuming that you have already installed Rust and Cargo(package manager of Rust) on your machine and if not then you can refer to this guide for further instructions on installing Rust. First, we have to create a rust project.
cargo new rust_tutorial
Then we have to install two Rust libraries which will be used in the course of this tutorial.
reqwest
: It will be used for making an HTTP connection with the host website.scraper
: It will be used for selecting DOM elements and parsing HTML.
Both of these libraries can be installed by adding them to your cargo.toml
file.
[dependencies]
reqwest = "0.10.8"
scraper = "0.12.0"
Both 0.10.8
and 0.12.0
are the latest versions of the libraries. Now finally you can access them in your main project file src/main.rs
.
What are we scraping?
It is always better to decide what you want to scrape. We will scrape titles and the prices of the individual books from the page.
The process will be pretty straightforward. First, we will inspect chrome to identify the exact location of these elements in the DOM, and then we will use scraper
library to parse them out.
Scrape Individual Book Data
Let’s scrape book titles and prices in a step-by-step manner. First, you have to identify the DOM element location.
As you can see in the above book title is stored inside the title
attribute of a
the tag. Now let’s see where is the price stored.
Price is stored under the p tag
with class price_color
. Now, let’s code it in rust and extract this data.
The first step would be to import all the relevant libraries in the main file src/main.rs
.
use reqwest::Client;
use scraper::{Html, Selector};
Using reqwest
we are going to make an HTTP connection to the host website and using scraper
library we are going to parse the HTML content that we are going to receive by making the GET request through reqwest
library.
Now, we have to create a client which can be used for sending connection requests using reqwest
.
let client = Client::new();
Then finally we are going to send the GET request to our target URL using the client
we just created above.
let mut res = client.get("http://books.toscrape.com/")
.send()
.unwrap();
Here we have used mut
modifier to bind the value to the variable. This improves code readability and once you change this value in the future you might have to change other parts of the code as well.
So, once the request is sent you will get a response in HTML format. But you have to extract that HTML string from res
variable using .text().unwrap()
. .unwrap()
is like a try-catch thing where it asks the program to deliver the results and if there is any error it asks the program to stop the execution asap.
let body = res.text().unwrap();
Here res.text().unwrap()
will return an HTML string and we are storing that string in the body variable.
Now, we have a string through which we can extract all the data we want. Before we use the scraper
library we have to convert this string into an scraper::Html
object using Html::parse_document
.
let document = Html::parse_document(&body);
Now, this object can be used for selecting elements and navigating to the desired element.
First, let’s create a selector for the book title. We are going to use the Selector::parse
function to create a scraper::Selector
object.
let book_title_selector = Selector::parse("h3 > a").unwrap();
Now this object can be used for selecting elements from the HTML document. We have passed h3 > a
as arguments to the parse function
. That is a CSS selector for the elements we are interested in. h3 > a
means it is going to select all the a tags
which are the children of h3 tags
.
As you can see in the image that the target a tag
is the child of h3 tag
. Due to this, we have used h3 > a
in the above code.
Since there are so many books we are going to iterate over all of them using the for
loop.
for book_title in document.select(&book_title_selector) {
let title = book_title.text().collect::<Vec<_>>();
println!("Title: {}", title[0]);
}
select
method will provide us with a list of elements that matches the selector book_title_selector
. Then we are iterating over that list to find the title attribute
and finally print it.
Here Vec<_>>
represents a dynamically sized array. It is a vector where you can access any element by its position in the vector.
The next and final step is to extract the price.
let book_price_selector = Selector::parse(".price_color").unwrap();
Again we have used Selector::parse
function to create the scraper::Selector
object. As discussed above price is stored under the price_color
class. So, we have passed this as a CSS selector to the parse function.
Then again we are going to use for loop like we did above to iterate over all the price elements.
for book_price in document.select(&book_price_selector) {
let price = book_price.text().collect::<Vec<_>>();
println!("Price: {}", price[0]);
}
Once you find the match of the selector it will get the text and print it on the console.
Finally, we have completed the code which can extract the title and the price from the target URL. Now, once you save this and run the code using cargo run
you will get output that looks something like this.
Title: A Light in the Attic
Price: £51.77
Title: Tipping the Velvet
Price: £53.74
Title: Soumission
Price: £50.10
Title: Sharp Objects
Price: £47.82
Title: Sapiens: A Brief History of Humankind
Price: £54.23
Title: The Requiem Red
Price: £22.65
Title: The Dirty Little Secrets of Getting Your Dream Job
Price: £33.34
Title: The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull
Price: £17.93
Title: The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics
Price: £22.60
Title: The Black Maria
Price: £52.15
Complete Code
You can make more changes to the code to extract other information like star ratings, book images, etc. You can use the same technique of first inspecting and finding the location of the element and then extracting them using the Selector function.
But for now, the code will look like this.
use reqwest::Client;
use scraper::{Html, Selector};
// Create a new client
let client = Client::new();
// Send a GET request to the website
let mut res = client.get("http://books.toscrape.com/")
.send()
.unwrap();
// Extract the HTML from the response
let body = res.text().unwrap();
// Parse the HTML into a document
let document = Html::parse_document(&body);
// Create a selector for the book titles
let book_title_selector = Selector::parse("h3 > a").unwrap();
// Iterate over the book titles
for book_title in document.select(&book_title_selector) {
let title = book_title.text().collect::<Vec<_>>();
println!("Title: {}", title[0]);
}
// Create a selector for the book prices
let book_price_selector = Selector::parse(".price_color").unwrap();
// Iterate over the book prices
for book_price in document.select(&book_price_selector) {
let price = book_price.text().collect::<Vec<_>>();
println!("Price: {}", price[0]);
}
Advantages of using Rust
- Rust is an efficient programming language like C++. You can build heavy-duty games and Software using it.
- It can handle a high volume of concurrent calls, unlike python.
- Rust can even interact with languages like C and Python.
Disadvantages of using Rust
- Rust is a new language if we compare it to Nodejs and Python. Due to the small community, it becomes very difficult for a beginner to resolve even a small error.
- Rust syntax is not that easy to understand as compared to Python or Nodejs. So, it becomes very difficult to read and understand the code.
Conclusion
We learned how Rust can be used for web scraping purposes. Using Rust you can scrape many other dynamic websites as well. Even in the above code, you can make a few more changes to scrape images and ratings. This will surely improve your web scraping skills with Rust.
I hope you like this little tutorial and if you do then please do not forget to share it with your friends and on your social media.
Additional Resources
Here are a few additional resources that you may find helpful during your web scraping journey: