Web Scraping with Playwright and Nodejs

Scrapingdog
8 min readDec 1, 2023

Playwright is another popular open-source browser automation Nodejs library. It makes browser automation super simple. It is fast and lightweight and requires very few lines of code in order to implement any web scraping task. It provides support for almost all popular browsers like Firefox, Chromium, etc.

Playwright can help you scrape any website that might load data through JS execution, it can take screenshots, record videos, and fill up a login form in order to bypass the auth wall. It is a complete library and is gaining popularity among developers very quickly. Since it supports both headless and headful modes it becomes a favorable choice when it comes to web scraping and data mining.

In this tutorial, we are going to focus on how Playwright can be used for scraping websites, taking screenshots, and scraping websites that require login. But if you are not familiar with scraping normal websites then do read web scraping with Nodejs.

Before we begin with the coding part let’s first understand what is a headless browser because you are going to hear this term a lot of times throughout this tutorial.

What is a headless browser?

A headless browser is a web browser without a graphical user interface(GUI). Since there is no GUI, the resources consumed by the browser will be much less while scraping any website at scale. You can run multiple instances of the browser without worrying about the CPU usage.

Now, why would you use this headless browser for scraping instead of making the traditional XHR request? Well, many modern websites use JavaScript to load and display content dynamically. Headless browsers can execute JavaScript, allowing them to render and process pages just like a regular browser. If a website heavily relies on JavaScript to load content, a headless browser ensures that you capture the fully rendered page.

Some websites use anti-scraping measures that can detect and block requests coming from automated scripts. Headless browsers can emulate human-like behavior, making it harder for websites to distinguish between scraping bots and genuine users. This is another advantage of using the headless browser for scraping.

Web Scraping with Playwright

Requirements and Installation

Let’s create a small scraper using Playwright and understand how it works. For this example, we are going to scrape the price and description of a product from an e-commerce website.

But before we start coding let's set up the coding environment. First, create a folder in which you are going to keep the playwright nodejs script.

mkdir play
npm init

Now, let’s install Playwright.

npm i playwright cheerio --save

Along with Playwright, I have also installed Cheerio. Cheerio is a data parsing nodejs library. This will be used when we extract the price and other details from the raw HTML data of the target page.

I have created a js file inside the play folder. You can name it anything you like. I have named it as right.js.

Scraping the product

Let us first write the basic code through which we can open the target page and close it.

// right.js

const playwright = require('playwright');
const cheerio = require('cheerio')

async function test(){
const browser = await playwright.chromium.launch({headless: false});

const page = await browser.newPage();
await page.goto('https://www.myntra.com/trousers/the+indian+garage+co/the-indian-garage-co-men-black-slim-fit-solid-joggers/9922235/buy');

await browser.close();
}

test()

Let me explain the code step by step.

  • First, we import the Playwright library.
  • Then inside the test() function, the first line uses Playwright to launch a Chromium browser. The headless: false option means that the browser will have a graphical user interface (GUI), and you'll be able to see the browser window. If you set headless as true, the browser would run in the background without a visible window.
  • Next line open a new tab in the browser.
  • Then using goto() function playwright commands the browser to open the target page.
  • Then using close() function we are going to close the browser window. This will also free up your resources.

The basic setup is ready. But before we dive in let’s first examine the page and check which information we are going to scrape and where it is located inside the DOM.

What will we scrape?

For this tutorial, we are going to scrape the name of the product and the price of the product.

As you can see in the above image the name of the product is located inside the h1 tag with class pdp-title.

The price of the product is stored inside the span tag with class pdp-price.

Downloading and parsing the data

// right.js

const playwright = require('playwright');
const cheerio = require('cheerio')


async function test(){
const browser = await playwright.chromium.launch({headless: false});

const page = await browser.newPage();
await page.goto('https://www.myntra.com/trousers/the+indian+garage+co/the-indian-garage-co-men-black-slim-fit-solid-joggers/9922235/buy');
let html = await page.content();
console.log(html)
await browser.close();
}

test()

We have added two new lines of code in our right.js file. content() function will provide us with the raw HTML of our target product page. Then we print that data on our console. With this downloading part is done. We have successfully scraped and downloaded the data from myntra.com. Now, let’s parse the title and price using Cheerio.

// right.js

const playwright = require('playwright');
const cheerio = require('cheerio')
let obj={}
let arr=[]

async function test(){
const browser = await playwright.chromium.launch({headless: false});

const page = await browser.newPage();
await page.goto('https://www.myntra.com/trousers/the+indian+garage+co/the-indian-garage-co-men-black-slim-fit-solid-joggers/9922235/buy');
let html = await page.content();

const $ = cheerio.load(html);
obj["name"]=$('h1.pdp-title').text()
obj["price"]=$('span.pdp-price').text()

arr.push(obj)
console.log(arr)
await browser.close();
}

test()

Using the load() function we are creating the Cheerio instance.

obj["name"] = $('h1.pdp-title').text();

This line selects the text content inside the <h1> element with the class pdp-title and assigns it to the name property of an object (obj).

obj["price"] = $('span.pdp-price').text();

This line is similar to the previous one but extracts the text content inside a <span> element with the class pdp-price.

Once you run this code, you should get this output.

[ { name: 'The Indian Garage Co', price: '₹692' } ]

With this, we have managed to scrape a website using Playwright.

How to take Screenshots with Playwright?

Along with scraping you can also take screenshots with Playwright. Taking screenshots at key points in your automation script can help you visually inspect the state of the web page at that moment. This is especially useful when you’re trying to figure out why a certain step is failing.

With just one single line we can take screenshot of any page. For this tutorial we are going to take a screenshot of this page.

const playwright = require('playwright');
const cheerio = require('cheerio')
let obj={}
let arr=[]

async function test(){
const browser = await playwright.chromium.launch({headless: false});

const page = await browser.newPage();
await page.goto('https://www.myntra.com/trousers/the+indian+garage+co/the-indian-garage-co-men-black-slim-fit-solid-joggers/9922235/buy');
await page.screenshot({ path: 'myntra.png' });
await browser.close();
}

test()

screenshot() function will take a screenshot and save it to your current folder with the name myntra.png. Once you run this code you should get a screenshot like this.

How to take a screenshot of the full page?

const playwright = require('playwright');
const cheerio = require('cheerio')
let obj={}
let arr=[]

async function test(){
const browser = await playwright.chromium.launch({headless: false});

const page = await browser.newPage();
await page.goto('https://www.myntra.com/trousers/the+indian+garage+co/the-indian-garage-co-men-black-slim-fit-solid-joggers/9922235/buy');
await page.screenshot({ path: 'myntra.png',fullPage: true });
await browser.close();
}

test()

To take full-page screenshot you can pass fullPage arguement to the screenshot function. You should get a screenshot like this.

Scraping behind the authentication wall

Web Scraping is not just about making a GET request to a website, sometimes you will have to bypass the auth wall too. You can perform this task by the playwright very easily. For this example, we are going to scrape the home page of github.com after login.

As you can see in the above image we have two input fields and one button. We are going to fill the input fields and then click on the Sign In button.

const playwright = require('playwright');
const cheerio = require('cheerio')


async function test(){
const browser = await playwright.chromium.launch({headless: false});

const page = await browser.newPage();
await page.goto('https://github.com/login');

await page.fill('input[name="login"]', "your-username");
await page.fill('input[name="password"]', "your-password");
await page.click('input[type="submit"]');

await page.waitForNavigation();
// scrape anything from here
await browser.close();
}

test()

The .fill method is used to set the value of an input field. It takes two parameters:

  • selector: A selector that identifies the input field.
  • value: The value to be filled into the input field.

The .click method is used to simulate a click on an element. It takes one parameter:

  • selector: A selector that identifies the element to be clicked.

Then we are waiting for the navigation to happen using waitForNavigation() function.

Once you are login you can scrape any page you like.

Puppeteer vs Playwright

Puppeteer and Playwright both are loved by developers and both libraries have great documentation. Although there are certain differences between the two.

Puppeteer has a large community compared to Playwright but Playwright community is growing very rapidly and you can notice this thing from the graph provided here. We have written a great piece over web scraping with puppeteer do read that too.

If you need more stats over Puppeteer vs Playwright then do read this. And if you want to understand which will be the better choice for your use case then read this blog to gain more insights.

Conclusion

Playwright is a powerful browser automation library that is gaining popularity faster than Puppeteer. It has a strong base that lets you scrape endless websites.

If you like this article then do not forget to share this on your social media pages.

--

--

Scrapingdog

I usually talk about web scraping and yes web scraping only. You can find a web scraping API at www.scrapingdog.com