NodeJs Puppeteer- Setup, Web scraping & Testing

Nikita Jaiswal
5 min readMay 11, 2024

--

what puppeteer is:

Puppeteer is a Nodejs library developed by Google that allows us to interact with html elements in chrome or chromium from any website.

It’s not just getting information from a website, like a movie title, we can interact actively with it. This includes clicking buttons, sending forms, extracting data from specific objects, taking screenshots, and more — essentially a wide range of actions

SetUp

We will created a nodejs project using following commands:

mkdir puppeteer
cd puppeteer
npm init

This will create a “package.json” file like this:

Next, we start by installing the “puppeteer” library. Execute the following command.

npm install puppeteer

Next, our goal is to extract product data from Amazon.com, including handling pagination. This involves navigating through multiple pages by clicking the “Next” button until it becomes disabled.

Furthermore, we aim to store the collected data in an Excel spreadsheet. To accomplish this, we will utilise the “xlsx” npm package.

Let’s get started 🙂

To begin, let’s create a JavaScript file named “index.js”.

Now Let’s import the required packages:

const puppeteer = require('puppeteer');
const XLSX = require('xlsx');

Next, we will launch our puppeteer and go to the amazon product page to fetch products data

(async () => {
// Launch the browser and open a new blank page
const browser = await puppeteer.launch({
headless: false,
defaultViewport: false
});
const page = await browser.newPage();

// Navigate the page to a URL
await page.goto('https://www.amazon.com/s?k=gifts+for+mom&_encoding=UTF8&content-id=amzn1.sym.b1718f3d-ff3b-4ea9-94ad-229b681f4963&pd_rd_r=bacf8c18-88bb-4f64-8b2c-078343c8b07b&pd_rd_w=WOwv3&pd_rd_wg=GtzYP&pf_rd_p=b1718f3d-ff3b-4ea9-94ad-229b681f4963&pf_rd_r=0Z61C8QBS7N9V6P832FQ&qid=1715233337&xpid=k0Az372LpDs7Z&ref=sr_pg_1');


await browser.close();
})();

To launch the puppeteer we are using 2 options

headless: The headless option in Puppeteer determines whether the browser should be launched in headless mode or not. When set to true, Puppeteer launches a browser without a visible user interface (UI), making it run in the background.

defaultViewport: The defaultViewport option specifies the initial browser window’s viewport size and scale factor. When set to false, it means Puppeteer will not use any default viewport settings and instead rely on the webpage’s default viewport.

Now, next is we need to select the parent class of all product. We will achieve this using “$$” method of puppeteer

$$: The method runs document.querySelectorAll within the page. If no elements match the selector, the return value resolves to [].

We will Navigate to each page and store the data in Array until the “Next” button in the pagination becomes disabled.

let items = [] // We will store the product Title price and image in this array
const products = await page.$$('div.s-main-slot > .s-result-item')
let btnDisabled = false
while(!btnDisabled) {
for(const product of products){
try {
const title = await page.evaluate(el => el.querySelector("h2 > a > span").textContent, product) // textContent is to get the text
const price = await page.evaluate(el => el.querySelector(".a-price > .a-offscreen").textContent, product)
const image = await page.evaluate(el => el.querySelector(".s-image").getAttribute('src'), product) // Here we need to fetch the image src so we are using getAttribute
items.push({
title, price, image
})
} catch (error) { }
}
await page.waitForSelector('.s-pagination-next', { visible: true })
btnDisabled = await page.$('.s-pagination-next.s-pagination-disabled') !== null;
if(!btnDisabled){
await page.click('.s-pagination-next')
} else{
break;
}
}

Now we will store the items data in a .xlsx file. For this we will use a nodeJs package “xlsx”

const workSheet = XLSX.utils.json_to_sheet(items);
const workBook = XLSX.utils.book_new();
XLSX.utils.book_append_sheet(workBook, workSheet, "Sheet 1");
XLSX.writeFile(workBook, "./sample.xlsx");

Here is the final code:

const puppeteer = require('puppeteer');
const XLSX = require('xlsx');

(async () => {
// Launch the browser and open a new blank page
const browser = await puppeteer.launch({
headless: false,
defaultViewport: false
});
const page = await browser.newPage();

// Navigate the page to a URL
await page.goto('https://www.amazon.com/s?k=gifts+for+mom&_encoding=UTF8&content-id=amzn1.sym.b1718f3d-ff3b-4ea9-94ad-229b681f4963&pd_rd_r=bacf8c18-88bb-4f64-8b2c-078343c8b07b&pd_rd_w=WOwv3&pd_rd_wg=GtzYP&pf_rd_p=b1718f3d-ff3b-4ea9-94ad-229b681f4963&pf_rd_r=0Z61C8QBS7N9V6P832FQ&qid=1715233337&xpid=k0Az372LpDs7Z&ref=sr_pg_1');

let items = []
const products = await page.$$('div.s-main-slot > .s-result-item')
let btnDisabled = false
while(!btnDisabled) {
for(const product of products){
try {
const title = await page.evaluate(el => el.querySelector("h2 > a > span").textContent, product)
const price = await page.evaluate(el => el.querySelector(".a-price > .a-offscreen").textContent, product)
const image = await page.evaluate(el => el.querySelector(".s-image").getAttribute('src'), product)
items.push({
title, price, image
})
} catch (error) { }
}
await page.waitForSelector('.s-pagination-next', { visible: true })
btnDisabled = await page.$('.s-pagination-next.s-pagination-disabled') !== null;
console.log({ btnDisabled });
if(!btnDisabled){
await page.click('.s-pagination-next')
} else{
break;
}
}

const workSheet = XLSX.utils.json_to_sheet(items);
const workBook = XLSX.utils.book_new();
XLSX.utils.book_append_sheet(workBook, workSheet, "Sheet 1");
XLSX.writeFile(workBook, "./sample.xlsx");


await browser.close();
})();

Now, run the following command in the terminal:

node index.js

It will generate a “sample.xlsx” file.

Conclusion

Web scraping with Puppeteer in Nodejs opens up a world of possibilities for automating data extraction from the web. Whether you’re gathering market data, monitoring competitors, or conducting research, Puppeteer gives you the tools you need to navigate today’s complex websites and extract the data you’re looking for.

However, it’s important to use web scraping responsibly and ethically.

Always respect website terms of service, robots.txt files, and rate limits to avoid legal issues and server overload. With great power comes great responsibility, so wield Puppeteer wisely in your web scraping endeavors.

--

--