BIG NEWS: Scrapingdog is collaborating with Serpdog.

4 Best HTML Parsing Libraries – JavaScript

Best JavaScript HTML Parsing Libraries

Table of Contents

In this article, we are going to look out at different JavaScript HTML Parsing Libraries to parse important data from HTML. Usually, this step is carried out when you write a web scraper for any website, and in the next step, you parse the data.

We are going to test the 4 best HTML parsers in JavaScript. We will look for the advantages and disadvantages of each.

First, we are going to download a page using any HTTP client and then use these libraries to parse the data. At the end of this tutorial, you will be able to make a clear choice for your next web scraping project as to which JavaScript parsing library you can use.

Scraping a Page

The first step would be to download the HTML code of any website. For this tutorial, we are going to use this page.

We are going to use unirest library for downloading data from the target page. First, let’s download this library in our coding environment.

				
					npm i unirest
				
			

Once this is done now we can write the code to download the HTML data. The code is pretty simple and straightforward.

				
					const unirest = require('unirest');

async function scraper(scraping_url){
  let res;
  try{
    res = await unirest.get(scraping_url)
    return {body:res.body,status:200}
  }catch(err){
    return {body:'Something went wrong',status:400}
  }
}

scraper('https://books.toscrape.com/').then((res) => {
  console.log(res.body)
}).catch((err) => {
  console.log(err)
})
				
			

Let me explain this code to you in step by step method.

  1. const unirest = require('unirest');: This line imports the unirest library, which is a simplified HTTP client for making requests to web servers.
  2. async function scraper(scraping_url) { ... }: This is an async function named scraper that takes a single parameter, scraping_url, which represents the URL to be scraped.
  3. let res;: This initializes a variable res that will be used to store the response from the HTTP request.
  4. try { ... } catch (err) { ... }: This is a try-catch block that wraps the code responsible for making the HTTP request. res = await unirest.get(scraping_url): This line makes an asynchronous HTTP GET request to the specified scraping_url using the unirest library. The await keyword is used to wait for the response before proceeding. The response is stored in the res variable.
  5. return {body: res.body, status: 200}: If the HTTP request is successful (no errors are thrown), this line returns an object containing the response body (res.body) and an HTTP status code (status) of 200 (indicating success).
  6. return {body: 'Something went wrong', status: 400}: If an error is caught during the HTTP request (inside the catch block), this line returns an object with a generic error message ('Something went wrong') and an HTTP status code of 400 (indicating a client error).
  7. scraper('https://books.toscrape.com/')...: This line calls the scraper function with the URL 'https://books.toscrape.com/' and then uses the .then() and .catch() methods to handle the result or any errors. .then((res) => { console.log(res.body) }):
  8. If the promise returned by the scraper function is fulfilled (resolved successfully), this callback function will be executed. It logs the response body to the console. .catch((err) => { console.log(err) }): If the promise is rejected (an error occurs), this callback function will be executed. It logs the error message to the console.

This is the point where parsing techniques will be used to extract important data from the downloaded data.

JavaScript HTML Parsers

Let’s first decide what exactly we are going to extract and then we are going to use and test different JavaScript parsing libraries for the same.

We are going to scrape:

  1. Name of the book
  2. Price of the book

The JavaScript HTML Parsing Libraries We are Going to Cover in this article:

  • Cheerio
  • Parse5
  • htmlparser2
  • DOMParser

Cheerio

Cheerio is by far the most popular library when it comes to HTML parsing in JavaScript. If you are familiar with jquery then it becomes extremely simple to use this library.

Since it is a III party library you have to install it before you start using it.

				
					npm install cheerio
				
			

Let’s now parse book titles and their prices using Cheerio.

				
					async function scraper(scraping_url){
  let res;
  try{
    res = await unirest.get(scraping_url)
    return {body:res.body,status:200}
  }catch(err){
    return {body:'Something went wrong',status:400}
  }
}
scraper('https://books.toscrape.com/').then((res) => {
  const $ = cheerio.load(res.body)

  const books = [];

  $('.product_pod').each((index, element) => {
    const title = $(element).find('h3 > a').attr('title');
    const price = $(element).find('.price_color').text();

    books.push({ title, price });
  });

  console.log(books);
}).catch((err) => {
  console.log(err)
})
				
			

This code uses Unirest to make an HTTP GET request to the given URL and fetches the HTML content of the page. Then, Cheerio is used to parse the HTML and extract the book titles and prices using appropriate CSS selectors. The extracted data is stored in an array of objects, each representing a book with its title and price. Finally, the code prints out the extracted book data.

You can read web scraping with nodejs to understand how Cheerio can be used for scraping valuable data from the internet.

Advantages

  • Since this JavaScript parsing library runs at the backend it becomes comparatively faster than the solutions that are built for browser use.
  • It supports CSS selectors.
  • Error handling is quite easy in Cheerio.

Disadvantages

  • Developers who are not familiar with jQuery might experience a steep learning curve.

I just found this one single disadvantage of using Cheerio which I also had to go through because I was not familiar with jQuery. Once you learn how it works you will never look for another alternate for parsing.

HTMLparser2

HTMLparser2 is another popular choice by javaScript developers for parsing HTML and XML documents. Also, do not get confused by its name, it’s a totally separate project from htmlparser.

This is how you can install it.

				
					npm i htmlparser2
				
			

Let’s use this JavaScript library to parse the data.

				
					const unirest = require('unirest');
const htmlparser = require('htmlparser2');

const url = 'https://books.toscrape.com/';

unirest.get(url).end(response => {
  if (response.error) {
    console.error('Error:', response.error);
    return;
  }

  const books = [];
  let currentBook = {}; // To store the current book being processed

  const parser = new htmlparser.Parser({
    onopentag(name, attributes) {
      if (name === 'h3' && attributes.class === 'product-title') {
        currentBook = {};
      }
      if (name === 'p' && attributes.class === 'price_color') {
        parser._tag = 'price'; // Set a flag for price parsing
      }
    },
    ontext(text) {
      if (parser._tag === 'h3') {
        currentBook.title = text.trim();
      }
      if (parser._tag === 'price') {
        currentBook.price = text.trim();
      }
    },
    onclosetag(name) {
      if (name === 'h3') {
        books.push(currentBook);
        currentBook = {}; // Reset currentBook for the next book
      }
      if (name === 'p') {
        parser._tag = ''; // Reset the price flag
      }
    }
  }, { decodeEntities: true });

  parser.write(response.body);
  parser.end();

  console.log('Books:');
  books.forEach((book, index) => {
    console.log(`${index + 1}. Title: ${book.title}, Price: ${book.price}`);
  });
});
				
			

This library might be new for many readers so let me explain the code.

  1. Unirest and htmlparser2 were imported.
  2. Target URL was set for making the HTTP request.
  3. Created an empty array books to store information about each book. We also initialize an empty object currentBook to temporarily store data about the book currently being processed.
  4. We create a new instance of htmlparser2.Parser, which allows us to define event handlers for different HTML elements encountered during parsing.
  5. When an opening HTML tag is encountered, we check if it’s an <h3> tag with the class “product-title”. If so, we reset currentBook to prepare for storing data about the new book. If it’s a <p> tag with the class “price_color”, we set a flag (parser._tag = 'price') to indicate that we’re currently processing the price.
  6. When text content inside an element is encountered, we check if we’re currently processing an <h3> tag (indicating the book title) or if we’re processing a price. Depending on the context, we store the text content in the appropriate property of currentBook.
  7. When a closing HTML tag is encountered, we check if it’s an <h3> tag. If it is, we push the data in currentBook into the books array, and then reset currentBook to prepare for the next book. If it’s a closing <p> tag, we reset the parser._tag flag to indicate that we’re no longer processing the price.
  8. We use parser.write(response.body) to start the parsing process using the HTML content from the response. After parsing is complete, we call parser.end() to finalize the process.
  9. Finally, we loop through the books array and print out the extracted book titles and prices.

Advantages

  • This JavaScript library can parse large HTML documents without loading them into the memory chunk by chunk.
  • It comes with cross-browser compatibility.
  • This library allows you to define your own event handlers and logic, making it highly customizable for your parsing needs.

Disadvantages

  • It could be a little challenging for someone new to scraping.
  • No methods for DOM manipulation.

Parse5

Parse5 works on both backend as well as on browsers. It is extremely fast and can parse both HTML and XML documents with ease. Even documents with HTML5 can be parsed accurately with parse5.

You can even use parse5 with cheerio and jsdom for more complex parsing job.

				
					const unirest = require('unirest');
const parse5 = require('parse5');

const url = 'http://books.toscrape.com/';

unirest.get(url).end(response => {
  if (response.error) {
    console.error('Error:', response.error);
    return;
  }

  let books = [];

  let document = parse5.parse(response.body);

  function extractBooksInfo(node) {
    if (node.tagName === 'h3' && hasClass(node, 'product-title')) {
      const title = node.childNodes[0].childNodes[0].value.trim();
      const priceNode = node.parentNode.nextElementSibling.nextElementSibling.querySelector('.price_color');
      const price = priceNode.childNodes[0].value.trim();
      books.push({ title, price });
    }

    node.childNodes && node.childNodes.forEach(childNode => extractBooksInfo(childNode));
  }

  function hasClass(node, className) {
    return node.attrs && node.attrs.some(attr => attr.name === 'class' && attr.value.split(' ').includes(className));
  }

  extractBooksInfo(document);

  console.log('Books:');
  books.forEach((book, index) => {
    console.log(`${index + 1}. Title: ${book.title}, Price: ${book.price}`);
  });
});
				
			

Advantages

  • You can convert a parsed HTML document back to an HTML string. This can help you create new HTML content.
  • It is memory efficient because it does not load the entire HTML document at once for parsing.
  • It has great community support which makes this library fast and robust.

Disadvantages

  • The documentation is very confusing. When you open this page you will find the name of the methods. Now, if you are new you will be lost and might end up finding an alternate library.

DOMParser

This is a browser built-in parser for HTML and XML. Almost all browsers support this parser. JavaScript developers love this parser due to its high community support.

Let’s write a code to parse the title of the book and the price using DOMParser.

				
					// Create a new DOMParser instance
const parser = new DOMParser();
const unirest = require('unirest');
async function scraper(scraping_url){
  let res;
  try{
    res = await unirest.get(scraping_url)
    return {body:res.body,status:200}
  }catch(err){
    return {body:'Something went wrong',status:400}
  }
}
scraper('https://books.toscrape.com/').then((res) => {
  console.log(res.body)
  // Sample HTML content (you can fetch this using AJAX or any other method)
  const htmlContent = res.body
  // Parse the HTML content
  const doc = parser.parseFromString(htmlContent, "text/html");
  // Extract book titles and prices
  const bookElements = doc.querySelectorAll(".product_pod");
  const books = [];
  bookElements.forEach((bookElement) => {
    const title = bookElement.querySelector("h3 > a").getAttribute("title");
    const price = bookElement.querySelector(".price_color").textContent;
    books.push({ title, price });
  });
  // Print the extracted book titles and prices
  books.forEach((book, index) => {
    console.log(`Book ${index + 1}:`);
    console.log(`Title: ${book.title}`);
    console.log(`Price: ${book.price}`);
    console.log("----------------------");
  });
}).catch((err) => {
    console.log(err)
})
				
			

Advantages

  • This JavaScript library has a built-in feature, no need to download any external package for using this.
  • You can even create a new DOM using DOMparser. Of course, some major changes would be needed in the code.
  • It comes with cross-browser compatibility.

Disadvantages

  • If the HTML document is too large then this will consume a lot of memory. This might slow down your server and API performance.
  • Error handling is not robust compared to other third-party libraries like Cheerio.

Here’s a concise table summarizing the pros and cons of each JavaScript HTML parsing library mentioned in the article:

LibraryProsCons
Cheerio– Implements a subset of core jQuery
– Fast parsing and manipulation
– Flexible with various HTML/XML documents
– Limited browser compatibility
– Does not handle JavaScript-rendered content
Puppeteer– Provides a high-level API to control Chrome or Chromium
– Suitable for dynamic content
– Can handle JavaScript-rendered pages
– Slower performance
– Higher resource consumption
– Requires a headless browser
JSDOM– Implements a subset of the W3C DOM
– Allows manipulation of HTML documents
– Suitable for server-side DOM manipulation
– Slower performance
– Limited browser compatibility
– Does not handle JavaScript-rendered content
htmlparser2– Fast and forgiving HTML parser
– Supports streaming parsing
– Can handle malformed HTML
– Limited browser compatibility
– Does not handle JavaScript-rendered content

Conclusion

There are many options for you to choose from but to be honest, only a few will work when you dig a little deeper. My personal favorite is Cheerio and I have been using this for like four years now.

In this article, I tried to present all the positives and negatives of the top parsing libraries which I am sure will help you figure out the best one.

Of course, there are more than just four libraries but I think these four are the best ones.

I would advise you to use a Web Scraper API like Scrapingdog while scraping any website. This API can also be integrated with these libraries very easily.

I hope you like this little tutorial and if you do then please do not forget to share it with your friends and on your social media.

Additional Resources

Web Scraping with Scrapingdog

Scrape the web without the hassle of getting blocked
My name is Manthan Koolwal and I am the founder of scrapingdog.com. I love creating scraper and seamless data pipelines.
Manthan Koolwal

Web Scraping with Scrapingdog

Scrape the web without the hassle of getting blocked

Recent Blogs

Web Scraping Amazon Reviews using Python

Web Scraping Amazon Reviews using Python

In this blog we have used Amazon Review Scraper API and Python to extract review data at scale.
Challenges in Web Scraping

7 Challenges in Large Scale Web Scraping & How To Overcome Them

In this read, we have listed out some of the challenges that you may have during large scale data extraction. You can build a hassle free data pipeline using a Web Scraping API.