All articles

4 Best Python HTML Parsers

Published Date Aug 23, 2024
Read 10 min
4 Best Python HTML Parsers

TL;DR

  • Compares 4 Python HTML parsers: BeautifulSoup, lxml, html5lib, PyQuery.

  • BS4: easiest; handles messy HTML. lxml: fast with XPath / XSLT; great for XML.

  • html5lib: HTML5-correct but slow / high memory. PyQuery: jQuery-style selectors on lxml.

  • Takeaway: default to BS4 for messy pages, lxml for XML; others are situational; Scrapy / requests-html also noted.

There is a lot of data available on the internet and almost all of that is pretty useful. You can make an analysis based on that data, make better decisions, and even predict changes in the stock market. But there is a gap between this data and your decision-making graphs, which can be filled with HTML parsing.

If you want to use this data for your personal or business needs, you must scrape and clean it.

All this data is not human readable therefore you need a mechanism to clean that raw data and make it readable. This technique is called HTML parsing.

In this blog, we will talk about the best python html parsing libraries available. Also, I have made a table at the very end to compare all the libraries in a table.

Many new coders get confused while choosing a suitable parsing library. 

Python is supported by a very large community and therefore it comes with multiple options for parsing HTML.

Here are some common criteria and reasons for selecting specific HTML parsing libraries for this blog.

  • Ease of Use and Readability

  • Performance and Efficiency

  • Error Handling and Robustness

  • Community and Support

  • Documentation and Learning Resources

4 Python HTML Parsing Libraries

BeautifulSoup

It is the most popular one among all the html parsing libraries. It can help you parse HTML and XML documents with ease. Once you read the documentation you will find it very easy to create parsing trees and extract useful data out of them.

Since it is a third-party package you have to install it using pip in your project environment. You can do it using pip install beautifulsoup4. Let’s understand how we can use it in Python with a small example.

The first step would be to import it into your Python script. Of course, you have to first scrape the data from the target website but for this blog, we are just going to focus on the parsing section.

You can refer to web scraping with Python to learn more about the web scraping part using the best documentation.

Example

Let’s say we have the following simple HTML document as a string.

1Sample HTML Page
2
3
4 Welcome to BeautifulSoup Example
5 This is a paragraph of text.
6
7 Item 1
8 Item 2
9 Item 3
10 window.lazyLoadOptions=Object.assign({},{threshold:300},window.lazyLoadOptions||{});!function(t,e){"object"==typeof exports&&"undefined"!=typeof module?module.exports=e():"function"==typeof define&&define.amd?define(e):(t="undefined"!=typeof globalThis?globalThis:t||self).LazyLoad=e()}(this,function(){"use strict";function e(){return(e=Object.assign||function(t){for(var e=1;ewindow.litespeed_ui_events=window.litespeed_ui_events||["mouseover","click","keydown","wheel","touchmove","touchstart"];var urlCreator=window.URL||window.webkitURL;function litespeed_load_delayed_js_force(){console.log("[LiteSpeed] Start Load JS Delayed"),litespeed_ui_events.forEach(e=>{window.removeEventListener(e,litespeed_load_delayed_js_force,{passive:!0})}),document.querySelectorAll("iframe[data-litespeed-src]").forEach(e=>{e.setAttribute("src",e.getAttribute("data-litespeed-src"))}),"loading"==document.readyState?window.addEventListener("DOMContentLoaded",litespeed_load_delayed_js):litespeed_load_delayed_js()}litespeed_ui_events.forEach(e=>{window.addEventListener(e,litespeed_load_delayed_js_force,{passive:!0})});async function litespeed_load_delayed_js(){let t=[];for(var d in document.querySelectorAll('script[type="litespeed/javascript"]').forEach(e=>{t.push(e)}),t)await new Promise(e=>litespeed_load_one(t[d],e));document.dispatchEvent(new Event("DOMContentLiteSpeedLoaded")),window.dispatchEvent(new Event("DOMContentLiteSpeedLoaded"))}function litespeed_load_one(t,e){console.log("[LiteSpeed] Load ",t);var d=document.createElement("script");d.addEventListener("load",e),d.addEventListener("error",e),t.getAttributeNames().forEach(e=>{"type"!=e&&d.setAttribute("data-src"==e?"src":e,t.getAttribute(e))});let a=!(d.type="text/javascript");!d.src&&t.textContent&&(d.src=litespeed_inline2src(t.textContent),a=!0),t.after(d),t.remove(),a&&e()}function litespeed_inline2src(t){try{var d=urlCreator.createObjectURL(new Blob([t.replace(/^(?:)?$/gm,"$1")],{type:"text/javascript"}))}catch(e){d="data:text/javascript;base64,"+btoa(t.replace(/^(?:)?$/gm,"$1"))}return d}var litespeed_vary=document.cookie.replace(/(?:(?:^|.*;\s*)_lscache_vary\s*\=\s*([^;]*).*$)|^.*$/,"");litespeed_vary||(sessionStorage.getItem("litespeed_reloaded")?console.log("LiteSpeed: skipping guest vary reload (already reloaded this session)"):fetch("/wp-content/plugins/litespeed-cache/guest.vary.php",{method:"POST",cache:"no-cache",redirect:"follow"}).then(e=>e.json()).then(e=>{console.log(e),e.hasOwnProperty("reload")&&"yes"==e.reload&&(sessionStorage.setItem("litespeed_docref",document.referrer),sessionStorage.setItem("litespeed_reloaded","1"),window.location.reload(!0))}));

Here’s a Python code example using BeautifulSoup.

1from bs4 import BeautifulSoup
2
3# Sample HTML content
4html = """
5
6
7
8 Sample HTML Page
9
10
11 Welcome to BeautifulSoup Example
12 This is a paragraph of text.
13
14 Item 1
15 Item 2
16 Item 3
17 window.lazyLoadOptions=Object.assign({},{threshold:300},window.lazyLoadOptions||{});!function(t,e){"object"==typeof exports&&"undefined"!=typeof module?module.exports=e():"function"==typeof define&&define.amd?define(e):(t="undefined"!=typeof globalThis?globalThis:t||self).LazyLoad=e()}(this,function(){"use strict";function e(){return(e=Object.assign||function(t){for(var e=1;ewindow.litespeed_ui_events=window.litespeed_ui_events||["mouseover","click","keydown","wheel","touchmove","touchstart"];var urlCreator=window.URL||window.webkitURL;function litespeed_load_delayed_js_force(){console.log("[LiteSpeed] Start Load JS Delayed"),litespeed_ui_events.forEach(e=>{window.removeEventListener(e,litespeed_load_delayed_js_force,{passive:!0})}),document.querySelectorAll("iframe[data-litespeed-src]").forEach(e=>{e.setAttribute("src",e.getAttribute("data-litespeed-src"))}),"loading"==document.readyState?window.addEventListener("DOMContentLoaded",litespeed_load_delayed_js):litespeed_load_delayed_js()}litespeed_ui_events.forEach(e=>{window.addEventListener(e,litespeed_load_delayed_js_force,{passive:!0})});async function litespeed_load_delayed_js(){let t=[];for(var d in document.querySelectorAll('script[type="litespeed/javascript"]').forEach(e=>{t.push(e)}),t)await new Promise(e=>litespeed_load_one(t[d],e));document.dispatchEvent(new Event("DOMContentLiteSpeedLoaded")),window.dispatchEvent(new Event("DOMContentLiteSpeedLoaded"))}function litespeed_load_one(t,e){console.log("[LiteSpeed] Load ",t);var d=document.createElement("script");d.addEventListener("load",e),d.addEventListener("error",e),t.getAttributeNames().forEach(e=>{"type"!=e&&d.setAttribute("data-src"==e?"src":e,t.getAttribute(e))});let a=!(d.type="text/javascript");!d.src&&t.textContent&&(d.src=litespeed_inline2src(t.textContent),a=!0),t.after(d),t.remove(),a&&e()}function litespeed_inline2src(t){try{var d=urlCreator.createObjectURL(new Blob([t.replace(/^(?:)?$/gm,"$1")],{type:"text/javascript"}))}catch(e){d="data:text/javascript;base64,"+btoa(t.replace(/^(?:)?$/gm,"$1"))}return d}
18
19"""
20
21# Create a BeautifulSoup object
22soup = BeautifulSoup(html, 'html.parser')
23
24# Accessing Elements
25print("Title of the Page:", soup.title.text) # Access the title element
26print("Heading:", soup.h1.text) # Access the heading element
27print("Paragraph Text:", soup.p.text) # Access the paragraph element's text
28
29# Accessing List Items
30ul = soup.ul # Access the unordered list element
31items = ul.find_all('li') # Find all list items within the ul
32print("List Items:")
33for item in items:
34 print("- " + item.text)

Let me explain the code step by step:

  1. We import the BeautifulSoup class from the bs4 library and create an instance of it by passing our HTML content and the parser to use (in this case, 'html.parser').

  2. We access specific elements in the HTML using the BeautifulSoup object. For example, we access the title, heading (h1), and paragraph (p) elements using the .text attribute to extract their text content.

  3. We access the unordered list (ul) element and then use .find_all('li') to find all list items (li) within it. We iterate through these list items and print their text.

Once you run this code you will get the following output.

1Title of the Page: Sample HTML Page
2Heading: Welcome to BeautifulSoup Example
3Paragraph Text: This is a paragraph of text.
4List Items:
5- Item 1
6- Item 2
7- Item 3

You can adapt similar techniques for more complex web scraping and data extraction tasks. If you want to learn more about BeautifulSoup, you should read web scraping with BeautifulSoup.

LMXL

LXML is considered to be one of the fastest parsing libraries available. It gets regular updates with the last update released in July of 2023. Using its ElementTree API you can access libxml2 and libxslt toolkits(for parsing HTML & XML) of C language. It has great documentation and community support.

BeautifulSoup also provides support for lxml. You can use it by just mentioning the lxml as your second argument inside your BeautifulSoup constructor.

lxml can parse both HTML and XML documents with high speed and efficiency. It follows standards closely and provides excellent support for XML namespaces, XPath, and CSS selectors.

In my experience, you should always prefer BS4 when dealing with messy HTML and use lxml when you are dealing with XML documents.

Like BeautifulSoup this is a third-party package that needs to be installed before you start using it in your script. You can simply do that by pip install lxml.

Let me explain to you how it can used with a small example.

Example

1Python Programming
2 Manthan Koolwal
3 36
4
5
6 Web Development with Python
7 John Smith
8 34

Our objective is to extract this text using lxml.

1from lxml import etree
2
3# Sample XML content
4xml = """
5
6
7 Python Programming
8 Manthan Koolwal
9 36
10
11
12 Web Development with Python
13 John Smith
14 34
15
16
17"""
18
19# Create an ElementTree from the XML
20tree = etree.XML(xml)
21
22# Accessing Elements
23for book in tree.findall("book"):
24 title = book.find("title").text
25 author = book.find("author").text
26 price = book.find("price").text
27 print("Title:", title)
28 print("Author:", author)
29 print("Price:", price)
30 print("---")

Let me explain you above code step by step.

  1. We import the etree module from the lxml library and create an instance of it by passing our XML content.

  2. We access specific elements in the XML using the find() and findall() methods. For example, we find all <book> elements within the <bookstore> using tree.findall("book").

  3. Inside the loop, we access the <title><author>, and <price> elements within each <book> element using book.find("element_name").text.

The output will look like this.

1Title: Python Programming
2Author: Manthan Koolwal
3Price: 36
4---
5Title: Web Development with Python
6Author: John Smith
7Price: 34
8---

If you want to learn more about this library then you should definitely check out our guide Web Scraping with Xpath and Python.

html5lib

HTML5lib is another great contender on this list which works great while parsing the latest HTML5. Of course, you can parse XML as well but mainly it is used for parsing html5.

It can parse documents even when they contain missing or improperly closed tags, making it valuable for web scraping tasks where the quality of HTML varies. html5lib produces a DOM-like tree structure, allowing you to navigate and manipulate the parsed document easily, similar to how you would interact with the Document Object Model (DOM) in a web browser.

Whether you’re working with modern web pages, and HTML5 documents, or need a parsing library capable of handling the latest web standards, html5lib is a reliable choice to consider.

Again this needs to be installed before you start using it. You can simply do it by pip install html5lib. After this step, you can directly import this library inside your Python script.

Example

1import html5lib
2
3# Sample HTML5 content
4html5 = """
5
6
7
8 HTML5lib Example
9
10
11 Welcome to HTML5lib
12 This is a paragraph of text.
13
14 Item 1
15 Item 2
16 Item 3
17 window.lazyLoadOptions=Object.assign({},{threshold:300},window.lazyLoadOptions||{});!function(t,e){"object"==typeof exports&&"undefined"!=typeof module?module.exports=e():"function"==typeof define&&define.amd?define(e):(t="undefined"!=typeof globalThis?globalThis:t||self).LazyLoad=e()}(this,function(){"use strict";function e(){return(e=Object.assign||function(t){for(var e=1;ewindow.litespeed_ui_events=window.litespeed_ui_events||["mouseover","click","keydown","wheel","touchmove","touchstart"];var urlCreator=window.URL||window.webkitURL;function litespeed_load_delayed_js_force(){console.log("[LiteSpeed] Start Load JS Delayed"),litespeed_ui_events.forEach(e=>{window.removeEventListener(e,litespeed_load_delayed_js_force,{passive:!0})}),document.querySelectorAll("iframe[data-litespeed-src]").forEach(e=>{e.setAttribute("src",e.getAttribute("data-litespeed-src"))}),"loading"==document.readyState?window.addEventListener("DOMContentLoaded",litespeed_load_delayed_js):litespeed_load_delayed_js()}litespeed_ui_events.forEach(e=>{window.addEventListener(e,litespeed_load_delayed_js_force,{passive:!0})});async function litespeed_load_delayed_js(){let t=[];for(var d in document.querySelectorAll('script[type="litespeed/javascript"]').forEach(e=>{t.push(e)}),t)await new Promise(e=>litespeed_load_one(t[d],e));document.dispatchEvent(new Event("DOMContentLiteSpeedLoaded")),window.dispatchEvent(new Event("DOMContentLiteSpeedLoaded"))}function litespeed_load_one(t,e){console.log("[LiteSpeed] Load ",t);var d=document.createElement("script");d.addEventListener("load",e),d.addEventListener("error",e),t.getAttributeNames().forEach(e=>{"type"!=e&&d.setAttribute("data-src"==e?"src":e,t.getAttribute(e))});let a=!(d.type="text/javascript");!d.src&&t.textContent&&(d.src=litespeed_inline2src(t.textContent),a=!0),t.after(d),t.remove(),a&&e()}function litespeed_inline2src(t){try{var d=urlCreator.createObjectURL(new Blob([t.replace(/^(?:)?$/gm,"$1")],{type:"text/javascript"}))}catch(e){d="data:text/javascript;base64,"+btoa(t.replace(/^(?:)?$/gm,"$1"))}return d}
18
19"""
20
21# Parse the HTML5 document
22parser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("dom"))
23tree = parser.parse(html5)
24
25# Accessing Elements
26title = tree.find("title").text
27heading = tree.find("h1").text
28paragraph = tree.find("p").text
29list_items = tree.findall("ul/li")
30
31print("Title:", title)
32print("Heading:", heading)
33print("Paragraph Text:", paragraph)
34print("List Items:")
35for item in list_items:
36 print("- " + item.text)

Explanation of the code:

  1. We import the html5lib library, which provides the HTML5 parsing capabilities we need.

  2. We define the HTML5 content as a string in the html5 variable.

  3. We create an HTML5 parser using html5lib.HTMLParser and specify the tree builder as "dom" to create a Document Object Model (DOM)-like tree structure.

  4. We parse the HTML5 document using the created parser, resulting in a parse tree.

  5. We access specific elements in the parse tree using the find() and findall() methods. For example, we find the <title><h1><p>, and <ul> elements and their text content.

Once you run this code you will get this.

1Title: HTML5lib Example
2Heading: Welcome to HTML5lib
3Paragraph Text: This is a paragraph of text.
4List Items:
5- Item 1
6- Item 2
7- Item 3

You can refer to its documentation if you want to learn more about this library.

Pyquery

With PyQuery you can use jQuery syntax to parse XML documents. So, if you are already familiar with jQuery then pyquery will be a piece of cake for you. Behind the scenes, it is actually using lxml for parsing and manipulation.

Its application is similar to BeautifulSoup and lxml. With PyQuery, you can easily navigate and manipulate documents, select specific elements, extract text or attribute values, and perform various operations on the parsed content.

This library receives regular updates and has growing community support. PyQuery supports CSS selectors, allowing you to select and manipulate elements in a document using familiar CSS selector expressions.

Example

1from pyquery import PyQuery as pq
2
3# Sample HTML content
4html = """
5
6
7 PyQuery Example
8
9
10 Welcome to PyQuery
11
12 Item 1
13 Item 2
14 Item 3
15 window.lazyLoadOptions=Object.assign({},{threshold:300},window.lazyLoadOptions||{});!function(t,e){"object"==typeof exports&&"undefined"!=typeof module?module.exports=e():"function"==typeof define&&define.amd?define(e):(t="undefined"!=typeof globalThis?globalThis:t||self).LazyLoad=e()}(this,function(){"use strict";function e(){return(e=Object.assign||function(t){for(var e=1;ewindow.litespeed_ui_events=window.litespeed_ui_events||["mouseover","click","keydown","wheel","touchmove","touchstart"];var urlCreator=window.URL||window.webkitURL;function litespeed_load_delayed_js_force(){console.log("[LiteSpeed] Start Load JS Delayed"),litespeed_ui_events.forEach(e=>{window.removeEventListener(e,litespeed_load_delayed_js_force,{passive:!0})}),document.querySelectorAll("iframe[data-litespeed-src]").forEach(e=>{e.setAttribute("src",e.getAttribute("data-litespeed-src"))}),"loading"==document.readyState?window.addEventListener("DOMContentLoaded",litespeed_load_delayed_js):litespeed_load_delayed_js()}litespeed_ui_events.forEach(e=>{window.addEventListener(e,litespeed_load_delayed_js_force,{passive:!0})});async function litespeed_load_delayed_js(){let t=[];for(var d in document.querySelectorAll('script[type="litespeed/javascript"]').forEach(e=>{t.push(e)}),t)await new Promise(e=>litespeed_load_one(t[d],e));document.dispatchEvent(new Event("DOMContentLiteSpeedLoaded")),window.dispatchEvent(new Event("DOMContentLiteSpeedLoaded"))}function litespeed_load_one(t,e){console.log("[LiteSpeed] Load ",t);var d=document.createElement("script");d.addEventListener("load",e),d.addEventListener("error",e),t.getAttributeNames().forEach(e=>{"type"!=e&&d.setAttribute("data-src"==e?"src":e,t.getAttribute(e))});let a=!(d.type="text/javascript");!d.src&&t.textContent&&(d.src=litespeed_inline2src(t.textContent),a=!0),t.after(d),t.remove(),a&&e()}function litespeed_inline2src(t){try{var d=urlCreator.createObjectURL(new Blob([t.replace(/^(?:)?$/gm,"$1")],{type:"text/javascript"}))}catch(e){d="data:text/javascript;base64,"+btoa(t.replace(/^(?:)?$/gm,"$1"))}return d}
16
17"""
18
19# Create a PyQuery object
20doc = pq(html)
21
22# Accessing Elements
23title = doc("title").text()
24heading = doc("h1").text()
25list_items = doc("ul li")
26
27print("Title:", title)
28print("Heading:", heading)
29print("List Items:")
30for item in list_items:
31 print("- " + pq(item).text())

Understand the above code:

  1. We import the PyQuery class from the pyquery library.

  2. We define the HTML content as a string in the html variable.

  3. We create a PyQuery object doc by passing the HTML content.

  4. We use PyQuery’s CSS selector syntax to select specific elements in the document. For example, doc("title") selects the <title> element.

  5. We extract text content from selected elements using the text() method.

Once you run this code you will get this.

1Title: PyQuery Example
2Heading: Welcome to PyQuery
3List Items:
4- Item 1
5- Item 2
6- Item 3

I have listed the pros and cons of using each library to better help you with choosing one.

Library

Pros

Cons

BeautifulSoup

- User-friendly - Handles poorly formed HTML - Supports multiple parsers - Extensive community support

- Slower performance - Requires additional parsers for optimal speed

lxml

- High performance - Supports XPath and XSLT - Robust error handling - Parses both HTML and XML

- Complex installation - Less intuitive API for beginners

html5lib

- Fully implements HTML5 parsing - Handles all edge cases - Produces browser-like parse tree

- Very slow - High memory usage - Not suitable for large-scale parsing

pyquery

- jQuery-like syntax - Supports CSS selectors - Built on lxml for good performance

- Limited community support - May not handle malformed HTML as gracefully

Conclusion

I hope things are pretty clear now. You have multiple options for parsing but if you dig deeper you will realize very few options can be used in production. If you want to mass-scrape some websites then Beautifulsoup should be your go-to choice and if you want to parse XML then lxml should your choice.

Of course, the list does not end here there are other options like requests-htmlScrapy, etc. but the community support received by BeautifulSoup and lxml is next level.

You should also try these libraries on a live website. Scrape some websites and use one of these libraries to parse out the data to make your conclusion. If you want to crawl a complete website then Scrapy is a great choice. We have also explained web crawling in Python, it’s a great tutorial you should read it.

I hope you like this tutorial and if you do then please do not forget to share it with your friends and on your social media.

Some other relevant resources are linked below.⬇️

Additional Resources

Try Scrapingdog for Free!

Get 200 free credits to spin the API. No credit card required!