Data parsing is like extracting metals from a pile of garbage. When we deal with web scraping we receive a large amount of data that is completely useless for us.
At this point, we use an HTML parser to extract useful data from the raw data we get from the target website while scraping it.
In this tutorial, we will talk about some of the most popular C#
HTML
parsers. We will discuss them one by one and after that, you can draw your conclusion.
In the end, you will have a clear picture of which library you should use while parsing data in C#
.
Html Agility Pack(HAP)
HTML Agility Pack aka HAP is the most widely used HTML
parser in the C#
community. It is used for loading, parsing, and manipulating HTML documents. It has the capability of parsing HTML from a file, a string, or even a URL also. It comes with XPath
support that can help you identify or find specific HTML
elements within the DOM. Due to this reason, it is quite popular in web scraping projects.
Features
- HAP can help you remove dangerous elements from HTML documents.
- Within the
.NET
environment, you can manipulate HTML documents. - It comes with a low memory footprint which makes it friendly for large projects. This ultimately reduces cost as well.
- Its built-in support
XPath
makes it the first choice for developers.
Example
Let’s see how we can use HAP to parse HTML and extract the title tag from the sample HTML given below.
Harry Potter
We will use SelectSingleNode
to find the p tag
inside of this raw HTML
.
using HtmlAgilityPack;
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("Harry Potter
");
HtmlNode title = doc.DocumentNode.SelectSingleNode("//p[@class='title']");
if (title != null)
{
Console.WriteLine(title.InnerText);
}
The output will be Harry Potter. Obviously, this is just a small example of parsing. This library can be used for heavy parsing too.
Advantages
- API is pretty simple for parsing
HTML
. Even a beginner can use it without getting into much trouble. It is a developer-friendly library. - Since it supports multiple encoding options, parsing
HTML
becomes even more simple. - Due to the large community, solving errors becomes easy for beginners.
Disadvantages
- It cannot parse javascript.
- Still, now there is very limited support for
HTML5
. - Error handling is quite an old style. The community needs to focus on this issue.
- It is purely designed for parsing
HTML
documents. So, if you are thinking of parsingXML
then you have to pick another library for that.
AngleSharp
It’s a .NET
based lightweight HTML and CSS parsing library. It comes with clean documentation which makes it popular among developers. AngleSharp helps you by providing an interactive DOM while scraping any website.
Features
- It comes with a
CSS
selector feature. This makes data extraction throughHTML
&CSS
extremely easy. - Using custom classes you can handle any specific type of element.
- Built-in support for
HTML5
andCSS3
. With this, it becomes compatible with new technology. - It is compatible with the
.NET
framework too. This opens too many gates for compatibility with various libraries.
Example
Let’s see how AngleSharp works on the same HTML code used above.
using AngleSharp.Html.Parser;
var parser = new HtmlParser();
var document = parser.Parse("Harry Potter
");
var title = document.QuerySelector("p.title").TextContent;
Console.WriteLine(title); // Output: Harry Potter
Here at first, we used HtmlParser
to parse HTML string into an AngleSharp.Dom.IHtmlDocument
object.
QuerySelector
, we selected p tag
of the class title
. And finally using TextContent
we have extracted the text. Advantages
- It has a better error-handling mechanism than HAP.
- It is faster than compared to other libraries like HAP.
- It comes with built-in support for Javascript parsing.
- It supports new technologies like
HTML5
andCSS3
.
Disadvantages
- It has a smaller community than HAP which makes it difficult for beginners to overcome challenges that they might face while using AngelSharp.
- It lacks support for
Xpath
. - You cannot parse and manipulate
HTML
forms using AngleSharp. - Not a good choice for parsing
XML
documents.
Awesomium
Awesomium can be used to render any website. By creating an instance you can navigate to the website and by using DOM API you can interact with the page as well. It is built on Chromium Embedded Framework
(CEF) and provides a great API for interaction with the webpage.
Features
- API is straightforward which makes interaction with the page dead simple.
- Browser functionalities like notifications and dialog boxes are also supported by this library.
- Works on Mac, Linux, and Windows.
Example
Since Awesomium is a web automation engine and not a parsing library. So, we will write a code to display www.scrapingdog.com using it.
using Awesomium.Core;
using Awesomium.Windows.Forms;
using System.Windows.Forms;
namespace DisplayScrapingdog
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
// Initialize the WebCore
WebCore.Initialize(new WebConfig()
{
// Add any configuration options here
});
// Create a new WebControl
var webView = new WebControl();
// Add the WebControl to the form
this.Controls.Add(webView);
// Navigate to the website
webView.Source = new Uri("https://www.scrapingdog.com/");
}
}
}
Advantage
- It is a lightweight library due to low memory usage.
- It is compatible with
HTML5
,CSS3
, and Javascript which makes it popular among developers. - It is an independent library that does not require any extra liability to extract
raw data. - You can scrape dynamic javascript websites using Awesomium but you will require an additional library for parsing the important data from the raw data.
Disadvantage
- It comes with limited community support. Solving bugs with no community support can become very hard for developers to use in their web scraping projects.
- It does not support all browsers. Hence, scraping certain websites might not be possible.
- It is not open source. So, you might end up paying due to copyright issues.
Fizzler
Fizzler is another parsing library that is built on top of HAP. The syntax is small and pretty self-explanatory. It uses namespaces for the unique identification of objects. It is a .NET
library which does not get active support from the community.
Features
- Using a
CSS
selector you can filter and extract elements from anyHTML
document. - Since it has no external dependency, it is quite lightweight.
- Fizzler provides a facility for
CSS
selectors as well. You can easily search forID
,class
,type
, etc.
Example
Since it is built on top of HAP the syntax will look somewhat similar to it.
using System;
using Fizzler.Systems.HtmlAgilityPack;
using HtmlAgilityPack;
var html = "Harry Potter
";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var title = doc.DocumentNode.QuerySelector(".title").InnerText;
Console.WriteLine(title);
Advantages
- Unlike Awesomium it is a free open-source package, you won’t have to pay for anything.
- CSS selector can help you parse any website even if it is a dynamic javascript website.
- Its fast performance reduces server latency.
Disadvantages
- It might not work as well as other libraries do with large HTML documents.
- Support resources or tutorials on Fizzler are very less.
Selenium WebDriver
I think you already know what Selenium is capable of. This is the most popular web automation tool which can work with almost any programming language (C#, Python, NodeJS, etc). It can run on any browser which includes Chrome, Firefox, Safari, etc.
It provides an integration facility for testing frameworks like TestNG and JUnit.
Example
Again just like Awesomium, it is a web automation tool. So, we will display www.scrapingdog.com using Selenium.
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
class Program
{
static void Main(string[] args)
{
IWebDriver driver = new ChromeDriver();
driver.Navigate().GoToUrl("http://books.toscrape.com/");
Console.WriteLine(driver.Title);
driver.Quit();
}
}
Here we are using the ChromeDriver
constructor to open the chrome browser. Then using GoToUrl()
we are navigating to the target website. driver.title
will print the title of the website and then using driver.Quit()
we are closing down the browser.
Features
- You can record videos, take screenshots and you can even log console messages. It’s a complete web automation testing tool.
- Support almost all browsers and almost all programming languages.
- You can click buttons, fill out forms and navigate between multiple pages.
Advantages
- A clear advantage is its capability to work with almost all browsers and programming languages.
- You can run it in a headless mode which ultimately reduces resource costs and promotes faster execution.
- CSS selectors and XPath both can work with Selenium.
- The community is very large so even a beginner can learn and create a web scraper in no time.
Disadvantages
- It does not support mobile application testing. All though there are some alternatives to that too.
- It does not support SSL encryption. So, testing high-security websites with selenium would not be a great idea.
- It requires a separate driver if you want to run it on multiple different browser instances.
Here’s a concise table summarizing the pros and cons of each C# HTML parsing library mentioned in the article:
Library | Pros | Cons |
---|---|---|
Html Agility Pack (HAP) | – Widely used in the C# community – Supports parsing from files, strings, or URLs – Built-in XPath support – Low memory footprint | – Limited support for HTML5 – Cannot parse JavaScript – Error handling is outdated |
AngleSharp | – Supports HTML5 and CSS3 – Provides an interactive DOM – Built-in JavaScript parsing – Compatible with .NET Framework | – Smaller community compared to HAP – Lacks XPath support – Not suitable for parsing XML |
Awesomium | – Renders websites and interacts with the DOM – Suitable for dynamic content – Provides a browser-like environment | – Heavier and slower compared to other libraries – Limited community support – Requires more resources |
Fizzler | – Lightweight and fast – CSS selector support – Easy to use – Built on top of Html Agility Pack | – Limited features compared to HAP – Less community support – Not suitable for complex parsing tasks |
Selenium WebDriver | – Automates web browsers – Suitable for dynamic content – Supports multiple browsers – Can handle JavaScript-heavy websites | – Slower performance – Requires a web driver – More resource-intensive – Overkill for simple parsing tasks |
Conclusion
Today in general Selenium Web driver is the most used web automation tool due to its compatibility with almost all programming languages but it can be slow because it used real browsers.
Awesomium and Fizzler are both great HTML parsing libraries but Awesomium offers fast website rendering APIs. On the other hand, Fizzler too can be used for small web scraping tasks but it is not fully equipped with the guns as Selenium. Personally, I prefer the combination of Selenium and Fizzler.
I hope this article has given you an insight into the most popular web scraping and HTML parsing tools/libraries available in C#. I know it can be a bit confusing while selecting the right library for your project but you have to find the right fit by trying them one by one.
I hope you like this little tutorial and if you do then please do not forget to share it with your friends and on your social media.
Additional Resources
Here are a few additional resources that you may find helpful during your web scraping journey: