5 Best C# HTML Parsers

Scrapingdog
7 min readJan 28, 2023

--

Data parsing is like extracting metals from a pile of garbage. When we deal with web scraping we receive a large amount of data that is completely useless for us. At this point, we use an HTML parser to extract useful data from the raw data we get from the target website while scraping it.

In this tutorial, we will talk about some of the most popular C# HTML parsers. We will discuss them one by one and after that, you can draw your own conclusion. In the end, you will have a clear picture of which library you should use while parsing data in C#.

Html Agility Pack(HAP)

HTML Agility Pack aka HAP is the most widely used HTML parser in the C# community. It is used for loading, parsing, and manipulating HTML documents. It has the capability of parsing HTML from a file, a string, or even a URL also. It comes with XPath support that can help you identify or find specific HTML elements within the DOM. Due to this reason, it is quite popular in web scraping projects.

Features

  • HAP can help you remove dangerous elements from HTML documents.
  • Within the .NET environment, you can manipulate HTML documents.
  • It comes with a low memory footprint which makes it friendly for large projects. This ultimately reduces cost as well.
  • Its built-in support for XPath makes it the first choice for developers.

Example

Let’s see how we can use HAP to parse HTML and extract the title tag from the sample HTML given below.

<div class="test1"><p class="title">Harry Potter</p></div>

We will use SelectSingleNode to find the p tag inside of this raw HTML.

using HtmlAgilityPack;

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("<div class='test1'><p class='title'>Harry Potter</p></div>");

HtmlNode title = doc.DocumentNode.SelectSingleNode("//p[@class='title']");

if (title != null)
{
Console.WriteLine(title.InnerText);
}

The output will be Harry Potter. Obviously, this is just a small example of parsing. This library can be used for heavy parsing too.

Advantages

  • API is pretty simple for parsing HTML. Even a beginner can use it without getting into much trouble. It is a developer-friendly library.
  • Since it supports multiple encoding options, parsing HTML becomes even more simple.
  • Due to the large community, solving errors becomes for beginners will not be a problem.

Disadvantages

  • It cannot parse javascript.
  • Still, now there is very limited support for HTML5.
  • Error handling is quite an old style. The community needs to focus on this issue.
  • It is purely designed for parsing HTML documents. So, if you are thinking of parsing XML then you have to pick another library for that.

AngleSharp

It’s a .NET based lightweight HTML and CSS parsing library. It comes with clean documentation which makes it popular among developers. AngelSharp helps you by providing an interactive DOM while scraping any website.

Features

  • It comes with a CSS selector feature. This makes data extraction through HTML & CSS extremely easy.
  • Using custom classes you can handle any specific type of element.
  • Built-in support for HTML5 and CSS3. With this, it becomes compatible with new technology.
  • It is compatible with the .NET framework too. This opens too many gates for compatibility with various libraries.

Example

Let’s see how AngleSharp works on the same HTML code used above.

using AngleSharp.Html.Parser;

var parser = new HtmlParser();
var document = parser.Parse("<div class='test1'><p class='title'>Harry Potter</p></div>");
var title = document.QuerySelector("p.title").TextContent;
Console.WriteLine(title); // Output: Harry Potter

Here at first, we have used HtmlParser to parse HTML string into an AngleSharp.Dom.IHtmlDocument object. Then with the help of QuerySelector, we selected p tag of the class title. And finally using TextContent we have extracted the text.

Advantages

  • It has a better error-handling mechanism than HAP.
  • It is faster than compared to other libraries like HAP.
  • It comes with built-in support for Javascript parsing.
  • It supports new technologies like HTML5 and CSS3.

Disadvantages

  • It has a smaller community than HAP which makes it difficult for beginners to overcome challenges that they might face while using AngelSharp.
  • It lacks support for Xpath.
  • You cannot parse and manipulate HTML forms using AngleSharp.
  • Not a good choice for parsing XML documents.

Awesomium

Awesomium can be used to render any website. By creating an instance you can navigate to the website and by using DOM API you can interact with the page as well. It is built on Chromium Embedded Framework (CEF) and provides a great API for interaction with the webpage.

Features

  • API is straightforward which makes interaction with the page dead simple.
  • Browser functionalities like notifications and dialog boxes are also supported by this library.
  • Works on Mac, Linux, and Windows.

Example

Since Awesomium is a web automation engine and not a parsing library. So, we will write a code to display www.scrapingdog.com using it.

using Awesomium.Core;
using Awesomium.Windows.Forms;
using System.Windows.Forms;

namespace DisplayScrapingdog
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();

// Initialize the WebCore
WebCore.Initialize(new WebConfig()
{
// Add any configuration options here
});

// Create a new WebControl
var webView = new WebControl();

// Add the WebControl to the form
this.Controls.Add(webView);

// Navigate to the website
webView.Source = new Uri("https://www.scrapingdog.com/");
}
}
}

Advantage

  • It is a lightweight library due to low memory usage.
  • It is compatible with HTML5, CSS3, and Javascript which makes it popular among developers.
  • It is an independent library that does not require any extra liability to extract raw data.
  • You can scrape dynamic javascript websites using Awesomium but you will require an additional library for parsing the important data from the raw data.

Disadvantage

  • It comes with limited community support. Solving bugs with no community support can become very hard for developers to use in their web scraping projects.
  • It does not support all browsers. Hence, scraping certain websites might not be possible.
  • It is not open source. So, you might end up paying due to copyright issues.

Fizzler

Fizzler is another parsing library that is built on top of HAP. The syntax is small and pretty self-explanatory. It uses namespaces for the unique identification of objects. It is a .NET library which does not get active support from the community.

Features

  • Using a CSS selector you can filter and extract elements from any HTML document.
  • Since it has no external dependency, it is quite lightweight.
  • Fizzler provides a facility for CSS selectors as well. You can easily search for ID, class, type, etc.

Example

Since it is built on top of HAP the syntax will look somewhat similar to it.

using System;
using Fizzler.Systems.HtmlAgilityPack;
using HtmlAgilityPack;

var html = "<div class='test1'><p class='title'>Harry Potter</p></div>";
var doc = new HtmlDocument();
doc.LoadHtml(html);

var title = doc.DocumentNode.QuerySelector(".title").InnerText;
Console.WriteLine(title);

Advantages

  • Unlike Awesomium it is a free open-source package, you won’t have to pay for anything.
  • CSS selector can help you parse any website even if it is a dynamic javascript website.
  • It’s fast performance reduces server latency.

Disadvantages

  • It might not work as well as other libraries do with large HTML documents.
  • Support resources or tutorials on fizzler are very less.

Selenium WebDriver

I think you already know what Selenium is capable of. This is the most popular web automation tool which can work with almost any programming language(C#, Python, NodeJS, etc). It can run on any browser which includes Chrome, Firefox, Safari, etc.

It provides an integration facility for testing frameworks like TestNG and JUnit.

Example

Again just like Awesomium it is a web automation tool. So, we will display www.scrapingdog.com using Selenium.

using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;

class Program
{
static void Main(string[] args)
{
IWebDriver driver = new ChromeDriver();
driver.Navigate().GoToUrl("http://books.toscrape.com/");
Console.WriteLine(driver.Title);
driver.Quit();
}
}

Here we are using the ChromeDriver constructor to open the chrome browser. Then using GoToUrl() we are navigating to the target website. driver.title will print the title of the website and then using driver.Quit() we are closing down the browser.

Features

  • You can record videos, take screenshots and you can even log console messages. It’s a complete web automation testing tool.
  • Support almost all browsers and almost all programming languages.
  • You can click buttons, fill out forms and navigate between multiple pages.

Advantages

  • A clear advantage is its capability to work with almost all browsers and programming languages.
  • You can run it in a headless mode which ultimately reduces resource costs and promotes faster execution.
  • CSS selectors and XPath both can work with Selenium.
  • The community is very large so even a beginner can learn and create a web scraper in no time.

Disadvantages

  • It does not support mobile application testing. All though there are some alternatives to that too.
  • It does not support SSL encryption. So, testing high-security websites with selenium would not be a great idea.
  • It requires a separate driver if you want to run it on multiple different browser instances.

Conclusion

Today in general Selenium Web driver is the most used web automation tool due to its compatibility with almost all programming languages but it can be slow because it used real browsers.

Awesomium and Fizzler are both great HTML parsing libraries but Awesomium offers fast website rendering APIs. On the other hand, Fizzler too can be used for small web scraping tasks but it is not fully equipped with the guns as Selenium. Personally, I prefer the combination of Selenium and Fizzler.

I hope this article has given you an insight into the most popular web scraping and HTML parsing tools/libraries available in C#. I know it can be a bit confusing while selecting the right library for your project but you have to find the right fit by trying them one by one.

I hope you like this little tutorial and if you do then please do not forget to share it with your friends and on your social media.

Additional Resources

Here are a few additional resources that you may find helpful during your web scraping journey:

--

--

Scrapingdog

I usually talk about web scraping and yes web scraping only. You can find a web scraping API at www.scrapingdog.com