How to easily parse HTML without RegEx
May 6th, 2008
I recently discovered an absolutely amazing HTML parsing library for .NET called HtmlAgilityPack. It completely takes away the pain of parsing complicated HTML with regular expressions.
Here’s a very simple example of what you could do with it - I’m just extracting inner HTML from any element inside a HTML file which has a css class called “scrape” assigned to it:
using HtmlAgilityPack; public partial class _Default : System.Web.UI.Page { protected void Page_Load(object sender, EventArgs e) { HtmlDocument doc = new HtmlDocument(); doc.Load(Server.MapPath(filePath)); Parse(doc.DocumentNode); } private void Parse(HtmlNode n) { foreach (HtmlAttribute atr in n.Attributes) { if (atr.Name == “class” && atr.Value == “scrape”) { Response.Write(n.InnerHtml); } } if (n.HasChildNodes) { foreach (HtmlNode cn in n.ChildNodes) { Parse(cn); } } } }
That’s just a very small part of what it could do. I’ll expand upon this and post a few more examples in the future showing some interesting things you could do with this.









Leave a Reply