How to easily parse HTML without RegEx

May 6th, 2008

I recently discovered an absolutely amazing HTML parsing library for .NET called HtmlAgilityPack. It completely takes away the pain of parsing complicated HTML with regular expressions.

Here’s a very simple example of what you could do with it - I’m just extracting inner HTML from any element inside a HTML file which has a css class called “scrape” assigned to it:

using HtmlAgilityPack;

public partial class _Default : System.Web.UI.Page
{
    protected void Page_Load(object sender, EventArgs e)
    {
        HtmlDocument doc = new HtmlDocument();
        doc.Load(Server.MapPath(filePath));
        Parse(doc.DocumentNode);
    }
    private void Parse(HtmlNode n)
    {
        foreach (HtmlAttribute atr in n.Attributes)
        {
            if (atr.Name == “class” && atr.Value == “scrape”)
            {
                Response.Write(n.InnerHtml);
            }
        }

        if (n.HasChildNodes)
        {
            foreach (HtmlNode cn in n.ChildNodes)
            {
                Parse(cn);
            }
        }
    }
}

That’s just a very small part of what it could do. I’ll expand upon this and post a few more examples in the future showing some interesting things you could do with this.

Programming | Comments | Trackback Jump to the top of this page

Leave a Reply

  •  
  •  
  •  

You can keep track of new comments to this post with the comments feed.


Recently on Flickr

  • IMG_0514
  • IMG_0506
  • IMG_0505
  • IMG_0503
  • IMG_0497
  • IMG_0495
  • IMG_0494
  • IMG_0493

Switch Theme

Meta