On a recent project, I needed to parse some HTML to extract data from it. Throughout the last few years, I often used regular expressions for this kind of chore, but I know it really isn't the right way to do it; plus, it's a waste of time to write all of that code. This particular project is an application that looks at entries on Blogger, grabs the actual text, title, and timestamp so you can import them into another CMS. Here are some of the things I learned along the way on this project; I hope this knowledge is helpful the next time you need to parse HTML.
To parse the HTML, I used the WebBrowser control, which is a .NET wrapper around the Internet Explorer ActiveX control. With this component, Internet Explorer does all of the heavy lifting in terms of parsing the Web page and exposing the properties — I just needed to know how to get them. Unfortunately, the .NET wrapper didn't expose all of the functionality I needed, which presented some additional challenges along the way.
Instantiating the control was easy — I just called the default constructor. I was doing the processing behind the scenes, and I did not need to show a browser to the user. To point the control at a page, you can either set the Url property or call the Navigate method. This is where things get tricky. The control does everything asynchronously, so these calls do not block. If you try to access the document, it probably won't be ready, and you'll see an empty document. But to complicate matters, you can't just spin on the ReadyState property either — you need to put a call into Application.DoEvents in that loop; otherwise, the ReadyState property will never flip to Complete. Here's the code I used:
var browser = new WebBrowser();
browser.ScriptErrorsSuppressed = true;
browser.AllowNavigation = true;
while (browser.ReadyState != WebBrowserReadyState.Complete)
When this loop exits, the document is fully loaded and ready to use. Your project will need to reference mshtml.tlb, and your code will need to be using/importing mshtml.
The WebBrowser control exposes the "most common" attributes of elements. It's too bad they did not consider the class attribute common enough to provide; this was important because Blogspot does not give an id attribute to all of the elements I was looking for — many of the elements were uniquely identified by class instead. To get at the class attribute (and anything else not wrapped in the .NET class), we need to drop this down into COM mode. Some properties of the browser component have a DomElement property that can be cast with the correct interface to access the full properties. For example, on the elements within the document, you can cast the DomElement property with IHTMLElement, which then gives you full access to the element. In fact, it seems like they only exposed attributes that all elements can have, so you will get pretty familiar with this very quickly.
Another little issue I found was that the Style property of the HtmlElement class has a bug in the documentation. The styles are separated by a semicolon (as they would be in CSS), not a comma like the documentation says. Keep this in mind when manipulating them.
My application was short and simple (and, yes, I did end up with a few regex's here). If you want to take a look at the full application and source code, it is available under the MIT License (the source code will be installed in a directory under the install point).
J.JaDisclosure of Justin's industry affiliations: Justin James has a contract with Spiceworks to write product buying guides; he has a contract with OpenAmplify, which is owned by Hapax, to write a series of blogs, tutorials, and articles; and he has a contract with OutSystems to write articles, sample code, etc.
———————————————————————————————————————————-Get weekly development tips in your inbox Keep your developer skills sharp by signing up for TechRepublic's free Web Developer newsletter, delivered each Tuesday. Automatically subscribe today!
Justin James is the Lead Architect for Conigent.