Browser

Parse and process HTML with WebBrowser

Justin James describes how he used the WebBrowser control to parse HTML to extract data from it. Here are some of the issues he faced on this simple project.

 

On a recent project, I needed to parse some HTML to extract data from it. Throughout the last few years, I often used regular expressions for this kind of chore, but I know it really isn't the right way to do it; plus, it's a waste of time to write all of that code. This particular project is an application that looks at entries on Blogger, grabs the actual text, title, and timestamp so you can import them into another CMS. Here are some of the things I learned along the way on this project; I hope this knowledge is helpful the next time you need to parse HTML.

To parse the HTML, I used the WebBrowser control, which is a .NET wrapper around the Internet Explorer ActiveX control. With this component, Internet Explorer does all of the heavy lifting in terms of parsing the Web page and exposing the properties -- I just needed to know how to get them. Unfortunately, the .NET wrapper didn't expose all of the functionality I needed, which presented some additional challenges along the way.

Instantiating the control was easy -- I just called the default constructor. I was doing the processing behind the scenes, and I did not need to show a browser to the user. To point the control at a page, you can either set the Url property or call the Navigate method. This is where things get tricky. The control does everything asynchronously, so these calls do not block. If you try to access the document, it probably won't be ready, and you'll see an empty document. But to complicate matters, you can't just spin on the ReadyState property either -- you need to put a call into Application.DoEvents in that loop; otherwise, the ReadyState property will never flip to Complete. Here's the code I used:

var browser = new WebBrowser();

browser.ScriptErrorsSuppressed = true;

browser.AllowNavigation = true;

browser.Navigate(url);

while (browser.ReadyState != WebBrowserReadyState.Complete)

{

Application.DoEvents();

}

When this loop exits, the document is fully loaded and ready to use. Your project will need to reference mshtml.tlb, and your code will need to be using/importing mshtml.

The WebBrowser control exposes the "most common" attributes of elements. It's too bad they did not consider the class attribute common enough to provide; this was important because Blogspot does not give an id attribute to all of the elements I was looking for -- many of the elements were uniquely identified by class instead. To get at the class attribute (and anything else not wrapped in the .NET class), we need to drop this down into COM mode. Some properties of the browser component have a DomElement property that can be cast with the correct interface to access the full properties. For example, on the elements within the document, you can cast the DomElement property with IHTMLElement, which then gives you full access to the element. In fact, it seems like they only exposed attributes that all elements can have, so you will get pretty familiar with this very quickly.

Another little issue I found was that the Style property of the HtmlElement class has a bug in the documentation. The styles are separated by a semicolon (as they would be in CSS), not a comma like the documentation says. Keep this in mind when manipulating them.

My application was short and simple (and, yes, I did end up with a few regex's here). If you want to take a look at the full application and source code, it is available under the MIT License (the source code will be installed in a directory under the install point).

J.Ja

Disclosure of Justin's industry affiliations: Justin James has a contract with Spiceworks to write product buying guides; he has a contract with OpenAmplify, which is owned by Hapax, to write a series of blogs, tutorials, and articles; and he has a contract with OutSystems to write articles, sample code, etc.

---------------------------------------------------------------------------------------

Get weekly development tips in your inbox Keep your developer skills sharp by signing up for TechRepublic's free Web Developer newsletter, delivered each Tuesday. Automatically subscribe today!

About

Justin James is the Lead Architect for Conigent.

30 comments
ojemuyiwa
ojemuyiwa

shouldve use html agility pack. no need to re-invent the wheel this has been done many a time and is called screen scrapping.. nice approach though. httprequest class would have returned the response / web page markup u required as well...

mr85
mr85

I didn't find the code, is it a joke? pedrito rod mr85@servidor.unam.mx

?/\/\?|???\/???
?/\/\?|???\/???

More than one way to skin a cat, it would seem... WebBrowser.DocumentCompleted Event @ http://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser.documentcompleted.aspx "Handle the DocumentCompleted event to receive notification when the new document finishes loading. When the DocumentCompleted event occurs, the new document is fully loaded, which means you can access its contents through the Document, DocumentText, or DocumentStream property." WebBrowserReadyState Enumeration @ http://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowserreadystate.aspx "Complete The control has finished loading the new document and all its contents"

dingjing0105
dingjing0105

Which WebBrowser are you talking about? System.Windows.Forms.WebBrowser or System.Windows.Controls.WebBrowser? They are quite different.

duke.url
duke.url

Has anyone tried TextPipe Pro for this purpose?

lcdata
lcdata

I thought the article title was Parse and Process HTML?!!! So, where's the code?? All you say here is that it can be done. Very helpful. What a waste of time!!!!!

JackOfAllTech
JackOfAllTech

Using IE's COM Objects makes it child's play to extract anything from an HTML page, including navigating links to other pages. Plus, you don't have to waste disk space for the .NET framework, don't have to worry about Big Brother M$ watching over your shoulder. http://www.autoitscript.com/

gharlow
gharlow

This all worked great for me until.... GRRRRRRR! frame cross domain security issues. I never could get past this no matter what I did to security settings. The web browser control DOES provide a nice dom model, which works on most sites. Frames with different domains??? If any of you have a solution to this, I am still interested in pursuing...

Justin James
Justin James

Using the WebBrowser control worked well for me, ut it might not be ideal for everyone since it loads all of IE in the process. If you have a different way of doing this, I'd love to hear about it! J.Ja

Justin James
Justin James

I've used it on some other projects. For folks who do not want to use a 3rd party component, or need things like JavaScript parsing and the full DOM model, WebBrowser works very well. It's also a bit better documented, I think. More importantly, many people have written an awful lot of code designed to work with IE (even if it is not written to the HTML spec), and I feel like it would be mighty hard to replicate that. Indeed, if you want to be sure that something will parse the HTML you find on the Internet well, you are best off leveraging an existing browser engine (IE's Trident, Firefox's Gecko, etc.) rather than using a third-party one, because the third party ones don't have the piles of quirks that HTML authors write to. I may write an article about Html Agility Pack in the future though, because I agree that it is a useful tool. J.Ja

Justin James
Justin James

If you follow the link in the article, you'll find the actual application which was written, and if you install the application, you'll get the full source code in a folder in the install directory. J.Ja

Justin James
Justin James

You are right that I could have handled DocumentCompleted instead. Personally, I try to avoid event driven async style code whenever possible, because I find that it adds a layer of abstraction and indirection that is not conducive to debugging. Purely personal preference, driven more from where/when/how I learned programming than anything else! For someone who prefers that style of code, handling DocumentCompleted is the way to do it. :) J.Ja

?/\/\?|???\/???
?/\/\?|???\/???

... System.Windows.Controls.WebBrowser doesn't appear to have a ReadyState property...

Realvdude
Realvdude

Thanks Justin for not pasting in a bunch of code snippets for anyone interested to scrape.

Justin James
Justin James

The AutoIt items looks interesting in and of itself, but your arguments in favor of it don't really hold much water. The .NET WebBrowser control is the IE COM object wrapped in .NET and there are methods to access it fully in COM to. Wasting disk space fot .NET? Hardly relevant, seeing as every Windows PC has .NET to begin with. And the Big Brother comment... not sure what you mean by that, but neither IE nor .NET report anything back to Microsoft... J.Ja

mattohare
mattohare

I had created the pages with my old xml writer (not MS-XMLWriter). I found a few problems on it, but by and large I got everything quite well. I had the ability to get any attribute of any element. All the content too. I still have the stuff set up in case I need to crawl my old site (or any other site) again.

Kruppster
Kruppster

These are the subroutines and some comment sections it should work with some tweaking. The delay and check to see page done sub routines were the keys to knowing that the page had loaded. Just past into VB module in a new spreadsheet and have fun the URLs are from years ago and you need to put in your own Dim MyBrowser As SHDocVw.InternetExplorer Dim webpage As HTMLDocument 'Dim B As MSHTML.HTMLBody Dim B As MSHTML.HTMLFrameElement Dim browser2 As MSHTML.HTMLDocument Dim Z As MSHTML.HTMLGenericElement Dim Wtime As Integer Dim AUTOLOG As Hyperlink Dim A As MSHTML.HTMLAnchorElement Sub GetPage() 'Obtain Web Information for processing 'Paste hyperlink in worksheet for autologin 'With Worksheets(1) ' .Hyperlinks.Add Anchor:=.Range("Z1"), _ ' Address:="http://www.neco.navy.mil/biz_opps/search_edi.cfm", _ ' ScreenTip:="NECO Web Site Open", _ ' TextToDisplay:="NECO" ' 'assign hyperlink to variable autolog ' Set AUTOLOG = Range("Z1").Hyperlinks(1) ' Columns("A:A").ColumnWidth = 20 ' Columns("B:B").ColumnWidth = 11 'End With 'Open web page GetWeb ' Delay for remote server error (document needs time to fully load) ' Replace all web delay cycles with Browser event download complete Wtime = 3 Delay 'Do Until Left(peekat2, 4) = "Done" 'peekat2 = MyBrowser.StatusText 'Loop capture '******* End Sub Sub capture() pageaddress = LocationURL Set webpage = MyBrowser.Document For Each A In webpage.all Count = Count + 1 'If A.innerHTML = "LINK" Then Range("A" & Count) = A.tagName Range("B" & Count) = A.innerHTML Range("C" & Count) = A.innerText Range("D" & Count) = A.href Range("E" & Count) = A.ID 'For Each B In webpage.all 'counter = counter + 1 ' Range("F" & Count) = B.tagName ' Range("G" & Count) = B.outerHTML ' Range("H" & Count) = B.innerHTML ' 'Range("I" & Count) = B.Link ' Range("J" & Count) = B.childNodes ' 'Range("K" & Count) = B.children ' Range("L" & Count) = B.childNodes ' Range("M" & Count) = B.Document ' Next B ' End If Next A 'Set browser2 = MyBrowser.Document 'look = browser2.frames.Length 'For p = 1 To look 'Set webpage = browser2.frames(p) 'Range("A" & Count) = ("FRAME # " & p) 'Count = Count + 1 ' For Each A In webpage.all ' Count = Count + 1 ' 'If A.innerHTML = "LINK" Then ' Range("A" & Count) = A.tagName ' Range("B" & Count) = A.innerHTML ' Range("C" & Count) = A.innerText ' Range("D" & Count) = A.href ' Range("E" & Count) = A.ID 'Next A 'Next p End Sub Private Sub MyBrowser_DocumentComplete() Dim pDisp As Double Dim objDoc As HTMLDocument Set MyBrowser = New SHDocVw.InternetExplorer With MyBrowser .Navigate URL:="http://www.alittlesomethingcatering.com" .Top = 50 'set the browser in the top .Left = 100 'left of the user's screen .StatusBar = True 'display the status bar .Visible = True End With pDisp = MyBrowser.HWND MsgBox pDisp & " " & MyBrowser.Name Set objDoc = MyBrowser.Document ' Display all of the HTML for the active Web page. MsgBox "Entire HTML: " & vbCrLf & vbCrLf & objDoc.activeElement With objDoc.body MsgBox " outer HTML: " & vbCrLf & vbCrLf & .outerHTML MsgBox " inner HTML: " & vbCrLf & vbCrLf & .innerHTML End With End Sub Private Sub GetWeb() ' both statements work to open new explorer window Set MyBrowser = New SHDocVw.InternetExplorer 'Set MyBrowser = CreateObject("InternetExplorer.Application") With MyBrowser 'SiteAddress = "http://phoenix.ilsmart.com/FormsLogin.asp?/subscriber/default.asp" SiteAddress = InputBox("Enter Url after the www.", "URL w/o header and .com trailer") 'ThePage = ("http://www." & SiteAddress & ".com") .Navigate URL:=SiteAddress .Top = 20 'set the browser in the top .Left = 50 'left of the user's screen .StatusBar = True 'display the status bar .Visible = True .Width = 500 .Height = 500 End With Do Until Left(peekat2, 4) = "Done" peekat2 = MyBrowser.StatusText Loop If peekat2 "Done" Then MsgBox ("Page didn't Load") peekat2 = "" End Sub Function Delay() newHour = Hour(Now()) newMinute = Minute(Now()) newSecond = Second(Now()) + Wtime waitTime = TimeSerial(newHour, newMinute, newSecond) Application.Wait waitTime End Function

XDotNet
XDotNet

Justin, I am wondering why regex is not a good way to do this. I'm not disagreeing, just looking to improve. I've been doing the following. 1. Download web page with webclient. 2. HTML goes into string. 3. Run HTML through regex class to get rid of tags. 4. Find my matches with regex. Any input is appreciated, is it slow, are regexs inflexible? They are a pain to write but there are some regex developers that help. Just looking to improve. Thanks!

Justin James
Justin James

The other is a WPF item for viewing HTML in a page. Probably wraps the same control, though. J.Ja

JackOfAllTech
JackOfAllTech

I find the AutoIt COM interface much easier and intuitive than any other and the fact that it 'compiles' into a relatively small, standalone executable is another advantage. My PCs do NOT have .Net on them. I refuse to have anything to do with it. I have yet to find anything I can't do in straight C or, if in a hurry, with AutoIt. How do you KNOW that the framework doesn't call home? You can't even install it without allowing it to connect to the Internet.

Justin James
Justin James

... with the way the average person writes HTML. :( J.Ja

Justin James
Justin James

I've used that as well, on another project, to rapidly strip a page of HTML where using WebBrowser would not be appropriate. J.Ja

Justin James
Justin James

Eric - Good question. Here are the problems with using regex: * Writing a regex that covers all the basics is a pain in the neck. * People often write HTML in a way that defies logic and even the HTML spec itself, but continues to work in browsers; I'd rather let Microsoft (or Mozilla, or whoever) worry about those parsing rules that try to handle them all! * Difficult to extract attributes with regex. By the by, wondering if you are the same Eric Lamey that I went to high school (JMHS) with? J.Ja

Justin James
Justin James

You are right that the .NET installer is huge... however, only a portion of it actually makes it to disk. That's because the install package rolls up binaries across a variety of platforms. This link has additional details: http://www.hanselman.com/blog/SmallestDotNetOnTheSizeOfTheNETFramework.aspx The .NET Framework itself is not significantly larger than the full JDK, which is a good comparison, .NET applications themselves tend to be quite small (because they leverage the Framework), which actually makes them, pound for pound, more effective than native code apps once you have a lot installed (because they all refer to the same Framework, instead of installing a million DLLs, or creating "DLL hell" by dumping them into C:\Windows\System32). In terms of "encourging poor programming practices"? First of all, the .NET Framework itself does not encourage poor programming practices. Secondly, C# is very similar to Java in terms of programming practices, and by extension, VB.NET is as well. If you don't mind the way Java gets written, I have no idea why you'd disagree with how C# gets written. Thirdly, VB.NET and C# are both MUCH more rigorous that classic VB ever was (now *there* was a system that encourage poor coding practices!). I am sure that if you aren't using an .NET apps, you probably have a few VB apps running around. Fourthly, I'd rather deal with the "poor programming practices" of the typical .NET (or Java) developer then the massive security holes that even the best developers can inadvertently put into C/C++ apps. I think that if you really examine the facts of reality, it becomes fairly hard to forbid an entire swath of applications written in a particular set of languages. I've seen plenty of ugly Perl code (talk about "encouraging poor programming habits?!?!") yet every *Nix system out there is filled with critical system utilities that rely on Perl. I wouldn't stop using my FreeBSD server just because you can't have it up and running for 10 minutes post-install without Perl ending up on it as a requirement. J.Ja

seanferd
seanferd

But isn't the Framework more properly compared with the JRE? Both are large, diskwise. If you mean the size of .NET applications, I can't comment, as I don't use them or have the Framework installed, either.

JackOfAllTech
JackOfAllTech

It wouldn't be my first choice for anything but I always install it (the non-M$ version) on a PC. Although I honestly don't like Microsoft's attitude, I have made a good living supporting it's products. My main objection is that .net enourages poor programming practices. I just don't understand why the framework takes up Hundreds of MBs when a C (even C++, even Java!) program that does exactly the same thing would only be a few MBs in size. Even with TB drives and multi-GB RAM, that seems wasteful to me and indicative of sloppy coding and sloppy reasoning.

Justin James
Justin James

I would be incredibly shocked if you have Windows PCs without at least one version of the .NET framework. For one thing, .NET comes with all post-XP Windows versions (http://blogs.msdn.com/astebner/archive/2007/03/14/mailbag-what-version-of-the-net-framework-is-included-in-what-version-of-the-os.aspx). Secondly, huge swaths of applications install it, from printer drivers to little utilities. In terms of "calling home", you absolutely CAN install it without being connected to the Internet. You just need to get the big package which doesn't download anything. Furthermore, the .NET Framework is something that anti-Microsoft folks love to pick at. Beleive me, if it was "calling home", someone would have noticed by now and there would have been a massive outrage over it. If you have proof that .NET "calls home" I would love to see it, because I'll be the first to blast Microsoft for that. I find it very ironic that you are afraid of .NET "calling home" while using Windows (which *does* "call home") and IE. I'm going to guess that you are also opposed to the Java platform for the same reasons. Between no Java and no .NET, there are an awful lot of apps that won't install on your system. I really wonder what the advantage there is in a system that you are ignoring a huge number of applications on, due to a prejudice based on speculation. J.Ja

mattohare
mattohare

(I know), even my code had issues. I think I spent a day and a half sorting out the rogue code issues. (It was the same time we were talking about the movie Airplane, if you were wondering why I was that brand of loopy at the time.) That said, If the code coming in is sound, XmlReader was absolutely dead on!

Editor's Picks