Discussion on:

30
Comments

Join the conversation!

Follow via:
RSS
Email Alert
Using the WebBrowser control worked well for me, ut it might not be ideal for everyone since it loads all of IE in the process. If you have a different way of doing this, I'd love to hear about it!

J.Ja
Justin, I am wondering why regex is not a good way to do this. I'm not disagreeing, just looking to improve.

I've been doing the following.
1. Download web page with webclient.
2. HTML goes into string.
3. Run HTML through regex class to get rid of tags.
4. Find my matches with regex.

Any input is appreciated, is it slow, are regexs inflexible? They are a pain to write but there are some regex developers that help. Just looking to improve. Thanks!
0 Votes
+ -
Contributr
Eric -

Good question. Here are the problems with using regex:

* Writing a regex that covers all the basics is a pain in the neck.

* People often write HTML in a way that defies logic and even the HTML spec itself, but continues to work in browsers; I'd rather let Microsoft (or Mozilla, or whoever) worry about those parsing rules that try to handle them all!

* Difficult to extract attributes with regex.

By the by, wondering if you are the same Eric Lamey that I went to high school (JMHS) with?

J.Ja
These are the subroutines and some comment sections it should work with some tweaking. The delay and check to see page done sub routines were the keys to knowing that the page had loaded. Just past into VB module in a new spreadsheet and have fun the URLs are from years ago and you need to put in your own

Dim MyBrowser As SHDocVw.InternetExplorer
Dim webpage As HTMLDocument
'Dim B As MSHTML.HTMLBody
Dim B As MSHTML.HTMLFrameElement


Dim browser2 As MSHTML.HTMLDocument

Dim Z As MSHTML.HTMLGenericElement

Dim Wtime As Integer


Dim AUTOLOG As Hyperlink
Dim A As MSHTML.HTMLAnchorElement



Sub GetPage()


'Obtain Web Information for processing
'Paste hyperlink in worksheet for autologin
'With Worksheets(1)
' .Hyperlinks.Add Anchor:=.Range("Z1"), _
' Address:="http://www.neco.navy.mil/biz_opps/search_edi.cfm", _
' ScreenTip:="NECO Web Site Open", _
' TextToDisplay:="NECO"
' 'assign hyperlink to variable autolog
' Set AUTOLOG = Range("Z1").Hyperlinks(1)
' Columns("A:A").ColumnWidth = 20
' Columns("B:B").ColumnWidth = 11
'End With

'Open web page
GetWeb
' Delay for remote server error (document needs time to fully load)
' Replace all web delay cycles with Browser event download complete

Wtime = 3
Delay


'Do Until Left(peekat2, 4) = "Done"
'peekat2 = MyBrowser.StatusText
'Loop


capture
'*******
End Sub

Sub capture()
pageaddress = LocationURL
Set webpage = MyBrowser.Document
For Each A In webpage.all
Count = Count + 1
'If A.innerHTML = "LINK" Then
Range("A" & Count) = A.tagName
Range("B" & Count) = A.innerHTML
Range("C" & Count) = A.innerText
Range("D" & Count) = A.href
Range("E" & Count) = A.ID



'For Each B In webpage.all
'counter = counter + 1


' Range("F" & Count) = B.tagName

' Range("G" & Count) = B.outerHTML
' Range("H" & Count) = B.innerHTML
' 'Range("I" & Count) = B.Link
' Range("J" & Count) = B.childNodes
' 'Range("K" & Count) = B.children
' Range("L" & Count) = B.childNodes
' Range("M" & Count) = B.Document


' Next B
' End If


Next A

'Set browser2 = MyBrowser.Document
'look = browser2.frames.Length
'For p = 1 To look
'Set webpage = browser2.frames(p)

'Range("A" & Count) = ("FRAME # " & p)
'Count = Count + 1
' For Each A In webpage.all



' Count = Count + 1
' 'If A.innerHTML = "LINK" Then
' Range("A" & Count) = A.tagName
' Range("B" & Count) = A.innerHTML
' Range("C" & Count) = A.innerText
' Range("D" & Count) = A.href
' Range("E" & Count) = A.ID
'Next A
'Next p



End Sub

Private Sub MyBrowser_DocumentComplete()
Dim pDisp As Double
Dim objDoc As HTMLDocument
Set MyBrowser = New SHDocVw.InternetExplorer

With MyBrowser
.Navigate URL:="http://www.alittlesomethingcatering.com"
.Top = 50 'set the browser in the top
.Left = 100 'left of the user's screen
.StatusBar = True 'display the status bar
.Visible = True
End With

pDisp = MyBrowser.HWND
MsgBox pDisp & " " & MyBrowser.Name

Set objDoc = MyBrowser.Document

' Display all of the HTML for the active Web page.


MsgBox "Entire HTML: " & vbCrLf & vbCrLf & objDoc.activeElement

With objDoc.body
MsgBox " outer HTML: " & vbCrLf & vbCrLf & .outerHTML
MsgBox " inner HTML: " & vbCrLf & vbCrLf & .innerHTML
End With


End Sub


Private Sub GetWeb()
' both statements work to open new explorer window
Set MyBrowser = New SHDocVw.InternetExplorer
'Set MyBrowser = CreateObject("InternetExplorer.Application")


With MyBrowser
'SiteAddress = "http://phoenix.ilsmart.com/FormsLogin.asp?/subscriber/default.asp"
SiteAddress = InputBox("Enter Url after the www.", "URL w/o header and .com trailer")
'ThePage = ("http://www." & SiteAddress & ".com")
.Navigate URL:=SiteAddress

.Top = 20 'set the browser in the top
.Left = 50 'left of the user's screen
.StatusBar = True 'display the status bar
.Visible = True
.Width = 500
.Height = 500



End With


Do Until Left(peekat2, 4) = "Done"
peekat2 = MyBrowser.StatusText
Loop
If peekat2 "Done" Then MsgBox ("Page didn't Load")
peekat2 = ""

End Sub

Function Delay()
newHour = Hour(Now())
newMinute = Minute(Now())
newSecond = Second(Now()) + Wtime
waitTime = TimeSerial(newHour, newMinute, newSecond)
Application.Wait waitTime
End Function
0 Votes
+ -
I've used HTML Agility Pack (http://www.codeplex.com/htmlagilitypack) with success.
0 Votes
+ -
Contributr
I've used that as well, on another project, to rapidly strip a page of HTML where using WebBrowser would not be appropriate.

J.Ja
0 Votes
+ -
I had created the pages with my old xml writer (not MS-XMLWriter). I found a few problems on it, but by and large I got everything quite well. I had the ability to get any attribute of any element. All the content too.

I still have the stuff set up in case I need to crawl my old site (or any other site) again.
0 Votes
+ -
Contributr
... with the way the average person writes HTML. sad

J.Ja
0 Votes
+ -
Ano...
mattohare@... 25th Feb 2010
(I know), even my code had issues. I think I spent a day and a half sorting out the rogue code issues. (It was the same time we were talking about the movie Airplane, if you were wondering why I was that brand of loopy at the time.)

That said, If the code coming in is sound, XmlReader was absolutely dead on!
This all worked great for me until....

GRRRRRRR!

frame cross domain security issues. I never could get past this no matter what I did to security settings.

The web browser control DOES provide a nice dom model, which works on most sites. Frames with different domains???

If any of you have a solution to this, I am still interested in pursuing...
Using IE's COM Objects makes it child's play to extract anything from an HTML page, including navigating links to other pages. Plus, you don't have to waste disk space for the .NET framework, don't have to worry about Big Brother M$ watching over your shoulder.

http://www.autoitscript.com/
0 Votes
+ -
Contributr
The AutoIt items looks interesting in and of itself, but your arguments in favor of it don't really hold much water. The .NET WebBrowser control is the IE COM object wrapped in .NET and there are methods to access it fully in COM to. Wasting disk space fot .NET? Hardly relevant, seeing as every Windows PC has .NET to begin with. And the Big Brother comment... not sure what you mean by that, but neither IE nor .NET report anything back to Microsoft...

J.Ja
0 Votes
+ -
.NET
JackOfAllTech 17th Feb 2010
I find the AutoIt COM interface much easier and intuitive than any other and the fact that it 'compiles' into a relatively small, standalone executable is another advantage.

My PCs do NOT have .Net on them. I refuse to have anything to do with it. I have yet to find anything I can't do in straight C or, if in a hurry, with AutoIt.

How do you KNOW that the framework doesn't call home? You can't even install it without allowing it to connect to the Internet.
0 Votes
+ -
Contributr
.NET install
Justin James 18th Feb 2010
I would be incredibly shocked if you have Windows PCs without at least one version of the .NET framework. For one thing, .NET comes with all post-XP Windows versions (http://blogs.msdn.com/astebner/archive/2007/03/14/mailbag-what-version-of-the-net-framework-is-included-in-what-version-of-the-os.aspx). Secondly, huge swaths of applications install it, from printer drivers to little utilities.

In terms of "calling home", you absolutely CAN install it without being connected to the Internet. You just need to get the big package which doesn't download anything. Furthermore, the .NET Framework is something that anti-Microsoft folks love to pick at. Beleive me, if it was "calling home", someone would have noticed by now and there would have been a massive outrage over it. If you have proof that .NET "calls home" I would love to see it, because I'll be the first to blast Microsoft for that. I find it very ironic that you are afraid of .NET "calling home" while using Windows (which *does* "call home") and IE.

I'm going to guess that you are also opposed to the Java platform for the same reasons. Between no Java and no .NET, there are an awful lot of apps that won't install on your system. I really wonder what the advantage there is in a system that you are ignoring a huge number of applications on, due to a prejudice based on speculation.

J.Ja
It wouldn't be my first choice for anything but I always install it (the non-M$ version) on a PC.

Although I honestly don't like Microsoft's attitude, I have made a good living supporting it's products. My main objection is that .net enourages poor programming practices. I just don't understand why the framework takes up Hundreds of MBs when a C (even C++, even Java!) program that does exactly the same thing would only be a few MBs in size. Even with TB drives and multi-GB RAM, that seems wasteful to me and indicative of sloppy coding and sloppy reasoning.
0 Votes
+ -
OK
seanferd 18th Feb 2010
But isn't the Framework more properly compared with the JRE? Both are large, diskwise.

If you mean the size of .NET applications, I can't comment, as I don't use them or have the Framework installed, either.
0 Votes
+ -
Contributr
You are right that the .NET installer is huge... however, only a portion of it actually makes it to disk. That's because the install package rolls up binaries across a variety of platforms. This link has additional details: http://www.hanselman.com/blog/SmallestDotNetOnTheSizeOfTheNETFramework.aspx The .NET Framework itself is not significantly larger than the full JDK, which is a good comparison, .NET applications themselves tend to be quite small (because they leverage the Framework), which actually makes them, pound for pound, more effective than native code apps once you have a lot installed (because they all refer to the same Framework, instead of installing a million DLLs, or creating "DLL hell" by dumping them into C:\Windows\System32).

In terms of "encourging poor programming practices"? First of all, the .NET Framework itself does not encourage poor programming practices. Secondly, C# is very similar to Java in terms of programming practices, and by extension, VB.NET is as well. If you don't mind the way Java gets written, I have no idea why you'd disagree with how C# gets written. Thirdly, VB.NET and C# are both MUCH more rigorous that classic VB ever was (now *there* was a system that encourage poor coding practices!). I am sure that if you aren't using an .NET apps, you probably have a few VB apps running around. Fourthly, I'd rather deal with the "poor programming practices" of the typical .NET (or Java) developer then the massive security holes that even the best developers can inadvertently put into C/C++ apps.

I think that if you really examine the facts of reality, it becomes fairly hard to forbid an entire swath of applications written in a particular set of languages. I've seen plenty of ugly Perl code (talk about "encouraging poor programming habits?!?!") yet every *Nix system out there is filled with critical system utilities that rely on Perl. I wouldn't stop using my FreeBSD server just because you can't have it up and running for 10 minutes post-install without Perl ending up on it as a requirement.

J.Ja
I thought the article title was Parse and Process HTML?!!! So, where's the code?? All you say here is that it can be done. Very helpful. What a waste of time!!!!!
0 Votes
+ -
Huh.
seanferd 18th Feb 2010
You aren't able to find the WebBrowser control?

Let me take care of that for you.
http://msdn.microsoft.com/en-us/library/aa752041(VS.85).aspx

http://msdn.microsoft.com/en-us/library/w290k23d.aspx

http://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser.aspx

I can't imagine why you'd want the specific parsing code for J.J.'s specific project, if that's what you mean.

Now you've wasted my time. Satisfied?
Thanks Justin for not pasting in a bunch of code snippets for anyone interested to scrape.
0 Votes
+ -
TextPipe Pro
duke.url@... 23rd Feb 2010
Has anyone tried TextPipe Pro for this purpose?
Which WebBrowser are you talking about? System.Windows.Forms.WebBrowser or System.Windows.Controls.WebBrowser? They are quite different.
... System.Windows.Controls.WebBrowser doesn't appear to have a ReadyState property...
0 Votes
+ -
Contributr
The other is a WPF item for viewing HTML in a page. Probably wraps the same control, though.

J.Ja
More than one way to skin a cat, it would seem...

WebBrowser.DocumentCompleted Event @ http://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser.documentcompleted.aspx
"Handle the DocumentCompleted event to receive notification when the new document finishes loading. When the DocumentCompleted event occurs, the new document is fully loaded, which means you can access its contents through the Document, DocumentText, or DocumentStream property."

WebBrowserReadyState Enumeration @ http://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowserreadystate.aspx
"Complete The control has finished loading the new document and all its contents"
0 Votes
+ -
Contributr
You are right that I could have handled DocumentCompleted instead. Personally, I try to avoid event driven async style code whenever possible, because I find that it adds a layer of abstraction and indirection that is not conducive to debugging. Purely personal preference, driven more from where/when/how I learned programming than anything else! For someone who prefers that style of code, handling DocumentCompleted is the way to do it. happy

J.Ja
I didn't find the code, is it a joke?
pedrito rod
mr85@servidor.unam.mx
0 Votes
+ -
Contributr
If you follow the link in the article, you'll find the actual application which was written, and if you install the application, you'll get the full source code in a folder in the install directory.

J.Ja
shouldve use html agility pack. no need to re-invent the wheel this has been done many a time and is called screen scrapping.. nice approach though. httprequest class would have returned the response / web page markup u required as well...
0 Votes
+ -
Contributr
I've used it on some other projects. For folks who do not want to use a 3rd party component, or need things like JavaScript parsing and the full DOM model, WebBrowser works very well. It's also a bit better documented, I think. More importantly, many people have written an awful lot of code designed to work with IE (even if it is not written to the HTML spec), and I feel like it would be mighty hard to replicate that. Indeed, if you want to be sure that something will parse the HTML you find on the Internet well, you are best off leveraging an existing browser engine (IE's Trident, Firefox's Gecko, etc.) rather than using a third-party one, because the third party ones don't have the piles of quirks that HTML authors write to.

I may write an article about Html Agility Pack in the future though, because I agree that it is a useful tool.

J.Ja
Keyboard Shortcuts:
Prev
Next
Toggle
Join the conversation
Formatting +
BB Codes - Note: HTML is not supported in forums
  • [b] Bold [/b]
  • [i] Italic [/i]
  • [u] Underline [/u]
  • [s] Strikethrough [/s]
  • [q] "Quote" [/q]
  • [ol][*] 1. Ordered List [/ol]
  • [ul][*] · Unordered List [/ul]
  • [pre] Preformat [/pre]
  • [quote] "Blockquote" [/quote]

Join the TechRepublic Community and join the conversation! Signing-up is free and quick, Do it now, we want to hear your opinion.