I have been spending a bit of time working on a small utility-type application. This application is designed to solve a problem that I have to deal with constantly, a problem that I am sure many other people have to deal with. It analyzes a delimited file, and determines the field types and maximum width of each field. All too frequently, a customer will dump a delimited file on me without providing any metadata (I consider myself lucky if I get field names, I often have to guess which field is which). To ensure accurate use of the data, it is crucial that I create a database table that is not too narrow, or contains any incorrect field types. On the other hand, making fields too wide or using more generic field types makes the database slow and uses storage inefficiently. After all, which is faster to perform a JOIN on, an integer column, or a varchar? How quickly will I fill up my storage space if I used 200 character width columns when I really only need 50 characters?

So I wrote the application in VB.Net, a quick and dirty language for a quick and dirty job. Sadly, the performance just outright stunk. Once again, my low expectations of .Net were met. My first thought was that Perl would have run through the test file (about 30 MB large) as fast as my hard drive could dish it out. So I was about to set out to rewrite the program in Perl, when I decided to see if maybe re-working the VB.Net code could yield any speed benefits.

With about five minutes worth of code editing, I reduced execution speed to approximately 1% of its former speed. In other words, execution speed is 100 times faster now. In fact, it now executes about as fast as I would expect equivalent Perl code to execute. I did not bother to rewrite it in Perl, because it is more than fast enough for my needs. Sadly, any performance numbers I would have gotten from it could not be published anyways, thanks to MSDN’s license agreement. (Correction 6/29/06: after re-reviewing the MSDN EULA, this is incorrect. I may publish benchmarks on the .Net Framework, it is one of the few exceptions to the general “no benchmarks allowed” rules in the EULA)

What changes did I make to the VB.Net code to achieve this performance miracle?

I dumped as much of the object oriented code as I could.

Here is an example (clarification 6/26/06: this is not the real code, just some similar code for demonstration purposes):

“Proper” Object Oriented Code

For iRowCounter = 0 to cStringList.Length – 1
      aSplitString = cStringList(iRowCounter).Split(sDelimiter.ToCharArray)
 
      For iColumnCounter = 0 to aFieldNames.Length – 1
            If a.SplitString(iColumnCounter).Empty Then
                  sOutput = sOutput & “Empty”
            Else
                  sOutput = sOutput & aSplitString(iColumnCounter).Length.ToString
            End If
 
            sOutput = sOutput & ” “
      Next
 
            sOutput = sOutput & vbNewLine
Next
 

Procedural Style Code

iNumberOfRows = cStringList.Length – 1
iNumberOfColumns = aFieldNames.Length – 1
cDelimiters = sDelimiter.ToCharArray
 
For iRowCounter = 0 to iNumberOfRows
      aSplitString = cStringList(iRowCounter).Split(cDelimiters)
 
      For iColumnCounter = 0 to iNumberOfColumns
            If a.SplitString(iColumnCounter).Empty Then
                  sOutput = sOutput & “Empty”
            Else
                  sOutput = sOutput & aSplitString(iColumnCounter).Length.ToString
            End If
 
            sOutput = sOutput & ” “
      Next
 
            sOutput = sOutput & vbNewLine
Next

 

These examples are quite short, but you can see the difference. By making these type of small changes, I took execution speed from about 7.5 minutes to under 10 seconds.

Unfortunately, as an example of clean OO code, the program now looks a little too much like procedural code for the comfort of some people. After all, I am using an OO language, right? So why use all of that procedural code? Well, I think the results speak for themselves.

This example highlights an interesting thing: the less elegant code wins out. Is it easier to maintain code like this? Absolutely not. I have always said that each line of code is a potential point of failure; reduce your lines of code, reduce your errors and debugging time. Furthermore, what is the point of using an OO language, if you are going to strip away much of the OO aspects of it as soon as you can? And why take up memory duplicating all of that information when it is already available in the objects?

I have read through a lot of source code in my life, and many programmers do indeed choose to use the OO style as much as possible. The code does look better, and it retains an element of elegance. But as my experience, time and time again has shown, the OO style executes extremely slowly. The reason lies in the very nature of OO code. For each of those row length lookups, the system is not caching the length, even if it has not changed. Even if the length is stored as a property (as opposed to being), that information still has to be dug up. This means burrowing through the object tree to get at the underlying data, and then bubbling it back up until it hits that For condition. In the procedural style, that information is already on hand in the final type needed.

As a rule of thumb, if a property (or method return) will be used more than once before it changes, it is nearly always faster to store that information to a temporary variable. Of course, there are exceptions to this. If you are going to be working with an insane amount of temporary variables at once, the memory used to store them may go up so quickly that you are burying the needle on RAM usage. An example of that is if the items in the property are absolutely huge (say, 100 MB of data in each one) or if the application is highly multithreaded. In a situation like that, you need to decide between memory usage and CPU utilization. But for your average utility loops, it is a sure bet that storing those values in temporary space (especially things used in a condition) is going to reap huge performance benefits.

So choose your poison: ugly, harder to maintain, less elegant procedural-esque code, or slower to execute, pure OO code. Regular readers know my emphasis on end user satisfaction; slow code makes no one happy. I prefer to break the OO paradigm in order to reap speed benefits any day of the week.

J.Ja