Data Management optimize

CSV Reader is a lifesaver for parsing CSV files

Justin James describes how a great CSV parser that is on CodeProject saved him lots of coding and debugging.

There are a lot of ways to parse a CSV file in .NET. Indeed, it is easy enough to write your own CSV parser. Recently though, I encountered a CSV file that was a bit more complex than simply being "comma delimited," and it had fields that would contain newline characters -- that requires a good deal more work to parse. In fact, this file was "ugly" enough that SQL Server's BULK INSERT command and SSIS could not load it into the database, which was my end goal. Instead of writing my own parser for it, I hoped someone had made a great CSV parser out there. Luckily, I found it over at CodeProject.

I was able to quickly assemble an application to run through the CSV file, and turn it into something that could be imported. This was really easy. I used CSV Reader to parse the import file, and then go through it field by field and replace newline characters with <br /> tags (it was to be displayed on Web pages) and strip out goofy characters (anything with an ASCII code below 32 and above 127 with the exception of 9). After that, I put it back out in standard tab delimited format: a tab to separate each field, and a newline to separate records.

Here is the code I used:

static void Main(string[] args)
{
    var fileName = @"C:\Users\jjames\Downloads\importfile.csv"; //
Default file location
if (args.Length > 0)
    {
        fileName = args[0];
    }
    var file = new StreamReader(fileName);
    var csvReader = new CsvReader(file, true);
    var fieldCount = csvReader.FieldCount;
    var output = string.Empty;
    byte byteValue;
while (csvReader.ReadNextRecord())
    {
        for (var index = 0; index < fieldCount; index++)
        {
            output += (csvReader[index].Replace("\n", "<br />").Replace("\t", " ").Trim());
            if (index != fieldCount - 1)
            {
                output += ("\t");
            }
     }
foreach (var character in output.ToCharArray())
     {
        byteValue = (byte)character;
if ((byteValue >= 32 && byteValue < 127) || byteValue == 9)
        {
             Console.Write(character);
        }
    }
Console.WriteLine();
    output = string.Empty;
  }
file.Dispose();
}

As you can see, it was a snap using CSV Reader. All I had to do was give it a StreamReader to the CSV file to open, then I could loop through it with the ReadNextRecord() method. There are a pile of options with it, but the defaults worked great for me.

The speed was poor, but that is due to the individual byte-by-byte checks. On a laptop with decent CPU and RAM and a typical laptop IO system (which is the real bottleneck on this application), it ran through a nearly 2 GB file while outputting one of the same size in about 10 minutes, which was fine with me. After the file was parsed and saved in tab delimited format, SQL Server happily loaded it with the BULK INSERT command, and I was on my merry way.

CSV Reader saved my day. This would have taken me a while to write, debug, and get right!

J.Ja

Keep your engineering skills up to date by signing up for TechRepublic's free Software Engineer newsletter, delivered each Tuesday.

About

Justin James is the Lead Architect for Conigent.

3 comments
mattohare
mattohare

It was nice for some rather huge CSVs in some GTFS packages.

jimlonero
jimlonero

To speed things up, you might have used a StringBuilder instead of a string for output. Quicker to add to than appending to the end of a string. Then for the 2nd (foreach) loop, get the string from stringbuilder then proceed.

sparent
sparent

That might have been a poor choice of word. Those characters above ASCII 127 happen to contain characters that let me type my name correctly, with the accented letter. I certainly don't think they are goofy.