Apps

Parsing CSV by feeding BNF to Haskell's Parsec module

Chip Camden shows how easy it is to translate Backus-Naur Form rules into Haskell code that will parse a comma-separated file.

In a previous post, I provided a general overview of Backus-Naur Form (BNF) and an example set of BNF rules for comma-separated values (CSVs). Today we'll see just how easy it is to translate those rules into Haskell code that will parse a CSV file.

To work this bit o' magic, we first need a proper spellbook. So, we go and grab Parsec off the shelf. Parsec is a parsing module that comes distributed with the Glasgow Haskell Compiler (GHC). In order to make use of this module, we have to import it:

import Text.ParserCombinators.Parsec

Now, we'll proceed to convert each line of the CSV BNF we presented earlier into a Haskell function. Don't worry, Haskell functions are often one-liners. First, the definition of a CSV file. Our BNF for this was:

csv-file = { row }

which indicates that a csv-file is composed of many rows. We can state that rule in Haskell as:

csv_file = many row

Easy, huh? The many function is provided by Parsec, and means exactly the same thing as the braces in Extended Backus-Naur Form (EBNF).

Next comes the definition of row. First the BNF:

row = field-list, eol

Now the Haskell:


row = do result <- field_list

eol

return result

This one is a little more involved, because the rule contains two elements, and we only want to return the first one (the list of fields) because we don't need to return the end-of-line character. So, we use a do, obtain our desired result as the list of fields (which we'll define in another rule), consume the end-of-line (which we'll also define later), then return the desired result. We can see from this example that although BNF defines the components of a grammar, it doesn't tell you which parts you're looking for. That's all we've really added here.

The rule for field_list is even more involved:

field-list = field, [ ",", field-list ]

Now, we could write the Haskell code for this to follow the above definition exactly. It would look something like this:


field_list = do first_cell <- field

comma <- optionMaybe (string ",")

remaining <- case comma of

Just "," -> field_list

Nothing -> return []

return (first_cell:remaining)

We obtain the first field, then look for an optional comma. The optionMaybe function returns a Maybe type, which is either Just the syntactic element we're looking for, or Nothing if that element isn't provided. So, we do a case on what we got and construct the remainder of the field list as either another field_list or an empty list, respectively. Then we return the first field followed by the remainder.

That's a literal translation, but Parsec provides a function that will make this a lot simpler. This rule fits a common case in parsing of elements separated by other elements. We can rephrase the rule as:

field_list = field `sepBy` (char ',')

This will parse fields, separated by a single comma. I could have invoked the sepBy function with prefix notation (sepBy field (char ',')), which would have saved me the back-ticks, but in that case the word "By" is followed by "field", which doesn't read right to me.

This rule is going to return a list of whatever field returns.

Okay, on to the definition of a field:

field = [ whitespace ], field-value, [ whitespace ]

In this case, we're going to ignore the optional whitespace, and just return the field-value, so:


field = do optional whitespace

result <- field_value

optional whitespace

return result

This looks a lot like the rule for row, except that the whitespace is optional. Parsec's optional function will consume what we specify and ignore it, but it won't complain if it doesn't find it.

Next, we encounter our first rule with an either/or choice:

field-value = quoted-string | bare-string

That translates word for word into Haskell as:

field_value = quoted_string <|> bare_string

The <|> operator is provided by Parsec. Parsec can't use just plain "|" because that's a reserved token in Haskell, used in pattern matching.

Next comes the rule for quoted strings:

quoted-string = '"', quoted-content, '"'

We don't care to keep the quotes, so we'll just extract and return what's between them:


quoted_string = do char '"'

result <- quoted_content

char '"'

return result

The rule for the quoted content looks a lot like our rule for csv-file:

quoted-content = { quoted-char }

And it translates much the same way:

quoted_content = many quoted_char

The BNF rule for a quoted character shows a remaining limitation even of EBNF:

quoted-char = (any char except '"' or eol)

It resorts to a call-out in English, because BNF doesn't have a good rule for "anything except", but Parsec does:

quoted_char = noneOf "\"\n"

We just have a few more rules to translate. They don't tell us anything new about Parsec, so I'll just provide them here all at once:


bare_string = many bare_char

bare_char = noneOf ",\n"

whitespace = many space_char

space_char = char ' ' <|> char '\t'

eol = string "\n"

That's it! To use these rules, we'll add a little main function:


main = do c <- getContents

case parse csv_file "<stdin>" c of

Left e -> do putStr "Error: "

print e

Right rslt -> mapM_ print rslt

This reads the content from stdin using getContents, then parses it by calling the Parsec function parse passing it our csvfile function. The parse function returns an Either type: either a Left with the error that occurred, or a Right with the successful result. If you follow through all of the rules we've defined, a successful parse will result in a list of lists of Strings, each of which is the unquoted content of a cell. If we needed to get fancier, we could have provided a type for the cells, or even the rows, but we'll save that complexity as an exercise for the type-hungry reader.

We could have also allowed quotation marks within a quoted string to be escaped in some way, as in this version (which also defines eol as either a newline or a carriage-return/newline pair). And we could have collapsed the above rules into fewer functions, but I wanted to demonstrate how you can translate BNF to Haskell rule for rule. This insures that if your BNF is correct, your Haskell code will parse the target grammar correctly.

About

Chip Camden has been programming since 1978, and he's still not done. An independent consultant since 1991, Chip specializes in software development tools, languages, and migration to new technology. Besides writing for TechRepublic's IT Consultant b...

22 comments
lisaroy
lisaroy

Its totally above my head, i m not able to understand the whitespace field and how we can set up perfectly. My Blog @[url=http://www.smsbuds.com/]hindi sms[/url]

ed34222
ed34222

This actually seems more complicated than the same logic in php, cfml, perl or jsp.

aikimark
aikimark

Chip, Nice follow-up. How are the Haskell statements 'packaged' relative to the main routine? I did a double-take in the middle of the article, thinking that there was a typo in the BNF statements. However, my confusion derived from the rapid switching between BNF and Haskell. (hyphen vs. underscore) You are probably limited in your choices, but if you have input into the development process, suggest adding the ability to change the background color of code windows.

oldbaritone
oldbaritone

It's nice to see a serious tech article once in a while!

apotheon
apotheon

You're probably not writing a parser when you implement the same idea; instead, you probably use regexen to do the work. The truth is that the logic within that regex is much more complicated, but hidden behind special characters that represent string matching rules.

aikimark
aikimark

Chip, I finally found the source and see the how the Haskell statements are positioned in the file. I suspected that they would all be in a single file.

apotheon
apotheon

As another contributor for TechRepublic, I think I speak for Chip as well when I say that I wish we had some real input into site styles and development. If you use Firefox, though, I can at least offer a work-around. Chip created a Greasemonkey script called tr-rectify that overrides a lot of the styles on the site, and I've contributed some code to it (including some stuff related to code sample display in articles and comments). You can get the Greasemonkey extension (if you do not already have it) from Mozilla's extension archives. You can get tr-rectify from Chip's tr-rectify Mercurial repository or as a tarball download from the tr-rectify announcement at Chip's Tips.

Sterling chip Camden
Sterling chip Camden

I like a no-compromises functional language like Haskell. I just wish I got to use it for more real projects.

apotheon
apotheon

Unfortunately, my understanding is not quite as complete as the two-part lesson. I suspect that'll be resolved when I get to the Haskell chapter in Seven Languages in Seven Weeks, which is taking me a lot longer than seven weeks because I keep taking a break from it to read other stuff after each "day". I finally got to the Prolog chapter, though, and have made some real headway through it.

Sterling chip Camden
Sterling chip Camden

But it's also more rigorous. It's difficult to insure that a programmatic or Regex solution is 100% compliant for all edge cases, but you can be reasonably certain that a Parsec solution will conform 100% to a BNF specification.

Sterling chip Camden
Sterling chip Camden

I put it all in one file for simplicity, but all you have to do is declare a module that exports what you want to be public, e.g.: module Csv(csv_file) where Name the file the same as the name of the module, with a .hs extension. Then, in the main module, do: import Csv The Haskell compiler will search its include path for the file Csv.hs. By default, ghc includes . in that path. You can also create subordinate namespaces by placing the file in a subdirectory, and include the subdirectory in the import statement, using . as a separator.

aikimark
aikimark

Sorry, but I'm a Chrome user.

Sterling chip Camden
Sterling chip Camden

I'll have to learn it in more depth some day. For me and Haskell, getting past the IO Monad was the big hurdle.

aikimark
aikimark

mostly Mac. He only uses Windows when he 'has to'.

apotheon
apotheon

FreeBSD has development tools that are lacking in Debian; apt-cache search snobol returns no results. Gone are the days when everything I wanted I found in the Debian APT archives. Is it any wonder I prefer FreeBSD?

Sterling chip Camden
Sterling chip Camden

is in the FreeBSD ports system as lang/snobol4. What OS does your friend use?

aikimark
aikimark

Chip This might be a bit off-topic, but I have a research colleague that loves SNOBOL. For years, I've kept my eyes peeled for a decent SNOBOL environment for him. I've also looked at several alternative tools. Do you have some opinions or recommendations?

apotheon
apotheon

Maybe I'll see if I can create something similar for Chromium some day, but at the moment I do not find Chromium useful enough (thanks to its emasculated extension system) to use it very often. I wish I had a better answer for you.

Sterling chip Camden
Sterling chip Camden

except for its hamstrung extension system. I rely too much on Firefox extensions like Greasemonkey.

Editor's Picks