Parsing CSV by feeding BNF to Haskell's Parsec module

Chip Camden shows how easy it is to translate Backus-Naur Form rules into Haskell code that will parse a comma-separated file.

In a previous post, I provided a general overview of Backus-Naur Form (BNF) and an example set of BNF rules for comma-separated values (CSVs). Today we'll see just how easy it is to translate those rules into Haskell code that will parse a CSV file.

To work this bit o' magic, we first need a proper spellbook. So, we go and grab Parsec off the shelf. Parsec is a parsing module that comes distributed with the Glasgow Haskell Compiler (GHC). In order to make use of this module, we have to import it:

import Text.ParserCombinators.Parsec

Now, we'll proceed to convert each line of the CSV BNF we presented earlier into a Haskell function. Don't worry, Haskell functions are often one-liners. First, the definition of a CSV file. Our BNF for this was:

csv-file = { row }

which indicates that a csv-file is composed of many rows. We can state that rule in Haskell as:

csv_file = many row

Easy, huh? The many function is provided by Parsec, and means exactly the same thing as the braces in Extended Backus-Naur Form (EBNF).

Next comes the definition of row. First the BNF:

row = field-list, eol

Now the Haskell:

row = do result <- field_list


return result

This one is a little more involved, because the rule contains two elements, and we only want to return the first one (the list of fields) because we don't need to return the end-of-line character. So, we use a do, obtain our desired result as the list of fields (which we'll define in another rule), consume the end-of-line (which we'll also define later), then return the desired result. We can see from this example that although BNF defines the components of a grammar, it doesn't tell you which parts you're looking for. That's all we've really added here.

The rule for field_list is even more involved:

field-list = field, [ ",", field-list ]

Now, we could write the Haskell code for this to follow the above definition exactly. It would look something like this:

field_list = do first_cell <- field

comma <- optionMaybe (string ",")

remaining <- case comma of

Just "," -> field_list

Nothing -> return []

return (first_cell:remaining)

We obtain the first field, then look for an optional comma. The optionMaybe function returns a Maybe type, which is either Just the syntactic element we're looking for, or Nothing if that element isn't provided. So, we do a case on what we got and construct the remainder of the field list as either another field_list or an empty list, respectively. Then we return the first field followed by the remainder.

That's a literal translation, but Parsec provides a function that will make this a lot simpler. This rule fits a common case in parsing of elements separated by other elements. We can rephrase the rule as:

field_list = field `sepBy` (char ',')

This will parse fields, separated by a single comma. I could have invoked the sepBy function with prefix notation (sepBy field (char ',')), which would have saved me the back-ticks, but in that case the word "By" is followed by "field", which doesn't read right to me.

This rule is going to return a list of whatever field returns.

Okay, on to the definition of a field:

field = [ whitespace ], field-value, [ whitespace ]

In this case, we're going to ignore the optional whitespace, and just return the field-value, so:

field = do optional whitespace

result <- field_value

optional whitespace

return result

This looks a lot like the rule for row, except that the whitespace is optional. Parsec's optional function will consume what we specify and ignore it, but it won't complain if it doesn't find it.

Next, we encounter our first rule with an either/or choice:

field-value = quoted-string | bare-string

That translates word for word into Haskell as:

field_value = quoted_string <|> bare_string

The <|> operator is provided by Parsec. Parsec can't use just plain "|" because that's a reserved token in Haskell, used in pattern matching.

Next comes the rule for quoted strings:

quoted-string = '"', quoted-content, '"'

We don't care to keep the quotes, so we'll just extract and return what's between them:

quoted_string = do char '"'

result <- quoted_content

char '"'

return result

The rule for the quoted content looks a lot like our rule for csv-file:

quoted-content = { quoted-char }

And it translates much the same way:

quoted_content = many quoted_char

The BNF rule for a quoted character shows a remaining limitation even of EBNF:

quoted-char = (any char except '"' or eol)

It resorts to a call-out in English, because BNF doesn't have a good rule for "anything except", but Parsec does:

quoted_char = noneOf "\"\n"

We just have a few more rules to translate. They don't tell us anything new about Parsec, so I'll just provide them here all at once:

bare_string = many bare_char

bare_char = noneOf ",\n"

whitespace = many space_char

space_char = char ' ' <|> char '\t'

eol = string "\n"

That's it! To use these rules, we'll add a little main function:

main = do c <- getContents

case parse csv_file "<stdin>" c of

Left e -> do putStr "Error: "

print e

Right rslt -> mapM_ print rslt

This reads the content from stdin using getContents, then parses it by calling the Parsec function parse passing it our csvfile function. The parse function returns an Either type: either a Left with the error that occurred, or a Right with the successful result. If you follow through all of the rules we've defined, a successful parse will result in a list of lists of Strings, each of which is the unquoted content of a cell. If we needed to get fancier, we could have provided a type for the cells, or even the rows, but we'll save that complexity as an exercise for the type-hungry reader.

We could have also allowed quotation marks within a quoted string to be escaped in some way, as in this version (which also defines eol as either a newline or a carriage-return/newline pair). And we could have collapsed the above rules into fewer functions, but I wanted to demonstrate how you can translate BNF to Haskell rule for rule. This insures that if your BNF is correct, your Haskell code will parse the target grammar correctly.