Software Development

What is BNF, and why should developers care?

Chip Camden explains what the Backus-Naur Form syntax is and how it works. Take a look at his examples of this abstract programming concept.

Backus-Naur Form (BNF) is a syntax for describing syntax. It's used to write a formal representation of a context-free grammar. If that doesn't sound abstract enough for you, a grammar doesn't have to be a programming language or even a human language -- it can be any syntax that you want to describe.

I first encountered BNF when working on the 1992 ANSI standard for DIBOL, which contained a complete BNF syntactic specification for that language (it was 11 pages long). The exercise of defining the language in BNF helped the committee rigorously work out the details of its syntax. One of the language implementers also fed this BNF into Yet Another Compiler Compiler (yacc) to help build their compiler for their next version, so they could guarantee that it complied syntactically with the specification. That's where BNF really shines: Once you have a solid syntactic definition in BNF, you can use tools to rigorously parse the syntax it describes, without having to invent code that implements the specification, or rely on your mastery of Regular Expressions.

BNF uses a declarative syntax that allows you to define terms via "production rules." Each rule contains terms that each have more concrete rules until you get down to "terminals," which are terms that we can only describe as specific characters (literal values). Thus, for instance, if we wanted to describe the syntax for calling a shell command named "fred" that takes one required non-negative numeric value as its argument, we could express it like this:


<call-fred> ::= fred <whitespace> <number> <eol>

<whitespace> ::= <space-char> | <space-char> <whitespace>

<space-char> ::= \s | \t

<number> ::= <digit> | <digit> <number>

<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

<eol> ::= \n | <whitespace> \n

Each of the identifiers in angle-brackets is the name of a term. A line beginning with the term name, followed by "::=" and the components that make up the term forms the term's production rule. The terminals in the above are the literal values that aren't enclosed in angle-brackets. The "|" operator provides a logical "or." As in functional programming languages, we use recursion to describe variable-length lists like <number> and <whitespace>.

Many variants to the above syntax have emerged. Extended Backus-Naur Form (EBNF) provides a number of enhancements and simplifications. By requiring quotation marks around literal values, we can dispense with a lot of other punctuation. We can therefore express our example above in a form that seems more simple and familiar to programmers:


call-fred = "fred", whitespace, number, eol

whitespace = space-char, { space-char }

space-char = " " | "\t"

number = digit, { digit }

digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

eol = [ whitespace ], "\n"

Note the use of braces to indicate "zero or more," and square brackets for optional items. Some variants also use the Regex-like asterisk "*" trailing a term to indicate zero or more, as well as "+" for one or more.

For a more useful example, let's describe the syntax of comma-separated values (CSVs).


csv-file = { row }

row = field-list, eol

field-list = field, [ ",", field-list ]

field = [ whitespace ], field-value, [ whitespace ]

field-value = quoted-string | bare-string

quoted-string = '"', quoted-content, '"'

quoted-content = { quoted-char }

quoted-char = (any char except '"' or eol)

bare-string = { bare-char }

bare-char = (any char except ',' or eol)

whitespace = space-char, { space-char }

space-char = " " | "\t"

eol = "\n"

In the first rule, we define a CSV file as a series of zero or more rows. A row is a list of fields, followed by an end-of-line character. The list of fields has at least one field (even though that field can have zero length). A comma and another list of fields can optionally follow (note the recursion). An individual field may have whitespace on either end, and the field value may be either a quoted string or a bare string. A quoted string may not contain a quote, and a bare string may not contain a comma.

In a future article, we'll see how we can transform this definition into code that parses a CSV file.

About

Chip Camden has been programming since 1978, and he's still not done. An independent consultant since 1991, Chip specializes in software development tools, languages, and migration to new technology. Besides writing for TechRepublic's IT Consultant b...

12 comments
guy
guy

Enjoyed reading it. Made me thing of xml and their schemas, which my limited exposure, perform a similar task...but perhaps they lean more towards defining the structure of the document, than the formats of teh strings within...

apotheon
apotheon

Thanks. I was looking forward to this article, and it was much more clearly stated than I expected.

Mark Miller
Mark Miller

...I found in using yacc was keeping in mind that when I'm classifying the input, I'm dealing with a state machine (in fact, two state machines: a tokenizer, and the parser). I wonder if it would've been better to draw out each of the rules in a state diagram until I got used to the idea. Many years ago I was trying to convert a C-coded language parser into BNF, and I ended up throwing out some language features, because I was running into shift-reduce (one rule allows a token to be shifted onto the stack, another allows a reduce action to a non-terminal with the same token) and reduce-reduce conflicts (two or more rules reduce to a non-terminal on the same token, and the parser generator can't decide which to choose), and I couldn't figure out how to resolve them. Looking back on it now, I think the problem was caused by how I classified the input, and probably my unsophisticated use of regex's (using lex). With most expressions it's easy to express the beginning and the end of the expression you want to match, but sometimes the specification for a part of the language is vague enough that a beginner can fall into traps. A really frustrating thing for me used to be that I'd give a regex that matched what I wanted, but it also matched stuff that came after it. So I had to tell the tokenizing state machine when to stop! You see that in one of Chip's examples (though he was dealing with the parser state machine). The language I dealt with described reports. One of the original features was plain text that could be inserted as a block among data elements. The syntax caused problems for me, because each line of a block began with a "+", and was followed by a line of text, like a comment in code. I don't remember why, but it was interfering with the design of the rest of my BNF spec. I might've tried to simplify and optimize too soon. Hey, reminds me of Chad's recent post. :) Another challenge was in trying to put out clear error messages. If someone fed some syntactically incorrect text into the parser, the best I could put out was "syntax error." I might've been able to give the line number, but it was really difficult (given my limited knowledge of how yacc worked) to get the parser to be clearer about what was wrong with the line. There was a special "error" token I used, but it didn't seem like I could do a lot with it in the BNF spec.

Tony Hopkinson
Tony Hopkinson

off a kitchen sink, with a pair of grips or an adjustable wrench. You know laid on your back, water dribbling on your head, blood from barked knuckles dripping down your wrist, your missus asking you if she should ring the plumber, been under there half an hour and you think you might have tightened the thing by at least a quarter of a turn. I mean it's only a tap ffs... If you had a tap spanner, job done in minutes (as long as you remember to isolate the thing :p ) Not used often, but nigh on indispensable.

Scott.Geiger
Scott.Geiger

First let me say that this was a very insightful article. I have seen something like this before, but never really looked into it; so thanks for the great intro. However, after reading I am still asking "why should I (a developer) care?" I was hoping that would be answered, perhaps in a future article?

aikimark
aikimark

Chip, Thanks for this article. I'm looking forward to the next one(s) in the series. I inherited some code for a commercial application and the biggest aid in understanding the application and communicating with other developers was to create a specification document. It wasn't BNF, but it forces everyone to agree to the command formats and meanings.

Justin James
Justin James

... that I've heard a number of times, but never looked into. Thanks! J.Ja

seanferd
seanferd

You'll be a wee bit closer, then. Even if they aren't sort of the same thing. But I think I see the parallels you are seeing. edit: Current SGML, because some XML requirements had to be back-ported into SGML.

Tony Hopkinson
Tony Hopkinson

unless you've had already tried to develop something where it would have come in damn handy. Have a wee go at say developing something that can parse simple formulae C = A + B C = A + -B C = A + (-B) etc BNF gives you a formal way of describing all the legal combinations, even more valuable in my opinion, using BNF can highlight syntactic and semantic ambiguities in waht you are trying to parse, so it's invaluable for designing things you will want to write a parser for. Then you don't have to piss about parsing context sensitive stuff like = for assignment versus = for equality, which is a major PIA when you get the to nitty gritty of coding it.

Sterling chip Camden
Sterling chip Camden

I like the way it converges with functional languages, which helps me understand both better.

Ternarybit
Ternarybit

Agreed, though in the modern programming context it seems highly academic. Still a very good article, even if the vast majority of programmers will never implement a language or interpreter outside of university.

Tony Hopkinson
Tony Hopkinson

Mr Camden threw in parsing a csv file because it does have a language, it's not a huge complex one, and it's fairly free of ambiguity. How often have you seen an import fail because say quote char was in a field though. :( Most of the situations we run into where BNF would be a good tool are so trivial we can and do brute force past them. Parsing command line arguments for instance, wouldn't it be nice if you had a useful set of rules, may be even a standard across your executables for that, instead of 5 - 10 flavours and at least two completely seperate routines in every one for processing it's own and launching others. Or if the language you use supports optional arguments in a function call , how many times have you seen someone go over the top on that. Or some fwit who passes a variable length array into a function but doesn't use the upper and lower bounds because it's currently constant. The concepts that gave rise to BNF appear in many places, it's just more obvious at the extremes.

Editor's Picks