Linux

Regular expressions for everyone: The basics

Vincent Danen goes over some basic regular expressions. They are handy for developers and programmers, of course, and can even be employed for Google searching.

As a Linux user or administrator, the topic of regular expressions probably comes up fairly often, and if it doesn't, you're missing out. A lot of command-line programs like grep or scripts in perl, python, or PHP scripts, make use of, or can make use of, regular expressions.

So what is a regular expression?

A regular expression (also known as a regex or regexp) is a way to match strings of text, by characters, words, or patterns of characters. There are three primary types of regular expressions: POSIX regexps, perl-based regexps, and simple regexps. The basics amongst them, however, are largely the same; also, perl-based regexps are used in a number of programming languages besides perl: it is also used by (or slightly derived in) Python, Ruby, Java, JavaScript, and PCRE, to name a few.

Here is a list of basic perl-based regular expressions and what they do:

  • . Matches any characters
  • * Matches 0 or more of the preceding character
  • + Matches 1 or more of the preceding characters
  • ? Matches 0 or 1 occurrences of the preceding character (the preceding character is optional)
  • \d Matches a single digit ('[:digit:]' in POSIX)
  • \w Matches any word character (including alphanumeric and underscore; '[:word:]' in POSIX)
  • [ABC] Matches any single character from the class (i.e. 'A' or 'B' or 'C')
  • [ABC]+ Matches 1 or more characters from the class
  • $ Matches the end of the string
  • ^ Matches the beginning of the string
  • | Matches on the expression either before or after '|'

It's not pretty, and as a result, regular expressions can become very messy looking and difficult to grasp. However, they are very powerful. Here are some examples of regular expressions in actions:

foo|bar

The above matches either "foo" or "bar".

https?://(www.)?foo.com

The above matches either https://www.foo.com, https://foo.com, http://foo.com, or http://www.foo.com.

 [fb]?oo

The above matches "foo", "boo", or "oo".

 [fb]+oo

The above matches "foo", "boo", "fboo", "ffoo", and so on, but not "oo".

Knowing regular expressions is, obviously, most useful when programming, however it can be very useful for command-line tools as well. Grep, when called as egrep can use POSIX regular expressions which elevates grep to a whole new level of convenience. The find command also supports using regular expressions to find files, and likewise the awk and sed tools support regular expressions.

For system administrators, many programs can use regular expressions in configuration files, such as Apache. The "*Match" directives in Apache (i.e., <DirectoryMatch> and <RedirectMatch>) support regular expressions, as do rewrite rules.

Knowing regular expressions isn't for the "elite" or even just for sysadmins; even Google supports regular expressions in search queries! No, they are useful for many people, even if you just learn basic regular expressions such as those noted above. They make searching for things so much easier, and can reduce multiple commands or directives down to a single one, which can ultimately enhance productivity.

About

Vincent Danen works on the Red Hat Security Response Team and lives in Canada. He has been writing about and developing on Linux for over 10 years and is a veteran Mac user.

16 comments
spudman
spudman

I see an easy way to do this OR that (this|that) but what about this AND that? I also gather that is makes a difference which occurs first. IOW this AND that is different than that AND this.

Jaqui
Jaqui

Perl Compatible Regular Expressions is a language? :D I use --enable-pcre when compiling things like bash, grep, ... so that all the regex on the system is as capable as the perl regex.

chris_thamm
chris_thamm

What do parentheses do? From your example, it appears that they delimit a string to be given as a unit to an operator. What is a word character? Are all operators except "|" postfix? Given that "|" is infix, how can I know when the parser is switching from postfix to infix? Can I specify my own delimiting, or is it always implied? How can I override default delimiting? I am unclear on how to use the beginning- or end-of-string operators. Does ?? match the null string and all 1- and 2-character strings? How do I match a question mark?

Justin James
Justin James

Something missing (unless I overlooked it) is information about "greedy" vs. "lazy" matching. It is critical to know which one the regex engine is doing or else things won't work the way you expect them to. In a greedy system, the wildcards match through the final instance. In a lazy system, the wildcards match through the first instance. For example: regex = "a.*z" text = "abczdez" A greedy system would match the entire string ("abczdez") while a lazy system would match only through the first "z" ("abcz"). J.Ja

spudman
spudman

(this.that)|(that.this) of course if just the exact words are to be checked then more is needed. Regular expressions have to be practiced and there isn't enough opportunity in the MS world -- or am I overlooking something? Thanks,

Justin James
Justin James

"What do parentheses do?" They perform grouping for references. For example: regex = "(a?c)def" text = "1234abcdefgabcwqa" In a replacement operation (or a back reference within the same regex), I can refer to group #1 and get the text "abc". For example: regex = "\w*?@(.*)" text = "emailaddress@example.com" replacement = "newaddress@" + regex.match(0) would give me "newaddress@example.com" as my replacement text. "What is a word character?" A character that is allowed in a word, usually a-z, A-Z, 0-9 and a couple of special characters. Check your documentation for specifics. "Are all operators except "|" postfix?" There are no operators in regex match patterns per se. "|" simply means "or". It means "match this or match that". "I am unclear on how to use the beginning- or end-of-string operators." regex = "^abc" text1 = "123abc" test2 = "abc123" text1 will match, text2 will not. "Does ?? match the null string and all 1- and 2-character strings?" It will match nullstring and any 1 character string. Its behavior on a 2 character string can be tricky. In some systems it may mean "match any single character, then match any single character" which will match any two character string. On other systems it will mean "match any single character zero or one time", performing a lazy evaluation of the "?". It's the same reason why with ".*?", the "?" forces the regex to lazy evaluation. "How do I match a question mark?" By escaping it with a backslash ("How tall are you\?"). That being said, your questions are VERY basic. I get the feeling that you haven't tried very hard to get answers to them from the documentation. You really should do that, few folks are going to be willing to nicely answer these levels of questions like I did, most will tell you to read your docs. J.Ja

john_heidelberg
john_heidelberg

Check out the freeware program Expresso. You can play with it to learn much of the in's and out's of regular expressions. I use it to create the find/replace syntax in Notepad++ for complex replacements.

Justin James
Justin James

... but people do not use them nearly as much because they are so inconvenient. In *Nix systems, there are a zillion command line tools that use them. Perl puts them up front (they are an operator just like "=" and "+" are) so they quickly become your first choice for everything. In .NET, though... well, it takes 4 lines of code to do the same thing, and there really are few tools that focus on them, other than the occasional search/replace command in a text editor. J.Ja

tr
tr

regex = "^abc" text1 = "123abc" test2 = "abc123" text1 will match, text2 will not.

chris_thamm
chris_thamm

I very much appreciate your taking the time to answer. I haven't done much programming (little more than scripting, really) since the 386 days, choosing instead to focus on networking and building a small business servicing PCs for small businesses. It is now my third week looking at Linux for the first time. Unfortunately, until recently, I had never had a need to look at anything other than Microsoft OSes, and have never used regexps except in their (extremely) limited form on the command line. I have begun reading the reams of available online documentation, and the one thing that stands out to me is that every flavour of Linux (UNIX) does things a little differently (as opposed to adhering to a set of OS-wide standards). It is not surprising to me that Microsoft has dominated the end-user market. (Few of the people whose computers I service would understand what it means to mount a volume, for example.) Still, I very much like what I've seen so far, and can imagine why people swear by such a lean and powerful OS. It has been very eye-opening for me to look at computers through different glasses than the ones Microsoft would have you wear. So ^ and $ are prefix, | is infix, * and ? are postfix. I'll have to read more on the grouping that the parentheses do; however, the examples that you provided are clear to me. It is a humbling experience being the newbie again. I'm doing a lot of reading, asking a lot of (very basic) questions, and appreciate everyone who takes the time to answer them. Hopefully others will also benefit from them.

zenoscope
zenoscope

I've been learning using an online regex program, called regexr (http://gskinner.com/RegExr/) it allows you to enter your data in, and then add the regex to check it, which it does live, so you can see what different things are doing.

Justin James
Justin James

It's weird, the need to use them is there same as anywhere else, it's just that they are not exposed by interfaces nearly as much. I really do not know why so few apps in Windows compared to *Nix allow the user to use a regex where appropriate. I would love them for searching through Outlook, for example, but they are available in Word... J.Ja

spudman
spudman

Justin, your reply totally backs up my statement about not much opportunity in MS World. :)

Justin James
Justin James

It should read that text1 will NOT match and text2 will, sorry! J.Ja

Jaqui
Jaqui

you got it right. if you used a postfix operator it would be reversed. if you used an infix operator both would match

Jaqui
Jaqui

actually, the core tools of all GNU/Linux distros and Unix systems are the same. It's a standard set of names, even if the program itself is different. It's a standard method of defining arguments [ switches / options ] to the commands. the differences are only when you get above the core, or base system. which software management backend, and frontend. which of the 12 guis etc. but the stuff needed for administering the system below that level is all the same. perl is perl. bash is bash sh is sh korn is korn grep is grep sed is sed and it doesn't matter if you use a BSD sed or GNU sed, calling it and passing it arguments is the same, just like the rest of them. I would strongly recommend reading the Linux From Scratch book, even if you don't build a system from sources. It will help you see where the line is between common to all and chosen variable options.