Introducing regular expressions





Note: This article is a slightly modified version of Chapter 6 from Volume 2 of my Linux self-study course, “Using and Administering Linux: Zero to SysAdmin,” due out from Apress in late 2019.

We have all used file globbing with wildcard characters like * and ? as a means to select specific files or lines of data from a data stream. These tools are powerful and I use them many times a day. Yet there are things that cannot be done with wildcards.

Regular expressions (REGEXes or REs) provide us with more complex and flexible pattern matching capabilities. Just as certain characters take on special meaning when using file globbing, REs also have special characters. There are two main types of regular expressions (REs), Basic Regular Expressions (BREs) and Extended Regular Expressions (EREs).

The first thing we need are some definitions. There are many definitions for the term “regular expressions” but many are dry and uninformative. Here are mine.

  • Regular Expressions are strings of literal and metacharacters that can be used as patterns by various Linux utilities to match strings of ASCII plain text data in a data stream. When a match occurs it can be used to extract or eliminate a line of data from the stream or to modify the matched string in some way.
  • Basic Regular Expressions (BRE) and Extended Regular Expressions (ERE) are not significantly different in terms of functionality1. The primary difference is in the syntax used and how metacharacters are specified. In basic regular expressions the meta-characters ‘?’, ‘+’, ‘{’, ‘|’, ‘(’, and ‘)’ lose their special meaning; instead, it is necessary to use the backslashed versions ‘\?’, ‘\+’, ‘\{’, ‘\|’, ‘\(’, and ‘\)’. The ERE syntax is believed by many to be easier to use.

Regular expressions (REs)2 take the concept of using metacharacters to match patterns in data streams much further than file globbing and give us even more control over the items we select from a data stream. REs are used by various tools to parse3 a data stream to match patterns of characters in order to perform some transformation on the data.

Regular expressions have a reputation for being obscure and arcane incantations that only those with special wizardly SysAdmin powers use. Figure 1 would seem to confirm this. The command pipeline appears to be an intractable sequence of meaningless gibberish to anyone without the knowledge of regex. It certainly seemed that way to me the first time I encountered something similar early in my career. As you will see, it is actually relatively simple once it is all explained.

cat Experiment_6-1.txt | grep -v Team | grep -v "^\s*$" | sed -e "s/[Ll]eader//" -e "s/\[//g" -e "s/\]//g" -e "s/)//g" | awk '{print $1" "$2" <"$3">"}' > addresses.txt

Figure 1: A real world sample of the use of regular expressions. It is actually a single line that I used to transform a file that was sent to me into a usable form.

We can only begin to touch upon all of the possibilities opened to us by regular expressions in a single article. There are entire books devoted exclusively to regular expressions so we will explore the basics here – just enough to get started with tasks common to SysAdmins.

Getting started

Now we need a real world example to use as a learning tool. Here is one I encountered several years ago.

The mailing list

This example highlights the power and flexibility of the Linux command line, especially regular expressions, for their ability to automate common tasks. I have administered several listservs during my career and still do. People send me lists of email addresses to add to those lists. In more than one case I have received a list of names and email addresses in a Word format that were to be added to one of the lists.

The list itself was not really very long but it was very inconsistent in its formatting. An abbreviated version of that list, with name and domain changes, is shown in Figure 2. The original list has extra lines, characters like brackets and parentheses that need to be deleted, whitespace such as spaces and tabs, and some empty lines. The format required to add these emails to the list is, first last <email@example.com>. Our task is to transform this list into a format usable by the mailing list software.

Team 1 Apr 3
Leader Virginia Jones vjones88@example.com
Frank Brown FBrown398@example.com
Cindy Williams cinwill@example.com
Marge smith msmith21@example.com
[Fred Mack] edd@example.com

Team 2 March 14
leader Alice Wonder Wonder1@example.com
John broth bros34@example.com
Ray Clarkson Ray.Clarks@example.com
Kim West kimwest@example.com
[JoAnne Blank] jblank@example.com

Team 3 Apr 1
Leader Steve Jones sjones23876@example.com
Bullwinkle Moose bmoose@example.com
Rocket Squirrel RJSquirrel@example.com
Julie Lisbon julielisbon234@example.com
[Mary Lastware) mary@example.com

Figure 2: A partial, modified listing of the document of email addresses to add to a listserv.

It was obvious that I needed to manipulate the data in order to mangle it into an acceptable format for inputting to the list. It is possible to use a text editor or a word processor such as LibreOffice Writer to make the necessary changes to this small file. However, people send me files like this quite often so it becomes a chore to use a word processor to make these changes. Despite the fact that Writer has a good search and replace function, each character or string must be replaced singly and there is no way to save previous searches. Writer does have a very powerful macro feature, but I am not familiar with either of its two languages, LibreOffice Basic or Python. I do know Bash shell programming.

Experiment 1

I did what comes naturally to a SysAdmin – I automated the task. The first thing I did was to copy the address data to a text file so I could work on it using command line tools. After a few minutes of work, I developed the Bash command line program in Figure 1 that produced the desired output as the file, addresses.txt. I used my normal approach to writing command line programs like this by building up the pipeline one command at a time.

Let’s break this pipeline down into its component parts to see how it works and fits together. All of the experiments in this article should be performed as a non-privileged user. I also did this on a VM that I created for testing, studentvm1.

First, download the sample file, Experiment_6-1.txt from THE XXX website, . Let’s do all of this work in a new directory so we will create that too.

[student@studentvm1 ~]$ mkdir testing ; cd testing
[student@studentvm1 testing]$ wget http://www.linux-databook.info/downloads/Experiment_6-1.txt

Now we just take a look at the file and see what we need to do.

[student@studentvm1 testing]$ cat Experiment_6-1.txt 
 Team 1  Apr 3  
 Leader  Virginia Jones  vjones88@example.com
 Frank Brown  FBrown398@example.com
 Cindy Williams  cinwill@example.com
 Marge smith   msmith21@example.com  
  [Fred Mack]   edd@example.com   
 Team 2  March 14
 leader  Alice Wonder  Wonder1@example.com
 John broth  bros34@example.com   
 Ray Clarkson  Ray.Clarks@example.com
 Kim West    kimwest@example.com  
 [JoAnne Blank]  jblank@example.com
 Team 3  Apr 1  
 Leader  Steve Jones  sjones23876@example.com
 Bullwinkle Moose bmoose@example.com
 Rocket Squirrel RJSquirrel@example.com   
 Julie Lisbon  julielisbon234@example.com
 [Mary Lastware) mary@example.com
 [student@studentvm1 testing]$ 

The first things I see that can be done are a couple easy ones. Since the Team names and dates are on lines by themselves we can use the following to remove those lines that have the word “Team”. I place the end of sentence period outside the quotes for clarity to ensure that only the intended string is inside the quotes.

[student@studentvm1 testing]$ cat Experiment_6-1.txt | grep -v Team

I won’t reproduce the results of each stage of building this Bash program but you should be able to see the changes in the data stream as it shows up on STDOUT, the terminal session. We won’t save it in a file until the end.

In this first step in transforming the data stream into one that is usable, we use the grep command with a simple literal pattern, “Team.” Literals are the most basic type of pattern we can use as a regular expression because there is only a single possible match in the data stream being searched, and that is the string “Team”.

We need to discard empty lines so we can use another grep statement to eliminate them. I find that enclosing the regular expression for the second grep command in quotes ensures that it gets interpreted properly.

[student@studentvm1 testing]$ cat Experiment_6-1.txt | grep -v Team | grep -v "^\s*$"
Leader  Virginia Jones  vjones88@example.com
Frank Brown  FBrown398@example.com
Cindy Williams  cinwill@example.com
Marge smith   msmith21@example.com  
 [Fred Mack]   edd@example.com   
leader  Alice Wonder  Wonder1@example.com
John broth  bros34@example.com   
Ray Clarkson  Ray.Clarks@example.com
Kim West    kimwest@example.com  
[JoAnne Blank]  jblank@example.com
Leader  Steve Jones  sjones23876@example.com
Bullwinkle Moose bmoose@example.com
Rocket Squirrel RJSquirrel@example.com   
Julie Lisbon  julielisbon234@example.com
[Mary Lastware) mary@example.com
[student@studentvm1 testing]$ 

The expression “^\s*$” illustrates anchors and using the backslash (\) as an escape character to change the meaning of a literal, “s” in this case, to a metacharacter that means any whitespace such as spaces, tabs, or other characters that are unprintable. We cannot see these characters in the file, but it does contain some of them. The asterisk, aka splat (*) specifies that we are to match zero or more of the whitespace characters. This would match multiple tabs or multiple spaces or any combination of those in an otherwise empty line.

I configured my Vim editor to display whitespace using visible characters. Do this by adding the following line to your own ~.vimrc or the global /etc/vimrc configuration files. Then start – or restart – Vim.

set listchars=eol:$,nbsp:_,tab:<->,trail:~,extends:>,space:+

I have found a lot of bad, very incomplete, and contradictory information on the Internet in my searches for how to do this. The built-in Vim help has the best information and the data line I have created from that here is one that works for me.

The result, before any operation on the file, is shown in Figure 3. Regular spaces are shown as +; tabs are shown as <, <>, or <–>, and fill the length of the space that the tab covers. The end of line (EOL) character is shown as $.

Team+1<>Apr+3~$
Leader++Virginia+Jones++vjones88@example.com<-->$
Frank+Brown++FBrown398@example.com<---->$
Cindy+Williams++cinwill@example.com<--->$
Marge+smith+++msmith21@example.com~$
+[Fred+Mack]+++edd@example.com<>$
$
Team+2<>March+14$
leader++Alice+Wonder++Wonder1@example.com<----->$
John+broth++bros34@example.com<>$
Ray+Clarkson++Ray.Clarks@example.com<-->$
Kim+West++++kimwest@example.com>$
[JoAnne+Blank]++jblank@example.com<---->$
$
Team+3<>Apr+1~$
Leader++Steve+Jones++sjones23876@example.com<-->$
Bullwinkle+Moose+bmoose@example.com<--->$
Rocket+Squirrel+RJSquirrel@example.com<>$
Julie+Lisbon++julielisbon234@example.com<------>$
[Mary+Lastware)+mary@example.com$

Figure 3: The Experiment_6-1.txt file showing all of the embedded whitespace.

You can see that there are a lot of whitespace characters that need to be removed from our file. We also need to get rid of the word “leader” which appears twice and is capitalized once. Let’s get rid of “leader” first. This time we will use sed (stream editor) to perform this task by substituting a new string – or a null string in our case – for the pattern it matches. Adding sed -e “s/[Ll]eader//” to the pipeline does this.

[student@studentvm1 testing]$ cat Experiment_6-1.txt | grep -v Team | grep -v "^\s*$" | sed -e "s/[Ll]eader//"

In this sed command, -e means that the quote enclosed expression is a script that produces a desired result. In the expression the s means that this is a substitution. The basic form of a substitution is s/regex/replacement string/. So /[Ll]eader/ is our search string. The set [Ll] matches L or l so [Ll]eader matches leader or Leader. In this case the replacement string is null because it looks like this – // – a double forward slash with no characters or whitespace between the two slashes.

Now let’s get rid of some of the extraneous characters like []() that will not be needed.

[student@studentvm1 testing]$ cat Experiment_6-1.txt | grep -v Team | grep -v "^\s*$" | sed -e "s/[Ll]eader//" -e "s/\[//g" -e "s/]//g" -e "s/)//g" -e "s/(//g"

We have added four new expressions to the sed statement. Each one removes a single character. The first of these additional expressions is a bit different. Because the left square brace [ character can mark the beginning of a set, we need to escape it to ensure that sed interprets it correctly as a regular character and not a special one.

We could use sed to remove the leading spaces from some of the lines, but the awk command can do that as well as reorder the fields if necessary, and add the <> characters around the email address.

[student@studentvm1 testing]$ cat Experiment_6-1.txt | grep -v Team | grep -v "^\s*$" | sed -e "s/[Ll]eader//" -e "s/\[//g" -e "s/]//g" -e "s/)//g" -e "s/(//g" | awk '{print $1" "$2" <"$3">"}'

The awk utility is actually a very powerful programming language that can accept data streams on its STDIN. This makes it extremely useful in command line programs and scripts.

The awk utility works on data fields and the default field separator is spaces – any amount of white space. The data stream we have created so far has three fields separated by whitespace, first, last, and email. This little program awk ‘{print $1″ “$2” <“$3″>”}’ takes each of the three fields, $1, $2, and $3 and extracts them without leading or trailing whitespace. It then prints them in sequence adding a single space between each as well as the <> characters needed to enclose the email address.

The last step here would be to redirect the output data stream to a file but that is trivial so I leave it with you to perform that step. It is not really necessary that you do so.

I saved the Bash program in an executable file and now I can run this program any time I receive a new list. Some of those lists are fairly short, as is the one in Figure 3, but others have been quite long, sometimes containing up to several hundred addresses and many lines of “stuff” that do not contain addresses to be added to the list.

Experiment 2

But now that we have a working solution, one that is a step-by-step exploration of the tools we are using, we can do quite a bit more to perform the same task in a more compact and optimized command line program.

In this experiment we explore ways in which we can shorten and simplify our command line program. The final result of that experiment was the following CLI program.

cat Experiment_6-1.txt | grep -v Team | grep -v "^\s*$" | sed -e "s/[Ll]eader//" -e "s/\[//g" -e "s/]//g" -e "s/)//g" -e "s/(//g" | awk '{print $1" "$2" <"$3">"}'

Let’s start near the beginning and combine the two grep statements. The result is shorter and more succinct. It also means faster execution because grep only needs to parse the data stream once.

Tip: When the STDOUT from grep is not piped through another utility, and when using a terminal emulator that supports color, the regex matches are highlighted in the output data stream.

In the revised command, grep -vE “Team|^\s*$”, we add the E option which specifies extended regex. According to the grep man page, “In GNU grep there is no difference in available functionality between basic and extended syntaxes.” This statement is not strictly true because our new combined expression fails without the E option. Run the following to see the results.

[student@studentvm1 testing]$ cat Experiment_6-1.txt | grep -vE "Team|^\s*$"

Try it without the E option. The grep tool can also read data from a file so we eliminate the cat command.

[student@studentvm1 testing]$ grep -vE "Team|^\s*$" Experiment_6-1.txt

This leaves us with the following, somewhat simplified CLI program.

grep -vE "Team|^\s*$" Experiment_6-1.txt | sed -e "s/[Ll]eader//" -e "s/\[//g" -e "s/]//g" -e "s/)//g" -e "s/(//g" | awk '{print $1" "$2" <"$3">"}'

We can also simplify the sed command and we will do so after we learn more about regular expressions.

It is important to realize that this solution is not the only one. There are different methods in Bash for producing the same output, there are other languages like Python and Perl that can also be used. And, of course, there are always LibreOffice Writer macros. But I can always count on Bash as part of any Linux distribution. I can perform these tasks using Bash programs on any Linux computer, even one without a GUI desktop or that does not have LibreOffice installed.

grep

Because GNU grep is one of the tools I use the most that provides a more or less standardized implementation of regular expressions, I will use that set of expressions as the basis for the next part of this article. We will then look again at sed, another tool that uses regular expressions. There are many details that are important to understanding some of the complexity and of regex implementations and how they work.

Data flow

All implementations of regular expressions are line-based. A pattern created by a combination of one or more expressions is compared against each line of a data stream. When a match is made, an action is taken on that line as prescribed by the tool being used. For example when a pattern match occurs with grep, the usual action is to pass that line on to STDOUT and lines that do not match the pattern are discarded. As we have seen, the -v option reverses those actions so that the lines with matches are discarded.

Each line of the data stream is evaluated on its own and the results of matching the expressions in the pattern with the data from previous lines are not carried over. It might be helpful to think of each line of a data stream as a record and that the tools that use regexes processes one record at a time. When a match is made an action defined by the tool in use is take on the line that contains the matching string.

regex building blocks

Figure 4 contains a list of the basic building block expressions and metacharacters implemented by the GNU grep command and their descriptions. When used in a pattern, each of these expressions or metacharacters matches a single character in the data stream being parsed.

Expression Description
Alphanumeric characters Literals A-Z,a-z,0-9 All alphanumeric and some punctuation characters are considered as literals. Thus the letter “a” in a regex will always match the letter “a” in the data stream being parsed. There is no ambiguity for these characters. Each literal character matches one and only one character.
. (dot) The dot (.) metacharacter is the most basic form of expression. It matches any single character in the position it is encountered in a pattern. So the pattern b.g would match big, bigger, bag, baguette, and bog, but not dog, blog, hug, lag, gag, or leg, etc.
Bracket expression [list of characters]
GNU grep calls this a bracket expression and it is the same as a set for the Bash shell. The brackets enclose a list of characters to match for a single character location in the pattern. [abcdABCD] matches the letters a, b, c, or d in either upper or lower case. [a-dA-D] specifies a range of characters that creates the same match. [a-zA-Z] matches the alphabet in upper and lower case.
[:class name:] Character classes This is a POSIX4 attempt at regex standardization. The class names are supposed to be obvious. For example the [:alnum:] class matches all alphanumeric characters. Other classes are [:digit:] which matches any one digit 0-9, [:alpha:], [:space:], and so on. Note that there may be issues due to differences in the sorting sequences in different locales. Read the grep man page for details.
^ and $ Anchors These two metacharacters match the beginning and ending of a line, respectively. They are said to anchor the rest of the pattern to either the beginning or ending of a line. The expression ^b.g would only match big, bigger, bag, etc., as shown above, if they occur at the beginning of the line being parsed. The pattern b.g$ would match big or bag only if they occur at the end of the line, but not bigger.

Figure 4: These expressions and metacharacters are implemented by grep and most other regex implementations.

Let’s explore these building blocks before continuing on with some of the modifiers. The text file we will use for Experiment 3 is from a lab project I created for a Linux class I wrote and used to teach. It was originally in a LibreOffice Writer ODT file but I saved it to an ASCII text file. Most of the formatting of things like tables was removed but the result is a long ASCII text file that we can use for this series of experiments.

Experiment 3

We must download the sample file from the Apress GitHub website. If the directory ~/testing is not the PWD, make it so.

[student@studentvm1 testing]$ wget http://www.linux-databook.info/downloads/Experiment_6-3.txt

To begin, just use the less command to look at and explore the Experiment_6-3.txt file for a few minutes so you have an idea of its content.

Now we will use some simple expressions in grep to extract lines from the input data stream. The Table of Contents (TOC) contains a list of projects and their respective page numbers in the PDF document. Let’s extract the TOC starting with lines ending in two digits.

[student@studentvm1 testing]$ grep [0-9][0-9]$ Experiment_6-3.txt

That is not really what we want. It displays all lines that end in two digits and misses TOC entries with only one digit. We will look at how to deal with an expression for one or more digits in a later experiment. Looking at the whole file in less, we could do something like this.

[student@studentvm1 testing]$ grep "^Lab Project" Experiment_6-3.txt | grep "[0-9]$"

This is much closer to what we want but it is not quite there. We get some lines from later in the document that also match these expressions. If you study the extra lines and look at those in the complete document you can see why they match while not being part of the TOC. This also misses TOC entries that do not start with “Lab Project.” Sometimes this is the best you can do, but it does give a better look at the TOC than we had before. We will look at how to combine these two grep instances into a single one in a later experiment in this article.

Now let’s modify this a bit and use the POSIX expression. Notice the double square braces around the POSIX expression. Single braces generate an error message.

[student@studentvm1 testing]$ grep "^Lab Project" Experiment_6-3.txt | grep "[[:digit:]]$"

This gives the same results as the previous attempt. Let’s look for something different.

[student@studentvm1 testing]$ grep systemd Experiment_6-3.txt

This lists all occurrences of “systemd” in the file. Try using the -i option to ensure that you get all instances including those that start with uppercase5. Or you could just change the literal expression to Systemd. Count the number of lines with the string systemd contained in them. I always use -i to ensure that all instances of the search expression are found regardless of case.

[student@studentvm1 testing]$ grep -i systemd Experiment_6-3.txt | wc
 20     478    3098

As you can see I have 20 lines and you should have the same number.

Here is an example of matching a metacharacter. the left bracket ([). First let’s try it without doing anything special.

[student@studentvm1 testing]$ grep -i "[" Experiment_6-3.txt 
grep: Invalid regular expression

This occurs because [ is interpreted as a metacharacter. We need to “escape” this character with a backslash so that it is interpreted as literal character and not as a metacharacter.

[student@studentvm1 testing]$ grep -i "\[" Experiment_6-3.txt

Most metacharacters lose their special meaning when used inside bracket expressions. To include a literal ] place it first in the list. To include a literal ^ place it anywhere but first. To include a literal [ place it last.

Repetition

Regular expressions may be modified using some operators that allow specification of zero, one, or more repetitions of a character or expression. These repetition operators are placed immediately following the literal character or metacharacter used in the pattern.

Operator Description
? In regexes the ? means zero or one occurrence at most of the preceding character. So for example, “drives?” matches drive, and drives but not driver. Using “drive” for the expression would match drive, drives, and driver. This is a bit different from the behavior of ? in a glob.
* The character preceding the * will be matched zero or more times without limit. In this example, “drives*” matches drive, drives, and drivesss but not driver. Again this is a bit different from the behavior of * in a glob.
+ The character preceding the + will be matched one or more times. The character must exist in the line at least once for a match to occur. As one example, “drives+” matches drives, and drivesss but not drive or driver.
{n} This operator matches the preceding character exactly n times. The expression “drives{2}” matches drivess but not drive, drives, drivesss, or any number of trailing “s” characters. However, because drivesssss contains the string drivess, a match occurs on that string so the line would be a match by grep.
{n,} This operator matches the preceding character n or more times. The expression “drives{2,}” matches drivess but not drive, drives, drivess, drives, or any number of trailing “s” characters. Because drivesssss contains the string drivess, a match occurs.
{,m} This operator matches the preceding character no more than m times. The expression “drives{,2}” matches drive, drives, and drivess, but not drivesss, or any number of trailing “s” characters. Once again, because drivesssss contains the string drivess, a match occurs.
{n,m} This operator matches the preceding character at least n times but no more than m times. The expression “drives{1,3}” matches drives, drivess, and drivesss, but not drivessss or any number of trailing “s” characters. Once again, because drivesssss contains a matching string, a match occurs.

Figure 5: Meta-character modifiers that specify repetition.

Experiment 4

Run each of the following commands and examine the results carefully so that you understand what is happening.

[student@studentvm1 testing]$ grep -E files? Experiment_6-3.txt
[student@studentvm1 testing]$ grep -Ei "drives*" Experiment_6-3.txt
[student@studentvm1 testing]$ grep -Ei "drives*" Experiment_6-3.txt
[student@studentvm1 testing]$ grep -Ei "drives+" Experiment_6-3.txt
[student@studentvm1 testing]$ grep -Ei "drives{2}" Experiment_6-3.txt
[student@studentvm1 testing]$ grep -Ei "drives{2,}" Experiment_6-3.txt
[student@studentvm1 testing]$ grep -Ei "drives{,2}" Experiment_6-3.txt
[student@studentvm1 testing]$ grep -Ei "drives{2,3}" Experiment_6-3.txt

Be sure to experiment with these modifiers on other text in the sample file.

Other metacharacters

There are still some interesting and important modifiers that we need to explore.

Modifier Description
\< This special expression matches the empty string at the beginning of a word. The expression “\<fun” would match on “ fun” and “Function” but not “refund”.
\> This special expression matches the normal space, or empty “ ” string at the end of a word as well as punctuation that typically appears in the single character string at the end of a word. So “environment\>” matches “environment”, “environment,”, and environment.” but not environments or environmental.
^ In a character class expression, this operator negates the list of characters. Thus, while the class [a-c] matches a, b , or c, in that position of the pattern, the class [^a-c] matches anything but a, b, or c.
| When used in a regex, the | metacharacter is a logical “or” operator. It is officially called the “infix” or “alternation” operator. We have already encountered this in Experiment 6-2, where we saw that the regex “Team|^\s*$” means, “a line with ‘Team’ or ( | ) an empty line including one that has zero, one, or more whitespace characters such as spaces, tabs, and other unprintable characters.”
( and ) The parentheses ( and ) allow us to ensure a specific sequence of pattern comparison like might be used for logical comparisons in a programming language.

Figure 6: Meta-character modifiers.

We now have a way to specify word boundaries with the \< and \> metacharacters. This means we can now be even more explicit with our patterns. We can also use some logic in more complex patterns.

Experiment 5

Start with a couple simple patterns. This first one selects all instances of drives but not drive, drivess, or additional trailing “s” characters.

[student@studentvm1 testing]$ grep -Ei "\<drives\>" Experiment_6-3.txt

Now let’s build up a search pattern to locate references to tar, the tape archive command and related references. The first two iterations display more than just tar-related lines.

[student@studentvm1 testing]$ grep -Ei "tar" Experiment_6-3.txt
[student@studentvm1 testing]$ grep -Ei "\<tar" Experiment_6-3.txt
[student@studentvm1 testing]$ grep -Ein "\<tar\>" Experiment_6-3.txt

The -n option in the last command above displays the line numbers of each line in which a match occurred. This can assist in locating specific instances of the search pattern.

Tip: Matching lines of data can extend beyond a single screen, especially when searching a large file. You can pipe the resulting data stream through the less utility and then use the less search facility which implements regexes, too, to highlight the occurrences of matches to the search pattern. The search argument in less is: \<tar\>

This next pattern searches for “shell script” or “shell program” or “shell variable” or “shell environment” or “shell prompt” in our test document. The parentheses alter the logical order in which the pattern comparisons are resolved.

[student@studentvm1 testing]$ grep -Eni "\<shell (script|program|variable|environment|prompt)" Experiment_6-3.txt

Remove the parentheses from the preceding command and run it again to see the difference.

Although we have now explored the basic building blocks of regular expressions in grep, there are an infinite variety of ways in which they can be combined to create complex yet elegant search patterns. However grep is a search tool and does not provide any direct capability to edit or modify the contents of a line of text in the data stream when a match is made.

sed

The sed utility not only allows searching for text that matches a regex pattern, it can also modify, delete, or replace the matched text. I use sed at the command line and in Bash shell scripts as a fast and easy way to locate and text and alter it in some way. The name sed stands for stream editor because it operates on data streams in the same manner as other tools that can transform a data stream. Most of those changes simply involve selecting specific lines from the data stream and passing them on to another transformer6 program.

We have already seen sed in action but now, with an understanding of regular expressions, we can better analyze and understand our earlier usage.

Experiment 6

In Experiment 2 we simplified the CLI program we used to transform a list of names and email addresses into a form that can be used as input to a listserv. That CLI program looks like this after some simplification.

grep -vE "Team|^\s*$" Experiment_6-1.txt | sed -e "s/[Ll]eader//" -e "s/\[//g" -e "s/]//g" -e "s/)//g" -e "s/(//g" | awk '{print $1" "$2" <"$3">"}'

It is possible to combine four of the five expressions used in the sed command into a single expression. The sed command now has two expressions instead of five.

sed -e "s/[Ll]eader//" -e "s/[]()\[]//g"

This makes it a bit difficult to understand the more complex expression. Note that no matter how many expressions a single sed command contains, the data stream is only parsed once to match all of the expressions.

Let’s examine the revised expression, -e “s/[]()\[]//g”, more closely. By default, sed interprets all [ characters as the beginning of a set and the last ] character as the end of that set. -e “s/[]()\[]//g” The intervening ] characters are not interpreted as metacharacters. Since we need to match [ as a literal character in order to remove it from the data stream and sed normally interprets that as a metacharacter, we need to escape it so that it is interpreted as a literal ]. -e “s/[]()\[]//g” So now all of the metacharacters in this expression are highlighted. Let’s plug this into the CLI script and test it.

[student@studentvm1 testing]$ grep -vE "Team|^\s*$" Experiment_6-1.txt | sed -e "s/[Ll]eader//" -e "s/[]()\[]//g"

I know that you are asking, “Why not place the \[ after the [ that opens the set and before the ] character. Try it as I did.

[student@studentvm1 testing]$ grep -vE "Team|^\s*$" Experiment_6-1.txt | sed -e "s/[Ll]eader//" -e "s/[\[]()]//g"

I think that should work but it does not. Little unexpected results like this make it clear that we must be careful and test each regex carefully to ensure that it actually does what we intend. After some experimentation of my own, I discovered that the escaped left square brace \[ works fine in all positions of the expression except for the first one. This behavior is noted in the grep man page which I probably should have read first. However I find that experimentation reinforces the things I read and I usually discover more interesting things than that for which I was looking.

Adding the last component, the awk statement, our optimized program looks like this and the results are exactly what we want.

[student@studentvm1 testing]$ grep -vE "Team|^\s*$" Experiment_6-1.txt | sed -e "s/[Ll]eader//" -e "s/[]()\[]//g" | awk '{print $1" "$2" <"$3">"}'

Other tools that implement regular expressions

Many Linux tools implement regular expressions. Most of those implementations are very similar to that of awk, grep, and sed so that it should be easy to learn the differences. Although we have not looked in detail at awk, it is a powerful text processing language that also implements regexes.

Most of the more advanced text editors use regexes. Vim, gVim, Kate, and GNU Emacs are no exceptions. The less utility implements regexes as does the search and replace facility of LibreOffice Writer.

Programming languages like Perl, awk, and Python also contain implementations of regexes which makes them well suited to writing tools for text manipulation.

Resources

I have found some excellent resources for learning about regular expressions. There are more than I have listed here but these are the ones I have found to be particularly useful.

The grep man page has a good reference but is not appropriate for learning about regular expressions. The O’Reilly book, Mastering Regular Expressions7, is a very good tutorial and reference for regular expressions. I recommend it for anyone who is or wants to be a Linux SysAdmin because you will use regular expressions. Another good O’Reilly book is sed and awk8 which covers both of these powerful tools and it also has an excellent discussion of regular expressions.

There are also some good web sites that can help you learn about regular expressions and which provide interesting and useful cookbook style regex examples. There are some that ask for money in return for using them. Jason Baker, my Technical Reviewer for Volumes 1 and 2 of my Using and Administering Linux course suggests https://regexcrossword.com/ as a good learning tool.

Summary

This article has provided a very brief introduction to the complex world of regular expressions. We have explored the regex implementations in the grep and sed utilities in just enough depth to give you an idea of some of the amazing things that can be accomplished with regexes. We have also looked at several Linux tools and programming languages that also implement regexes.

But make no mistake! We have only scratched the surface of these tools and regular expressions. There is much more to learn and there are some excellent resources for doing so.


1See the grep info page in Section 3.6 Basic vs Extended Regular Expressions

2When I talk about regular expressions, in a general sense I usually mean to include both basic and extended regular expressions. If there is a differentiation to be made I will use the acronyms BRE for basic regular expressions or ERE for extended regular expression.

3One general meaning of parse is to examine something by studying its component parts. For our purposes we parse a data stream to locate sequences of characters that match a specified pattern.

4Wikipedia, POSIX, https://en.wikipedia.org/wiki/POSIX

5The official form of systemd is all lowercase.

6Many people call tools like grep, “filter” programs because they filter unwanted lines out of the data stream. I prefer the term “transformers” because ones such as sed and awk, do more than just filter. They can test the content for various string combinations and alter the matching content in many different ways. Tools like sort, head, tail, uniq, fmt, and more, all transform the data stream in some way.

7Friedl, Jeffrey E. F., Mastering Regular Expressions, O’Reilly, 2012, Paperback ISBN-13: 978-0596528126

8Robbins, Arnold, and Dougherty, Dale, sed & awk: UNIX Power Tools (Nutshell Handbooks), O’Reilly, 2012, ISBN-13: 978-1565922259