{"id":5533,"date":"2019-11-01T08:40:25","date_gmt":"2019-11-01T12:40:25","guid":{"rendered":"http:\/\/www.linux-databook.info\/?page_id=5533"},"modified":"2019-10-24T13:54:53","modified_gmt":"2019-10-24T17:54:53","slug":"introducing-regular-expressions","status":"publish","type":"page","link":"http:\/\/www.linux-databook.info\/?page_id=5533","title":{"rendered":"Introducing regular expressions"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\"><em>Note: This article is a slightly modified version of Chapter 6\nfrom Volume 2 of my Linux self-study course, \u201cUsing and\nAdministering Linux: Zero to SysAdmin,\u201d due out from Apress in late\n2019.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We have all used file globbing with wildcard characters like * and\n? as a means to select specific files or lines of data from a data\nstream. These tools are powerful and I use them many times a day. Yet\nthere are things that cannot be done with wildcards.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"> Regular expressions (REGEXes or REs) provide us with more complex\nand flexible pattern matching capabilities. Just as certain\ncharacters take on special meaning when using file globbing, REs also\nhave special characters. There are two main types of regular\nexpressions (REs), Basic Regular Expressions (BREs) and Extended\nRegular Expressions (EREs).  \n<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The first thing we need are some definitions. There are many\ndefinitions for the term \u201cregular expressions\u201d but many are dry\nand uninformative. Here are mine.<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Regular Expressions are strings of literal and metacharacters\n\tthat can be used as patterns by various Linux utilities to match\n\tstrings of ASCII plain text data in a data stream. When a match\n\toccurs it can be used to extract  or eliminate a line of data from\n\tthe stream or to modify the matched string in some way.\n\t<\/li><li>Basic Regular Expressions (BRE) and Extended Regular\n\tExpressions (ERE) are not significantly different in terms of\n\tfunctionality<a href=\"#sdfootnote1sym\"><sup>1<\/sup><\/a>.\n\tThe primary difference is in the syntax used and how metacharacters\n\tare specified. In basic regular expressions the meta-characters \u2018?\u2019,\n\t\u2018+\u2019, \u2018{\u2019, \u2018|\u2019, \u2018(\u2019, and \u2018)\u2019 lose their special\n\tmeaning; instead, it is necessary to use the backslashed versions\n\t\u2018\\?\u2019, \u2018\\+\u2019, \u2018\\{\u2019, \u2018\\|\u2019, \u2018\\(\u2019, and \u2018\\)\u2019. The\n\tERE syntax is believed by many to be easier to use.\n<\/li><\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Regular expressions (REs)<a href=\"#sdfootnote2sym\"><sup>2<\/sup><\/a>\ntake the concept of using metacharacters to match patterns in data\nstreams much further than file globbing and give us even more control\nover the items we select from a data stream. REs are used by various\ntools to parse<a href=\"#sdfootnote3sym\"><sup>3<\/sup><\/a>\na data stream to match patterns of characters in order to perform\nsome transformation on the data. \n<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Regular expressions have a reputation for being obscure and arcane\nincantations that only those with special wizardly SysAdmin powers\nuse. Figure 1 would seem to confirm this. The command pipeline\nappears to be an intractable sequence of meaningless gibberish to\nanyone without the knowledge of regex. It certainly seemed that way\nto me the first time I encountered something similar early in my\ncareer. As you will see, it is actually relatively simple once it is\nall explained.<\/p>\n\n\n\n<table class=\"wp-block-table has-subtle-light-gray-background-color has-background\"><tbody><tr><td><code>cat Experiment_6-1.txt | grep -v Team | grep -v \"^\\s*$\" | sed -e \"s\/[Ll]eader\/\/\" -e \"s\/\\[\/\/g\" -e \"s\/\\]\/\/g\" -e \"s\/)\/\/g\" | awk '{print $1\" \"$2\" &lt;\"$3\"&gt;\"}' &gt; addresses.txt <\/code><\/td><\/tr><\/tbody><\/table>\n\n\n\n<p class=\"wp-block-paragraph\">\n<em>Figure\n1: A real world sample of the use of regular expressions. It is\nactually a single line that I used to transform a file that was sent\nto me into a usable form.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We can only begin to touch upon all of the possibilities opened to\nus by regular expressions in a single article. There are entire books\ndevoted exclusively to regular expressions so we will explore the\nbasics here \u2013 just enough to get started with tasks common to\nSysAdmins. \n<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Getting started<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Now we need a real world example to use as a learning tool. Here\nis one I encountered several years ago.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The mailing list<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">This example highlights the power and flexibility of the Linux\ncommand line, especially regular expressions, for their ability to\nautomate common tasks. I have administered several listservs during\nmy career and still do. People send me lists of email addresses to\nadd to those lists. In more than one case I have received a list of\nnames and email addresses in a Word format that were to be added to\none of the lists. \n<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The list itself was not really very long but it was very\ninconsistent in its formatting. An abbreviated version of that list,\nwith name and domain changes, is shown in Figure 2. The original list\nhas extra lines, characters like brackets and parentheses that need\nto be deleted, whitespace such as spaces and tabs, and some empty\nlines. The format required to add these emails to the list is, first\nlast &lt;email@example.com&gt;. Our task is to transform this list\ninto a format usable by the mailing list software.<\/p>\n\n\n\n<table class=\"wp-block-table has-subtle-light-gray-background-color has-background\"><tbody><tr><td><code>  Team 1 Apr 3  <br> Leader  Virginia Jones  vjones88@example.com <br> Frank Brown  FBrown398@example.com <br> Cindy Williams  cinwill@example.com <br> Marge smith   msmith21@example.com  <br>  [Fred Mack]   edd@example.com <br> <br> Team 2 March 14<br> leader  Alice Wonder  Wonder1@example.com <br> John broth  bros34@example.com <br> Ray Clarkson  Ray.Clarks@example.com <br> Kim West    kimwest@example.com <br> [JoAnne Blank]  jblank@example.com <br> <br> Team 3 Apr 1  <br> Leader  Steve Jones  sjones23876@example.com <br> Bullwinkle Moose bmoose@example.com <br> Rocket Squirrel RJSquirrel@example.com <br> Julie Lisbon  julielisbon234@example.com <br> [Mary Lastware) mary@example.com <\/code><\/td><\/tr><\/tbody><\/table>\n\n\n\n<p class=\"wp-block-paragraph\">\n<em>Figure\n2: A partial, modified listing of the document of email addresses to\nadd to a listserv.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">It was obvious that I needed to manipulate the data in order to\nmangle it into an acceptable format for inputting to the list. It is\npossible to use a text editor or a word processor such as LibreOffice\nWriter to make the necessary changes to this small file. However,\npeople send me files like this quite often so it becomes a chore to\nuse a word processor to make these changes. Despite the fact that\nWriter has a good search and replace function, each character or\nstring must be replaced singly and there is no way to save previous\nsearches. Writer does have a very powerful macro feature, but I am\nnot familiar with either of its two languages, LibreOffice Basic or\nPython. I do know Bash shell programming.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Experiment 1<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">I did what comes naturally to a SysAdmin \u2013 I automated the task.\nThe first thing I did was to copy the address data to a text file so\nI could work on it using command line tools. After a few minutes of\nwork, I developed the Bash command line program in Figure 1 that\nproduced the desired output as the file, addresses.txt.  I used my\nnormal approach to writing command line programs like this by\nbuilding up the pipeline one command at a time.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let\u2019s break this pipeline down into its component parts to see\nhow it works and fits together. All of the experiments in this\narticle should be performed as a non-privileged user. I also did this\non a VM that I created for testing, studentvm1.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">First, download the sample file, Experiment_6-1.txt from THE XXX\nwebsite, . Let\u2019s do all of this work in a new directory so we will\ncreate that too.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">[student@studentvm1 ~]$ <strong>mkdir testing ; cd testing<\/strong>\n[student@studentvm1 testing]$ <strong>wget <\/strong><a href=\"http:\/\/www.linux-databook.info\/downloads\/Experiment_6-1.txt\">http:\/\/www.linux-databook.info\/downloads\/Experiment_6-1.txt<\/a><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Now we just take a look at the file and see what we need to do.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">[student@studentvm1 testing]$ <strong>cat Experiment_6-1.txt<\/strong> \n Team 1  Apr 3  \n Leader  Virginia Jones  vjones88@example.com\n Frank Brown  FBrown398@example.com\n Cindy Williams  cinwill@example.com\n Marge smith   msmith21@example.com  \n  [Fred Mack]   edd@example.com   \n Team 2  March 14\n leader  Alice Wonder  Wonder1@example.com\n John broth  bros34@example.com   \n Ray Clarkson  Ray.Clarks@example.com\n Kim West    kimwest@example.com  \n [JoAnne Blank]  jblank@example.com\n Team 3  Apr 1  \n Leader  Steve Jones  sjones23876@example.com\n Bullwinkle Moose bmoose@example.com\n Rocket Squirrel RJSquirrel@example.com   \n Julie Lisbon  julielisbon234@example.com\n [Mary Lastware) mary@example.com\n [student@studentvm1 testing]$ <\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The first things I see that can be done are a couple easy ones.\nSince the Team names and dates are on lines by themselves we can use\nthe following to remove those lines that have the word \u201cTeam\u201d. I\nplace the end of sentence period outside the quotes for clarity to\nensure that only the intended string is inside the quotes. \n<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">[student@studentvm1 testing]$ <strong>cat Experiment_6-1.txt | grep -v Team<\/strong><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">I won\u2019t reproduce the results of each stage of building this\nBash program but you should be able to see the changes in the data\nstream as it shows up on STDOUT, the terminal session. We won\u2019t\nsave it in a file until the end. \n<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In this first step in transforming the data stream into one that\nis usable, we use the <strong>grep<\/strong>\ncommand with a simple literal pattern, \u201cTeam.\u201d Literals are the\nmost basic type of pattern we can use as a regular expression because\nthere is only a single possible match in the data stream being\nsearched, and that is the string \u201cTeam\u201d.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We need to discard empty lines so we can use another grep\nstatement to eliminate them. I find that enclosing the regular\nexpression for the second grep command in quotes ensures that it gets\ninterpreted properly. \n<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">[student@studentvm1 testing]$ <strong>cat Experiment_6-1.txt | grep -v Team | grep -v \"^\\s*$\"<\/strong>\nLeader  Virginia Jones  vjones88@example.com\nFrank Brown  FBrown398@example.com\nCindy Williams  cinwill@example.com\nMarge smith   msmith21@example.com  \n [Fred Mack]   edd@example.com   \nleader  Alice Wonder  Wonder1@example.com\nJohn broth  bros34@example.com   \nRay Clarkson  Ray.Clarks@example.com\nKim West    kimwest@example.com  \n[JoAnne Blank]  jblank@example.com\nLeader  Steve Jones  sjones23876@example.com\nBullwinkle Moose bmoose@example.com\nRocket Squirrel RJSquirrel@example.com   \nJulie Lisbon  julielisbon234@example.com\n[Mary Lastware) mary@example.com\n[student@studentvm1 testing]$ <\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The expression <strong>&#8220;^\\s*$&#8221;<\/strong> illustrates anchors and using the backslash (<strong>\\<\/strong>) as an escape character to change the meaning of a literal, \u201c<strong>s<\/strong>\u201d in this case, to a metacharacter that means any whitespace such as spaces, tabs, or other characters that are unprintable. We cannot see these characters in the file, but it does contain some of them. The asterisk, aka splat (<strong>*<\/strong>) specifies that we are to match zero or more of the whitespace characters. This would match multiple tabs or multiple spaces or any combination of those in an otherwise empty line.  <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I configured my Vim editor to display whitespace using visible characters. Do this by adding the following line to your own ~.vimrc or the global \/etc\/vimrc configuration files. Then start \u2013 or restart \u2013 Vim.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">set listchars=eol:$,nbsp:_,tab:&lt;-&gt;,trail:~,extends:&gt;,space:+<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">I have found a lot of bad, very incomplete, and contradictory\ninformation on the Internet in my searches for how to do this. The\nbuilt-in Vim help has the best information and the data line I have\ncreated from that here is one that works for me. \n<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The result, before any operation on the file, is shown in Figure\n3. Regular spaces are shown as +; tabs are shown as &lt;, &lt;&gt;,\nor &lt;&#8211;&gt;, and fill the length of the space that the tab covers.\nThe end of line (EOL) character is shown as $.<\/p>\n\n\n\n<table class=\"wp-block-table has-subtle-light-gray-background-color has-background\"><tbody><tr><td><code>Team+1&lt;&gt;Apr+3~$<br>Leader++Virginia+Jones++vjones88@example.com&lt;--&gt;$<br>Frank+Brown++FBrown398@example.com&lt;----&gt;$<br>Cindy+Williams++cinwill@example.com&lt;---&gt;$<br>Marge+smith+++msmith21@example.com~$<br>+[Fred+Mack]+++edd@example.com&lt;&gt;$<br>$<br>Team+2&lt;&gt;March+14$<br>leader++Alice+Wonder++Wonder1@example.com&lt;-----&gt;$<br>John+broth++bros34@example.com&lt;&gt;$<br>Ray+Clarkson++Ray.Clarks@example.com&lt;--&gt;$<br>Kim+West++++kimwest@example.com&gt;$<br>[JoAnne+Blank]++jblank@example.com&lt;----&gt;$<br>$<br>Team+3&lt;&gt;Apr+1~$<br>Leader++Steve+Jones++sjones23876@example.com&lt;--&gt;$<br>Bullwinkle+Moose+bmoose@example.com&lt;---&gt;$<br>Rocket+Squirrel+RJSquirrel@example.com&lt;&gt;$<br>Julie+Lisbon++julielisbon234@example.com&lt;------&gt;$<br>[Mary+Lastware)+mary@example.com$ <\/code><\/td><\/tr><\/tbody><\/table>\n\n\n\n<p class=\"wp-block-paragraph\">\n<em>Figure\n3: The Experiment_6-1.txt file showing all of the embedded\nwhitespace.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">You can see that there are a lot of whitespace characters that\nneed to be removed from our file. We also need to get rid of the word\n\u201cleader\u201d which appears twice and is capitalized once. Let\u2019s get\nrid of \u201cleader\u201d first. This time we will use <strong>sed<\/strong>\n(stream editor) to perform this task by substituting a new string \u2013\nor a null string in our case \u2013 for the pattern it matches. Adding\n<strong>sed\n-e &#8220;s\/[Ll]eader\/\/&#8221;<\/strong>\nto the pipeline does this.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">[student@studentvm1 testing]$ <strong>cat Experiment_6-1.txt | grep -v Team | grep -v \"^\\s*$\" | sed -e \"s\/[Ll]eader\/\/\"<\/strong><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">In this <strong>sed<\/strong>\ncommand, -e means that the quote enclosed expression is a script that\nproduces a desired result. In the expression the s means that this is\na substitution. The basic form of a substitution is\ns\/regex\/replacement string\/. So \/[Ll]eader\/ is our search string. The\nset [Ll] matches L or l so [Ll]eader matches leader or Leader. In\nthis case the replacement string is null because it looks like this &#8211;\n\/\/ &#8211; a double forward slash with no characters or whitespace between\nthe two slashes.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Now let\u2019s get rid of some of the extraneous characters like []()\nthat will not be needed.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">[student@studentvm1 testing]$ <strong>cat Experiment_6-1.txt | grep -v Team | grep -v \"^\\s*$\" | sed -e \"s\/[Ll]eader\/\/\" -e \"s\/\\[\/\/g\" -e \"s\/]\/\/g\" -e \"s\/)\/\/g\" -e \"s\/(\/\/g\"<\/strong><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">We have added four new expressions to the <strong>sed<\/strong> statement. Each one removes a single character. The first of these additional expressions is a bit different. Because the left square brace [ character can mark the beginning of a set, we need to escape it to ensure that <strong>sed<\/strong> interprets it correctly as a regular character and not a special one.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We could use <strong>sed<\/strong> to remove the leading spaces from some of the lines, but the <strong>awk<\/strong> command can do that as well as reorder the fields if necessary, and add the &lt;&gt; characters around the email address.  <\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">[student@studentvm1 testing]$ <strong>cat Experiment_6-1.txt | grep -v Team | grep -v \"^\\s*$\" | sed -e \"s\/[Ll]eader\/\/\" -e \"s\/\\[\/\/g\" -e \"s\/]\/\/g\" -e \"s\/)\/\/g\" -e \"s\/(\/\/g\" | awk '{print $1\" \"$2\" &lt;\"$3\"&gt;\"}'<\/strong><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The <strong>awk<\/strong>\nutility is actually a very powerful programming language that can\naccept data streams on its STDIN. This makes it extremely useful in\ncommand line programs and scripts.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The <strong>awk<\/strong>\nutility works on data fields and the default field separator is\nspaces \u2013 any amount of white space. The data stream we have created\nso far has three fields separated by whitespace, first, last, and\nemail. This little program <strong>awk\n&#8216;{print $1&#8243; &#8220;$2&#8221; &lt;&#8220;$3&#8243;&gt;&#8221;}&#8217;<\/strong>\ntakes each of the three fields, $1, $2, and $3 and extracts them\nwithout leading or trailing whitespace. It then prints them in\nsequence adding a single space between each as well as the &lt;&gt;\ncharacters needed to enclose the email address.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The last step here would be to redirect the output data stream to\na file but that is trivial so I leave it with you to perform that\nstep. It is not really necessary that you do so.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I saved the Bash program in an executable file and now I can run\nthis program any time I receive a new list. Some of those lists are\nfairly short, as is the one in Figure 3, but others have been quite\nlong, sometimes containing up to several hundred addresses and many\nlines of \u201cstuff\u201d that do not contain addresses to be added to the\nlist. \n<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Experiment 2<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">But now that we have a working solution, one that is a\nstep-by-step exploration of the tools we are using, we can do quite a\nbit more to perform the same task in a more compact and optimized\ncommand line program.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In this experiment we explore ways in which we can shorten and\nsimplify our command line program. The final result of that\nexperiment was the following CLI program.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">cat Experiment_6-1.txt | grep -v Team | grep -v \"^\\s*$\" | sed -e \"s\/[Ll]eader\/\/\" -e \"s\/\\[\/\/g\" -e \"s\/]\/\/g\" -e \"s\/)\/\/g\" -e \"s\/(\/\/g\" | awk '{print $1\" \"$2\" &lt;\"$3\"&gt;\"}'<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Let\u2019s start near the beginning and combine the two <strong>grep<\/strong>\nstatements. The result is shorter and more succinct. It also means\nfaster execution because <strong>grep<\/strong>\nonly needs to parse the data stream once. \n<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><em><strong>Tip:<\/strong> When the STDOUT from grep is not piped through\nanother utility, and when using a terminal emulator that supports\ncolor, the regex matches are highlighted in the output data stream. <\/em>\n<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the revised command,  <strong>grep\n-vE &#8220;Team|^\\s*$&#8221;<\/strong>, we\nadd the E option which specifies extended regex. According to the\n<strong>grep<\/strong>\nman page, \u201cIn GNU grep there is no difference in available\nfunctionality between basic and extended syntaxes.\u201d This statement\nis not strictly true because our new combined expression fails\nwithout the E option. Run the following to see the results.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">[student@studentvm1 testing]$ <strong>cat Experiment_6-1.txt | grep -vE \"Team|^\\s*$\"<\/strong><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Try it without the E option. The <strong>grep<\/strong>\ntool can also read data from a file so we eliminate the <strong>cat<\/strong>\ncommand. \n<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">[student@studentvm1 testing]$ <strong>grep -vE \"Team|^\\s*$\" Experiment_6-1.txt<\/strong><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This leaves us with the following, somewhat simplified CLI\nprogram.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">grep -vE \"Team|^\\s*$\" Experiment_6-1.txt | sed -e \"s\/[Ll]eader\/\/\" -e \"s\/\\[\/\/g\" -e \"s\/]\/\/g\" -e \"s\/)\/\/g\" -e \"s\/(\/\/g\" | awk '{print $1\" \"$2\" &lt;\"$3\"&gt;\"}'<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">We can also simplify the <strong>sed<\/strong>\ncommand and we will do so after we learn more about regular\nexpressions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">It is important to realize that this solution is not the only one.\nThere are different methods in Bash for producing the same output,\nthere are other languages like Python and Perl that can also be used.\nAnd, of course, there are always LibreOffice Writer macros. But I can\nalways count on Bash as part of any Linux distribution. I can perform\nthese tasks using Bash programs on any Linux computer, even one\nwithout a GUI desktop or that does not have LibreOffice installed.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">grep<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Because GNU <strong>grep<\/strong> is one of the tools I use the most that provides a more or less standardized implementation of regular expressions, I will use that set of expressions as the basis for the next part of this article. We will then look again at <strong>sed<\/strong>, another tool that uses regular expressions. There are many details that are important to understanding some of the complexity and of regex implementations and how they work.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Data flow<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">All implementations of regular expressions are line-based. A\npattern created by a combination of one or more expressions is\ncompared against each line of a data stream. When a match is made, an\naction is taken on that line as prescribed by the tool being used.\nFor example when a pattern match occurs with grep, the usual action\nis to pass that line on to STDOUT and lines that do not match the\npattern are discarded. As we have seen, the -v option reverses those\nactions so that the lines with matches are discarded.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Each line of the data stream is evaluated on its own and the\nresults of matching the expressions in the pattern with the data from\nprevious lines are not carried over. It might be helpful to think of\neach line of a data stream as a record and that the tools that use\nregexes processes one record at a time. When a match is made an\naction defined by the tool in use is take on the line that contains\nthe matching string.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">regex building blocks<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Figure 4 contains a list of the basic building block expressions\nand metacharacters implemented by the GNU grep command and their\ndescriptions. When used in a pattern, each of these expressions or\nmetacharacters matches a single character in the data stream being\nparsed.  \n<\/p>\n\n\n\n<table class=\"wp-block-table has-subtle-light-gray-background-color has-background\"><thead><tr><td>\n\t\t\t\t\t<strong>Expression<\/strong>\n\t\t\t\t<\/td><td>\n\t\t\t\t\t<strong>Description<\/strong>\n\t\t\t\t<\/td><\/tr><\/thead><tbody><tr><td>\n\t\t\t\t\tAlphanumeric\n\t\t\t\t\tcharacters\n\t\t\t\t\t\n\t\t\t\t\tLiterals\n\t\t\t\t\t\n\t\t\t\t\tA-Z,a-z,0-9\n\t\t\t\t<\/td><td>\n\t\t\t\t\tAll\n\t\t\t\t\talphanumeric and some punctuation characters are considered as\n\t\t\t\t\tliterals. Thus the letter \u201ca\u201d in a regex will always match\n\t\t\t\t\tthe letter \u201ca\u201d in the data stream being parsed. There is no\n\t\t\t\t\tambiguity for these characters. Each literal character matches\n\t\t\t\t\tone and only one character.\n\t\t\t\t<\/td><\/tr><tr><td>\n\t\t\t\t\t.\n\t\t\t\t\t(dot)\n\t\t\t\t<\/td><td>\n\t\t\t\t\tThe\n\t\t\t\t\tdot (.) metacharacter is the most basic form of expression. It\n\t\t\t\t\tmatches any single character in the position it is encountered\n\t\t\t\t\tin a pattern. So the pattern b.g would match big, bigger, bag,\n\t\t\t\t\tbaguette, and bog, but not dog, blog, hug, lag, gag, or leg,\n\t\t\t\t\tetc. \n\t\t\t\t\t\n\t\t\t\t<\/td><\/tr><tr><td>\n\t\t\t\t\tBracket\n\t\t\t\t\texpression\n\t\t\t\t\t\n\t\t\t\t\t[list\n\t\t\t\t\tof characters]\n\t\t\t\t\t\n\t\t\t\t\t<br>\n\n\t\t\t\t\t\n\t\t\t\t<\/td><td>\n\t\t\t\t\tGNU\n\t\t\t\t\tgrep calls this a bracket expression and it is the same as a set\n\t\t\t\t\tfor the Bash shell. The brackets enclose a list of characters to\n\t\t\t\t\tmatch for a single character location in the pattern. [abcdABCD]\n\t\t\t\t\tmatches the letters a, b, c, or d in either upper or lower case.\n\t\t\t\t\t[a-dA-D] specifies a range of characters that creates the same\n\t\t\t\t\tmatch. [a-zA-Z] matches the alphabet in upper and lower case. \n\t\t\t\t\t\n\t\t\t\t<\/td><\/tr><tr><td>\n\t\t\t\t\t[:class\n\t\t\t\t\tname:]\n\t\t\t\t\t\n\t\t\t\t\tCharacter\n\t\t\t\t\tclasses\n\t\t\t\t<\/td><td>\n\t\t\t\t\tThis\n\t\t\t\t\tis a POSIX<a href=\"#sdfootnote4sym\"><sup>4<\/sup><\/a>\n\t\t\t\t\tattempt at regex standardization. The class names are supposed\n\t\t\t\t\tto be obvious. For example the [:alnum:] class matches all\n\t\t\t\t\talphanumeric characters. Other classes are [:digit:] which\n\t\t\t\t\tmatches any one digit 0-9, [:alpha:], [:space:], and so on. Note\n\t\t\t\t\tthat there may be issues due to differences in the sorting\n\t\t\t\t\tsequences in different locales. Read the grep man page for\n\t\t\t\t\tdetails.\n\t\t\t\t<\/td><\/tr><tr><td>\n\t\t\t\t\t^\n\t\t\t\t\tand $\n\t\t\t\t\t\n\t\t\t\t\tAnchors\n\t\t\t\t<\/td><td>\n\t\t\t\t\tThese\n\t\t\t\t\ttwo metacharacters match the beginning and ending of a line,\n\t\t\t\t\trespectively. They are said to anchor the rest of the pattern to\n\t\t\t\t\teither the beginning or ending of a line. The expression ^b.g\n\t\t\t\t\twould only match big, bigger, bag, etc., as shown above, if they\n\t\t\t\t\toccur at the beginning of the line being parsed. The pattern\n\t\t\t\t\tb.g$ would match big or bag only if they occur at the end of the\n\t\t\t\t\tline, but not bigger.\n\t\t\t\t<\/td><\/tr><\/tbody><\/table>\n\n\n\n<p class=\"wp-block-paragraph\">\n<em>Figure\n4: These expressions and metacharacters are implemented by grep and\nmost other regex implementations.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let\u2019s explore these building blocks before continuing on with some of the modifiers. The text file we will use for Experiment 3 is from a lab project I created for a Linux class I wrote and used to teach. It was originally in a LibreOffice Writer ODT file but I saved it to an ASCII text file. Most of the formatting of things like tables was removed but the result is a long ASCII text file that we can use for this series of experiments.  <\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Experiment 3<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">We must download the sample file from the Apress GitHub website.\nIf the directory  ~\/testing is not the PWD, make it so. \n<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">[student@studentvm1 testing]$ <strong>wget <\/strong><a href=\"http:\/\/www.linux-databook.info\/downloads\/Experiment_6-3.txt\">http:\/\/www.linux-databook.info\/downloads\/Experiment_6-3.txt<\/a><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">To begin, just use the less command to look at and explore the\nExperiment_6-3.txt file for a few minutes so you have an idea of its\ncontent. \n<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Now we will use some simple expressions in grep to extract lines\nfrom the input data stream. The Table of Contents (TOC) contains a\nlist of projects and their respective page numbers in the PDF\ndocument. Let\u2019s extract the TOC starting with lines ending in two\ndigits.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">[student@studentvm1 testing]$ <strong>grep [0-9][0-9]$ Experiment_6-3.txt<\/strong><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">That is not really what we want. It displays all lines that end in\ntwo digits and misses TOC entries with only one digit. We will look\nat how to deal with an expression for one or more digits in a later\nexperiment. Looking at the whole file in <strong>less<\/strong>,\nwe could do something like this.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">[student@studentvm1 testing]$ <strong>grep \"^Lab Project\" Experiment_6-3.txt | grep \"[0-9]$\"<\/strong><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This is much closer to what we want but it is not quite there. We\nget some lines from later in the document that also match these\nexpressions. If you study the extra lines and look at those in the\ncomplete document you can see why they match while not being part of\nthe TOC. This also misses TOC entries that do not start with \u201cLab\nProject.\u201d Sometimes this is the best you can do, but it does give a\nbetter look at the TOC than we had before. We will look at how to\ncombine these two grep instances into a single one in a later\nexperiment in this article.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Now let\u2019s modify this a bit and use the POSIX expression. Notice\nthe double square braces around the POSIX expression. Single braces\ngenerate an error message.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">[student@studentvm1 testing]$ <strong>grep \"^Lab Project\" Experiment_6-3.txt | grep \"[[:digit:]]$\"<\/strong><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This gives the same results as the previous attempt. Let\u2019s look\nfor something different.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">[student@studentvm1 testing]$ <strong>grep systemd Experiment_6-3.txt<\/strong><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This lists all occurrences of \u201csystemd\u201d in the file. Try using\nthe -i option to ensure that you get all instances including those\nthat start with uppercase<a href=\"#sdfootnote5sym\"><sup>5<\/sup><\/a>.\nOr you could just change the literal expression to Systemd. Count the\nnumber of lines with the string systemd contained in them. I always\nuse -i to ensure that all instances of the search expression are\nfound regardless of case.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">[student@studentvm1 testing]$ <strong>grep -i systemd Experiment_6-3.txt | wc<\/strong>\n<code> 20     478    3098<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">As you can see I have 20 lines and you should have the same\nnumber.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Here is an example of matching a metacharacter. the left bracket\n([). First let\u2019s try it without doing anything special.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">[student@studentvm1 testing]$ <strong>grep -i \"[\" Experiment_6-3.txt<\/strong> \ngrep: Invalid regular expression<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This occurs because [ is interpreted as a metacharacter. We need\nto \u201cescape\u201d this character with a backslash so that it is\ninterpreted as literal character and not as a metacharacter.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">[student@studentvm1 testing]$ <strong>grep -i \"\\[\" Experiment_6-3.txt<\/strong><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Most metacharacters lose their special meaning when used inside\nbracket expressions.  To include a literal ] place it first in the\nlist. To include a literal ^ place it anywhere but first. To include\na literal [ place it last.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Repetition<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Regular expressions may be modified using some operators that\nallow specification of zero, one, or more repetitions of a character\nor expression. These repetition operators are placed immediately\nfollowing the literal character or metacharacter used in the pattern.<\/p>\n\n\n\n<table class=\"wp-block-table has-subtle-light-gray-background-color has-background is-style-regular\"><thead><tr><td>\n\t\t\t\t<strong>Operator<\/strong>\n\t\t\t<\/td><td>\n\t\t\t\t<strong>Description<\/strong>\n\t\t\t<\/td><\/tr><\/thead><tbody><tr><td>\n\t\t\t\t?\n\t\t\t<\/td><td>\n\t\t\t\tIn\n\t\t\t\tregexes the ? means zero or one occurrence at most of the\n\t\t\t\tpreceding character. So for example, &#8220;drives?&#8221; matches\n\t\t\t\tdrive, and drives but not driver. Using \u201cdrive\u201d for the\n\t\t\t\texpression would match drive, drives, and driver. This is a bit\n\t\t\t\tdifferent from the behavior of ? in a glob.\n\t\t\t<\/td><\/tr><tr><td>\n\t\t\t\t*\n\t\t\t<\/td><td>\n\t\t\t\tThe\n\t\t\t\tcharacter preceding the * will be matched zero or more times\n\t\t\t\twithout limit. In this example, &#8220;drives*&#8221; matches\n\t\t\t\tdrive, drives, and drivesss but not driver. Again this is a bit\n\t\t\t\tdifferent from the behavior of * in a glob.\n\t\t\t<\/td><\/tr><tr><td>\n\t\t\t\t+\n\t\t\t<\/td><td>\n\t\t\t\tThe\n\t\t\t\tcharacter preceding the + will be matched one or more times. The\n\t\t\t\tcharacter must exist in the line at least once for a match to\n\t\t\t\toccur. As one example, &#8220;drives+&#8221; matches drives, and\n\t\t\t\tdrivesss but not drive or driver. \n\t\t\t\t\n\t\t\t<\/td><\/tr><tr><td>\n\t\t\t\t{n}\n\t\t\t<\/td><td>\n\t\t\t\tThis\n\t\t\t\toperator matches the preceding character exactly n times. The\n\t\t\t\texpression \u201cdrives{2}\u201d matches drivess but not drive, drives,\n\t\t\t\tdrivesss, or any number of trailing \u201cs\u201d characters. However,\n\t\t\t\tbecause drivesssss contains the string drivess, a match occurs on\n\t\t\t\tthat string so the line would be a match by grep.\n\t\t\t<\/td><\/tr><tr><td>\n\t\t\t\t{n,}\n\t\t\t<\/td><td>\n\t\t\t\tThis\n\t\t\t\toperator matches the preceding character n or more times. The\n\t\t\t\texpression \u201cdrives{2,}\u201d matches drivess but not drive,\n\t\t\t\tdrives, drivess, drives, or any number of trailing \u201cs\u201d\n\t\t\t\tcharacters. Because drivesssss contains the string drivess, a\n\t\t\t\tmatch occurs.\n\t\t\t<\/td><\/tr><tr><td>\n\t\t\t\t{,m}\n\t\t\t<\/td><td>\n\t\t\t\tThis\n\t\t\t\toperator matches the preceding character no more than m times.\n\t\t\t\tThe expression \u201cdrives{,2}\u201d matches drive, drives, and\n\t\t\t\tdrivess, but not drivesss, or any number of trailing \u201cs\u201d\n\t\t\t\tcharacters. Once again, because drivesssss contains the string\n\t\t\t\tdrivess, a match occurs.\n\t\t\t<\/td><\/tr><tr><td>\n\t\t\t\t{n,m}\n\t\t\t<\/td><td>\n\t\t\t\tThis\n\t\t\t\toperator matches the preceding character at least n times but no\n\t\t\t\tmore than m times. The expression \u201cdrives{1,3}\u201d matches\n\t\t\t\tdrives, drivess, and drivesss, but not drivessss or any number of\n\t\t\t\ttrailing \u201cs\u201d characters. Once again, because drivesssss\n\t\t\t\tcontains a matching string, a match occurs.\n\t\t\t<\/td><\/tr><\/tbody><\/table>\n\n\n\n<p class=\"wp-block-paragraph\">\n<em>Figure\n5: Meta-character modifiers that specify repetition. <\/em>\n<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Experiment 4<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Run each of the following commands and examine the results\ncarefully so that you understand what is happening.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>[student@studentvm1 testing]$ <\/code><strong><code>grep -E files? Experiment_6-3.txt<\/code><\/strong>\n<code>[student@studentvm1 testing]$ <\/code><strong><code>grep -Ei \"drives*\" Experiment_6-3.txt<\/code><\/strong>\n<code>[student@studentvm1 testing]$ <\/code><strong><code>grep -Ei \"drives*\" Experiment_6-3.txt<\/code><\/strong>\n<code>[student@studentvm1 testing]$ <\/code><strong><code>grep -Ei \"drives+\" Experiment_6-3.txt<\/code><\/strong>\n<code>[student@studentvm1 testing]$ <\/code><strong><code>grep -Ei \"drives{2}\" Experiment_6-3.tx<\/code><\/strong>t\n<code>[student@studentvm1 testing]$ <\/code><strong><code>grep -Ei \"drives{2,}\" Experiment_6-3.txt<\/code><\/strong>\n<code>[student@studentvm1 testing]$ <\/code><strong><code>grep -Ei \"drives{,2}\" Experiment_6-3.txt<\/code><\/strong>\n<code>[student@studentvm1 testing]$ <\/code><strong><code>grep -Ei \"drives{2,3}\" Experiment_6-3.txt<\/code><\/strong><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Be sure to experiment with these modifiers on other text in the\nsample file.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Other metacharacters<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">There are still some interesting and important modifiers that we\nneed to explore.<\/p>\n\n\n\n<table class=\"wp-block-table has-subtle-light-gray-background-color has-background\"><thead><tr><td>\n\t\t\t\t<strong>Modifier<\/strong>\n\t\t\t<\/td><td>\n\t\t\t\t<strong>Description<\/strong>\n\t\t\t<\/td><\/tr><\/thead><tbody><tr><td>\n\t\t\t\t\\&lt;\n\t\t\t<\/td><td>\n\t\t\t\tThis\n\t\t\t\tspecial expression matches the empty string at the beginning of a\n\t\t\t\tword. The expression &#8220;\\&lt;fun&#8221; would match on \u201c fun\u201d\n\t\t\t\tand \u201cFunction\u201d but not \u201crefund\u201d.\n\t\t\t<\/td><\/tr><tr><td>\n\t\t\t\t\\&gt;\n\t\t\t<\/td><td>\n\t\t\t\tThis\n\t\t\t\tspecial expression matches the normal space, or empty \u201c \u201d\n\t\t\t\tstring at the end of a word as well as punctuation that typically\n\t\t\t\tappears in the single character string at the end of a word. So\n\t\t\t\t\u201cenvironment\\&gt;\u201d matches \u201cenvironment\u201d, \u201cenvironment,\u201d,\n\t\t\t\tand environment.\u201d but not environments or environmental. \n\t\t\t\t\n\t\t\t<\/td><\/tr><tr><td>\n\t\t\t\t^\n\t\t\t<\/td><td>\n\t\t\t\tIn\n\t\t\t\ta character class expression, this operator negates the list of\n\t\t\t\tcharacters. Thus, while the class [a-c] matches a, b , or c, in\n\t\t\t\tthat position of the pattern, the class [^a-c] matches anything\n\t\t\t\tbut a, b, or c.\n\t\t\t<\/td><\/tr><tr><td>\n\t\t\t\t|\n\t\t\t<\/td><td>\n\t\t\t\tWhen\n\t\t\t\tused in a regex, the | metacharacter is a logical \u201cor\u201d\n\t\t\t\toperator. It is officially called the \u201cinfix\u201d or\n\t\t\t\t\u201calternation\u201d operator. We have already encountered this in\n\t\t\t\tExperiment 6-2, where we saw that the regex &#8220;Team|^\\s*$&#8221;\n\t\t\t\tmeans,  \u201ca line with \u2018Team\u2019 or ( | ) an empty line\n\t\t\t\tincluding one that has zero, one, or more whitespace characters\n\t\t\t\tsuch as spaces, tabs, and other unprintable characters.\u201d\n\t\t\t<\/td><\/tr><tr><td>\n\t\t\t\t(\n\t\t\t\tand )\n\t\t\t<\/td><td>\n\t\t\t\tThe\n\t\t\t\tparentheses ( and ) allow us to ensure a specific sequence of\n\t\t\t\tpattern comparison like might be used for logical comparisons in\n\t\t\t\ta programming language.\n\t\t\t<\/td><\/tr><\/tbody><\/table>\n\n\n\n<p class=\"wp-block-paragraph\">\n<em>Figure\n6: Meta-character modifiers. <\/em>\n<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We now have a way to specify word boundaries with the \\&lt; and \\&gt;\nmetacharacters. This means we can now be even more explicit with our\npatterns. We can also use some logic in more complex patterns.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Experiment 5<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Start with a couple simple patterns. This first one selects all\ninstances of drives but not drive, drivess, or additional trailing\n\u201cs\u201d characters. \n<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">[student@studentvm1 testing]$ <strong>grep -Ei \"\\&lt;drives\\&gt;\" Experiment_6-3.txt<\/strong><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Now let\u2019s build up a search pattern to locate references to tar,\nthe tape archive command and related references. The first two\niterations display more than just tar-related lines.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">[student@studentvm1 testing]$ <strong>grep -Ei \"tar\" Experiment_6-3.txt<\/strong>\n[student@studentvm1 testing]$ <strong>grep -Ei \"\\&lt;tar\" Experiment_6-3.txt<\/strong>\n[student@studentvm1 testing]$ <strong>grep -Ein \"\\&lt;tar\\&gt;\" Experiment_6-3.txt<\/strong><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The -n option in the last command above displays the line numbers\nof each line in which a match occurred. This can assist in locating\nspecific instances of the search pattern.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><em><strong>Tip: <\/strong>Matching lines of data can extend beyond a single\nscreen, especially when searching a large file. You can pipe the\nresulting data stream through the less utility and then use the less\nsearch facility which implements regexes, too, to highlight the\noccurrences of matches to the search pattern. The search argument in\nless is: \\&lt;tar\\&gt;<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This next pattern searches for \u201cshell script\u201d or \u201cshell\nprogram\u201d or \u201cshell variable\u201d or \u201cshell environment\u201d or\n\u201cshell prompt\u201d in our test document. The parentheses alter the\nlogical order in which the pattern comparisons are resolved. \n<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">[student@studentvm1 testing]$ <strong>grep -Eni \"\\&lt;shell (script|program|variable|environment|prompt)\" Experiment_6-3.txt<\/strong><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Remove the parentheses from the preceding command and run it again\nto see the difference. \n<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Although we have now explored the basic building blocks of regular\nexpressions in grep, there are an infinite variety of ways in which\nthey can be combined to create complex yet elegant search patterns.\nHowever grep is a search tool and does not provide any direct\ncapability to edit or modify the contents of a line of text in the\ndata stream when a match is made.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">sed<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The <strong>sed<\/strong>\nutility not only allows searching for text that matches a regex\npattern, it can also modify, delete, or replace the matched text. I\nuse <strong>sed<\/strong>\nat the command line and in Bash shell scripts as a fast and easy way\nto locate and text and alter it in some way. The name <strong>sed<\/strong>\nstands for stream editor because it operates on data streams in the\nsame manner as other tools that can transform a data stream. Most of\nthose changes simply involve selecting specific lines from the data\nstream and passing them on to another transformer<a href=\"#sdfootnote6sym\"><sup>6<\/sup><\/a>\nprogram.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We have already seen sed in action but now, with an understanding\nof regular expressions, we can better analyze and understand our\nearlier usage.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Experiment 6<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In Experiment 2 we simplified the CLI program we used to transform\na list of names and email addresses into a form that can be used as\ninput to a listserv. That CLI program looks like this after some\nsimplification.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">grep -vE \"Team|^\\s*$\" Experiment_6-1.txt | sed -e \"s\/[Ll]eader\/\/\" -e \"s\/\\[\/\/g\" -e \"s\/]\/\/g\" -e \"s\/)\/\/g\" -e \"s\/(\/\/g\" | awk '{print $1\" \"$2\" &lt;\"$3\"&gt;\"}'<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">It is possible to combine four of the five expressions used in the <strong>sed<\/strong> command into a single expression. The <strong>sed<\/strong> command now has two expressions instead of five.  <\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">sed -e \"s\/[Ll]eader\/\/\" -e \"s\/[]()\\[]\/\/g\"<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This makes it a bit difficult to understand the more complex expression. Note that no matter how many expressions a single <strong>sed<\/strong> command contains, the data stream is only parsed once to match all of the expressions.  <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let\u2019s examine the revised expression, <strong>-e\n&#8220;s\/[]()\\[]\/\/g&#8221;<\/strong>, more\nclosely. By default, <strong>sed<\/strong>\ninterprets all [ characters as the beginning of a set and the last ]\ncharacter as the end of that set. -e &#8220;s\/<strong>[<\/strong>]()\\[<strong>]<\/strong>\/\/g&#8221;\n The intervening ] characters are not interpreted as metacharacters.\nSince we need to match [ as a literal character in order to remove it\nfrom the data stream and <strong>sed<\/strong>\nnormally interprets that as a metacharacter, we need to escape it so\nthat it is interpreted as a literal ]. -e\n&#8220;s\/<strong>[<\/strong>]()<strong>\\<\/strong>[<strong>]<\/strong>\/\/g&#8221;\nSo now all of the metacharacters in this expression are highlighted.\nLet\u2019s plug this into the CLI script and test it.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">[student@studentvm1 testing]$ <strong>grep -vE \"Team|^\\s*$\" Experiment_6-1.txt | sed -e \"s\/[Ll]eader\/\/\" -e \"s\/[]()\\[]\/\/g\"<\/strong><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">I know that you are asking, \u201cWhy not place the \\[ after the [\nthat opens the set and before the ] character. Try it as I did.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">[student@studentvm1 testing]$ <strong>grep -vE \"Team|^\\s*$\" Experiment_6-1.txt | sed -e \"s\/[Ll]eader\/\/\" -e \"s\/[\\[]()]\/\/g\"<\/strong><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">I think that should work but it does not. Little unexpected\nresults like this make it clear that we must be careful and test each\nregex carefully to ensure that it actually does what we intend. After\nsome experimentation of my own, I discovered that the escaped left\nsquare brace \\[ works fine in all positions of the expression except\nfor the first one. This behavior is noted in the grep man page which\nI probably should have read first. However I find that\nexperimentation reinforces the things I read and I usually discover\nmore interesting things than that for which I was looking. \n<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Adding the last component, the <strong>awk<\/strong>\nstatement, our optimized program looks like this and the results are\nexactly what we want.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">[student@studentvm1 testing]$ <strong>grep -vE \"Team|^\\s*$\" Experiment_6-1.txt | sed -e \"s\/[Ll]eader\/\/\" -e \"s\/[]()\\[]\/\/g\" | awk '{print $1\" \"$2\" &lt;\"$3\"&gt;\"}'<\/strong><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Other tools that implement regular expressions<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Many Linux tools implement regular expressions. Most of those\nimplementations are very similar to that of <strong>awk<\/strong>,\n<strong>grep<\/strong>,\nand <strong>sed<\/strong>\nso that it should be easy to learn the differences. Although we have\nnot looked in detail at <strong>awk<\/strong>,\nit is a powerful text processing language that also implements\nregexes.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Most of the more advanced text editors use regexes. Vim, gVim,\nKate, and GNU Emacs  are no exceptions. The <strong>less<\/strong>\nutility implements regexes as does the search and replace facility of\nLibreOffice Writer.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Programming languages like Perl, awk, and Python also contain\nimplementations of regexes which makes them well suited to writing\ntools for text manipulation.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Resources<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">I have found some excellent resources for learning about regular\nexpressions. There are more than I have listed here but these are the\nones I have found to be particularly useful.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The <strong>grep<\/strong>\nman page has a good reference but is not appropriate for learning\nabout regular expressions. The O\u2019Reilly book, <em>Mastering Regular\nExpressions<\/em><a href=\"#sdfootnote7sym\"><sup>7<\/sup><\/a>,\nis a very good tutorial and reference for regular expressions. I\nrecommend it for anyone who is or wants to be a Linux SysAdmin\nbecause you will  use regular expressions. Another good O\u2019Reilly\nbook is <em>sed and awk<\/em><a href=\"#sdfootnote8sym\"><sup>8<\/sup><\/a>\nwhich covers both of these powerful tools and it also has an\nexcellent discussion of regular expressions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">There are also some good web sites that can help you learn about\nregular expressions and which provide interesting and useful cookbook\nstyle regex examples. There are some that ask for money in return for\nusing them. Jason Baker, my Technical Reviewer for Volumes 1 and 2 of\nmy <em>Using and Administering Linux<\/em>\ncourse suggests https:\/\/regexcrossword.com\/ as a good learning tool.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Summary<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">This article has provided a very brief introduction to the complex world of regular expressions. We have explored the regex implementations in the <strong>grep<\/strong> and <strong>sed<\/strong> utilities in just enough depth to give you an idea of some of the amazing things that can be accomplished with regexes. We have also looked at several Linux tools and programming languages that also implement regexes.  <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">But make no mistake! We have only scratched the surface of these tools and regular expressions. There is much more to learn and there are some excellent resources for doing so.  <\/p>\n\n\n\n<hr class=\"wp-block-separator\"\/>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"#sdfootnote1anc\">1<\/a>See\n\tthe grep info page in Section 3.6 Basic vs Extended Regular\n\tExpressions<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"#sdfootnote2anc\">2<\/a>When\n\tI talk about regular expressions, in a general sense I usually mean\n\tto include both basic and extended regular expressions. If there is\n\ta differentiation to be made I will use the acronyms BRE for basic\n\tregular expressions or ERE for extended regular expression.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"#sdfootnote3anc\">3<\/a>One\n\tgeneral meaning of parse is to examine something by studying its\n\tcomponent parts. For our purposes we parse a data stream to locate\n\tsequences of characters that match a specified pattern.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"#sdfootnote4anc\">4<\/a>Wikipedia,\n\tPOSIX, <a href=\"https:\/\/en.wikipedia.org\/wiki\/POSIX\">https:\/\/en.wikipedia.org\/wiki\/POSIX<\/a>\n\t\t<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"#sdfootnote5anc\">5<\/a>The\n\tofficial form of systemd is all lowercase. \n\t<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"#sdfootnote6anc\">6<\/a>Many\n\tpeople call tools like grep, \u201cfilter\u201d programs because they\n\tfilter unwanted lines out of the data stream. I prefer the term\n\t\u201ctransformers\u201d because ones such as sed and awk, do more than\n\tjust filter. They can test the content for various string\n\tcombinations and alter the matching content in many different ways.\n\tTools like sort, head, tail, uniq, fmt, and more, all transform the\n\tdata stream in some way.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"#sdfootnote7anc\">7<\/a>Friedl,\n\tJeffrey E. F., Mastering Regular Expressions, O\u2019Reilly, 2012,\n\tPaperback ISBN-13: 978-0596528126<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"#sdfootnote8anc\">8<\/a>Robbins,\n\tArnold, and Dougherty, Dale, sed &amp; awk: UNIX Power Tools\n\t(Nutshell Handbooks), O\u2019Reilly, 2012, ISBN-13: 978-1565922259<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Note: This article is a slightly modified version of Chapter 6 from Volume 2 of my Linux self-study course, \u201cUsing and Administering Linux: Zero to SysAdmin,\u201d due out from Apress in late 2019. We have all used file globbing with&hellip;<\/p>\n<p class=\"more-link-p\"><a class=\"more-link\" href=\"http:\/\/www.linux-databook.info\/?page_id=5533\">Read more &rarr;<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"parent":677,"menu_order":7,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-5533","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"http:\/\/www.linux-databook.info\/index.php?rest_route=\/wp\/v2\/pages\/5533","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.linux-databook.info\/index.php?rest_route=\/wp\/v2\/pages"}],"about":[{"href":"http:\/\/www.linux-databook.info\/index.php?rest_route=\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"http:\/\/www.linux-databook.info\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"http:\/\/www.linux-databook.info\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=5533"}],"version-history":[{"count":17,"href":"http:\/\/www.linux-databook.info\/index.php?rest_route=\/wp\/v2\/pages\/5533\/revisions"}],"predecessor-version":[{"id":5647,"href":"http:\/\/www.linux-databook.info\/index.php?rest_route=\/wp\/v2\/pages\/5533\/revisions\/5647"}],"up":[{"embeddable":true,"href":"http:\/\/www.linux-databook.info\/index.php?rest_route=\/wp\/v2\/pages\/677"}],"wp:attachment":[{"href":"http:\/\/www.linux-databook.info\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=5533"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}