Thursday, 24 March 2016

Regular Expressions in Linux

The following post is a section of the book 'Just Enough Linux'.  The entire book can be downloaded in pdf format for free from Leanpub or you can read it online here.
Since this post is a snapshot in time. I recommend that you download a copy of the book which is updated frequently to improve and expand the content.
---------------------------------------

Regular expressions (also called ‘regex’) are a pattern matching system that uses sequences of characters constructed according to pre-defined syntax rules to find desired strings in text. The topic of regular expressions is a book in itself and I heartily recommend further reading for those who find the need to use them in anger.
The command grep (where the re in grep stands for regular expression) is an essential tool for any one using Linux, allowing regular expressions to be used in file searches or command outputs. Although the use of regular expressions is widespread in multiple facets of computing operations.
For example, if we wanted to search the file dmesg which is in the /var/log directory and wanted to show each line that contained the string of characters CPU. we would use the grep command as follows;
The output from the command will appear similar to the following;
pi@raspberrypi ~ $ grep CPU /var/log/dmesg
[    0.000000] Booting Linux on physical CPU 0xf00
[    0.000000] CPU: ARMv7 Processor [410fc075] revision 5 (ARMv7), cr=10c5387d
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
[    0.004116] CPU: Testing write buffer coherency: ok
[    0.053503] CPU0: update cpu_capacity 1024
[    0.053577] CPU0: thread -1, cpu 0, socket 15, mpidr 80000f00
[    0.113791] CPU1: Booted secondary processor
[    0.113851] CPU1: update cpu_capacity 1024
[    0.113860] CPU1: thread -1, cpu 1, socket 15, mpidr 80000f01
[    0.133710] CPU2: Booted secondary processor
[    0.133746] CPU2: update cpu_capacity 1024
[    0.133755] CPU2: thread -1, cpu 2, socket 15, mpidr 80000f02
[    0.153750] CPU3: Booted secondary processor
[    0.153788] CPU3: update cpu_capacity 1024
[    0.153797] CPU3: thread -1, cpu 3, socket 15, mpidr 80000f03
[    0.153891] Brought up 4 CPUs
[    0.154045] CPU: All CPU(s) started in SVC mode.
[    2.406902] ledtrig-cpu: registered to indicate activity on CPUs
This is a basic example that utilises a simple string to match on and shouldn’t necessarily be regarded as a great use of regular expressions. However, if we wanted to limit the returned results to instances where the string was the text CPU followed by the number 012 or 3, we could use a regular expression with a facility that included a range of options. This is accomplished by using the square brackets [] with the specified range inside.
In our case we want the text CPU and it must be immediately followed by a number in the range 0 to 3. This can be designated by the regular expression CPU[0-3].
Which means that our search as follows;
… will result in;
pi@raspberrypi ~ $ grep CPU[0-3] /var/log/dmesg
[    0.053503] CPU0: update cpu_capacity 1024
[    0.053577] CPU0: thread -1, cpu 0, socket 15, mpidr 80000f00
[    0.113791] CPU1: Booted secondary processor
[    0.113851] CPU1: update cpu_capacity 1024
[    0.113860] CPU1: thread -1, cpu 1, socket 15, mpidr 80000f01
[    0.133710] CPU2: Booted secondary processor
[    0.133746] CPU2: update cpu_capacity 1024
[    0.133755] CPU2: thread -1, cpu 2, socket 15, mpidr 80000f02
[    0.153750] CPU3: Booted secondary processor
[    0.153788] CPU3: update cpu_capacity 1024
[    0.153797] CPU3: thread -1, cpu 3, socket 15, mpidr 80000f03
The square brackets are ‘metacharacters’ and it is the use of these metacharacters that provide regular expressions with the foundation of their strength.
The following are some of the most commonly used metacharacters and a very short description of their effect (we will show examples further on);
[ ]Match anything inside the square brackets for ONE character
^(circumflex or caret) Matches only at the beginning of the target string (when not used inside square brackets (where it has a different meaning))
$Matches only at the end of the target string
.(period or full-stop) Matches any single character
?Matches when the preceding character occurs 0 or 1 times only
*Matches when the preceding character occurs 0 or more times
+Matches when the preceding character occurs 1 or more times
( )Can be used to group parts of our search expression together
|(vertical bar or pipe) Allows us to find the left hand or right values

Match a defined single character with square brackets ([])

As demonstrated at the start of this section, the use of square brackets will allow us to match any single character. The example we used below employed the use of the dash (or minus) character as a range signifier to signify that the possible characters were 012, or3.
We could also have simply put each character in the square brackets as follows;
In either case it should be noted that only a single character is matched for the entries in the square brackets.
We can specify more than one range and we can also distinguish between upper case and lower case characters. Therefore the following ranges will have the corresponding results;
  • [a-z] : Match any single character between a to z.
  • [A-Z] : Match any single character between A to Z.
  • [0-9] : Match any single character between 0 to 9.
  • [a-zA-Z0-9] : Match any single character either a to z or A to Z or 0 to 9
Within square brackets we can also use the circumflex or caret character (^) to negate the characters selection. I.e. with a caret we can say search for lines with the text CPU and it must be immediately followed by a character that is not in the range 0 to 3. This is done as follows;
Which would result in an output similar to the following;
pi@raspberrypi ~ $ grep CPU[^0-3] /var/log/dmesg
[    0.000000] Booting Linux on physical CPU 0xf00
[    0.000000] CPU: ARMv7 Processor [410fc075] revision 5 (ARMv7), cr=10c5387d
[    0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing
[    0.000000] PERCPU: Embedded 11 pages/cpu @ba05d000 s12864 r8192 d24000
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
[    0.004116] CPU: Testing write buffer coherency: ok
[    0.153891] Brought up 4 CPUs
[    0.154045] CPU: All CPU(s) started in SVC mode.
[    2.406902] ledtrig-cpu: registered to indicate activity on CPUs
Note that none of the previous lines with CPU0CPU1CPU2 or CPU3 have been listed.

Match at the beginning of a string (^)

We can use the circumflex or caret character (^) to match lines of text that begin with a specific set of characters.
Given a text file names foo.txt with the following contents;
First line with something
Second line with something else
Third line still going
Fourth Line but Second last
Last line. Goodbye!
If we run the grep command looking for the string ‘Second’ as follows;
We should have two lines returned as below;
pi@raspberrypi ~ $ grep Second foo.txt
Second line with something else
Fourth Line but Second last
But if we use the caret character to designate that we are only looking for lines that start with our string as follows;
… we will get the following output where only the second line is returned;
pi@raspberrypi ~ $ grep Second foo.txt
Second line with something else

Match at the end of a string ($)

We can use the dollar sign character ($) to match lines of text that finish with a specific character or set of characters.
For example, given a text file names foo.txt with the following contents;
First line with something
Second line with something else
Third line still going
Fourth Line but Second last
Last line. Goodbye!
If we use the dollar sign character to search for all lines that end in ‘ing’ as follows;
… we will get the following output where only the second line is returned;
pi@raspberrypi ~ $ grep ing$ foo.txt
First line with something
Third line still going

Match any single character (.)

The . (period) character will allow us to match any single character in this position.
For example, given a text file names foo.txt with the following contents;
First line with something
Second line with something else
Third line still going
Fourth Line but Second last
Last line. Goodbye!
… if we wanted to return all lines where the characters ing were in the middle of the line (not at the end) we could run the following grep command;
This would produce an output similar to the following;
pi@raspberrypi ~ $ grep ing. foo.txt
Second line with something else
While there are two other lines with ing in them, (the first and third lines), both of them end with ‘ing’ and as a result there is no character after them. The only one where there is ‘ing’ with a character following it is in the second line.

Match when the preceding character occurs 0 or 1 times only (?)

It may be difficult to think of a situation where we would want to match against something that occurs 0 or 1 time, but the best example comes from the world of language. In American spelling the word ‘color’ differs from the British spelling by the omission of the letter ‘u’ (‘colour’). We can write a regular expression that will match either spelling as follows;
This way the question mark denotes that for a match to occur, the preceding character must either not be present or must occur once. The additional characters (‘colo’ and the ‘r’) are literals in the sense that they must be present exactly as stated. The only variable in the expression is the ‘u’.

Match when the preceding character occurs 0 or more times (*)

The asterisk metacharacter in regular expressions can be one of the most confusing options to use, but this is mainly because its real strength is applied when matched with other metacharacters.
For example it could be argued that a regular expression such as q*w will match wqw and qqqw, however if we use a period and an asterisk together (.*) we gain a function that will match zero or more of any series of characters.
In this case we can use a regular expression such as …
… to find any combination of characters that start with pa and end with y and have any number of characters (including none) in between. These would include the following;
pacify
painfully
paisley
palmistry
palpably
pay

Match when the preceding character occurs 1 or more times (+)

The use of the + character to allow one or more instances of a character is similar to that of the asterisk. Where the * metacharacter might return the following matches from the regular expression fe*d;
fd
fed
feed
The use of fe+d would result in;
fed
feed

Group parts of a search expression together (())

Regular expressions can be combined into subgroups that can be operated on as separate entities by enclosing those entities in parenthesis. For example, if we wanted to return a match if we saw the word ‘monkey’ or ‘banana’ we would use the or metacharacter |(the pipe) to try to match one string or another as follows;

Find one group of values or another (|)

The pipe metacharacter allows us to apply a logical ‘or’ operator to our pattern matching. For example if we wanted to return a match if we saw the word ‘monkey’ or ‘banana’ we would use the words encapsulated in parenthesis and the pipe metacharacter to try to match one string or another as follows;

Extended Regular Expressions

In basic regular expressions the meta-characters ?+{|(, and ) are not regarded as special and instead we need to use the backslashed versions \?\+\{\|\(, and \).


The post above (and heaps of other stuff) is in the book 'Just Enough Linux' that can be downloaded for free (or donate if you really want to :-)).

No comments:

Post a Comment