Since this post is a snapshot in time. I recommend that you download a copy of the book which is updated frequently to improve and expand the content.
---------------------------------------
Regular expressions (also called ‘regex’) are a pattern matching system that uses sequences of characters constructed according to pre-defined syntax rules to find desired strings in text. The topic of regular expressions is a book in itself and I heartily recommend further reading for those who find the need to use them in anger.
The command grep
(where the re
in grep
stands for regular expression) is an essential tool for any one using Linux, allowing regular expressions to be used in file searches or command outputs. Although the use of regular expressions is widespread in multiple facets of computing operations.
For example, if we wanted to search the file dmesg
which is in the /var/log
directory and wanted to show each line that contained the string of characters CPU
. we would use the grep
command as follows;
The output from the command will appear similar to the following;
This is a basic example that utilises a simple string to match on and shouldn’t necessarily be regarded as a great use of regular expressions. However, if we wanted to limit the returned results to instances where the string was the text CPU
followed by the number 0
, 1
, 2
or 3
, we could use a regular expression with a facility that included a range of options. This is accomplished by using the square brackets []
with the specified range inside.
In our case we want the text CPU
and it must be immediately followed by a number in the range 0 to 3. This can be designated by the regular expression CPU[0-3]
.
Which means that our search as follows;
… will result in;
The square brackets are ‘metacharacters’ and it is the use of these metacharacters that provide regular expressions with the foundation of their strength.
The following are some of the most commonly used metacharacters and a very short description of their effect (we will show examples further on);
grep
(where the re
in grep
stands for regular expression) is an essential tool for any one using Linux, allowing regular expressions to be used in file searches or command outputs. Although the use of regular expressions is widespread in multiple facets of computing operations.dmesg
which is in the /var/log
directory and wanted to show each line that contained the string of characters CPU
. we would use the grep
command as follows;CPU
followed by the number 0
, 1
, 2
or 3
, we could use a regular expression with a facility that included a range of options. This is accomplished by using the square brackets []
with the specified range inside.CPU
and it must be immediately followed by a number in the range 0 to 3. This can be designated by the regular expression CPU[0-3]
.[ ] | Match anything inside the square brackets for ONE character |
^ | (circumflex or caret) Matches only at the beginning of the target string (when not used inside square brackets (where it has a different meaning)) |
$ | Matches only at the end of the target string |
. | (period or full-stop) Matches any single character |
? | Matches when the preceding character occurs 0 or 1 times only |
* | Matches when the preceding character occurs 0 or more times |
+ | Matches when the preceding character occurs 1 or more times |
( ) | Can be used to group parts of our search expression together |
| | (vertical bar or pipe) Allows us to find the left hand or right values |
Match a defined single character with square brackets ([]
)
As demonstrated at the start of this section, the use of square brackets will allow us to match any single character. The example we used below employed the use of the dash (or minus) character as a range signifier to signify that the possible characters were 0
, 1
, 2
, or3
.
We could also have simply put each character in the square brackets as follows;
In either case it should be noted that only a single character is matched for the entries in the square brackets.
We can specify more than one range and we can also distinguish between upper case and lower case characters. Therefore the following ranges will have the corresponding results;
[a-z]
: Match any single character between a to z.
[A-Z]
: Match any single character between A to Z.
[0-9]
: Match any single character between 0 to 9.
[a-zA-Z0-9]
: Match any single character either a to z or A to Z or 0 to 9
Within square brackets we can also use the circumflex or caret character (^
) to negate the characters selection. I.e. with a caret we can say search for lines with the text CPU
and it must be immediately followed by a character that is not in the range 0 to 3. This is done as follows;
Which would result in an output similar to the following;
Note that none of the previous lines with CPU0
, CPU1
, CPU2
or CPU3
have been listed.
0
, 1
, 2
, or3
.[a-z]
: Match any single character between a to z.[A-Z]
: Match any single character between A to Z.[0-9]
: Match any single character between 0 to 9.[a-zA-Z0-9]
: Match any single character either a to z or A to Z or 0 to 9^
) to negate the characters selection. I.e. with a caret we can say search for lines with the text CPU
and it must be immediately followed by a character that is not in the range 0 to 3. This is done as follows;CPU0
, CPU1
, CPU2
or CPU3
have been listed.
Match at the beginning of a string (^
)
We can use the circumflex or caret character (^
) to match lines of text that begin with a specific set of characters.
Given a text file names foo.txt with the following contents;
If we run the grep
command looking for the string ‘Second’ as follows;
We should have two lines returned as below;
But if we use the caret character to designate that we are only looking for lines that start with our string as follows;
… we will get the following output where only the second line is returned;
^
) to match lines of text that begin with a specific set of characters.grep
command looking for the string ‘Second’ as follows;
Match at the end of a string ($
)
We can use the dollar sign character ($
) to match lines of text that finish with a specific character or set of characters.
For example, given a text file names foo.txt with the following contents;
If we use the dollar sign character to search for all lines that end in ‘ing’ as follows;
… we will get the following output where only the second line is returned;
$
) to match lines of text that finish with a specific character or set of characters.
Match any single character (.
)
The .
(period) character will allow us to match any single character in this position.
For example, given a text file names foo.txt with the following contents;
… if we wanted to return all lines where the characters ing
were in the middle of the line (not at the end) we could run the following grep
command;
This would produce an output similar to the following;
While there are two other lines with ing
in them, (the first and third lines), both of them end with ‘ing’ and as a result there is no character after them. The only one where there is ‘ing’ with a character following it is in the second line.
.
(period) character will allow us to match any single character in this position.ing
were in the middle of the line (not at the end) we could run the following grep
command;ing
in them, (the first and third lines), both of them end with ‘ing’ and as a result there is no character after them. The only one where there is ‘ing’ with a character following it is in the second line.
Match when the preceding character occurs 0 or 1 times only (?
)
It may be difficult to think of a situation where we would want to match against something that occurs 0 or 1 time, but the best example comes from the world of language. In American spelling the word ‘color’ differs from the British spelling by the omission of the letter ‘u’ (‘colour’). We can write a regular expression that will match either spelling as follows;
This way the question mark denotes that for a match to occur, the preceding character must either not be present or must occur once. The additional characters (‘colo’ and the ‘r’) are literals in the sense that they must be present exactly as stated. The only variable in the expression is the ‘u’.
Match when the preceding character occurs 0 or more times (*
)
The asterisk metacharacter in regular expressions can be one of the most confusing options to use, but this is mainly because its real strength is applied when matched with other metacharacters.
For example it could be argued that a regular expression such as q*w
will match w
, qw
and qqqw
, however if we use a period and an asterisk together (.*
) we gain a function that will match zero or more of any series of characters.
In this case we can use a regular expression such as …
… to find any combination of characters that start with pa
and end with y
and have any number of characters (including none) in between. These would include the following;
q*w
will match w
, qw
and qqqw
, however if we use a period and an asterisk together (.*
) we gain a function that will match zero or more of any series of characters.pa
and end with y
and have any number of characters (including none) in between. These would include the following;
Match when the preceding character occurs 1 or more times (+
)
The use of the +
character to allow one or more instances of a character is similar to that of the asterisk. Where the *
metacharacter might return the following matches from the regular expression fe*d
;
The use of fe+d
would result in;
+
character to allow one or more instances of a character is similar to that of the asterisk. Where the *
metacharacter might return the following matches from the regular expression fe*d
;fe+d
would result in;
Group parts of a search expression together (()
)
Regular expressions can be combined into subgroups that can be operated on as separate entities by enclosing those entities in parenthesis. For example, if we wanted to return a match if we saw the word ‘monkey’ or ‘banana’ we would use the or
metacharacter |
(the pipe) to try to match one string or another as follows;
or
metacharacter |
(the pipe) to try to match one string or another as follows;
Find one group of values or another (|
)
The pipe metacharacter allows us to apply a logical ‘or’ operator to our pattern matching. For example if we wanted to return a match if we saw the word ‘monkey’ or ‘banana’ we would use the words encapsulated in parenthesis and the pipe metacharacter to try to match one string or another as follows;
Extended Regular Expressions
In basic regular expressions the meta-characters ?
, +
, {
, |
, (
, and )
are not regarded as special and instead we need to use the backslashed versions \?
, \+
, \{
, \|
, \(
, and \)
.
?
, +
, {
, |
, (
, and )
are not regarded as special and instead we need to use the backslashed versions \?
, \+
, \{
, \|
, \(
, and \)
.
No comments:
Post a Comment