RegexComparisonTerminal - SdPhd

RegexComparisonTerminal

Contents

Regular Expressions Comparison - Terminal

I often come to the point, where I need a reference like this - comparing different tools' regular expression capabilities in terminal - but I cannot find anything similar online... So I thought, I maybe post one myself - and as I had no idea where to post this, I thought: why not here :)

Here are some Regex snippets, comparing `grep`, `sed`, Perl and Python operation in a Terminal. The idea is to compare the regex queries, when input string is piped via stdin to the application - such that for the same input, all applications generate the same output (note that in the case of `grep`, also the ANSI color-code characters of matches are included here).

Notes:

  • Examples are done in a `bash` shell/terminal
  • Multiple expressions that have the same output, have the output given only once; empty output will be explicitly indicated with a shell prompt `$`
  • Note that `grep` can only do searching (matching), not replacement - `sed`, Perl and Python can do both
  • To indicate where `grep` found a match:
    • Add `--color=always | cat -v`; where `-v` for `cat` is "use ^ and M- notation, except for LFD and TAB"
    • Use:
      • `-o` "--only-matching Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line." ;
      • `-n` "Prefix each line of output with the 1-based line number within its input file." ;
      • `-b` "--byte-offset Print the 0-based byte offset within the input file before each line of output. If -o (--only-matching) is specified, print the offset of the matching part itself."
  • To simulate `grep` default operation (output line on match):
    • `sed` by default prints all lines, so we use its `-n` switch ("suppress automatic printing of pattern space"), then match `//` statements, and then we add a `p` ("Print the current pattern space.") command at end, as in `.../p`
    • Perl needs `-ne` switches, match `//` statements and a `print $_`
    • Python - `print` cannot be used, it must be sys.stdout.write; one-liners require a weird [] (list comprehension) block; matches are objects with groups - but only if regex grouping was selected (by using parentheses in regex query); difference between `findall()` and `match()`


Note about `echo`

`echo` by default outputs a newline:

$ STR="abc"
$ echo $STR | hexdump -C
00000000  61 62 63 0a                                       |abc.|
00000004

To suppress it, add `-n` "do not output the trailing newline" to `echo`:

$ echo -n $STR | hexdump -C
00000000  61 62 63                                          |abc|
00000003

... and otherwise add `-e` "enable interpretation of backslash escapes" to insert line endings (etc) as desired:

$ echo -n -e "a\nb\nc" | hexdump -C
00000000  61 0a 62 0a 63                                    |a.b.c|
00000005


Note about `python` one-liners

In one-liner constructs; `sys.stdout.write` can be used; while `print` cannot - and the square brackets must be there. That is "list comprehension", (see my SO:2043453) - the print can be used at end of command line - but not inside the list comprehension:

$ STR=abc
$ echo $STR | python -c "import sys,re; [sys.stdout.write(line) for line in sys.stdin]"
abc
$ echo $STR | python -c "import sys,re; [print(line) for line in sys.stdin]"
  File "", line 1
    import sys,re; [print(line) for line in sys.stdin]
                        ^
SyntaxError: invalid syntax
$ echo $STR | python -c "import sys,re; sys.stdout.write(line) for line in sys.stdin"  File "", line 1
    import sys,re; sys.stdout.write(line) for line in sys.stdin
                                            ^
SyntaxError: invalid syntax
$ echo $STR | python -c "import sys,re; a=[sys.stdout.write(line) for line in sys.stdin]; b=[sys.stdout.write(str(x)) for x in range(2)] ; print a ; print b"
abc
01[None]
[None, None]

Adding a nested `if` in the above, is through a Python ternary assignment outside with double `write`:

$ echo $STR | python -c "import sys,re;(sys.stdout.write(line) if False) for line in sys.stdin]"
  File "", line 1
    import sys,re; [(sys.stdout.write(line) if False) for line in sys.stdin]
                                                    ^
SyntaxError: invalid syntax
$ echo $STR | python -c "import sys,re; [(sys.stdout.write(line) if False else sys.stdout.write('XX')) for line in sys.stdin]"
XX$

... or directly inside the `.write()` argument (note how `.write()` doesn't add newlines at end of string, like `print` (without a comma) usually does):

$ echo $STR | python -c "import sys,re; [sys.stdout.write(line if False else 'XX') for line in sys.stdin]"
XX$


Examples below done on:

$ bash --version | head -n 1
GNU bash, version 4.2.8(1)-release (i686-pc-linux-gnu)
$ grep --version | head -n 1
GNU grep 2.6.3
$ sed --version | head -n 1
GNU sed version 4.2.1
$ perl --version | sed -n '2p'      # head -n 2 | tail -n 1
This is perl, v5.10.1 (*) built for i686-linux-gnu-thread-multi
$ python --version | sed -n '1p'
Python 2.7.1+
$ uname -a
Linux ljutntcol 2.6.38-16-generic #67-Ubuntu SMP Thu Sep 6 18:00:43 UTC 2012 i686 i686 i386 GNU/Linux
$ lsb_release -sa | tr '\n' ' '     # column -s '\n'
Ubuntu Ubuntu 11.04 11.04 natty



Output line if match: One given word at start of line

 STR="This is a test string"
 # lookfor="This"

For all cases: just the word (as search pattern) is enough; can further specify caret `^` for start of line:

Grep

$ echo $STR | grep 'This' --color=always | cat -v
$ echo $STR | grep '^This' --color=always | cat -v
$ echo -n $STR | grep 'This' --color=always | cat -v
$ echo -n $STR | grep '^This' --color=always | cat -v
^[[01;31m^[[KThis^[[m^[[K is a test string


Sed

$ echo $STR | sed -n '/This/p'
$ echo $STR | sed -n '/^This/p'
This is a test string


Perl

Here we have to explicitly check for a match, and output the original line if there's a match.

Note that here, a line without line ending is output as such by Perl's `print`:

$ echo $STR | perl -ne '/This/ && print $_'
$ echo $STR | perl -ne 'm/This/ && print $_'
$ echo $STR | perl -ne '/^This/ && print $_'
$ echo $STR | perl -ne 'm/^This/ && print $_'
This is a test string

$ echo -n $STR | perl -ne '/This/ && print $_'
This is a test string$ echo -n $STR | perl -ne '/^This/ && print $_'
This is a test string$


Python

Here we have to explicitly check for a match, and output the original line if there's a match.

Note that here, a line without line ending is output as such by Python's `sys.stdout.write()` (Python's `print` outputs "\n" if its arguments don't end with a comma - but we cannot use it with one-liner list comprehension)

$ echo $STR | python -c "import sys,re; [sys.stdout.write(line if re.match(r'This', line) else '') for line in sys.stdin]"
$ echo $STR | python -c "import sys,re; [sys.stdout.write(line if re.match(r'^This', line) else '') for line in sys.stdin]"
This is a test string

$ echo -n $STR | python -c "import sys,re; [sys.stdout.write(line if re.match(r'This', line) else '') for line in sys.stdin]"
This is a test string$ echo -n $STR | python -c "import sys,re; [sys.stdout.write(line if re.match(r'^This', line) else '') for line in sys.stdin]"
This is a test string$



Output line if match: One given word anywhere in line

 STR="This is a test string"
 # lookfor="test"

For all cases: we obviously do not want to specify caret `^` for start of line anymore:

Grep

$ echo $STR | grep 'test' --color=always | cat -v
This is a ^[[01;31m^[[Ktest^[[m^[[K string


Sed

$ echo $STR | sed -n '/test/p'
This is a test string


Perl

$ echo $STR | perl -ne '/test/ && print $_'
$ echo $STR | perl -ne 'm/test/ && print $_'
This is a test string


Python

Not so easy here:

$ echo $STR | python -c "import sys,re; [sys.stdout.write(line if re.match(r'test', line) else '') for line in sys.stdin]"
$

Problem is that the `match()` method seems to explicitly go character by character - so if we want to match a word anywhere in the line, we have to use `.*` to match the start of the string we aren't looking for, until we reach the part that we are looking for:

$ echo $STR | python -c "import sys,re; [sys.stdout.write(line if re.match(r'.*test', line) else '') for line in sys.stdin]"
This is a test string

This is explained in 7.2. re — 7.2.5.3. search() vs. match(): "Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string (this is what Perl does by default)."

Either that - or we should use `findall()` (which returns a list of strings, not a "Match" object like `match()`); for either case (match or findall), we can also use parenthesis for regex grouping, with no change in effect (only the `.*` matters for match):

$ echo $STR | python -c "import sys,re; [sys.stdout.write(line if re.findall(r'test', line) else '') for line in sys.stdin]"
$ echo $STR | python -c "import sys,re; [sys.stdout.write(line if re.findall(r'(test)', line) else '') for line in sys.stdin]"
This is a test string



Output line if match: One of two given words anywhere in line

 STR="line mine\nline yours\nline you min theirs"
 # lookfor="mine" or "yours"

For all cases: we use the pipe character `|` as regex "or" logical operator:

Grep

Note that the pipe `|` as regex "or" operator in `grep` (aka "infix operator") must be escaped with backslash - even if the query is in single quotes:

$ echo -e $STR | grep 'mine|yours' --color=always | cat -v
$
$ echo -e $STR | grep 'mine\|yours' --color=always | cat -v
line ^[[01;31m^[[Kmine^[[m^[[K
line ^[[01;31m^[[Kyours^[[m^[[K


Sed

Note that the pipe `|` as regex "or" operator in `sed` must be escaped with backslash - even if the query is in single quotes:

$ echo -e $STR | sed -n '/mine|yours/p'
$
$ echo -e $STR | sed -n '/mine\|yours/p'
line mine
line yours


Perl

No escape character is needed for the pipe `|` as regex "or" operator in Perl:

$ echo -e $STR | perl -ne 'm/mine|yours/ && print $_'
line mine
line yours


Python

No escape character is needed for the pipe `|` as regex "or" operator in Python.

Again we can use `findall()` to search anywhere in line with the query as in Grep/Perl - or `match()`, if we prepend the query with `.*` to take the start of line into account, which also forces us to either use regex group parentheses around the query - or specify .* as a starter for each of the queries:

$ echo -e $STR | python -c "import sys,re; [sys.stdout.write(line if re.findall(r'mine|yours', line) else '') for line in sys.stdin]"
line mine
line yours

$ echo -e $STR | python -c "import sys,re; [sys.stdout.write(line if re.match(r'.*mine|yours', line) else '') for line in sys.stdin]"
line mine
$ echo -e $STR | python -c "import sys,re; [sys.stdout.write(line if re.match(r'.*mine|.*yours', line) else '') for line in sys.stdin]"
line mine
line yours
$ echo -e $STR | python -c "import sys,re; [sys.stdout.write(line if re.match(r'.*(mine|yours)', line) else '') for line in sys.stdin]"
line mine
line yours



Output line if match: Both of two given words anywhere in line

 STR="line mine\nline yours\nline yours theirs mine"
 # lookfor="mine" and "yours"

For all cases: there is no logical AND in regex - either have to use positive lookahead assertion, or perform logical AND operation outside of regex; or use OR to "enumerate all possible permutations with a standard regexp" ("rewinding?")?


Grep

There are no lookaheads with `grep` proper (however, it has a switch `--perl-regexp`): typically we can daisy-chain two `grep`s, the one piped into another, to do a logical AND - this also indicates each word separately:

$ echo -e $STR | grep 'mine' --color=always | grep 'yours' --color=always | cat -v
line ^[[01;31m^[[Kyours^[[m^[[K theirs ^[[01;31m^[[Kmine^[[m^[[K

With permutations and OR - this has one match, containing both words, and anything between the two words:

$ echo -e $STR | grep 'mine.*yours\|yours.*mine' --color=always | cat -v
line ^[[01;31m^[[Kyours theirs mine^[[m^[[K

There is no need to daisychain two calls for Perl or Python.


Sed

There are no lookaheads with `sed`: also here, we could pipe/daisychain two separate `sed`s, as in `grep`'s case.

With permutations and OR:

$ echo -e $STR | sed -n '/mine.*yours\|yours.*mine/p'
line yours theirs mine

With use of `sed` specific commands: on a line that matches (contains) "mine", begin a block of commands, and do another check on the same line - if it matches "yours", print pattern space (the input line string)

$ echo -e $STR | sed -n '/mine/ { /yours/p }'
line yours theirs mine


Perl

With permutations and OR:

$ echo -e $STR | perl -ne 'm/mine.*yours|yours.*mine/ && print $_'
line yours theirs mine

With positive lookahead:

$ echo -e $STR | perl -ne 'm/(?=.*mine)(?=.*yours)/ && print $_'
line yours theirs mine

This query will not work, if the starting `.*` is left out from the lookahead expression, as in 'm/(?=mine)(?=yours)/'

With Perl language logical `and` operator:

$ echo -e $STR | perl -ne 'm/mine/ and m/yours/ && print $_'
line yours theirs mine


Python

With permutations and OR - again, don't forget difference in behaviour between match and findall

$ echo -e $STR | python -c "import sys,re; [sys.stdout.write(line if re.match(r'.*mine.*yours|.*yours.*mine', line) else '') for line in sys.stdin]"
$ echo -e $STR | python -c "import sys,re; [sys.stdout.write(line if re.match(r'.*(mine.*yours|yours.*mine)', line) else '') for line in sys.stdin]"
$ echo -e $STR | python -c "import sys,re; [sys.stdout.write(line if re.findall(r'mine.*yours|yours.*mine', line) else '') for line in sys.stdin]"
line yours theirs mine

With positive lookahead - note, here there is no difference between `match()` and `findall()` regex query, yet it still works the same:

$ echo -e $STR | python -c "import sys,re; [sys.stdout.write(line if re.match(r'(?=.*mine)(?=.*yours)', line) else '') for line in sys.stdin]"
$ echo -e $STR | python -c "import sys,re; [sys.stdout.write(line if re.findall(r'(?=.*mine)(?=.*yours)', line) else '') for line in sys.stdin]"
line yours theirs mine

Neither `match()` nor `findall()` will work, if the starting `.*` is left out from the lookahead expression, as in '(?=mine)(?=yours)'

With Python language logical `and` operator:

$ echo -e $STR | python -c "import sys,re; [sys.stdout.write(line if re.findall(r'mine', line) and re.findall('yours', line) else '') for line in sys.stdin]"
line yours theirs mine



Output nth line: from a set of lines

 STR="a\nb\nc\n"
 #lookfor = second line

for finding in versions, etc; instead of doing: ` | head -n 2 | tail -n 1`

Sed

$ echo -e "a\nb\nc\n" | sed -n '2p'
b



References


»»



Choose skin to view site in: 0 1 2 3 4 5 6 7 8 9 10 11