Wednesday 10 April 2019

Introduction to RegEx

RegEx is used often when working with strings, paths, configurations here is a little breakdown of commonly used RegEx expressions. I will be adding examples I come across in my daily work and down below I'll be adding explanations on how to interpret them.

RegEx examples:


RegEx Special Characters (Metacharacters)

\d = matches any single digit in most regex grammar styles and is equivalent to [0-9]
\t = match tab

[ ] = Square brackets are characters that help define a character set, which is a set of characters we want to match.[aeiou] matches all vowels. 

Some characters have special meanings inside square brackets:
  • caret (^) before any other character inside square brackets means we’re negating the characters that follow the caret. We’re telling the regex engine not to match those characters. [^aeiou] matches all characters apart from vowels.
  • hyphen (-) between two characters inside square brackets means range. [a-z] means "match all characters between a and z, inclusive"
If we want to match square brackets, we need to escape them with backslash. \[|\] means "match [ or ]".

RegEx Expressions & Interpretation:

Dot matches any single character except the newline character, by default.
If s flag ("dotAll") is true, it also matches newline characters. If we want to match a dot character itself, we need to escape it: \.

This quantifier (asterisk) matches the preceding expression 0 or more (unlimited) times, as many times as possible, giving back as needed (greedy)

Example: Find any text between two digits OR a single digit:


In string LeadingText-1-TrailingText found pattern 1
In string LeadingText-12-TrailingText found pattern 12
In string LeadingText-1.2-TrailingText found pattern 1.2
In string LeadingText-11.2-TrailingText found pattern 11.2
In string LeadingText-1.22-TrailingText found pattern 1.22
In string LeadingText-11.22-TrailingText found pattern 11.22
In string LeadingText-1234-TrailingText found pattern 1234 

Matches any character greedily - as many characters as possible.

1.*1 in 101000001 will match 101000001

Matches the preceding expression 0 or 1 time.
If used immediately after any of the quantifiers *, +, ?, or {}, makes the quantifier non-greedy (matching the fewest possible characters), as opposed to the default, which is greedy (matching as many characters as possible).

Matches any character in non-greedy mode - as little as enough to match the pattern.

1.*1 in 101000001 will match 101

What is the difference between .*? and .* regular expressions?
(this answer also contains nice explanation of backtracking and how non-greedy expression can return multiple matches within a string)

This quantifier matches the preceding expression 1 or more (unlimited) times, as many times as possible, giving back as needed (greedy)

A backslash that precedes a non-special character indicates that the next character is special and is not to be interpreted literally.
A backslash that precedes a special character indicates that the next character is not special and should be interpreted literally (this is called escaping).
Example: \. matches the character . literally (case sensitive)

The first backslash escapes the one after it, so the expression searches for a single literal backslash.

  • ^ means "not the following" when inside and at the start of [], so [^...].
  • When it's inside [] but not at the start, it means the actual ^ character.
  • When it's escaped (\^), it also means the actual ^ character.
  • In all other cases it means start of the string or line (which one is language / setting dependent). If a caret (^) is at the beginning of the entire regular expression, it matches the beginning of a line.
  • [^abc] -> not a, b or c
  • [ab^cd] -> a, b, ^ (character), c or d
  • \^ -> a ^ character
  • Anywhere else -> start of string / line.
  • ^[ \t]* = match all spaces and tabs at the beginning of line

^[b-d]t$ means:
  • Start of line
  • b/c/d character
  • t character
  • End of line

Carets in Regular Expressions


If a dollar sign ($) is at the end of the entire regular expression, it matches the end of a line.

If an entire regular expression is enclosed by a caret and dollar sign (^lorem ipusm$), it matches an entire line.

Capturing Groups

Part of a pattern can be enclosed in parentheses (...). This is called a capturing group.
Multiple characters in that group are treated as a single unit that we want to match.
It allows to get a part of the match as a separate item in the result array
If we put a quantifier after the parentheses, it applies to the parentheses as a whole.

String: abababa 
Goal: find all matches of sequence ab
Result: There are 3 matches. 
Regex: (ab)

String: ab123cd345ef785
Goal: find all sequences of numbers
Result: 123, 345, 785
Regex: (\d+)

String: abc345-1.23.456.7890+whatever.ext
Goal: extract only numbers which form a valid version number (greedy - M.m.r.b or  M.m.r or M.m )
Result: 1.23.456.7890

Regex (\d+) returns 5 groups: 345, 1, 23, 456, 7890
Regex (\d+)\. returns all groups of numbers that are followed by dot. There are 3 such groups: 1, 23 and 456.

Let's look at some examples:


We can see that all version numbers:
  • start with a sequence of 1 or more digits which are followed by dot: \d+\.
  • end with a sequence of 1 or more digits: \d+
So far we have: \d+\.\d+
Regex \d+\.\d+ returns 2 groups: 1.23 and 456.7890

Between these two sequences can be 0 or more (max 2 but let's ignore this) sequences of 1 or more digits that are followed by dot: \d+\.
This sequence is optional so let's put it in brackets that sequence to form a group and append * to it:
Regex  \d+\.(\d+\.)*\d+ does match 1.23.456.7890 but as (...) is capturing group it captures it and result is a single group: 456.

Here we just want regex to match this group but not to capture it (not to return it in results). We want this group to be a non-capturing group and there is a special syntax for it: (?: ... ).

Regex \d+\.(?:\d+\.)*\d+ fully matches 1.23.456.7890

Online Regex tools

No comments: