Introduction to RegEx

Wednesday, 10 April 2019

Introduction to RegEx

RegEx is used often when working with strings, paths, configurations etc...so here is a little breakdown of commonly used RegEx expressions. I will be adding examples I come across in my daily work and down below I'll be adding explanations on how to interpret them.

RegEx examples:

'/.*?\\.(test|spec)\\.js$'

RegEx Special Characters (Metacharacters)

. (dot) = any character apart from new line

\d = matches any single digit in most regex grammar styles and is equivalent to [0-9]

\t = match tab

\s = any whitespace character (space/blank, tab \t , and newline \r or \n)

[ ] = Square brackets are characters that help define a character set, which is a set of characters we want to match.

[aeiou] matches all vowels.

Some characters have special meanings inside square brackets:

caret (^) before any other character inside square brackets means we’re negating the characters that follow the caret. We’re telling the regex engine not to match those characters.

[^aeiou] matches all characters apart from vowels.

hyphen (-) between two characters inside square brackets means range. [a-z] means "match all characters between a and z, inclusive"

If we want to match square brackets, we need to escape them with backslash. \[|\] means "match [ or ]".

RegEx Expressions & Interpretation:

.
Dot matches any single character except the newline character, by default.
If s flag ("dotAll") is true, it also matches newline characters. If we want to match a dot character itself, we need to escape it: \.

*
This quantifier (asterisk) matches the preceding expression 0 or more (unlimited) times, as many times as possible, giving back as needed (greedy)

Example: Find any text between two digits OR a single digit:

"\\d(.*\\d)*"

In string LeadingText-1-TrailingText found pattern 1
In string LeadingText-12-TrailingText found pattern 12
In string LeadingText-1.2-TrailingText found pattern 1.2
In string LeadingText-11.2-TrailingText found pattern 11.2
In string LeadingText-1.22-TrailingText found pattern 1.22
In string LeadingText-11.22-TrailingText found pattern 11.22
In string LeadingText-1234-TrailingText found pattern 1234

.*
Matches any character greedily - as many characters as possible.

Example:
1.*1 in 101000001 will match 101000001

?
Matches the preceding expression 0 or 1 time.
If used immediately after any of the quantifiers *, +, ?, or {}, makes the quantifier non-greedy (matching the fewest possible characters), as opposed to the default, which is greedy (matching as many characters as possible).

.*?
Matches any character in non-greedy mode - as little as enough to match the pattern.

Example:
1.*1 in 101000001 will match 101

What is the difference between .*? and .* regular expressions?

(this answer also contains nice explanation of backtracking and how non-greedy expression can return multiple matches within a string)

+
This quantifier matches the preceding expression 1 or more (unlimited) times, as many times as possible, giving back as needed (greedy)

\
A backslash that precedes a non-special character indicates that the next character is special and is not to be interpreted literally.
A backslash that precedes a special character indicates that the next character is not special and should be interpreted literally (this is called escaping).

Example: \. matches the character . literally (case sensitive)

\\
The first backslash escapes the one after it, so the expression searches for a single literal backslash.

^
Caret.

^ means "not the following" when inside and at the start of [], so [^...].
When it's inside [] but not at the start, it means the actual ^ character.
When it's escaped (\^), it also means the actual ^ character.
In all other cases it means start of the string or line (which one is language / setting dependent). If a caret (^) is at the beginning of the entire regular expression, it matches the beginning of a line.

Examples:

[^abc] = not a, b or c
[ab^cd] = a, b, ^ (character), c or d
\^ = a ^ character
Anywhere else -> start of string / line. For example:

^.*".ds-metrics-apm.* matches for example a string which starts with SPACE characters:

" ".ds-metrics-apm.app.game_api-default-2023.02.02-000014","

^[ \t]* = match all spaces and tabs at the beginning of line
^\s* = match all whitespace characters at the beginning of line
^\n = match all empty lines (which only contain \n character)
^[b-d]t$ means:

Start of line
b/c/d character
t character
End of line

Carets in Regular Expressions

If a dollar sign ($) is at the end of the entire regular expression, it matches the end of a line.

If an entire regular expression is enclosed by a caret and dollar sign (^lorem ipusm$), it matches an entire line.

Caret and dollar are known as anchors as they denote the beginning and the end of string. String which matches "^$" regex is an empty string.

Capturing Groups

Part of a pattern can be enclosed in parentheses (...). This is called a capturing group.

Multiple characters in that group are treated as a single unit that we want to match.

It allows to get a part of the match as a separate item in the result array.

If we put a quantifier after the parentheses, it applies to the parentheses as a whole.

String: abababa

Goal: find all matches of sequence ab.

Result: There are 3 matches.

Regex: (ab)

String: ab123cd345ef785

Goal: find all sequences of numbers

Result: 123, 345, 785

Regex: (\d+)

String: abc345-1.23.456.7890+whatever.ext

Goal: extract only numbers which form a valid version number (greedy - M.m.r.b or M.m.r or M.m )

Result: 1.23.456.7890

Regex (\d+) returns 5 groups: 345, 1, 23, 456, 7890

Regex (\d+)\. returns all groups of numbers that are followed by dot. There are 3 such groups: 1, 23 and 456.

Let's look at some examples:

1.23

11.23

123.45

1.23.456

1.23.456.7890

We can see that all version numbers:

start with a sequence of 1 or more digits which are followed by dot: \d+\.
end with a sequence of 1 or more digits: \d+

So far we have: \d+\.\d+

Regex \d+\.\d+ returns 2 groups: 1.23 and 456.7890

Between these two sequences can be 0 or more (max 2 but let's ignore this) sequences of 1 or more digits that are followed by dot: \d+\.

This sequence is optional so let's put it in brackets that sequence to form a group and append * to it:

Regex \d+\.(\d+\.)*\d+ does match 1.23.456.7890 but as (...) is capturing group it captures it and result is a single group: 456.

Here we just want regex to match this group but not to capture it (not to return it in results, not to be stored in the backreference - $1 will be empty). We want this group to be a non-capturing group and there is a special syntax for it: (?: ... ).

Regex \d+\.(?:\d+\.)*\d+ fully matches 1.23.456.7890

Capturing group can be used to match characters between delimiters but excluding delimiters.

E.g. extract file name from: https://raw.example.com/aws/path/to/karpenter.sh_nodeclaims.yaml

/[^/]*$ captures leading /, any character apart from / (0 or more times), before the end of the string:

/karpenter.sh_nodeclaims.yaml

If we use capturing group, we can capture anything between the last / and the end of the string:

/([^/]*)$ matches /karpenter.sh_nodeclaims.yaml and a capturing group karpenter.sh_nodeclaims.yaml. Capturing group match is the desired file name.

^(.*)$ <-- capturing group captures entire string; entire string will be in the backreference $1