RegEx is used often when working with strings, paths, configurations here is a little breakdown of commonly used RegEx expressions. I will be adding examples I come across in my daily work and down below I'll be adding explanations on how to interpret them.
\d = matches any single digit in most regex grammar styles and is equivalent to [0-9]
RegEx examples:
RegEx Special Characters (Metacharacters)
\d = matches any single digit in most regex grammar styles and is equivalent to [0-9]
\t = match tab
[ ] = Square brackets are characters that help define a character set, which is a set of characters we want to match.
[ ] = Square brackets are characters that help define a character set, which is a set of characters we want to match.
- [aeiou] matches all vowels.
Some characters have special meanings inside square brackets:
- caret (^) before any other character inside square brackets means we’re negating the characters that follow the caret. We’re telling the regex engine not to match those characters.
- [^aeiou] matches all characters apart from vowels.
- hyphen (-) between two characters inside square brackets means range. [a-z] means "match all characters between a and z, inclusive"
If we want to match square brackets, we need to escape them with backslash. \[|\] means "match [ or ]".
RegEx Expressions & Interpretation:
Dot matches any single character except the newline character, by default.
If s flag ("dotAll") is true, it also matches newline characters. If we want to match a dot character itself, we need to escape it: \.
This quantifier (asterisk) matches the preceding expression 0 or more (unlimited) times, as many times as possible, giving back as needed (greedy)
Example: Find any text between two digits OR a single digit:
In string LeadingText-1-TrailingText found pattern 1
In string LeadingText-12-TrailingText found pattern 12
In string LeadingText-1.2-TrailingText found pattern 1.2
In string LeadingText-11.2-TrailingText found pattern 11.2
In string LeadingText-1.22-TrailingText found pattern 1.22
In string LeadingText-11.22-TrailingText found pattern 11.22
In string LeadingText-1234-TrailingText found pattern 1234
Matches any character greedily - as many characters as possible.
1.*1 in 101000001 will match 101000001
Matches the preceding expression 0 or 1 time.
If used immediately after any of the quantifiers *, +, ?, or {}, makes the quantifier non-greedy (matching the fewest possible characters), as opposed to the default, which is greedy (matching as many characters as possible).
Matches any character in non-greedy mode - as little as enough to match the pattern.
1.*1 in 101000001 will match 101
What is the difference between .*? and .* regular expressions?
(this answer also contains nice explanation of backtracking and how non-greedy expression can return multiple matches within a string)
This quantifier matches the preceding expression 1 or more (unlimited) times, as many times as possible, giving back as needed (greedy)
A backslash that precedes a non-special character indicates that the next character is special and is not to be interpreted literally.
A backslash that precedes a special character indicates that the next character is not special and should be interpreted literally (this is called escaping).
Example: \. matches the character . literally (case sensitive)
The first backslash escapes the one after it, so the expression searches for a single literal backslash.
^[b-d]t$ means:
Carets in Regular Expressions
If a dollar sign ($) is at the end of the entire regular expression, it matches the end of a line.
If an entire regular expression is enclosed by a caret and dollar sign (^lorem ipusm$), it matches an entire line.
The first backslash escapes the one after it, so the expression searches for a single literal backslash.
- ^ means "not the following" when inside and at the start of [], so [^...].
- When it's inside [] but not at the start, it means the actual ^ character.
- When it's escaped (\^), it also means the actual ^ character.
- In all other cases it means start of the string or line (which one is language / setting dependent). If a caret (^) is at the beginning of the entire regular expression, it matches the beginning of a line.
- [^abc] -> not a, b or c
- [ab^cd] -> a, b, ^ (character), c or d
- \^ -> a ^ character
- Anywhere else -> start of string / line.
- ^[ \t]* = match all spaces and tabs at the beginning of line
^[b-d]t$ means:
- Start of line
- b/c/d character
- t character
- End of line
Carets in Regular Expressions
If a dollar sign ($) is at the end of the entire regular expression, it matches the end of a line.
If an entire regular expression is enclosed by a caret and dollar sign (^lorem ipusm$), it matches an entire line.
Capturing Groups
Part of a pattern can be enclosed in parentheses (...). This is called a capturing group.
Multiple characters in that group are treated as a single unit that we want to match.
It allows to get a part of the match as a separate item in the result array.
If we put a quantifier after the parentheses, it applies to the parentheses as a whole.
String: abababa
Goal: find all matches of sequence ab.
Result: There are 3 matches.
Regex: (ab)
String: ab123cd345ef785
Goal: find all sequences of numbers
Result: 123, 345, 785
Regex: (\d+)
String: abc345-1.23.456.7890+whatever.ext
Goal: extract only numbers which form a valid version number (greedy - M.m.r.b or M.m.r or M.m )
Result: 1.23.456.7890
Regex (\d+) returns 5 groups: 345, 1, 23, 456, 7890
Regex (\d+)\. returns all groups of numbers that are followed by dot. There are 3 such groups: 1, 23 and 456.
Let's look at some examples:
We can see that all version numbers:
- start with a sequence of 1 or more digits which are followed by dot: \d+\.
- end with a sequence of 1 or more digits: \d+
So far we have: \d+\.\d+
Regex \d+\.\d+ returns 2 groups: 1.23 and 456.7890
Between these two sequences can be 0 or more (max 2 but let's ignore this) sequences of 1 or more digits that are followed by dot: \d+\.
This sequence is optional so let's put it in brackets that sequence to form a group and append * to it:
Regex \d+\.(\d+\.)*\d+ does match 1.23.456.7890 but as (...) is capturing group it captures it and result is a single group: 456.
Here we just want regex to match this group but not to capture it (not to return it in results). We want this group to be a non-capturing group and there is a special syntax for it: (?: ... ).
Regex \d+\.(?:\d+\.)*\d+ fully matches 1.23.456.7890
Capturing group can be used to match characters between delimiters but excluding delimiters.
E.g. extract file name from:
/[^/]*$ captures leading /, any character apart from / (0 or more times), before the end of the string:
If we use capturing group, we can capture anything between the last / and the end of the string:
/([^/]*)$ matches /karpenter.sh_nodeclaims.yaml and a capturing group karpenter.sh_nodeclaims.yaml. Capturing group match is the desired file name.
No comments:
Post a Comment