Wednesday, 10 April 2019

Introduction to RegEx

RegEx is used often when working with strings, paths, configurations etc...so here is a little breakdown of commonly used RegEx expressions. I will be adding examples I come across in my daily work and down below I'll be adding explanations on how to interpret them.

RegEx examples:


'/.*?\\.(test|spec)\\.js$'


RegEx Special Characters


\d = matches any single digit in most regex grammar styles and is equivalent to [0-9]


RegEx Expressions & Interpretation:


.
Dot matches any single character except the newline character, by default.
If s flag ("dotAll") is true, it also matches newline characters.

*
This quantifier (asterisk) matches the preceding expression 0 or more (unlimited) times, as many times as possible, giving back as needed (greedy)

Example: Find any text between two digits OR a single digit:

"\\d(.*\\d)*"

In string LeadingText-1-TrailingText found pattern 1
In string LeadingText-12-TrailingText found pattern 12
In string LeadingText-1.2-TrailingText found pattern 1.2
In string LeadingText-11.2-TrailingText found pattern 11.2
In string LeadingText-1.22-TrailingText found pattern 1.22
In string LeadingText-11.22-TrailingText found pattern 11.22
In string LeadingText-1234-TrailingText found pattern 1234 

.*
Matches any character greedily - as many characters as possible.

Example:
1.*1 in 101000001 will match 101000001

?
Matches the preceding expression 0 or 1 time.
If used immediately after any of the quantifiers *, +, ?, or {}, makes the quantifier non-greedy (matching the fewest possible characters), as opposed to the default, which is greedy (matching as many characters as possible).

.*?
Matches any character in non-greedy mode - as little as enough to match the pattern.

Example:
1.*1 in 101000001 will match 101

What is the difference between .*? and .* regular expressions?
(this answer also contains nice explanation of backtracking and how non-greedy expression can return multiple matches within a string)


+
This quantifier matches the preceding expression 1 or more (unlimited) times, as many times as possible, giving back as needed (greedy)

\
A backslash that precedes a non-special character indicates that the next character is special and is not to be interpreted literally.
A backslash that precedes a special character indicates that the next character is not special and should be interpreted literally (this is called escaping).
Example: \. matches the character . literally (case sensitive)

\\
The first backslash escapes the one after it, so the expression searches for a single literal backslash.

^
Caret.

  • ^ means "not the following" when inside and at the start of [], so [^...].
  • When it's inside [] but not at the start, it means the actual ^ character.
  • When it's escaped (\^), it also means the actual ^ character.
  • In all other cases it means start of the string / line (which one is language / setting dependent).

So in short:

  • [^abc] -> not a, b or c
  • [ab^cd] -> a, b, ^ (character), c or d
  • \^ -> a ^ character
  • Anywhere else -> start of string / line.


So ^[b-d]t$ means:

  • Start of line
  • b/c/d character
  • t character
  • End of line

Carets in Regular Expressions


Capturing Groups


Part of a pattern can be enclosed in parentheses (...). This is called a capturing group.
Multiple characters in that group are treated as a single unit that we want to match.
It allows to get a part of the match as a separate item in the result array
If we put a quantifier after the parentheses, it applies to the parentheses as a whole.


String: abababa 
Goal: find all matches of sequence ab
Result: There are 3 matches. 
Regex: (ab)

String: ab123cd345ef785
Goal: find all sequences of numbers
Result: 123, 345, 785
Regex: (\d+)

String: abc345-1.23.456.7890+whatever.ext
Goal: extract only numbers which form a valid version number (greedy - M.m.r.b or  M.m.r or M.m )
Result: 1.23.456.7890

Regex (\d+) returns 5 groups: 345, 1, 23, 456, 7890
Regex (\d+)\. returns all groups of numbers that are followed by dot. There are 3 such groups: 1, 23 and 456.

Let's look at some examples:

1.23
11.23
123.45
1.23.456
1.23.456.7890

We can see that all version numbers:
  • start with a sequence of 1 or more digits which are followed by dot: \d+\.
  • end with a sequence of 1 or more digits: \d+
So far we have: \d+\.\d+
Regex \d+\.\d+ returns 2 groups: 1.23 and 456.7890

Between these two sequences can be 0 or more (max 2 but let's ignore this) sequences of 1 or more digits that are followed by dot: \d+\.
This sequence is optional so let's put it in brackets that sequence to form a group and append * to it:
Regex  \d+\.(\d+\.)*\d+ does match 1.23.456.7890 but as (...) is capturing group it captures it and result is a single group: 456.

Here we just want regex to match this group but not to capture it (not to return it in results). We want this group to be a non-capturing group and there is a special syntax for it: (?: ... ).

Regex \d+\.(?:\d+\.)*\d+ fully matches 1.23.456.7890




Online Regex tools




1 comment:

micheal pan said...

BE SMART AND BECOME RICH IN LESS THAN 3DAYS....It all depends on how fast 
you can be to get the new PROGRAMMED blank ATM card that is capable of
hacking into any ATM machine,anywhere in the world. I got to know about 
this BLANK ATM CARD when I was searching for job online about a month 
ago..It has really changed my life for good and now I can say I'm rich and 
I can never be poor again. The least money I get in a day with it is about 
$50,000.(fifty thousand USD) Every now and then I keeping pumping money 
into my account. Though is illegal,there is no risk of being caught 
,because it has been programmed in such a way that it is not traceable,it 
also has a technique that makes it impossible for the CCTVs to detect 
you..For details on how to get yours today, email the hackers on : (
atmmachinehackers1@gmail.com ). Tell your 
loved once too, and start to live large. That's the simple testimony of how 
my life changed for good...Love you all ...the email address again is ;
atmmachinehackers1@gmail.com