Demystifying Regular Expressions: A Beginner’s Guide

Demystifying Regular Expressions: A Beginner’s Guide

People usually fear Regex a lot and tend to avoid it. Fortunately, a hero comes to save us… ChatGPT!

Since its release, ChatGPT has been typically used to create Regex.

However, it could be better and tends to give poor or bad results. So, Is the moment to panic? Is it really that terrifying?

Pin by naminaminami on reação- | Spongebob funny pictures, Spongebob funny,  Spongebob pics

I don’t think so. Regex is just a bunch of easy rules that give you a powerful tool if you know how to use and implement them.

WARNING: A significant drawback to using Regex is that it is hard to read at first glance; this problem can be tackled ****if you adhere to good naming conventions. Examples of helpful practices include creating descriptive variables/constants names and encapsulating validation within functions with meaningful names, which enhance readability.

But what exactly are they?

They are just a way to create an expression to match a string with specific rules. For example, we can check if a string has an email structure quickly

identier@domain/subdomain

A domain or subdomain has its pattern

name. Top-level

Note: With top level, I’m referring to top-level domains, such as com, net, org, and more. You can learn more about it here.

Character classes

Character classes are a way to group a bunch of characters or rules to check if a given string has.

These classes could be built-ins such as characters, digits, or handmade to check a specific pattern you want to check

Your handmade classes

Classes are a way to match a character with a given pattern. You create them using square braces, and the pattern inside them would be like this [a,e,i,o,u], but these classes are case sensitive, so if you can match any vowel lower or upper case, the class could be [a,e,i,o,u,A,E,I,O,U]

Using your expected characters one by one is handy, but you can add ranges if you want to match any digit. You can make this [0-9]

Another nice thing you can do is check for the same pattern to occur at times using curly braces. You should check if the user entered a credit card number; usually, credit cards have 14 to 16 digits.

You can check it with the following pattern [0-9]{14,16}

Ready to use

These classes are ready to use, and you use them when you want to use Regex, so pay attention to them.

Note: these classes and rules could change depending on where you run yourRegexx, so if something doesn’t work as expected, please go to your programming language documentation.

  • Digits

  • Character

  • Alphanumeric

  • White space

  • Null characters

Groups

Sometimes, you want to group your matches to distinguish them easily. That is the function of groups created with parentheses ().

For example, separating country codes from phone numbers is a common problem that Regex can solve.

This Regex depends on how you store your phone numbers; the subsequent conditions are as follows:

  1. The only blank space that can exist is to separate country codes from the rest of the phone code

  2. The string can only contain digits, blank spaces, and a + sign

  3. The country code begins with the + sign

To do that, you can use the next regex ^\+(\d{1,3})(\d{1,12})

But what does it mean?

Let’s break it down

  • The ^ sign: This means the beginning of the line

  • + this check if there is a + sign at the beginning, with \ sign we escape + sign since this sign has a special meaning, we will discuss it later

  • (\d{1,3}): this is the part where we are going to capture the first digits (from 1 to 3)

  • (\d{1,12})This matches the rest of the phone

How you get a specific group depends on the programming language, but if you are using Python, you can get it with the following code

import re

def extract_country_code_and_number(phone_number):
    # Define the regex pattern
    pattern = r'^\+(\d{1,3})\s(\d{1,12})'

    # Use re.match to find the match object
    match = re.match(pattern, phone_number)

    if match:
        # Extract country code and number using groups
        country_code = match.group(1)
        phone_number = match.group(2)
        return country_code, phone_number
    else:
        return None, None

def show_country_code_and_number(phone_number):
    # Extracting country code and phone number
    country_code, number = extract_country_code_and_number(phone_number)

    # Printing the extracted groups
    if country_code and number:
        print("Country Code:", country_code)
        print("Phone Number:", number)
    else:
        print("Phone number format is not valid.")

phone_number = "+1 234567890"
show_country_code_and_number(phone_number)

'''
OUTCOME

Country Code: 1
Phone Number: 234567890
'''

Wild card

Sometimes there will be parts where you don’t care what character is on the string, you only care there will be a character and is part of the pattern.

You can do tricks creating a handmade class like [\d\D] \d matches all digits, and \D matches all non-digits characters. So you are matching any character.

It’s a good option, but there is one that is easier: the dot sign. When you use this sign onRegexx, it matches everything.

So, if we take back the previous example of phone numbers. How you separate country codes from the rest of the number is not always a black space. You can use the dot sign instead

The newRegexx will be ^\+(\d{1,3}).(\d{1,12})

The new code will be

import re

def extract_country_code_and_number(phone_number):
    # Define the regex pattern
    pattern = r'^\+(\d{1,3}).(\d{1,12})'

    # Use re.match to find the match object
    match = re.match(pattern, phone_number)

    if match:
        # Extract country code and number using groups
        country_code = match.group(1)
        phone_number = match.group(2)
        return country_code, phone_number
    else:
        return None, None

def show_country_code_and_number(phone_number):
    # Extracting country code and phone number
    country_code, number = extract_country_code_and_number(phone_number)

    # Printing the extracted groups
    if country_code and number:
        print("Country Code:", country_code)
        print("Phone Number:", number)
    else:
        print("Phone number format is not valid.")

Now let’s try it with different values for the phone_number parameter

show_country_code_and_number('+1-234567890')
show_country_code_and_number('+1?234567890')
show_country_code_and_number('+1^234567890')
show_country_code_and_number('+1 234567890')
''' 
In all cases, the OUTPUT is:
    - Country Code: 1
    - Phone Number: 234567890
'''

Zero or one

There will be times when you wonder if a character will be there. In these cases, you can use the question mark sign (?).

This operator means the left patter could or could not be there.

With our same example of phone numbers, the + sign could or could not be present. Also, there could or could not be a separator between the country code and phone number.

The new regex will be *^*\+?*(\d*{1,3}*).?(\d*{1,12}*)*

The code new code will be

import re

def extract_country_code_and_number(phone_number):
    # Define the regex pattern
    pattern = r'^\+?(\d{1,3}).?(\d+)'

    # Use re.match to find the match object
    match = re.match(pattern, phone_number)

    if match:
        # Extract country code and number using groups
        country_code = match.group(1)
        phone_number = match.group(2)
        return country_code, phone_number
    else:
        return None, None

def show_country_code_and_number(phone_number):
    # Extracting country code and phone number
    country_code, number = extract_country_code_and_number(phone_number)

    # Printing the extracted groups
    if country_code and number:
        print("Country Code:", country_code)
        print("Phone Number:", number)
    else:
        print("Phone number format is not valid.")
# Example phone number with country code
show_country_code_and_number('+121?234567890')
'''
OUTPUT
Country Code: 121
Phone Number: 234567890
'''
show_country_code_and_number('+31^234567890')
'''
OUTPUT
Country Code: 31
Phone Number: 234567890
'''
show_country_code_and_number('+123234567890')
'''
OUTPUT
Country Code: 123
Phone Number: 34567890
'''

One or more operator

Sometimes, a pattern is repeated one or more times, and you don’t care how many times. This is where the “one or more” operator comes in.

This operator is handy when you have to check something between two patterns and are unsure how long it would be.

Coming back to the phone number example. Maybe you don’t know how many digits the phone number will have, and you don’t care

The newRegexx will be *^*\+?*(\d*{1,3}*).+(\d+)*

And the new code will be

import re

def extract_country_code_and_number(phone_number):
    # Define the regex pattern
    pattern = r'^\+?(\d{1,3}).+(\d+)'

    # Use re.match to find the match object
    match = re.match(pattern, phone_number)

    if match:
        # Extract country code and number using groups
        country_code = match.group(1)
        phone_number = match.group(2)
        return country_code, phone_number
    else:
        return None, None

def show_country_code_and_number(phone_number):
    # Extracting country code and phone number
    country_code, number = extract_country_code_and_number(phone_number)

    # Printing the extracted groups
    if country_code and number:
        print("Country Code:", country_code)
        print("Phone Number:", number)
    else:
        print("Phone number format is not valid.")
# Extracting country code and phone number

# Example phone number with country code
show_country_code_and_number('+121?23')
'''
OUTPUT
Country Code: 121
Phone Number: 3
'''
show_country_code_and_number('+31^23456789012131232131')
'''
OUTPUT
Country Code: 31
Phone Number: 1
'''
show_country_code_and_number('+123234567890')
'''
OUTPUT
Country Code: 123
Phone Number: 34567890
'''

Zero or more operator

There will be times a pattern could or could not have, but it’s not only one time. It could be zero or more. For this occasions, we have an asterisk (*)

Using phone number problem. Maybe we don’t know if there will be a separator between the country code and the rest of the phone code. Or if there is going to be more than one separator.

the new regex will be *^*\+?*(\d*{1,3}*).**?*(\d*+*)*

The new code is

imported re

def extract_country_code_and_number(phone_number):
    # Define the regex pattern
    pattern = r'^\+?(\d{1,3}).*(\d+)'

    # Use re.match to find the match object
    match = re.match(pattern, phone_number)

    if match:
        # Extract country code and number using groups
        country_code = match.group(1)
        phone_number = match.group(2)
        return country_code, phone_number
    else:
        return None, None

def show_country_code_and_number(phone_number):
    # Extracting country code and phone number
    country_code, number = extract_country_code_and_number(phone_number)

    # Printing the extracted groups
    if country_code and number:
        print("Country Code:", country_code)
        print("Phone Number:", number)
    else:
        print("Phone number format is not valid.")

show_country_code_and_number('+121  23')
'''
OUTPUT

Country Code: 121
Phone Number: 23
'''
show_country_code_and_number('+31^&@23456789012131232131')
'''
OUTPUT

Country Code: 31
Phone Number: 23456789012131232131
'''
show_country_code_and_number('+123234567890')
'''
OUTPUT

Country Code: 123
Phone Number: 234567890
'''