The point of RegexExtract is to bring the power that regular expressions provide to precisely pulling the data you want out of text that follows a pattern. It is not consistent in a way that would allow you to use a "substring" approach, which uses start and end positions you define to pull data out of text, whereas RegexExtract uses pattern to pull text.
When using this in calculated columns in Grow, it helps you transform unusable data into data that can be used for your metrics. While RegexExtract is difficult to use, it also makes your data very dynamic.
Here, we will break down how to use it in Grow's environment.
This is the basic syntax of a regexextract function:RegexExtract(/regular expression/, {column to extract from})
However, telling the function what pattern to find and what to extract is a little more complicated than it looks. The next section will give you a little more insight on how to build your function.
The function
At the beginning of every function you always start with RegexExtract(
. This tells Grow what function you are using.
Where and what to extract
This part is often the trickiest to write, so we will break it down a little bit. This will cover the /regular expression/ portion shown in the syntax above. This is what it looks like broken down:
RegexExtract(/pattern(captured group)pattern/,
Patterns: Where to look
First, use backslashes //
to indicate boundaries. When it sees the first backslash, it will know that what comes next is the pattern to look for as well as the values you want to extract. This is standard regular expression syntax and is used in many programming languages and implementations.
For example, say you have a column where the data all comes in as 123_a_321, 123_b_321, 123_c_321, etc. You want to pull out the letter in between the underscores. To write the pattern portion of the expression, you would write it like this: RegexExtract(/123_(captured group)_321/
.
We see that the captured group, the part we want to extract, falls in between 123_
and _321
, so we tell the function to look for that same pattern in each cell of that column.
The captured group: what to extract
After you enter where you want it to look, you will put what string you want to extract in parentheses ()
. This is called the captured group and is what is returned by the RegexExtract function.
If it falls in between two pieces of your pattern, you will want to close the parentheses and finish writing out the pattern, like this:
/pattern(captured group)pattern/
If there is only one piece of text in front of the captured group, it will look like this: /pattern(captured group)/
In writing the expression, Regex recognizes both symbols as well as text. This goes for both writing the pattern as well as the captured group.
Here is a list of some of the symbols you can use and what they mean.
-
\w
-- This means any letters and numbers. -
\s
-- This means 0 to infinity spaces. -
\d
-- This usually indicates a digit [0-9] -
+
-- The plus means one or more of the character that precedes it. -
.
-- The dot matches any character. -
*
-- The asterisk tells the engine to match the preceding token zero or more times. -
?
-- The question mark means 0 or 1 of the things that precede it.
This is not a comprehensive list. In addition to the list above, Grow has access to the full regex implementation provided by JavaScript.
The difference between +
and *
is that when you use +
, it finds one or more times of a character. If you use the *
, it is zero or more times. This means it can include or skip characters that may or may not repeat.
Often, you will see a combination that looks like this: .*?
. When these three symbols are combined, it creates a wildcard. (Specifically, a lazy wildcard.)
After you have written out where to look and what to extract, make sure to put another backslash in: /pattern(captured group)pattern/
You will then want a comma between the backslash and the name of the column.
Column Name
Once you have written the rest of the function, put the column name in curly brackets {}
.
Make sure to close your expression with parentheses )
.
Examples
Below are some examples to help you get an idea of how RegexExtract works in Grow.
Example 1
- Text string:BankrateP_auction_conversions_us_refi_prospecting_dnf_na_na_na_homeowners_us_allstates_35+_allgender_na_na_impressions_man_cpm_standard_image_urla_brfb0002-2_aa_ba_ca_lm
-
Regex Function:
RegexExtract(/urla_(.*?)_aa/, {ad_name})
This is telling Grow to look between the portion urla_
and _aa
in the column titled ad_name. (Both the pattern and the captured group are bolded in the text string above.)
- Result: brfb0002-2
This function would look in the same place for each text string in the column.
Example 2
- Text string: E542 Doe, Jane
With this example, we can see that it might not always be long strings of text. But if I were building a metric of employee data, I would not want to have an employee code before each name.
So if I wanted to have it display Jane Doe
instead of E542 Doe, Jane
, I would use two different expressions in calculated columns:
-
Regex Functions:
RegexExtract(/\w+\s+(\w+)/, {Employee List})
andRegexExtract(/.*?,\s+(\w+)/, {Employee List})
The first function is looking first for \w+
, so any number and combination of letters and numbers. This matches with the E542
. Next, it looks for any number of spaces \s+
, then finally extracts what comes after that (\w+)
.
The only difference in the second function is that it begins with .*?
, which searches for any combination of really anything. However, notice that immediately after this there is a comma ,
. This is searching for a literal comma in the text string. So the result is that it would search from the beginning up to the comma: E542 Doe,
.
After that, the \s+
catches the space between Doe,
and Jane
. The function is told to extract any text after that space `(\w+).
-
Result:
Doe
andJane
Note: You can use another expression to format it as [First Name Last Name] in a new column in calculated columns.
Example 3
-
Text String:
Javascript is.the greatest.thing ever
In this example, we want to extract each word. The problem is that is.the
and greatest.thing
have periods in between them, not spaces.
Let's start with getting Javascript is
.
-
Regex Function:
RegexExtract(/(?:.*?\.){0}(.*?)(\.|$)/)
This would return Javascript is
. In the first set of parentheses, we see ?:.*?\.
. The first five symbols create a wildcard. However, the backslash and period \.
tells the function to find an actual period. The backslash \
is a type of neutralizer if you want to look for an actual symbol (in this case, the period .
). This is referred to as "escaping" a special character in Regex. So that whole pattern tells the function to look for everything in front of a period.
Next, we see a 0 in curly brackets {0}
. This tells the Regex function how many times to match the previous group. We want to match the pattern (?:.*?\.)
0 times before matching out the next pattern, (.*?)(\.|$)
. This first pattern will look for everything in front of a period, but it will match it 0 times.
In the last group of parentheses, we have this group of symbols: \.|$
. When you use a pipe |
, it functions as OR. So the pattern (\.|$)
tells the function to look for either a period after the captured group or the end of the line or text, which is indicated by the dollar sign $
.
Now let's get the string the greatest
.
-
Regex Function:
RegexExtract(/(?:.*?\.){1}(.*?)(\.|$)/)
Now that we have a 1 in curly brackets {1}
, it matches the pattern (?:.*?\.)
one time before looking for the captured group (.*?)
and the other part of the pattern (\.|$)
. This is because we now have a pattern with a period at the end (Javascript is
) in front of the group we want to capture (the greatest
), so we want to match it 1 time. So this will return the text the greatest
from our text string.
Finally, we can get the text thing ever
from our string by using the following function:
-
Regex Function:
RegexExtract(/(?:.*?\.){2}(.*?)(\.|$)/)
This time, the {2}
tells the function to match the pattern twice before looking for the captured group. So the function will look for a group that ends in a period (Javascript is
), then do it again a second time (the greatest
) before looking for the captured group with (.*?)
. Even though there is no period after the word ever
, we account for the end of the string by using $
. This returns the text thing ever
.
You can use multiple calculated columns to put this all together.
Example 4
Let's say you have a string with a combination of numbers, text, and periods, such as the following:
-
Text string:
Something.5403 Something else. 9000 The last stuff
But you just want The last stuff
.
Here's the regex function you would use:
-
Regex Function:
RegexExtract(/.*\d+\s+(.*)/)
Here's how to break it down. First, after our forward slash /
, we have .*
This tells the function to match any character (that's the period .
) zero or more times (that's the asterisk *
).
Next, the \d+
tells the function to match any digits (\d
) one or more times (+
). After that comes \s+
, which indicates that there is a space (\s
), one or more times (+
).
So we are left with the captured group, which is (.*)
. This captured group would match any characters after that space(s) zero or more times. The function finishes with the other forward slash /
, indicating that there is no more to the pattern after the captured group. This would return what is very last in the text string. In our example, this would be The last stuff
.
Regex in Find & Replace
In Find & Replace we've taken some complexity out of the transform so you don't need to know or type out the whole function. However, because regular expression is so unique, you'll still need to add regular expression per transform.
The examples above, although applied to regex_extract can still give you a great understanding of how regex works in Grow. Any portion that you would be extracting above would just be the value that you "Find" to then "Replace" with another string.
Within the Regex type in the Find & Replace transform, there is one other field you have the option to implement. This is the Flags field. This is an optional field that can be used for Regex Find & Replace. More than one flag can be selected. See the table below for each of the flags, and their meanings.
Regular Expression Flags
Option | Description |
---|---|
b |
rest of RE is a BRE |
c |
case-sensitive matching (overrides operator type) |
e |
rest of RE is an ERE |
g |
returns all matches, not just the first one. |
i |
case-insensitive matching |
m |
historical synonym for n
|
n |
newline-sensitive matching |
p |
partial newline-sensitive matching |
q |
rest of RE is a literal (“quoted”) string, all ordinary characters |
s |
non-newline-sensitive matching (default) |
t |
tight syntax (default; see below) |
w |
inverse partial newline-sensitive (“weird”) matching |
x |
expanded syntax |
For more information on pattern matching in Postgres, see their documentation here.