cancel
Showing results for 
Search instead for 
Did you mean: 

Pattern Matching Strategy

danielkcchan
NiCd Battery

I am using regular expression (regex) in the column split function of Paxata.  My conjecture is that it deploys a strategy of returning the shortest instead of the longest match.  Let me explain with an example.

Regex: .*(AAABBB|BBB).*

Intention: to match either pattern "AAABBB" of pattern "BBB" anywhere in a string (i.e. in the values of a column). 

Outcome: value with "AAABBB" will be matched as "BBB" and returned in a new split column.

I was hoping that a longer match will have priority over a shorter match.  I even tried placing the patterns in descending order of length within the regex but it did not help.

Is it the expected behavior or there is some control somewhere that I missed?

Labels (1)
1 Reply
calamari
DataRobot Employee
DataRobot Employee

Hi @danielkcchan !

I think you're also using "Capture Groups" based on the parenthesis in your regex (so assuming you have that box ticked). With capture groups in Paxata they are "Greedy"... (meaning you can't define "ungreedy" as a parameter with the U text.. 

There is a way to achieve what you're after though! My suggestion is where you have variables that you're interested in capturing that are substrings of another variable that you use a separate capture group, and then post process it... (either through a depivot or compute column (which I believe would be my preferred way)...

See the screenshot below where I have added another variable ABC to demonstrate the way to capture additional "groups". The regex is 

.*(AAABBB|ABC)|(BBB).*

calamari_0-1618886529242.png

If you wanted to put it back in a single column afterwards you'd create a compute column:

 FIRSTNONBLANK(@AAABBB or ABC@ , @BBB@ )

calamari_1-1618886739944.png

Resulting in this:

calamari_2-1618886766831.png

 

Cheers,

Calamari