Keyword Analyzer

Keyword Analyzer

Hi Team,

 

Current DR is supporting word and char analyzer for text processing. Is there a keyword analyzer support ?

 

I am trying to address a problem, where I am attempting to treat one whole sentence as one word. Such that the spaces or period is neglected and considered one whole word.

 

 

eg: AAAA_BBB_CCCC_DDD_CCCC  this would be tokenized and analyzed as 

"AAAA_BBB_CCCC_DDD_CCCC" instead of "AAAA".

 

MRK_0-1672252822372.png

 

 

Thanks & Regards,

Kusuma

 

 

 

 

Labels (2)
2 Solutions

Accepted Solutions
jas0n
Data Scientist
Data Scientist

Hi Kusuma,

Great question. So, out of the box you can set the number of tokens to consider in many text/NLP-enabled BP's using an Advanced Tuning parameter which will be named "ngram" or similar. However, your case is a little different since you want 1 entire sentence to be considered a token.

Is it just that 1 specific sentence or is it all sentences that should be tokens?

If it's just 1, then I would probably handle it by preprocessing. I would just replace the spaces and punctuation, probably with underscores as in your example. 

OTOH if you want all sentences to be  tokens I think we can tackle that using either a Blueprint (BP) with spaCy or a custom BP. I think you reached out to schedule some time with me already, so I will give you an example at that time if that would be useful.

Thanks,
Jason

View solution in original post

0 Kudos

Hi Kusuma!

May I ask you why you need one sentence to be encoded in first place? Are there more than one sentence in your text? How frequent are same sentences in your data? Do this sentence have humanreadable meaning? If answer to this questions is no - please consider converting variable type to categorical. This might improve performance.
Also if order of words in 'sentence' is essential you may try to split it into separate features.
Otherwise your approach with '_' should work as you expect, DataRobot shouldn't split that into smaller words.

Feel free to ask if more explanations are needed. 

View solution in original post

0 Kudos
4 Replies
jas0n
Data Scientist
Data Scientist

Hi Kusuma,

Great question. So, out of the box you can set the number of tokens to consider in many text/NLP-enabled BP's using an Advanced Tuning parameter which will be named "ngram" or similar. However, your case is a little different since you want 1 entire sentence to be considered a token.

Is it just that 1 specific sentence or is it all sentences that should be tokens?

If it's just 1, then I would probably handle it by preprocessing. I would just replace the spaces and punctuation, probably with underscores as in your example. 

OTOH if you want all sentences to be  tokens I think we can tackle that using either a Blueprint (BP) with spaCy or a custom BP. I think you reached out to schedule some time with me already, so I will give you an example at that time if that would be useful.

Thanks,
Jason

0 Kudos

Hi Kusuma!

May I ask you why you need one sentence to be encoded in first place? Are there more than one sentence in your text? How frequent are same sentences in your data? Do this sentence have humanreadable meaning? If answer to this questions is no - please consider converting variable type to categorical. This might improve performance.
Also if order of words in 'sentence' is essential you may try to split it into separate features.
Otherwise your approach with '_' should work as you expect, DataRobot shouldn't split that into smaller words.

Feel free to ask if more explanations are needed. 

0 Kudos

Thank you very much for getting back.

 

Basically I have a hierarchical structure, which is now reduced to linear dimension. I achieved it so by treating the data as Text.

 

p1_p2_c1_c1; p1_p2_c1_c2; p1_p2_c2_c3; p1_p3_c1_c1 etc.

 

Here:  "p1_p2_c1_c1" is considered as one sentence.  The data does not have stop words or language sensing restrictions.  

 

Each level has 3K+ nodes. Hence ruled out converting the data points into Categorical. 

I have limited resource to venture into graph DB, hence wanted to utilize clustering and NLP features of DR to drive initial insights. 

 

MRK_0-1672276054867.png

 

0 Kudos

Thanks for your clarification! Your use case is unusual indeed, and your approach seems to be innovative and good fit for clustering.

As an idea I may purpose (not yet tested) you may try to target this problem as sentences of n-gram sentences of a branch for example for this hierarchy 2-gram  representation will be:
"Device1_Networking2 Device1_Computer2 Networking2_Router3 Networking2_Firewall3 Networking2_Switch3". This will enable text processing to compare parts of your data and create some similarity estimations between 2 records (notice that I add level to record, you may skip it if it is not that essential.)

1_PsV9Y5RxPtAKEWjFsYUHeQ

0 Kudos