Current DR is supporting word and char analyzer for text processing. Is there a keyword analyzer support ?
I am trying to address a problem, where I am attempting to treat one whole sentence as one word. Such that the spaces or period is neglected and considered one whole word.
eg: AAAA_BBB_CCCC_DDD_CCCC this would be tokenized and analyzed as
"AAAA_BBB_CCCC_DDD_CCCC" instead of "AAAA".
Thanks & Regards,
Solved! Go to Solution.
Thanks for your clarification! Your use case is unusual indeed, and your approach seems to be innovative and good fit for clustering.
As an idea I may purpose (not yet tested) you may try to target this problem as sentences of n-gram sentences of a branch for example for this hierarchy 2-gram representation will be:
"Device1_Networking2 Device1_Computer2 Networking2_Router3 Networking2_Firewall3 Networking2_Switch3". This will enable text processing to compare parts of your data and create some similarity estimations between 2 records (notice that I add level to record, you may skip it if it is not that essential.)
Thank you very much for getting back.
Basically I have a hierarchical structure, which is now reduced to linear dimension. I achieved it so by treating the data as Text.
p1_p2_c1_c1; p1_p2_c1_c2; p1_p2_c2_c3; p1_p3_c1_c1 etc.
Here: "p1_p2_c1_c1" is considered as one sentence. The data does not have stop words or language sensing restrictions.
Each level has 3K+ nodes. Hence ruled out converting the data points into Categorical.
I have limited resource to venture into graph DB, hence wanted to utilize clustering and NLP features of DR to drive initial insights.
May I ask you why you need one sentence to be encoded in first place? Are there more than one sentence in your text? How frequent are same sentences in your data? Do this sentence have humanreadable meaning? If answer to this questions is no - please consider converting variable type to categorical. This might improve performance.
Also if order of words in 'sentence' is essential you may try to split it into separate features.
Otherwise your approach with '_' should work as you expect, DataRobot shouldn't split that into smaller words.
Feel free to ask if more explanations are needed.
Great question. So, out of the box you can set the number of tokens to consider in many text/NLP-enabled BP's using an Advanced Tuning parameter which will be named "ngram" or similar. However, your case is a little different since you want 1 entire sentence to be considered a token.
Is it just that 1 specific sentence or is it all sentences that should be tokens?
If it's just 1, then I would probably handle it by preprocessing. I would just replace the spaces and punctuation, probably with underscores as in your example.
OTOH if you want all sentences to be tokens I think we can tackle that using either a Blueprint (BP) with spaCy or a custom BP. I think you reached out to schedule some time with me already, so I will give you an example at that time if that would be useful.