cancel
Showing results for 
Search instead for 
Did you mean: 

Excel Data Cleaning with grouped range

Excel Data Cleaning with grouped range

I'm in the processing of cleaning a excel dataset to clean Average Age feature.

The current dataset have codes in the corresponding field which are taken from a reference table.

For example, Avg Age, 1

1 means 18-28 years

I can replace cell that contain 1 to "18 - 28 years" which makes it readable when importing to DataRobot. However my question is does this make the feature a categorical instead of numerical? What is the best recommendation in this scenario so DR can best analysis this feature as numerical? 

Labels (1)
0 Kudos
1 Solution

Accepted Solutions

To answer your first question; if the data cells in "Avg Age" is in age ranges, then it is most likely that the feature is categorical in DataRobot. However, always double check the data type in DataRobot to make sure that it is not labelled as "Text" or something else.

Regarding the second question; I believe that modelling the data with the feature as a categorical value is the easiest and wisest. The only loss I can think off is that the algorithm will weigh each group equally compared to eachother. This means that it will not take into account that the group "18-28" is closer to the group "29-39" than "59-69". Depending on your case, this will probably not have a large effect; taking the 'closeness' of the groups into account will require either custom ML models or highly specific preprocessing of the data (Similarity Encoding) which might do the trick.

View solution in original post

1 Reply

To answer your first question; if the data cells in "Avg Age" is in age ranges, then it is most likely that the feature is categorical in DataRobot. However, always double check the data type in DataRobot to make sure that it is not labelled as "Text" or something else.

Regarding the second question; I believe that modelling the data with the feature as a categorical value is the easiest and wisest. The only loss I can think off is that the algorithm will weigh each group equally compared to eachother. This means that it will not take into account that the group "18-28" is closer to the group "29-39" than "59-69". Depending on your case, this will probably not have a large effect; taking the 'closeness' of the groups into account will require either custom ML models or highly specific preprocessing of the data (Similarity Encoding) which might do the trick.