You may have heard that clustering, as an unsupervised learning approach, is now available in DataRobot. But what does clustering mean? Why would we use it? What does it really tell us about the patterns and relationships in our data?
With clustering, you are looking for structures within your data without a specific outcome to be predicted (as in supervised learning approaches, which you may be more familiar with in DataRobot). Clustering algorithms divide data points into a number of groups with similar traits.
How is clustering used?
How do I do this in DataRobot?
It’s easy to get started: just load your data and select “No target?”. Clustering is available in Manual mode if you want to pick particular blueprints, or Comprehensive, which will run Autopilot using all available clustering algorithms.
From here, it’s on to the Leaderboard. DataRobot uses a metric called silhouette score, which measures distances between clusters of data points and how condensed cluster points are to each other.
Scores range from -1 to +1. A score of 1 would mean clusters are dense and well-separated from others. A value of 0 represents overlapping clusters, which could be a sign that the particular approach is not a good one for segmenting data points. Negative silhouette scores mean rows might have been assigned to the wrong cluster. This score is not a perfect indicator of how “clean” your clusters are, but can be a useful proxy for assessing how well each model segmented data points.
Cluster Insights: What’s in there anyway?
Cluster Insights is where you can start exploring patterns in clusters, and how data points may be related by features (these can be numeric, categorical, text, images, or geospatial).
Download cluster insights as a CSV file to do some more analysis on your own.
Are your clusters looking curious? Maybe there are features in there that don’t seem relevant or that may be driving the division of data points in a way that doesn’t quite make sense? Try creating new feature lists and re-running clustering. Instead of using them all, try different combinations of features or removing ones you know are less-relevant.
Clustering is not for the faint of heart - it can be challenging to understand all that’s going on under the hood in how clusters are assigned and what they may be telling you about your data. But DataRobot has made these models easier to build and interpret, so embrace the ambiguity! You may discover unexpected relationships with this approach and new ways of understanding the features in your data that could lead to actionable insights.
Let me know in the comments how you are using clustering! Looking forward to hearing about your use cases.
I've an inquiry about this:
What does this mean: 80 %Frozen parameter setting were applied to subsequent sample sizes to increase processing speed for larger datasets
Awaiting your reply
Thank you so much
@Doctor Youness I'm looking for the place where you saw that statement to make sure I give you the right answer. With clustering, as with any projects using large datasets (over 1.5GB), DataRobot uses "frozen runs" to freeze parameter settings for speed and cost.
If that doesn't answer the question, if you can point me to where you read this I'll dig deeper...jen