cancel
Showing results for 
Search instead for 
Did you mean: 

Clustering is here! Now what?

clb
Data Scientist
Data Scientist

Clustering is here! Now what?

clb_0-1651601118245.png

 

You may have heard that clustering, as an unsupervised learning approach, is now available in DataRobot. But what does clustering mean? Why would we use it? What does it really tell us about the patterns and relationships in our data? 

 

With clustering, you are looking for structures within your data without a specific outcome to be predicted (as in supervised learning approaches, which you may be more familiar with in DataRobot). Clustering algorithms divide data points into a number of groups with similar traits.

 

How is clustering used?

  • Clustering is a common approach in customer segmentation, helping marketing teams more accurately segment customers by useful features, like spending behavior and age. The result is a targeted marketing approach that reaches more specific audiences…and less wasted time and energy for marketing teams. 
  • It’s also used in medical applications like genetic analysis and insurance for fraud detection. 
  • Sports teams use it for prospect research.
  • Completely labeled image datasets can be incredibly time-consuming to create, especially in areas where labeling requires extensive subject matter expertise (think: pathology datasets). Kickstart your image labeling with clustering using DataRobot’s image embeddings visualization.
  • Clustering can even help you get to a supervised learning approach, where clusters serve as a proxy for possible classes in a classification task.

How do I do this in DataRobot?

It’s easy to get started: just load your data and select “No target?”. Clustering is available in Manual mode if you want to pick particular blueprints, or Comprehensive, which will run Autopilot using all available clustering algorithms.

clb_1-1651601141941.png

 

From here, it’s on to the Leaderboard. DataRobot uses a metric called silhouette score, which measures distances between clusters of data points and how condensed cluster points are to each other.

clb_2-1651601165376.png

 

Scores range from -1 to +1. A score of 1 would mean clusters are dense and well-separated from others. A value of 0 represents overlapping clusters, which could be a sign that the particular approach is not a good one for segmenting data points. Negative silhouette scores mean rows might have been assigned to the wrong cluster. This score is not a perfect indicator of how “clean” your clusters are, but can be a useful proxy for assessing how well each model segmented data points.

 

Cluster Insights: What’s in there anyway?

Cluster Insights is where you can start exploring patterns in clusters, and how data points may be related by features (these can be numeric, categorical, text, images, or geospatial).

clb_3-1651601201884.png

 

Download cluster insights as a CSV file to do some more analysis on your own.

clb_4-1651601214046.png

 

Are your clusters looking curious? Maybe there are features in there that don’t seem relevant or that may be driving the division of data points in a way that doesn’t quite make sense? Try creating new feature lists and re-running clustering. Instead of using them all, try different combinations of features or removing ones you know are less-relevant.

 

Clustering is not for the faint of heart - it can be challenging to understand all that’s going on under the hood in how clusters are assigned and what they may be telling you about your data. But DataRobot has made these models easier to build and interpret, so embrace the ambiguity! You may discover unexpected relationships with this approach and new ways of understanding the features in your data that could lead to actionable insights.

 

Let me know in the comments how you are using clustering! Looking forward to hearing about your use cases.

@Sylvain @ivanpyzow 

3 Replies

Hi;

I've an inquiry about this:

What does this mean: 80 %Frozen parameter setting were applied to subsequent sample sizes to increase processing speed for larger datasets

Awaiting your reply

Thank you so much

Youness

I've an inquiry about this:
 
What does this mean: 80 %Frozen parameter setting were applied to subsequent sample sizes to increase processing speed for larger datasets
Awaiting your reply
Best regards
 
0 Kudos
jenD
DataRobot Employee
DataRobot Employee

@Doctor Youness I'm looking for the place where you saw that statement to make sure I give you the right answer. With clustering, as with any projects using large datasets (over 1.5GB), DataRobot uses "frozen runs" to freeze parameter settings for speed and cost.

 

If that doesn't answer the question, if you can point me to where you read this I'll dig deeper...jen

0 Kudos