hi, our team is trying to replicate the XGB model built by DR model through python/R xgb function for validation purpose. For some reasons, even though we are using the same data, same parameters to start with, we ended up with different performance. Is there anybody with similar experience and have some insights on this?
And just to cover the bases, most xgboost blueprints in DataRobot can be exported to Java Scoring Code (assuming your license includes this). These downloadable java executables have replicated the exact DataRobot model and can be run anywhere compatible with Java. If your goal is to productionalize DataRobot models outside DataRobot, this is an option. It sounds like you're trying just to validate the model's predictions, so I have some more ideas below.
It is possible to correctly match all the pre-processing steps and hyperparameters for an xgboost algorithm between R/Python and DataRobot, and still not get the same predictions. There is randomness built into these types of algorithms, for example random choices that select which rows and which features to include in each tree. In that case, the differences between the two model outputs would come down to random deviations within a margin of error. In some rows, that still may lead to noticeable prediction differences. That said, here are some steps I've taken to make sure the only differences are random:
Confirm same training data and same xgboost hyperparameters (as you noted). This AI Accelerator about hyperparameter tuning provides some helpful code to inspect blueprints, their tasks, and their tasks' hyperparameter values.
Confirm that you are performing the pre-processing tasks in the same way that DataRobot is. For example, xgboost blueprints typically use an ordinal encoding for categorial features. By default, I believe the levels are sorted lexographically, so you would need to sort the same way before converting to integers. For numeric missing value imputation, confirm you're using the same median (or other) imputation values and that you are creating a missing value indicator column (if the blueprint task does the same).
Confirm you're using the same optimization metric.
If you are comparing out-of-sample predictions, like those produced when downloading training predictions through the model leaderboard, you may not be able to exactly replicate the internal cross validation that is done to produce out-of-sample predictions on your 80% and 100% sample size DataRobot models. At the very least, you'd need to make sure the training, cross validation, and holdout partitions were all set the same way. To do this, you would need to export the DataRobot training predictions (either through the GUI or the API), and get the row-level partitioning assignments from that output.