From the Python API, how does one specify the optional features for a prediction run?
The item in the field on the web GUI "Optional features (0 of 5)".
which says, essentially, that it absolutely cannot be done. Please do not tell me that the DataRobot interface is that stupid. What I need is an API call that allows me to specify the optional features because I am not in a position to rely on the order that DataRobot and SQL happen to think the data should be in this time. Do I have to have the code pop up the DR Web GUI and say "please human enter this code"? That would be enough for me to vote against renewal of the service contract. FWIW.
I see your concern about getting features back along with prediction. The reason behind not providing features back - is reducing amount information transmitted through the internet along with reducing the latency. Instead one have all the data in code already, and it is believed to be easy to set one additional column with predictions. If there is concern of order problems, one may check that with "rowId" provided along with predictions and join by it.
Let me know if you need support in coding that.
The excuse that Datarobot is saving on transmission bandwidth is lame. Even just one extra column with a single integer that I control would be enough - and without that information it makes things rather hard. And that should be my choice, not Datarobot's. I could change the query so that I am absolutely certain that it has a very specific order - but then how can I be sure that DR will respect that order when it downloads it for the predictions? SQL relations have no inherent order except on output.
Also, you can do it through the Web GUI, so again - this is a lame excuse. I am only asking to do through the API what I can already do through the GUI.
But, I take it that you are actually saying that DR is that lame. Wow!
@dustin.burke can you confirm this limitation on the DR API ?
As I mentioned above that extra column with a single integer is "rowId" ensuring your order. This column is applicable to any data and project so it doesn't need to be specified extra.
So, I should upload the data into DR for prediction and then DR will attach rowid. And then DR will link the predictions to the data by rowid. So, I download the predictions and download the data, both of which have the DR row number, and then join on rowid.
That sounds plausible.
Is that your advice?
JFTR - That does not seem to lead to any saving on bandwidth.
I need to know the API you are using to give concrete advice.
If you are using real-time Python API (the most popular one), then you create pandas data frame for prediction, get predictions, and just set new column equals prediction (because the order is kept by DataRobot).
If you do prediction and receive it in different environments then before prediction you save the data with all columns you need with row index generated from 0 with step 1. And when you read the predictions you join them to the previous data (the one you saved) by "rowId".
In both situations, DataRobot doesn't need to send column values back to your environment.
I was using that basic approach, when I uploaded from a csv file.
But, I am now using a dynamic query from snowflake.
So, my concern is that previous developers at my place of work have found that since SQL can return results in indeterminate order, that the association using the order of the rows is unreliable. This was presented to me, when I started, as a big problem that was fixed by us adding a defacto key to the prediction explanations data. The very key that I cannot (apparently) set using the API. Which is still a weird oversight that should be fixed - regardless of whether there is a workaround.
I have modified the SQL queries so that they should return a deterministic order. But, it is unclear to me that Datarobot will respect that order, and that is certainly not something that I am comfortable taking the word of Datarobot about. I would much rather be able to cross check that by having my own key exist in the data.
I am using the only Python library I am aware for this purpose: "datarobot".
Did you try the Snowflake manual on integration with DataRobot? We highly value our efforts to integrate with the Snowflake environment to make your experience easy and code-free.
Please let us know if there are any issues with ensuring the order of predictions returning to Snowflake or any troubles with your use-case using it.