From the Python API, how does one specify the optional features for a prediction run?
The item in the field on the web GUI "Optional features (0 of 5)".
which says, essentially, that it absolutely cannot be done. Please do not tell me that the DataRobot interface is that stupid. What I need is an API call that allows me to specify the optional features because I am not in a position to rely on the order that DataRobot and SQL happen to think the data should be in this time. Do I have to have the code pop up the DR Web GUI and say "please human enter this code"? That would be enough for me to vote against renewal of the service contract. FWIW.
I see your concern about getting features back along with prediction. The reason behind not providing features back - is reducing amount information transmitted through the internet along with reducing the latency. Instead one have all the data in code already, and it is believed to be easy to set one additional column with predictions. If there is concern of order problems, one may check that with "rowId" provided along with predictions and join by it.
Let me know if you need support in coding that.
The excuse that Datarobot is saving on transmission bandwidth is lame. Even just one extra column with a single integer that I control would be enough - and without that information it makes things rather hard. And that should be my choice, not Datarobot's. I could change the query so that I am absolutely certain that it has a very specific order - but then how can I be sure that DR will respect that order when it downloads it for the predictions? SQL relations have no inherent order except on output.
Also, you can do it through the Web GUI, so again - this is a lame excuse. I am only asking to do through the API what I can already do through the GUI.
But, I take it that you are actually saying that DR is that lame. Wow!
@dustin.burke can you confirm this limitation on the DR API ?
As I mentioned above that extra column with a single integer is "rowId" ensuring your order. This column is applicable to any data and project so it doesn't need to be specified extra.
So, I should upload the data into DR for prediction and then DR will attach rowid. And then DR will link the predictions to the data by rowid. So, I download the predictions and download the data, both of which have the DR row number, and then join on rowid.
That sounds plausible.
Is that your advice?
JFTR - That does not seem to lead to any saving on bandwidth.
I need to know the API you are using to give concrete advice.
If you are using real-time Python API (the most popular one), then you create pandas data frame for prediction, get predictions, and just set new column equals prediction (because the order is kept by DataRobot).
If you do prediction and receive it in different environments then before prediction you save the data with all columns you need with row index generated from 0 with step 1. And when you read the predictions you join them to the previous data (the one you saved) by "rowId".
In both situations, DataRobot doesn't need to send column values back to your environment.
I was using that basic approach, when I uploaded from a csv file.
But, I am now using a dynamic query from snowflake.
So, my concern is that previous developers at my place of work have found that since SQL can return results in indeterminate order, that the association using the order of the rows is unreliable. This was presented to me, when I started, as a big problem that was fixed by us adding a defacto key to the prediction explanations data. The very key that I cannot (apparently) set using the API. Which is still a weird oversight that should be fixed - regardless of whether there is a workaround.
I have modified the SQL queries so that they should return a deterministic order. But, it is unclear to me that Datarobot will respect that order, and that is certainly not something that I am comfortable taking the word of Datarobot about. I would much rather be able to cross check that by having my own key exist in the data.
I am using the only Python library I am aware for this purpose: "datarobot".
Did you try the Snowflake manual on integration with DataRobot? We highly value our efforts to integrate with the Snowflake environment to make your experience easy and code-free.
Please let us know if there are any issues with ensuring the order of predictions returning to Snowflake or any troubles with your use-case using it.
I will look into this.
But why would I want my "experience" to be code free? Code is easy. Its dealing with GUIs that is a problem.
"Best code is no code at all" (c) Jeff Atwood, co-founder of Stack Overflow and Stack Exchange.
We tried to solve as many problems and edge cases along with stability and reliability as possible from our side.
@Bogdan Tsal-Tsalko "best code is no code" is an oxymoron. It is not a deep concept. It just means the person who said it is no good at code. How about "The best GUI is no GUI". Fashions change but what you are telling me is that the DR API is intended to discourage people from using it at all.
@Eu Jin This was definitely not the message I got in my first meeting about the functionality of the DR API.
DataRobot is made around automatization, so Datarobot is made to reduce code, and DR API is made to give you access to all the operations in-app through code. Uploading predictions is routine we automated for our users. You had a question about troubles with aligning predictions back, now you have two options - write code as you want it, or use what is automated.
Hey @Bruce , I decided to come up with a solution https://github.com/calamarif/datarobot_gui_to_code (please read on before you click on it), because on presenting this to a talented colleague ( @Lukas ) , he said "that's great, but hold my beer" (slight misrepresentation of his words), and asked why didn't. I just do this:
import datarobot as dr import pandas as pd p = dr.Project.get('6242f741974efc8f60e26fcf') prediction_data = pd.read_excel('/Users/lukas.innig/DataRobot/Datasets/10k_diabetes.xlsx')[:100] ds = p.upload_dataset(prediction_data) m = p.get_models() passthrough_cols = ['admission_type_id','discharge_disposition_id','admission_source_id'] pred_job = m.request_predictions(ds.id) preds = pred_job.get_result_when_complete() preds.set_index('row_id').join(prediction_data[passthrough_cols])
Very straight forward and pretty much what I tried first.
If you try to join on row_id you have the problem that SQL is not deterministic about order, so if you call the query twice and just join on row number, then the predictions can get attached to the wrong rows. At best you have to download to CSV first, and attach to that. But, then that forces me to use a CSV and handle the ensuing datatype problems. I did implement this, but it was unpopular at my work place. My source is a Snowflake database. I need to work directly through that.
But, the predictions download does not include data fields except those listed by hand in the Web GUI, and even then -- only when the download is done through the GUI, and not when done through the API. Apart from this being my experience - I also got official confirmation of this through my Datarobot contact, who has said that they will put in a request for a modification to the API that will allow me to do this through the API.
And, admittedly, I just don't like the idea of joining the two tables back together using row numbers. Any number of things could go wrong on the Datarobot end and cause massive havoc. Even just knowing that DR uses the same order as the query in any case at all is problematic (I could not find an assurance of this in the documentation). I would rather have a linking field that I personally supplied.
I do acknowledge that I, myself, have downloaded the predictions later rather than performing a blocking wait for the job, but another developer who mentioned the problem to me was using code that did wait for the job. So, I assume that that call produces the same download (without the extra fields).
Addendum - I just tried it, and can confirm that the "extra" fields were not supplied in the download using get_result_when_complete().
Ah got it, thanks for the explanation @Bruce.
I made some changes to the solution i posted in github to use the AI Catalog (instead of a local file) which I think will solve your problem - https://github.com/calamarif/datarobot_gui_to_code
Please let me know how you go, would be keen to get your feedback.