Showing results for 
Search instead for 
Did you mean: 

How to add columns from CSV-data (followup question)

How to add columns from CSV-data (followup question)

Hi community.

A couple of weeks ago I posted this question "What's the API endpoint for Compute Predictions?" [1]. Basically the answer was to use the endpoints described in [2] and avoid the "internal APIs".

Since then a colleague has pointed out that it's possible to add columns from the original CSV-file to the prediction download, much akin to the Prediction API's (not model worker's prediction API) passthroughColumns query parameters. But nothing similar seems to be mentioned in [2] so how we can get predictions based on holdout data, including columns from the CSV-file? Using Chrome debugger I can see that a POST is sent to /user_id_column but it looks like an "internal API" endpoint.


PS! Why do HTML-formatter options only appear in Edit message, ie. after the message has been posted to the forum?




7 Replies

Thanks @Linda, we'll keep an eye on the release notes.

@jlee - An enhancement request was created (thank you @doyouevendata !). We do not currently provide a way for users to view and track enhancement requests.

You can work with your account representative to get insight on your request. Also, the product release notes provide a great way for you to learn about new and changed features - after the release.

hope this helps


@doyouevendata Yes, I'd appreciate if you could make an enhancement request. Is it possible to monitor the progress or see its priority somewhere?

Yes; looks like you are poking around an internal API there.  The links I referenced were the full public raw API as well as the Python SDK that wraps and makes it easier to use.  The functionality you are looking to use, programmatically, is currently only available through a deployment.  I can put in an enhancement request to expose this via the Modeling API for an undeployed model on the leaderboard.

Hi @doyouevendata 

Yes, I'm looking for ways to compute and download predictions based on training data (the CSV file that was uploaded for model trianing). As part of that I'd like to use the optional feature which is adding columns from the training data.

I think you know what I'm looking for, but just so we're on the same page then I mean the UI-feature in (any) model (in any project) -> "Predict" tab -> "Make Predictions" pane -> "Prediction Datasets" -> "Training Data" there's a drop-down where I can choose All, Validation and Holdout and Holdout next to "Compute Predictions". Then above the "Compute Predictions" button one can add optional features like up to 5 columns from the original CSV-data to be joined with the prediction scores as well as "Prediction Threshold".

In Chrome debugger I can see that a POST /user_id_column with request body { user_id_column: "name_of_column" } is passed to the server when adding a column in the optional feature. But I can't find that in documentation so I'm wondering if it's an internal API.

I have noticed that the row number is returned so we could do the data join on our side and while I've done this before it'd still be preferable to have the system do it for us.


(@jlee for your use, this seems to be:

"Neither did I see it in the raw API documentation for constructing this type of job here," where here is: )

One can request a passthroughColumn on a real time request directly to the Prediction API, or as part of the Batch Prediction API that wraps it and is used to process larger file/stream based CSVs. These are available after the model has been deployed.

If I understand your question correctly, you are seeking to keep columns when scoring through an asynchronous batch file scoring request directly to a model on your leaderboard, before deployment, as can be done in the GUI with up to 5 columns.  There is a Python (and R) library that wraps the raw API; however I did not see a way to do this in the Python SDK here.  Neither did I see it in the raw API documentation for constructing this type of job here.

Note that the data will be returned in row order - so despite not having something like a surrogate ID column to pass through and join on, the scoring dataset uploaded could be joined to on row order and fields could be pulled through to create the desired result set.