Suppose I measure M independent variables as functions of time. I make these measurements for N trials. So I have N trials of M variables. On each trial, the output is a single-value scalar response, ranging from (say) zero to 100.
I want to find the relationship between the time-series data (including all possible lags) and the single-valued response.
Can someone please point me to the best similar example or tutorial so I can learn how to organize and process data?
I find tutorials on how to predict time-series from time-series, but I am looking for single valued output from time-series. I am not trying to predict the next point in the time series. Thanks
Hi @tepig ,
It sounds like you have a series-based problem (in this case it is a time-series) for M series, with N observations. It is quite common to have these questions were it is unclear if the problem is simply a 'time-ordered regression' or a 'time series' problem. To help determine that, please try to answer the following questions:
1. Do you want to predict N(t+1) for all M series? Or do you want to predict N(t+1) for 1-series, and have all M-1 other series as covariates that help predict the single value? I.e. How many different things are you trying to predict for one moment in time?
2. How much information from the other M-series do you know for the point in time you are predicting? Will none of the values of the other M-1 series be known for that timestamp/row? Will all be known except your target? A mixture? Will you only know the last values observed for all M-series, and just want to predict the next value for one of them?
3. Do you know or think that lags are important or relevant in this situation (are recent trends in the data for each of the M series important)? I.e. can you treat a time-stamp independently from others in the series and still have the problem make sense?
4. Is the spacing of your observations/rows regular and predictable? Is is something like a daily-spaced record or a situation in which the time-based spacing between observations is not known/expected?
Thank you for trying to help. I'm afraid you may have misread the original post -- I do not want to predict the point t +1. I am not predicting a point in time. I am predicting an animal's response. I will give a more concrete example:
I measure M variables as functions of time. In this example, M = 3:
(1) the animal's distance from the origin as a function of time
(2) the animal's velocity as a function of time.
(3) the animal's breathing rate as a function of time
From the relationships among those three variables, and perhaps their lagged relationships to each other, I want to predict how many grams of cereal the animal will eat at the end of the day.
I run this experiment for N days. So I have N trials of M variables. For each of the N days, I have an output (grams eaten)
So I have three time varying signals, and I want to predict a single-valued response from those three variables, their relationships to each other along with their lags.
The value I want to predict is not an extrapolation of any of the M variables.
I have found several examples of predicting time series. That is not what I need to do.
Got it. Thank you for the additional details.
Seems that you are using higher-frequency data (location, movement speed, biological factors) that will be used to calculate something else that happens over the same period of time (eat at the end of the day). If so, you have raw-data recordings throughout a period of time (day) but are aggregating those up to the daily level to also predict the target at a daily level. The nuance here is that, on a row-level, you already have those movements as rates that would be aggregated throughout some period of time. This could be structured with a lower-level of aggregation if you have higher frequency target-records (eating), but it sounds like you want a daily-level sum.
In this case, framing this as an 'OTV' problem is the best approach if you don't care about lagged values of those aggregated data (location, movement speed, biological factors). This approach creates a time-ordered regression problem, and can be structured as so:
1. Aggregate the dataset to the daily-level so that each row has a target value.
2. Upload data to DataRobot, set target and primary date/time features.
3. Select 'Out of Time Validation' (OTV) on the next step:
4. Now you can configure your 'backtests', which is how the time-based partitioning will be controlled. For this type of problem, you assume there is an importance in the time-order. For example, a mouse may grow over time and consume more food. If you randomly sort the rows of your dataset into training/validation partitions, then the 'future' of a fat-mouse might be accidentally leaked to the model during training and let it know what will happen in the future.
5. Configure any other project settings, and hit 'Start'
But, you say you do care about lags, so you may want to frame this as a 'Time Series Nowcasting' problem. This will instruct DataRobot to generate lagged features, but will make sure not to cheat and 'leak' the target-value or any target-derived features corresponding to the same row.
To do this:
1. (Same as above) Aggregate the dataset to the daily-level so that each row has a target value.
2. (Same as above) Upload data to DataRobot, set target and primary date/time features.
3. Select 'Time Series modeling':
4. Then, you'll configure a Forecast Window of : 0, 0. This will have DataRobot predict 'now' knowing the actual values of all of the other features, the lags of other features, and the historical lags of the target value. You can select the Feature Derivation Window to correspond to how long into the past you'd like to calculate the lags and time-based feature engineering. For the first project, just use the default Feature Derivation Window settings.
5. (Same as above) Configure the backtests.
6. Configure anything else about the project. DataRobot Automated Time Series projects have a few different advanced options.
7. Hit 'Start'
In both circumstances ('OTV' or 'Time Series Nowcasting') you will be building time-ordered models, and have new time-based insights to be able to determine how accurate the model is over time. You can compare high-level performance by looking at the same metric across both approaches. If you configure the backtest to cover the same period of time (they should with default settings), then you can directly compare the performance of the two approaches.
Does that make sense as two approaches for problem framing? Any additional questions on how to test this out?
I just wanted to thank you for this extensive answer. It's a lot of information to process and it's going to take me a while to sort through, so I may not be able to respond for a bit. I really appreciate the time and thoughtfulness you put into understanding the question and then coming up with several alternatives, depending on whether or not I want to include lags. Thank you!