cancel
Showing results for 
Search instead for 
Did you mean: 

Filter to keep most recent record per account

Filter to keep most recent record per account

I have data with multiple rows per Account (column 1) that have different dates (column 2). I wish to filter the Paxata answerset/project so that I keep only the most recent record per account each time the answerset is refreshed with new data.            I cannot seem to find any information by searching the community.            Any help would be appreciated.                                             Here is an example where I want to keep the most recent record, the first one, and remove the rows with older records.                          Column 1, Column 2; A123, 2020-07-02T13:34:50.423;  A123, 2020-07-02T11:00:01.507 ; A123, 2020-05-02T10:30:50.123;                                  Thank you.

Labels (1)
6 Replies
shyam
DataRobot Alumni

Are there other columns other than the key column and date column in your dataset? If so, what values should be picked for them? Is it the values from the row that has the most recent date? 

Thanks for answering.  There are 10 other columns with data that I must keep with the row that I am keeping in the answerset.  Yes, there is a primary key as well, separate from the account and date.  I wanted to remove all rows except for the most recent row and keep all the 10 columns for that row.   Thank you.

@marrkr 

Thank you for providing the requested details. One approach to performing this would be to identify the latest timestamp for each primary key and then performing an inner join with the same dataset. Below are the set of steps that I followed.

shyam_0-1599229521567.png

The trick is to add a Shape sub-step in the Join step between the Import sub-step and Join sub-step. 

Of course, this could be done as a separate project as well. 

Hope that helps. Please let me know if you have further questions. 

@shyam

Thank you for the info.  I am trying that out. I think I need a bit more coaching/info for how to do the shaping.  The aggregate option defaults to Count.  what should it be?  And to confirm, I shape on the account and date columns right?  I wasn't sure if you meant to shape on the primary key?  I want the most recent date per account not the most recent date per record.  

marrkr_0-1599234931226.png

 

@marrkr 

Thanks for sharing more details. As I understand, your objective is to get the latest information about the account. To accomplish that, in the group by you need to select the Account as the column, and to get the latest date, you need the MAX of date as shown below. 

 

shyam_0-1599240733648.png

And once this step is done, in the join, the columns you need to match with are Account and Date in both datasets. 

Please reach out to your customer success representative or send an email to support@datarobot.com if you are not able to arrive at the solution.

@shyam   Thank you.