We are trying to predict who will win or lose a competition, where the competition is a bid event.
For context, think of this as bids within a school district. There are 10 schools, and each school has multiple bid events per year. Each bid event has multiple bidders. We have several years of bid results, who bid, and who won/lost. We have 'feature' data about the schools and about the bidders. Let's say there are 50 bidders who participated in 1000 bid events for the 10 schools over the past 3 years. We have all of the bids (the wins and the losses) for every bid event. Every event has exactly one winner, and will have at least one (and maybe more) losers.
What we are trying to predict is which bidder will win a bid event that has been announced. We know who the bidders are. We know everything but who will win the bid. Naturally we don't have the exact bid amount, which is not a feature of our model anyway, because we'll never know the exact bid amount before the bids are submitted on a given bid event.
Our challenge, I think, is related to the logic of correctly framing our question. I think our question is 'on this next bid event with 4 bidders, who will be the winner'.
1. Since the competition is known to be those 4 bidders, is the target column 'bidder', which is a binary yes/no answer? How do we teach the model that only 1 of the 4 can be a 'yes'?
2. Do we need to create a model for just those 4 bidders, or can our training model include the whole dataset with all 50 bidders?
3. How many records do we need to have, with those 4 bidders involved, to have a legit model? Of the 1000 bid events that have happened, are 100 records enough? 50?
I suspect that our questions reflect a lack of some fundamental understanding, and I appreciate any assistance.
Hey @ml-noob, can you clarify a few things for me?
Thanks for any extra information!
Q1 How would the prediction (of which bidder is most likely to win) be used/implemented?
A1 If I'm a bidder, I will know who the other bidders are. Bidding takes time and money. I will use the prediction to get a sense of my chance to win, so that I can make a bid / no bid decision against the competitors for this school. If I decide to bid, I will know which factors are influential in my future win/loss prediction, and I can try to influence those factors.
Q2 What type of decision might be made because of a particular prediction?
A2 To bid, or not to bid, that is the question.
Q3 Historically, do you know all of the people who might have made a bid, or just all of the people who did make a bid? Related, would you have that same information at the time of a new prediction?
A3 Historically, we will only know the bidders that did actually bid. Yes, we will have this same information at the time of prediction.
@ml-noob Okay, I think I understand the situation better. I will assume you're bidding with an estimated cost/price, which will have some effect on whether your bid wins/you get selected. Of course, if you underbid for whatever reason (get the business at a loss to gain name value, etc.), that will improve your chances of winning the bid. Let me know if I've misunderstood something here!
So there are some caveats.
If you take into account such caveats, you still may be able to produce a model that is useful.
With that out of the way, I have a simple idea to take into account how many bidders there are (including you). In a given bid where there were N bidders in total, you can impose a "null" assumption that the probability that any given bidder wins is 1/N. Then treat every bidder on every bid as a single observation with a binary outcome (win/loss). Either they won the bid or didn't. Set the "offset" to ln(1/N) for all observations/rows for a particular bid. This means that one bid may have 4 observations all with offsets of ln(1/4), and another bid may have 10 observations all with offsets of ln(1/10).
This approach will essentially control for the number of bidders historically in each bid, where a particular bidder may not have won very often historically, but also only bid on longshot bids. I gather you're trying to derive some predictive information about which bidders win a lot, and maybe you want to avoid those bidders when possible (again, correct me if I'm wrong). The probability outputs of a particular bid aren't guaranteed to add to 100% with this approach (because we're treating each bidder's bid as an individual binary outcome), but you can scale the resulting probability predictions to sum to 100% for each bid event on the back end.
Bid amount is the #1 determining factor of a win or loss on every bid. In other words, the low bid almost always wins. We do have the actual bid amounts for the historic bids. We did feature engineering to put the bids in a range, i.e., the feature is a category of $10,000 to $20,000 rather than $12,502. We will know the likely bid range during the new bid event, whereas we do not know the bid amount for the current bid event.
I think that the underlying assumption that historic bids were made by each bidder using the same approach (e.g., estimate costs, target X% return) is a good assumption. What we are predicting, I think, is that given the set of circumstances for this bid, such as the school and type of work to be performed and other recent criteria, which organization is most likely to win.
Bidders don't necessarily change their businesses that much over a period of a few years. It isn't clear how we'd set a meaningful time boundary on what is 'recent' data. We are currently planning on using training data from a 3-year time period. Is there another concept to apply that would allow us to limit the data in a meaningful way?
Can you help me understand the final point a little bit better? The way I'm reading it, I'm concerned that our whole use case may not be a fit for ML. That's a concern I'd like to validate a bit more. How do we know if this is a good ML application at all?
@ml-noob I assume you're referring to the final bullet point about "strength of bidder", implying the likelihood a particular bidder is to win a typical bid. I actually think this is a good use case for machine learning (ML). First, some theory...
Let's say a common bidder has made 100 bids in the last 3 years (and for simplicity, that all bids had 10 bidders). You'd expect this bidder to have won about 10 bids (1/10 * 100). If this bidder won 20 bids, then from a statistical standpoint, there's some signal that this bidder is relatively "strong". Getting into the weeds a bit, assuming a 10% chance to win each bid, there is a <1% this bidder would have randomly won 20 or more bids (binomial distribution). So there's signal that this is a strong bidder.
But let's say a bidder had only made 10 such bids and won 2. That's still twice as many as we would expect, but from a statistical perspective, there is a ~25% chance that this bidder randomly won 2+ (despite 10 bidders per bid). There is less signal here.
There's no right answer for "how much data is enough", but hopefully that example helps show how much more information we could get from 100 historical bids per bidder vs. just 10. If you're trying to figure out how much data to use--like how many years to look back--DataRobot can create a wide range of features for you that look back 1, 2, 3, 5, 10, etc. years into the past to derive information. Then the ML algorithms can determine which ones matter most. E.g., maybe the last 3 years is a really good indicator of how a bidder will behave in the near future, but 10 years introduces too much noise. AutoML is a good tool I think to uncover how many years you should look into the past for meaningful information.
Regarding bid amount, it would definitely be helpful to know how much your bid win probability increases as you lower your bid. Without seeing your data, I'm not sure exactly how you're doing this. I would just want to make sure that the information about bid amount range is something you would know at the time you want to make your bid. And also that if you want to build ML models that include information about bid amounts, that you knew these values at the time of each training observation.