Scientists conducting clinical or laboratory research often look to perfectly controlled experiments as a gold standard. This allows them to bypass statistical modeling as a method for eliminating heterogeneity (“noise”) and estimating treatment effects. Often, it’s impossible to conduct a true experiment, however. Legal, ethical, financial, logistic, or other pragmatic roadblocks may preclude true experimentation. For instance, if a researcher wishes to explore the effects of smoking on reproduction, they cannot randomly assign half of a sample of individuals to begin smoking, or to attempt reproduction, for that matter.
In such situations researchers rely on observational data, and employ statistical models to remove the effect of heterogeneity between “treatment” and “control” units, which is the domain of quasi-experimentation. The estimation of treatment effect by quasi-experimental methods is called causal inference. There are a number of challenges in using statistical models this way, including the potential for bias: the person or people fitting models observe how the estimated treatment effects change as they alter their model specification and may (consciously or unconsciously) be more likely to select model specifications that yield results matching their desired outcomes or preconceptions.
A popular solution to this dilemma is Propensity Score Matching (PSM), first proposed by Rubin and Rosenbaum in 1983.1 PSM is a two-stage process, wherein statistical modeling is conducted in the first stage only and the dependent variable is a binary (indicator) variable representing treatment, rather than the ultimate target of interest. Since modeling is conducted without the true target variable as any component of the model, there is no opportunity for bias during modeling. The first stage model instead assigns a probability that a given unit received treatment; this is the Propensity Score. In the second stage, Propensity Scores are used to match pairs of treatment and control units that are about equally likely to have been included in treatment, allowing for an “apples-to-apples” rather than “apples-to-oranges” comparison.
These scores may also be used as a statistical weight in some variations of PSM. Extra units that have no close match are pruned from the sample after matching. Finally, experimental statistics of interest such as the Average Effect of Treatment on the Treated (ATT) are estimated on the cleaned sample.
Traditionally, a simple logistic regression is used for first stage modeling in PSM. Often no particular performance metrics are evaluated on that model or, at most, the matched sample of treatment and control units are checked for covariate balance.2-4
In recent years, researchers have found that better performance (i.e., better matching, which results in more accurate treatment effect estimations) can be obtained by replacing the first stage regression with machine learning (ML) algorithms and their best practices.5-13 Using DataRobot unlocks the power of iteratively testing, selecting, and scaling state-of-the-art ML models in a fraction of the development time traditionally required, thus enabling researchers to estimate more accurate treatment effects than ever before.
PSM is ubiquitous. It is heavily used in medical and healthcare research, including the search for COVID-19 treatments.14-20
Other types of organizations, including universities, social scientists, businesses, and nonprofits have also come to rely on PSM for quasi-experimentally estimating the effects (sometimes referred to as ROI) of any number of programs, marketing campaigns, or policies. For instance, a restaurant chain may wish to measure the effects of alternate seating arrangements on sales. Since other differences exist between store locations (e.g., neighborhood characteristics, promotions, staff quality)—an “apples-to-oranges” comparison—a perfectly controlled experiment is impossible, but effects may be measured quasi-experimentally by use of PSM to create an “apples-to-apples” comparison.
I've created two end-to-end demos of using DataRobot for PSM; these are available as Zepl notebooks (listed below) written in Python and R. The first provides a basic example and overview. The second compares the performance of DataRobot ML models to logistic regression in PSM, including the financial implications of DataRobot’s improved accuracy.
Note: PDF copies of the notebooks and exports of the code are also available on our Community GitHub.
Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70(1):41-55.
Austin PC. Assessing covariate balance when using the generalized propensity score with quantitative or continuous exposures. Stat Methods Med Res. 2019;28(5):1365-1377. doi:10/ghgjp2
Austin PC. Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity‐score matched samples. Stat Med. 2009;28(25):3083-3107.
Austin PC, Grootendorst P, Anderson GM. A comparison of the ability of different propensity score models to balance measured variables between treated and untreated subjects: a Monte Carlo study. Stat Med. 2007;26(4):734-753. doi:10/cm49f4
Balanescu DV, Monlezun DJ, Donisan T, et al. A cancer paradox: machine-learning backed propensity-score analysis of coronary angiography findings in cardio-oncology. J Invasive Cardiol. 2019;31(1):21-26.
Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W. Double/debiased/neyman machine learning of treatment effects. Am Econ Rev. 2017;107(5):261-265.
Goller D, Lechner M, Moczall A, Wolff J. Does the estimation of the propensity score by machine learning improve matching estimation? The case of Germany’s programmes for long term unemployed. Labour Econ. Published online 2020:101855. doi:10/gg37kq
Ferri-García R, Rueda M del M. Propensity score adjustment using machine learning classification algorithms to control selection bias in online surveys. PloS One. 2020;15(4):e0231500. doi:10/gg37kj
Linden A, Yarnold PR. Using machine learning to assess covariate balance in matching studies. J Eval Clin Pract. 2016;22(6):848-854.
Westreich D, Lessler J, Funk MJ. Propensity score estimation: neural networks, support vector machines, decision trees (CART), and meta-classifiers as alternatives to logistic regression. J Clin Epidemiol. 2010;63(8):826-833. doi:10/c93bhc
Westreich D, Lessler J, Funk MJ. Propensity score estimation: machine learning and classification methods as alternatives to logistic regression. J Clin Epidemiol. 2010;63(8):826. doi:10/c93bhc
Zhang Y, Chen R, Wang J, et al. Anaesthetic management and clinical outcomes of parturients with COVID-19: a multicentre, retrospective, propensity score matched cohort study. medRxiv. Published online 2020.
Biran N, Ip A, Ahn J, et al. Tocilizumab among patients with COVID-19 in the intensive care unit: a multicentre observational study. Lancet Rheumatol. Published online 2020.
Freedberg DE, Conigliaro J, Wang TC, et al. Famotidine use is associated with improved clinical outcomes in hospitalized COVID-19 patients: A propensity score matched retrospective cohort study. Gastroenterology. Published online 2020.
Geleris J, Sun Y, Platt J, et al. Observational study of hydroxychloroquine in hospitalized patients with Covid-19. N Engl J Med. Published online 2020.
Tremblay D, van Gerwen M, Alsen M, et al. Impact of anticoagulation prior to COVID-19 infection: a propensity score–matched cohort study. Blood J Am Soc Hematol. 2020;136(1):144-147.
Yuan M, Xu X, Xia D, et al. Effects of corticosteroid treatment for non-severe COVID-19 pneumonia: a propensity score-based analysis. Shock. 2020;54(5):638-643.
Rodríguez-Baño J, Pachón J, Carratalà J, et al. Treatment with tocilizumab or corticosteroids for COVID-19 patients with hyperinflammatory state: a multicentre cohort study (SAM-COVID-19). Clin Microbiol Infect. Published online 2020.
Need a Tip? DataRobot experts are putting together some helpful DataRobot usage tips for the platform, trial, features, etc. You can find these easily in the Tip of the Day board (under Read). Let us know if you've found a good one or have a good one to add!