Background: Real World Evidence is a growing method of generating evidence for effectiveness and is maturing to support regulatory decisions. It is recognised that clinical trial populations differ to patients receiving treatment in real world settings, this can be problematic if comparing data from a clinical trial to a real-world comparator arm. Although restriction of the real-world comparator arm can aid in making the real world arm more comparable, residual confounding might remain. Propensity score (PS) methods are increasingly being used to account for differences between non-randomised treatment and control arms, particularly because they provide statistical efficiency for endpoint adjustment in oncology trials. There are a variety of PS methods available to researchers and often the choice of method is specified a priori.
Objectives: The objective of this paper is to outline some of the PS methods available for endpoint adjustment when using real world data to build external control arms, and, using an example clinical trial dataset and real world data, demonstrate the variation is point estimates for overall survival (OS).
Methods: Patient data from the Flatiron advanced NSCLC dataset were compared to Project Data Sphere clinical trial data for study NCT00457392 of sunitinib plus erlotinib versus erlotinib alone to compare erlotinib overall survival. Propensity scores and weights were generated using the R MatchIt and WeightIt packages using age, sex, race, number of prior treatments and prior bevacizumab as covariates. OS was determined by Kaplan Meier and Cox proportional hazards, stratified by dataset.
Results: Data were extracted for 477 patients who formed the control arm from the clinical trial, and 929 patients were extracted from Flatiron. After restricting the Flatiron cohort by stage, ECOG, and starting erlotinib prior to 2013, 155 patients were eligible. Median OS in the clinical trial arm was 251 days (95% CI: 224, 291), and was 216 days (188, 347) in the unadjusted Flatiron arm. Median OS in the Flatiron arm after PS matching (nearest neighbour, without replacement) was 216 days (188, 347) and 195 days (140, 482) after PS matching using nearest neighbour with a caliper of 0.5 and replacement. PS weighting the Flatiron arm using weighting by odds resulted in a median OS of 281 days (176, 809) and was 275 days (176, 640) when using IPTW.
Conclusions: There is no one single approach to adjusting for differences between single arm trials and external control arms. PS methods can be inconsistent and sensitive to the variables used to estimate them. Researchers should provide extensive sensitivity analyses whereby a variety of PS methods are applied.