Background: SEE is a new method for predicting the duration of pharmacological prescriptions in secondary data sources when information on the time in treatment is missing/incomplete.
Objectives: To describe SEE and its accuracy, sensitivity, specificity, positive and negative predicted value (PPV/NPV), balanced accuracy, and precision in predicting the duration of pharmacological prescriptions and exposure status.
Methods: Synthetic data were generated simulating refill histories of medication of 1000 patients with an initial fill for 30 days and at least one refill over an observation window of 2 years. Refill durations of 1, 2, or 3 months were generated randomly for each subsequent refill. Exposure status was known for all patients over the observation window. The simulation accounted for carry-over within and before the observation window, mimicking six treatment patterns observed for chronic diseases including patients for which adherence was 1) 95%, or 2) between 50-90%, 3) gradually declining over time, and 4) intermittent. 5) Delayed and 6) early discontinuers were simulated. Using this data source, SEE initially computed the Empirical Cumulative Distribution Functions (ECDF) of temporal distances (TDs) among consecutive redeemed prescriptions (RPs). 80% of the ECDF was retained to exclude long TDs introduced by discontinuers. For each patient, SEE selects two random consecutive RPs in the observational window computing the logarithm of TDs in days. TDs were clustered selecting the number of clusters with Silhouette Analysis in K-means Clustering (e.g., machine learning). For each cluster, SEE builds the probability density function (PDF) of TDs splitting the PDF into deciles. The fifth decile of the PDF is selected and exponentiated. This value is considered as the median duration of the TDs in the cluster and this duration is added to the dataset. Finally, the predicted end of supply for each RP based on its predicted median duration was computed. To test prediction accuracy, I selected for each patient a random date in the observational window to test real exposure status versus the predicted (i.e., by SEE) allowing for 90 days gap time. From the confusion matrix, the metrics described in the objectives were computed.
Results: SEE showed an accuracy of 96% (95%CI 95%-97%), a sensitivity of 96%, a specificity of 99%, NPV of 65%, PPV of 99%, and balanced accuracy of 97% in predicting exposure status. The absolute median number of days of misclassification of exposure was 17 days (interquartile range: 17-43 days).
Conclusions: SEE emerged as a highly accurate data-driven method for predicting the duration of RPs.