OBJECTIVE: Before integrating new machine learning (ML) into clinical practice, algorithms must undergo validation. Validation studies require sample size estimates. Unlike hypothesis testing studies seeking a p-value, the goal of validating predictive models is obtaining estimates of model performance. Our aim was to provide a standardized, data distribution- and model-agnostic approach to sample size calculations for validation studies of predictive ML models. MATERIALS AND METHODS: Sample Size Analysis for Machine Learning (SSAML) was tested in three previously published models: brain age to predict mortality (Cox Proportional Hazard), COVID hospitalization risk prediction (ordinal regression), and seizure risk forecasting (deep learning). The SSAML steps are: 1) Specify performance metrics for model discrimination and calibration. For discrimination, we use area under the receiver operating curve (AUC) for classification and Harrell's C-statistic for survival models. For calibration, we employ calibration slope and calibration-in-the-large. 2) Specify the required precision and accuracy (<=0.5 normalized confidence interval width and +/-5% accuracy). 3) Specify the required coverage probability (95%). 4) For increasing sample sizes, calculate the expected precision and bias that is achievable. 5) Choose the minimum sample size that meets all requirements. RESULTS: Minimum sample sizes were obtained in each dataset using standardized criteria. DISCUSSION: SSAML provides a formal expectation of precision and accuracy at a desired confidence level. CONCLUSION: SSAML is open-source and agnostics to data type and ML model. It can be used for clinical validation studies of ML models.
- Downloaded 87 times
- Download rankings, all-time:
- Site-wide: 167,168
- In health informatics: 789
- Year to date:
- Site-wide: None
- Since beginning of last month:
- Site-wide: None
Downloads over time
Distribution of downloads per paper, site-wide
- 27 Nov 2020: The website and API now include results pulled from medRxiv as well as bioRxiv.
- 18 Dec 2019: We're pleased to announce PanLingua, a new tool that enables you to search for machine-translated bioRxiv preprints using more than 100 different languages.
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!