Machine learning is a powerful set of techniques that has enhanced the abilities of neuroscientists to interpret information collected through EEG, fMRI, and MEG data. With these powerful techniques comes the danger of overfitting of hyper-parameters which can render results invalid, and cause a failure to generalize beyond the data set. We refer to this problem as ‘over-hyping’ and show that it is pernicious despite commonly used precautions. In particular, over-hyping occurs when an analysis is run repeatedly with slightly different analysis parameters and one set of results is selected based on the analysis. When this is done, the resulting method is unlikely to generalize to a new dataset, rendering it a partially, or perhaps even completely spurious result that will not be valid outside of the data used in the original analysis. While it is commonly assumed that cross-validation is an effective protection against such spurious results generated through overfitting or overhyping, this is not actually true. In this article, we show that both one-shot and iterative optimization of an analysis are prone to over-hyping, despite the use of cross-validation. We demonstrate that non-generalizable results can be obtained even on non-informative (i.e. random) data by modifying hyper-parameters in seemingly innocuous ways. We recommend a number of techniques for limiting over-hyping, such as lock-boxes, blind analyses, pre-registrations, and nested cross-validation. These techniques, are common in other fields that use machine learning, including computer science and physics. Adopting similar safeguards is critical for ensuring the robustness of machine-learning techniques in the neurosciences.
- Downloaded 4,708 times
- Download rankings, all-time:
- Site-wide: 875 out of 92,062
- In neuroscience: 92 out of 16,376
- Year to date:
- Site-wide: 2,161 out of 92,062
- Since beginning of last month:
- Site-wide: 2,767 out of 92,062
Downloads over time
Distribution of downloads per paper, site-wide
- 18 Dec 2019: We're pleased to announce PanLingua, a new tool that enables you to search for machine-translated bioRxiv preprints using more than 100 different languages.
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!