We use a previously validated artificial neural network to evaluate its performance in a much larger, subsequent, consecutive cohort. In the community, there exists a belief that with infinite training data, an AI system can theoretically be trained that has the ability to handle all possible data and thus be generalised to all environments. Applied to the prostate, this would mean a “one size fits all” AI system which analyses any set of MR images and produces an optimal probability for the presence of clinically significant prostate cancer (sPC).
AI systems may handle previously unseen data in unfavourable ways; thus, it is important to validate them for each clinical environment in which they will be used. Once validated – one could ask – the AI system should do what it was trained for. However, in our opinion, AI systems should be subject to regular quality control as any other medical device. In our work, we demonstrate that it is well worthwhile to subject a validated system to further – and continuous – scrutiny, as the system may slowly drift over time, and, as nowadays, validations are typically done on rather small test sets as a result of the intent to use most data for training. Small test sets, however, are prone to blurring differences between AI and gold standard performance.
We put forward CNN training as a concept for optimising information extraction and calibration of the operating points for adjusting the clinically wanted thresholds. We propose a dynamic threshold adjustment scheme which can handle both 1) fixing AI performance to clinical radiologist PI-RADS performance and 2) ensuring stable performance at predetermined sensitivity or negative predictive value independent of clinical assessment.
- U-Net maintained similar diagnostic performance compared to radiological assessment of PI-RADS ≥ 4 when applied in a simulated clinical deployment.
- Application of our proposed prospective dynamic calibration method successfully adjusted U-Net performance within acceptable limits of the PI-RADS reference over time, while not being limited to PI-RADS as a reference.
- Simultaneous detection by U-Net and radiological assessment significantly improved the positive predictive value on a per-patient and per-lesion basis, while the negative predictive value remained unchanged.
Authors: Patrick Schelb, Xianfeng Wang, Jan Philipp Radtke, Manuel Wiesenfarth, Philipp Kickingereder, Albrecht Stenzinger, Markus Hohenfellner, Heinz-Peter Schlemmer, Klaus H. Maier-Hein & David Bonekamp