Healthcare study casts doubt on ability of machine learning to generalize

Application of machine learning and particularly convolutional neural networks (CNNs) to medical image diagnosis has generated as much excitement in the AI field as any use case – because of its enormous potential across the whole of healthcare. This optimism is not entirely misplaced, but as we have argued before, it is jumping the gun because a lot of basic due diligence has yet to be done, and at best these systems should be used as aids rather than for primary diagnosis at present. CNN-based analysis should be subject to more like the rigorous scrutiny applied to emerging therapies or drugs, which can take 10 years to complete from conception to full final phase clinical trial.

A paper, just published by the Icahn School of Medicine at Mount Sinai in New York, has demonstrated that the diagnostic capability of a CNN model used to identify people with pneumonia from chest X-rays depends significantly on the institution where the images were made, as well as the scanning system used.

The wider conclusion is that the strength of medical image diagnostics, its ability to incorporate many parameters, is also its weakness because it makes it hard to identify the specific variables driving predictions. It could even lead to false positives and negatives, because a given person’s scan taken by one machine might appear more indicative of pneumonia than if it had been derived from another.

More broadly, it casts doubt on how readily a model can be generalized to data not incorporated in the original training, which has implications not just across healthcare but also other sectors where large numbers of parameters might be involved, including many applications of predictive analytics.

In this case, the CNN models identified pneumonia in 158,000 chest X-rays across three US medical institutions, the National Institutes of Health, the Mount Sinai Hospital and Indiana University Hospital. In each case, detailed 3D images of the lungs and surrounding pulmonary system were constructed from X-rays, taken from different angles using computerized tomography (CT), which builds up the picture in cross sectional 2D slices. It is these CT scans that the CNN models work on.

In three out of five comparisons, the CNN’s performance in diagnosing diseases on X-rays from hospitals outside its own network was significantly worse than on X-rays from the original health system. At the same time, the CNNs were able to detect with high accuracy the hospital system where a X-ray was acquired.

The researchers naturally concluded that the use of a massive number of parameters in deep learning models applied to medicine made it challenging to identify specific variables driving predictions. The type of CT scanners used at a hospital and the resolution of the images not only affected the predictions but were hard to separate from underlying clinical indicators, such as characteristic internal inflammation in the lungs.

It is important to emphasize that pinpointing the institution, X-ray scanner, or even CT process used is itself a strength not a weakness of ML, highlighting how diverse factors shaping a data set leave a signature that can then be identified. Experience so far of CNNs applied to medical diagnostics based on analysis of blood samples rather than images, again with multiple parameters, can indicate what the patient has just had for dinner – as well as say evidence of type 2 diabetes. That has not been considered a weakness there. But it does highlight the challenge of applying machine learning in the medical context and of generalizing it.

The paper’s authors correctly argue their findings should give pause to those considering rapid deployment of AI platforms in healthcare without rigorously assessing their performance in real-world clinical settings that reflect where they are being deployed. They point out that deep learning models trained to perform medical diagnosis can generalize well, but that crucially this cannot be taken for granted because patient populations and imaging techniques vary significantly between institutions.

We would contend that they also vary due to other factors, such as demography, geography and income, since these will also leave their signatures in the data sets. The implication may then be that rather than worrying about these sources of “noise” in the data they should just be considered as factors to be filtered out to home in on the salient clinical indicators.

This work is important because it is not just an isolated study but builds on papers published earlier this year in the journals Radiology and Nature Medicine, which laid the ground for applying computer vision and deep learning techniques.