New guidelines have been published on clinical trials for AI health solutions. The SPIRIT-AI extension relates to clinical trials protocols, and the CONSORT-AI extension is a new reporting guideline for clinical trial reports.

These new guidelines are aimed at addressing some of the chronic issues which often underlie trial standards for AI, including the use of retrospective datasets, poor standards of reporting and lack of transparency. We’ve described these issues below:

1. Retrospective reports do not represent the reality of clinical environment

  • ‘Clinical reports’ of AI technology have typically been in the form of ‘in silico’ assessments of datasets to determine how well a machine learning model performs a clinical task, compared to a small number of physicians.
  • The closed environment and small sample size of these reports should not be relied on, yet most of the regulatory approvals for algorithms by the US Food and Drug Administration rely on this kind of preliminary evidence.
  • Some argue that pitting clinicians against machine is the antithesis of clinical practice, and we should not only rely on AI for critical life-or-death decisions about a patient.
  • AI reports generally work from a clean and annotated dataset. In contrast, the real world of medicine contain many unstructured and missing data, which the AI may not be able to account for.

2. Poor standards of reporting in AI reports

  • A study which reviewed 82 AI reports found that the reporting of these trials was poor for critical aspects, which resulted in missing data.
  • The machine learning models were also seldom compared with the combined approach of both the algorithm and healthcare professionals assessing the same datasets.
  • The reports also lacked external validation, using out-of-sample data to assess the effectiveness of AI technology.

3. Lack of transparency

  • The lack of transparency in AI reports is attributed to the limited availability of code and datasets to determine reproducibility, small sample size of clinicians to assess algorithmic performance, and hyperbolic conclusions.
  • Private companies rarely publish the retrospective datasets used to create the algorithms, which has an adverse impact to the clinical community that intends to use the algorithms that they are based upon for direct patient care.
  • Transparency in AI could help detect biases in datasets early on – if an algorithm or dataset has embedded bias or misrepresented data, this could lead to serious diagnostic or predictive inaccuracy. 

The aims of new guidelines and how they work (CONSORT-AI and SPIRIT-AI)

The guidelines promote prospective and randomised trials that are representative of patient care:

  • A prospective study occurs where patients are observed for a prolonged period of time on the development of a disease and considers suspected risks or protection factors, which is a much more accurate depiction of how patients are diagnosed in reality.
  • For example, when a dermatologist is evaluating a skin lesion, they would not only analyse the photograph of the lesion, but also take into account the patient’s history and physical exam.
  • AI technology tends to perform poorly during prospective studies in comparison to their retrospective studies.
  • Randomised trials should also be carried out to compare the clinical effectiveness between i) AI ii) clinicians, and iii) clinicians using AI. 

The guidelines also aim to improve transparency and standards of AI reports. In simple terms, machine learning models/algorithms consist of inputs (data, such as an image) and outputs (an interpretation or diagnosis). The new standard requires that an AI clinical report should include information on:

  • Input data: the scope of data used and any exclusions, how representative they are for the clinical question at hand, and the quality and source of their data.
  • Output data:  how is output data is specified and how it can contribute to decision making.
  • Algorithm: the version of the algorithm, changes that occurred during testing and internal validation, and the fit of the model (i.e. a narrow analysis that is extrapolated to the broader, unrestricted world of the clinical environment should be avoided).
  • Human – AI Interface: the report must specify the level of expertise required by users to interact with the AI or any planned user training required.
  • Ground truths: AI that relies on supervised machine learning models are created by establishing certain ground truths. However the ground truths on which algorithms are built may not be factually correct, and the new guidelines require that the details of these should be elaborated.

Jaspreet is a Senior Associate, and advises clients on complex issues at the intersection of healthcare, data and technology. Her practice has a particular focus on accessing and using patient data, innovative collaborations with hospitals, and the use and regulation of AI in the healthcare space.