Task overview

The UNICORN challenge includes 8 language benchmarks designed to evaluate the capability of large language models (LLMs) to analyze and interpret pathology and radiology reports. Each task focuses on a specific type of prediction, assessing the model’s ability to perform tasks in classification, regression, and named entity recognition (NER).

  • Classification
    • T12: Predicting histopathology sample origin This task aims to classify the type of histopathology material described in the report as either a biopsy, resection, or excision.
    • T13: Classifying pulmonary nodule presence This task aims to determine if a pulmonary nodule is mentioned in the radiology report.
    • T14: Classifying kidney abnormality This task aims to identify the presence of significant kidney abnormalities based on the radiology report.
    • T15: Predicting Hip Kellgren-Lawrence score This task aims to classify the radiology report by the Hip Kellgren-Lawrence score, ranging from 0 to 4.
    • T16: Classifying colon histopathology diagnosis This task requires predicting the sample type (biopsy or polypectomy) and diagnosis type, with options including hyperplastic polyps, low-grade dysplasia, high-grade dysplasia, cancer, serrated polyps, and non-informative.
  • Regression
    • T17: Predicting lesion size measurements The goal of this task is to predict the size of all lesions presented in the report in a standardized format.
    • T18: Predicting prostate volume, PSA, and PSA density The goal of this task is to predict the PSA level, the prostate volume, and the PSA density based on the radiology report.
  • Named Entity Recognition
    • T19: Anonymizing report This task requires identifying and tagging personally identifiable information (PII) within reports, including dates, personal identifiers, report identifiers, locations, clinical trial names, times, and ages.

Note: Additional language tasks may be introduced in the future to further assess the capability of LLMs across diverse medical contexts.

Submission to the UNICORN leaderboard

Submissions to the UNICORN challenge for language tasks follow a two-step process, designed to assess the performance of large language models (LLMs) in medical reports analysis. Participants submit a Docker container with their LLM, which processes input reports and generates responses, allowing for prediction across classification, regression, or named entity recognition (NER) tasks.

Step 1: Response Generation Participants first upload a Docker container with their pre-trained LLM and are provided access to a set of few-shot examples to help guide the model's response. Participants can choose how to use these examples to maximize their utility. The LLM then processes each report in the evaluation set, generating an initial response for each prompt.

Step 2: Post-Processing and Prediction In the second step, participants can apply post-processing to transform the generated response into a final prediction tailored to the task. This step ensures that the model’s raw outputs are transformed into structured predictions, ready for evaluation within the challenge.