Submission to leaderboards


This is a code execution challenge. Rather than submitting your predicted labels, you'll package everything needed to do inference and submit that for containerized execution. Depending on the task modality (vision, language, or vision-language) the submission pipeline differs in structure and output format. More details are provided below.

Few-Shot Supervision

For all tasks (except vision-language), participants are provided with a set of 48 few-shot examples alongside the evaluation dataset. These examples consist of labeled input-output pairs (e.g., images and labels, clinical reports and targets). Participants are expected to use these few-shots to adapt their models in a lightweight manner, such as kNN probing, linear probing or prompt-based adaptations. Fine-tuning large models is not permitted. The goal is to evaluate generalization and adaptability of foundation models under minimal supervision.

Submission to the vision leaderboards

Vision tasks include classification, segmentation, detection, and regression using medical images (e.g., WSI, CT, MRI). Submissions to vision tasks follow a two-step pipeline designed to assess how foundation models can be applied to diverse medical imaging modalities and task types. Participants submit a Docker container that generates features from the input images, and during submission, they specify the adaptor method they wish to use in the second step to go from features to predictions.

Step 1: Algorithm Docker (Encoder)

In this first step, participants submit a Docker container that includes their pre-trained vision foundation model, which will serve as vision encoder. The Docker container does not have internet access, so all necessary dependencies must be included within the image to ensure full offline execution. This container is responsible for:
  • Processing both the evaluation images and the 48 few-shot examples.
  • Running all required pre-processing (e.g., tiling, normalization).
  • Extracting feature representations at either the patch or image level.
  • Saving features using one of the following Grand Challenge interfaces: patch-neural-representation.json for detection or segmentation tasks and image-neural-representation.json for classification or regression tasks. The patch-level neural representation includes, for each case, a list of patch-level entries, each containing the spatial coordinates and a corresponding feature vector (a list of floats). The image-level neural representation stores a single feature vector per case, representing the entire image or slide. All feature representations must be saved as JSON files following the required schema for each interface.
This container processes one case at a time and must respect task-specific time limits. Time constraints for each task can be found here.

Step 2: Adaptation and Evaluation (Adaptor)

Once the algorithm Docker has completed feature extraction, the adaptation and evaluation Docker is triggered. This container implements the logic to go from features to predictions, which are then compared to the groundtruth labels for metric computation. It takes as input the features extracted from the evaluation cases, as well as the features of the few-shot examples and their corresponding labels. Its purpose is to generate task-specific predictions, such as classification labels, regression outputs, segmentation masks, or detection coordinates.
Participants are expected to use the few-shot examples in a lightweight manner to guide adaptation toward each downstream task. At submission, participants select an adaptation method from the UNICORN evaluation repository. This includes default, lightweight adaptation methods that leverage few-shot samples to generate predictions. Custom adaptation methods can also be submitted via pull request. Upon approval, these will be added to the repository and made selectable for all participants. To maintain the challenge's focus on generalizability, only lightweight adapters are allowed. In this phase, the use of pretrained models or large-scale methods is not permitted.

Submission to the language leaderboards

Language tasks involve processing clinical reports using a Docker container that includes a pre-trained large language model (LLM). The Docker container does not have internet access, so all necessary dependencies must be included within the image to ensure full offline execution. The Docker container should:
  • Process each report in the evaluation set.
  • Optionally leverage the few-shot examples to perform lightweight adaptation or prompting, enhancing the model's predictions for the task at hand. The way these examples are leveraged is left to the participants' discretion and can be adapted to best suit their strategy.
  • Generate task-specific predictions, which may include classification labels, regression values, or named entity recognition (NER) results.
Model predictions are expected to be saved in a JSON file named nlp-predictions-dataset.json in the format defined by the NLP Predictions Dataset interface. This file should contain the case IDs and a task-specific predictions for each clinical report.
Evaluation is handled through a separate evaluation Docker container provided by the organizers. This container includes the official evaluation metrics and must not be altered by participants. It will take the predictions from the algorithm container and compute the relevant metrics for each task.
This container processes all reports at once and must respect task-specific time limits. Time constraints for each task can be found here.

Submission to the vision-language leaderboard

Participants must submit a single Docker container that includes a pre-trained vision-language model capable of processing WSIs and generating captions. The Docker container does not have internet access, so all necessary dependencies must be included within the image to ensure full offline execution.
Few-shot examples are not provided for this modality. This container processes one case at a time and must respect task-specific time limits. Time constraints for each task can be found here.
The model should produce a JSON file named nlp-predictions-dataset.json in the format defined by the NLP Predictions Dataset interface. This file should contain the case IDs and generated caption for each WSI.
Evaluation is handled through a separate evaluation Docker container provided by the organizers. This container includes the official evaluation metrics and must not be altered by participants. It will take the generated captions from the algorithm container and compute the relevant NLP metrics.