The information is subject to change during the first days after launch, as we continue updating and refining the details.
Submission to leaderboards¶
This is a code execution challenge. Rather than submitting your predicted labels, you'll package everything needed to do inference and submit that for containerized execution. Depending on the task modality (vision, language, or vision-language) the submission pipeline differs in structure and output format. More details are provided below.¶
Few-Shot Supervision¶
For all tasks (except vision-language), participants are provided with a set of 48 few-shot examples alongside the evaluation dataset. These examples consist of labeled input-output pairs (e.g., images and labels, clinical reports and targets). Participants are expected to use these few-shots to adapt their models in a lightweight manner, such as kNN probing, linear probing or prompt-based adaptations. Fine-tuning large models is not permitted. The goal is to evaluate generalization and adaptability of foundation models under minimal supervision.¶
Submission to the vision
leaderboards¶
Vision tasks include classification, segmentation, detection, and regression using medical images (e.g., WSI, CT, MRI). Submissions to vision tasks follow a two-step pipeline designed to assess how foundation models can be applied to diverse medical imaging modalities and task types. Participants submit a Docker container that generates features from the input images, and during submission, they specify the adaptor method they wish to use in the second step to go from features to predictions.¶
Step 1: Algorithm Docker (Encoder)¶
In this first step, participants submit a Docker container that includes their pre-trained vision foundation model, which will serve as vision encoder. The Docker container does not have internet access, so all necessary dependencies must be included within the image to ensure full offline execution. This container is responsible for:¶
- Processing both the evaluation images and the 48 few-shot examples.
- Running all required pre-processing (e.g., tiling, normalization).
- Extracting feature representations at either the patch or image level.
- Saving features using one of the following Grand Challenge interfaces:
patch-neural-representation.json
for detection or segmentation tasks andimage-neural-representation.json
for classification or regression tasks. The patch-level neural representation includes, for each case, a list of patch-level entries, each containing the spatial coordinates and a corresponding feature vector (a list of floats). The image-level neural representation stores a single feature vector per case, representing the entire image or slide. All feature representations must be saved as JSON files following the required schema for each interface.
This container processes one case at a time and must respect task-specific time limits. Time constraints for each task can be found here.¶
Step 2: Adaptation and Evaluation (Adaptor)¶
Once the algorithm Docker has completed feature extraction, the adaptation and evaluation Docker is triggered. This container implements the logic to go from features to predictions, which are then compared to the groundtruth labels for metric computation. It takes as input the features extracted from the evaluation cases, as well as the features of the few-shot examples and their corresponding labels. Its purpose is to generate task-specific predictions, such as classification labels, regression outputs, segmentation masks, or detection coordinates.¶
Participants are expected to use the few-shot examples in a lightweight manner to guide adaptation toward each downstream task. At submission, participants select an adaptation method from the UNICORN evaluation repository. This includes default, lightweight adaptation methods that leverage few-shot samples to generate predictions. Custom adaptation methods can also be submitted via pull request. Upon approval, these will be added to the repository and made selectable for all participants. To maintain the challenge's focus on generalizability, only lightweight adapters are allowed. In this phase, the use of pretrained models or large-scale methods is not permitted.¶
Submission to the language
leaderboards¶
Language tasks involve processing clinical reports using a Docker container that includes a pre-trained large language model (LLM). The Docker container does not have internet access, so all necessary dependencies must be included within the image to ensure full offline execution. The Docker container should:¶
- Process each report in the evaluation set.
- Optionally leverage the few-shot examples to perform lightweight adaptation or prompting, enhancing the model's predictions for the task at hand. The way these examples are leveraged is left to the participants' discretion and can be adapted to best suit their strategy.
- Generate task-specific predictions, which may include classification labels, regression values, or named entity recognition (NER) results.