T1: Classifying H&E-stained prostate biopsies into ISUP scores


Objective:
Develop a model to classify H&E-stained prostate biopsy slides into the International Society of Urological Pathology (ISUP) grade groups. These grades reflect prostate cancer aggressiveness and are crucial for treatment decisions.

Patient Population:
Patients with prostate cancer from two cohorts:

  • 165 cases from Radboudumc (2012–2017), mainly from screening programs.
  • 113 cases from six hospitals collected via a social media call, capturing routine clinical practice diversity.

Imaging Data:
Whole slide images (.tif format) at 0.5 microns per pixel resolution. Each slide is accompanied by a binary tissue mask:

  • Label 0: Background
  • Label 1: Tissue region

Each case contains a single biopsy, which corresponds to the most representative biopsy (selected and cropped by an expert pathologist) for that case.

Test Data:
Unlabeled whole slide images and tissue masks in the same format as the training data. Participants must predict the ISUP grade for each slide.

Reference Standard:

  • Radboudumc cohort: Graded independently by three expert uropathologists using ISUP 2014 guidelines. Disagreements were resolved in a three-round process including consensus meetings.
  • Six-hospital cohort: Clinically graded by contributing pathologists, with biopsy selection overseen by an expert.

Pathologists showed strong agreement:

  • Pairwise quadratic kappa: 0.926
  • Average vs. majority vote: 0.878 (range: 0.847–0.914)
  • Pairwise agreement: average 0.858 (range: 0.777–0.916)

Evaluation metrics:
Cohen’s quadratic weighted kappa between predicted and reference ISUP grades (1–5). This metric accounts for the degree of disagreement between predicted and true grades.

Relation to Existing Challenges:
Task 1 is partly adapted from the PANDA challenge.

Differences from PANDA:

  • We provide pre-computed tissue segmentation masks
  • We do not provide pre-computed Gleason pattern segmentation annotations masks
  • Not all the data comes from PANDA
  • Unlike PANDA, which focused on developing models using a large, locally labeled dataset, this task emphasizes few-shot learning, with training conducted on-platform