Innovative Approaches to Prostate Cancer Diagnosis Using AI
Written on
Understanding Prostate Cancer and Diagnostic Challenges
With over a million new cases diagnosed each year, prostate cancer (PCa) ranks as the second most prevalent cancer among men globally, leading to more than 350,000 fatalities annually. Enhancing diagnostic accuracy is essential for reducing mortality rates.
Source: Kaggle
In the context of Kaggle’s prostate cancer competition, I was struck by the innovative machine learning solutions presented. My background in digital pathology allowed me to appreciate the intricacies involved in this challenge.
The primary objective was to predict the ISUP grade score from Whole Slide Images (WSI). These high-resolution pathology images present a significant challenge, with the ISUP grading scale indicating cancer risk—ranging from 0 (no cancer) to 5 (high risk).
Evaluating Success: Quadratic Weighted Kappa
The competition utilized a metric known as Quadratic Weighted Kappa (QWK) for evaluation. QWK assesses the level of agreement between two predictions, where 0 indicates random agreement and 1 signifies perfect agreement. A negative QWK suggests less agreement than would occur by chance. The calculation involves constructing an N x N histogram matrix to compare actual versus predicted values.
Source: Kaggle
Data Preprocessing: The Initial Challenge
Processing WSIs is notoriously labor-intensive. For my work, I employed OpenSlide to partition the slides into manageable tiles. The winning solution implemented a method called Concatenate Tile Pooling (CTP). While resizing images may seem appealing, it often results in significant information loss.
CTP allows for the selection of N tiles from each image based on tissue pixel content, processing these tiles through convolutional layers individually. The results are then combined into a comprehensive map prior to pooling and connecting to a fully connected head.
Source: Kaggle
Source: Github (CC license)
Addressing Dataset Noises with Image Hashing
The dataset included noise and duplicates, complicating the task. Imagehash, a library for generating hash values based on an image's visual content, was utilized to identify and remove duplicate images. The following code snippet demonstrates the process:
import imagehash
from tqdm import tqdm_notebook as tqdm
import cv2
import numpy as np
# Different hashing types
hashes = []
for path in tqdm(paths, total=len(paths)):
image = cv2.imread(path)
image = Image.fromarray(image)
hashes.append(np.array([f(image).hash for f in funcs]).reshape(256))
# Calculate similarity scores
sims = np.array([(hashes[i] == hashes).sum(dim=1).cpu().numpy()/256 for i in range(hashes.shape[0])])
threshold = 0.96
duplicates = np.where(sims > threshold)
Source: Kaggle
Modeling Techniques: EfficientNet and Beyond
The winning solution employed three different EfficientNet models and utilized Cross-Entropy Loss. EfficientNet has gained popularity for supervised image classification tasks, including its successful application in the Melanoma competition.
While many competitors relied on ensembles of two networks, this approach included an additional network for label cleaning, addressing the noisy dataset—one of the major challenges in the competition. This mirrors the pseudo-labeling technique used in other competitions.
Dynamic Learning Rate Management
Employing a cosine annealing scheduler, the learning rate fluctuates throughout the training phase, starting high and approaching zero before rising again, creating a cosine wave effect.
Source: Wikipedia commons
Enhancing Model Performance with Data Augmentation
Data augmentation has become a standard practice in leading Kaggle solutions. The team incorporated techniques such as cutout and mixup to enhance generalization. Cutout involves removing sections of input images, generating partially obscured versions to enrich the dataset.
Source: arXiv
Data Processing Insights
The effectiveness of data processing techniques cannot be overstated. Analyzing other solutions reveals that data cleaning was pivotal for this competition's winning entry.
To summarize the solution approach:
- Segment images based on similarity and eliminate duplicates.
- Train with noisy labels.
- Mitigate noise using prediction and original label discrepancies.
- Retrain the model without noise.
- Combine models for final predictions.
Conclusion: Lessons from Kaggle Competitions
The insights gained from top-performing solutions in Kaggle competitions extend beyond merely selecting the right model. As demonstrated, numerous complexities must be navigated to achieve success.
For ongoing updates on the latest AI and machine learning research, along with high-quality tutorials, consider subscribing to our newsletter.
Chapter 2: Engaging with Prostate Cancer Insights
Explore the following informative videos to deepen your understanding of prostate cancer challenges and solutions.
The first video discusses the Kaggle meetup focused on the Prostate Cancer Grade Assessment (PANDA) Challenge, providing further insights into the AI solutions developed for this competition.
The second video features a Q&A on Prostate Cancer with experts Mark Moyad, MD, MPH, and Mark Scholz, MD, offering the latest updates and insights for 2024.