LTKD: Long-Tailed Knowledge Distillation

Conventional knowledge distillation, designed for model compression, fails on long-tailed distributions because the teacher model tends to be biased toward head classes and provides limited supervision for tail classes.

Overview of standard KD on long-tailed distributions.

We propose Long-Tailed Knowledge Distillation (LTKD), a novel framework that reformulates the conventional objective into two components: a cross-group loss, capturing mismatches in prediction distributions across class groups (head, medium, and tail), and a within-group loss, capturing discrepancies within each group's distribution. This decomposition reveals the specific sources of the teacher's bias. To mitigate the inherited bias, LTKD introduces (1) a rebalanced cross-group loss that calibrates the teacher's group-level predictions and (2) a reweighted within-group loss that ensures equal contribution from all groups.

Overview of the proposed Long-Tailed Knowledge Distillation (LTKD).

Extensive experiments on CIFAR-100-LT, TinyImageNet-LT, and ImageNet-LT demonstrate that LTKD significantly outperforms existing methods in both overall and tail-class accuracy, thereby showing its ability to distill balanced knowledge from a biased teacher for real-world applications.

Accuracy (%) on tail and overall classes for CIFAR-100-LT.

Accuracy (%) on tail and overall classes for TinyImageNet-LT.

Accuracy (%) on tail and overall classes for ImageNet-LT.

Distilling Balanced Knowledge from a Biased Teacher

CVPR 2026

Overview and Key Results

BibTex