Knowledge Distillation
Implement the Knowledge Distillation loss function, a model compression technique where a smaller student model learns to mimic the behavior of a larger, pre-trained teacher model. The student is trained using a combination of the soft probability distributions from the teacher and the hard ground-truth labels.
Formula:
L = alpha * KL(soft_teacher || soft_student) + (1 - alpha) * CE(hard_labels, student)
Simplified (soft target distillation only):
L = MSE(teacher_output, student_output)
= mean((teacher_output - student_output)^2)
Where:
teacher_output = output logits or probabilities from the teacher model
student_output = output logits or probabilities from the student model
alpha = weight balancing soft and hard loss components
T (temperature)= scaling factor applied to logits for softer distributions
Example:
Input: teacher_output = [0.1, 0.2], student_output = [0.3, 0.4]
diff = [0.1-0.3, 0.2-0.4] = [-0.2, -0.2]
squared = [0.04, 0.04]
MSE = mean([0.04, 0.04]) = 0.04
Output: 0.04
The loss measures how closely the student model's outputs match the teacher model's outputs. By training on the teacher's soft probability distributions (produced with temperature scaling), the student captures richer information about inter-class relationships than it would from hard labels alone.
Constraints:
Test Cases
[[0.1,0.2],[0.3,0.4]]0.005[[0.5,0.6],[0.7,0.8]]0.02