Click4Ai

147.

Hard

Knowledge Distillation

Implement the Knowledge Distillation loss function, a model compression technique where a smaller student model learns to mimic the behavior of a larger, pre-trained teacher model. The student is trained using a combination of the soft probability distributions from the teacher and the hard ground-truth labels.

Formula:

L = alpha * KL(soft_teacher || soft_student) + (1 - alpha) * CE(hard_labels, student)

Simplified (soft target distillation only):

L = MSE(teacher_output, student_output)

= mean((teacher_output - student_output)^2)

Where:

teacher_output = output logits or probabilities from the teacher model

student_output = output logits or probabilities from the student model

alpha = weight balancing soft and hard loss components

T (temperature)= scaling factor applied to logits for softer distributions

Example:

Input: teacher_output = [0.1, 0.2], student_output = [0.3, 0.4]

diff = [0.1-0.3, 0.2-0.4] = [-0.2, -0.2]

squared = [0.04, 0.04]

MSE = mean([0.04, 0.04]) = 0.04

Output: 0.04

The loss measures how closely the student model's outputs match the teacher model's outputs. By training on the teacher's soft probability distributions (produced with temperature scaling), the student captures richer information about inter-class relationships than it would from hard labels alone.

Constraints:

  • Both `teacher_output` and `student_output` are 1D NumPy arrays of equal length.
  • Use mean squared error (MSE) for the distillation loss computation.
  • Use NumPy for all operations.
  • Test Cases

    Test Case 1
    Input: [[0.1,0.2],[0.3,0.4]]
    Expected: 0.005
    Test Case 2
    Input: [[0.5,0.6],[0.7,0.8]]
    Expected: 0.02
    + 3 hidden test cases