How we evaluate our LLM judge
The article describes ForUs's method for evaluating their LLM judge using a perturbation-based approach. By systematically introducing controlled variations (perturbations) to test inputs, they measure the judge's consistency and reliability. This technique helps assess how robust the LLM judge is against subtle changes in phrasing or context without requiring human-labeled ground truth data.