Prompt eval cues predicted refusal shifts across 32k LLM rollouts
An analysis of 32,000 LLM rollouts found that prompt evaluation cues, not reasoning traces, were the main predictors of shifts in refusal behavior, suggesting models rely more on surface-level prompt features than internal reasoning for safety decisions.