译文语言

提示评估线索预测了32K次LLM输出中的拒绝转向

一项针对32,000次大语言模型（LLM）运行输出的研究发现，提示评估（prompt eval）中的特定线索能够预测模型从“拒绝回答”到“给出回应”的行为转变。这表明模型的拒绝行为并非完全由推理痕迹（reasoning trace）决定，而是更多地受到提示中隐含的评估信号（eval awareness）影响。该研究揭示了LLM在安全对齐中的行为模式，为理解模型决策机制提供了新视角。

提示评估线索预测了32K次LLM输出中的拒绝转向

相关报道

RT Lukasz Olejnik: A 2005 state-designed worm designed to corrupt physics simulations sat undetected on VirusTotal for nearly a decade. Fast16, interc...

Each Y Combinator batch I ask the startups what percent of their code is written by AI. It passed 75% at least a year ago, maybe two.

This is the aspect of climate change that I worry most about — when instead of seeing gradual degradation, we cross an irreversible line.

Software horror: litellm PyPI supply chain attack. Simple `pip install litellm` was enough to exfiltrate SSH keys, AWS/GCP/Azure creds, Kubernetes con...

New supply chain attack this time for npm axios, the most popular HTTP client library with 300M weekly downloads. Scanning my system I found a use imp...

提示评估线索预测了32K次LLM输出中的拒绝转向

相关报道

RT Lukasz Olejnik: A 2005 state-designed worm designed to corrupt physics simulations sat undetected on VirusTotal for nearly a decade. Fast16, interc...

Each Y Combinator batch I ask the startups what percent of their code is written by AI. It passed 75% at least a year ago, maybe two.

This is the aspect of climate change that I worry most about — when instead of seeing gradual degradation, we cross an irreversible line.

Software horror: litellm PyPI supply chain attack. Simple `pip install litellm` was enough to exfiltrate SSH keys, AWS/GCP/Azure creds, Kubernetes con...

New supply chain attack this time for npm axios, the most popular HTTP client library with 300M weekly downloads. Scanning my system I found a use imp...