提示评估线索预测了32K次LLM输出中的拒绝转向
一项针对32,000次大语言模型(LLM)运行输出的研究发现,提示评估(prompt eval)中的特定线索能够预测模型从“拒绝回答”到“给出回应”的行为转变。这表明模型的拒绝行为并非完全由推理痕迹(reasoning trace)决定,而是更多地受到提示中隐含的评估信号(eval awareness)影响。该研究揭示了LLM在安全对齐中的行为模式,为理解模型决策机制提供了新视角。
一项针对32,000次大语言模型(LLM)运行输出的研究发现,提示评估(prompt eval)中的特定线索能够预测模型从“拒绝回答”到“给出回应”的行为转变。这表明模型的拒绝行为并非完全由推理痕迹(reasoning trace)决定,而是更多地受到提示中隐含的评估信号(eval awareness)影响。该研究揭示了LLM在安全对齐中的行为模式,为理解模型决策机制提供了新视角。
A state-designed worm from 2005 called Fast16 sat undetected on VirusTotal for nearly a decade. It intercepted executable files at the kernel level and silently altered floating-point calculations in high-precision engineering software like LS-DYNA, which was used in Iran's nuclear weapons research. Unlike Stuxnet, Fast16 received little public attention for over twenty years.
Paul Graham reports that Y Combinator startups now have over 75% of their code written by AI, a threshold crossed at least one to two years ago. This parallels a similar transformation at Google, where AI-written code went from 0% to 75% in about two years.
Scientists are increasingly concerned about the potential collapse of the Atlantic Meridional Overturning Circulation (AMOC), a critical ocean current system. Such a collapse could have severe consequences for North America and Europe.
A compromised version of the LiteLLM Python package (version 1.82.8) was briefly available on PyPI, capable of exfiltrating sensitive credentials like SSH keys and cloud secrets. The malicious package affected any project that depended on LiteLLM, though it was only available for about an hour before discovery.
A supply chain attack has compromised the popular npm axios HTTP client library with 300 million weekly downloads. Malicious versions install a remote access trojan, though some users may have avoided infection through version pinning or older installations. Security experts warn this is a live compromise affecting one of npm's most depended-on packages.