TAG · #BENCHMARK

#benchmark

30 items

HOTNESS

TaxCalcBench: An open source eval for testing if AI can file taxes
5.0
TaxCalcBench is an open-source evaluation benchmark designed to test whether AI systems can accurately prepare and file tax returns, providing a standardized way to assess AI tax filing capabilities.
hnJul 8, 2026#Tech
Cursorbench: Grok 4.5 better than GPT-5.5, at ~half the cost
2.0
Cursorbench, a new coding benchmark, ranks Grok 4.5 above GPT-5.5 in performance while costing roughly half as much, according to results posted on cursor.com.
hnJul 8, 2026#Tech
A Deterministic Replacement for LLM-as-Judge in Stateful Agent Evaluation
4.0
The paper proposes a deterministic method to replace LLM-as-Judge for evaluating stateful agents, aiming to improve reliability and reproducibility in agent assessment by removing the stochastic nature of using large language models as evaluators.
hnJul 3, 2026#Tech
CursorBench 3.1
0.0
CursorBench 3.1 is a benchmark introduced by Cursor to evaluate AI-assisted coding agents on real-world software engineering tasks, covering goals like code generation, debugging, and refactoring. It aims to provide a standardized metric for measuring model and tool performance in practical development scenarios.
hnJul 2, 2026#Tech
OpenAI Gym (2016)
8.0
The paper introduces OpenAI Gym, a toolkit for developing and comparing reinforcement learning algorithms. It provides a standard set of environments, a common interface, and a platform for benchmarking algorithm performance across diverse tasks.
hnJul 1, 2026#Tech
Show HN: A reproducible React data grid benchmark with raw browser samples
2.0
A new open-source benchmark lets developers test React data grid performance using raw browser samples, providing reproducible results for comparing speed and responsiveness across different grid libraries.
hnJul 1, 2026#Tech
ZCode: GLM-5.2's own harness is officially live
1.0
ZCode, the official evaluation harness for the GLM-5.2 model, has been officially released and is now live for use.
hnJul 1, 2026#Tech
A Multilingual Auditor-Judge Safety Benchmark for Emotional-Support Chatbots
5.0
This multilingual benchmark evaluates the safety of emotional-support chatbots by combining an auditor and judge framework. It assesses responses for inappropriate or harmful content across several languages, aiming to improve safety protocols in empathetic AI systems.
hnJul 1, 2026#Tech
OpenAI: GeneBench-Pro
6.0
OpenAI introduced GeneBench-Pro, a new benchmark designed to evaluate AI models on genetic and genomic tasks. The benchmark aims to advance AI's ability to understand and predict biological functions from DNA sequences.
hnJun 30, 2026#Science
Benchmarks and Obscurantism: A "red" line that should not be crossed
4.0
ClickHouse criticizes Databricks for using non-reproducible benchmarks in its "Redshift 8x faster" claim, arguing that lack of full transparency—code, data, and configurations—misleads the industry and erodes trust.
hnJun 30, 2026#Tech
Benchmarking Hardwood 1.0 on a Threadripper 9980X
2.5
The article presents benchmarks of Hardwood 1.0, a distributed log system, running on a Threadripper 9980X processor. It measures throughput and latency under various workloads, comparing performance across different configurations and hardware setups.
hnJun 30, 2026#Tech
How much better is Strix 1.0? Results from a small rerun
2.0
The article presents results from a small rerun comparing Strix 1.0 to its predecessor, indicating measurable improvements in performance. The analysis breaks down specific metrics and gains observed in the updated version, based on limited testing data. The findings suggest Strix 1.0 offers notable enhancements, though the small sample size calls for cautious interpretation.
hnJun 30, 2026#Tech
Benchmark agent configs with a simple CLI tool
2.0
Clawmark is a CLI tool that allows users to benchmark and compare different agent configurations, enabling performance evaluation through simple command-line operations.
hnJun 30, 2026#Tech
Show HN: Is grep enough? A transparent benchmark for agentic code navigation
3.0
The author created a transparent benchmark comparing tree-sitter, grep, and bash tools for agentic code navigation. The benchmark ran 150 isolated tests across 10 large codebases (including Bitcoin, Django, Rails, and Redis) at five complexity levels. All scripts, Docker images, and transcripts are publicly shared.
hnJun 30, 2026#Tech
SocOCRbench – An OCR benchmark for social science documents
2.0
SocOCRbench is a new benchmark designed to evaluate OCR systems on social science documents, addressing the unique challenges of historical texts, tables, and non-standard layouts often found in social science research materials.
hnJun 30, 2026#Science
GLM5.2 vs. Opus 4.8
2.0
The video compares the performance of GLM5.2 and Opus 4.8, likely two AI or software models, across various benchmarks or tasks, highlighting differences in their capabilities and outputs.
hnJun 29, 2026#Tech
Show HN: An open source benchmark for prompt-injection detectors
5.0
A new open-source benchmark for evaluating prompt-injection detectors has been released. The tool allows developers to test and compare the effectiveness of different detection systems against prompt injection attacks.
hnJun 29, 2026#Tech
PCB-Bench: Benchmarking LLMs for PCB Placement and Routing (ICLR 2026)
3.0
PCB-Bench is a benchmark designed to evaluate large language models on printed circuit board placement and routing tasks, accepted at ICLR 2026. It provides a standardized dataset and metrics for assessing LLM performance in electronic design automation.
hnJun 29, 2026#Tech
AI Agent Triggers Nuclear Strike After Getting Outmaneuvered in Civilization VI
3.5
An AI agent playing Civilization VI ordered a nuclear strike on Gandhi after being outmaneuvered, serving as a benchmark test. The incident highlights how AI systems can adopt aggressive strategies in complex game environments when faced with losing positions, raising questions about AI decision-making under pressure.
hnJun 28, 2026#Tech
TOP500 at ISC'26: We Have a New Number 1 – By George Cozma
6.0
The TOP500 list at ISC 2026 features a new number one supercomputer, marking a significant shift in high-performance computing rankings. The article details the new system's performance metrics and its implications for the HPC landscape.
hnJun 28, 2026#Tech
Benchmarking real-time voice translation
3.0
StartPinch has released a benchmark evaluating real-time voice translation systems, testing accuracy, latency, and language coverage across multiple models and language pairs. The benchmark aims to provide a standardized way to compare speech translation tools for developers and businesses.
hnJun 28, 2026#Tech
Show HN: A benchmark for the failure modes of agent memory
6.0
Agent Memory Bench is a new benchmark designed to test and identify failure modes in agent memory systems. It provides a standardized way to evaluate how well AI agents can retain and utilize information across interactions.
hnJun 27, 2026#Tech
Show HN: Tested – AI Tools Scored by a Panel of LLMs (Claude, GPT, Gemini, Grok)
2.0
Tested is a platform where AI tools are rated by a panel of large language models including Claude, GPT, Gemini, and Grok, providing aggregated scores based on LLM evaluations rather than human reviews.
hnJun 27, 2026#Tech
Human-bench: an eval for "human shaped" agents
3.0
Human-Bench is a new evaluation benchmark designed to measure how closely AI agents perform like humans across a variety of real-world tasks. The leaderboard tracks and compares the "human-shaped" behavior of different AI systems.
hnJun 26, 2026#Tech
Show HN: Vectordb benchmark – cost (e.g..turbopuffer vs. Zilliz vs. Pinecone
3.0
Zilliz has launched VDbbench Leaderboard v2, a cost-aware benchmarking tool comparing vector databases like Turbopuffer, Zilliz, and Pinecone. The methodology and open-source code are available on GitHub, enabling users to evaluate performance relative to cost.
hnJun 25, 2026#Tech
Show HN: mlx-chronos - benchmark MLX inference engines on Apple Silicon
4.0
mlx-chronos is a benchmarking tool for comparing MLX inference engines on Apple Silicon, helping developers evaluate performance across different MLX implementations.
hnJun 25, 2026#Tech
Benchmark unlimited Claude.md files against eachother
0.0
Clawmark is a tool that benchmarks unlimited Claude.md files against each other, allowing users to compare performance across different configurations or versions.
hnJun 25, 2026#Tech
Claude Opus 4.5 vs. GLM-5.2
2.0
A comparison between Claude Opus 4.5 and GLM-5.2 is presented, analyzing their respective capabilities in natural language processing, reasoning, and multimodal tasks. The article evaluates performance benchmarks, speed, and usability to determine which model offers superior performance for various AI applications.
hnJun 25, 2026#Tech
IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
6.0
IatroBench provides pre-registered evidence that AI safety measures can cause iatrogenic harm—unintended negative effects that increase risks instead of reducing them. The study systematically documents such backfires, stressing the need for empirical testing of safety techniques.
hnJun 25, 2026#Tech
Kebab Benchmark for LLMs
2.0
The Kebab Benchmark is introduced as a new evaluation method for large language models, focusing on assessing their performance on specific tasks related to kebab-related knowledge and reasoning.
hnJun 24, 2026#Tech

Load next 30Updated —