TAG · #DISTRIBUTED-SYSTEMS

#distributed-systems

30 items

HOTNESS

Nondeterminism's Not the Problem
2.0
The article argues that nondeterminism in computing is not a fundamental problem, and that the real challenge lies in how we model and reason about systems. It suggests that nondeterministic behaviors can be effectively managed with proper abstractions and tools, shifting the focus away from eliminating nondeterminism itself.
hnMay 29, 2026#Tech
Show HN: DRD – Git for distributed consensus failures"
3.0
DRD (Deterministic Replay Debugger) is a tool for debugging distributed systems by capturing, replaying, and explaining why consensus diverged. It is described as "Git for distributed consensus failures."
hnMay 29, 2026#Tech
Idempotent and Moving Window is simply a reduction (2021)
1.0
The article explains that idempotent and moving window operations in stream processing can be understood as specific cases of a reduction. It demonstrates how these seemingly distinct patterns are unified under the concept of reduction, simplifying their implementation and reasoning.
hnMay 28, 2026#Tech
Durable Execution the Hard Way
2.0
This guide demonstrates how to build a durable execution engine from scratch, covering core concepts like state persistence, retries, and failure recovery without relying on existing frameworks.
hnMay 28, 2026#Tech
The Orchestration Tax
3.0
The article discusses the "orchestration tax" — the overhead and complexity introduced when adding an orchestrator tool to manage workflows or microservices. It argues that while orchestrators can provide benefits, they also come with costs in terms of cognitive load, debugging difficulty, and infrastructure complexity that teams should carefully evaluate before adopting.
hnMay 28, 2026#Tech
Workload isolation using shuffle-sharding
3.0
Shuffle sharding is a technique that AWS uses to improve workload isolation by distributing customer workloads across a larger number of smaller shards, ensuring that a failure in one shard affects only a small subset of customers. This approach reduces the blast radius of failures, increases availability, and allows for more granular capacity management compared to traditional sharding methods.
hnMay 28, 2026#Tech
Cheers.fan #1 – Founding Infrastructure Engineers
0.0
Cheers.fan #1 is seeking founding infrastructure engineers for building large-scale synchronized interaction systems for entertainment and live events. Focus areas include distributed systems, low-latency infrastructure, real-time synchronization, and edge systems.
hnMay 26, 2026#Tech
Distributing LLM Inference in DwarfStar
3.0
Salvatore Sanfilippo introduces DwarfStar, a proof-of-concept for distributing LLM inference across machines using a protocol over Unix sockets, stdout, and HTTP, enabling models larger than a single GPU's VRAM via pipeline parallelism.
hnMay 26, 2026#Tech
Show HN: Raft in Rust
2.5
A developer implemented the Raft consensus protocol in Rust from scratch over six months, using a sans-I/O design and Claude as an instructor and code reviewer. The project includes a simple key-value store with potentially stale reads, and linearizable reads are planned as a future addition.
hnMay 26, 2026#Tech
Raft Consensus with a Minority of Nodes
1.0
Raft, a consensus algorithm typically requiring a majority, can be extended to function with a minority of nodes by introducing a novel "loose consensus" mode. This approach uses quorum leases and view-change protocols to maintain safety and liveness even when fewer than half the nodes are alive, useful for edge cases like network partitions.
hnMay 26, 2026#Tech
Step in Tine: Forking Processes in Functional Choreographies
2.0
This paper presents a formal framework for functional choreographies that supports forking processes. It introduces a calculus combining multiparty session types with process forking, enabling parallel execution and dynamic creation of new communication participants while preserving type safety and deadlock freedom.
hnMay 25, 2026#Tech
Distributing LLM Inference in DwarfStar
3.0
Antirez introduces DwarfStar, a prototype system for distributing LLM inference across multiple nodes. The approach partitions the model's layers across machines, using a fan-out pattern where a coordinator sends tokens to all layer groups in parallel during prefill, reducing inference latency. DwarfStar is designed to run on low-end hardware, aiming for cost-effective, decentralized inference.
hnMay 25, 2026#Tech
Distributed Development
2.0
Reticulum Network's Distributed Development manual describes how its decentralized architecture allows for peer-to-peer communication without central servers, enabling resilient and autonomous network operations through cryptographic identities and distributed routing.
hnMay 24, 2026#Tech
The principles of extreme fault tolerance
3.0
Extreme fault tolerance is a design philosophy for building systems that survive severe failures without manual intervention. Core principles include designing for inevitable chaos, embracing redundancy, avoiding single points of failure, and practicing constant failure testing (chaos engineering) to create self-healing systems.
hnMay 24, 2026#Tech
Oxia ― Metadata Store and Coordination System
4.0
Oxia is an open-source metadata store and coordination system designed for large-scale distributed systems. It provides strong consistency, high availability, and low-latency access to metadata, serving as a replacement for systems like ZooKeeper and etcd.
hnMay 24, 2026#Tech
Temporal Primer – Building Long-Running Systems
0.5
This article introduces Temporal, a workflow orchestration platform for building reliable, long-running distributed systems. It explains core concepts like workflows, activities, and how Temporal handles failures, retries, and state persistence to simplify complex application logic.
hnMay 24, 2026#Tech
White Rabbit – sub-nanosecond synchronization for large distributed systems
2.0
White Rabbit is an open-source project providing sub-nanosecond synchronization for large distributed systems, developed at CERN. It extends Ethernet with precise time transfer for deterministic data delivery across thousands of nodes.
hnMay 23, 2026#Tech
ParadeDB (YC S23) Is Hiring Distributed Systems/Platform Engineers
1.0
ParadeDB (YC S23), a Postgres extension for full-text and vector search, is hiring a distributed systems/platform engineer to help build a managed cloud service. The ideal candidate has experience with Kubernetes, Go, and Postgres. The open-source company has 10 people across the US.
hnMay 21, 2026#Tech
Temporal Primer – Building Long-Running Systems
2.0
Temporal is a workflow orchestration platform that helps build reliable, long-running distributed systems by managing state, retries, and timeouts. It decouples business logic from infrastructure concerns, enabling developers to write resilient code that handles failures gracefully. The primer covers core concepts like workflows, activities, and task queues to get started with Temporal.
hnMay 20, 2026#Tech
Testing distributed systems with AI agents
2.0
This repository explores using AI agents to test distributed systems, including approaches like agent-based test generation, fault injection, and anomaly detection to improve system reliability.
hnMay 20, 2026#Tech
The Importance of Being Idempotent
3.0
The article explains idempotency as a critical design principle for reliable software agents. It describes how idempotent operations (those producing the same result regardless of how many times they are executed) prevent duplicate side effects, simplify error recovery, and enable safe retries in distributed systems and AI agent workflows.
hnMay 20, 2026#Tech
The question Erlang answered in 1986 is back, one level up
3.0
The article re-examines the 1986-era question Erlang addressed about building reliable, concurrent systems, arguing that the same fundamental challenge has resurfaced at a higher level of abstraction in modern distributed software. It suggests that current architectural patterns must revisit Erlang's principles to handle today's complexity.
hnMay 19, 2026#Tech
Retrospective on DDIA
2.0
A retrospective review of the book "Designing Data-Intensive Applications" (DDIA) examines its enduring relevance, core concepts around data systems, and practical insights for engineers building and maintaining scalable, reliable software.
hnMay 19, 2026#Tech
Skills for Testing Distributed Systems
2.0
This resource provides skills and testing methodologies for distributed systems, covering fault injection, network partition simulation, and consistency validation techniques to ensure system reliability.
hnMay 18, 2026#Tech
Protocols for transactional usage of object storage
2.0
The article discusses protocols for using object storage in transactional workloads, addressing challenges like consistency and atomicity that arise when adapting object storage systems, originally designed for static data, to handle transactional operations. It explores various approaches and trade-offs involved in making object storage suitable for such use cases.
hnMay 18, 2026#Tech
Book Club: Designing Data-Intensive Applications, 2nd Edition
1.0
The Heltweg.org community launched a book club for the second edition of "Designing Data-Intensive Applications" by Martin Kleppmann. The post outlines the schedule, chapters covered per session, and how to join the discussions. The book club aims to collaboratively explore distributed systems, databases, and data processing concepts.
hnMay 18, 2026#Tech
Curated list of resources on testing distributed systems
4.0
A curated collection of resources focused on testing distributed systems, covering topics such as fault tolerance, chaos engineering, and simulation testing.
hnMay 18, 2026#Tech
Turing Award Winner: Abstraction, Dijkstra, Distributed Systems – Barbara Liskov [video]
4.0
Barbara Liskov, a Turing Award winner, discusses key computer science concepts including abstraction, programming languages, and the influence of Edsger Dijkstra. She shares insights from her work on distributed systems and the Liskov Substitution Principle, reflecting on her career and the evolution of computing.
hnMay 17, 2026#Tech
Distributed Systems aren't just about scaling
2.0
The article argues that distributed systems engineering is often narrowly associated with scaling, but in practice involves much more—such as fault tolerance, latency, consistency, and operational complexity—making it a broader discipline than just handling growth.
hnMay 16, 2026#Tech
An introduction to TLA+ and its use in parties (2023)
1.0
This article introduces TLA+, a formal specification language used for modeling and verifying system designs. It explains TLA+ concepts through the playful analogy of planning a party, covering topics like state machines, invariants, and temporal logic to demonstrate how the language can help catch design flaws early.
hnMay 16, 2026#Tech

Load next 30Updated —