The article argues that nondeterminism in computing is not a fundamental problem, and that the real challenge lies in how we model and reason about systems. It suggests that nondeterministic behaviors can be effectively managed with proper abstractions and tools, shifting the focus away from eliminating nondeterminism itself.
#distributed-systems
30 items
DRD (Deterministic Replay Debugger) is a tool for debugging distributed systems by capturing, replaying, and explaining why consensus diverged. It is described as "Git for distributed consensus failures."
The article explains that idempotent and moving window operations in stream processing can be understood as specific cases of a reduction. It demonstrates how these seemingly distinct patterns are unified under the concept of reduction, simplifying their implementation and reasoning.
This guide demonstrates how to build a durable execution engine from scratch, covering core concepts like state persistence, retries, and failure recovery without relying on existing frameworks.
The article discusses the "orchestration tax" — the overhead and complexity introduced when adding an orchestrator tool to manage workflows or microservices. It argues that while orchestrators can provide benefits, they also come with costs in terms of cognitive load, debugging difficulty, and infrastructure complexity that teams should carefully evaluate before adopting.
Shuffle sharding is a technique that AWS uses to improve workload isolation by distributing customer workloads across a larger number of smaller shards, ensuring that a failure in one shard affects only a small subset of customers. This approach reduces the blast radius of failures, increases availability, and allows for more granular capacity management compared to traditional sharding methods.
Cheers.fan #1 is seeking founding infrastructure engineers for building large-scale synchronized interaction systems for entertainment and live events. Focus areas include distributed systems, low-latency infrastructure, real-time synchronization, and edge systems.
Salvatore Sanfilippo introduces DwarfStar, a proof-of-concept for distributing LLM inference across machines using a protocol over Unix sockets, stdout, and HTTP, enabling models larger than a single GPU's VRAM via pipeline parallelism.
A developer implemented the Raft consensus protocol in Rust from scratch over six months, using a sans-I/O design and Claude as an instructor and code reviewer. The project includes a simple key-value store with potentially stale reads, and linearizable reads are planned as a future addition.
Raft, a consensus algorithm typically requiring a majority, can be extended to function with a minority of nodes by introducing a novel "loose consensus" mode. This approach uses quorum leases and view-change protocols to maintain safety and liveness even when fewer than half the nodes are alive, useful for edge cases like network partitions.
This paper presents a formal framework for functional choreographies that supports forking processes. It introduces a calculus combining multiparty session types with process forking, enabling parallel execution and dynamic creation of new communication participants while preserving type safety and deadlock freedom.
Antirez introduces DwarfStar, a prototype system for distributing LLM inference across multiple nodes. The approach partitions the model's layers across machines, using a fan-out pattern where a coordinator sends tokens to all layer groups in parallel during prefill, reducing inference latency. DwarfStar is designed to run on low-end hardware, aiming for cost-effective, decentralized inference.
Reticulum Network's Distributed Development manual describes how its decentralized architecture allows for peer-to-peer communication without central servers, enabling resilient and autonomous network operations through cryptographic identities and distributed routing.
Extreme fault tolerance is a design philosophy for building systems that survive severe failures without manual intervention. Core principles include designing for inevitable chaos, embracing redundancy, avoiding single points of failure, and practicing constant failure testing (chaos engineering) to create self-healing systems.
Oxia is an open-source metadata store and coordination system designed for large-scale distributed systems. It provides strong consistency, high availability, and low-latency access to metadata, serving as a replacement for systems like ZooKeeper and etcd.
This article introduces Temporal, a workflow orchestration platform for building reliable, long-running distributed systems. It explains core concepts like workflows, activities, and how Temporal handles failures, retries, and state persistence to simplify complex application logic.
White Rabbit is an open-source project providing sub-nanosecond synchronization for large distributed systems, developed at CERN. It extends Ethernet with precise time transfer for deterministic data delivery across thousands of nodes.
ParadeDB (YC S23), a Postgres extension for full-text and vector search, is hiring a distributed systems/platform engineer to help build a managed cloud service. The ideal candidate has experience with Kubernetes, Go, and Postgres. The open-source company has 10 people across the US.
Temporal is a workflow orchestration platform that helps build reliable, long-running distributed systems by managing state, retries, and timeouts. It decouples business logic from infrastructure concerns, enabling developers to write resilient code that handles failures gracefully. The primer covers core concepts like workflows, activities, and task queues to get started with Temporal.
This repository explores using AI agents to test distributed systems, including approaches like agent-based test generation, fault injection, and anomaly detection to improve system reliability.
The article explains idempotency as a critical design principle for reliable software agents. It describes how idempotent operations (those producing the same result regardless of how many times they are executed) prevent duplicate side effects, simplify error recovery, and enable safe retries in distributed systems and AI agent workflows.
The article re-examines the 1986-era question Erlang addressed about building reliable, concurrent systems, arguing that the same fundamental challenge has resurfaced at a higher level of abstraction in modern distributed software. It suggests that current architectural patterns must revisit Erlang's principles to handle today's complexity.
A retrospective review of the book "Designing Data-Intensive Applications" (DDIA) examines its enduring relevance, core concepts around data systems, and practical insights for engineers building and maintaining scalable, reliable software.
This resource provides skills and testing methodologies for distributed systems, covering fault injection, network partition simulation, and consistency validation techniques to ensure system reliability.
The article discusses protocols for using object storage in transactional workloads, addressing challenges like consistency and atomicity that arise when adapting object storage systems, originally designed for static data, to handle transactional operations. It explores various approaches and trade-offs involved in making object storage suitable for such use cases.
The Heltweg.org community launched a book club for the second edition of "Designing Data-Intensive Applications" by Martin Kleppmann. The post outlines the schedule, chapters covered per session, and how to join the discussions. The book club aims to collaboratively explore distributed systems, databases, and data processing concepts.
A curated collection of resources focused on testing distributed systems, covering topics such as fault tolerance, chaos engineering, and simulation testing.
Barbara Liskov, a Turing Award winner, discusses key computer science concepts including abstraction, programming languages, and the influence of Edsger Dijkstra. She shares insights from her work on distributed systems and the Liskov Substitution Principle, reflecting on her career and the evolution of computing.
The article argues that distributed systems engineering is often narrowly associated with scaling, but in practice involves much more—such as fault tolerance, latency, consistency, and operational complexity—making it a broader discipline than just handling growth.
This article introduces TLA+, a formal specification language used for modeling and verifying system designs. It explains TLA+ concepts through the playful analogy of planning a party, covering topics like state machines, invariants, and temporal logic to demonstrate how the language can help catch design flaws early.