Wave is a proposed universal GPU Instruction Set Architecture (ISA) designed to abstract hardware differences and enable cross-platform GPU programming, aiming to reduce fragmentation across vendors like NVIDIA, AMD, and Intel.
#gpu
30 items
This guide introduces torch.profiler, a tool for measuring the performance and resource usage of PyTorch models. It covers how to use the profiler to capture operations, view GPU utilization, and identify bottlenecks, helping beginners optimize model training and inference.
Nvidia Dynamo Snapshot is a tool that reduces startup times for inference workloads on Kubernetes by enabling fast snapshot-based restoration of containers, improving efficiency and scalability for AI model serving.
A new project demonstrates running one million Python interpreters in parallel on a GPU to parallelize arbitrary Python code, showcasing a novel approach to leveraging GPU hardware for general-purpose Python computation.
Nvidia Dynamo Snapshot is a tool that speeds up the startup time of AI inference workloads on Kubernetes by capturing and restoring the pre-initialized state of containers. It eliminates redundant initialization steps, enabling faster scaling and reduced latency for GPU-accelerated inference deployments.
The article announces the open-sourcing of FastVideo Dreamverse, a framework enabling real-time video vibe directing on a single Nvidia B200 GPU. It allows users to generate and iteratively refine AI videos with consistent style and composition, including text-to-video and image-to-video capabilities.
cuSBF is a GPU-accelerated Bloom filter implementation designed for faster processing of sequence data, offering improved performance over CPU-based alternatives for tasks like k-mer membership queries in bioinformatics and genomics.
DoubleAI has introduced WarpSpeed, a high-performance computing framework for NVIDIA's Blackwell architecture, claiming it approaches the speed of light in terms of data processing efficiency. The framework optimizes memory bandwidth and parallel compute to achieve near-theoretical limits on Blackwell GPUs, targeting AI and scientific workloads.
Nvidia is retiring its classic Control Panel after 20 years, moving driver update features exclusively to the new Nvidia App. The shift consolidates the separate Control Panel and GeForce Experience into a single modern interface for driver management and settings.
Nvidia has officially discontinued its GeForce Control Panel app, a tool that had been available for 20 years, signaling a transition to newer software solutions like the Nvidia App for managing graphics settings.
Wave is a proposed universal GPU instruction set architecture designed to enable portability across different GPU hardware, aiming to reduce fragmentation in GPU programming and eliminate the need for vendor-specific code paths.
The article discusses how floating-point denormals (subnormals) affect floor and ceil operations on both CPU and GPU. It explains that denormals can cause performance degradation by triggering slow paths, and examines differences in handling denormals between CPU architectures and GPU hardware. The author provides code examples and benchmarks to illustrate these performance impacts.
Auto GPU Kernel is an open-source tool that autonomously explores and optimizes GPU kernel implementations, automatically discovering efficient kernel configurations without manual tuning.
AMD discusses how agentic AI—autonomous AI systems that plan and execute tasks—is shifting the balance between CPU and GPU workloads. Unlike traditional AI inference, agentic workflows require more frequent, low-latency CPU processing for orchestration and decision-making, alongside GPU compute. This evolution calls for a more balanced system architecture rather than GPU-only dominance.
The open-source RADV Vulkan driver for AMD Radeon GPUs has merged support for the VK_KHR_shader_fma extension. This extension standardizes fused multiply-add operations across Vulkan implementations, enabling better performance and precision for shaders that rely on FMA instructions.
gpucheck is a pytest plugin designed for testing GPU kernels, providing tools to validate and verify the correctness of GPU code within the pytest framework.
MetalBench is a benchmarking tool designed to evaluate the performance of Apple Silicon's Metal Shading Language (MSL) by measuring GPU computation throughput across various kernels and operations.
Tom's Hardware tested Advanced Shader Delivery on the AMD Radeon RX 9070 XT, finding it can reduce game load times by up to 95% compared to standard shader compilation methods. The technology pre-processes and caches shaders, significantly speeding up initial loads and reducing stutter in supported titles.
Lupine is an open-source GPU-over-IP bridge that allows applications to access remote GPU resources over a network, enabling hardware-accelerated rendering on servers and edge devices without requiring a local GPU.
A new finding shows that matrix multiplications on GPUs execute significantly faster when the input data has predictable, regular patterns compared to random data. This performance difference arises from how GPUs handle memory access patterns and cache behavior, with structured data enabling more efficient parallel processing.
The article explains how to accelerate deep learning training from first principles, covering GPU memory hierarchy, kernel fusion, parallelization strategies, and practical techniques to maximize hardware utilization, ultimately showing that understanding these fundamentals can lead to order-of-magnitude speed improvements.
Nvidia has removed "gaming" as a separate revenue category in its financial reports, instead merging it with other segments under "Compute & Networking." The change reflects the company's shift toward data center and AI markets, which now generate far more revenue than its traditional gaming business.
Reiner Pope discusses chip design from the ground up, explaining how basic logic gates lead to the distinct architectures of GPUs, TPUs, FPGAs, and the human brain.
Nvidia's CFO claims the company is on track to become the world's leading CPU supplier, leveraging its Grace architecture and expanding beyond GPUs into the broader processor market.
Nvidia has open-sourced its GPU function platform NVCF, allowing developers to run serverless GPU-accelerated workloads. The platform was previously only available as a managed service but is now accessible for self-hosting and customization, expanding options for GPU computing in cloud and on-premises environments.
The v1.2.1 update introduces four new validators and brings Aether, the AI agent framework, to GPU acceleration on the QuantumAi Blockchain network, enhancing performance and decentralized AI capabilities.
MLX Vulkan Back End is an open-source project that provides Vulkan support for MLX, Apple's machine learning framework, enabling GPU acceleration on devices with Vulkan-compatible hardware.
Opal Pathtracer is a physically-based GPU pathtracer written in C++ and CUDA. It features a structure-aware BVH builder, full light transport via unidirectional path tracing, and various emitter sampling methods, serving as a research project exploring GPU path tracing and spatial data structures.
IgniteMS is a high-performance batch text embedding tool achieving 253,000 messages per second on 8x A100 GPUs, designed for efficient large-scale text processing.
The paper presents StepStone, an LLM-based fuzzing framework for GPU kernel drivers that leverages user-space libraries to generate test inputs. The approach aims to find bugs in GPU kernel drivers by automatically synthesizing kernel API calls through the analysis of library usage patterns and LLM guidance.