VOID (FlashSparse)

VOID (FlashSparse)

High-performance block-sparse matrix operations for GPUs

VOID achieves 3-15x speedups over cuSPARSE and dense baselines by combining:

Block tiling — Fixed-size dense tiles (32x32) for Tensor Core utilization
Morton ordering — Z-curve spatial locality for cache efficiency
Triton kernels — Custom fused kernels with FP32 accumulation
Software pipelining — Async memory prefetching for latency hiding

Performance Highlights

Use Case	Speedup	Comparison
Sparse Attention (Dilated)	15x	vs Dense
SpMM (Block-sparse)	3.2x	vs cuSPARSE BSR
2:4 Structured Sparsity	7x	vs cuSPARSE
Fused SpMM+GELU	1.5x	vs Unfused

Features

Core Operations

SpMM/SpMV — Sparse-dense matrix multiplication with autotuning
Sparse Attention — FlashAttention-style block-sparse patterns
Autograd Support — Full backward pass for training
Stream-K Load Balancing — Handles power-law row distributions

Advanced Features

2:4 Structured Sparsity — NVIDIA Sparse Tensor Core support
INT8/INT4 Quantization — Per-block quantized inference
Dynamic Sparsity — Runtime-adaptive patterns (top-k attention)
Multi-GPU Distribution — Row/column/block sharding strategies
FP8 Support — E4M3/E5M2 formats with automatic scaling
Operation Fusion — SpMM + activation in single kernel
Pipelined Kernels — 3-5 stage async memory prefetching

Installation

# Using uv (recommended)
uv pip install -e .

# Using pip
pip install -e .

# Development
uv pip install -e ".[dev]"

Quick Start

Basic SpMM

import torch
import scipy.sparse as sp
from void import csr_to_void, void_spmm

# Convert scipy sparse matrix to VOID format
sparse_np = sp.random(512, 512, density=0.1, format='csr')
void_tensor = csr_to_void(sparse_np, tile_size=32).cuda()

# Sparse-dense matrix multiplication
B = torch.randn(512, 128, device='cuda')
C = void_spmm(void_tensor, B)

Sparse Attention

from void import local_attention

q = torch.randn(2, 8, 2048, 64, device='cuda')
k = torch.randn(2, 8, 2048, 64, device='cuda')
v = torch.randn(2, 8, 2048, 64, device='cuda')

# Sliding window attention (14x faster than dense at seq_len=2048)
out = local_attention(q, k, v, window_size=256)

2:4 Structured Sparsity

from void import prune_to_2_4, void_to_structured

# Prune weights to 2:4 pattern (50% sparsity)
pruned = prune_to_2_4(weights)

# Convert for Tensor Core acceleration
structured = void_to_structured(void_tensor)

INT8 Quantization

from void import quantize_void_tensor, void_spmm_int8, IntQuantConfig, IntFormat

# Quantize sparse matrix to INT8
config = IntQuantConfig(format=IntFormat.INT8, symmetric=True)
int8_tensor = quantize_void_tensor(void_tensor, config)

# INT8 SpMM (2-4x faster for inference)
C = void_spmm_int8(int8_tensor, B_int8, B_scale)

Dynamic Sparsity

from void import dynamic_topk_attention

# Top-k sparse attention (keeps only k highest scores per query)
out = dynamic_topk_attention(q, k, v, top_k=256)

Multi-GPU

from void import shard_void_tensor, distributed_void_spmm, ShardingStrategy

# Shard across GPUs
dist_tensor = shard_void_tensor(void_tensor, ShardingStrategy.ROW_WISE)

# Distributed SpMM with automatic all-reduce
C = distributed_void_spmm(dist_tensor, B)

Fused Operations

from void import void_spmm_gelu, FusedSparseLinear

# Single kernel: C = GELU(A @ B)
C = void_spmm_gelu(void_tensor, B)

# Drop-in replacement for nn.Linear
layer = FusedSparseLinear(512, 256, void_tensor, activation="gelu")

Benchmarks

SpMM: VOID vs cuSPARSE BSR

Block Sparsity	VOID	BSR	Speedup
70%	0.28ms	0.91ms	3.3x
80%	0.19ms	0.61ms	3.3x
90%	0.12ms	0.32ms	2.6x
95%	0.06ms	0.18ms	2.9x
98%	0.04ms	0.16ms	3.8x

4096x4096 matrix, N=512, RTX GPU

Sparse Attention vs Dense

Pattern	Seq=1024	Seq=2048
Dilated	3.9x	15.0x
Sliding-256	4.4x	9.7x
Causal (GPT)	2.9x	4.6x
BigBird	1.9x	2.9x

Memory Bandwidth Efficiency

Access Pattern	Bandwidth	vs CSR
Scalar CSR	54 GB/s	1x
Block 32x32	850 GB/s	16x
Block 64x64	1759 GB/s	32x

Architecture

┌─────────────┐    ┌──────────────┐    ┌────────────────┐
│   Input     │    │ VOID Format  │    │ Kernel Select  │
│ CSR/Dense   │ -> │ Morton Order │ -> │ Autotuning     │
│ Attention   │    │ TC Alignment │    │ Stream-K       │
└─────────────┘    └──────────────┘    └────────────────┘
       |                  |                    |
       v                  v                    v
┌─────────────┐    ┌──────────────┐    ┌────────────────┐
│  Core Ops   │    │  Features    │    │    Output      │
│ void_spmm() │ <- │ 2:4 / INT8   │ -> │ Dense Result   │
│ attention() │    │ Dynamic/Dist │    │ Gradients      │
└─────────────┘    └──────────────┘    └────────────────┘

VOID stores sparse matrices as collections of dense tiles with Morton (Z-curve) ordering for optimal cache locality.

Running Benchmarks

# Core validation
uv run python benchmarks/validation.py

# BSR comparison (primary benchmark)
uv run python benchmarks/bsr_comparison.py

# Attention patterns
uv run python benchmarks/attention_benchmark.py

# Generate visualizations
uv run python benchmarks/visualize_results.py

Tests

# All tests (requires GPU)
uv run pytest tests/ -v

# Quick check
uv run pytest tests/ -v --tb=short

API Reference

Core Format

from void import csr_to_void, dense_to_void, VOIDTensor

# From scipy CSR
void_tensor = csr_to_void(csr_matrix, tile_size=32)

# From dense with threshold
void_tensor = dense_to_void(dense, tile_size=32, threshold=0.01)

# Dtype conversion
void_fp16 = void_tensor.half()
void_bf16 = void_tensor.bfloat16()

SpMM Operations

from void import void_spmm, void_spmm_autotuned, void_spmm_pipelined

C = void_spmm(A, B)                    # Basic SpMM
C = void_spmm_autotuned(A, B)          # With autotuning
C = void_spmm_pipelined(A, B)          # Async pipelining

Attention

from void import (
    sparse_attention, local_attention, block_sparse_attention,
    create_local_attention_mask, create_causal_local_mask
)

out = sparse_attention(q, k, v, mask)
out = local_attention(q, k, v, window_size=256)
out = block_sparse_attention(q, k, v, sparsity=0.9)

Quantization

from void import (
    quantize_void_tensor, void_spmm_int8,
    void_tensor_to_fp8, void_spmm_fp8
)

# INT8
int8_tensor = quantize_void_tensor(void_tensor, config)
C = void_spmm_int8(int8_tensor, B, scale)

# FP8
fp8_tensor = void_tensor_to_fp8(void_tensor)
C = void_spmm_fp8(fp8_tensor, B)

Neural Networks

from void import SparseLinear, FusedSparseLinear, VOIDSpMM

# Drop-in nn.Linear replacement
layer = SparseLinear(512, 256, void_tensor, bias=True)

# With fused activation
layer = FusedSparseLinear(512, 256, void_tensor, activation="gelu")

Citation

@software{void2025,
  title = {VOID: High-Performance Block-Sparse Matrix Operations for GPUs},
  author = {Khushiyant},
  year = {2025},
  url = {https://github.com/khushiyant/void}
}

License

MIT License — see LICENSE for details.

Acknowledgments

Triton — GPU programming framework
FlashAttention — Tiled attention inspiration
cuSPARSE — Baseline comparison

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
tests		tests
void		void
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VOID (FlashSparse)

Performance Highlights

Features

Core Operations

Advanced Features

Installation

Quick Start

Basic SpMM

Sparse Attention

2:4 Structured Sparsity

INT8 Quantization

Dynamic Sparsity

Multi-GPU

Fused Operations

Benchmarks

SpMM: VOID vs cuSPARSE BSR

Sparse Attention vs Dense

Memory Bandwidth Efficiency

Architecture

Running Benchmarks

Tests

API Reference

Citation

License

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

Khushiyant/void

Folders and files

Latest commit

History

Repository files navigation

VOID (FlashSparse)

Performance Highlights

Features

Core Operations

Advanced Features

Installation

Quick Start

Basic SpMM

Sparse Attention

2:4 Structured Sparsity

INT8 Quantization

Dynamic Sparsity

Multi-GPU

Fused Operations

Benchmarks

SpMM: VOID vs cuSPARSE BSR

Sparse Attention vs Dense

Memory Bandwidth Efficiency

Architecture

Running Benchmarks

Tests

API Reference

Citation

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages