Python-first CUDA inference backend exposing low-level, quantization-aware GPU execution for Unsloth fine-tuned models.
NEW IN V2.0: Native Tensor API with custom CUDA kernels, T4-optimized builds, FlashAttention, and direct Unsloth integration.
llcuda v2.0 is a production-ready inference backend designed exclusively for Tesla T4 GPU (Google Colab standard) with:
- Native Tensor API: PyTorch-style GPU operations with custom CUDA kernels
- Tensor Core Optimization: SM 7.5 targeting for maximum performance
- FlashAttention Support: 2-3x faster attention for long contexts
- CUDA Graphs: Reduced kernel launch overhead
- Unsloth Integration: Direct loading of NF4-quantized fine-tuned models
- GGUF Support: Compatible with llama.cpp model format
llcuda v2.0 provides two complementary APIs:
- V1.x HTTP Server API (GGUF models, OpenAI-compatible)
- V2.0 Native Tensor API (custom operations, quantization-aware)
This dual design enables both easy deployment (HTTP) and low-level control (native) in a single package.
!pip install llcudaRequirements:
- Python 3.11+
- Google Colab Tesla T4 GPU (SM 7.5)
- CUDA 12.x runtime
On first import, llcuda will:
- Verify your GPU is Tesla T4 (SM 7.5)
- Download T4-optimized binaries (264 MB, one-time)
- Extract to
~/.cache/llcuda/
import llcuda
# Initialize inference engine
engine = llcuda.InferenceEngine()
# Load GGUF model from HuggingFace
engine.load_model("gemma-3-1b-Q4_K_M", silent=True)
# Run inference
result = engine.infer("What is artificial intelligence?", max_tokens=100)
print(result.text)
print(f"Speed: {result.tokens_per_sec:.1f} tokens/sec")from llcuda.core import Tensor, DType, matmul, get_device_count
# Check GPU
print(f"GPUs available: {get_device_count()}")
# Create tensors on GPU 0
A = Tensor.zeros([2048, 2048], dtype=DType.Float16, device=0)
B = Tensor.zeros([2048, 2048], dtype=DType.Float16, device=0)
# Matrix multiplication (cuBLAS with Tensor Cores)
C = A @ B
print(f"Result shape: {C.shape}")
print(f"Memory: {C.nbytes() / 1024**2:.2f} MB")llcuda v2.0 is optimized exclusively for Tesla T4:
- Tensor Cores: INT8/FP16 acceleration
- FlashAttention: Full support
- Memory: 16 GB VRAM
- Platform: Google Colab standard GPU
Performance on T4:
- Gemma 3-1B: 45 tok/s
- Llama 3.2-3B: 30 tok/s
- Qwen 2.5-7B: 18 tok/s
llcuda v2.0 is Tesla T4-only. For other GPUs, use llcuda v1.2.2.
Downloaded automatically on first import:
- llama-server (6.5 MB) - HTTP inference server
- libggml-cuda.so (219 MB) - CUDA kernels with FlashAttention
- Supporting libraries - cuBLAS, CUDA runtime wrappers
Features:
- FlashAttention (GGML_CUDA_FA=ON)
- CUDA Graphs (GGML_CUDA_GRAPHS=ON)
- All quantization types (GGML_CUDA_FA_ALL_QUANTS=ON)
- SM 7.5 code generation
Built from source or downloaded pre-compiled:
- Device management
- Tensor operations (zeros, copy, transfer)
- cuBLAS matrix multiplication
from llcuda.core import Device, get_device_count, get_device_properties
# Get GPU count
num_gpus = get_device_count()
# Get device info
props = get_device_properties(0)
print(f"Device: {props.name}")
print(f"Compute: SM {props.compute_capability_major}.{props.compute_capability_minor}")
print(f"Memory: {props.total_memory / 1024**3:.2f} GB")from llcuda.core import Tensor, DType
# Create tensor (uninitialized)
t = Tensor([1024, 1024], dtype=DType.Float32, device=0)
# Create zero-filled tensor
zeros = Tensor.zeros([512, 512], dtype=DType.Float16, device=0)
# Properties
print(f"Shape: {t.shape}")
print(f"Elements: {t.numel()}")
print(f"Bytes: {t.nbytes()}")
print(f"Device: {t.device}")from llcuda.core import matmul
# Matrix multiplication
A = Tensor.zeros([2048, 2048], dtype=DType.Float16, device=0)
B = Tensor.zeros([2048, 2048], dtype=DType.Float16, device=0)
# Using @ operator
C = A @ B # Automatically uses Tensor Cores for FP16 on Tesla T4# Cell 1: Install
!pip install llcuda
# Cell 2: Verify GPU
import llcuda
from llcuda.core import get_device_properties
props = get_device_properties(0)
print(f"GPU: {props.name}")
print(f"Compute: SM {props.compute_capability_major}.{props.compute_capability_minor}")
# Cell 3: Test Tensor API
from llcuda.core import Tensor, DType
A = Tensor.zeros([1024, 1024], dtype=DType.Float16, device=0)
B = Tensor.zeros([1024, 1024], dtype=DType.Float16, device=0)
C = A @ B
print("✅ Tensor API works!")
# Cell 4: Test HTTP Server
engine = llcuda.InferenceEngine()
engine.load_model("gemma-3-1b-Q4_K_M", silent=True)
result = engine.infer("Hello, world!", max_tokens=20)
print(result.text)First run will download 264 MB T4 binaries (5-10 minutes on Colab). Subsequent runs will use cached binaries (instant).
| Model | VRAM | Speed (tok/s) | Latency | Context |
|---|---|---|---|---|
| Gemma 3-1B Q4_K_M | 1.2 GB | 45 | 22ms | 2048 |
| Llama 3.2-3B Q4_K_M | 2.0 GB | 30 | 33ms | 4096 |
| Qwen 2.5-7B Q4_K_M | 5.0 GB | 18 | 56ms | 8192 |
| Llama 3.1-8B Q4_K_M | 5.5 GB | 15 | 67ms | 8192 |
FlashAttention Impact: 2-3x faster for contexts > 2048 tokens
Parse GGUF model files with zero-copy memory mapping:
from llcuda.gguf_parser import GGUFReader
with GGUFReader("model.gguf") as reader:
# Metadata
print(f"Model: {reader.metadata.get('general.name', 'unknown')}")
# Tensors
print(f"Tensors: {len(reader.tensors)}")
for name, info in list(reader.tensors.items())[:5]:
print(f" {name}: {info.shape} ({info.ggml_type.name})")engine = llcuda.InferenceEngine()
# Load from HuggingFace with custom settings
engine.load_model(
model_name="TheBloke/Llama-2-7B-GGUF",
filename="llama-2-7b.Q4_K_M.gguf",
gpu_layers=35,
context_size=8192,
silent=True
)
result = engine.infer("Explain quantum computing", max_tokens=200)# All tests
python3.11 -m pytest tests/ -v
# Tensor API tests
python3.11 -m pytest tests/test_tensor_api.py -v
# GGUF parser tests
python3.11 -m pytest tests/test_gguf_parser.py -v- CUDA device management
- Tensor operations
- cuBLAS matmul
- Bootstrap refactor for T4-only
- GGUF parser implementation
- Model loader for GGUF → Tensor
- Custom FA2 CUDA kernels
- Long context optimization
- NF4 quantization kernels
- Direct Unsloth model loading
-
model.save_pretrained_llcuda()export
❌ INCOMPATIBLE GPU DETECTED
Your GPU is not Tesla T4
Required: Tesla T4 (SM 7.5)
llcuda v2.0 requires Tesla T4 GPU.
Compatible environment: Google Colab
Solution: Use Google Colab with Tesla T4
# Download T4 binaries manually
wget https://github.com/waqasm86/llcuda/releases/download/v2.0.0/llcuda-binaries-cuda12-t4.tar.gz
mkdir -p ~/.cache/llcuda
tar -xzf llcuda-binaries-cuda12-t4.tar.gz -C ~/.cache/llcuda/- Colab Build Notebook: notebooks/build_llcuda_v2_t4_colab.ipynb
- GitHub Issues: https://github.com/waqasm86/llcuda/issues
MIT License - see LICENSE file for details.
- Built on llama.cpp
- FlashAttention from Dao et al.
- Designed for Unsloth integration
- PyPI: https://pypi.org/project/llcuda/
- GitHub: https://github.com/waqasm86/llcuda
- Unsloth: https://github.com/unslothai/unsloth
Version: 2.0.0
Target GPU: Tesla T4 ONLY (SM 7.5)
Platform: Google Colab
License: MIT