Skip to content

CUDA-accelerated LLM inference for Python. PyTorch-style, zero-configuration, works on Colab/Kaggle/local GPUs.

License

Notifications You must be signed in to change notification settings

waqasm86/llcuda

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llcuda v2.0: CUDA Inference Backend for Unsloth

Version Python CUDA GPU License

Python-first CUDA inference backend exposing low-level, quantization-aware GPU execution for Unsloth fine-tuned models.

NEW IN V2.0: Native Tensor API with custom CUDA kernels, T4-optimized builds, FlashAttention, and direct Unsloth integration.

🎯 What is llcuda v2.0?

llcuda v2.0 is a production-ready inference backend designed exclusively for Tesla T4 GPU (Google Colab standard) with:

  • Native Tensor API: PyTorch-style GPU operations with custom CUDA kernels
  • Tensor Core Optimization: SM 7.5 targeting for maximum performance
  • FlashAttention Support: 2-3x faster attention for long contexts
  • CUDA Graphs: Reduced kernel launch overhead
  • Unsloth Integration: Direct loading of NF4-quantized fine-tuned models
  • GGUF Support: Compatible with llama.cpp model format

Architecture

llcuda v2.0 provides two complementary APIs:

  1. V1.x HTTP Server API (GGUF models, OpenAI-compatible)
  2. V2.0 Native Tensor API (custom operations, quantization-aware)

This dual design enables both easy deployment (HTTP) and low-level control (native) in a single package.


🚀 Quick Start (Google Colab)

Installation

!pip install llcuda

Requirements:

  • Python 3.11+
  • Google Colab Tesla T4 GPU (SM 7.5)
  • CUDA 12.x runtime

On first import, llcuda will:

  1. Verify your GPU is Tesla T4 (SM 7.5)
  2. Download T4-optimized binaries (264 MB, one-time)
  3. Extract to ~/.cache/llcuda/

V1.x HTTP Server API (GGUF Models)

import llcuda

# Initialize inference engine
engine = llcuda.InferenceEngine()

# Load GGUF model from HuggingFace
engine.load_model("gemma-3-1b-Q4_K_M", silent=True)

# Run inference
result = engine.infer("What is artificial intelligence?", max_tokens=100)
print(result.text)
print(f"Speed: {result.tokens_per_sec:.1f} tokens/sec")

V2.0 Native Tensor API (Custom Operations)

from llcuda.core import Tensor, DType, matmul, get_device_count

# Check GPU
print(f"GPUs available: {get_device_count()}")

# Create tensors on GPU 0
A = Tensor.zeros([2048, 2048], dtype=DType.Float16, device=0)
B = Tensor.zeros([2048, 2048], dtype=DType.Float16, device=0)

# Matrix multiplication (cuBLAS with Tensor Cores)
C = A @ B

print(f"Result shape: {C.shape}")
print(f"Memory: {C.nbytes() / 1024**2:.2f} MB")

🎮 GPU Requirements

Tesla T4 (SM 7.5) - ONLY Supported GPU ✅

llcuda v2.0 is optimized exclusively for Tesla T4:

  • Tensor Cores: INT8/FP16 acceleration
  • FlashAttention: Full support
  • Memory: 16 GB VRAM
  • Platform: Google Colab standard GPU

Performance on T4:

  • Gemma 3-1B: 45 tok/s
  • Llama 3.2-3B: 30 tok/s
  • Qwen 2.5-7B: 18 tok/s

Other GPUs NOT Supported

llcuda v2.0 is Tesla T4-only. For other GPUs, use llcuda v1.2.2.


📦 What's Included

T4 Binary Package (264 MB)

Downloaded automatically on first import:

  • llama-server (6.5 MB) - HTTP inference server
  • libggml-cuda.so (219 MB) - CUDA kernels with FlashAttention
  • Supporting libraries - cuBLAS, CUDA runtime wrappers

Features:

  • FlashAttention (GGML_CUDA_FA=ON)
  • CUDA Graphs (GGML_CUDA_GRAPHS=ON)
  • All quantization types (GGML_CUDA_FA_ALL_QUANTS=ON)
  • SM 7.5 code generation

Native Extension (llcuda_cpp.so)

Built from source or downloaded pre-compiled:

  • Device management
  • Tensor operations (zeros, copy, transfer)
  • cuBLAS matrix multiplication

🛠️ V2.0 Native API Reference

Device Management

from llcuda.core import Device, get_device_count, get_device_properties

# Get GPU count
num_gpus = get_device_count()

# Get device info
props = get_device_properties(0)
print(f"Device: {props.name}")
print(f"Compute: SM {props.compute_capability_major}.{props.compute_capability_minor}")
print(f"Memory: {props.total_memory / 1024**3:.2f} GB")

Tensor Creation

from llcuda.core import Tensor, DType

# Create tensor (uninitialized)
t = Tensor([1024, 1024], dtype=DType.Float32, device=0)

# Create zero-filled tensor
zeros = Tensor.zeros([512, 512], dtype=DType.Float16, device=0)

# Properties
print(f"Shape: {t.shape}")
print(f"Elements: {t.numel()}")
print(f"Bytes: {t.nbytes()}")
print(f"Device: {t.device}")

Matrix Operations

from llcuda.core import matmul

# Matrix multiplication
A = Tensor.zeros([2048, 2048], dtype=DType.Float16, device=0)
B = Tensor.zeros([2048, 2048], dtype=DType.Float16, device=0)

# Using @ operator
C = A @ B  # Automatically uses Tensor Cores for FP16 on Tesla T4

🌐 Google Colab Complete Example

# Cell 1: Install
!pip install llcuda

# Cell 2: Verify GPU
import llcuda
from llcuda.core import get_device_properties

props = get_device_properties(0)
print(f"GPU: {props.name}")
print(f"Compute: SM {props.compute_capability_major}.{props.compute_capability_minor}")

# Cell 3: Test Tensor API
from llcuda.core import Tensor, DType

A = Tensor.zeros([1024, 1024], dtype=DType.Float16, device=0)
B = Tensor.zeros([1024, 1024], dtype=DType.Float16, device=0)
C = A @ B
print("✅ Tensor API works!")

# Cell 4: Test HTTP Server
engine = llcuda.InferenceEngine()
engine.load_model("gemma-3-1b-Q4_K_M", silent=True)
result = engine.infer("Hello, world!", max_tokens=20)
print(result.text)

First run will download 264 MB T4 binaries (5-10 minutes on Colab). Subsequent runs will use cached binaries (instant).


📊 Performance Benchmarks (Tesla T4)

Model VRAM Speed (tok/s) Latency Context
Gemma 3-1B Q4_K_M 1.2 GB 45 22ms 2048
Llama 3.2-3B Q4_K_M 2.0 GB 30 33ms 4096
Qwen 2.5-7B Q4_K_M 5.0 GB 18 56ms 8192
Llama 3.1-8B Q4_K_M 5.5 GB 15 67ms 8192

FlashAttention Impact: 2-3x faster for contexts > 2048 tokens


🔬 Advanced Features

GGUF Parser

Parse GGUF model files with zero-copy memory mapping:

from llcuda.gguf_parser import GGUFReader

with GGUFReader("model.gguf") as reader:
    # Metadata
    print(f"Model: {reader.metadata.get('general.name', 'unknown')}")
    
    # Tensors
    print(f"Tensors: {len(reader.tensors)}")
    for name, info in list(reader.tensors.items())[:5]:
        print(f"  {name}: {info.shape} ({info.ggml_type.name})")

Custom Model Loading

engine = llcuda.InferenceEngine()

# Load from HuggingFace with custom settings
engine.load_model(
    model_name="TheBloke/Llama-2-7B-GGUF",
    filename="llama-2-7b.Q4_K_M.gguf",
    gpu_layers=35,
    context_size=8192,
    silent=True
)

result = engine.infer("Explain quantum computing", max_tokens=200)

🧪 Testing

# All tests
python3.11 -m pytest tests/ -v

# Tensor API tests
python3.11 -m pytest tests/test_tensor_api.py -v

# GGUF parser tests
python3.11 -m pytest tests/test_gguf_parser.py -v

🗺️ Roadmap

Phase 1: Core Tensor API ✅ (Complete)

  • CUDA device management
  • Tensor operations
  • cuBLAS matmul

Phase 2: GGUF Integration 🚧 (In Progress)

  • Bootstrap refactor for T4-only
  • GGUF parser implementation
  • Model loader for GGUF → Tensor

Phase 3: Flash Attention 📅 (Planned)

  • Custom FA2 CUDA kernels
  • Long context optimization

Phase 4: Unsloth Integration 📅 (Planned)

  • NF4 quantization kernels
  • Direct Unsloth model loading
  • model.save_pretrained_llcuda() export

🔧 Troubleshooting

GPU Not Compatible

❌ INCOMPATIBLE GPU DETECTED

Your GPU is not Tesla T4
Required: Tesla T4 (SM 7.5)

llcuda v2.0 requires Tesla T4 GPU.
Compatible environment: Google Colab

Solution: Use Google Colab with Tesla T4

Binary Download Failed

# Download T4 binaries manually
wget https://github.com/waqasm86/llcuda/releases/download/v2.0.0/llcuda-binaries-cuda12-t4.tar.gz
mkdir -p ~/.cache/llcuda
tar -xzf llcuda-binaries-cuda12-t4.tar.gz -C ~/.cache/llcuda/

📚 Documentation


📄 License

MIT License - see LICENSE file for details.


🙏 Acknowledgments


🔗 Links


Version: 2.0.0
Target GPU: Tesla T4 ONLY (SM 7.5)
Platform: Google Colab
License: MIT