convert_to_quant

Convert safetensors weights to quantized formats (INT8, FP16) with learned rounding optimization for ComfyUI inference.

Warning

Experimental State: This project is a fork, currently in a rough state, and has not been extensively tested. It might not be actively maintained. Use with caution.

Installation

Important

PyTorch must be installed first with the correct CUDA version for your GPU. This package does not install PyTorch automatically to avoid conflicts with your existing setup.

Step 1: Install PyTorch (GPU-specific)

Visit pytorch.org to get the correct install command for your system.

Examples:

# CUDA 12.8 (stable)
pip install torch --index-url https://download.pytorch.org/whl/cu128

# CPU only (no GPU acceleration)
pip install torch --index-url https://download.pytorch.org/whl/cpu

Step 2: Install convert_to_quant

# Install from source
git clone https://github.com/silveroxides/convert_to_quant.git
cd convert_to_quant
pip install -e .

Optional: Triton (needed for INT8 kernels)

On Linux
pip install -U triton

On Windows
# for torch>=2.6
pip install -U "triton-windows<3.3"

Quick Start

Recommended: QuIP for near-lossless INT8 quantization (highest weight fidelity, best for LoRA compatibility)

Note: QuIP is optimized for transformer architectures (text encoders, diffusion transformers, etc.). For other model types, the default learned rounding optimizer may be more suitable.

convert_to_quant -i model.safetensors --optimizer quip --comfy_quant

Default Format: QuIP uses uses tensor-wise scaling by default (--quip-requant-scheme tensor) which produces standard int8_tensorwise format. This provides:

Maximum compatibility with ComfyUI and standard inference pipelines

Single global scale per weight matrix (scalar)

For potentially higher precision (at the cost of compatibility): Use --quip-requant-scheme block to enable block-wise re-quantization.

QuIP with SmoothQuant for maximum accuracy (best for models with activation outliers, may reduce LoRA compatibility)

convert_to_quant -i model.safetensors --optimizer quip --smoothquant --comfy_quant

Basic INT8 quantization with ComfyUI metadata (default optimizer)

convert_to_quant -i model.safetensors --comfy_quant

Low VRAM / Memory-efficient mode

convert_to_quant -i model.safetensors --comfy_quant --low-memory

Streaming mode (memory-efficient with GPU acceleration)

convert_to_quant -i model.safetensors --comfy_quant --streaming-mode=balanced

Use --streaming-mode=balanced for faster quantization while still saving RAM. Unlike --low-memory, this keeps calculations on GPU.

Aggressive streaming mode (faster with more VRAM)

convert_to_quant -i model.safetensors --comfy_quant --streaming-mode=aggressive

Use --streaming-mode=aggressive for 10-20% faster processing on GPUs with 12GB+ VRAM. Increases GPU computation thresholds by 2x while still avoiding OOM.

With custom learning rate (adaptive schedule by default)

convert_to_quant -i model.safetensors --comfy_quant --lr 0.01

Load the output .safetensors file in ComfyUI like any other model.

Supported Quantization Formats

Format	Flag	Hardware	Notes
INT8 (tensor-wise)	`--optimizer quip` (default)	Any GPU / CPU	Standard W8A8, maximum compatibility
INT8 (block-wise)	`--scaling-mode block`	Any GPU / CPU	Good balance of quality/speed
INT8 (axis-wise)	`--scaling-mode axis`	Any GPU / CPU	Per-row scaling
FP16	`--fp16`	Any GPU / CPU	High precision fallback

Recommended: Use QuIP with tensor-wise scaling (--optimizer quip, default) for best compatibility and near-lossless quality.

Model-Specific Presets

Model	Flag	Notes
Flux.2	`--flux2`	Keep modulation/guidance/time/final high-precision
LoRA	`--lora`	Skip alpha/scale, quantize lora_up/down
T5-XXL Text Encoder	`--t5xxl`	Decoder removed, skip norms/biases
Mistral Text Encoder	`--mistral`	Norms/biases excluded
Visual Encoder	`--visual`	MLP layers excluded
Hunyuan Video	`--hunyuan`	Attention norms and vision_in excluded
WAN Video	`--wan`	Embeddings, encoders, and head excluded
Qwen Image	`--qwen`	Image layers and added norms excluded
Z-Image	`--zimage`	cap_embedder/norms excluded
Z-Image Refiner	`--zimage_refiner`	Context/noise refiner high-precision
Chroma/Distilled (Large)	`--distillation_large`	Keep distilled_guidance, final, img/txt_in high-precision
Chroma/Distilled (Small)	`--distillation_small`	Keep only distilled_guidance high-precision
NeRF (Large)	`--nerf_large`	Keep nerf_blocks, distilled_guidance, txt_in high-precision
NeRF (Small)	`--nerf_small`	Keep nerf_blocks, distilled_guidance high-precision
Radiance	`--radiance`	Keep img_in_patch, nerf_final_layer high-precision

Documentation

📖 MANUAL.md - Complete usage guide with examples and troubleshooting
🔗 quantization.examples.md - ComfyUI integration patterns
📚 docs/API.md - Python API reference for developers

Project Structure

convert_to_quant/
├── convert_to_quant/            # Main package
│   ├── cli/                     # CLI entry point & argument parsing
│   ├── comfy/                   # ComfyUI integration components & kernels
│   ├── config/                  # Layer configuration & templates
│   ├── converters/              # Core quantization logic (INT8, GPTQ, SmoothQuant)
│   ├── utils/                   # Shared utilities (tensor, memory, metrics)
│   ├── constants.py             # Model Filter Registry & constants
│   ├── quantization.py          # Simplified INT8 entry point
│   └── convert_to_quant.py      # Backward-compatibility wrapper
├── pyproject.toml               # Package configuration
├── MANUAL.md                    # User documentation
└── ...

Key Features

Unified Safetensors Loader: Memory-efficient streaming loader with two modes:
- Standard mode: Preloads all tensors for maximum speed
- Low-memory mode: Streams tensors on-demand, loading only one tensor at a time into RAM
QuIP (Quantization with Incoherence Processing): Near-lossless INT8 quantization using randomized Hadamard transforms to eliminate outliers and make weights more quantization-friendly. Defaults to standard int8_tensorwise format for maximum compatibility and inference performance.
Learned Rounding: SVD-based optimization minimizes quantization error in weight's principal directions
GPTQ Optimizer: Sequential layer-wise optimization with error compensation
SmoothQuant: Preprocessing to migrate quantization difficulty from activations to weights
LoRA-Informed Calibration: Use existing LoRA tensors (--calibration-lora) to guide the quantization process for better compatibility
Multiple Optimizers: QuIP, GPTQ, AdamW, RAdam, and the original adaptive LR optimizer
Bias Correction: Automatic bias adjustment using synthetic calibration data
Model-Specific Support: Exclusion lists for sensitive layers (norms, embeddings, distillation)
Triton Kernels: GPU-accelerated quantization/dequantization with fallback to PyTorch
Layer Config JSON: Fine-grained per-layer control with regex pattern matching
LR Schedules: Adaptive, exponential, and plateau learning rate scheduling
Quality Metrics: MSE and SQNR reporting for validation
BF16 Compute Mode: Half-precision computation on Ampere+ GPUs for 2× memory savings
Checkpointed Quantization: Extreme memory savings (75-90%) for large layers
Streaming Modes: Configurable CPU/GPU offloading with auto-detection based on VRAM

Optimizations

The following performance optimizations are implemented:

1. Triton Kernels (`convert_to_quant/comfy/int8_kernels.py`)

GPU-accelerated INT8 quantization/dequantization with autotuning for optimal block sizes.

int8_gemm, int8_addmm - Optimized matrix multiplication
act_quant, weight_quant, act_dequant, weight_dequant - Fast quantization ops
Automatic fallback to PyTorch when Triton is unavailable

2. QuIP Triton Matmul (`convert_to_quant/comfy/int8_kernels.py`)

Specialized matrix multiplication for QuIP-quantized weights with Hadamard transform support.

quip_int8_matmul() - Optimized for QuIP's transformed weight format
Supports sign vectors (s_u, s_v) and inverse transforms

3. Tensor Buffer Pool (`convert_to_quant/converters/quip_int8.py`)

Efficient buffer reuse during quantization to reduce memory allocations.

TensorBufferPool class with LRU eviction
Configurable max_buffers limit
Reduces GC pressure during iterative optimization

4. CUDA Graph Support (`convert_to_quant/converters/base_converter.py`)

Capture and replay CUDA graphs to eliminate CPU launch overhead.

CudaGraphRunner class for repeatable kernel sequences
Warmup iterations before capture
Ideal for batched inference scenarios

5. Parallel Processing (`convert_to_quant/quantization.py`)

Thread pool for I/O-bound layer-wise operations.

ParallelProcessor with ThreadPoolExecutor
Configurable max_workers
Falls back to sequential for single items

6. Lazy Logging (`convert_to_quant/utils/logging.py`)

Defer expensive string formatting until messages are actually logged.

LazyString - On-demand string evaluation
LazyFormat - Dynamic value evaluation
Reduces overhead when log level filters messages

7. Memory-Mapped Loading (`convert_to_quant/utils/memory_efficient_loader.py`)

Zero-copy tensor loading via memory mapping.

MemoryMappedTensorLoader - Direct file mapping
UnifiedSafetensorsLoader - Unified interface with optional mmap
~50% memory reduction for large models

8. BF16 Compute Mode (`convert_to_quant/constants.py`)

Half-precision computation on Ampere+ GPUs (RTX 30 series, A100, etc.).

2× memory savings for large tensor operations
Automatic detection of BF16 support
Per-operation threshold configuration

Advanced Usage

Layer Config JSON

Define per-layer quantization settings with regex patterns:

# Generate a template from your model
convert_to_quant -i model.safetensors --dry-run create-template

# Apply custom layer config
convert_to_quant -i model.safetensors --layer-config layers.json --comfy_quant

Scaling Modes (Non-QuIP Optimizers)

For optimizers other than QuIP:

# Block-wise scaling (default for learned rounding/GPTQ)
convert_to_quant -i model.safetensors --scaling-mode block --block_size 128 --comfy_quant

# Axis-wise (per-row) scaling
convert_to_quant -i model.safetensors --scaling-mode axis --comfy_quant

# Tensor-wise scaling
convert_to_quant -i model.safetensors --scaling-mode tensor --comfy_quant

Note: For QuIP optimizer, use --quip-requant-scheme {tensor,block} instead (see QuIP Storage Options above).

QuIP Storage Options

By default, QuIP stores weights in original space with a single global scale (tensor-wise) for maximum compatibility:

# Default: Standard int8_tensorwise format with tensor-wise scaling (recommended)
convert_to_quant -i model.safetensors --optimizer quip --comfy_quant

Benefits of tensor-wise (default):

Maximum compatibility with Z-Image, ComfyUI, and standard inference pipelines
Single scalar scale per weight matrix
No special loader requirements
Uses optimized hardware INT8 kernels

Alternative: Block-wise re-quantization

For potentially higher precision (at the cost of producing int8_blockwise format):

# Block-wise re-quantization (may improve precision, less compatible)
convert_to_quant -i model.safetensors --optimizer quip --comfy_quant --quip-requant-scheme block

Comparison:

Scheme	Format	Scale Type	Compatibility	Use Case
`tensor` (default)	`int8_tensorwise`	Scalar (1 value)	Maximum	Recommended for all standard inference
`block`	`int8_blockwise`	3D blocks	Limited	Potentially higher precision

Experimental: Transformed space storage

For advanced use cases with custom loaders:

# Store in transformed space (requires custom loader with Hadamard support)
convert_to_quant -i model.safetensors --optimizer quip --comfy_quant --quip-store-transformed

Note: Not compatible with standard inference pipelines

Memory-Efficient Loading

The UnifiedSafetensorsLoader provides a unified interface for loading safetensors files with optional streaming support:

# Low-memory streaming mode - loads tensors on-demand
convert_to_quant -i model.safetensors --comfy_quant --low-memory

# Standard mode (default) - preloads all tensors for faster processing
convert_to_quant -i model.safetensors --comfy_quant

When to use low-memory mode:

Quantizing very large models that don't fit in system RAM
Running on machines with limited memory
Processing models alongside other memory-intensive applications

Mode comparison:

Mode	Memory Usage	Speed	Use Case
Standard	~2x model size	Fast	Recommended for most users
Low-memory	~1x model size + 1 tensor	Slower	Limited RAM environments

Streaming Modes

The quantizer provides several streaming modes for memory-efficient processing:

# Auto-detect based on GPU VRAM (recommended)
convert_to_quant -i model.safetensors --comfy_quant --streaming-mode=auto

# Balanced approach (default)
convert_to_quant -i model.safetensors --comfy_quant --streaming-mode=balanced

# Aggressive CPU offloading (maximum memory safety)
convert_to_quant -i model.safetensors --comfy_quant --streaming-mode=aggressive

# Minimal offloading (for 12-16GB VRAM)
convert_to_quant -i model.safetensors --comfy_quant --streaming-mode=minimal

# Disable streaming (requires 24GB+ VRAM)
convert_to_quant -i model.safetensors --comfy_quant --streaming-mode=off

Streaming Mode Comparison:

Mode	Hadamard Threshold	Behavior	Use Case
`off`	∞ (infinity)	Never offload to CPU	Workstations with 24GB+ VRAM
`minimal`	100M elements (~400MB)	Conservative offloading	12-16GB VRAM
`balanced`	50M elements (~200MB)	Moderate offloading	8-12GB VRAM
`aggressive`	25M elements (~100MB)	Aggressive offloading	<8GB VRAM or maximum safety
`auto`	Adaptive	Detects based on VRAM	Recommended default

BF16 Compute Mode

Enable BF16 compute on Ampere+ GPUs (RTX 30 series, A100, etc.) for 2× memory savings:

# Auto-enable for large tensors (default)
convert_to_quant -i model.safetensors --comfy_quant --bf16-compute=auto

# Force BF16 for all supported operations
convert_to_quant -i model.safetensors --comfy_quant --bf16-compute=on

# Disable BF16, use FP32 only
convert_to_quant -i model.safetensors --comfy_quant --bf16-compute=off

BF16 with Custom Thresholds:

# Adjust tensor size thresholds for BF16 (in elements)
convert_to_quant -i model.safetensors --comfy_quant \
    --bf16-compute=auto \
    --bf16-threshold 1000000 \
    --bf16-hadamard-threshold 500000 \
    --bf16-hessian-threshold 1000000

Recommended for:

8GB GPUs: --streaming-mode=aggressive --bf16-compute=on
12GB GPUs: --streaming-mode=balanced --bf16-compute=on
16GB+ GPUs: --streaming-mode=auto --bf16-compute=on

Checkpointed Quantization

For extreme memory savings on large layers (75-90% reduction):

# Enable checkpointed quantization with default settings
convert_to_quant -i model.safetensors --comfy_quant --optimizer quip --quip-checkpointed

# Custom threshold and segments
convert_to_quant -i model.safetensors --comfy_quant --optimizer quip \
    --quip-checkpointed \
    --quip-checkpoint-threshold 8192 \
    --quip-checkpoint-segments 4

Options:

--quip-checkpointed - Enable checkpointed LDLQ quantization
--quip-checkpoint-threshold - Dimension threshold (default: 8192)
--quip-checkpoint-segments - Number of segments (default: 4, higher = more memory savings but slower)

No Memory Limits Mode

Disable all memory safety checks for maximum performance (use with caution):

# Maximum speed, no safety checks (requires 24GB+ VRAM)
convert_to_quant -i model.safetensors --comfy_quant --no-memory-limits

Warning: This disables ALL memory protection:

Pre-emptive memory checking (OOMGuard)
Adaptive threshold adjustments
Automatic CPU fallback when VRAM is low
OOM recovery and learning from OOM events

Only use when:

You have abundant VRAM (24GB+) where OOM is unlikely
Performance is critical and CPU fallback is unacceptable
Debugging to isolate OOM handling slowdowns

Quality Reporting & Calibration

# INT8 with SmoothQuant, GPTQ, and internal calibration
convert_to_quant -i model.safetensors --smoothquant --optimizer gptq --report-quality --comfy_quant

# With LoRA-informed calibration for best results
convert_to_quant -i model.safetensors --optimizer quip --calibration-lora my_lora.safetensors --comfy_quant

LoRA Merging

Merge LoRA weights directly into the base model before quantization for a single unified quantized file:

# Merge single LoRA
convert_to_quant -i model.safetensors --merge-lora my_lora.safetensors --comfy_quant

# Merge multiple LoRAs with automatic dampening
convert_to_quant -i model.safetensors --merge-loras lora1.safetensors lora2.safetensors --comfy_quant

# Adjust merge scale (default: 1.0)
convert_to_quant -i model.safetensors --merge-lora my_lora.safetensors --merge-lora-scale 0.8 --comfy_quant

Benefits:

Single file deployment - No separate LoRA loading at inference time
Faster inference - No runtime LoRA computation overhead
Better quantization quality - Optimizers can work with the merged weights

LoRA Compatibility

For the best results when using LoRAs with quantized models:

Use QuIP without SmoothQuant: Non-SmoothQuant QuIP runs provide the best LoRA compatibility. The QuIP optimizer delivers the highest weight fidelity without the activation-to-weight transformations that SmoothQuant applies, which is crucial for maintaining compatibility with LoRAs trained on the original base model.
LoRA-Informed Calibration: If you have a specific LoRA you want to optimize for, use the --calibration-lora flag. This uses the LoRA's weight directions to inform the quantization process for that specific LoRA.

Requirements

Python 3.9+
PyTorch 2.1+ (with CUDA for GPU acceleration)
safetensors >= 0.4.2
tqdm
(Optional) triton >= 2.1.0 for INT8 kernels

Acknowledgements

Original Project (Pre-Fork)

Clybius – For inspiring the project and the Learned-Rounding repository.
lyogavin – For ComfyUI PR #10864 adding int8_blockwise format support and int8 kernels.

Current Project

silveroxides – For ongoing support and providing the main code for this project.
dxqb – For providing the axis-wise implementation (originally from OneTrainer PR #1034).

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 260 Commits
.github		.github
convert_to_quant		convert_to_quant
docs		docs
.gitignore		.gitignore
MANUAL.md		MANUAL.md
README.md		README.md
batch_quantize.py		batch_quantize.py
lora_mapping_example.json		lora_mapping_example.json
pyproject.toml		pyproject.toml
quantization.examples.md		quantization.examples.md
requirements.txt		requirements.txt
setup.py		setup.py

ThunderFun/convert_to_quant_QuIP_INT8

Folders and files

Latest commit

History

Repository files navigation

convert_to_quant

Installation

Step 1: Install PyTorch (GPU-specific)

Step 2: Install convert_to_quant

Optional: Triton (needed for INT8 kernels)

Quick Start

Supported Quantization Formats

Model-Specific Presets

Documentation

Project Structure

Key Features

Optimizations

1. Triton Kernels (convert_to_quant/comfy/int8_kernels.py)

2. QuIP Triton Matmul (convert_to_quant/comfy/int8_kernels.py)

3. Tensor Buffer Pool (convert_to_quant/converters/quip_int8.py)

4. CUDA Graph Support (convert_to_quant/converters/base_converter.py)

5. Parallel Processing (convert_to_quant/quantization.py)

6. Lazy Logging (convert_to_quant/utils/logging.py)

7. Memory-Mapped Loading (convert_to_quant/utils/memory_efficient_loader.py)

8. BF16 Compute Mode (convert_to_quant/constants.py)

Advanced Usage

Layer Config JSON

Scaling Modes (Non-QuIP Optimizers)

QuIP Storage Options

Memory-Efficient Loading

Streaming Modes

BF16 Compute Mode

Checkpointed Quantization

No Memory Limits Mode

Quality Reporting & Calibration

LoRA Merging

LoRA Compatibility

Requirements

Acknowledgements

Original Project (Pre-Fork)

Current Project

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Languages