Convert safetensors weights to quantized formats (INT8, FP16) with learned rounding optimization for ComfyUI inference.
Warning
Experimental State: This project is a fork, currently in a rough state, and has not been extensively tested. It might not be actively maintained. Use with caution.
Important
PyTorch must be installed first with the correct CUDA version for your GPU. This package does not install PyTorch automatically to avoid conflicts with your existing setup.
Visit pytorch.org to get the correct install command for your system.
Examples:
# CUDA 12.8 (stable)
pip install torch --index-url https://download.pytorch.org/whl/cu128
# CPU only (no GPU acceleration)
pip install torch --index-url https://download.pytorch.org/whl/cpu# Install from source
git clone https://github.com/silveroxides/convert_to_quant.git
cd convert_to_quant
pip install -e .On Linux
pip install -U triton
On Windows
# for torch>=2.6
pip install -U "triton-windows<3.3"Recommended: QuIP for near-lossless INT8 quantization (highest weight fidelity, best for LoRA compatibility)
Note: QuIP is optimized for transformer architectures (text encoders, diffusion transformers, etc.). For other model types, the default learned rounding optimizer may be more suitable.
convert_to_quant -i model.safetensors --optimizer quip --comfy_quantDefault Format: QuIP uses uses tensor-wise scaling by default (
--quip-requant-scheme tensor) which produces standardint8_tensorwiseformat. This provides:
- Maximum compatibility with ComfyUI and standard inference pipelines
- Single global scale per weight matrix (scalar)
For potentially higher precision (at the cost of compatibility): Use
--quip-requant-scheme blockto enable block-wise re-quantization.
QuIP with SmoothQuant for maximum accuracy (best for models with activation outliers, may reduce LoRA compatibility)
convert_to_quant -i model.safetensors --optimizer quip --smoothquant --comfy_quantBasic INT8 quantization with ComfyUI metadata (default optimizer)
convert_to_quant -i model.safetensors --comfy_quantLow VRAM / Memory-efficient mode
convert_to_quant -i model.safetensors --comfy_quant --low-memoryStreaming mode (memory-efficient with GPU acceleration)
convert_to_quant -i model.safetensors --comfy_quant --streaming-mode=balancedUse --streaming-mode=balanced for faster quantization while still saving RAM. Unlike --low-memory, this keeps calculations on GPU.
Aggressive streaming mode (faster with more VRAM)
convert_to_quant -i model.safetensors --comfy_quant --streaming-mode=aggressiveUse --streaming-mode=aggressive for 10-20% faster processing on GPUs with 12GB+ VRAM. Increases GPU computation thresholds by 2x while still avoiding OOM.
With custom learning rate (adaptive schedule by default)
convert_to_quant -i model.safetensors --comfy_quant --lr 0.01Load the output .safetensors file in ComfyUI like any other model.
| Format | Flag | Hardware | Notes |
|---|---|---|---|
| INT8 (tensor-wise) | --optimizer quip (default) |
Any GPU / CPU | Standard W8A8, maximum compatibility |
| INT8 (block-wise) | --scaling-mode block |
Any GPU / CPU | Good balance of quality/speed |
| INT8 (axis-wise) | --scaling-mode axis |
Any GPU / CPU | Per-row scaling |
| FP16 | --fp16 |
Any GPU / CPU | High precision fallback |
Recommended: Use QuIP with tensor-wise scaling (--optimizer quip, default) for best compatibility and near-lossless quality.
| Model | Flag | Notes |
|---|---|---|
| Flux.2 | --flux2 |
Keep modulation/guidance/time/final high-precision |
| LoRA | --lora |
Skip alpha/scale, quantize lora_up/down |
| T5-XXL Text Encoder | --t5xxl |
Decoder removed, skip norms/biases |
| Mistral Text Encoder | --mistral |
Norms/biases excluded |
| Visual Encoder | --visual |
MLP layers excluded |
| Hunyuan Video | --hunyuan |
Attention norms and vision_in excluded |
| WAN Video | --wan |
Embeddings, encoders, and head excluded |
| Qwen Image | --qwen |
Image layers and added norms excluded |
| Z-Image | --zimage |
cap_embedder/norms excluded |
| Z-Image Refiner | --zimage_refiner |
Context/noise refiner high-precision |
| Chroma/Distilled (Large) | --distillation_large |
Keep distilled_guidance, final, img/txt_in high-precision |
| Chroma/Distilled (Small) | --distillation_small |
Keep only distilled_guidance high-precision |
| NeRF (Large) | --nerf_large |
Keep nerf_blocks, distilled_guidance, txt_in high-precision |
| NeRF (Small) | --nerf_small |
Keep nerf_blocks, distilled_guidance high-precision |
| Radiance | --radiance |
Keep img_in_patch, nerf_final_layer high-precision |
- 📖 MANUAL.md - Complete usage guide with examples and troubleshooting
- 🔗 quantization.examples.md - ComfyUI integration patterns
- 📚 docs/API.md - Python API reference for developers
convert_to_quant/
├── convert_to_quant/ # Main package
│ ├── cli/ # CLI entry point & argument parsing
│ ├── comfy/ # ComfyUI integration components & kernels
│ ├── config/ # Layer configuration & templates
│ ├── converters/ # Core quantization logic (INT8, GPTQ, SmoothQuant)
│ ├── utils/ # Shared utilities (tensor, memory, metrics)
│ ├── constants.py # Model Filter Registry & constants
│ ├── quantization.py # Simplified INT8 entry point
│ └── convert_to_quant.py # Backward-compatibility wrapper
├── pyproject.toml # Package configuration
├── MANUAL.md # User documentation
└── ...
- Unified Safetensors Loader: Memory-efficient streaming loader with two modes:
- Standard mode: Preloads all tensors for maximum speed
- Low-memory mode: Streams tensors on-demand, loading only one tensor at a time into RAM
- QuIP (Quantization with Incoherence Processing): Near-lossless INT8 quantization using randomized Hadamard transforms to eliminate outliers and make weights more quantization-friendly. Defaults to standard
int8_tensorwiseformat for maximum compatibility and inference performance. - Learned Rounding: SVD-based optimization minimizes quantization error in weight's principal directions
- GPTQ Optimizer: Sequential layer-wise optimization with error compensation
- SmoothQuant: Preprocessing to migrate quantization difficulty from activations to weights
- LoRA-Informed Calibration: Use existing LoRA tensors (
--calibration-lora) to guide the quantization process for better compatibility - Multiple Optimizers: QuIP, GPTQ, AdamW, RAdam, and the original adaptive LR optimizer
- Bias Correction: Automatic bias adjustment using synthetic calibration data
- Model-Specific Support: Exclusion lists for sensitive layers (norms, embeddings, distillation)
- Triton Kernels: GPU-accelerated quantization/dequantization with fallback to PyTorch
- Layer Config JSON: Fine-grained per-layer control with regex pattern matching
- LR Schedules: Adaptive, exponential, and plateau learning rate scheduling
- Quality Metrics: MSE and SQNR reporting for validation
- BF16 Compute Mode: Half-precision computation on Ampere+ GPUs for 2× memory savings
- Checkpointed Quantization: Extreme memory savings (75-90%) for large layers
- Streaming Modes: Configurable CPU/GPU offloading with auto-detection based on VRAM
The following performance optimizations are implemented:
1. Triton Kernels (convert_to_quant/comfy/int8_kernels.py)
GPU-accelerated INT8 quantization/dequantization with autotuning for optimal block sizes.
int8_gemm,int8_addmm- Optimized matrix multiplicationact_quant,weight_quant,act_dequant,weight_dequant- Fast quantization ops- Automatic fallback to PyTorch when Triton is unavailable
2. QuIP Triton Matmul (convert_to_quant/comfy/int8_kernels.py)
Specialized matrix multiplication for QuIP-quantized weights with Hadamard transform support.
quip_int8_matmul()- Optimized for QuIP's transformed weight format- Supports sign vectors (s_u, s_v) and inverse transforms
3. Tensor Buffer Pool (convert_to_quant/converters/quip_int8.py)
Efficient buffer reuse during quantization to reduce memory allocations.
TensorBufferPoolclass with LRU eviction- Configurable
max_bufferslimit - Reduces GC pressure during iterative optimization
4. CUDA Graph Support (convert_to_quant/converters/base_converter.py)
Capture and replay CUDA graphs to eliminate CPU launch overhead.
CudaGraphRunnerclass for repeatable kernel sequences- Warmup iterations before capture
- Ideal for batched inference scenarios
5. Parallel Processing (convert_to_quant/quantization.py)
Thread pool for I/O-bound layer-wise operations.
ParallelProcessorwithThreadPoolExecutor- Configurable
max_workers - Falls back to sequential for single items
6. Lazy Logging (convert_to_quant/utils/logging.py)
Defer expensive string formatting until messages are actually logged.
LazyString- On-demand string evaluationLazyFormat- Dynamic value evaluation- Reduces overhead when log level filters messages
7. Memory-Mapped Loading (convert_to_quant/utils/memory_efficient_loader.py)
Zero-copy tensor loading via memory mapping.
MemoryMappedTensorLoader- Direct file mappingUnifiedSafetensorsLoader- Unified interface with optional mmap- ~50% memory reduction for large models
8. BF16 Compute Mode (convert_to_quant/constants.py)
Half-precision computation on Ampere+ GPUs (RTX 30 series, A100, etc.).
- 2× memory savings for large tensor operations
- Automatic detection of BF16 support
- Per-operation threshold configuration
Define per-layer quantization settings with regex patterns:
# Generate a template from your model
convert_to_quant -i model.safetensors --dry-run create-template
# Apply custom layer config
convert_to_quant -i model.safetensors --layer-config layers.json --comfy_quantFor optimizers other than QuIP:
# Block-wise scaling (default for learned rounding/GPTQ)
convert_to_quant -i model.safetensors --scaling-mode block --block_size 128 --comfy_quant
# Axis-wise (per-row) scaling
convert_to_quant -i model.safetensors --scaling-mode axis --comfy_quant
# Tensor-wise scaling
convert_to_quant -i model.safetensors --scaling-mode tensor --comfy_quantNote: For QuIP optimizer, use --quip-requant-scheme {tensor,block} instead (see QuIP Storage Options above).
By default, QuIP stores weights in original space with a single global scale (tensor-wise) for maximum compatibility:
# Default: Standard int8_tensorwise format with tensor-wise scaling (recommended)
convert_to_quant -i model.safetensors --optimizer quip --comfy_quantBenefits of tensor-wise (default):
- Maximum compatibility with Z-Image, ComfyUI, and standard inference pipelines
- Single scalar scale per weight matrix
- No special loader requirements
- Uses optimized hardware INT8 kernels
Alternative: Block-wise re-quantization
For potentially higher precision (at the cost of producing int8_blockwise format):
# Block-wise re-quantization (may improve precision, less compatible)
convert_to_quant -i model.safetensors --optimizer quip --comfy_quant --quip-requant-scheme blockComparison:
| Scheme | Format | Scale Type | Compatibility | Use Case |
|---|---|---|---|---|
tensor (default) |
int8_tensorwise |
Scalar (1 value) | Maximum | Recommended for all standard inference |
block |
int8_blockwise |
3D blocks | Limited | Potentially higher precision |
Experimental: Transformed space storage
For advanced use cases with custom loaders:
# Store in transformed space (requires custom loader with Hadamard support)
convert_to_quant -i model.safetensors --optimizer quip --comfy_quant --quip-store-transformed- Note: Not compatible with standard inference pipelines
The UnifiedSafetensorsLoader provides a unified interface for loading safetensors files with optional streaming support:
# Low-memory streaming mode - loads tensors on-demand
convert_to_quant -i model.safetensors --comfy_quant --low-memory
# Standard mode (default) - preloads all tensors for faster processing
convert_to_quant -i model.safetensors --comfy_quantWhen to use low-memory mode:
- Quantizing very large models that don't fit in system RAM
- Running on machines with limited memory
- Processing models alongside other memory-intensive applications
Mode comparison:
| Mode | Memory Usage | Speed | Use Case |
|---|---|---|---|
| Standard | ~2x model size | Fast | Recommended for most users |
| Low-memory | ~1x model size + 1 tensor | Slower | Limited RAM environments |
The quantizer provides several streaming modes for memory-efficient processing:
# Auto-detect based on GPU VRAM (recommended)
convert_to_quant -i model.safetensors --comfy_quant --streaming-mode=auto
# Balanced approach (default)
convert_to_quant -i model.safetensors --comfy_quant --streaming-mode=balanced
# Aggressive CPU offloading (maximum memory safety)
convert_to_quant -i model.safetensors --comfy_quant --streaming-mode=aggressive
# Minimal offloading (for 12-16GB VRAM)
convert_to_quant -i model.safetensors --comfy_quant --streaming-mode=minimal
# Disable streaming (requires 24GB+ VRAM)
convert_to_quant -i model.safetensors --comfy_quant --streaming-mode=offStreaming Mode Comparison:
| Mode | Hadamard Threshold | Behavior | Use Case |
|---|---|---|---|
off |
∞ (infinity) | Never offload to CPU | Workstations with 24GB+ VRAM |
minimal |
100M elements (~400MB) | Conservative offloading | 12-16GB VRAM |
balanced |
50M elements (~200MB) | Moderate offloading | 8-12GB VRAM |
aggressive |
25M elements (~100MB) | Aggressive offloading | <8GB VRAM or maximum safety |
auto |
Adaptive | Detects based on VRAM | Recommended default |
Enable BF16 compute on Ampere+ GPUs (RTX 30 series, A100, etc.) for 2× memory savings:
# Auto-enable for large tensors (default)
convert_to_quant -i model.safetensors --comfy_quant --bf16-compute=auto
# Force BF16 for all supported operations
convert_to_quant -i model.safetensors --comfy_quant --bf16-compute=on
# Disable BF16, use FP32 only
convert_to_quant -i model.safetensors --comfy_quant --bf16-compute=offBF16 with Custom Thresholds:
# Adjust tensor size thresholds for BF16 (in elements)
convert_to_quant -i model.safetensors --comfy_quant \
--bf16-compute=auto \
--bf16-threshold 1000000 \
--bf16-hadamard-threshold 500000 \
--bf16-hessian-threshold 1000000Recommended for:
- 8GB GPUs:
--streaming-mode=aggressive --bf16-compute=on - 12GB GPUs:
--streaming-mode=balanced --bf16-compute=on - 16GB+ GPUs:
--streaming-mode=auto --bf16-compute=on
For extreme memory savings on large layers (75-90% reduction):
# Enable checkpointed quantization with default settings
convert_to_quant -i model.safetensors --comfy_quant --optimizer quip --quip-checkpointed
# Custom threshold and segments
convert_to_quant -i model.safetensors --comfy_quant --optimizer quip \
--quip-checkpointed \
--quip-checkpoint-threshold 8192 \
--quip-checkpoint-segments 4Options:
--quip-checkpointed- Enable checkpointed LDLQ quantization--quip-checkpoint-threshold- Dimension threshold (default: 8192)--quip-checkpoint-segments- Number of segments (default: 4, higher = more memory savings but slower)
Disable all memory safety checks for maximum performance (use with caution):
# Maximum speed, no safety checks (requires 24GB+ VRAM)
convert_to_quant -i model.safetensors --comfy_quant --no-memory-limitsWarning: This disables ALL memory protection:
- Pre-emptive memory checking (OOMGuard)
- Adaptive threshold adjustments
- Automatic CPU fallback when VRAM is low
- OOM recovery and learning from OOM events
Only use when:
- You have abundant VRAM (24GB+) where OOM is unlikely
- Performance is critical and CPU fallback is unacceptable
- Debugging to isolate OOM handling slowdowns
# INT8 with SmoothQuant, GPTQ, and internal calibration
convert_to_quant -i model.safetensors --smoothquant --optimizer gptq --report-quality --comfy_quant
# With LoRA-informed calibration for best results
convert_to_quant -i model.safetensors --optimizer quip --calibration-lora my_lora.safetensors --comfy_quantMerge LoRA weights directly into the base model before quantization for a single unified quantized file:
# Merge single LoRA
convert_to_quant -i model.safetensors --merge-lora my_lora.safetensors --comfy_quant
# Merge multiple LoRAs with automatic dampening
convert_to_quant -i model.safetensors --merge-loras lora1.safetensors lora2.safetensors --comfy_quant
# Adjust merge scale (default: 1.0)
convert_to_quant -i model.safetensors --merge-lora my_lora.safetensors --merge-lora-scale 0.8 --comfy_quantBenefits:
- Single file deployment - No separate LoRA loading at inference time
- Faster inference - No runtime LoRA computation overhead
- Better quantization quality - Optimizers can work with the merged weights
For the best results when using LoRAs with quantized models:
- Use QuIP without SmoothQuant: Non-SmoothQuant QuIP runs provide the best LoRA compatibility. The QuIP optimizer delivers the highest weight fidelity without the activation-to-weight transformations that SmoothQuant applies, which is crucial for maintaining compatibility with LoRAs trained on the original base model.
- LoRA-Informed Calibration: If you have a specific LoRA you want to optimize for, use the
--calibration-loraflag. This uses the LoRA's weight directions to inform the quantization process for that specific LoRA.
- Python 3.9+
- PyTorch 2.1+ (with CUDA for GPU acceleration)
- safetensors >= 0.4.2
- tqdm
- (Optional) triton >= 2.1.0 for INT8 kernels
- Clybius – For inspiring the project and the Learned-Rounding repository.
- lyogavin – For ComfyUI PR #10864 adding
int8_blockwiseformat support and int8 kernels.
- silveroxides – For ongoing support and providing the main code for this project.
- dxqb – For providing the axis-wise implementation (originally from OneTrainer PR #1034).
MIT License