Forensic PDF analysis and CTF toolkit for security researchers, forensic analysts, and penetration testers.
pip install pdfscalpelpip install pdfscalpel[full]git clone https://github.com/ridpath/pdfscalpel
cd pdfscalpel
pip install -e .[full,dev]For enhanced functionality, install external tools:
Windows:
# Install via package manager or manually download
# - Ghostscript: https://ghostscript.com/download/gsdnld.html
# - QPDF: http://qpdf.sourceforge.net/
# - John the Ripper: https://www.openwall.com/john/Linux/WSL:
sudo apt install ghostscript qpdf john hashcat tesseract-ocr imagemagickCheck dependencies:
pdfscalpel check-depsEnable tab completion for faster command usage:
Bash:
pdfscalpel completion bash > ~/.pdfscalpel-complete.bash
echo "source ~/.pdfscalpel-complete.bash" >> ~/.bashrc
source ~/.bashrcZsh:
pdfscalpel completion zsh > ~/.pdfscalpel-complete.zsh
echo "source ~/.pdfscalpel-complete.zsh" >> ~/.zshrc
source ~/.zshrcFish:
pdfscalpel completion fish > ~/.config/fish/completions/pdfscalpel.fishQuick reference for common commands:
pdfscalpel commands # Show all common commands by category
pdfscalpel commands --search malware # Search for specific commandsList all available commands:
pdfscalpel list-commands # List all commands
pdfscalpel list-commands --group analyze # Show only analyze commands
pdfscalpel list-commands --format json # JSON output for scriptingpdfscalpel analyze structure document.pdf
pdfscalpel analyze metadata document.pdf
pdfscalpel analyze encryption encrypted.pdf --check-exploits# Malware and exploit detection
pdfscalpel analyze malware suspicious.pdf
pdfscalpel analyze malware file.pdf --yara-rules ./rules --output report.json
# Digital signature validation
pdfscalpel analyze signatures signed.pdf
pdfscalpel analyze signatures file.pdf --output report.json --format json
# Form security analysis
pdfscalpel analyze form-security form.pdf
pdfscalpel analyze form-security file.pdf --output report.json
# Anti-forensics detection
pdfscalpel analyze anti-forensics document.pdf
pdfscalpel analyze anti-forensics sanitized.pdf --format json
# Advanced steganography detection
pdfscalpel analyze advanced-stego file.pdf
pdfscalpel analyze advanced-stego suspicious.pdf --deep
# PDF repair and damage assessment
pdfscalpel solve repair corrupted.pdf --assess-only
pdfscalpel solve repair broken.pdf --output fixed.pdf --report report.jsonpdfscalpel extract text document.pdf -o output.txt
pdfscalpel extract images document.pdf -o images/
pdfscalpel extract javascript malicious.pdf -o scripts/
pdfscalpel extract revisions modified.pdf -o revisions/Download paginated content from web APIs and compile into PDFs. Perfect for documentation archives, writeup systems, and API-based content extraction.
# Simple extraction with URL and page range
pdfscalpel extract web --url "https://api.example.com/page" --pages "1-20" -o output.pdf
# Auto-discover available pages
pdfscalpel extract web --url "https://api.example.com/page" --auto-discover -o output.pdf
# Advanced configuration file
pdfscalpel extract web --config web-scrape-example.tomlConfiguration File Example (web-scrape-example.toml):
[web_extraction]
base_url = "https://api.example.com/infoiwant"
pages = "1-50"
title = "Documentation Archive"
auto_discover = true
cookies_from_browser = "firefox"
[retry]
max_retries = 5
exponential_backoff = true
[rate_limit]
base_delay_ms = 2000
jitter_ms = 1000Features:
- Auto-discovery: Automatically find available pages
- Browser cookies: Load authentication cookies from Firefox/Chrome/Edge
- Retry logic: Automatic retry with exponential backoff
- Rate limiting: Configurable delays with jitter to avoid rate limits
- Progress tracking: Real-time download progress and statistics
- Resume support: Resume from cached downloads (planned)
# CTF mode (requires challenge ID for audit trail)
pdfscalpel solve password encrypted.pdf --ctf-mode --challenge-id ctf-2024-001
# Dictionary attack
pdfscalpel solve password encrypted.pdf --wordlist rockyou.txt --ctf-mode --challenge-id test
# Brute force
pdfscalpel solve password encrypted.pdf --brute --length 6 --ctf-mode --challenge-id test# Detect watermark
pdfscalpel analyze watermark watermarked.pdf
# Remove watermark (auto-detect method)
pdfscalpel mutate watermark watermarked.pdf clean.pdf --remove auto
# Try all removal methods
pdfscalpel mutate watermark watermarked.pdf clean.pdf --remove-all# Hunt for flags with built-in patterns
pdfscalpel solve flag-hunt challenge.pdf --patterns ctf,flag,md5
# Custom pattern
pdfscalpel solve flag-hunt challenge.pdf --custom-pattern "FLAG\{[a-f0-9]+\}"
# Search all layers (text, metadata, streams, revisions)
pdfscalpel solve flag-hunt challenge.pdf --deep --report flag_report.txtpdfscalpel solve auto challenge.pdf --ctf-mode --challenge-id ctf-2024-pdf --report solution.txt# Password-protected challenge
pdfscalpel generate challenge output.pdf --flag "CTF{test_flag}" --difficulty medium --type password
# Steganography challenge
pdfscalpel generate challenge output.pdf --flag "CTF{hidden}" --type stego --difficulty hard
# Multi-stage challenge
pdfscalpel generate challenge output.pdf --flag "CTF{final}" --type multi --stages 3# Generate DOT graph
pdfscalpel analyze graph document.pdf -o graph.dot
# Generate PNG (requires graphviz installed)
pdfscalpel analyze graph document.pdf -o graph.png --format png
# With entropy analysis
pdfscalpel analyze entropy document.pdf --heatmap -o entropy.png# Merge PDFs
pdfscalpel mutate merge file1.pdf file2.pdf -o merged.pdf
# Extract page range
pdfscalpel mutate extract-pages document.pdf -o output.pdf --pages 1-5,10,15-20
# Delete specific pages
pdfscalpel mutate delete-pages document.pdf -o output.pdf --pages 3,7-9
# Rotate pages
pdfscalpel mutate rotate-pages document.pdf -o output.pdf --pages 1-3 --rotation 90
# Add password
pdfscalpel mutate encrypt input.pdf output.pdf --password secret123 --algorithm aes256
# Remove password (requires current password)
pdfscalpel mutate decrypt encrypted.pdf output.pdf --password secret123
# Redact text patterns
pdfscalpel mutate redact document.pdf output.pdf --pattern "\d{3}-\d{2}-\d{4}"# Make scanned PDF searchable (basic mode - skips pages with existing text)
pdfscalpel mutate ocr scanned_document.pdf searchable.pdf
# Force OCR on all pages (image-based PDFs)
pdfscalpel mutate ocr scanned_document.pdf searchable.pdf --force-ocr
# Multi-language OCR
pdfscalpel mutate ocr document.pdf output.pdf --lang eng+spa --force-ocr
# Check available languages
pdfscalpel mutate ocr --list-languages
# Advanced options
pdfscalpel mutate ocr input.pdf output.pdf \
--force-ocr \
--lang eng \
--jobs 8 \
--no-deskew \
--output-type pdfa| Command | Description |
|---|---|
structure |
Analyze PDF structure, detect anomalies |
metadata |
Extract metadata (Info dict, XMP) |
encryption |
Analyze encryption parameters, assess crackability |
malware |
Detect malware, exploits, and malicious JavaScript (20+ CVEs) |
signatures |
Validate digital signatures and detect forgery attacks |
form-security |
Analyze PDF forms for XXE, SSRF, and injection vulnerabilities |
anti-forensics |
Detect sanitization tools and anti-forensic manipulation |
advanced-stego |
Detect advanced steganography beyond LSB |
watermark |
Detect and classify watermarks |
graph |
Generate object graph visualization |
entropy |
Entropy analysis for malware/stego detection |
intelligence |
Generate intelligence report with recommendations |
compliance |
Check PDF/A, PDF/X, PDF/E, PDF/UA compliance |
render-diff |
Analyze reader-specific rendering differences |
| Command | Description |
|---|---|
text |
Extract text with layout preservation |
images |
Extract all embedded images |
javascript |
Extract and deobfuscate JavaScript |
attachments |
Extract embedded files |
forms |
Extract AcroForm/XFA form data |
streams |
Extract and decompress object streams |
objects |
Dump specific PDF objects by ID |
hidden |
Find invisible/hidden content |
revisions |
Extract PDF revision history |
web |
Extract paginated web content to PDF |
| Command | Description |
|---|---|
merge |
Merge multiple PDFs into one |
extract-pages |
Extract specific pages from PDF |
delete-pages |
Delete specific pages from PDF |
rotate-pages |
Rotate pages (90, 180, 270 degrees) |
watermark |
Add or remove watermarks |
encrypt |
Add password protection and encryption |
decrypt |
Remove password protection and encryption |
add-bookmarks |
Add bookmarks/table of contents |
remove-bookmarks |
Remove all bookmarks |
redact |
Redact text matching patterns |
ocr |
Add searchable text layer via OCR (requires tesseract) |
optimize |
Compress, remove unused objects, linearize |
| Command | Description |
|---|---|
password |
Crack PDF passwords (requires --ctf-mode) |
flag-hunt |
Hunt for flags across all PDF layers |
stego |
Detect and extract steganography |
auto |
Automatically analyze and solve challenges |
repair |
Assess PDF damage and attempt repair/recovery |
| Command | Description |
|---|---|
challenge |
Generate CTF challenges |
corrupted |
Generate intentionally broken PDFs |
polyglot |
Create PDF polyglots (PDF+ZIP, PDF+HTML) |
stego-data |
Generate steganographic test data |
watermark-samples |
Generate watermark test samples |
| Command | Description |
|---|---|
check-deps |
Check all dependencies and show installation instructions |
commands |
Quick reference for common commands (searchable) |
list-commands |
List all available commands (filterable by group) |
completion |
Generate shell completion scripts (bash/zsh/fish) |
plugin |
Plugin management (list, enable, disable) |
perf |
Performance profiling and benchmarking |
CTF mode enforces ethical use with audit trails:
pdfscalpel solve password encrypted.pdf \
--ctf-mode \
--challenge-id "defcon-2024-pdf-01" \
--output-audit audit.jsonCTF Mode Requirements:
- Challenge ID must be provided
- All operations are logged
- Generates signed provenance file
- Prevents destructive operations without audit
Audit Log Format:
{
"mode": "ctf",
"challenge_id": "defcon-2024-pdf-01",
"timestamp": "2024-01-27T12:00:00Z",
"operations": ["password_crack", "flag_hunt"],
"results": {...},
"hash": "sha256:..."
}Create pdfscalpel.toml in your project directory or ~/.pdfscalpel.toml:
# PDFScalpel Configuration
[ocr]
enabled = true
language = "eng"
jobs = 4
deskew = true
[watermark]
font_size = 72
opacity = 0.3
rotation = 45
[password]
wordlists = ["rockyou.txt", "common-passwords.txt"]
max_brute_length = 6
timeout = 3600 # seconds
[plugins]
enabled = true
directories = ["plugins", "~/.pdfscalpel/plugins"]Load configuration:
pdfscalpel --config pdfscalpel.toml analyze structure document.pdfCreate custom plugins by extending base classes:
# plugins/my_analyzer.py
from pdfscalpel.plugins.base import AnalyzerPlugin
class MyAnalyzer(AnalyzerPlugin):
name = "my-analyzer"
description = "Custom PDF analysis"
version = "1.0.0"
def analyze(self, pdf_path, **kwargs):
# Your analysis logic
return {"status": "analyzed"}Register plugin:
pdfscalpel plugin list
pdfscalpel plugin enable my-analyzer# 1. Initial triage
pdfscalpel analyze structure suspicious.pdf
pdfscalpel analyze metadata suspicious.pdf
# 2. Malware analysis
pdfscalpel analyze malware suspicious.pdf --output malware_report.json
pdfscalpel extract javascript suspicious.pdf -o js/
# 3. Signature validation (if signed)
pdfscalpel analyze signatures suspicious.pdf
# 4. Deep analysis
pdfscalpel analyze entropy suspicious.pdf --output entropy.png
pdfscalpel analyze anti-forensics suspicious.pdf
pdfscalpel analyze advanced-stego suspicious.pdf --deep
# 5. Intelligence report
pdfscalpel analyze intelligence suspicious.pdf --report report.txt
# 6. Extract timeline
pdfscalpel extract revisions suspicious.pdf -o revisions/# 1. Auto-solve attempt
pdfscalpel solve auto challenge.pdf --ctf-mode --challenge-id ctf-001 --report solution.txt
# 2. Manual solving
pdfscalpel solve password challenge.pdf --ctf-mode --challenge-id ctf-001
pdfscalpel solve flag-hunt challenge.pdf --patterns ctf
pdfscalpel solve stego challenge.pdf -o extracted/# 1. Analyze watermark
pdfscalpel analyze watermark document.pdf --verbose
# 2. Remove using recommended method
pdfscalpel mutate watermark document.pdf clean.pdf --remove content-stream
# 3. Try all methods if first fails
pdfscalpel mutate watermark document.pdf clean.pdf --remove-allDigital Forensics:
- Malware detection (JavaScript exploits, CVE fingerprinting, 20+ exploit patterns)
- Digital signature validation and forgery detection
- Document authenticity verification
- Timeline reconstruction from revisions
- Metadata analysis and tool fingerprinting
- Anti-forensics detection (sanitization tool identification)
CTF Competitions:
- Password cracking (RC4/AES, all key lengths)
- Advanced steganography detection (stream operators, object ordering, whitespace)
- Flag hunting across all PDF layers
- Challenge creation and testing
- PDF damage assessment and repair
Penetration Testing:
- Form exploitation (XFA XXE, SSRF, JavaScript injection - CVE-2025-54988)
- Encryption weakness detection
- Reader-specific exploit analysis
- Polyglot file detection
Security Research:
- PDF standard compliance testing
- Rendering difference analysis
- Sanitization tool fingerprinting (ExifTool, MAT2, QPDF, Ghostscript)
- Attack indicator detection (USF, SWA, ISA signature attacks)
Benchmarks (Windows 11, i7-12700K):
- Structure analysis: 1000-page PDF in <3 seconds
- Password cracking: RC4-40 at 50,000+ passwords/sec
- Object graph generation: 5000 objects in <2 seconds
- Image extraction: 100 images in <1 second
WSL/Linux Performance:
- GPU-accelerated cracking with Hashcat: 100x+ improvement
- Parallel processing: 4-8x speedup on multi-core CPUs
- External tool integration: QPDF, John, Ghostscript
Dependencies missing:
pdfscalpel check-depsEncrypted PDF without password:
pdfscalpel analyze encryption document.pdf --check-exploits
# Review crackability assessment before attemptingWatermark won't remove:
# Try all methods and compare results
pdfscalpel mutate watermark input.pdf output.pdf --remove-all --compareLarge PDF performance:
# Use streaming mode for large files
pdfscalpel analyze structure huge.pdf --streaming- Fork repository
- Create feature branch
- Add tests for new features
- Run test suite:
pytest - Run linter:
ruff check pdfscalpel/ - Run type checker:
mypy pdfscalpel/ - Submit pull request
Authorized Use Cases:
- CTF competitions with
--ctf-mode - Authorized penetration testing engagements
- Digital forensics investigations
- Security research and education
- Defensive security analysis
Always obtain proper authorization before using password cracking or exploitation features.
MIT License - See LICENSE file for details
Built on: pikepdf, pdfplumber, PyPDF, QPDF, Ghostscript, John the Ripper, Hashcat
PDFScalpel Contributors