PDFScalpel

Forensic PDF analysis and CTF toolkit for security researchers, forensic analysts, and penetration testers.

Installation

Basic Installation

pip install pdfscalpel

Full Installation (with all features)

pip install pdfscalpel[full]

Development Installation

git clone https://github.com/ridpath/pdfscalpel
cd pdfscalpel
pip install -e .[full,dev]

External Tools (Optional)

For enhanced functionality, install external tools:

Windows:

# Install via package manager or manually download
# - Ghostscript: https://ghostscript.com/download/gsdnld.html
# - QPDF: http://qpdf.sourceforge.net/
# - John the Ripper: https://www.openwall.com/john/

Linux/WSL:

sudo apt install ghostscript qpdf john hashcat tesseract-ocr imagemagick

Check dependencies:

pdfscalpel check-deps

Shell Completion (Tab Autocomplete)

Enable tab completion for faster command usage:

Bash:

pdfscalpel completion bash > ~/.pdfscalpel-complete.bash
echo "source ~/.pdfscalpel-complete.bash" >> ~/.bashrc
source ~/.bashrc

Zsh:

pdfscalpel completion zsh > ~/.pdfscalpel-complete.zsh
echo "source ~/.pdfscalpel-complete.zsh" >> ~/.zshrc
source ~/.zshrc

Fish:

pdfscalpel completion fish > ~/.config/fish/completions/pdfscalpel.fish

Command Discovery

Quick reference for common commands:

pdfscalpel commands                    # Show all common commands by category
pdfscalpel commands --search malware   # Search for specific commands

List all available commands:

pdfscalpel list-commands               # List all commands
pdfscalpel list-commands --group analyze   # Show only analyze commands
pdfscalpel list-commands --format json     # JSON output for scripting

Quick Start

Analyze PDF Structure

pdfscalpel analyze structure document.pdf
pdfscalpel analyze metadata document.pdf
pdfscalpel analyze encryption encrypted.pdf --check-exploits

Forensic Analysis (Malware, Signatures, Forms)

# Malware and exploit detection
pdfscalpel analyze malware suspicious.pdf
pdfscalpel analyze malware file.pdf --yara-rules ./rules --output report.json

# Digital signature validation
pdfscalpel analyze signatures signed.pdf
pdfscalpel analyze signatures file.pdf --output report.json --format json

# Form security analysis
pdfscalpel analyze form-security form.pdf
pdfscalpel analyze form-security file.pdf --output report.json

# Anti-forensics detection
pdfscalpel analyze anti-forensics document.pdf
pdfscalpel analyze anti-forensics sanitized.pdf --format json

# Advanced steganography detection
pdfscalpel analyze advanced-stego file.pdf
pdfscalpel analyze advanced-stego suspicious.pdf --deep

# PDF repair and damage assessment
pdfscalpel solve repair corrupted.pdf --assess-only
pdfscalpel solve repair broken.pdf --output fixed.pdf --report report.json

Extract Data

pdfscalpel extract text document.pdf -o output.txt
pdfscalpel extract images document.pdf -o images/
pdfscalpel extract javascript malicious.pdf -o scripts/
pdfscalpel extract revisions modified.pdf -o revisions/

Web Content Extraction

Download paginated content from web APIs and compile into PDFs. Perfect for documentation archives, writeup systems, and API-based content extraction.

# Simple extraction with URL and page range
pdfscalpel extract web --url "https://api.example.com/page" --pages "1-20" -o output.pdf

# Auto-discover available pages
pdfscalpel extract web --url "https://api.example.com/page" --auto-discover -o output.pdf

# Advanced configuration file
pdfscalpel extract web --config web-scrape-example.toml

Configuration File Example (web-scrape-example.toml):

[web_extraction]
base_url = "https://api.example.com/infoiwant"
pages = "1-50"
title = "Documentation Archive"
auto_discover = true
cookies_from_browser = "firefox"

[retry]
max_retries = 5
exponential_backoff = true

[rate_limit]
base_delay_ms = 2000
jitter_ms = 1000

Features:

Auto-discovery: Automatically find available pages
Browser cookies: Load authentication cookies from Firefox/Chrome/Edge
Retry logic: Automatic retry with exponential backoff
Rate limiting: Configurable delays with jitter to avoid rate limits
Progress tracking: Real-time download progress and statistics
Resume support: Resume from cached downloads (planned)

Password Cracking (CTF/Authorized Testing)

# CTF mode (requires challenge ID for audit trail)
pdfscalpel solve password encrypted.pdf --ctf-mode --challenge-id ctf-2024-001

# Dictionary attack
pdfscalpel solve password encrypted.pdf --wordlist rockyou.txt --ctf-mode --challenge-id test

# Brute force
pdfscalpel solve password encrypted.pdf --brute --length 6 --ctf-mode --challenge-id test

Watermark Detection & Removal

# Detect watermark
pdfscalpel analyze watermark watermarked.pdf

# Remove watermark (auto-detect method)
pdfscalpel mutate watermark watermarked.pdf clean.pdf --remove auto

# Try all removal methods
pdfscalpel mutate watermark watermarked.pdf clean.pdf --remove-all

Flag Hunting (CTF)

# Hunt for flags with built-in patterns
pdfscalpel solve flag-hunt challenge.pdf --patterns ctf,flag,md5

# Custom pattern
pdfscalpel solve flag-hunt challenge.pdf --custom-pattern "FLAG\{[a-f0-9]+\}"

# Search all layers (text, metadata, streams, revisions)
pdfscalpel solve flag-hunt challenge.pdf --deep --report flag_report.txt

Auto-Solve CTF Challenges

pdfscalpel solve auto challenge.pdf --ctf-mode --challenge-id ctf-2024-pdf --report solution.txt

Generate CTF Challenges

# Password-protected challenge
pdfscalpel generate challenge output.pdf --flag "CTF{test_flag}" --difficulty medium --type password

# Steganography challenge
pdfscalpel generate challenge output.pdf --flag "CTF{hidden}" --type stego --difficulty hard

# Multi-stage challenge
pdfscalpel generate challenge output.pdf --flag "CTF{final}" --type multi --stages 3

Object Graph Visualization

# Generate DOT graph
pdfscalpel analyze graph document.pdf -o graph.dot

# Generate PNG (requires graphviz installed)
pdfscalpel analyze graph document.pdf -o graph.png --format png

# With entropy analysis
pdfscalpel analyze entropy document.pdf --heatmap -o entropy.png

Modify PDFs

# Merge PDFs
pdfscalpel mutate merge file1.pdf file2.pdf -o merged.pdf

# Extract page range
pdfscalpel mutate extract-pages document.pdf -o output.pdf --pages 1-5,10,15-20

# Delete specific pages
pdfscalpel mutate delete-pages document.pdf -o output.pdf --pages 3,7-9

# Rotate pages
pdfscalpel mutate rotate-pages document.pdf -o output.pdf --pages 1-3 --rotation 90

# Add password
pdfscalpel mutate encrypt input.pdf output.pdf --password secret123 --algorithm aes256

# Remove password (requires current password)
pdfscalpel mutate decrypt encrypted.pdf output.pdf --password secret123

# Redact text patterns
pdfscalpel mutate redact document.pdf output.pdf --pattern "\d{3}-\d{2}-\d{4}"

OCR (Optical Character Recognition)

# Make scanned PDF searchable (basic mode - skips pages with existing text)
pdfscalpel mutate ocr scanned_document.pdf searchable.pdf

# Force OCR on all pages (image-based PDFs)
pdfscalpel mutate ocr scanned_document.pdf searchable.pdf --force-ocr

# Multi-language OCR
pdfscalpel mutate ocr document.pdf output.pdf --lang eng+spa --force-ocr

# Check available languages
pdfscalpel mutate ocr --list-languages

# Advanced options
pdfscalpel mutate ocr input.pdf output.pdf \
  --force-ocr \
  --lang eng \
  --jobs 8 \
  --no-deskew \
  --output-type pdfa

Command Reference

Analyze Commands

Command	Description
`structure`	Analyze PDF structure, detect anomalies
`metadata`	Extract metadata (Info dict, XMP)
`encryption`	Analyze encryption parameters, assess crackability
`malware`	Detect malware, exploits, and malicious JavaScript (20+ CVEs)
`signatures`	Validate digital signatures and detect forgery attacks
`form-security`	Analyze PDF forms for XXE, SSRF, and injection vulnerabilities
`anti-forensics`	Detect sanitization tools and anti-forensic manipulation
`advanced-stego`	Detect advanced steganography beyond LSB
`watermark`	Detect and classify watermarks
`graph`	Generate object graph visualization
`entropy`	Entropy analysis for malware/stego detection
`intelligence`	Generate intelligence report with recommendations
`compliance`	Check PDF/A, PDF/X, PDF/E, PDF/UA compliance
`render-diff`	Analyze reader-specific rendering differences

Extract Commands

Command	Description
`text`	Extract text with layout preservation
`images`	Extract all embedded images
`javascript`	Extract and deobfuscate JavaScript
`attachments`	Extract embedded files
`forms`	Extract AcroForm/XFA form data
`streams`	Extract and decompress object streams
`objects`	Dump specific PDF objects by ID
`hidden`	Find invisible/hidden content
`revisions`	Extract PDF revision history
`web`	Extract paginated web content to PDF

Mutate Commands

Command	Description
`merge`	Merge multiple PDFs into one
`extract-pages`	Extract specific pages from PDF
`delete-pages`	Delete specific pages from PDF
`rotate-pages`	Rotate pages (90, 180, 270 degrees)
`watermark`	Add or remove watermarks
`encrypt`	Add password protection and encryption
`decrypt`	Remove password protection and encryption
`add-bookmarks`	Add bookmarks/table of contents
`remove-bookmarks`	Remove all bookmarks
`redact`	Redact text matching patterns
`ocr`	Add searchable text layer via OCR (requires tesseract)
`optimize`	Compress, remove unused objects, linearize

Solve Commands (CTF/Authorized Testing)

Command	Description
`password`	Crack PDF passwords (requires --ctf-mode)
`flag-hunt`	Hunt for flags across all PDF layers
`stego`	Detect and extract steganography
`auto`	Automatically analyze and solve challenges
`repair`	Assess PDF damage and attempt repair/recovery

Generate Commands

Command	Description
`challenge`	Generate CTF challenges
`corrupted`	Generate intentionally broken PDFs
`polyglot`	Create PDF polyglots (PDF+ZIP, PDF+HTML)
`stego-data`	Generate steganographic test data
`watermark-samples`	Generate watermark test samples

Utility Commands

Command	Description
`check-deps`	Check all dependencies and show installation instructions
`commands`	Quick reference for common commands (searchable)
`list-commands`	List all available commands (filterable by group)
`completion`	Generate shell completion scripts (bash/zsh/fish)
`plugin`	Plugin management (list, enable, disable)
`perf`	Performance profiling and benchmarking

CTF Mode

CTF mode enforces ethical use with audit trails:

pdfscalpel solve password encrypted.pdf \
  --ctf-mode \
  --challenge-id "defcon-2024-pdf-01" \
  --output-audit audit.json

CTF Mode Requirements:

Challenge ID must be provided
All operations are logged
Generates signed provenance file
Prevents destructive operations without audit

Audit Log Format:

{
  "mode": "ctf",
  "challenge_id": "defcon-2024-pdf-01",
  "timestamp": "2024-01-27T12:00:00Z",
  "operations": ["password_crack", "flag_hunt"],
  "results": {...},
  "hash": "sha256:..."
}

Configuration

Create pdfscalpel.toml in your project directory or ~/.pdfscalpel.toml:

# PDFScalpel Configuration

[ocr]
enabled = true
language = "eng"
jobs = 4
deskew = true

[watermark]
font_size = 72
opacity = 0.3
rotation = 45

[password]
wordlists = ["rockyou.txt", "common-passwords.txt"]
max_brute_length = 6
timeout = 3600  # seconds

[plugins]
enabled = true
directories = ["plugins", "~/.pdfscalpel/plugins"]

Load configuration:

pdfscalpel --config pdfscalpel.toml analyze structure document.pdf

Plugin Development

Create custom plugins by extending base classes:

# plugins/my_analyzer.py
from pdfscalpel.plugins.base import AnalyzerPlugin

class MyAnalyzer(AnalyzerPlugin):
    name = "my-analyzer"
    description = "Custom PDF analysis"
    version = "1.0.0"
    
    def analyze(self, pdf_path, **kwargs):
        # Your analysis logic
        return {"status": "analyzed"}

Register plugin:

pdfscalpel plugin list
pdfscalpel plugin enable my-analyzer

Examples

Forensic Analysis Workflow

# 1. Initial triage
pdfscalpel analyze structure suspicious.pdf
pdfscalpel analyze metadata suspicious.pdf

# 2. Malware analysis
pdfscalpel analyze malware suspicious.pdf --output malware_report.json
pdfscalpel extract javascript suspicious.pdf -o js/

# 3. Signature validation (if signed)
pdfscalpel analyze signatures suspicious.pdf

# 4. Deep analysis
pdfscalpel analyze entropy suspicious.pdf --output entropy.png
pdfscalpel analyze anti-forensics suspicious.pdf
pdfscalpel analyze advanced-stego suspicious.pdf --deep

# 5. Intelligence report
pdfscalpel analyze intelligence suspicious.pdf --report report.txt

# 6. Extract timeline
pdfscalpel extract revisions suspicious.pdf -o revisions/

CTF Challenge Solving

# 1. Auto-solve attempt
pdfscalpel solve auto challenge.pdf --ctf-mode --challenge-id ctf-001 --report solution.txt

# 2. Manual solving
pdfscalpel solve password challenge.pdf --ctf-mode --challenge-id ctf-001
pdfscalpel solve flag-hunt challenge.pdf --patterns ctf
pdfscalpel solve stego challenge.pdf -o extracted/

Watermark Removal (Authorized)

# 1. Analyze watermark
pdfscalpel analyze watermark document.pdf --verbose

# 2. Remove using recommended method
pdfscalpel mutate watermark document.pdf clean.pdf --remove content-stream

# 3. Try all methods if first fails
pdfscalpel mutate watermark document.pdf clean.pdf --remove-all

Use Cases

Digital Forensics:

Malware detection (JavaScript exploits, CVE fingerprinting, 20+ exploit patterns)
Digital signature validation and forgery detection
Document authenticity verification
Timeline reconstruction from revisions
Metadata analysis and tool fingerprinting
Anti-forensics detection (sanitization tool identification)

CTF Competitions:

Password cracking (RC4/AES, all key lengths)
Advanced steganography detection (stream operators, object ordering, whitespace)
Flag hunting across all PDF layers
Challenge creation and testing
PDF damage assessment and repair

Penetration Testing:

Form exploitation (XFA XXE, SSRF, JavaScript injection - CVE-2025-54988)
Encryption weakness detection
Reader-specific exploit analysis
Polyglot file detection

Security Research:

PDF standard compliance testing
Rendering difference analysis
Sanitization tool fingerprinting (ExifTool, MAT2, QPDF, Ghostscript)
Attack indicator detection (USF, SWA, ISA signature attacks)

Performance

Benchmarks (Windows 11, i7-12700K):

Structure analysis: 1000-page PDF in <3 seconds
Password cracking: RC4-40 at 50,000+ passwords/sec
Object graph generation: 5000 objects in <2 seconds
Image extraction: 100 images in <1 second

WSL/Linux Performance:

GPU-accelerated cracking with Hashcat: 100x+ improvement
Parallel processing: 4-8x speedup on multi-core CPUs
External tool integration: QPDF, John, Ghostscript

Troubleshooting

Dependencies missing:

pdfscalpel check-deps

Encrypted PDF without password:

pdfscalpel analyze encryption document.pdf --check-exploits
# Review crackability assessment before attempting

Watermark won't remove:

# Try all methods and compare results
pdfscalpel mutate watermark input.pdf output.pdf --remove-all --compare

Large PDF performance:

# Use streaming mode for large files
pdfscalpel analyze structure huge.pdf --streaming

Contributing

Fork repository
Create feature branch
Add tests for new features
Run test suite: pytest
Run linter: ruff check pdfscalpel/
Run type checker: mypy pdfscalpel/
Submit pull request

Ethical Use

Authorized Use Cases:

CTF competitions with --ctf-mode
Authorized penetration testing engagements
Digital forensics investigations
Security research and education
Defensive security analysis

Always obtain proper authorization before using password cracking or exploitation features.

License

MIT License - See LICENSE file for details

Credits

Built on: pikepdf, pdfplumber, PyPDF, QPDF, Ghostscript, John the Ripper, Hashcat

PDFScalpel Contributors

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
pdfscalpel		pdfscalpel
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
PDFScalpel.PNG		PDFScalpel.PNG
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py
web-scrape-example.toml		web-scrape-example.toml

License

ridpath/pdfscapel

Folders and files

Latest commit

History

Repository files navigation

PDFScalpel

Installation

Basic Installation

Full Installation (with all features)

Development Installation

External Tools (Optional)

Shell Completion (Tab Autocomplete)

Command Discovery

Quick Start

Analyze PDF Structure

Forensic Analysis (Malware, Signatures, Forms)

Extract Data

Web Content Extraction

Password Cracking (CTF/Authorized Testing)

Watermark Detection & Removal

Flag Hunting (CTF)

Auto-Solve CTF Challenges

Generate CTF Challenges

Object Graph Visualization

Modify PDFs

OCR (Optical Character Recognition)

Command Reference

Analyze Commands

Extract Commands

Mutate Commands

Solve Commands (CTF/Authorized Testing)

Generate Commands

Utility Commands

CTF Mode

Configuration

Plugin Development

Examples

Forensic Analysis Workflow

CTF Challenge Solving

Watermark Removal (Authorized)

Use Cases

Performance

Troubleshooting

Contributing

Ethical Use

License

Credits

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages