Skip to content

A tool for extracting and anonymizing credit card statements from PDFs using PaddleOCR and Presidio

License

Notifications You must be signed in to change notification settings

jerrylasama/cc-statement-extractor

Repository files navigation

Credit Card Statement Extractor

Small tool to extract and anonymize data from credit card statements (PDFs), since my bank doesn't provide a good way to export the data and I am too lazy to do it manually. It uses PaddleOCR for the OCR and Microsoft Presidio for PII (Personally Identifiable Information) anonymization.

Features

  • OCR Extraction: Converts PDF statements into structured Markdown, ready to be used by other tools / LLMs.
  • PII Anonymization: Automatically redacts sensitive information like:
    • Names
    • Credit Card Numbers
    • Identity Numbers (NIK)
    • Locations
    • Email Addresses
    • Anything else you want to redact, just edit config.yaml
  • Custom Deny List: Add specific terms to be redacted in config.yaml.
  • Incremental Processing: Skips already processed files unless forced.

Installation

This project uses uv for dependency management.

# Install dependencies (CPU version)
uv sync

# If you have a CUDA Acceleration and want faster OCR
# Though, don't forget to match your CUDA version with the one in pyproject.toml's "paddle-gpu" extra
# More Info: https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/develop/install/pip/linux-pip.html
uv sync --extra gpu

Install Spacy model

uv run -m spacy download en_core_web_sm

Usage

Place your PDF statements in the data/ directory.

# Process all PDFs in data/
uv run main

# Process a specific file
uv run main -f filename.pdf

# Force re-processing of already anonymized files
uv run main --force

Results will be saved in the output/ directory as -anonymized.md files.

Configuration

Configuration is managed via config.yaml. Copy config.example.yaml to create your own:

cp config.example.yaml config.yaml

Configuration areas:

  • ocr: Control which labels to ignore during Markdown generation.
  • nlp: Configure the NLP engine (spaCy) and GLiNER models (GLiNER).
  • pii: Define target entities and a custom deny list for redaction.

GLiNER entity mapping is done on entity.map.json, if you use any other model, don't forget to adjust it.

Docker Usage

You can run the extractor using Docker to avoid local environment issues.

Using CPU

# Build
docker build -t cc-extractor-cpu .

# Run all files
docker run -v $(pwd)/data:/app/data -v $(pwd)/output:/app/output -v $(pwd)/config.yaml:/app/config.yaml cc-extractor-cpu

# Pass parameters (e.g., process specific file)
docker run -v $(pwd)/data:/app/data -v $(pwd)/output:/app/output -v $(pwd)/config.yaml:/app/config.yaml cc-extractor-cpu -f filename.pdf --force

Using GPU (CUDA 13.0)

Requires NVIDIA Container Toolkit.

# Build
docker build -t cc-extractor-gpu -f Dockerfile.gpu .

# Run all files
docker run --gpus all -v $(pwd)/data:/app/data -v $(pwd)/output:/app/output -v $(pwd)/config.yaml:/app/config.yaml cc-extractor-gpu

# Pass parameters
docker run --gpus all -v $(pwd)/data:/app/data -v $(pwd)/output:/app/output -v $(pwd)/config.yaml:/app/config.yaml cc-extractor-gpu --force

Development

# Run tests
uv run pytest

# Check linting
uv run ruff check

License

MIT License - See the LICENSE file for details.

About

A tool for extracting and anonymizing credit card statements from PDFs using PaddleOCR and Presidio

Resources

License

Stars

Watchers

Forks

Packages

No packages published