Credit Card Statement Extractor

Small tool to extract and anonymize data from credit card statements (PDFs), since my bank doesn't provide a good way to export the data and I am too lazy to do it manually. It uses PaddleOCR for the OCR and Microsoft Presidio for PII (Personally Identifiable Information) anonymization.

Features

OCR Extraction: Converts PDF statements into structured Markdown, ready to be used by other tools / LLMs.
PII Anonymization: Automatically redacts sensitive information like:
- Names
- Credit Card Numbers
- Identity Numbers (NIK)
- Locations
- Email Addresses
- Anything else you want to redact, just edit config.yaml
Custom Deny List: Add specific terms to be redacted in config.yaml.
Incremental Processing: Skips already processed files unless forced.

Installation

This project uses uv for dependency management.

# Install dependencies (CPU version)
uv sync

# If you have a CUDA Acceleration and want faster OCR
# Though, don't forget to match your CUDA version with the one in pyproject.toml's "paddle-gpu" extra
# More Info: https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/develop/install/pip/linux-pip.html
uv sync --extra gpu

Install Spacy model

uv run -m spacy download en_core_web_sm

Usage

Place your PDF statements in the data/ directory.

# Process all PDFs in data/
uv run main

# Process a specific file
uv run main -f filename.pdf

# Force re-processing of already anonymized files
uv run main --force

Results will be saved in the output/ directory as -anonymized.md files.

Configuration

Configuration is managed via config.yaml. Copy config.example.yaml to create your own:

cp config.example.yaml config.yaml

Configuration areas:

ocr: Control which labels to ignore during Markdown generation.
nlp: Configure the NLP engine (spaCy) and GLiNER models (GLiNER).
pii: Define target entities and a custom deny list for redaction.

GLiNER entity mapping is done on entity.map.json, if you use any other model, don't forget to adjust it.

Docker Usage

You can run the extractor using Docker to avoid local environment issues.

Using CPU

# Build
docker build -t cc-extractor-cpu .

# Run all files
docker run -v $(pwd)/data:/app/data -v $(pwd)/output:/app/output -v $(pwd)/config.yaml:/app/config.yaml cc-extractor-cpu

# Pass parameters (e.g., process specific file)
docker run -v $(pwd)/data:/app/data -v $(pwd)/output:/app/output -v $(pwd)/config.yaml:/app/config.yaml cc-extractor-cpu -f filename.pdf --force

Using GPU (CUDA 13.0)

Requires NVIDIA Container Toolkit.

# Build
docker build -t cc-extractor-gpu -f Dockerfile.gpu .

# Run all files
docker run --gpus all -v $(pwd)/data:/app/data -v $(pwd)/output:/app/output -v $(pwd)/config.yaml:/app/config.yaml cc-extractor-gpu

# Pass parameters
docker run --gpus all -v $(pwd)/data:/app/data -v $(pwd)/output:/app/output -v $(pwd)/config.yaml:/app/config.yaml cc-extractor-gpu --force

Development

# Run tests
uv run pytest

# Check linting
uv run ruff check

License

MIT License - See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
data		data
models		models
output		output
src/cc_statement_extractor		src/cc_statement_extractor
tests/cc_statement_extractor		tests/cc_statement_extractor
.dockerignore		.dockerignore
.gitignore		.gitignore
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
Dockerfile.gpu		Dockerfile.gpu
LICENSE		LICENSE
README.md		README.md
config.example.yaml		config.example.yaml
entity.map.json		entity.map.json
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Credit Card Statement Extractor

Features

Installation

Usage

Configuration

Docker Usage

Using CPU

Using GPU (CUDA 13.0)

Development

License

About

Uh oh!

Releases

Packages

Languages

License

jerrylasama/cc-statement-extractor

Folders and files

Latest commit

History

Repository files navigation

Credit Card Statement Extractor

Features

Installation

Usage

Configuration

Docker Usage

Using CPU

Using GPU (CUDA 13.0)

Development

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages