OSF Preprints - Modular Pipeline

This repo runs a pipeline for OSF preprints. It does four big jobs:

Ingestion: fetch preprints from the OSF API and store them in DynamoDB (the database).
PDF + GROBID: download files and turn PDFs into TEI XML (GROBID is the PDF parser).
TEI extraction: parse TEI XML and write references back to DynamoDB.
Enrichment: look up missing DOIs using Crossref and OpenAlex.

If you are not sure what to run, follow "Quick Start" below.

Quick Start (Local dev)

Fill .env (example below). For local dev, keep DYNAMO_LOCAL_URL and use fake AWS keys.
Start services: docker compose up -d dynamodb-local redis docker compose up -d celery-worker celery-pdf celery-grobid celery-beat flower
Create DynamoDB tables (run once): docker compose run --rm app python -c "from osf_sync.db import init_db; init_db(); print('Dynamo tables ready')"
Run a small sync: docker compose run --rm app python -m osf_sync.cli sync-from-date --start 2025-07-01
Queue the pipeline: docker compose run --rm app python -m osf_sync.cli enqueue-pdf --limit 50 docker compose run --rm app python -m osf_sync.cli enqueue-grobid --limit 25 docker compose run --rm app python -m osf_sync.cli enqueue-extraction --limit 200
Enrich references: docker compose run --rm app python -m osf_sync.cli enrich-references --limit 400

Feature overview

Incremental OSF ingestion with DynamoDB-backed cursors.
PDF/GROBID pipeline with queue flags stored on each preprint.
TEI parsing + reference extraction using the same data path.
Enrichment via Crossref + OpenAlex using the multi-method pipeline.
Extensive analytics scripts (no Docker required) to inspect coverage, dump references, or compute metrics.

Requirements

Docker Desktop 4GB+ RAM
Python 3.11 (for local helpers) + pip/venv (optional)
AWS CLI (optional, for querying AWS DynamoDB)
OSF API token and optional AWS credentials (for hosted DynamoDB)

Environment Configuration (`.env`)

GROBID_URL=http://grobid:8070
GROBID_INCLUDE_RAW_CITATIONS=true
CELERY_BROKER_URL=redis://redis:6379/0
AWS_REGION=eu-north-1
AWS_SECRET_ACCESS_KEY=<AWS_SECRET_ACCESS_KEY>
AWS_ACCESS_KEY_ID=<AWS_ACCESS_KEY_ID>
DDB_TABLE_PREPRINTS=preprints
DDB_TABLE_REFERENCES=preprint_references
DDB_TABLE_SYNCSTATE=sync_state
OPENALEX_EMAIL=<PERSONAL_EMAIL_ID>
PDF_DEST_ROOT=/data/preprints
LOG_LEVEL=INFO
OSF_INGEST_ANCHOR_DATE=YYYY-MM-DD
API_CACHE_TTL_MONTHS=6
DDB_TABLE_API_CACHE=api_cache
OSF_INGEST_SKIP_EXISTING=false

For AWS DynamoDB, comment out DYNAMO_LOCAL_URL and set real AWS credentials (or rely on an IAM role).
The table names can be overridden with DDB_TABLE_* env vars.
Set GROBID_INCLUDE_RAW_CITATIONS=false if you do not want raw citation sections in the TEI output.
When OSF_INGEST_ANCHOR_DATE is set, ingestion only keeps preprints whose original_publication_date (if present) or date_published falls within the 6-month window ending on the anchor date. If a preprint has links.doi and that DOI link is not an OSF or Zenodo link (osf.io, zenodo.org, or a Zenodo DOI like 10.5281/zenodo...), it is skipped.
API_CACHE_TTL_MONTHS controls the default TTL horizon for the api_cache table (used for API response caching). DynamoDB TTL is enabled on expires_at.
OSF_INGEST_SKIP_EXISTING=true skips upserting records that already exist in the preprints table (adds a read-before-write check).

Running the ingestion stack

docker compose up -d dynamodb-local redis
docker compose up -d celery-worker celery-pdf celery-grobid celery-beat flower

The app service is a helper container that initializes tables and prints CLI usage before exiting. Run one-off commands via docker compose run --rm app ....
Source code is bind-mounted into the containers, so edits in your IDE take effect immediately.

Common commands

Action	Command
Build images	`docker compose build`
Start stack	`docker compose up -d`
Inspect Celery worker logs	`docker compose logs -f celery-worker`
Inspect PDF/GROBID logs	`docker compose logs -f celery-pdf` / `docker compose logs -f celery-grobid`
Flower dashboard	visit `http://localhost:5555`

DynamoDB: Local vs AWS

Local DynamoDB (default dev workflow)

docker compose up -d dynamodb-local
Ensure .env still contains DYNAMO_LOCAL_URL=http://dynamodb-local:8000 and dummy AWS credentials.

Initialize tables/GSIs (only needed once):

docker compose run --rm app python -c "from osf_sync.db import init_db; init_db(); print('Dynamo tables ready')"

AWS-hosted DynamoDB

Comment/remove DYNAMO_LOCAL_URL in .env.
Export valid AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_REGION (or rely on an IAM role).

Run init_db() once to create tables + GSIs in AWS:

python -c "from osf_sync.db import init_db; init_db(); print('Dynamo tables ready')"

Restart services so they use the AWS endpoint.

init_db() auto-creates tables and adds any missing GSIs. It is idempotent; rerun it after deploying schema changes.

Inspecting DynamoDB Data

Helper script (hosts & containers)

Run from repo root:

# Host (ensure env vars/venv installed)
python -m osf_sync.dump_ddb --limit 5 --queues

# Container
docker compose exec app python /app/osf_sync/dump_ddb.py --limit 5 --queues

Flags:

--table preprints scans only that table.
--queues queries by_queue_pdf, by_queue_grobid, by_queue_extract GSIs (falls back to scans if missing).

AWS CLI snippets

Add --endpoint-url http://localhost:8000 to point commands at local DynamoDB.

# List tables
aws dynamodb list-tables --region eu-north-1

# Scan 10 preprints
aws dynamodb scan --table-name preprints --limit 10 --region eu-north-1

# Get a single preprint
aws dynamodb get-item --table-name preprints `
  --key '{"osf_id":{"S":"<OSF_ID>"}}' `
  --region eu-north-1

# Get a reference (composite key)
aws dynamodb get-item --table-name preprint_references `
  --key '{"osf_id":{"S":"<OSF_ID>"},"ref_id":{"S":"<REF_ID>"}}' `
  --region eu-north-1

# Queue queries
aws dynamodb query --table-name preprints `
  --index-name by_queue_pdf `
  --key-condition-expression "queue_pdf = :q" `
  --expression-attribute-values '{":q":{"S":"pending"}}' `
  --limit 20 --region eu-north-1

aws dynamodb query --table-name preprints `
  --index-name by_queue_grobid `
  --key-condition-expression "queue_grobid = :q" `
  --expression-attribute-values '{":q":{"S":"pending"}}' `
  --limit 20 --region eu-north-1

aws dynamodb query --table-name preprints `
  --index-name by_queue_extract `
  --key-condition-expression "queue_extract = :q" `
  --expression-attribute-values '{":q":{"S":"pending"}}' `
  --limit 20 --region eu-north-1

The AWS PowerShell module (AWS.Tools.DynamoDBv2) exposes equivalent cmdlets (Get-DDBItem, Get-DDBTableList, etc.) if you prefer native PowerShell syntax.

Task families

Ingestion (Celery tasks/CLI)

Sync OSF preprints

docker compose run --rm app python -m osf_sync.cli sync-from-date --start 2025-07-01
# Add --subject Psychology to filter

Enqueue PDF downloads

docker compose run --rm app python -m osf_sync.cli enqueue-pdf --limit 50
docker compose logs -f celery-pdf

PDF + GROBID + TEI (Celery tasks)

docker compose run --rm app python -m osf_sync.cli enqueue-grobid --limit 25
docker compose logs -f celery-grobid

Parse existing TEI XML and write references

After GROBID finishes, TEI files live under /data/preprints/<provider>/<osf_id>/tei.xml.
enqueue-extraction walks that folder, parses each TEI XML, and writes structured TEI + reference records back into DynamoDB.

# Batch extraction (reads TEI XML directly, no OSF/API calls)
docker compose run --rm app python -m osf_sync.cli enqueue-extraction --limit 200

# Inspect logs
docker compose logs -f celery-worker

Each extraction job calls osf_sync.augmentation.run_extract.extract_for_osf_id, which:

Loads the TEI XML.
Uses the TEI parser to pull structured fields and references.
Persists output via write_extraction, updating preprint_tei, preprint_references, and preprints.tei_extracted.

Enrichment tasks

docker compose run --rm app python -m osf_sync.cli enrich-references --limit 400
docker compose run --rm app python -m osf_sync.cli enrich-references --limit 400 --threshold 75
# Target a single reference:
# docker compose run --rm app python -m osf_sync.cli enrich-references --osf-id <OSF_ID> --ref-id <REF_ID> --debug

celery-beat already schedules the daily sync, PDF/GROBID queues, and enrichment tasks; adjust in osf_sync/celery_app.py.

Analytics & local scripts (no Docker)

All scripts live under scripts/manual_post_grobid/. They work from the repo root after installing dependencies (pip install -r requirements.txt) and setting PYTHONPATH. Summary:

Script	Description
`run_extraction.py`	Parse TEI XML from disk and write TEI/refs into DynamoDB.
`doi_multi_method_lookup.py`	Run multi-method DOI matching and write a CSV. Does not update DynamoDB.
`run_forrt_screening.py`	Run FORRT lookup + screening.
`analyze_doi_sources.py`	Count DOI coverage per source.
`dump_missing_doi_refs.py`	Dump references that still lack DOI information.
`select_low_doi_coverage.py`	Find OSF IDs below a DOI coverage threshold (optional ref dumps).

Run scripts with:

cd H:\fred_preprint_bot
$env:PYTHONPATH = "$PWD"    # PowerShell; use set/export on cmd/bash
python scripts/manual_post_grobid/analyze_doi_sources.py
python scripts/manual_post_grobid/dump_missing_doi_refs.py --output missing.jsonl
python scripts/manual_post_grobid/select_low_doi_coverage.py --threshold 0.2 --min-refs 30 --dump-refs-dir low_refs

See scripts/manual_post_grobid/README.md for details on each script.

Manual scripts (no Docker required)

After configuring .env and installing dependencies locally, you can run the post-GROBID steps directly:

python scripts/manual_post_grobid/run_extraction.py --limit 200
python scripts/manual_post_grobid/doi_multi_method_lookup.py --from-db --limit 400 --output doi_multi_method.csv
python scripts/manual_post_grobid/run_forrt_screening.py --limit-lookup 200 --limit 500
python scripts/manual_post_grobid/analyze_doi_sources.py
python scripts/manual_post_grobid/dump_missing_doi_refs.py --output missing.jsonl
python scripts/manual_post_grobid/select_low_doi_coverage.py --threshold 0.2 --min-refs 30 --dump-refs-dir low_refs

Note: doi_multi_method_lookup.py writes a CSV only. To update DynamoDB, use the enrich-references task.

See scripts/manual_post_grobid/README.md for details and extra flags (--dry-run, --sleep, etc.).

One-off Celery tasks

# Download a specific PDF
docker compose run --rm app sh -lc \
  "celery -A osf_sync.celery_app.app call osf_sync.tasks.download_single_pdf --args '[\"<OSF_ID>\"]'"

# Run GROBID for a specific preprint
docker compose run --rm app sh -lc \
  "celery -A osf_sync.celery_app.app call osf_sync.tasks.grobid_single --args '[\"<OSF_ID>\"]'"

Quickstart Recap

# 1. Configure .env (local Dynamo or AWS)
# 2. Build + start services
docker compose build
docker compose up -d

# 3. Initialize DynamoDB tables/GSIs (first run)
docker compose run --rm app python -c "from osf_sync.db import init_db; init_db(); print('Dynamo tables ready')"

# 4. Sync, download PDFs, run GROBID
docker compose run --rm app python -m osf_sync.cli sync-from-date --start 2025-07-01
docker compose run --rm app python -m osf_sync.cli enqueue-pdf --limit 50
docker compose run --rm app python -m osf_sync.cli enqueue-grobid --limit 25

# 5. Parse TEI XML and enrich references
docker compose run --rm app python -m osf_sync.cli enqueue-extraction --limit 200
docker compose run --rm app python -m osf_sync.cli enrich-references --limit 400

Happy syncing!

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
.github		.github
data		data
grobid-home/config		grobid-home/config
osf_sync		osf_sync
scripts		scripts
.gitignore		.gitignore
Dockerfile		Dockerfile
celerybeat-schedule		celerybeat-schedule
docker-compose.yml		docker-compose.yml
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OSF Preprints - Modular Pipeline

Quick Start (Local dev)

Feature overview

Requirements

Environment Configuration (`.env`)

Running the ingestion stack

Common commands

DynamoDB: Local vs AWS

Local DynamoDB (default dev workflow)

AWS-hosted DynamoDB

Inspecting DynamoDB Data

Helper script (hosts & containers)

AWS CLI snippets

Task families

Ingestion (Celery tasks/CLI)

Sync OSF preprints

Enqueue PDF downloads

PDF + GROBID + TEI (Celery tasks)

Parse existing TEI XML and write references

Enrichment tasks

Analytics & local scripts (no Docker)

Manual scripts (no Docker required)

One-off Celery tasks

Quickstart Recap

About

Uh oh!

Releases

Packages

Contributors 7

Uh oh!

Languages

forrtproject/fred_preprint_bot

Folders and files

Latest commit

History

Repository files navigation

OSF Preprints - Modular Pipeline

Quick Start (Local dev)

Feature overview

Requirements

Environment Configuration (.env)

Running the ingestion stack

Common commands

DynamoDB: Local vs AWS

Local DynamoDB (default dev workflow)

AWS-hosted DynamoDB

Inspecting DynamoDB Data

Helper script (hosts & containers)

AWS CLI snippets

Task families

Ingestion (Celery tasks/CLI)

Sync OSF preprints

Enqueue PDF downloads

PDF + GROBID + TEI (Celery tasks)

Parse existing TEI XML and write references

Enrichment tasks

Analytics & local scripts (no Docker)

Manual scripts (no Docker required)

One-off Celery tasks

Quickstart Recap

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Uh oh!

Languages

Environment Configuration (`.env`)

Packages