π Try it live at decimer.ai
- π Overview
- β¨ Key Features
- π― How It Works
- βοΈ Installation
- π Usage
- π Model Information
- π Citation
- π Acknowledgements
- π₯ Authors
- π Project Website
- ποΈ Research Group
Unlocking decades of chemical knowledge from scientific literature!
Chemistry has accumulated vast amounts of knowledge about chemical compounds, structures, and properties across countless scientific publications. DECIMER Segmentation is the first open-source, deep learning-based tool designed to automatically recognize and extract chemical structure depictions from scientific documents.
Converting images of chemical structures into machine-readable formats (OCSR - Optical Chemical Structure Recognition) is a crucial step in digitizing chemical knowledge. But before we can recognize structures, we need to find and extract them from complex document pages!
DECIMER Segmentation uses advanced deep learning to:
- π Detect chemical structure depictions in scientific publications
- βοΈ Extract individual structure images with precision
- π Process both modern PDFs and scanned historical documents
- β‘ Automate the entire workflow from document to segmented structures
|
Built on Mask R-CNN architecture for state-of-the-art detection accuracy |
Works with PDFs, scanned pages, and bitmap images from any publisher |
Freely available code and pre-trained models for the community |
|
GPU acceleration support for rapid batch processing |
Automatic mask expansion to capture complete structures |
User-friendly interface available at decimer.ai |
DECIMER Segmentation employs a sophisticated two-stage workflow:
π Input Document β π€ Mask R-CNN Model β π Structure Masks
The deep learning model analyzes the page and creates precise masks indicating the location of each chemical structure.
π Initial Masks β π§ Expansion Algorithm β β
Complete Structures
An intelligent post-processing workflow ensures that potentially incomplete masks are expanded to capture the full structure.
βββββββββββββββββββ
β PDF/Image File β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Page Extractionβ
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Mask R-CNN β
β Detection β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Mask Expansion β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Segmented β
β Structures β
βββββββββββββββββββ
We strongly recommend using a Conda environment for seamless dependency management.
# Linux
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
# macOS
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
bash Miniconda3-latest-MacOSX-x86_64.shOption 1: Install from GitHub (Development Version)
# Clone the repository
git clone https://github.com/Kohulan/DECIMER-Image-Segmentation.git
cd DECIMER-Image-Segmentation
# Create and activate conda environment
conda create --name DECIMER_IMGSEG python=3.10
conda activate DECIMER_IMGSEG
# Install dependencies
conda install pip
python -m pip install -U pip
# Install DECIMER-Segmentation
pip install .
# Install Poppler (required for PDF processing)
conda install -c conda-forge popplerOption 2: Install from PyPI (Stable Release)
# Create and activate conda environment
conda create --name DECIMER_IMGSEG python=3.10
conda activate DECIMER_IMGSEG
# Install from PyPI
pip install decimer-segmentation
# Install Poppler (required for PDF processing)
conda install -c conda-forge poppler- CPU Mode: Works on any modern CPU
- GPU Mode (Recommended): CUDA-compatible GPU with appropriate drivers
- Significantly faster processing
- Essential for batch processing
Process entire documents with a single command:
# Segment structures from a PDF or image file
python3 segment_structures_in_document.py your_document.pdf
# Output will be saved in a folder named after your input file
# e.g., your_document/ containing all segmented structuresfrom decimer_segmentation import segment_chemical_structures
import cv2
# Load your scanned page
page_image = cv2.imread("path/to/scanned_page.png")
# Extract all chemical structures
segments = segment_chemical_structures(page_image, expand=True)
# segments is a list of numpy arrays, each containing a structure
for idx, structure in enumerate(segments):
cv2.imwrite(f"structure_{idx}.png", structure)
print(f"β
Saved structure {idx}")from decimer_segmentation import segment_chemical_structures_from_file
# Process a PDF file
segments = segment_chemical_structures_from_file(
"path/to/document.pdf",
expand=True
)
# Process an image file
segments = segment_chemical_structures_from_file(
"path/to/page_image.jpg",
expand=True
)
print(f"π Extracted {len(segments)} chemical structures!")from decimer_segmentation import segment_chemical_structures_from_file
import os
from pathlib import Path
def batch_segment(input_dir, output_dir):
"""Process multiple PDF files"""
Path(output_dir).mkdir(parents=True, exist_ok=True)
for pdf_file in Path(input_dir).glob("*.pdf"):
print(f"π Processing {pdf_file.name}...")
segments = segment_chemical_structures_from_file(
str(pdf_file),
expand=True
)
# Save each segment
file_output_dir = Path(output_dir) / pdf_file.stem
file_output_dir.mkdir(exist_ok=True)
for idx, segment in enumerate(segments):
output_path = file_output_dir / f"structure_{idx:03d}.png"
cv2.imwrite(str(output_path), segment)
print(f"β
Extracted {len(segments)} structures from {pdf_file.name}")
# Use it
batch_segment("input_pdfs/", "output_structures/")from decimer_segmentation import segment_chemical_structures
import cv2
# Load image
image = cv2.imread("complex_page.png")
# Segment with custom settings
segments = segment_chemical_structures(
image,
expand=True, # Enable mask expansion
visualization=True # Generate visualization (if available)
)
# Process results
for idx, segment in enumerate(segments):
# You can now pass this to DECIMER Image Transformer
# for structure recognition
print(f"Structure {idx}: {segment.shape}")For more comprehensive examples and interactive demonstrations, check out our Jupyter Notebook!
Windows-Specific Instructions
Run all commands in the Anaconda PowerShell Prompt (not regular Command Prompt or PowerShell).
PDF processing on Windows requires Poppler. Follow these steps:
-
Download Poppler
- Visit Poppler for Windows
- Download and extract to a location like
C:\Program Files\poppler
-
Specify Poppler Path in Code
from decimer_segmentation import segment_chemical_structures_from_file segments = segment_chemical_structures_from_file( "document.pdf", expand=True, poppler_path=r"C:\Program Files\poppler\Library\bin" )
Ensure you have:
- CUDA Toolkit installed
- cuDNN libraries configured
- Compatible GPU drivers
The Mask R-CNN model is publicly available and ready to use:
- Base Network: Mask R-CNN
- Training Data: Diverse chemical literature from multiple publishers
- Task: Instance segmentation of chemical structure depictions
- Performance: Manually validated on publications from various sources
The model has been rigorously evaluated on:
- β Publications from multiple scientific publishers
- β Documents spanning different time periods
- β Both modern PDFs and scanned historical pages
- β Various image qualities and layouts
If DECIMER Segmentation contributes to your research, please cite:
@article{Rajan2021,
author = {Rajan, Kohulan and Brinkhaus, Henning Otto and Sorokina, Maria and Zielesny, Achim and Steinbeck, Christoph},
title = {DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literature},
journal = {Journal of Cheminformatics},
year = {2021},
volume = {13},
number = {20},
doi = {10.1186/s13321-021-00496-1}
}Full Citation:
Rajan, K., Brinkhaus, H.O., Sorokina, M. et al. DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literature. J Cheminform 13, 20 (2021). https://doi.org/10.1186/s13321-021-00496-1
This project wouldn't be possible without the support and contributions from the community and funding organizations.
|
Contributors All our amazing contributors who helped improve the codebase |
Community Users providing feedback and reporting issues |
Open Source Projects we build upon: TensorFlow, Mask R-CNN |
π Try DECIMER.ai - Web application.
DECIMER Segmentation is part of a comprehensive chemical structure recognition pipeline:
-
π DECIMER Segmentation (You are here)
Extract chemical structures from documents -
π§ DECIMER Image Transformer
Convert structure images to SMILES strings -
ποΈ MARCUS
Molecular Annotation and Recognition for Curating Unravelled Structures
π Maintained by the Kohulan @ Steinbeck Group
Natural Products Cheminformatics Research Group
Institute for Inorganic and Analytical Chemistry
Friedrich Schiller University Jena, Germany
We welcome contributions! Please feel free to submit a Pull Request.
π Report Bug Β· π‘ Request Feature Β· β Star this repo
Made with β€οΈ and β for the global chemistry community
Β© 2025 Kohulan @ Steinbeck Lab, Friedrich Schiller University Jena
π¬ Advancing Open Science in Chemistry | π Digitizing Chemical Knowledge | π€ Powered by Deep Learning

