🔬 DECIMER Image Segmentation 📄

Deep Learning for Chemical Image Recognition - Automated Structure Detection & Extraction

🌐 Try it live at decimer.ai

📚 Table of Contents

📝 Overview
✨ Key Features
🎯 How It Works
⚙️ Installation
🚀 Usage
📊 Model Information
📄 Citation
🙏 Acknowledgements
👥 Authors
🌐 Project Website
🏛️ Research Group

📝 Overview

Unlocking decades of chemical knowledge from scientific literature!

Chemistry has accumulated vast amounts of knowledge about chemical compounds, structures, and properties across countless scientific publications. DECIMER Segmentation is the first open-source, deep learning-based tool designed to automatically recognize and extract chemical structure depictions from scientific documents.

🎯 The Challenge

Converting images of chemical structures into machine-readable formats (OCSR - Optical Chemical Structure Recognition) is a crucial step in digitizing chemical knowledge. But before we can recognize structures, we need to find and extract them from complex document pages!

💡 The Solution

DECIMER Segmentation uses advanced deep learning to:

🔍 Detect chemical structure depictions in scientific publications
✂️ Extract individual structure images with precision
📚 Process both modern PDFs and scanned historical documents
⚡ Automate the entire workflow from document to segmented structures

✨ Key Features

🤖 Deep Learning Powered Built on Mask R-CNN architecture for state-of-the-art detection accuracy	📖 Universal Compatibility Works with PDFs, scanned pages, and bitmap images from any publisher	🆓 Open Source Freely available code and pre-trained models for the community
⚡ High Performance GPU acceleration support for rapid batch processing	🎨 Smart Post-Processing Automatic mask expansion to capture complete structures	🌐 Web Application User-friendly interface available at decimer.ai

🎯 How It Works

DECIMER Segmentation employs a sophisticated two-stage workflow:

1️⃣ Detection Stage

📄 Input Document → 🤖 Mask R-CNN Model → 🎭 Structure Masks

The deep learning model analyzes the page and creates precise masks indicating the location of each chemical structure.

2️⃣ Post-Processing Stage

🎭 Initial Masks → 🔧 Expansion Algorithm → ✅ Complete Structures

An intelligent post-processing workflow ensures that potentially incomplete masks are expanded to capture the full structure.

🎨 Visual Workflow

┌─────────────────┐
│  PDF/Image File │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Page Extraction│
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Mask R-CNN      │
│ Detection       │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Mask Expansion  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Segmented       │
│ Structures      │
└─────────────────┘

⚙️ Installation

🐍 Prerequisites

We strongly recommend using a Conda environment for seamless dependency management.

Install Miniconda (if not already installed)

# Linux
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

# macOS
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
bash Miniconda3-latest-MacOSX-x86_64.sh

📦 Installation Options

Option 1: Install from GitHub (Development Version)

# Clone the repository
git clone https://github.com/Kohulan/DECIMER-Image-Segmentation.git
cd DECIMER-Image-Segmentation

# Create and activate conda environment
conda create --name DECIMER_IMGSEG python=3.10
conda activate DECIMER_IMGSEG

# Install dependencies
conda install pip
python -m pip install -U pip

# Install DECIMER-Segmentation
pip install .

# Install Poppler (required for PDF processing)
conda install -c conda-forge poppler

Option 2: Install from PyPI (Stable Release)

# Create and activate conda environment
conda create --name DECIMER_IMGSEG python=3.10
conda activate DECIMER_IMGSEG

# Install from PyPI
pip install decimer-segmentation

# Install Poppler (required for PDF processing)
conda install -c conda-forge poppler

🖥️ Hardware Requirements

CPU Mode: Works on any modern CPU
GPU Mode (Recommended): CUDA-compatible GPU with appropriate drivers
- Significantly faster processing
- Essential for batch processing

🚀 Usage

Command Line Interface

Process entire documents with a single command:

# Segment structures from a PDF or image file
python3 segment_structures_in_document.py your_document.pdf

# Output will be saved in a folder named after your input file
# e.g., your_document/ containing all segmented structures

Python API

🎨 Example 1: Segment from Image Array

from decimer_segmentation import segment_chemical_structures
import cv2

# Load your scanned page
page_image = cv2.imread("path/to/scanned_page.png")

# Extract all chemical structures
segments = segment_chemical_structures(page_image, expand=True)

# segments is a list of numpy arrays, each containing a structure
for idx, structure in enumerate(segments):
    cv2.imwrite(f"structure_{idx}.png", structure)
    print(f"✅ Saved structure {idx}")

📄 Example 2: Segment from File (PDF or Image)

from decimer_segmentation import segment_chemical_structures_from_file

# Process a PDF file
segments = segment_chemical_structures_from_file(
    "path/to/document.pdf",
    expand=True
)

# Process an image file
segments = segment_chemical_structures_from_file(
    "path/to/page_image.jpg",
    expand=True
)

print(f"🎉 Extracted {len(segments)} chemical structures!")

🔧 Example 3: Batch Processing

from decimer_segmentation import segment_chemical_structures_from_file
import os
from pathlib import Path

def batch_segment(input_dir, output_dir):
    """Process multiple PDF files"""
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    
    for pdf_file in Path(input_dir).glob("*.pdf"):
        print(f"📄 Processing {pdf_file.name}...")
        
        segments = segment_chemical_structures_from_file(
            str(pdf_file),
            expand=True
        )
        
        # Save each segment
        file_output_dir = Path(output_dir) / pdf_file.stem
        file_output_dir.mkdir(exist_ok=True)
        
        for idx, segment in enumerate(segments):
            output_path = file_output_dir / f"structure_{idx:03d}.png"
            cv2.imwrite(str(output_path), segment)
        
        print(f"✅ Extracted {len(segments)} structures from {pdf_file.name}")

# Use it
batch_segment("input_pdfs/", "output_structures/")

🎯 Example 4: Advanced Usage with Custom Parameters

from decimer_segmentation import segment_chemical_structures
import cv2

# Load image
image = cv2.imread("complex_page.png")

# Segment with custom settings
segments = segment_chemical_structures(
    image,
    expand=True,          # Enable mask expansion
    visualization=True    # Generate visualization (if available)
)

# Process results
for idx, segment in enumerate(segments):
    # You can now pass this to DECIMER Image Transformer
    # for structure recognition
    print(f"Structure {idx}: {segment.shape}")

📓 Interactive Tutorial

For more comprehensive examples and interactive demonstrations, check out our Jupyter Notebook!

🪟 Notes for Windows Users

Windows-Specific Instructions

1️⃣ Use Anaconda PowerShell Prompt

Run all commands in the Anaconda PowerShell Prompt (not regular Command Prompt or PowerShell).

2️⃣ Install Poppler for PDF Support

PDF processing on Windows requires Poppler. Follow these steps:

Download Poppler
- Visit Poppler for Windows
- Download and extract to a location like C:\Program Files\poppler

Specify Poppler Path in Code

from decimer_segmentation import segment_chemical_structures_from_file

segments = segment_chemical_structures_from_file(
    "document.pdf",
    expand=True,
    poppler_path=r"C:\Program Files\poppler\Library\bin"
)

3️⃣ GPU Support on Windows

Ensure you have:

CUDA Toolkit installed
cuDNN libraries configured
Compatible GPU drivers

📊 Model Information

🤖 Pre-trained Model

The Mask R-CNN model is publicly available and ready to use:

🎓 Model Architecture

Base Network: Mask R-CNN
Training Data: Diverse chemical literature from multiple publishers
Task: Instance segmentation of chemical structure depictions
Performance: Manually validated on publications from various sources

🔍 Model Performance

The model has been rigorously evaluated on:

✅ Publications from multiple scientific publishers
✅ Documents spanning different time periods
✅ Both modern PDFs and scanned historical pages
✅ Various image qualities and layouts

📄 Citation

If DECIMER Segmentation contributes to your research, please cite:

@article{Rajan2021,
  author = {Rajan, Kohulan and Brinkhaus, Henning Otto and Sorokina, Maria and Zielesny, Achim and Steinbeck, Christoph},
  title = {DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literature},
  journal = {Journal of Cheminformatics},
  year = {2021},
  volume = {13},
  number = {20},
  doi = {10.1186/s13321-021-00496-1}
}

Full Citation:
Rajan, K., Brinkhaus, H.O., Sorokina, M. et al. DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literature. J Cheminform 13, 20 (2021). https://doi.org/10.1186/s13321-021-00496-1

🙏 Acknowledgements

🌟 Special Thanks

This project wouldn't be possible without the support and contributions from the community and funding organizations.

Contributors
All our amazing contributors who helped improve the codebase

Community
Users providing feedback and reporting issues

Open Source
Projects we build upon: TensorFlow, Mask R-CNN

🌐 Project Website

Experience DECIMER Live!

🚀 Try DECIMER.ai - Web application.

📦 Complete DECIMER Suite

DECIMER Segmentation is part of a comprehensive chemical structure recognition pipeline:

🔍 DECIMER Segmentation (You are here)
Extract chemical structures from documents
🧠 DECIMER Image Transformer
Convert structure images to SMILES strings
🗄️ MARCUS
Molecular Annotation and Recognition for Curating Unravelled Structures

🏛️ Research Group

🎓 Maintained by the Kohulan @ Steinbeck Group

Natural Products Cheminformatics Research Group
Institute for Inorganic and Analytical Chemistry
Friedrich Schiller University Jena, Germany

⭐ Star History

📊 Project Analytics

🤝 Contributing

We welcome contributions! Please feel free to submit a Pull Request.

📝 Report Bug · 💡 Request Feature · ⭐ Star this repo

Made with ❤️ and ☕ for the global chemistry community

_{🔬 Advancing Open Science in Chemistry | 🌍 Digitizing Chemical Knowledge | 🤖 Powered by Deep Learning}

Name		Name	Last commit message	Last commit date
Latest commit History 340 Commits
.github/workflows		.github/workflows
Validation		Validation
decimer_segmentation		decimer_segmentation
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
DECIMER_Segmentation_notebook.ipynb		DECIMER_Segmentation_notebook.ipynb
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
segment_structures_in_document.py		segment_structures_in_document.py
setup.py		setup.py
tox.ini		tox.ini

License

Steinbeck-Lab/DECIMER-Image-Segmentation

Folders and files

Latest commit

History

Repository files navigation