Skip to content

Steinbeck-Lab/DECIMER-Image-Segmentation

Β 
Β 

Repository files navigation

πŸ”¬ DECIMER Image Segmentation πŸ“„

Deep Learning for Chemical Image Recognition - Automated Structure Detection & Extraction

DECIMER Segmentation

License Maintenance GitHub issues GitHub contributors tensorflow Model Card DOI GitHub release PyPI version fury.io

🌐 Try it live at decimer.ai


πŸ“š Table of Contents


πŸ“ Overview

Unlocking decades of chemical knowledge from scientific literature!

Chemistry has accumulated vast amounts of knowledge about chemical compounds, structures, and properties across countless scientific publications. DECIMER Segmentation is the first open-source, deep learning-based tool designed to automatically recognize and extract chemical structure depictions from scientific documents.

🎯 The Challenge

Converting images of chemical structures into machine-readable formats (OCSR - Optical Chemical Structure Recognition) is a crucial step in digitizing chemical knowledge. But before we can recognize structures, we need to find and extract them from complex document pages!

πŸ’‘ The Solution

DECIMER Segmentation uses advanced deep learning to:

  • πŸ” Detect chemical structure depictions in scientific publications
  • βœ‚οΈ Extract individual structure images with precision
  • πŸ“š Process both modern PDFs and scanned historical documents
  • ⚑ Automate the entire workflow from document to segmented structures

✨ Key Features

πŸ€– Deep Learning Powered

Built on Mask R-CNN architecture for state-of-the-art detection accuracy

πŸ“– Universal Compatibility

Works with PDFs, scanned pages, and bitmap images from any publisher

πŸ†“ Open Source

Freely available code and pre-trained models for the community

⚑ High Performance

GPU acceleration support for rapid batch processing

🎨 Smart Post-Processing

Automatic mask expansion to capture complete structures

🌐 Web Application

User-friendly interface available at decimer.ai


🎯 How It Works

DECIMER Segmentation employs a sophisticated two-stage workflow:

1️⃣ Detection Stage

πŸ“„ Input Document β†’ πŸ€– Mask R-CNN Model β†’ 🎭 Structure Masks

The deep learning model analyzes the page and creates precise masks indicating the location of each chemical structure.

2️⃣ Post-Processing Stage

🎭 Initial Masks β†’ πŸ”§ Expansion Algorithm β†’ βœ… Complete Structures

An intelligent post-processing workflow ensures that potentially incomplete masks are expanded to capture the full structure.

🎨 Visual Workflow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  PDF/Image File β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Page Extractionβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Mask R-CNN      β”‚
β”‚ Detection       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Mask Expansion  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Segmented       β”‚
β”‚ Structures      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

βš™οΈ Installation

🐍 Prerequisites

We strongly recommend using a Conda environment for seamless dependency management.

Install Miniconda (if not already installed)

# Linux
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

# macOS
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
bash Miniconda3-latest-MacOSX-x86_64.sh

πŸ“¦ Installation Options

Option 1: Install from GitHub (Development Version)
# Clone the repository
git clone https://github.com/Kohulan/DECIMER-Image-Segmentation.git
cd DECIMER-Image-Segmentation

# Create and activate conda environment
conda create --name DECIMER_IMGSEG python=3.10
conda activate DECIMER_IMGSEG

# Install dependencies
conda install pip
python -m pip install -U pip

# Install DECIMER-Segmentation
pip install .

# Install Poppler (required for PDF processing)
conda install -c conda-forge poppler
Option 2: Install from PyPI (Stable Release)
# Create and activate conda environment
conda create --name DECIMER_IMGSEG python=3.10
conda activate DECIMER_IMGSEG

# Install from PyPI
pip install decimer-segmentation

# Install Poppler (required for PDF processing)
conda install -c conda-forge poppler

πŸ–₯️ Hardware Requirements

  • CPU Mode: Works on any modern CPU
  • GPU Mode (Recommended): CUDA-compatible GPU with appropriate drivers
    • Significantly faster processing
    • Essential for batch processing

πŸš€ Usage

Command Line Interface

Process entire documents with a single command:

# Segment structures from a PDF or image file
python3 segment_structures_in_document.py your_document.pdf

# Output will be saved in a folder named after your input file
# e.g., your_document/ containing all segmented structures

Python API

🎨 Example 1: Segment from Image Array

from decimer_segmentation import segment_chemical_structures
import cv2

# Load your scanned page
page_image = cv2.imread("path/to/scanned_page.png")

# Extract all chemical structures
segments = segment_chemical_structures(page_image, expand=True)

# segments is a list of numpy arrays, each containing a structure
for idx, structure in enumerate(segments):
    cv2.imwrite(f"structure_{idx}.png", structure)
    print(f"βœ… Saved structure {idx}")

πŸ“„ Example 2: Segment from File (PDF or Image)

from decimer_segmentation import segment_chemical_structures_from_file

# Process a PDF file
segments = segment_chemical_structures_from_file(
    "path/to/document.pdf",
    expand=True
)

# Process an image file
segments = segment_chemical_structures_from_file(
    "path/to/page_image.jpg",
    expand=True
)

print(f"πŸŽ‰ Extracted {len(segments)} chemical structures!")

πŸ”§ Example 3: Batch Processing

from decimer_segmentation import segment_chemical_structures_from_file
import os
from pathlib import Path

def batch_segment(input_dir, output_dir):
    """Process multiple PDF files"""
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    
    for pdf_file in Path(input_dir).glob("*.pdf"):
        print(f"πŸ“„ Processing {pdf_file.name}...")
        
        segments = segment_chemical_structures_from_file(
            str(pdf_file),
            expand=True
        )
        
        # Save each segment
        file_output_dir = Path(output_dir) / pdf_file.stem
        file_output_dir.mkdir(exist_ok=True)
        
        for idx, segment in enumerate(segments):
            output_path = file_output_dir / f"structure_{idx:03d}.png"
            cv2.imwrite(str(output_path), segment)
        
        print(f"βœ… Extracted {len(segments)} structures from {pdf_file.name}")

# Use it
batch_segment("input_pdfs/", "output_structures/")

🎯 Example 4: Advanced Usage with Custom Parameters

from decimer_segmentation import segment_chemical_structures
import cv2

# Load image
image = cv2.imread("complex_page.png")

# Segment with custom settings
segments = segment_chemical_structures(
    image,
    expand=True,          # Enable mask expansion
    visualization=True    # Generate visualization (if available)
)

# Process results
for idx, segment in enumerate(segments):
    # You can now pass this to DECIMER Image Transformer
    # for structure recognition
    print(f"Structure {idx}: {segment.shape}")

πŸ““ Interactive Tutorial

For more comprehensive examples and interactive demonstrations, check out our Jupyter Notebook!


πŸͺŸ Notes for Windows Users

Windows-Specific Instructions

1️⃣ Use Anaconda PowerShell Prompt

Run all commands in the Anaconda PowerShell Prompt (not regular Command Prompt or PowerShell).

2️⃣ Install Poppler for PDF Support

PDF processing on Windows requires Poppler. Follow these steps:

  1. Download Poppler

  2. Specify Poppler Path in Code

    from decimer_segmentation import segment_chemical_structures_from_file
    
    segments = segment_chemical_structures_from_file(
        "document.pdf",
        expand=True,
        poppler_path=r"C:\Program Files\poppler\Library\bin"
    )

3️⃣ GPU Support on Windows

Ensure you have:

  • CUDA Toolkit installed
  • cuDNN libraries configured
  • Compatible GPU drivers

πŸ“Š Model Information

πŸ€– Pre-trained Model

The Mask R-CNN model is publicly available and ready to use:

DOI

πŸŽ“ Model Architecture

  • Base Network: Mask R-CNN
  • Training Data: Diverse chemical literature from multiple publishers
  • Task: Instance segmentation of chemical structure depictions
  • Performance: Manually validated on publications from various sources

πŸ” Model Performance

The model has been rigorously evaluated on:

  • βœ… Publications from multiple scientific publishers
  • βœ… Documents spanning different time periods
  • βœ… Both modern PDFs and scanned historical pages
  • βœ… Various image qualities and layouts

πŸ“„ Citation

If DECIMER Segmentation contributes to your research, please cite:

@article{Rajan2021,
  author = {Rajan, Kohulan and Brinkhaus, Henning Otto and Sorokina, Maria and Zielesny, Achim and Steinbeck, Christoph},
  title = {DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literature},
  journal = {Journal of Cheminformatics},
  year = {2021},
  volume = {13},
  number = {20},
  doi = {10.1186/s13321-021-00496-1}
}

Full Citation:
Rajan, K., Brinkhaus, H.O., Sorokina, M. et al. DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literature. J Cheminform 13, 20 (2021). https://doi.org/10.1186/s13321-021-00496-1


πŸ™ Acknowledgements

🌟 Special Thanks

This project wouldn't be possible without the support and contributions from the community and funding organizations.

Contributors
All our amazing contributors who helped improve the codebase
Community
Users providing feedback and reporting issues
Open Source
Projects we build upon: TensorFlow, Mask R-CNN

🌐 Project Website

Experience DECIMER Live!

DECIMER.ai

πŸš€ Try DECIMER.ai - Web application.

πŸ“¦ Complete DECIMER Suite

DECIMER Segmentation is part of a comprehensive chemical structure recognition pipeline:

  1. πŸ” DECIMER Segmentation (You are here)
    Extract chemical structures from documents

  2. 🧠 DECIMER Image Transformer
    Convert structure images to SMILES strings

  3. πŸ—„οΈ MARCUS
    Molecular Annotation and Recognition for Curating Unravelled Structures


πŸ›οΈ Research Group

πŸŽ“ Maintained by the Kohulan @ Steinbeck Group

Cheminformatics Group

Natural Products Cheminformatics Research Group
Institute for Inorganic and Analytical Chemistry
Friedrich Schiller University Jena, Germany


⭐ Star History

Star History Chart


πŸ“Š Project Analytics

Repobeats


🀝 Contributing

We welcome contributions! Please feel free to submit a Pull Request.

πŸ“ Report Bug Β· πŸ’‘ Request Feature Β· ⭐ Star this repo


Made with ❀️ and β˜• for the global chemistry community

Β© 2025 Kohulan @ Steinbeck Lab, Friedrich Schiller University Jena


πŸ”¬ Advancing Open Science in Chemistry | 🌍 Digitizing Chemical Knowledge | πŸ€– Powered by Deep Learning

Packages

No packages published

Languages

  • Jupyter Notebook 94.4%
  • Python 5.6%