Skip to content

Feature/ocr preprocessing error checking#37

Open
bradleyrule wants to merge 5 commits intomainfrom
feature/ocr-preprocessing-error-checking
Open

Feature/ocr preprocessing error checking#37
bradleyrule wants to merge 5 commits intomainfrom
feature/ocr-preprocessing-error-checking

Conversation

@bradleyrule
Copy link
Collaborator

Summary

Refactored PDF text extraction to make better use of OCR by improving error checking and OCR pre-processing.

Key Changes

1. Improved error checking for text extraction

  • check_spelling()
    • Uses pyspellchecker to output the ratio of misspelled words to total words
  • Current threshold for using OCR is > 5% of words misspelled

2. Reduced code redundancy

  • Moved some repetitive code to parse_page_embedded() and parse_page_ocr()

3. Improved OCR pre-processing

  • Utilizing image normalization to improve image quality before OCR.
  • Converting image to gray scale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant