Skip to content

LLM: Field Descriptions + Targeted Search + Restructuring#36

Open
SeanClay10 wants to merge 16 commits intomainfrom
feat/llm-improve-handling
Open

LLM: Field Descriptions + Targeted Search + Restructuring#36
SeanClay10 wants to merge 16 commits intomainfrom
feat/llm-improve-handling

Conversation

@SeanClay10
Copy link
Collaborator

Summary

Refactored the LLM extraction system to improve accuracy, modularity, and usability. Added direct PDF processing support, enhanced field descriptions, enabled targeted search of relevant sections, and diverse examples.

Key Changes

1. Modular Architecture

  • Split script into three modules:
    • models.py: Pydantic data models with validation
    • llm_text.py: Text preprocessing and intelligent section extraction
    • llm_client.py: Main CLI and LLM interaction
  • Removed code duplication by importing shared functions

2. Enhanced Data Models (models.py)

  • Expanded field descriptions with common synonyms and phrasings found in scientific papers
  • Added explicit guidance for distinguishing collection dates from publication dates
  • Improved examples covering edge cases (incomplete data, multi-year studies, minimal fields)

3. Direct PDF Support (llm_client.py)

  • Added native PDF reading capability via pdf_text_extraction module

4. Improved Prompts

  • Added 5 diverse examples showing complete, partial, and edge-case extractions
  • Emphasized publication vs. collection date distinction with concrete examples
  • Better handling of multi-predator studies and missing data

5. Intelligent Section Extraction (llm_text.py)

  • Smart prioritization of key sections (Abstract > Results > Methods > Tables)
  • Automatic filtering of irrelevant content (References, Acknowledgements)
  • Budget-aware text extraction to fit within LLM context windows

Contributors: Sean Clayton, Raymond Cen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants