Skip to content

Refactor: Modernize Codebase Architecture & Implement Full-Link Async I/O#690

Closed
wchiways wants to merge 5 commits intodataabc:masterfrom
wchiways:master
Closed

Refactor: Modernize Codebase Architecture & Implement Full-Link Async I/O#690
wchiways wants to merge 5 commits intodataabc:masterfrom
wchiways:master

Conversation

@wchiways
Copy link
Contributor

@wchiways wchiways commented Feb 4, 2026

Architecture Overview

graph TD
    subgraph "Sync"
        A[Spider] -->|Blocking| B(Downloader)
        A -->|Blocking| C(Writer)
        C --> D[File/DB]
        style A fill:#f9f,stroke:#333,stroke-width:2px
    end

    subgraph "Async"
        AA[Spider] -->|Await| BB(Downloader)
        AA -->|Await| CC(Writer)
        CC -->|Non-Blocking| DD[File/DB/API]
        style AA fill:#bbf,stroke:#333,stroke-width:2px
    end
Loading

Configuration Validation Flow

sequenceDiagram
    participant User
    participant Spider
    participant SpiderConfig
    participant Pydantic
    participant DateTimeUtil

    User->>Spider: Run with config.json
    Spider->>SpiderConfig: Load Dict
    SpiderConfig->>Pydantic: Validate Fields
    Pydantic->>DateTimeUtil: is_valid_date(date_str)
    alt Invalid
        DateTimeUtil-->>Pydantic: False
        Pydantic-->>Spider: ValidationError
        Spider-->>User: Exit with Error Message
    else Valid
        DateTimeUtil-->>Pydantic: True
        Pydantic-->>SpiderConfig: Valid Config Object
        SpiderConfig-->>Spider: Ready to Scrape
    end
Loading

PR 1: Modernize Path Handling and Formatting

Title: refactor: replace os.path with pathlib and use f-strings

Description:
Improved code readability and cross-platform compatibility by modernizing file path handling and string formatting.

  • Pathlib Adoption: Replaced legacy os.path and string concatenation for file paths with the object-oriented pathlib.Path library. This ensures robust path handling across different operating systems.
  • f-strings: Replaced older % string formatting with Python 3.6+ f-strings for better performance and cleaner syntax.

PR 2: Modernize Data Structures

Title: refactor: use dataclasses for User/Weibo and introduce SpiderConfig

Description:
Transitioned core data structures to modern Python standards to reduce boilerplate and improve maintainability.

  • Dataclasses: Refactored User and Weibo classes to use the standard library dataclasses module. This simplifies class definitions while retaining memory efficiency via slots=True.
  • SpiderConfig: Introduced a structured SpiderConfig class to encapsulate configuration parameters, replacing raw dictionary usage for better type hints and code clarity.

PR 3: Modernize Configuration and Eliminate Code Duplication

Title: refactor: eliminate code duplication and enhance config validation with pydantic

Description:
This update focuses on improving the robustness and maintainability of the configuration management system.

  • Pydantic Migration: Replaced the previous dataclass implementation of SpiderConfig with a Pydantic BaseModel. This provides automatic type coercion, strict validation, and better IDE support.
  • Centralized Validation: Eliminated code duplication by moving date validation logic into a single source of truth: weibo_spider/datetime_util.py. The redundant _is_date functions in config.py and config_util.py were removed.
  • Improved Error Handling: Updated spider.py to catch pydantic.ValidationError during startup, providing clean, human-readable error messages for configuration issues instead of generic stack traces.
  • Unit Testing: Added tests/test_config.py and tests/test_datetime_util.py to ensure the new validation logic and utilities are correct and protected against regressions.

PR 4: Full-Link Async I/O for Writers

Title: feat: implement full-link async I/O for writers

Description:
This update completes the transition to a fully asynchronous architecture by removing blocking I/O operations from the data writing pipeline.

  • Async Writer Interface: Converted the Writer abstract base class and all 8 implementations (Txt, Csv, Json, Mongo, MySql, Sqlite, Kafka, Post) to support async/await.
  • Asynchronous File I/O: Integrated the aiofiles library into TxtWriter, CsvWriter, and JsonWriter. This ensures that writing large amounts of data to disk does not block the asyncio event loop.
  • Asynchronous Network I/O: Migrated PostWriter from the synchronous requests library to aiohttp, allowing for non-blocking API notifications.
  • Spider Integration: Refactored the Spider class to await all writer operations, ensuring the scraper remains responsive even during heavy I/O tasks.
  • Dependency Update: Added aiofiles to requirements.txt.
  • Async Verification: Introduced tests/test_writers_async.py with mock-based testing to verify the new asynchronous writing logic.

wchiways and others added 5 commits February 4, 2026 23:24
…th pydantic

- move date validation to datetime_util.is_valid_date

- implement SpiderConfig using pydantic for robust validation

- catch ValidationError in spider.py for user-friendly error reporting

- add unit tests for config and datetime utilities
- Update Writer base class to use async/await

- Integrate aiofiles for asynchronous file writing (txt, csv, json)

- Migrate PostWriter to use aiohttp for async network requests

- Update Spider to await writer methods

- Add async writer tests
Fix the error occurring during the build process
@wchiways
Copy link
Contributor Author

wchiways commented Feb 5, 2026

本次PR基于python 3.10进行重构,可能低版本不一定兼容,所以决定Close

@wchiways wchiways closed this Feb 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant