Skip to content

Conversation

@kalebbroo
Copy link
Contributor

Introduces HuggingFace dataset discovery endpoint and service, enabling users to explore available configs, splits, and files before import. Updates dataset import flow to support user selection of streaming or download options, including fallback confirmation when streaming is unavailable. Adds new DTOs and UI components for option selection, improves error handling, and disables IndexedDB caching by default.

kalebbroo and others added 26 commits November 29, 2025 19:59
Introduces HuggingFace dataset discovery endpoint and service, enabling users to explore available configs, splits, and files before import. Updates dataset import flow to support user selection of streaming or download options, including fallback confirmation when streaming is unavailable. Adds new DTOs and UI components for option selection, improves error handling, and disables IndexedDB caching by default.
Introduces logic to detect and extract images from ZIP files during HuggingFace dataset ingestion, including caption and metadata extraction. Adds a new endpoint to serve dataset files directly, and improves download progress reporting in HuggingFaceClient.
- Added PHASE1_EXECUTION_GUIDE.md with step-by-step instructions
- Added FILE_MIGRATION_MAP.md with complete file-by-file mapping
- 258 total files to handle (125 migrate, 24 create new, 107 TODO scaffolds)
- All planning documents ready for Phase 1 execution
- Added comprehensive PHASE1_CHECKLIST.md with all tasks
- 256 total items to track across all categories
- Organized by project (Core, DTO, APIBackend, ClientApp)
- Includes build verification, testing, and cleanup steps
- Ready for Phase 1 execution
- Added README_REFACTOR.md with getting started guide
- Explains all planning documents and when to use them
- Provides 3 execution approaches (all-at-once, incremental, assisted)
- Includes FAQ, tips, and success criteria
- Complete Phase 1 planning documentation ready
Introduces the APIBackend ASP.NET Core project, including configuration, LiteDB-based repositories, dataset and item management endpoints, HuggingFace integration, and service registration. Adds support for dataset CRUD, file uploads, item editing, HuggingFace dataset discovery/import, and static file serving. Also updates .gitignore and refactor plan for new structure.
🎯 Major Transformation Complete:
- Renamed from HartsysDatasetEditor to Dataset Studio by Hartsy
- Complete restructure to feature-based architecture
- All 4 new projects created and migrated

📦 Projects Created:
✅ Core (DatasetStudio.Core) - 41 files migrated
✅ DTO (DatasetStudio.DTO) - 13 files migrated
✅ APIBackend (DatasetStudio.APIBackend) - 21 files migrated
✅ ClientApp (DatasetStudio.ClientApp) - 66 files migrated

🗂️ New Architecture:
- Feature-based organization (Home, Datasets, Settings, etc.)
- Clean separation: Core/DTO/APIBackend/ClientApp
- Extension system scaffolded with TODOs
- Documentation structure created

🔧 Key Changes:
- All namespaces updated to DatasetStudio.*
- Fixed Modality namespace conflict (Modality → ModalityProviders)
- Created DatasetStudio.sln with all 4 projects
- Migrated 141 total files
- Added comprehensive TODO scaffolds for future phases

📝 Build Status:
✅ Core: Builds successfully
✅ DTO: Builds successfully
✅ APIBackend: Builds successfully
⚠️ ClientApp: Has Razor binding warnings (MudBlazor syntax - non-breaking)

📋 TODO Scaffolds Created:
- Extension SDK (Phase 3)
- Built-in Extensions (Phase 3-6)
- Installation docs (Phase 4)
- User guides (Phase 4)
- API documentation (Phase 6)
- Development guides (Phase 3)

🎉 Ready for Phase 2: Database Migration (PostgreSQL + Parquet)

See REFACTOR_PLAN.md for complete roadmap
Added REFACTOR_COMPLETE_SUMMARY.md with:
- Complete transformation overview
- All 4 projects detailed
- Extension system scaffolds documented
- Success metrics and status
- Known issues and next steps
- Phase 2 roadmap

Phase 1 Complete! ✅
- Deleted src/HartsysDatasetEditor.Core/
- Deleted src/HartsysDatasetEditor.Contracts/
- Deleted src/HartsysDatasetEditor.Api/
- Deleted src/HartsysDatasetEditor.Client/
- Deleted HartsysDatasetEditor.sln
- Deleted tests/HartsysDatasetEditor.Tests/
- Removed migration scripts

Only new DatasetStudio projects remain! Clean slate for Phase 2.
🗄️ PostgreSQL Database Layer:
✅ Entity Framework Core 8.0 integration
✅ 5 entity models (Dataset, User, Caption, Permission, DatasetItem)
✅ DatasetStudioDbContext with 40+ indexes
✅ Complete relationships and cascade behaviors
✅ JSONB columns for flexible metadata
✅ Connection strings configured
✅ Comprehensive 544-line README

📊 Parquet Storage System:
✅ ParquetSchemaDefinition - 15-column schema
✅ ParquetItemWriter - Batch writing with auto-sharding
✅ ParquetItemReader - Cursor pagination, parallel reads
✅ ParquetItemRepository - Full IDatasetItemRepository implementation
✅ Support for billions of items (10M per shard)
✅ Snappy compression (60-80% reduction)
✅ Comprehensive 452-line README
✅ Real-world usage examples

⚡ Performance Targets:
- Write: 50-100K items/sec
- Read page: <50ms
- Find item: <200ms
- Unlimited scalability

📝 Documentation:
- PostgreSQL setup guide (Docker, native, cloud)
- Parquet usage examples and best practices
- Migration strategies
- Troubleshooting guides

🎯 Ready for Phase 3: Extension System

Total: 2,895 lines of production-ready code!
Comprehensive 500+ line summary covering:
- PostgreSQL database layer (5 entities, 40+ indexes)
- Parquet storage system (billions of items)
- Performance characteristics
- Code examples and usage patterns
- Migration strategies
- Next steps for Phase 3

Phase 2: Database Infrastructure - COMPLETE ✅
🔌 Extension System Foundation:
✅ Extension SDK (7 base classes)
✅ API Extension Registry & Loader
✅ Client Extension Registry & Loader
✅ ExtensionApiClient for distributed deployment
✅ 4 Built-in extension scaffolds (CoreViewer, Creator, Editor, AITools)
✅ Comprehensive documentation (500+ lines)

🌐 Distributed Architecture:
✅ API and Client can be on different servers
✅ Type-safe HTTP communication
✅ Dynamic assembly loading
✅ Manifest-based discovery

📝 Documentation:
✅ DEVELOPMENT_GUIDE.md - Complete extension development guide
✅ APPSETTINGS_EXAMPLES.md - Configuration examples
✅ PROGRAM_INTEGRATION.md - Integration instructions
✅ All files have extensive TODO comments

🎯 Ready for Phase 3.1: Extension Implementation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
📄 PHASE3_COMPLETE_SUMMARY.md:
✅ Complete documentation of extension system architecture
✅ Detailed explanation of all 7 SDK classes
✅ Registry/Loader implementation guide
✅ Built-in extension scaffolds overview
✅ Distributed deployment architecture
✅ Communication flow diagrams
✅ Phase 3.1 implementation roadmap

🎯 Ready for Phase 3.1: Extension Implementation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
📄 Updates:
✅ Added Phase 2 & 3 to scaffolded section
✅ Added phase progress table
✅ Added links to all phase summaries
✅ Updated next steps to Phase 3.1

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
🔌 Extension System Implementation:
✅ ExtensionManifest - Complete JSON loading, validation, serialization
✅ ExtensionMetadata - Full metadata support with JSON attributes
✅ Extensions.SDK project - Framework references configured
✅ Extension project structure - CoreViewer & Creator (Api + Client)

📦 Projects Created:
✅ Extensions.SDK.csproj - Base SDK with ASP.NET Core framework
✅ CoreViewer.Api.csproj - API-side viewer extension
✅ CoreViewer.Client.csproj - Blazor client viewer extension
✅ Creator.Api.csproj - API-side creator extension
✅ Creator.Client.csproj - Blazor client creator extension

📚 Documentation Added:
✅ PHASE_3.1_EXTENSION_LOADING_COMPLETE.md - Implementation verification
✅ EXTENSION_ARCHITECTURE.md - System architecture diagrams
✅ EXTENSION_QUICK_START.md - Developer guide

🎯 Key Features:
✅ Manifest loading with validation
✅ JSON serialization/deserialization
✅ Proper namespacing for all projects
✅ Framework references for SDK
✅ Project dependencies configured

🚀 Ready for extension implementation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Added DatasetIngestionService for production ingestion of datasets from multiple formats (CSV, TSV, JSON, JSONL, ZIP, image folders). Introduced ItemRepository as a PostgreSQL adapter for dataset items, wrapping ParquetItemRepository. Refactored DatasetRepository to use DatasetDto and expanded repository methods. Updated DI registrations in ServiceCollectionExtensions to use new services and repositories. Adjusted IDatasetItemRepository interface for correct type usage. Added PostgreSqlMigrationsTests for verifying EF Core migrations and schema.
Added support for importing datasets from HuggingFace, including both streaming and download modes. The service now uses IHuggingFaceClient to fetch dataset info and files, and integrates error handling and logging for the import process.
Implements extension discovery and loading for both API and Client applications. Adds `ApiExtensionRegistry` and `ClientExtensionRegistry` services to scan, resolve dependencies, and load extensions from BuiltIn and Community directories. Updates Program.cs to register, configure, and initialize extensions at startup, enabling modular extension support as described in the implementation plan. Also adds the initial ApprovedExtensions.json registry and a comprehensive implementation plan document.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants