Gemini Subtitle Pro - Project Architecture Document
📖 Project Overview
Gemini Subtitle Pro is an AI-powered subtitle creation, translation, and polishing tool. Built with React + Vite + Electron tech stack, supporting both Web and Desktop client deployment.
- Tech Stack: React 19, Vite 6, Electron 39, TypeScript
- AI Engines: Google Gemini (Translation/Polishing), OpenAI Whisper (Speech Recognition)
See also: https://deepwiki.com/corvo007/Gemini-Subtitle-Pro
🏗️ Technology Stack Architecture
Technology Stack Layered Diagram
Dependency Version Overview
| Category | Dependency | Version | Purpose |
|---|---|---|---|
| Core Frameworks | React | 19.2 | UI Framework |
| Vite | 6.2 | Build Tool | |
| TypeScript | 5.8 | Type System | |
| Electron | 39 | Desktop Container | |
| AI SDK | @google/genai | Latest | Gemini API |
| openai | Latest | Whisper API | |
| onnxruntime-web | 1.23 | VAD Inference | |
| Audio Processing | @ricky0123/vad-web | 0.0.30 | Silero VAD Wrapper |
| fluent-ffmpeg | 2.1 | FFmpeg Control | |
| i18n | i18next | 25.7 | Internationalization Core |
| react-i18next | 16.5 | React Bindings | |
| Rendering | assjs | 0.1.4 | WYSIWYG Subtitle Rendering |
| Styling | TailwindCSS | 4.1 | Atomic CSS |
| Lucide React | 0.554 | Icon Library | |
| Utils | clsx / tw-merge | Latest | Style Merging |
📏 Code Standards & Engineering
Path Aliases
This project uses path aliases comprehensively under src and electron directories. Relative paths are prohibited (like ../../) for cross-level module references, except for same-level file references (using aliases uniformly is recommended).
@/*->src/*(Core Source Code)@components/*->src/components/*@hooks/*->src/hooks/*@services/*->src/services/*@utils/*->src/utils/*@types/*->src/types/*@lib/*->src/lib/*(New)@electron/*->electron/*(Electron Main Process Code)
Directory Organization Principles
- Co-location Principle: Utility functions or components used only within a specific module should be placed in that module's
utilsorsharedsubdirectory, rather than elevated to global.- For example,
src/components/endToEnd/wizard/utils/validation.tsserves only the wizard module.
- For example,
- Separation of Concerns:
src/utils: Global common, pure JavaScript/UI helper functions.src/services/utils: Infrastructure, logging, system-level tools.
🧱 Application Module Architecture
Module Dependency Graph
📁 Directory Structure
Gemini-Subtitle-Pro/
├── 📂 src/ # Frontend Source Code
│ ├── 📄 App.tsx # Application Main Entry
│ ├── 📄 index.tsx # React Render Entry
│ ├── 📄 index.css # Global Styles
│ ├── 📄 i18n.ts # [NEW] i18n Configuration Entry
│ │
│ ├── 📂 components/ # UI Components
│ │ ├── 📂 common/ # Common Business Components (Header, PageHeader, etc.)
│ │ ├── 📂 editor/ # Subtitle Editor & Video Preview Components
│ │ │ ├── 📄 VideoPlayerPreview.tsx # [NEW] Progressive Video Player with ASS Rendering
│ │ │ ├── 📄 RegenerateModal.tsx # [NEW] Batch Regenerate Modal
│ │ │ └── 📄 ... # SubtitleRow, Batch, etc.
│ │ ├── 📂 compression/ # [NEW] Video Compression Page Components
│ │ │ ├── 📄 EncoderSelector.tsx # Encoder Selection & Config
│ │ │ └── 📄 ...
│ │ ├── 📂 pages/ # Page-level Components (HomePage, WorkspacePage, etc.)
│ │ ├── 📂 ui/ # Base UI Component Library (Modal, Toggle, TextInput...)
│ │ ├── 📂 settings/ # Settings-related Components
│ │ │ ├── 📂 tabs/ # [NEW] Modular Settings Panels (GeneralTab, AboutTab, etc.)
│ │ │ └── 📄 SettingsModal.tsx # Settings Modal Container
│ │ ├── 📂 layout/ # Layout Containers
│ │ ├── 📂 modals/ # Business Modals (GlossaryConfirmationModal, SpeakerManagerModal, etc.)
│ │ ├── 📂 endToEnd/ # End-to-End Wizard Components
│ │ └── 📂 ... # Other feature-divided component directories
│ │
│ ├── 📂 hooks/ # React Hooks
│ │ ├── 📂 useWorkspaceLogic/ # Core Workspace Logic
│ │ ├── 📄 useVideoPreview.ts # [NEW] Video Preview & Transcoding State
│ │ └── ... # Other Feature Hooks
│ │
│ ├── 📂 locales/ # [NEW] Internationalization Resources
│ │ ├── 📂 zh-CN/ # Chinese (Simplified)
│ │ ├── 📂 en-US/ # English
│ │ └── 📂 ja-JP/ # Japanese (New in v2.13)
│ │
│ ├── 📂 services/ # Service Layer (Pure Logic)
│ │ ├── 📂 api/ # API Integration
│ │ ├── 📂 generation/ # Generation Services (Core Business Logic)
│ │ │ ├── 📂 pipeline/ # Complete Pipeline
│ │ │ │ ├── 📂 core/ # [NEW] Base step class and type definitions
│ │ │ │ └── 📂 steps/ # [NEW] Step implementations (Transcription, Refinement, Alignment, Translation, Proofread)
│ │ │ ├── 📂 extractors/ # Information Extraction
│ │ │ └── 📂 batch/ # Batch Operations (proofread.ts, regenerate.ts)
│ │ ├── 📂 audio/ # Audio Processing
│ │ ├── 📂 subtitle/ # Subtitle Parsing and Generation
│ │ │ ├── 📄 reconciler.ts # [NEW] Data Reconciler (Data Hub)
│ │ │ └── 📄 ...
│ │ ├── 📂 alignment/ # [NEW] Alignment Services
│ │ │ ├── 📂 strategies/ # Alignment Strategies (CTC)
│ │ │ └── 📄 index.ts # Strategy Factory
│ │ ├── 📂 download/ # Download Service Logic
│ │ └── 📂 utils/ # Common Service Tools
│ │
│ ├── 📂 config/ # Configuration Module
│ │ ├── 📄 index.ts # Config Export Entry
│ │ └── 📄 models.ts # Model Config
│ │
│ ├── 📂 lib/ # Common Libraries
│ │
│ ├── 📂 types/ # [NEW] Centralized Type Definitions
│ │ ├── 📄 pipeline.ts # Pipeline Shared Types
│ │ ├── 📄 alignment.ts # Alignment Types
│ │ └── 📄 ...
│ │
│ └── 📂 workers/ # Web Workers
│
├── 📂 electron/ # Electron Desktop Code
│ ├── 📄 main.ts # Main Process Entry
│ ├── 📄 preload.ts # Preload Script
│ ├── 📄 logger.ts # Unified Logging Service (with JSON View)
│ ├── 📂 utils/ # [NEW] Utility Modules
│ │ └── 📄 paths.ts # Portable path resolution
│ └── 📂 services/ # Desktop Services (Node.js Env)
│ ├── 📄 localWhisper.ts # Local Whisper Call (with GPU Detection)
│ ├── 📄 videoPreviewTranscoder.ts # Video Preview & Caching
│ ├── 📄 storage.ts # Portable storage service
│ └── ... # Other System-level Services
│
└── 📄 package.json # Project Config🔄 Core Process Diagrams
1. Complete Pipeline Concurrent Architecture
The diagram below shows the complete concurrent architecture for subtitle generation, including parallel async tasks, Semaphore control, and cross-task dependencies:
2. Dual Semaphore Concurrency Control Details
Configuration Explanation:
| Semaphore | Purpose | Default Concurrency | Config Item |
|---|---|---|---|
transcriptionSemaphore | Controls Whisper API Calls | Local: 1, Cloud: 5 | whisperConcurrency |
refinementSemaphore | Controls Gemini Flash API | 5 | concurrencyFlash |
alignmentSemaphore | Controls Alignment Tools | 2 | concurrencyAlignment |
| (Glossary Extraction Internal) | Controls Gemini Pro API | 2 | concurrencyPro |
3. Chunk Internal 5-Stage Pipeline
3.5 Pipeline Step Architecture (New in v2.13)
v2.13 introduces a class-based step architecture, modularizing Chunk processing logic:
Step Descriptions:
| Step | File | Input | Output | Purpose |
|---|---|---|---|---|
TranscriptionStep | TranscriptionStep.ts | AudioChunk | SubtitleItem[] | Whisper speech-to-text |
WaitForDepsStep | WaitForDepsStep.ts | - | Glossary + Speakers | Wait for glossary and speaker extraction |
RefinementStep | RefinementStep.ts | SubtitleItem[] | SubtitleItem[] | Timeline correction, apply glossary |
AlignmentStep | AlignmentStep.ts | SubtitleItem[] | SubtitleItem[] | CTC forced alignment |
TranslationStep | TranslationStep.ts | SubtitleItem[] | SubtitleItem[] | AI translation |
ProofreadStep | ProofreadStep.ts | SubtitleItem[] | SubtitleItem[] | Batch proofreading (optional) |
3.6 Batch Operations Comparison (New in v2.13)
v2.13 splits batch operations into two independent modes:
| Feature | Proofread | Regenerate |
|---|---|---|
| File | batch/proofread.ts | batch/regenerate.ts |
| Purpose | Polish and proofread existing translations | Completely reprocess selected segments |
| Pipeline | Gemini Pro proofreading only | Transcribe → Refine → Align → Translate (full pipeline) |
| Input | Existing SubtitleItem[] | Raw audio + time range |
| Preserved | Original timeline preserved | Everything regenerated |
| Use Cases | Improve translation quality, fix typos | Fix transcription errors, re-segment, re-run after glossary update |
| User Hints | Not supported | Supports transcription and translation hints |
| Model | Gemini 3 Pro | Whisper + Gemini Flash |
4. Data Integrity & Reconciliation (The "Data Hub")
The system employs a rigorous Data Reconciliation Strategy (src/services/subtitle/reconciler.ts) to ensure metadata persistence across the pipeline matches (Refinement, Alignment, Translation), even when the number of segments changes due to splitting or merging.
4.1 The Reconciler Logic
The reconcile(prev, curr) function acts as the "Data Hub" connecting pipeline stages. It intelligently merges prev (source) metadata into curr (newly generated) segments:
- Semantic Metadata (Always Inherited):
speaker(Speaker ID/Name)comment(User comments)- Logic: Inherited from the
prevsegment with the highest overlap ratio. Even if segments are split, they all inherit the parent's speaker.
- Internal State (Conditionally Inherited):
alignmentScore(CTC Confidence)lowConfidence(Flag)hasRegressionIssue,hasCorruptedRangeIssue(Error Flags)- Logic: Strictly inherited ONLY when a 1:1 mapping is detected. If a segment is split or merged, these internal flags are resetting to avoid false propagation (e.g., a "Perfect Alignment" score shouldn't automatically apply to two new half-segments without re-verification).
4.2 Alignment Strategy (CTC)
The system uses CTC (Connectionist Temporal Classification) for high-precision alignment:
- Engine:
ctcAligner.tsinterfacing with an externalalign.exe(MMS-300m model). - Function: Updates
startTimeandendTimebased on effective audio alignment, but never splits or merges segments. - Metadata: Adds
alignmentScoreto segments. Scores below threshold triggerlowConfidenceflag for user review.
5. Glossary Extraction and User Interaction Flow
6. Speaker Recognition Position in Pipeline
Pipeline Dependency Summary:
| Stage | Dependencies | Description |
|---|---|---|
| Transcription | transcriptionSemaphore | Independent Execution, No Blocking Dependencies |
| Wait Glossary | glossaryState.get() | Must Wait for Glossary Confirmation Complete |
| Wait Speakers | speakerProfilePromise | Must Wait for Speaker Recognition Complete |
| Refinement | refinementSemaphore + Glossary + Speakers | Merge and Use All Data |
| Alignment | alignmentSemaphore | High-precision Timestamp Alignment |
| Translation | refinementSemaphore (Shared) | Translated after Alignment |
7. Desktop Full Workflow (Download-Create-Encode)
Desktop-exclusive complete workflow, connecting from material acquisition to final output:
8. Full Auto End-to-End Mode (End-to-End Pipeline)
This is an Electron-exclusive core feature that coordinates Main Process (resource scheduling) and Renderer Process (AI computation) through IPC communication, achieving "one-click cooked content".
8.1 Cross-Process Interaction Architecture
8.2 Data Flow and State Management
All intermediate state and configuration is managed through the EndToEndWizard component, with data flow as follows:
User Configuration
- Source:
EndToEndWizardUI - Flow: Via
IPC (start-processing)-> Main ProcessEndToEndPipelineService - Content: URL, Model Selection, Translation Style, Encoding Parameters
- Source:
Media Stream
yt-dlp-> Disk Temp Directory ->ffmpeg(Extract Audio) -> Disk WAV- Disk WAV ->
IPC (read-file)-> Renderer Process Memory (ArrayBuffer) -> Web Audio API
Subtitle Data
- Renderer Process generates
SubtitleItem[]array - Passed back to Main Process via
IPC (subtitle-result) - Main Process serializes object to ASS/SRT format text and writes to file
- Renderer Process generates
Progress Feedback
- Each stage (Download/Transcribe/Encode) produces progress events
- Main Process ->
IPC (progress)-> Renderer ProcessuseEndToEndHook -> UI Progress Bar
8.3 Key IPC Channels
| Channel Name | Direction | Payload | Purpose |
|---|---|---|---|
end-to-end:start | Renderer -> Main | EndToEndConfig | Start Full Auto Task |
end-to-end:generate-subtitles | Main -> Renderer | path, config | Main Process Ready, Request Frontend to Start Generation |
end-to-end:subtitle-result | Renderer -> Main | SubtitleItem[] | Frontend Complete, Return Results |
end-to-end:progress | Main -> Renderer | stage, progress | Real-time Progress Sync |
🛰️ Custom Protocol for Media Playback
To bypass browser security restrictions (CSP, Sandbox) and support large file streaming, the desktop version implements a custom protocol:
local-video:// Protocol
- Implementation:
electron/main.ts - Privileges:
standard,secure,stream,supportFetchAPI,bypassCSP. - Key Feature: Tailing Reader: Support for reading "growing files" (transcoding in progress). It uses a polling mechanism to read new data as it is written to disk by FFmpeg.
📺 Video Preview & Caching Strategy
The system uses a fragmented MP4 (fMP4) transcoding strategy to balance compatibility and performance, enabling play-while-transcoding for immediate video preview.
Architecture Overview
The video preview system consists of three main components:
| Component | Location | Purpose |
|---|---|---|
VideoPlayerPreview | src/components/editor/ | React video player with ASS subtitle overlay |
useVideoPreview | src/hooks/ | State management for transcoding progress, video source, and playback |
videoPreviewTranscoder | electron/services/ | FFmpeg-based transcoding service with GPU acceleration and caching |
Process Flow
Key Features
| Feature | Description |
|---|---|
| Progressive Playback | Start playing before transcoding completes via fMP4 + TailingReader |
| GPU Acceleration | Auto-detects NVENC/QSV/VCE for faster transcoding |
| Format Detection | Skips transcoding for browser-compatible formats (mp4, webm, m4v) |
| WYSIWYG Subtitles | Renders ASS subtitles using assjs in sync with video |
| Floating/Docked Modes | Supports resizable floating window or docked panel |
Cache Lifecycle
- Storage: User data directory (
/preview_cache/). - Limit: Automatically enforces a total size limit (3GB default).
- Cleanup: Enforced on app startup (oldest files first) and via manual UI action.
IPC Channels
| Channel Name | Direction | Payload | Purpose |
|---|---|---|---|
transcode-for-preview | Renderer → Main | { filePath } | Request video transcoding |
transcode-start | Main → Renderer | { outputPath, duration } | Transcoding started, begin progressive playback |
transcode-progress | Main → Renderer | { percent, transcodedDuration } | Real-time progress update |
cache:get-size | Renderer → Main | - | Get preview cache size |
cache:clear | Renderer → Main | - | Clear preview cache |
🧩 Core Module Descriptions
1. Generation Services Module (src/services/generation/) [NEW]
This is the refactored core business logic module, splitting the original Gemini API logic by responsibility:
| Submodule | File/Directory | Function Description |
|---|---|---|
pipeline | index.ts | Generation Flow Orchestrator, Coordinates Transcription, Extraction, Generation Full Flow |
pipelineCore.ts | [NEW] Shared context and dependency injection | |
chunkProcessor.ts | Single Chunk Processing Logic (Transcribe -> Wait Glossary/Speaker -> Translate) | |
translation.ts | Specific Translation Execution Logic | |
glossaryHandler.ts | Glossary Application Logic | |
resultTransformers.ts | Result Transformation and Post-processing Logic | |
core/BaseStep.ts | [NEW] Base step class, defines unified interface | |
steps/*.ts | [NEW] Step implementations (Transcription, Refinement, Alignment, Translation, Proofread) | |
extractors | glossary.ts | Glossary Extractor (Gemini Pro + Search) |
speakerProfile.ts | Speaker Profile Extractor | |
batch | proofread.ts | [NEW] Batch proofreading operations |
regenerate.ts | [NEW] Batch regenerate operations (full pipeline re-run) |
2. Gemini API Core (src/services/api/gemini/core/)
Retains only the most basic API interaction capabilities:
| File | Function Description |
|---|---|
client.ts | Gemini API Client Encapsulation, Handles auth, retry, and quota |
prompts.ts | Basic Prompt Template Library |
schemas.ts | Structured Output Schema Definitions |
3. Audio Processing Module (src/services/audio/)
| File | Function Description |
|---|---|
segmenter.ts | Smart Audio Segmenter, Uses Silero VAD Model to Detect Voice Activity, Splits by Semantic Boundaries |
sampler.ts | Audio Sampling, Generates Audio Samples for AI Analysis |
decoder.ts | Audio Decoding, Supports Multiple Formats |
processor.ts | Audio Preprocessing, Normalization, etc. |
converter.ts | Audio Format Conversion |
ffmpegExtractor.ts | FFmpeg Audio Extraction (Core Logic) |
4. Subtitle Processing Module (src/services/subtitle/)
| File | Function Description |
|---|---|
parser.ts | Subtitle Parser, Supports SRT/ASS/VTT and other formats |
generator.ts | Subtitle Export, Generates Bilingual Subtitle Files |
time.ts | Timecode Processing Tool |
postCheck.ts | Subtitle Quality Post-check |
timelineValidator.ts | Subtitle Timeline Logic Validation |
reconciler.ts | [NEW] Data Reconciliation (Merges Metadata) |
5. Download Service Module (src/services/download/)
| File | Function Description |
|---|---|
download.ts | Video Download Logic Encapsulation |
utils.ts | Download-related Utility Functions |
6. Electron Desktop (electron/)
| File | Function Description |
|---|---|
main.ts | Electron Main Process, Window Management, IPC Communication |
preload.ts | Preload Script, Exposes Secure Node.js API |
logger.ts | Unified Logging System, Supports File Rotation, JSON View, and Multi-level Logs |
utils/paths.ts | [NEW] Portable path resolution, supports exe-relative storage |
services/localWhisper.ts | Local Whisper Model Call (whisper.cpp), with GPU detection |
services/ffmpegAudioExtractor.ts | FFmpeg Audio Extraction, Supports Video Files |
services/ytdlp.ts | Video Download Service (YouTube/Bilibili) |
services/videoCompressor.ts | Video Encoding Service (Supports NVENC/QSV/AMF Hardware Acceleration) |
services/videoPreviewTranscoder.ts | Video Preview Transcoding, fMP4 for progressive playback, cache management |
services/endToEndPipeline.ts | Full Auto Pipeline, Orchestrates Download-Transcribe-Encode Full Flow |
services/storage.ts | Portable storage service, config and logs stored in exe-relative directory |
7. Internationalization Module (src/locales/, src/i18n.ts) [NEW]
Full i18n support powered by i18next, enabling bilingual UI (Chinese/English):
| File/Directory | Function Description |
|---|---|
i18n.ts | i18n Configuration Entry, Initializes i18next with React bindings |
locales/ | Translation Resources Root Directory |
zh-CN/ | Chinese (Simplified) translations, 14 namespace files |
en-US/ | English translations, mirrors zh-CN structure |
ja-JP/ | Japanese translations, mirrors zh-CN structure (New in v2.13) |
Namespace Organization:
| Namespace | Content |
|---|---|
common | Shared texts (buttons, labels, errors) |
home | Home page content |
workspace | Workspace page |
editor | Subtitle editor |
settings | Settings modal |
endToEnd | End-to-end wizard |
modals | Business modals |
services | API service messages |
compression | Video compression page |
download | Download page |
progress | Progress indicators |
ui | UI components |
app | App-level texts |
8. Settings Module (src/components/settings/) [Refactored in v2.13]
v2.13 refactors the settings panel into a modular tabs structure:
| File/Directory | Function Description |
|---|---|
SettingsModal.tsx | Settings modal container, manages tab switching |
tabs/GeneralTab.tsx | General settings (language, theme, etc.) |
tabs/ServicesTab.tsx | API service config (Gemini, OpenAI keys) |
tabs/EnhanceTab.tsx | Enhancement features (glossary, speaker toggles) |
tabs/PerformanceTab.tsx | Performance settings (concurrency, cache, etc.) |
tabs/DebugTab.tsx | Debug options (mock mode, log level) |
tabs/AboutTab.tsx | [NEW] About page (version, Whisper status, GPU detection) |
AlignmentSettings.tsx | Alignment service configuration |
LocalWhisperSettings.tsx | Local Whisper configuration |
CacheManagement.tsx | Cache management UI |
About Page (AboutTab) Features:
- Display app version and build info
- Local Whisper status detection
- GPU hardware acceleration detection (NVENC/QSV/AMF)
- Log file path and viewer entry
- System information overview
🔧 Technical Highlights
Concurrency Control
// Use Semaphore to control concurrency count
const semaphore = new Semaphore(concurrency);
await mapInParallel(chunks, async (chunk) => {
await semaphore.acquire();
try {
return await processChunk(chunk);
} finally {
semaphore.release();
}
});Model Selection Strategy
Model configuration is centralized in src/config/models.ts, supporting different model selection per processing step:
| Processing Step | Default Model | Features |
|---|---|---|
refinement | Gemini 2.5 Flash | Timeline Correction (Avoids 3.0 Timestamp Bug) |
translation | Gemini 3 Flash Preview | Translation, Search Grounding |
glossaryExtraction | Gemini 3 Pro Preview | Multimodal, Term Extraction |
speakerProfile | Gemini 3 Pro Preview | Speaker Analysis |
batchProofread | Gemini 3 Pro Preview | High Quality Proofreading, Search Grounding |
Note: As of v2.13,
batchFixTimestampshas been replaced by theregenerateoperation. Regenerate re-runs the full pipeline (transcription → refinement → alignment → translation).
Each step can be independently configured:
thinkingLevel: Thinking Depth (none/low/medium/high)useSearch: Whether to Enable Google SearchmaxOutputTokens: Maximum Output Tokens
Retry Mechanism
// Automatically retry recoverable errors
async function generateContentWithRetry(
ai: GoogleGenAI,
params: any,
retries = 3,
signal?: AbortSignal,
onUsage?: (usage: TokenUsage) => void,
timeoutMs?: number
);📊 Data Flow Architecture
Main Data Flow Diagram
Data Type Conversion Chain
State Data Flow
🚀 Deployment Architecture
📝 Development Guide
Environment Requirements
- Node.js 18+
- npm or yarn
Quick Start
# Install Dependencies
yarn install
# Web Development Mode
yarn dev
# Electron Development Mode
yarn electron:dev
# Build Electron Application
yarn electron:buildEnvironment Variables
GEMINI_API_KEY=your_gemini_key # Required: Translation and Proofreading
OPENAI_API_KEY=your_openai_key # Optional: Cloud Whisper