Skip to content

Gemini Subtitle Pro - Project Architecture Document

中文文档 (Chinese Documentation)

📖 Project Overview

Gemini Subtitle Pro is an AI-powered subtitle creation, translation, and polishing tool. Built with React + Vite + Electron tech stack, supporting both Web and Desktop client deployment.

  • Tech Stack: React 19, Vite 6, Electron 39, TypeScript
  • AI Engines: Google Gemini (Translation/Polishing), OpenAI Whisper (Speech Recognition)

See also: https://deepwiki.com/corvo007/Gemini-Subtitle-Pro


🏗️ Technology Stack Architecture

Technology Stack Layered Diagram

Dependency Version Overview

CategoryDependencyVersionPurpose
Core FrameworksReact19.2UI Framework
Vite6.2Build Tool
TypeScript5.8Type System
Electron39Desktop Container
AI SDK@google/genaiLatestGemini API
openaiLatestWhisper API
onnxruntime-web1.23VAD Inference
Audio Processing@ricky0123/vad-web0.0.30Silero VAD Wrapper
fluent-ffmpeg2.1FFmpeg Control
i18ni18next25.7Internationalization Core
react-i18next16.5React Bindings
Renderingassjs0.1.4WYSIWYG Subtitle Rendering
StylingTailwindCSS4.1Atomic CSS
Lucide React0.554Icon Library
Utilsclsx / tw-mergeLatestStyle Merging

📏 Code Standards & Engineering

Path Aliases

This project uses path aliases comprehensively under src and electron directories. Relative paths are prohibited (like ../../) for cross-level module references, except for same-level file references (using aliases uniformly is recommended).

  • @/* -> src/* (Core Source Code)
  • @components/* -> src/components/*
  • @hooks/* -> src/hooks/*
  • @services/* -> src/services/*
  • @utils/* -> src/utils/*
  • @types/* -> src/types/*
  • @lib/* -> src/lib/* (New)
  • @electron/* -> electron/* (Electron Main Process Code)

Directory Organization Principles

  • Co-location Principle: Utility functions or components used only within a specific module should be placed in that module's utils or shared subdirectory, rather than elevated to global.
    • For example, src/components/endToEnd/wizard/utils/validation.ts serves only the wizard module.
  • Separation of Concerns:
    • src/utils: Global common, pure JavaScript/UI helper functions.
    • src/services/utils: Infrastructure, logging, system-level tools.

🧱 Application Module Architecture

Module Dependency Graph


📁 Directory Structure

Gemini-Subtitle-Pro/
├── 📂 src/                          # Frontend Source Code
│   ├── 📄 App.tsx                   # Application Main Entry
│   ├── 📄 index.tsx                 # React Render Entry
│   ├── 📄 index.css                 # Global Styles
│   ├── 📄 i18n.ts                   # [NEW] i18n Configuration Entry
│   │
│   ├── 📂 components/               # UI Components
│   │   ├── 📂 common/               # Common Business Components (Header, PageHeader, etc.)
│   │   ├── 📂 editor/               # Subtitle Editor & Video Preview Components
│   │   │   ├── 📄 VideoPlayerPreview.tsx  # [NEW] Progressive Video Player with ASS Rendering
│   │   │   ├── 📄 RegenerateModal.tsx     # [NEW] Batch Regenerate Modal
│   │   │   └── 📄 ...               # SubtitleRow, Batch, etc.
│   │   ├── 📂 compression/          # [NEW] Video Compression Page Components
│   │   │   ├── 📄 EncoderSelector.tsx # Encoder Selection & Config
│   │   │   └── 📄 ...
│   │   ├── 📂 pages/                # Page-level Components (HomePage, WorkspacePage, etc.)
│   │   ├── 📂 ui/                   # Base UI Component Library (Modal, Toggle, TextInput...)
│   │   ├── 📂 settings/             # Settings-related Components
│   │   │   ├── 📂 tabs/             # [NEW] Modular Settings Panels (GeneralTab, AboutTab, etc.)
│   │   │   └── 📄 SettingsModal.tsx # Settings Modal Container
│   │   ├── 📂 layout/               # Layout Containers
│   │   ├── 📂 modals/               # Business Modals (GlossaryConfirmationModal, SpeakerManagerModal, etc.)
│   │   ├── 📂 endToEnd/             # End-to-End Wizard Components
│   │   └── 📂 ...                   # Other feature-divided component directories
│   │
│   ├── 📂 hooks/                    # React Hooks
│   │   ├── 📂 useWorkspaceLogic/    # Core Workspace Logic
│   │   ├── 📄 useVideoPreview.ts    # [NEW] Video Preview & Transcoding State
│   │   └── ...                      # Other Feature Hooks
│   │
│   ├── 📂 locales/                  # [NEW] Internationalization Resources
│   │   ├── 📂 zh-CN/                # Chinese (Simplified)
│   │   ├── 📂 en-US/                # English
│   │   └── 📂 ja-JP/                # Japanese (New in v2.13)
│   │
│   ├── 📂 services/                 # Service Layer (Pure Logic)
│   │   ├── 📂 api/                  # API Integration
│   │   ├── 📂 generation/           # Generation Services (Core Business Logic)
│   │   │   ├── 📂 pipeline/         # Complete Pipeline
│   │   │   │   ├── 📂 core/         # [NEW] Base step class and type definitions
│   │   │   │   └── 📂 steps/        # [NEW] Step implementations (Transcription, Refinement, Alignment, Translation, Proofread)
│   │   │   ├── 📂 extractors/       # Information Extraction
│   │   │   └── 📂 batch/            # Batch Operations (proofread.ts, regenerate.ts)
│   │   ├── 📂 audio/                # Audio Processing
│   │   ├── 📂 subtitle/             # Subtitle Parsing and Generation
│   │   │   ├── 📄 reconciler.ts     # [NEW] Data Reconciler (Data Hub)
│   │   │   └── 📄 ...
│   │   ├── 📂 alignment/            # [NEW] Alignment Services
│   │   │   ├── 📂 strategies/       # Alignment Strategies (CTC)
│   │   │   └── 📄 index.ts          # Strategy Factory
│   │   ├── 📂 download/             # Download Service Logic
│   │   └── 📂 utils/                # Common Service Tools
│   │
│   ├── 📂 config/                   # Configuration Module
│   │   ├── 📄 index.ts              # Config Export Entry
│   │   └── 📄 models.ts             # Model Config
│   │
│   ├── 📂 lib/                      # Common Libraries
│   │
│   ├── 📂 types/                    # [NEW] Centralized Type Definitions
│   │   ├── 📄 pipeline.ts           # Pipeline Shared Types
│   │   ├── 📄 alignment.ts          # Alignment Types
│   │   └── 📄 ...
│   │
│   └── 📂 workers/                  # Web Workers

├── 📂 electron/                     # Electron Desktop Code
│   ├── 📄 main.ts                   # Main Process Entry
│   ├── 📄 preload.ts                # Preload Script
│   ├── 📄 logger.ts                 # Unified Logging Service (with JSON View)
│   ├── 📂 utils/                    # [NEW] Utility Modules
│   │   └── 📄 paths.ts              # Portable path resolution
│   └── 📂 services/                 # Desktop Services (Node.js Env)
│       ├── 📄 localWhisper.ts       # Local Whisper Call (with GPU Detection)
│       ├── 📄 videoPreviewTranscoder.ts # Video Preview & Caching
│       ├── 📄 storage.ts            # Portable storage service
│       └── ...                      # Other System-level Services

└── 📄 package.json                  # Project Config

🔄 Core Process Diagrams

1. Complete Pipeline Concurrent Architecture

The diagram below shows the complete concurrent architecture for subtitle generation, including parallel async tasks, Semaphore control, and cross-task dependencies:


2. Dual Semaphore Concurrency Control Details

Configuration Explanation:

SemaphorePurposeDefault ConcurrencyConfig Item
transcriptionSemaphoreControls Whisper API CallsLocal: 1, Cloud: 5whisperConcurrency
refinementSemaphoreControls Gemini Flash API5concurrencyFlash
alignmentSemaphoreControls Alignment Tools2concurrencyAlignment
(Glossary Extraction Internal)Controls Gemini Pro API2concurrencyPro

3. Chunk Internal 5-Stage Pipeline


3.5 Pipeline Step Architecture (New in v2.13)

v2.13 introduces a class-based step architecture, modularizing Chunk processing logic:

Step Descriptions:

StepFileInputOutputPurpose
TranscriptionStepTranscriptionStep.tsAudioChunkSubtitleItem[]Whisper speech-to-text
WaitForDepsStepWaitForDepsStep.ts-Glossary + SpeakersWait for glossary and speaker extraction
RefinementStepRefinementStep.tsSubtitleItem[]SubtitleItem[]Timeline correction, apply glossary
AlignmentStepAlignmentStep.tsSubtitleItem[]SubtitleItem[]CTC forced alignment
TranslationStepTranslationStep.tsSubtitleItem[]SubtitleItem[]AI translation
ProofreadStepProofreadStep.tsSubtitleItem[]SubtitleItem[]Batch proofreading (optional)

3.6 Batch Operations Comparison (New in v2.13)

v2.13 splits batch operations into two independent modes:

FeatureProofreadRegenerate
Filebatch/proofread.tsbatch/regenerate.ts
PurposePolish and proofread existing translationsCompletely reprocess selected segments
PipelineGemini Pro proofreading onlyTranscribe → Refine → Align → Translate (full pipeline)
InputExisting SubtitleItem[]Raw audio + time range
PreservedOriginal timeline preservedEverything regenerated
Use CasesImprove translation quality, fix typosFix transcription errors, re-segment, re-run after glossary update
User HintsNot supportedSupports transcription and translation hints
ModelGemini 3 ProWhisper + Gemini Flash

4. Data Integrity & Reconciliation (The "Data Hub")

The system employs a rigorous Data Reconciliation Strategy (src/services/subtitle/reconciler.ts) to ensure metadata persistence across the pipeline matches (Refinement, Alignment, Translation), even when the number of segments changes due to splitting or merging.

4.1 The Reconciler Logic

The reconcile(prev, curr) function acts as the "Data Hub" connecting pipeline stages. It intelligently merges prev (source) metadata into curr (newly generated) segments:

  • Semantic Metadata (Always Inherited):
    • speaker (Speaker ID/Name)
    • comment (User comments)
    • Logic: Inherited from the prev segment with the highest overlap ratio. Even if segments are split, they all inherit the parent's speaker.
  • Internal State (Conditionally Inherited):
    • alignmentScore (CTC Confidence)
    • lowConfidence (Flag)
    • hasRegressionIssue, hasCorruptedRangeIssue (Error Flags)
    • Logic: Strictly inherited ONLY when a 1:1 mapping is detected. If a segment is split or merged, these internal flags are resetting to avoid false propagation (e.g., a "Perfect Alignment" score shouldn't automatically apply to two new half-segments without re-verification).

4.2 Alignment Strategy (CTC)

The system uses CTC (Connectionist Temporal Classification) for high-precision alignment:

  • Engine: ctcAligner.ts interfacing with an external align.exe (MMS-300m model).
  • Function: Updates startTime and endTime based on effective audio alignment, but never splits or merges segments.
  • Metadata: Adds alignmentScore to segments. Scores below threshold trigger lowConfidence flag for user review.

5. Glossary Extraction and User Interaction Flow


6. Speaker Recognition Position in Pipeline

Pipeline Dependency Summary:

StageDependenciesDescription
TranscriptiontranscriptionSemaphoreIndependent Execution, No Blocking Dependencies
Wait GlossaryglossaryState.get()Must Wait for Glossary Confirmation Complete
Wait SpeakersspeakerProfilePromiseMust Wait for Speaker Recognition Complete
RefinementrefinementSemaphore + Glossary + SpeakersMerge and Use All Data
AlignmentalignmentSemaphoreHigh-precision Timestamp Alignment
TranslationrefinementSemaphore (Shared)Translated after Alignment

7. Desktop Full Workflow (Download-Create-Encode)

Desktop-exclusive complete workflow, connecting from material acquisition to final output:


8. Full Auto End-to-End Mode (End-to-End Pipeline)

This is an Electron-exclusive core feature that coordinates Main Process (resource scheduling) and Renderer Process (AI computation) through IPC communication, achieving "one-click cooked content".

8.1 Cross-Process Interaction Architecture

8.2 Data Flow and State Management

All intermediate state and configuration is managed through the EndToEndWizard component, with data flow as follows:

  1. User Configuration

    • Source: EndToEndWizard UI
    • Flow: Via IPC (start-processing) -> Main Process EndToEndPipeline Service
    • Content: URL, Model Selection, Translation Style, Encoding Parameters
  2. Media Stream

    • yt-dlp -> Disk Temp Directory -> ffmpeg (Extract Audio) -> Disk WAV
    • Disk WAV -> IPC (read-file) -> Renderer Process Memory (ArrayBuffer) -> Web Audio API
  3. Subtitle Data

    • Renderer Process generates SubtitleItem[] array
    • Passed back to Main Process via IPC (subtitle-result)
    • Main Process serializes object to ASS/SRT format text and writes to file
  4. Progress Feedback

    • Each stage (Download/Transcribe/Encode) produces progress events
    • Main Process -> IPC (progress) -> Renderer Process useEndToEnd Hook -> UI Progress Bar

8.3 Key IPC Channels

Channel NameDirectionPayloadPurpose
end-to-end:startRenderer -> MainEndToEndConfigStart Full Auto Task
end-to-end:generate-subtitlesMain -> Rendererpath, configMain Process Ready, Request Frontend to Start Generation
end-to-end:subtitle-resultRenderer -> MainSubtitleItem[]Frontend Complete, Return Results
end-to-end:progressMain -> Rendererstage, progressReal-time Progress Sync

🛰️ Custom Protocol for Media Playback

To bypass browser security restrictions (CSP, Sandbox) and support large file streaming, the desktop version implements a custom protocol:

local-video:// Protocol

  • Implementation: electron/main.ts
  • Privileges: standard, secure, stream, supportFetchAPI, bypassCSP.
  • Key Feature: Tailing Reader: Support for reading "growing files" (transcoding in progress). It uses a polling mechanism to read new data as it is written to disk by FFmpeg.

📺 Video Preview & Caching Strategy

The system uses a fragmented MP4 (fMP4) transcoding strategy to balance compatibility and performance, enabling play-while-transcoding for immediate video preview.

Architecture Overview

The video preview system consists of three main components:

ComponentLocationPurpose
VideoPlayerPreviewsrc/components/editor/React video player with ASS subtitle overlay
useVideoPreviewsrc/hooks/State management for transcoding progress, video source, and playback
videoPreviewTranscoderelectron/services/FFmpeg-based transcoding service with GPU acceleration and caching

Process Flow

Key Features

FeatureDescription
Progressive PlaybackStart playing before transcoding completes via fMP4 + TailingReader
GPU AccelerationAuto-detects NVENC/QSV/VCE for faster transcoding
Format DetectionSkips transcoding for browser-compatible formats (mp4, webm, m4v)
WYSIWYG SubtitlesRenders ASS subtitles using assjs in sync with video
Floating/Docked ModesSupports resizable floating window or docked panel

Cache Lifecycle

  • Storage: User data directory (/preview_cache/).
  • Limit: Automatically enforces a total size limit (3GB default).
  • Cleanup: Enforced on app startup (oldest files first) and via manual UI action.

IPC Channels

Channel NameDirectionPayloadPurpose
transcode-for-previewRenderer → Main{ filePath }Request video transcoding
transcode-startMain → Renderer{ outputPath, duration }Transcoding started, begin progressive playback
transcode-progressMain → Renderer{ percent, transcodedDuration }Real-time progress update
cache:get-sizeRenderer → Main-Get preview cache size
cache:clearRenderer → Main-Clear preview cache

🧩 Core Module Descriptions

1. Generation Services Module (src/services/generation/) [NEW]

This is the refactored core business logic module, splitting the original Gemini API logic by responsibility:

SubmoduleFile/DirectoryFunction Description
pipelineindex.tsGeneration Flow Orchestrator, Coordinates Transcription, Extraction, Generation Full Flow
pipelineCore.ts[NEW] Shared context and dependency injection
chunkProcessor.tsSingle Chunk Processing Logic (Transcribe -> Wait Glossary/Speaker -> Translate)
translation.tsSpecific Translation Execution Logic
glossaryHandler.tsGlossary Application Logic
resultTransformers.tsResult Transformation and Post-processing Logic
core/BaseStep.ts[NEW] Base step class, defines unified interface
steps/*.ts[NEW] Step implementations (Transcription, Refinement, Alignment, Translation, Proofread)
extractorsglossary.tsGlossary Extractor (Gemini Pro + Search)
speakerProfile.tsSpeaker Profile Extractor
batchproofread.ts[NEW] Batch proofreading operations
regenerate.ts[NEW] Batch regenerate operations (full pipeline re-run)

2. Gemini API Core (src/services/api/gemini/core/)

Retains only the most basic API interaction capabilities:

FileFunction Description
client.tsGemini API Client Encapsulation, Handles auth, retry, and quota
prompts.tsBasic Prompt Template Library
schemas.tsStructured Output Schema Definitions

3. Audio Processing Module (src/services/audio/)

FileFunction Description
segmenter.tsSmart Audio Segmenter, Uses Silero VAD Model to Detect Voice Activity, Splits by Semantic Boundaries
sampler.tsAudio Sampling, Generates Audio Samples for AI Analysis
decoder.tsAudio Decoding, Supports Multiple Formats
processor.tsAudio Preprocessing, Normalization, etc.
converter.tsAudio Format Conversion
ffmpegExtractor.tsFFmpeg Audio Extraction (Core Logic)

4. Subtitle Processing Module (src/services/subtitle/)

FileFunction Description
parser.tsSubtitle Parser, Supports SRT/ASS/VTT and other formats
generator.tsSubtitle Export, Generates Bilingual Subtitle Files
time.tsTimecode Processing Tool
postCheck.tsSubtitle Quality Post-check
timelineValidator.tsSubtitle Timeline Logic Validation
reconciler.ts[NEW] Data Reconciliation (Merges Metadata)

5. Download Service Module (src/services/download/)

FileFunction Description
download.tsVideo Download Logic Encapsulation
utils.tsDownload-related Utility Functions

6. Electron Desktop (electron/)

FileFunction Description
main.tsElectron Main Process, Window Management, IPC Communication
preload.tsPreload Script, Exposes Secure Node.js API
logger.tsUnified Logging System, Supports File Rotation, JSON View, and Multi-level Logs
utils/paths.ts[NEW] Portable path resolution, supports exe-relative storage
services/localWhisper.tsLocal Whisper Model Call (whisper.cpp), with GPU detection
services/ffmpegAudioExtractor.tsFFmpeg Audio Extraction, Supports Video Files
services/ytdlp.tsVideo Download Service (YouTube/Bilibili)
services/videoCompressor.tsVideo Encoding Service (Supports NVENC/QSV/AMF Hardware Acceleration)
services/videoPreviewTranscoder.tsVideo Preview Transcoding, fMP4 for progressive playback, cache management
services/endToEndPipeline.tsFull Auto Pipeline, Orchestrates Download-Transcribe-Encode Full Flow
services/storage.tsPortable storage service, config and logs stored in exe-relative directory

7. Internationalization Module (src/locales/, src/i18n.ts) [NEW]

Full i18n support powered by i18next, enabling bilingual UI (Chinese/English):

File/DirectoryFunction Description
i18n.tsi18n Configuration Entry, Initializes i18next with React bindings
locales/Translation Resources Root Directory
zh-CN/Chinese (Simplified) translations, 14 namespace files
en-US/English translations, mirrors zh-CN structure
ja-JP/Japanese translations, mirrors zh-CN structure (New in v2.13)

Namespace Organization:

NamespaceContent
commonShared texts (buttons, labels, errors)
homeHome page content
workspaceWorkspace page
editorSubtitle editor
settingsSettings modal
endToEndEnd-to-end wizard
modalsBusiness modals
servicesAPI service messages
compressionVideo compression page
downloadDownload page
progressProgress indicators
uiUI components
appApp-level texts

8. Settings Module (src/components/settings/) [Refactored in v2.13]

v2.13 refactors the settings panel into a modular tabs structure:

File/DirectoryFunction Description
SettingsModal.tsxSettings modal container, manages tab switching
tabs/GeneralTab.tsxGeneral settings (language, theme, etc.)
tabs/ServicesTab.tsxAPI service config (Gemini, OpenAI keys)
tabs/EnhanceTab.tsxEnhancement features (glossary, speaker toggles)
tabs/PerformanceTab.tsxPerformance settings (concurrency, cache, etc.)
tabs/DebugTab.tsxDebug options (mock mode, log level)
tabs/AboutTab.tsx[NEW] About page (version, Whisper status, GPU detection)
AlignmentSettings.tsxAlignment service configuration
LocalWhisperSettings.tsxLocal Whisper configuration
CacheManagement.tsxCache management UI

About Page (AboutTab) Features:

  • Display app version and build info
  • Local Whisper status detection
  • GPU hardware acceleration detection (NVENC/QSV/AMF)
  • Log file path and viewer entry
  • System information overview

🔧 Technical Highlights

Concurrency Control

typescript
// Use Semaphore to control concurrency count
const semaphore = new Semaphore(concurrency);
await mapInParallel(chunks, async (chunk) => {
  await semaphore.acquire();
  try {
    return await processChunk(chunk);
  } finally {
    semaphore.release();
  }
});

Model Selection Strategy

Model configuration is centralized in src/config/models.ts, supporting different model selection per processing step:

Processing StepDefault ModelFeatures
refinementGemini 2.5 FlashTimeline Correction (Avoids 3.0 Timestamp Bug)
translationGemini 3 Flash PreviewTranslation, Search Grounding
glossaryExtractionGemini 3 Pro PreviewMultimodal, Term Extraction
speakerProfileGemini 3 Pro PreviewSpeaker Analysis
batchProofreadGemini 3 Pro PreviewHigh Quality Proofreading, Search Grounding

Note: As of v2.13, batchFixTimestamps has been replaced by the regenerate operation. Regenerate re-runs the full pipeline (transcription → refinement → alignment → translation).

Each step can be independently configured:

  • thinkingLevel: Thinking Depth (none/low/medium/high)
  • useSearch: Whether to Enable Google Search
  • maxOutputTokens: Maximum Output Tokens

Retry Mechanism

typescript
// Automatically retry recoverable errors
async function generateContentWithRetry(
  ai: GoogleGenAI,
  params: any,
  retries = 3,
  signal?: AbortSignal,
  onUsage?: (usage: TokenUsage) => void,
  timeoutMs?: number
);

📊 Data Flow Architecture

Main Data Flow Diagram

Data Type Conversion Chain

State Data Flow


🚀 Deployment Architecture


📝 Development Guide

Environment Requirements

  • Node.js 18+
  • npm or yarn

Quick Start

bash
# Install Dependencies
yarn install

# Web Development Mode
yarn dev

# Electron Development Mode
yarn electron:dev

# Build Electron Application
yarn electron:build

Environment Variables

env
GEMINI_API_KEY=your_gemini_key    # Required: Translation and Proofreading
OPENAI_API_KEY=your_openai_key    # Optional: Cloud Whisper

📚 References

Released under the MIT License.