Skip to content

Pipeline Flow

🔄 Complete Pipeline Concurrent Architecture

The following diagram shows the complete concurrent architecture of subtitle generation, including parallel async tasks, Semaphore control, and inter-task dependencies:


🔒 Dual Semaphore Concurrency Control

Configuration:

SemaphorePurposeDefault ConcurrencyConfig Key
transcriptionSemaphoreControl Whisper APILocal: 1, Cloud: 5whisperConcurrency
refinementSemaphoreControl Gemini Flash5concurrencyFlash
alignmentSemaphoreControl Alignment2concurrencyAlignment
(Glossary internal)Control Gemini Pro2concurrencyPro

📊 6-Stage Chunk Pipeline


🏗️ Pipeline Step Architecture (v2.13)

v2.13 introduced class-based step architecture, modularizing Chunk processing logic:

Step Descriptions:

StepFileInputOutputPurpose
TranscriptionStepTranscriptionStep.tsAudioChunkSubtitleItem[]Whisper speech-to-text
WaitForDepsStepWaitForDepsStep.ts-Glossary + SpeakersWait for extraction completion
RefinementStepRefinementStep.tsSubtitleItem[]SubtitleItem[]Timeline correction, term app
AlignmentStepAlignmentStep.tsSubtitleItem[]SubtitleItem[]CTC forced alignment
TranslationStepTranslationStep.tsSubtitleItem[]SubtitleItem[]AI translation
ProofreadStepProofreadStep.tsSubtitleItem[]SubtitleItem[]Batch proofreading (optional)

⚖️ Batch Operations Comparison (v2.13)

FeatureProofreadRegenerate
Filebatch/proofread.tsbatch/regenerate.ts
PurposePolish existing translationsCompletely reprocess segments
FlowGemini Pro onlyFull pipeline (Transcribe→Refine→Align→Translate)
InputExisting SubtitleItem[]Original audio + time range
PreservesOriginal timestampsNothing (regenerates all)
Use CaseImprove translation qualityFix transcription errors, re-segment
ModelGemini 3 ProWhisper + Gemini Flash

🔄 Data Integrity & Reconciliation

The system uses strict data reconciliation strategy (src/services/subtitle/reconciler.ts) to ensure metadata consistency across pipeline stages.

Reconciler Logic

reconcile(prev, curr) function acts as the "data hub" connecting pipeline stages:

  • Semantic Metadata (always inherited):
    • speaker (Speaker ID/Name)
    • comment (User notes)
    • Logic: Inherited from prev segment with highest overlap. Even split segments inherit parent's speaker.
  • Internal State (conditionally inherited):
    • alignmentScore (CTC confidence)
    • lowConfidence (Low confidence flag)
    • Logic: Strictly inherited only when 1:1 mapping detected. Reset if segments are split/merged.

Alignment Strategy (CTC)

  • Engine: ctcAligner.ts calls external align.exe (MMS-300m model)
  • Function: Updates startTime and endTime based on audio alignment, but never splits or merges segments
  • Metadata: Adds alignmentScore. Scores below threshold trigger lowConfidence flag

Released under the MIT License.