SPEECH-TO-TEXT · BROADCAST CAPTIONING

Real-time. Multilingual.
Speech to caption in 0.5 seconds.

Point Media Tech STT is a broadcast-grade real-time speech recognition engine. Mandarin, Taiwanese, Hakka, and English with mixed recognition. End-to-end latency under 500ms; caption accuracy up to 98%. Live news, sports, and post-production — all on one pipeline.

<500ms

End-to-end latency

98%

Caption accuracy

4

Languages mixed

24/7

Unattended captioning

WHY NOW

From manual stenography and post-edit cleanup
to AI real-time captioning is a step change.

Legacy workflow: live broadcasts staff two stenographers, and VOD captions are outsourced for post-edit cleanup. A 30-minute news segment takes four hours to caption. Budgets shrink, and skilled stenographers are harder to hire each year. STT moves transcription onto GPUs: real-time inference under 500ms, mixed-language recognition (Mandarin + Taiwanese + Hakka + English), automatic punctuation, and SRT / EBU-STL output that drops straight into Playout / MAM. One system, three shifts of throughput.

LEGACY · MANUAL CAPTIONING

Stenographers, post-edit cleanup, outsourced cost

Two to three stenographers per live broadcast; VOD captioning outsourced; a 30-minute show takes four hours of cleanup; Taiwanese and Hakka cost extra; nights and weekends bill at premium; staff turnover breaks terminology continuity.

STT · AI CAPTIONING

Real-time, multilingual, automated output

GPU-based real-time inference under 500ms; mixed Mandarin / Taiwanese / Hakka / English recognition; automatic punctuation and segmentation; custom terminology and accent models; SRT / VTT / EBU-STL output drops straight into Playout / MAM; 24/7 unattended.

CAPABILITIES · THREE LAYERS

Three layers, one continuous
real-time captioning pipeline.

Multilingual recognition → real-time inference → broadcast integration. Each layer is sized independently, licensed by channel and language model, and extensible as you grow.

i.

Multilingual ASR · Multilingual ASR

Mandarin (traditional / simplified), Taiwanese (Hokkien), Hakka, English, Cantonese with mixed recognition; automatic language switching; custom broadcast terminology training; news / sports / variety / drama acoustic models; smart transcription of numbers, proper nouns, and acronyms.

  • Mandarin / Taiwanese / Hakka / English / Cantonese
  • Automatic language switching
  • Custom terminology + scene models
  • Accent + background-noise robustness
ii.

Real-time inference · Real-time Inference

GPU-accelerated inference with end-to-end latency under 500ms; automatic punctuation and segmentation; speaker diarization; multi-channel concurrency; node failover; 24/7 uninterrupted service.

  • < 500ms end-to-end latency
  • 98% caption accuracy
  • Speaker diarization
  • Multi-channel + automatic failover
iii.

Broadcast output · Broadcast Output

SRT / VTT / TTML / EBU-STL output; direct delivery to Playout systems and Marquee Graphics; write-back to MAM as time-coded indexes; REST API and webhooks; broadcast caption regulations; HLS / DASH OTT platform support.

  • SRT / VTT / TTML / EBU-STL
  • Playout / Marquee / MAM integration
  • REST API + webhooks
  • Broadcast caption regulations

WORKFLOW

Captions are a bridge between speech and screen —
audio in, captions out, end-to-end automated.

STT translates speech into captions in real time, accurately and traceably, and writes results directly into the broadcast workflow.

01
Audio In
Live signal / VOD file · multi-channel
02
ASR
GPU inference · multilingual mixed recognition
03
Punctuate
Punctuation + segmentation + diarization
04
Format
SRT / VTT / EBU-STL · broadcast-spec
05
Deliver
Playout · Marquee · MAM · OTT

FIGURE 01 · REAL-TIME CAPTIONING WORKFLOW (AUDIO → ASR → PUNCTUATE → FORMAT → DELIVER)

SPECIFICATIONS

Engineering specifications.

A full spec sheet and live engineering walkthrough are available on request.

Languages supported Mandarin (traditional / simplified) · Taiwanese · Hakka · English · Cantonese · automatic switching · custom terminology
Recognition performance End-to-end latency < 500ms · caption accuracy 98% (standard scenarios) · speaker diarization
Input formats SDI / NDI / RTMP / SRT live streams · WAV / MP4 / MXF VOD files · multi-channel concurrency
Output formats SRT · VTT · TTML · EBU-STL · CEA-608/708 · live RTMP / SRT caption tracks
System integration Playout · Marquee Graphics · MAM · REST API · webhooks · OTT (HLS / DASH)
Hardware requirements NVIDIA GPU (T4 / A10 / L4 entry-level) · Linux · containerized deployment · channel count scales with GPU
Licensing Licensed per speech recognition channel and language model; monthly or annual subscription; software license + annual maintenance

HOW TO START

Three steps, from evaluation to production.

i.

Request a trial

Submit the contact form. Sales will respond within one business day to schedule a technical consultation.

ii.

Integration assessment

Our consultant reviews channel count, language model needs, and integration with your existing Playout / Marquee / MAM to size the GPU configuration and trial plan.

iii.

Deploy and go live

After GPU deployment and terminology customization, we run acceptance tests covering both live and VOD scenarios, then phase the rollout to keep your captioning workflow running.

READY?

Ready to make captions
follow the voice?

Book a demo and see what AI captioning does to live news, sports, and post-production.

FAQ

Frequently asked questions.

Which languages does STT currently support for speech recognition?

STT supports multilingual speech recognition including Mandarin, Taiwanese (Hokkien), English, and Cantonese. Acoustic and language models can be customized for broadcast-specific terminology and accents to maintain accurate transcription even with background noise or overlapping speakers.

What concrete advantages does automated STT offer compared to manual captioning?

STT delivers end-to-end latency under 500ms and caption accuracy up to 98%, and can process multiple real-time streams concurrently. Compared to manual captioning, it significantly reduces labor costs and supports 24/7 unattended automated caption generation.

Which caption output formats does STT support?

STT exports industry-standard caption formats including SRT, VTT, TTML, and EBU-STL, and integrates directly with playout systems, media asset managers (MAM), and OTT platforms, meeting broadcast industry standards.

How is STT licensed?

STT is licensed per speech recognition channel and language model, with monthly and annual subscription options. Contact sales for detailed pricing; licensing is tailored to your actual speech recognition workload.

What system specifications are required to run STT?

STT requires GPU compute resources for real-time AI inference and runs on Linux. Exact specs depend on the number of concurrent channels and language model complexity; our technical consultants will help size the right configuration.

How do I start evaluating and trialing STT?

Submit the contact form and sales will respond within one business day to schedule a technical consultation to assess your speech recognition requirements and plan an appropriate trial.

What are the advantages of AI captioning over traditional manual captioning?

Point Media Tech STT converts speech to captions in real time, more than 10x faster than manual transcription, and supports mixed recognition of Mandarin, Taiwanese, Hakka, and English. AI captioning is well suited to time-critical scenarios such as live news and sports broadcasts, and can also be used for batch caption generation in post-production — reducing both labor costs and turnaround time.