1 min readfrom Machine Learning

I built a real-time pipeline that reads game subtitles and converts them into dynamic voice acting (OCR → TTS → RVC) [P]

I've been experimenting with real-time pipelines that combine OCR + TTS + voice conversion, and I ended up building a desktop app that can "voice" game subtitles dynamically.

The idea is simple: - Capture subtitles from screen (OCR) - Convert them into speech (TTS) - Transform the voice per character (RVC)

But the hard parts were: - Avoiding repeated subtitle spam (similarity filtering) - Keeping latency low (~0.3s) - Handling multiple characters with different voice models without reloading - Running everything in a smooth pipeline (no audio gaps)

One thing that helped a lot was using a two-stage pipeline: While one sentence is playing, the next one is already processed in the background.

I also experimented with: - Emotion-based voice changes - Real-time translation (EN → TR) - Audio ducking (lowering game sound during speech)

I'm curious: How would you approach reducing latency further in a multi-model setup like this? Or is there a better alternative to RVC for real-time character voice conversion?

Happy to share more technical details if anyone is interested.

submitted by /u/fqtih0
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#real-time data collaboration
#real-time collaboration
#financial modeling with spreadsheets
#rows.com
#natural language processing for spreadsheets
#generative AI for data analysis
#cloud-based spreadsheet applications
#Excel alternatives for data analysis
#RVC
#real-time pipeline
#OCR
#TTS
#game subtitles
#voice conversion
#latency
#dynamic voice acting
#similarity filtering
#two-stage pipeline
#multi-model setup
#real-time translation