TLDR: I built CATTbyCatt because there’s a need for an accessible, easy-to-use, high-quality transcription software for Vtubers.
The Problem
A keyword search for “live-transcription software”, “transcription app”, or “real-time ASR” returns either expensive corporate solutions (ElevenLabs, Otter.ai, AWS STT) or highly technical solutions (whisper.cpp, RealtimeSTT).
Let’s address the pricing first. ElevenLabs STT API costs, on average, $0.70 an hour for their most popular plan, up to about 63 hours, then $0.96 for each hour outside of that. Google Cloud, also $0.96. Amazon, $1,44 (!). Before you say that doesn’t seem that bad, remember that Vtubers sometimes stream hundreds of hours per month, and some don’t make money at all doing so. The nature of streaming is that you do it for free until you “blow up” or build a self-sustaining community. The cost of Vtubing, in my opinion, is already prohibitively high, and doesn’t need an additional layer of recurring cost on top.
And then there’s the coding route. Most free speech-to-text tools out there are made to work with other software, not as standalone consumer tools, so they push out command-line results in the terminal rather than in a GUI. I mean, there are some GUI tools out there, but even those aren’t very suitable for streamers, since they don’t have any styling options, or any way to be embedded into the stream natively, aside from the ‘ol “cutting up the window capture.”
Another thing is hardware. The aforementioned tools are made to run locally, which means they depend on the specs of the machine, what is occupying its processing power, and a plethora of other factors. The worst thing that can happen is a gaming session gets interrupted by the voice detection software acting up.
All of this means if you don’t have a background in programming AND don’t have access to a budget for tailor-made ASR (which describes most Vtubers), you are kind of stuck with the audience of your native language.
The Solution
The Captioning and Translating Tool (CATT) is a free/low-cost tool that, well, captions and translates live audio and provides styling to suit the streamer’s branding.
For transcription, it uses Web Speech API, a Speech-to-Text API that is free and readily available in Google Chrome/Safari and some other browsers. It runs entirely on a cloud server, so it won’t hog too much resource (I hope).
For translation, it uses OpenAI’s ChatGPT 4.1 nano or Google Translate API (the only part that costs money, and is free for users willing to provide feedback for the model).
For styling, it uses CSS styling and animation for class-based animation. For typography, it’s currently supporting Google Fonts, and support for Adobe Fonts is in development for logged-in users.
I hope that with the adoption of this app, more and more people will get to watch their favorite Vtuber without the obstruction of the language barrier.
Thank you for reading.