Hello! Small update to the app!
First, new translation model!
After a long history of under performing, Google Translate API is officially kicked out of the app. It's slow, expensive and handle raw input pretty horribly. I would be lying if I said I would miss it.
And so? After kicking Google out, what would I be using next? Drum roll please... It's... GOOGLE!!! OpenAI!
This time, it's a generative AI model from Google called Gemma 3 4B. Out of all the models that I've tested, this one strikes a pretty good balance between cost, speed and accuracy. As per usual with all Google models, this will be the slower but more reserved model compared to GPT-4.1 Nano. By that, I mean the GPT model will try to be more "free" and translate the idea behind the sentence, while Gemma will try to stick to the exact words spoken. I really look forward to hearing feedback regarding this.
So, the craziest thing happened. The day I push the code for v0.3 to main, OpenAI released their new models — OSS 120B and 20B. And it absolutely destroy, and I mean destroy Gemma on every possible metric, from cost, to intelligence, to latency, to speed. It even mogs GPT-4.1 Nano. So yea, this is technically v0.3-oss now.
Second, some small optimizations.
I tried to clean up the code for it to be more optimized and less spaghetti-like. Although, I am pretty sure if an actual good developer take a look at the code, they would still pass out from anger and confusion. Also, I made it so that the OBS button just copies the OBS link, and not opening a new tab, annoying the user. It's an easy fix, I was just too lazy until now. Also also, I disabled the faulty chunking algorithm, but that's a topic for another day.
Lastly, a brand new listening model.
This is a really exciting addition, because this was planned all the way from the beginning. Web Speech API is a nice tool, but it has its limitations, the most major one being its accuracy. The implementation of the new model is to, first and foremost, fix that issue. Aside from that, a really cool thing that this model does is that it support code-switching, or changing the language mid-session. If someone speaks English in a Korean session for example, the old listening model tends to get confused. Not with this model. Although with a bit less accuracy, it will try to listen to what was said even if it's not the chosen language. This new listening model works a little bit different to the current model, as in it would wait to hear the entire sentence before giving you what it heard. This will create a bit of perceived delay, but it's so, so much more accurate that I think the change is worth it. If you find the delay a bit jarring, you can delay everything else by about 500ms in OBS to compensate for it. Of course, the old Web Speech API would still be available if you choose to not tick the "Use Advanced ASR" toggle. I figure that more choice couldn't hurt.
That's it for now. The app is actually becoming more and more like what I envisioned when I first made the first prototype, which is very cool. We are slowly, but surely moving towards the actual public 1.0 release. I am pumped.
Thank you for watching, and thank you for following the development of CATTbyCatt.
Best regards,
Catt