Yesterday I noted the racial disparities in automatic speech recognition study and how modern ASR did worse than the estimates provided in an old patent. I also noted humans are built to get better at just about anything they do. I just so happen to think about this court reporting and automatic speech recognition stuff a lot. It finally hit me why automatic speech recognition has made little real progress in the last 20 years: Language drift. The way that people speak and write English tends to change over time. Great example? I’m a gamer but I’m not entrenched in gamer culture. When someone about six years younger than me said “I’m getting bodied,” I had almost no clue what he was talking about. He was getting beat up by the other team! If you took a look at the video I linked, it explains how words and nomenclature changed drastically in English. Early English, to me, sounded much more French than anything we know today. If you go back only about 650 years, you reach a point where you are unlikely to understand the English language. Giraffes used to be camelopards. “Verily” used to be a word that people used. Even worse, there was no electricity to charge our stenotypes yet. To the chagrin of English purists, language drift appears inevitable. But this is also why we need real people studying and mastering English. It gives the rest of us a fighting chance. That’s why a computer program could never do for court reporting what Margie Wakeman Wells did. The computer would only regurgitate the same rules again and again, never reviewing or assessing new information unless a real person told it to.
What does that have to do with automatic speech recognition and court reporting? Our verbal and written languages are changing over time. That’s why literally now means figuratively, literally! ASR is based off of machine learning. It’s unlikely to ever perfect English because English is ever evolving and never perfect. Let’s say a company compiles enough data and creates an algorithm so perfect that it can accurately understand every single one of the billions of speakers on the planet today. Every single day after that moment, the speech patterns would change just a little bit and would be unrecognizable to the system someday. Of course, there is not a single country or corporation on the planet allocating enough money or personnel to gather that much data in the first place!
As a secondary matter, a system trained to understand all English dialects is inherently less likely to work than a system trained to understand only standard English as far as I know. I’ve written extensively about how bad ASR was with AAVE, as low as 25%. If we train a system for AAVE and data suited for that, there is a high likelihood that it would have worse accuracy for standard speakers. Gain ground on one type of speaker and lose ground on the other. The main way to compensate for that would be to have a trained operator use a specific voice profile to select the speaker. Guess what? That’s voice writing, something our industry figured out two decades ago.
This is not to say we shouldn’t continue to train and be at the top of our game. But my thoughts on AI are shifting from what they were. I used to believe there was some small possibility we would be replaced. I am coming to a place where I do not see us as replaceable under the current model of ASR without a trained operator in every seat. If we’re going to do that, stenography is the way to go!
Thank you to recent donors. My PayPal is open to receive donations for those that wish to contribute to the cost of running the blog. If you don’t want to give something for “nothing,” I also designed a Sad Iron Stenographer mug on Zazzle. The cheaper one, I will make about $0.90 for every sale. The more expensive one, I will make about $10 for every sale. They are both identical mugs, so buy whichever you find to be more appropriate. Nothing will make your Mondays happier than the sad iron stenographer, I guarantee* it.