I Figured Out Why ASR Is So Hard To Perfect

Yesterday I noted the racial disparities in automatic speech recognition study and how modern ASR did worse than the estimates provided in an old patent. I also noted humans are built to get better at just about anything they do. I just so happen to think about this court reporting and automatic speech recognition stuff a lot. It finally hit me why automatic speech recognition has made little real progress in the last 20 years: Language drift. The way that people speak and write English tends to change over time. Great example? I’m a gamer but I’m not entrenched in gamer culture. When someone about six years younger than me said “I’m getting bodied,” I had almost no clue what he was talking about. He was getting beat up by the other team! If you took a look at the video I linked, it explains how words and nomenclature changed drastically in English. Early English, to me, sounded much more French than anything we know today. If you go back only about 650 years, you reach a point where you are unlikely to understand the English language. Giraffes used to be camelopards. “Verily” used to be a word that people used. Even worse, there was no electricity to charge our stenotypes yet. To the chagrin of English purists, language drift appears inevitable. But this is also why we need real people studying and mastering English. It gives the rest of us a fighting chance. That’s why a computer program could never do for court reporting what Margie Wakeman Wells did. The computer would only regurgitate the same rules again and again, never reviewing or assessing new information unless a real person told it to.

What does that have to do with automatic speech recognition and court reporting? Our verbal and written languages are changing over time. That’s why literally now means figuratively, literally! ASR is based off of machine learning. It’s unlikely to ever perfect English because English is ever evolving and never perfect. Let’s say a company compiles enough data and creates an algorithm so perfect that it can accurately understand every single one of the billions of speakers on the planet today. Every single day after that moment, the speech patterns would change just a little bit and would be unrecognizable to the system someday. Of course, there is not a single country or corporation on the planet allocating enough money or personnel to gather that much data in the first place!

As a secondary matter, a system trained to understand all English dialects is inherently less likely to work than a system trained to understand only standard English as far as I know. I’ve written extensively about how bad ASR was with AAVE, as low as 25%. If we train a system for AAVE and data suited for that, there is a high likelihood that it would have worse accuracy for standard speakers. Gain ground on one type of speaker and lose ground on the other. The main way to compensate for that would be to have a trained operator use a specific voice profile to select the speaker. Guess what? That’s voice writing, something our industry figured out two decades ago.

This is not to say we shouldn’t continue to train and be at the top of our game. But my thoughts on AI are shifting from what they were. I used to believe there was some small possibility we would be replaced. I am coming to a place where I do not see us as replaceable under the current model of ASR without a trained operator in every seat. If we’re going to do that, stenography is the way to go!

Thank you to recent donors. My PayPal is open to receive donations for those that wish to contribute to the cost of running the blog. If you don’t want to give something for “nothing,” I also designed a Sad Iron Stenographer mug on Zazzle. The cheaper one, I will make about $0.90 for every sale. The more expensive one, I will make about $10 for every sale. They are both identical mugs, so buy whichever you find to be more appropriate. Nothing will make your Mondays happier than the sad iron stenographer, I guarantee* it.

*Product is not guaranteed to make Mondays happier.

What Is Realtime Voice Writing and Why Is It Better Than Digital Reporting?

In our field there are three main modalities for taking the record or captioning. There is stenography, voice writing, and digital recording. Stenography is using a chorded stenotype and computer dictionary to instantaneously take down and transcribe the spoken word. Digital recording is all about letting a microphone pick up the audio and having somebody transcribe it after the fact. Sometimes digital recording proponents insist that they can run the audio through automatic speech recognition (ASR) systems to “assist the transcriber.” I’ve been pretty open about my feelings there.

Transcribers and digital reporters can do better switching to steno.

There are also nonprofits representing each modality. NCRA is all-in for steno. NVRA admits stenographers, but in my mind is really more for voice writers, and rightfully so. AAERT is pro-recording. ATSP is pro-transcriber to the extent it has any court reporting industry presence. There are others like Global Alliance or STTI that claim to be for all three modalities, but I’ve always gotten a “jack of all trades, master of none” vibe from those types of associations.

From information available to me, I believe that NCRA is by far the largest organization and in the best position to handle the court reporter shortage, but NVRA does provide an incredibly important role in certifying voice writers. One common problem in the early years of voice writing, which some New York attorneys still hold against them, was that occasionally they could be heard through the mask. Even now, when there is a lot of sibilance, one can infrequently hear a voice writer through the mask. Modern certification requires that the voice writer is able to perform without being heard, and a two-strike policy is employed in which the first time a writer is heard during a test they are tapped on the shoulder. The second time they are heard, they are disqualified. Voice writing tests, like ours, give the voice writer one shot at getting their “voice notes” correct. They are not allowed to repeat or review the test audio. This kind of testing is important and represents the quality standards this industry needs. NVRA confirmed its testing policy in an 8/11/21 e-mail to me.

Most reporters know that voice writing is, at its core, speaking into a Stenomask or other voice mask and allowing automatic speech recognition to assist in the transcription of what’s said. In some settings, a voice writer may use an open mic. Some stenographic reporters may be surprised to learn that realtime voice writing is superior to digital reporting and general ASR use. In general ASR use, the microphone takes input from everyone and the computer system gives its best guess based on the training data it has. In a study from last year, it was shown that that technology’s accuracy could drop as low as 25% dependent on who is speaking. Realtime voice writing, by comparison, is a trained operator, the voice writer, often speaking into a closed microphone, and utilizing ASR that has been trained to that writer’s voice. In the best of circumstances, that ASR can reliably put out highly accurate transcriptions of the voice writer’s voice — as high as 98%. Many realtime voice writers utilize Dragon by Nuance connected to their preferred CAT software. I guesstimate that Nuance has the best ASR tech, and there’s no coincidence that despite all the other ASR vendors out there, Nuance is the one Microsoft wanted to buy. This lead in technology comes from the system being trained to understand the specific user or voice writer.

One important distinction is the difference between realtime voice writers and voice writers that speak into the mask and have someone else transcribe and do the work. This is very similar to the divide in stenographic reporting where some scopists report having to fill in huge chunks of information missed by the court reporter. A realtime voice writer, like a realtime stenographer, does not have to provide realtime services, but they do maintain the equipment and capability to do so.

The knowledge and preparedness of the voice writer is integral to the integrity of the record produced. Think of all the glitches and anomalies in stenographic CAT software. Think about how reporters create macros and dictionary workarounds every day to deal with them. As an easy example, my software does not like certain punctuation marks to be together. Early in my career, I worked out that placing a backslash between the two marks and then deleting it would override the software’s programming to delete punctuation. Similarly, voice writers have to deal with the complexities of the ASR system, the CAT software, and how they interact in order to overcome word boundary and formatting issues.

The understanding and maintenance of a voice writer’s equipment is also paramount. How the computer “hears” a writer’s voice in one microphone can be vastly different than another microphone. Different masks can be given different training configurations to enhance the ASR transcription. Voice writers are speaking into a mask, and where saliva or liquid gets into the mask it can alter what the computer hears. The competent voice writer monitors their realtime and keeps redundant equipment in case of an equipment failure, including extra masks and multiple audio backups of their “voice notes.” As someone who keeps two stenotypes in case one decides to die mid-trial, I admire the voice writers that take the time to ensure the show goes on in the event of computer problems.

Like us, there are many briefs or triggers voice writers use. The key difference is that they must speak the “steno.” The same way we must come up with a stroke for designating a speaker, they must come up with a voice command. The same way that stenographers must differentiate the word “period” from the punctuation symbol of a period, voice writers historically had to create differentiations. For example, in years gone by, they might have had to say “peerk” for the symbol and “period” for the word. Modern ASR systems are sometimes able to differentiate the word versus the mark without any special command or input from the voice writer! Again, the experience and ability to predict how the software will interpret what is said is an important skill for the realtime voice writer.

The obvious question arises as to why this blog tends to be silent on voice writing. There’s no overt hostility there and deep admiration for the people at the top of the voice writing modality of record taking. Simply put, I truly believe that stenographic reporting is better and will open more doors for students. That’s colored by my own experiences. As of today, voice writers are not allowed to work in my court and be in my civil service title. We can argue about whether they should be allowed, but the simple fact is that New York courts today tend to utilize stenographic reporting or digital recording. It’s easy to see that the qualified voice writer is a far better choice than the digital recording, but I couldn’t say to a student “get into voice writing! You’ll have the same opportunities as I do!”

I also have to present a warning to voice writers and stenographers. I have seen many of us fall into the mindset of “the enemy of my enemy is my friend.” We are much closer to each other than either modality is to digital reporting for the simple reason that we like our jobs. Digital reporting proponents have made little effort to hide that their ultimate goal is to offshore the jobs to Manila, Kenya, India, or wherever they can. Digital reporting proponents want to pay stenographers and voice writers less than half of what they’re worth. Digital reporting proponents don’t even respect their own digital reporters, which is why I’ve suggested those people join the stenographic legion.

There is a tumultuous history between stenographic court reporters and voice writers. I’ve been told by multiple NCRA members that when an effort was made to include voice writers about two decades ago, there was heavy backlash and even some harassment that occurred against those that were pro-integration. That was the climate of yesterday. While it seems unlikely that there will be formal alliance, inclusion, or cooperation, the separation we see today is not the same violent rejection of voice writers from the early 2000s. The civility of NCRA’s 2021 business meeting showed that court reporters are ready to disagree without belligerence and keep our industry moving forward. This is more akin to why the North American Olive Oil Association probably doesn’t partner much with the Global Organization for EPA and DHA Omega-3s. Olive oil and fish oil are both fine oils, but every second and cent spent advocating for one could be spent advocating for the other. It doesn’t make much sense to divide the time and resources. That’s where we are today. What the future holds for tomorrow, I can only imagine.

A big thank you to everyone that made this article possible, up to and including the NVRA. One source of my information was the esteemed Tori Pittman. Trained in both stenography and voice writing, Tori gave me a full demonstration of voice writing and agreed to speak at length about voice writing. See the full interview below!

Gartner: 85% of AI Implementations Will Fail By 2022

A series of 2019 predictions by Gartner were reported on by Venture Beat on June 28, 2021. As explained in a prior post, “AI”, or machine learning, relies on datasets and algorithms. If the data is imperfect or incomplete, a computer has a chance of giving bad output. If the algorithm that tells the computer what to do with the data is imperfect, the computer has a chance of giving bad output. It’s easy to point to anecdotal cases where “AI” makes a bad call. There have been reports of discrimination in facial recognition technology, driverless cars killing people, or Amazon’s algorithm deciding to fire drivers that are doing their job. I’ve seen plenty of data on the failings of overhyped technology and commercial ASR. What I hadn’t seen prior to today was somebody willing to put a number on the percentage of AI solutions that succeed. Today, we have that number, and it’s an abysmal 15%.

Perhaps this will not come as a surprise to my readers, considering prior reports that automatic speech recognition (ASR), an example of machine learning, is only 25 to 80 percent accurate depending on who’s speaking. But it will certainly come as a surprise to investors and companies that are dumping money into these technologies. Now there’s a hard number to consider. And that 15% itself is misleading. It’s a snapshot of the total number of implementations, not just ASR. ASR comprises a percentage of the total number of implementations out there. And it’s so bad that some blogs are starting to claim word error rate isn’t really that important.

Judge,
I know I botched 20 percent of the words.
But word error rate really isn’t that important.

That 15% is also misleading in that it’s talking about solutions that are implemented successfully. It is not talking about implementations that provide a positive return on investment (ROI). So imagine having to go to investors and say “our AI product was implemented with 100% success, but there’s still no money in this.”

The Venture Beat article goes on to describe several ways to make AI implementation a success, and I think it’s worth examining them briefly here.

Customizing a solution for each environment. No doubt that modeling a solution for every single business individually is bound to make that solution more successful, but it’s also going to take more staff and money. This would be almost like every court reporting company having their own personal software development staff to build their own CaseCAT or Eclipse. Why don’t they do that? It’s hopelessly expensive.
Using a robust and scalable platform. The word robust doesn’t really mean anything in this context. Scalability is tied to modular design — the ability to swap out parts of the program that don’t work for specific situations. For this, you need somebody bright and forward thinking. They have to have the capability to design something that can be modified to handle situations they may not even be aware exist. With the average software engineer commanding in the ballpark of $90,000 a year and the best of them making over $1 million a year, it’s hopelessly expensive.
Staying on course once in production. This involves reevaluating and sticking with something that may appear to be dysfunctional. This would be almost like the court reporter coming to the job, botching the transcript, and the client going “yes, I think I’ll use that guy again so that I can get a fuller picture of my operational needs.” It’s a customer service nightmare.
Adding new AI use cases over time. Piggybacking on number 3, who is going to want to continue to use AI solutions to patch what the first solution fails to address? This is basically asking businesspeople to trust that it will all work out while they burn money and spend lots of time putting out the fire. It’s a customer service nightmare.

I really respect Venture Beat trying to keep positive about AI in business, even if it’s a hopelessly expensive customer service nightmare.

With some mirth, I have to point out to those in the field that believe the stenographer shortage is an insurmountable problem that we now know machine learning in the business world has a failure rate that’s right up there with stenographic education’s failure rate. Beyond the potential of exploiting digital reporters or stealing investor money, what makes this path preferable to the one that has worked for the last hundred years? As I wrote a week ago, the competition is going to wise up. Stenographic court reporters are the sustainable business model in this field, and to continue to pretend otherwise is nothing short of fraud.