There’s a lot of conjecture when it comes to automatic speech recognition (ASR) and its ability to replace the stenographic reporter or captioner. You may also see ASR referred to as NLP or natural language processing. An important piece of the puzzle is understanding the basics behind artificial intelligence and how complex problems are solved. This can be confusing for reporters because in any of the literature on the topic, there are words and concepts that we simply have a weak grasp on. I’m going to tackle some of that today. In brief, computer programmers are problem solvers. They utilize datasets and algorithms to solve problems.
What is an algorithm?
An algorithm is a set of instructions that tell a computer what to do. You can also think of it as computer code for this discussion. To keep things simple, computers must have things broken down logically for them. Think of it like a recipe. For example, let’s look at a very simple algorithm written in the Python 3 language:
Line one tells the computer to put the words “The stenographer is _.” on the screen. Line two creates something called a Stenographer, and the Stenographer is equal to whatever you type in. If you input the word awesome with a lowercase or uppercase “a” the computer will tell you that you are right. If you input anything else, it will tell you the correct answer was awesome. Again, think of an algorithm like a recipe. The computer is told what to do with the information or ingredients it is given.
What is a dataset?
A dataset is a collection of information. In the context of machine learning, it is a collection that is put into the computer. An algorithm then tells the computer what to do with that information. Datasets will look very different dependent on the problem that a computer programmer is trying to solve. As an example, for enhancing facial recognition, datasets may be comprised of pictures. A dataset may be a wide range of photos labeled “face” or “not face.” The algorithm might tell the computer to compare millions of pictures. After doing that, the computer has a much better idea of what faces “look like.”
What is machine learning?
As demonstrated above, algorithms can be very simple steps that a computer goes through. Algorithms can also be incredibly complex math equations that help a computer analyze datasets and decide what to do with similar data in the future. One issue that comes up with any complex problem is that no dataset is perfect. For example, with regard to facial recognition, there have been situations with almost 100 percent accuracy with lighter male faces and only 80 percent accuracy with darker female faces. There are two major ways this can happen. One, the algorithm may not accurately instruct the computer on how to handle the differences between a “lighter male” face and a “darker female” face. Two, the dataset may not equally represent all faces. If the dataset has more “lighter male” faces in this example, then the computer will get more practice identifying those faces, and will not be as good at identifying other faces, even if the algorithm is perfect.
Artificial intelligence / AI / voice recognition, for purposes of this discussion, are all synonymous with each other and with machine learning. The computer is not making decisions for itself, like you see in the movies, it is being fed lots of data and using that to make future decisions.
Why Voice Recognition Isn’t Perfect and May Never Be
Computers “hear” sound by taking the air pressure from a noise into a microphone and converting that to electronic signals or instructions so that it can be played back through a speaker. A dataset for audio recognition might look something like a clip of someone speaking paired with the words that are spoken. There are many factors that complicate this. Datasets might be focused on speakers that speak in a grammatically correct fashion. Datasets might focus on a specific demographic. Datasets might focus on a specific topic. Datasets might focus on audio that does not have background noises. Creating a dataset that accurately reflects every type of speaker in every environment, and an algorithm that tells the computer what to do with it, is very hard. “Training” the computer on imperfect datasets can result in a word error rate of up to 75 percent.
This technology is not new. There is a patent from 2000 that seems to be a design for audio and stenographic transcription to be fed to a “data center.” That patent was assigned to Nuance Communications, the owner of Dragon, in 2009. From the documents, as I interpret them, it was thought that 20 to 30 hours of training could result in 92 percent accuracy. One thing is clear: as far back as 2000, 92 percent accuracy was in the realm of possibility. As recently as April 2020, the data studied from Apple, IBM, Google, Amazon, and Microsoft was 65 to 80 percent accuracy. Assuming, from Microsoft’s intention to purchase Nuance for $20 billion, that Nuance is the best voice recognition on the market today, there’s still zero reason to believe that Nuance’s technology is comparable to court reporter accuracy. Nuance Communications was founded in 1992. Verbit was founded in 2016. If the new kid on the block seriously believes it has a chance of competing, and it seems to, that’s a pretty good indicator that Nuance’s lead is tenuous, if it exists at all. There’s a list of problems for automation of speech recognition, and even though computer programmers are brilliant people, there’s no guarantee any of them will be “perfectly solved.” Dragon trains to a person’s voice to get its high level of accuracy. It simply would not make economic sense to have hours of training a software to everyone who is going to speak in court forever until the end of time, and the process would be susceptible to sabotage or mistake if it was unmonitored and/or self-guided (AKA cheap).
This is all why legal reporting needs the human element. We are able to understand context and make decisions even when we have no prior experience with a situation. Think of all the times you’ve heard a qualified stenographer, videographer, or voice writer say “in 30 years, I’ve never seen that.” For us, it’s just something that happens, and we handle whatever the situation is. For a computer that has never been trained with the right dataset, it’s catastrophic. It’s easy, now, to see why even AI proponents like Tom Livne have said that they will not remove the human element.
Why Learning About Machine Learning Is Important For Court Reporters
Machine learning, or applications fueled by machine learning, are very likely to become part of our stenographic software. If you don’t believe me, just read this snippet about Advantage Software’s Eclipse AI Boost.
If you’ve been following along, you’ve probably figured out, and it pretty much lays it out here, that datasets are needed to train “AI.” There are a few somewhat technical questions that stenographic reporters will probably want answered at some point:
- Is this technology really sending your audio up to the Cloud and Google?
- Is Google’s transcription reliable?
- How securely is the information being sent?
- Is the reporter’s transcription also being sent up to the Cloud and Google?
The reasons for answering?
- The sensitive nature of some of our work may make it unsuitable for being uploaded. To the extent stuff may be confidential, privileged, or ex parte, court reporters and their clients may simply not want the audio to go anywhere.
- Again, as shown in “Racial disparities in automated speech recognition” by Allison Koenecke, et al., Google’s ASR word error rate can be as high as 30 percent. Having to fix 30 percent of a job is a frightening possibility that could be more a hindrance than a help. I’m a pretty average reporter, and if I don’t do any defining on a job, I only have to fix 2 to 10 percent of any given job.
- If we assume that everyone is fine with the audio being sent to the cloud, we must still question the security of the information. I assume that the best encryption possible would be in use, so this would be a minor issue.
- The reporter’s transcription carries not only all the same confidential information discussed in point 1, but also would provide helpful data to make the AI better. Reporters will have to decide whether they want to help improve this technology for free. If the reporter’s transcription is not sent up with the audio, then the audio would only ostensibly be useful if human transcribers went through the audio, similar to what Facebook was caught doing two years ago. Do we want outside transcribers having access to this data?
Our technological competence changes how well we serve our clients. Nobody reading this needs to become a computer genius, but being generally aware of how these things work and some of the material out there can only benefit reporters. In one of my first posts about AI, I alluded to the fact that just because a problem is solvable does not mean it will be solved. I didn’t have any of the data I have today to assure me that my guess was correct. But I saw how tech news was demoralizing my fellow stenographers, and I called it as I saw it even though I risked looking like an idiot.
It’s my hope that reporters can similarly let go of fear and start to pick apart the truth about what’s being sold to them. Talk to each other about this stuff, pros and cons. My personal view, at this point, is that a lot of these salespeople saw a field with a large percentage of women sitting on a nice chunk of the “$30 billion” transcription industry, and assumed we’d all be too risk averse to speak out on it. Obviously, I’m not a woman, but it makes a lot of sense. Pick on the people that won’t fight back. Pick on the people that will freeze their rates for 20 or 30 years. Keep telling a lie and it will become the truth because people expect it to become the truth. Look how many reporters believe audio recording is cheaper even when that’s not necessarily true.
Here’s my assumption: a little bit of hope and we’ve won. Decades ago, a scientist named Richter did an experiment where rats were placed in the water. It took them a few minutes to drown. Another group of rats were taken out of the water just before they drowned. The next time they were submerged, they swam for hours to survive. We’re not rats, we’re reporters, but I’ve watched this work for humans too. Years ago, doctors estimated a family member would live about six more months. We all rallied around her and said “maybe they’re wrong.” She went another three years. We have a totally different situation here. We know they’re wrong. Every reporter has a choice: sit on the sideline and let other people decide what happens or become advocates for the consumers we’ve been protecting for the last 140 years, before the stenotype design we use today was even invented. People have been telling stenographers that their technology is outdated since before I was born, and it’s only gotten more advanced since that time. Next time somebody makes such a claim, it’s not unreasonable for you to question it, learn what you can, and let your clients know what kind of deal they’re getting with the “new tech.”
Some readers checked in with the Eclipse AI Boost, and as it was relayed to me, the agreement is that Google will not save the audio and will not be taking the stenographic transcriptions. Assuming that this is true, my current understanding of the tech is that stenographers would not be helping improve the technology by utilizing this technology unless there’s some clever wordplay going on, “we’re not saving the audio, we’re just analyzing it.” At this point, I have no reason to suspect that kind of a game. In my view, our software manufacturers tend to be honest because there’s simply no truth worth getting caught in a lie over. The worst I have seen are companies using buzzwords to try to appease everyone, and I have not seen that from Advantage.
Admittedly, I did not reach out to Advantage myself because this was meant to assist reporters with understanding the concepts as opposed to a news story. But I’m very happy people took that to heart and started asking questions.