The Potential Pitfalls of DIY Speech Analytics

By Rick Britt, VP of Artificial Intelligence at CallMiner

Organizations are building in-house data science or Artificial Intelligence teams to use emerging technology and techniques to harness the power of their data. With the growth of these internal data science teams, companies want greater control of all aspects their data programs so they’re more nimble and effective. If done correctly, this provides more opportunities for creativity and experimentation with internal and external data. “Science teams” might conjure images of NASA space camp t-shirts, quoting dystopian novels and late-night sci-fi board game parties. However, science is inherently cool. And it’s especially cool when applied to speech analytics. However, organizations should draw the line between data science and software development, and be cautious when taking on data science projects that are not core to their business.

Scientists, by their nature, don’t inherently trust results or outcomes unless they are logically transparent or tested. The problem comes from organizations deploying in-house data scientists to projects outside of their core competence. There is an allure here, because building new metadata internally on a company’s highly valuable data assets can provide new intellectual property and possibly a competitive advantage. Speech recognition systems and machine learning algorthms are readily available through various cloud computing platforms. With all this upside what’s the risk?

When it comes to complicated software like a speech analytics, many organizations don’t fully realize the complexity associated with building their own solution until they are highly invested, and often are left footing a large bill with a suboptimal outcome. Here are some thoughts on the benefits of DIY speech analytics software development and some pitfalls:

Natural Language Processing (NLP) is Cool

Conversational AI and NLP are some of the coolest and most Avant-garde AI research fields out there. Research in this field covers understanding human conversation and the moonshot possibility of creating a fully robotic system that can carry on a conversation with humans. The advancements in NLP and commoditization of speech recognition transcribers, married with the power of current deep learning technologies make this a lower cost of entry to a cutting-edge research field.

Problem 1: Transcription Speed and Accuracy

Don’t be fooled by one-to-one audio transcription rates of high-quality audio. Speech recognition software has come a long way in just the past few years (see Moore’s Law). There are lots of options, even free ones, that produce acceptable transcription – but not at scale. Many technologies offer great results for one-for-one transcription: one call transcribed per CPU time, ergo a 5 minute call takes 5 minutes to process, and upon completion the next one starts. So, if an organization takes just a few calls, say 2,000 a day, that’s probably near 10,000 minutes of recorded audio. With a single one-for-one processing system that is just shy of seven days to complete one day’s calls. Pretty simple math to scale.

There is a trade-off between speed and accuracy. Typical accuracy rates with transcription can range dramatically because audio quality of telelphony conversations or recordings is greatly impacted by a number of factors, including line quality, background noise, compression in the signal or for storage purposes. Potentially more challenging is mono audio signal, where the agent and the customer are recorded on a single channel, so who is speaking is not automatically identified, without the use of voice biometrics. Speech recognition must deal with all of these issues at scale, which requires high processing speeds. Top solutions have algorithms that will contextualize with speed and pick the next logical word quickly to fill in the conversation.

Problem 2: Finding Relevant “Needles” in the Speech “Haystack”

When companies turn on a speech analytics system, they might see tens of thousands of their calls appear fully transcribed. This is millions of words and thousands of potential insights. In that blank search command line, what do they type? Of all the possible business applications, compliance issues, product mentions, customer experiences, disclosures, what is that first search? The answer all too often… Curse words. When analysts receive an exceptional speech analytics system, they start with a search for the humble curse word. This single word search is what we call word spotting. Speech analytics in it’s simplest form starts with word spotting, which itself doesn’t provide substantial business value.

The next step is to start finding the pieces of information in the transcript that can have a bearing on the business. Algorithms can search for specific words, but this practice of “word spotting” only shows singular instances. Through this process the data scientist will learn two natural laws of speech analytics.

Natural Law 1: Words are strong, but not reliable. This is conversational mining not written word. A conversation is messy. Suppose the “f-word” is found in the transcript as “this is a great f-ing company,” is that bad? Is it sarcasm? Do they absolutely love the company? What if it’s a trucking company? Context is key, context is hard, contextual search needs to be developed to be effective. Searches that deliver all possible contexts are very difficult to build.

Natural Law 2: You can’t find just one, you must find them all. To take any real action the business always needs to know how many times a thing was said, how often, when, by whom. Think of all the ways a customer can express dissatisfaction. More than once a data scientist has said they would just download an NLP negative sentiment list and run that against transcripts, thus quickly identifying all dissatisfaction. This is the wrong approach because NLP is developed from free data, and what data is free? Books, Wikipedia, movie reviews—really anything free on the web—it’s simply data the developers had to work from. Current NLP models are not quite right, adjusting takes annotation and curation, and context is key.

Algorithms must spot a word or phrase, but also identify any of their aliases (ex: loan/lone/alone) and where they fall in the conversation (ex: before “payment”). Even at exceptionally high 95% accuracy with just a million words that is fifty thousand incorrect words that must be addressed.

Systems need the ability to build phrases together to create scoring. As a basis for prediction, how relevant is the thing that’s been discovered? A phrase by itself may say much, but relevancy scores and counts of the same topic may be an indication of a significant change in customer behavior on that type of call.

Problem 3: Anomalies and Black Holes

Without a deep set of experiences (relevant data) finding the anomalies or missing things in a conversation is nearly impossible. For example, the candy “Tootsie Rolls,” among others, has a hotline. It has no required prompts, so it is effectively an endless loop. This is important if an agent who is not really working hard wants to take an unscheduled break. Just dial that hotline number and sit there looking busy. Or perhaps agents are listening to phone rings for five minutes or hold music for half an hour. Organizations understandably want to uncover these behaviors with analytics.

Even more diabolical is silence. Is silence good or bad? That depends, that tiny blackhole in an audio recording is speech analytics gold, highly important, and not trivial data science, either.

Stop Reinventing the Wheel

In the next order of magnitude, every organization will want to do something with the uncovered insights. That creates a whole new set of challenges, as change agents require proof, and an audit trail must be developed. These all can present engineering problems the organization will have to face. There are a lot of weeds to hack through to undertaking a speech analytics project than just ramming calls through an online transcription engine. It’s a full-time job to research, build, and consistently execute a speech analytics solution. Providers of top analytics platforms have the luxury of researching the most discrete solutions and testing them in an environment with nearly one trillion words generated per year. Insights and failed hypothesis arrive quickly in such environments. Be aware in smaller or improperly structured data sets, the “bake” time is longer, and can substantially slow or stop progress.

A software development project, like speech analytics, should not be expected to be completed as side project in any reasonable amount of time. In the case of developing a speech analytics platform, algorithms need to be tested, adjusted, weighted, and tested some more. Then they need review through a QA process, legal, an ethical review, debugging, monitoring, and adjusting just to create business intelligence and to track performance improvement with consistency and stability. Companies need to be willing to commit specialized FTE hours on an ongoing basis to ensure program success and should weigh the opportunity cost of such a venture.

Go/No-Go time

Everyone loves the satisfaction of a successful DIY project. Scientists love hypothesis, discovery and proof, but also know the frustration and money drain of an unsuccessful or extremely delayed project that may never come to fruition. To hedge against lost time and money a company and team of scientists should dive into speech analytics with open eyes, or better yet, partner with a software organization who has specialized in bringing speech analytics to market at scale.

About the Author

Rick Britt is the Vice President of Artificial Intelligence at CallMiner. Rick’s passion for interaction analytics is built upon his background of behavioral economics and operations analytics. He has specialized in optimizing results over his 25-year career in financial services, business process outsourcing, retail, marketing and collections. Rick’s experience in AI solutions and predictions as well as his global approach to intelligently to attract, retain and motivate peers and customers to perform at the highest levels of their capability compliment perfectly with CallMiner’s incredible data and innovation. He is a graduate of Penn State University with a degree in Economics and is an avid beer brewer.

Pain & Label:

How Artificial Intelligence is Empowering Aspiring Musicians

Terry Sejnowski on the Future of Machine Learning

Leveraging Natural Language Processing in Requirements Analysis

A DARPA Perspective on Artificial Intelligence