One of the many things that greatly inconvenience me on a regular basis is QA review for call recordings. You should be reviewing calls regularly during probationary periods of employees to ensure service delivery is high quality and to correct any issues or provide additional training. You should be reviewing “problem” or “complaint” calls, too. Maybe even a random sampling once in a while if a contract requires it. But what if there was a better way? A cooler way to make a computer do this tedious work for us? What if we could also throw the word “AI” in there for more clicks? I’m not above pandering to the search engine mafia, so we’re going to absolutely do that. Today we’re going to use Azure Cognitive Services to batch import recording files from our PBX (3CX in my case) to Azure and transcribe them for sentiment and speech analysis. We’ll look at the end result data in PowerBI, because no project is complete without pretty dashboards to validate your efforts. While Microsoft has a solution accelerator for this project, the documentation is confusing and contradictory if you haven’t worked in the Azure ecosystem much, so I’m going to do my best to summarize it in a more digestible way here.
I make it a point to my staff that I don’t go fishing for recordings to “gotcha” people with. That’s really not the point of QA review and if you use QA to dunk on your employees you are a dick and probably a top 5% poster on r/sysadmin. The point of QA review is to assure quality. Assuring quality is a pretty broad spectrum of ideas and tasks and all too often your subjective reality and personal biases can conflict with the objective reality of situations you’re meant to be reviewing. This is a task for a cold and logical machine.
- An Azure Subscription
- An Azure Cognitive Services Resource setup (you technically only need Language and Speech models for this exercise, but I recommend creating an AllInOne resource instead)
- Your resource keys
- An Azure SQL Database to store the output. You can use an existing SQL DB in Azure or create a new one. If you only want to use the JSON output to load into another platform you can skip this, but SQL will be required if you want to use the PowerBI templates to view results.
- Recording files and a way to upload them. You can do this manually by copying them to an Azure storage account we’ll create or you can set up an automatic job to do this. I’ll be demonstrating how to copy via PowerShell.
Background and Considerations for Azure Cognitive Services
There are two methods for getting data into Azure Cognitive Services for this application.
- Real Time Ingestion – This is exactly what it sounds like. Data/audio can be ingested in real time. There are limitations on real time text to speech and it has an increased cost to operate. It isn’t ideal for post-call transcription like we’re looking for.
- Batch Ingestion – Aptly named as this function ingests recordings in batches. We copy recordings to a storage account, an Azure function polls the account and processes audio in batches. Multiple files can be processed at once and you get the full range of features available. It costs less and schedules transcription on a “best effort” basis, meaning your items may not be processed right away.
We’ll be using batch processing to make everything easy and cost effective. I ran about 7500 recordings through the batch processor and ended up paying about $280 for that privilege, but I was trying to generate data for historical calls also. A typical MSP won’t be transcribing 7500 calls a month and shouldn’t see anywhere near that cost. The pricing for Azure Cognitive can be found here and as of the time of publishing of this article, you get 5 hours of free processing per month. Woo!
You’re a madman! What about PII?!
If you’re securely copying your data to Azure, securing access to the resource group where it resides, and managing content lifecycle of data once it is done being consumed by the cognitive services functions, there is not really much concern to be had here. Truthfully, knowing how most PBX (hosted or otherwise) are legacy technologies held together with hopes and dreams, your data is probably more secure in Azure than on your PBX. But to those of you who want to go a step farther and redact PII during transcription, I absolutely don’t blame you. This is one of those situations where I’m just going to present the idea to you in theory and you’re responsible for implementing it in a way that makes sense for you.
Step 1: Prepare and Deploy Azure Cognitive Resources
If you haven’t already, go ahead and create a Resource Group in Azure to hold everything we’ll be deploying. If you don’t have an Azure Cognitive Services AllInOne resource, deploy that to your RG and retrieve your keys.
Download this ARM template and keep it handy for the next step. ARM templates are instructions on how to build a set of resources in Azure, and the Cognitive Services team provides this one for batch imports.
Next, head over to Azure and in the search bar type “Template Deployment” and select the resource. On the next screen, you’ll click “Create”
Then we’ll click “Build your own template in the editor”
Click “Load File” and upload the Batch ARM Template JSON file you downloaded earlier.
Click “Save” after loading in the template.
After saving, you’ll see this screen and will be able to fill out the form to your specifications and deploy your resources.
At the end of deployment you should see something like this:
Background for Importing Recordings to Azure Cognitive Services
This is where you may be required to use some creativity depending on what kind of PBX is in use. I’m using 3CX hosted in Azure on a Linux appliance. As long as you can either run PowerShell on the PBX or access the PBX recordings from somewhere where PowerShell runs, you can use a very simple AzCopy script to get data into your Cognitive Services instance.
One of the resources the template created was a storage account, and inside that storage account there are several blob containers that look like this:
The ones we are concerned with are Audio-Input, Audio-Processed, and Audio-Failed as well as json-result-output.
- Audio-Input – This is the container we want to drop our Audio files in. You can either upload them manually using Azure Storage explorer or opt for an automated method.
- Audio-Processed – As the function processes audio files, they will be moved to Audio-Processed after completing transcription. You should set up some automation to clear this directory on occasion.
- Audio-Failed – If for whatever reason your audio file doesn’t transcribe, it will end up here. Typically this happens when audio files are too short for transcription.
- json-result-output – A handy dump of each recording transcript and sentiment analysis to individual JSON files. These are super handy both to back up the transcription itself but also to feed into other systems for processing. The data in these JSON files is what is loaded into the SQL database you specify in the template at creation.
Step 2: Import Recordings to Azure Cognitive Services Using AzCopy
I think the easiest way to import recordings is with AzCopy. AzCopy is Microsoft’s tool to copy to/from Azure Storage accounts. Think of it like RoboCopy but for the cloud. The nice thing is AzCopy has binaries for all the big OS’s and I’ll be able to install it on my PBX easily.
To install AzCopy on my Linux PBX I’ll use the following commands in order.
#Download AzCopy wget https://aka.ms/downloadazcopy-v10-linux #Expand Archive tar -xvf downloadazcopy-v10-linux #(Optional) Remove existing AzCopy version sudo rm /usr/bin/azcopy #Move AzCopy to the destination you want to store it sudo cp ./azcopy_linux_amd64_*/azcopy /usr/bin/
Once we have AzCopy installed, we’ll head over to our Storage Account and get a SAS key. We’ll use the SAS key to copy our recordings to our Azure Storage Account directly from the PBX.
If you’re using a hosted instance of 3CX like I am, your recordings will be in /var/lib/3cxpbx/Instance1/Data/Recordings and you’ll see a folder for each extension. In each folder you’ll see the WAV files for all the recorded calls.
We’ll go ahead and copy the recordings using AzCopy using the AzCopy Copy command:
azcopy cp '106/*' 'https://YOURSTORAGEACCOUNTNAME.blob.core.windows.net/audio-input<YOUR SAS KEY HERE>'
After a few minutes you should see your recordings start processing and moving to Audio-Processed. You can script this step to be automated using cron and PowerShell, and I would encourage you to use an AzKeyVaultSecret to store the SAS key or use a different method of authentication.
Step 3: Review the Azure Cognitive Output (if using SQL DB)
You’re going to want to install PowerBI Desktop if you don’t have it already and download the following two templates:
- SentimentInsights.pbit – Aggregates and displays the Sentiment Analysis data.
- SpeechInsights.pbit – Aggregates and displays the Speech Analysis data.
Keep in mind these are templates for demonstration purposes and you can get more creative with the data output however you like.
Grab your Azure SQL DB information from the portal and launch one of the templates. It will ask you to input your SQL username and password as well as the SQL server address.
Once your data has successfully loaded you’ll see the reportable metrics you can select on the tabs at the bottom.
Step 4: Marvel in the wonder of your creation
Click through the tabs on each report and you’ll see some really cool metrics. Hovering over the graphs gives you drill down information for individual calls and you can filter down to view/analyze only the data you want to.
Even though Microsoft has done most of the work for us, that was quite a bit more than I anticipated before I started this article. This is meant to be a demonstration of the power behind Azure Cognitive Services and how you can harness it to make your life just a little bit less arduous. The possibilities are endless here and you could do something like alerting on specific language or negative sentiment trends for individual agents. Because batch processing is more or less PBX agnostic (as long as you are feeding it some kind of standard audio format), this can be easily adapted to whatever call recording solution you’re using. And because we’re letting a cold unfeeling machine do all the work, we’re removing our own cognitive biases from the equation and getting a more holistic view of call QA data.