Data Science

Lessons Learned from Deploying a Speech Recognition Model on AWS

This post is the second of a two-part series. In the first part, I addressed learnings from a recent project in which I modified an English speech recognition model to understand German language. In this second part, I discuss some of our experiences with deploying this speech recognition model on Amazon Web Services (AWS) and give some recommendations concerning deployment.

The setting

So what exactly is it that we need to deploy when using DeepSpeech-based speech recognition?

For one, it is the code that handles the API request, calls the model with the data (in our case audio data) extracted from the request payload, and returns the transcript in a properly formatted / annotated form.

For the other, it is the model. In the DeepSpeech case, the model consists of the following artifacts:

  • a tensorflow graph file, size ca. 190 MB
  • an alphabet file, size ca. 0.4 kB
  • a language model, size 1.8 GB (in the English case, in the German case ca. 700 MB)
  • a trie (kind of search tree) supplementing the language model, size 22 MB (in the English case, in the German case 60 MB)

The tensorflow graph file as well as the alphabet file are obligatory for running DeepSpeech. The graph file contains graph definition, weights, and meta data, the alphabet file the possible transcript characters. Though this file sounds fun to play with, it is an artifact which cannot not be changed after the model has been trained without seeing errors.

The language model contains information about existing words and their co-occurrence. This enables the speech recognition engine, for instance, to „be aware“ that the transcript „I hear you say“ is more likely than „I here you say“ – even though both transcripts sound identical. Both language model and supporting trie are optional, however, without it, transcript quality is poor.

You see the catch – deploying model artifacts of more than one GB, which is not commonly discussed in tutorials :).

The deployment approaches

As the infrastructure for deploying the speech recognition model we chose AWS, since our whole company IT infrastructure runs on AWS. There are several AWS services that can be used to deploy machine learning models on AWS. The ones that looked most promising to us were AWS Lambda and AWS Sagemaker.

AWS Lambda – The Cheap, the Flexible, and the Limited

AWS Lambda is a serverless compute service, which lets you run custom code without provisioning or managing servers. All you need to do is to package your code & requirements into one (or more) .zip-archive(s), upload to S3, configure a little, and you’re good to go.

Since Lambda functions are designed to run in response to events, you can easily invoke your code upon an API call to the AWS API gateway. Besides, due the free tier and „pay-per-use“, for moderate use, Lambda functions are very cost-effective. In addition, scaling works out of the box – if twenty concurrent API requests come in, AWS will spin up 20 function „instances“ for you. You pay the compute, but (within reasonable limits) no need to bother about scaling.

The interim assessment: cheap, and very versatile. Not to mention „easy-to-use“ from a deployment perspective, as the following diagram architecture overview illustrates.

DeepSpeech deployment with AWS Lambda

But there has to be a catch – and there is: Lambda limits.

  • Max. runtime of the code: 15 minutes.
  • Max. memory allocation: ca. 3 GB
  • Max. storage (code, dependencies, …): 512 MB

The last one in particular is the killer when it comes to our application. With artifact sizes of at least 1 GB, Lambda limits are out of reach.

There is a workaround, however. The size of the language model can be reduced, by reducing the number of „context words“. But this comes at a cost: transcription quality. For example, if you take into account one context word only, the model will not be able to figure out that „here you go“ makes more sense than „hear you go“ (according to Google ngrams, „hear you“ actually occurs more often in English books than „here you“).

AWS Sagemaker – The Opposite

AWS Sagemaker, a fully managed machine learning service, is a different ball game. Sagemaker provides, for instance, pre-configured notebook instances with common machine learning algorithms, offers distributed training options, and more. We’re not interested in model development though, but in deployment.

For that purpose, Sagemaker provides templates for Docker containers, which can be modified to serve custom models. In our case, this meant adapting the inference function, copying our model artifacts into the container and updating the requirements in the Docker file. Sagemaker then allowed us to specify the instance type we wanted our container to run on and to create and endpoint to access our speech recognition model. The approach using Sagemaker therefore is quite the opposite to the Lambda approach.

First, it is not serverless. You have to specify an instance type, and unless the instance is up and running, you can’t access your model, which means you have to pay for the instance 24/7 if you want your model to be accessible 24/7. Besides, scaling does not work out of the box. You can activate auto-scaling, but this comes with additional bucks (and not too few). Additionally, you need a API gateway and a Lambda function configured if you don’t want to dig into signed requests. Hence even more costs.
This deployment approach comes with a big advantage, though: you can meet the compute/memory/storage requirements you have. Deploying 2 GB artifacts as in our case was not an issue.

Second, it is more configuration heavy. It’s not just application code, it’s also Docker and flask (in our case). But there’s a solution to this one – Mlflow.

Mlflow to the rescue

Mlflow is an open source platform for the machine learning lifecycle. While it has also capabilities to track machine learning experiments and to reproducibly run machine learning projects, the capability relevant to us is called Mlflow models. This capability provides a general format to ship machine learning models to different deployment platforms, be it AWS, Azure, or others. Using Mlflow models, we can write the code defining our inference function and the initialization routine for our model, specify the model artifacts and dependencies, and let Mlflow package it such that it can be deployed on AWS Sagemaker, Azure ML, and others. Does it spark joy? It does!

The resulting workflow / architecture resulting from the Mlflow-aided Sagemaker deployment approach is illustrated below.

DeepSpeech deployment on AWS sagemaker using mlflow


I should mention a few caveats before concluding.

First, calling Lambda deployment „cheap“ is of course a bit fishy. Lambda is cheap for moderate use (runtime, invocations). In our scenario where AWS Lambda was the clear winner in terms of cost, we assumed a total of 250.000 invocations/year with 2o seconds average invocation audio length. In addition, costs always need to be evaluated on a use-case basis. Depending on the use-case, API gateway costs, networking costs, … might also be non-negligible.

Second, „serverless“ does not mean that Lambda functions are always immediately up and running. Lambda functions can have cold start issues. In our case, upon invocation, the Lambda function downloads the model artifacts from S3 if they are not already present in the runtime. This enables us, for instance, to version code and model independently (via environment variables), a flexibility which we are quite fond of.
It implies, though, that a cold start takes significantly more time than a warm start (artifacts are still present when the function is triggered). Evaluations of others have shown that while the probability that a Lambda function undergoes a warm start after 5 minutes is close to one, but essentially zero after 17 minutes. Cold starts can be avoided by regularly „pinging“ the Lambda function using a CloudWatch events invocation with test payload, but this has to be implemented nevertheless.

Third, while we are quite fond of Mlflow, this wasn’t always the case. In the 0.9.1 version, artifacts were extensively copied during Sagemaker instance startup, which, given our artifact sizes, led to timeouts and out-of-memory errors. This bug, which we reported, was resolved in the 1.0.0 version, however, and makes us like the project even more :).


So what are our recommendations for deploying a speech recognition model (or any other large model) on AWS? In effect, they are the following:

  • If you can put up with its limits, consider AWS Lambda first for your machine learning deployment. Of course, it all depends on the use case, but considering the advantages of serverless for application development, you might want to give it a try.
  • If you need heavy compute / resources and are willing to pay for it, go for AWS Sagemaker. But before writing Docker files, consider using a tool like Mlflow, since it really makes life easier for you.
    Of course, this ease comes at the expense of some flexibility such as API design, so if you need to be fully flexible, use plain Sagemaker.


The credits for the figures as well as the evaluation of AWS Lambda and Sagemaker for DeepSpeech deployment go to Alexander Miller.