This post is the first of a two-part series. In this first part, I address learnings from a recent project in which I modified an English speech recognition model to understand German language. In the second part, I discuss some of our experiences with deploying this model on Amazon Web Services (AWS), and give some recommendations concerning deployment.
Some time ago, a business partner told us about a client of his struggling with a speech recognition capable of understanding German dialects. This immediately piqued my interest: How hard would it be to tackle such a task?
Disclaimer: I won’t give a three-step recipe for solving such a challenging task in the remainder of this post. Instead, I’ll discuss some learnings from tackling a similar, though arguably more „standard“ task, namely modifying an English Speech recognition model to understand German language.
To get started with my Machine Learning / Deep Learning task, I needed (at least) two ingredients – some data and some model.
As for the data, I created my own dataset from the German Common Voice dataset, the German Voxforge dataset, the Tuda corpus, and the German Spoken Wikipedia Corpus. This allowed me to construct a dataset with more than 200.000 utterances (audio + text).
As for the model, I chose the Open Source model DeepSpeech. DeepSpeech, which is developed by Mozilla, is a neural-network based speech recognition engine which takes a .wav-file as input and outputs the transcription of the audio. The advantage here is that DeepSpeech i) provides an English speech recognition model, and ii) is relatively easy to modify due to its Tensorflow implementation.
Taken together, an ideal starting point for my planned approach consisting of transfer learning, and, potentially, further fine-tuning.
In transfer learning, the weights of the last layer(s) of a neural network are replaced by randomly initialised ones and are retrained on the target dataset while the original layers are held fixed. This allows training the model on a significantly smaller dataset than the original one, since only a small part of the model is updated. The reasoning behind this is that „speech is speech“, i.e. we expect the low-level features created in the first layers of the neural network of an English speech recognition model to be likewise useful for a German speech recognition model. In contrast, the high-level features in the last layer(s) certainly depend on the target language of the model – particularly since the German alphabet has additional characters compared to the English one (ä, ö, ü, ß), which we might want to allow for.
Lessons learned & Remaining challenges
Did it work? It did – my model does some reasonable stuff. There’s still ample avenues for improvement, though – I’m not satisfied with the word error rate yet. But let’s focus on the learnings instead of the contingencies.
1. Know Thy Data
While working on improving my model, I learned a lot about my data. I had to – and it taught me a lot about what I could expect from my model and what not.
Let me give you an example: In the beginning, I used a test set for my model evaluation containing utterances from all four datasets. Looking at the utterances with largest test error according to the word error rate (WER), I noticed an intriguing pattern: the „correct“ source sentence was shorter than the transcribed sentence, and only contained a subset of the words. Digging deeper, I realised that the Spoken Wikipedia Corpus, from which these utterances originated, had to be considered „low quality“ – the programmatic alignment of Wikipedia articles and audio recordings did not always work out. Hence, the „ground truth“ transcripts did not correspond to the correct transcriptions in some cases – such as the cases I noticed in the test set. Here, the trained model was actually giving the correct transcription…
But there’s more to the data. Take, for instance, a visual analysis of the distributions of transcript length (number of characters in the transcript) versus feature length (length of the feature time series, a time series of Mel Frequency Cepstral Coefficients), depicted below.
As you might notice, the distributions look rather different. Not only the transcript length distributions vary strongly, also the mean number of characters per feature length („slope“) and their spread varies. Such variations arise from, e.g., differences in speech paces as well as number and duration of pauses.
In addition, the distribution of distinct speakers per dataset, along with their characteristics (gender, age, dialect) differs for each dataset (not shown, partly not known). This gives rise to a couple of intriguing questions:
- How should the dataset be split into train/dev/test? Surely, we want to have disjoint speaker sets in the different datasets. But how to distribute the individual contributions best, given the difference in proportions of speaker contributions, not to mention their characteristics?
- Does the speaker distribution reflect the potential user distribution?
- Is the domain of the data set (partly based on written language) appropriate?
Lastly, I should not forget to mention data versioning. Thinking about this in the beginning will save you a good amount of time later. „Just wanting to try things out“, I learned this the hard way.
2. Know Thy Tools
This learning is strongly related with the previous one, the importance of the data, but it provides a complementary view on the matter. Put as a question, it could be „Do you know what your tools are doing? Is it what you want them to do?“
Getting started in a new domain (be it speech recognition, image recognition, nlp, …) is not easy. Taking advantage of existing tools is more than reasonable. But my learning from a domain less blessed with (python) open source tooling is to be cautious with seemingly useful, but less established tools, particularly when it comes to data preparation. Does the provided functionality, e.g., to split a dataset into train/dev/test respect the „disjoint speaker requirement“? Is the preprocessing of the different datasets consistent?
Doing these thing by yourself takes time and effort, but might be the only way to get things done the way you need them to be.
A funny anecdote is in order. During some (explorative) preprocessing using
jupyter notebook and
pandas, I was taken aback by appearing
NaN transcripts despite having made sure no
NaN transcriptions would pass my quality gate. Looking closer, I noticed that the
NaNs only appeared after saving and reloading the dataframe… You guessed it – the transcript was the German number zero,
null, which triggered pandas‘ built-in (but thankfully fully configurable)
NaN handler in the
3. Know Thy Goal
This learning is all about reflecting. It is easy to trigger yet another training job to see whether different hyperparameters lead to an improved model. It is hard to think about whether you are optimising for the right thing.
Maybe the data quality is not good enough. Maybe the data is not enough or not appropriately pre-processed (audio volume, special characters). Which of these avenues to try next? It’s up to you (or me, in this case). And it requires hard thinking.
Deciding what to do next is eventually determined by your goal. What do you want to achieve in the end? If it’s learning, you’re good to go. If it’s shipping an application, you may want to bother collecting more data.
Understanding your goal and optimising the right thing is consequently the first stage in CRISP-DM (Cross-Industry Process for Data Mining), and is called business understanding. I silently glossed over this by stipulating that „data“ and „model“ is all we need. It is by no means.
4. Know Thy Past
This final learning is a friendly reminder (to me, to you) never to forget documentation. Experimentation, versioning, documentation and reproducibility go hand in hand. „Without the past, there is no future“ – and without documentation no sustainable progress. Setting up a documentation-by-design workflow and using appropriate tools where needed, e.g. mlfow and dvc, can be a real deal-breaker.