Audio samples from "WHAT DOES A NETWORK LAYER HEAR?
ANALYZING HIDDEN REPRESENTATIONS OF END-TO-END ASR THROUGH SPEECH SYNTHESIS"

Paper: arXiv

Authors: Chung-Yi Li, Pei-Chieh Yuan, and Hung-Yi Lee

Abstract: End-to-end speech recognition systems have achieved competitive results compared to traditional systems. However, the complex transformations involved between layers given highly variable acoustic signals are hard to analyze. In this paper, we present our ASR probing model, which synthesizes speech from hidden representations of end-to-end ASR to examine the information maintain after each layer calculation. Listening to the synthesized speech, we observe gradual removal of speaker variability and noise as the layer goes deeper, which aligns with the previous studies on how deep network functions in speech recognition. This paper is the first study analyzing the end-to-end speech recognition model by demonstrating what each layer hears. Speaker verification and speech enhancement measurements on synthesized speech are also conducted to confirm our observation further.

This page show some audio samples in support of paper: we suggest that the reader listen to the samples along with reading the paper. All utterances were unseen during training.

1. Demo

The following examples are from models trained on the 100 hours LibriSpeech dataset. The test utterances are taken from the mixture of the test-clean set of LibriSpeech and the noise of MUSAN. The reader can select the mixture of 8 speech utterances and 4 kinds of noise in the two menus.

Choose speech:	Choose noise:	Mixture	Text
			And the old gentleman was so delighted with his success that he had to burst out into a series of short happy bits of laughter that occupied quite a space of time.

VGG-LSTM

	Noise-Robust	Baseline
Layer1 - cnn1
Layer5 - cnn2
Layer10 - cnn4
Layer11 - blstm1
Layer13 - blstm3
Layer15 - blstm5

pure-LSTM

	Noise-Robust	Baseline
Layer1 - blstm1
Layer3 - blstm2
Layer5 - blstm3
Layer7 - blstm4
Layer8 - blstm5

Audio samples from "WHAT DOES A NETWORK LAYER HEAR?
ANALYZING HIDDEN REPRESENTATIONS OF END-TO-END ASR THROUGH SPEECH SYNTHESIS"

1. Demo

2. Model Architecture

Pure-LSTM ASR Model

VGG-LSTM ASR Model

Probing Model (Reconstruction Model)