**VoiceFilter-Lite** is a single-channel source separation model that runs on the device to preserve only the speech signals from a target user, as part of a streaming speech recognition system. In this architecture, the voice filtering model operates as a frame-by-frame frontend signal processor to enhance the features consumed by the speech recognizer, without reconstructing audio signals from the features. The key contributions are (1) A system to perform speech separation directly on ASR input features; (2) An asymmetric loss function to penalize oversuppression during training, to make the model harmless under various acoustic environments, (3) An adaptive suppression strength mechanism to adapt to different noise conditions.

**Cross-Covariance Attention**, or **XCA**, is an [attention mechanism](https://paperswithcode.com/methods/category/attention-mechanisms-1) which operates along the feature dimension instead of the token dimension as in [conventional transformers](https://paperswithcode.com/methods/category/transformers).

Using the definitions of queries, keys and values from conventional attention, the cross-covariance attention function is defined as:

$$
\text { XC-Attention }(Q, K, V)=V \mathcal{A}_{\mathrm{XC}}(K, Q), \quad \mathcal{A}\_{\mathrm{XC}}(K, Q)=\operatorname{Softmax}\left(\hat{K}^{\top} \hat{Q} / \tau\right)
$$

where each output token embedding is a convex combination of the $d\_{v}$ features of its corresponding token embedding in $V$. The attention weights $\mathcal{A}$ are computed based on the cross-covariance matrix.

Cross-Covariance Attention

XCiT: Cross-Covariance Image Transformers

VoiceFilter-Lite

VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition

**GLOW** is a type of flow-based generative model that is based on an invertible $1 \times 1$ [convolution](https://paperswithcode.com/method/convolution). This builds on the flows introduced by [NICE](https://paperswithcode.com/method/nice) and [RealNVP](https://paperswithcode.com/method/realnvp). It consists of a series of steps of flow, combined in a multi-scale architecture; see the Figure to the right. Each step of flow consists of Act Normalization followed by an *invertible $1 \times 1$ convolution* followed by an [affine coupling](https://paperswithcode.com/method/affine-coupling) layer.

Source	VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com