**ParaNet** is a non-autoregressive attention-based architecture for text-to-speech, which is fully convolutional and converts text to mel spectrogram. ParaNet distills the attention from the autoregressive text-to-spectrogram model, and iteratively refines the alignment between text and spectrogram in a layer-by-layer manner. The architecture is otherwise similar to [Deep Voice 3](https://paperswithcode.com/method/deep-voice-3) except these changes to the decoder; whereas the decoder of DV3 has multiple attention-based layers, where each layer consists of a
[causal convolution](https://paperswithcode.com/method/causal-convolution) block followed by an attention block, ParaNet has a single attention block in the encoder.

**Spatial Attention Module (SAM)** is a feature extraction module for object detection used in [ThunderNet](https://paperswithcode.com/method/thundernet).

The ThunderNet SAM explicitly re-weights the feature map before RoI warping over the spatial dimensions. The key idea of SAM is to use the knowledge from [RPN](https://paperswithcode.com/method/rpn) to refine the feature distribution of the feature map. RPN is trained to recognize foreground regions under the supervision of ground truths. Therefore, the intermediate features in RPN can be used to distinguish foreground features from background features. SAM accepts two inputs: the intermediate feature map from RPN $\mathcal{F}^{RPN}$ and the thin feature map from the [Context Enhancement Module](https://paperswithcode.com/method/context-enhancement-module) $\mathcal{F}^{CEM}$. The output of SAM $\mathcal{F}^{SAM}$ is defined as:

$$ \mathcal{F}^{SAM} = \mathcal{F}^{CEM} * \text{sigmoid}\left(\theta\left(\mathcal{F}^{RPN}\right)\right) $$

Here $\theta\left(·\right)$ is a dimension transformation to match the number of channels in both feature maps. The sigmoid function is used to constrain the values within $\left[0, 1\right]$. At last, $\mathcal{F}^{CEM}$ is re-weighted by the generated feature map for better feature distribution. For computational efficiency, we simply apply a 1×1 [convolution](https://paperswithcode.com/method/convolution) as $\theta\left(·\right)$, so the computational cost of CEM is negligible. The Figure to the right shows the structure of SAM. 

SAM has two functions. The first one is to refine the feature distribution by strengthening foreground features and suppressing background features. The second one is to stabilize the training of RPN as SAM enables extra gradient flow from [R-CNN](https://paperswithcode.com/method/r-cnn) subnet to RPN. As a result, RPN receives additional supervision from RCNN subnet, which helps the training of RPN.

Spatial Attention Module (ThunderNet)

ThunderNet: Towards Real-time Generic Object Detection

ParaNet

Non-Autoregressive Neural Text-to-Speech

FastGCN is a fast improvement of the GCN model recently proposed by Kipf & Welling (2016a) for learning graph embeddings. It generalizes transductive training to an inductive manner and also addresses the memory bottleneck issue of GCN caused by recursive expansion of neighborhoods. The crucial ingredient is a sampling scheme in the reformulation of the loss and the gradient, well justified through an alternative view of graph convoluntions in the form of integral transforms of embedding functions.

Description and image from: [FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling](https://arxiv.org/pdf/1801.10247.pdf)

Source	Non-Autoregressive Neural Text-to-Speech
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com