**Glow-TTS** is a flow-based generative model for parallel TTS that does not require any external aligner. By combining the properties of flows and dynamic programming, the proposed model searches for the most probable monotonic alignment between text and the latent representation of speech.  The model is directly trained to maximize the log-likelihood of speech with the alignment. Enforcing hard monotonic alignments helps enable robust TTS, which generalizes to long utterances, and employing flows enables fast, diverse, and controllable speech synthesis.

**Convolutional Hough Matching**, or **CHM**, is a geometric matching algorithm that distributes similarities of candidate matches over a geometric transformation space and evaluates them in a convolutional manner. It is casted into a trainable neural layer with a  semi-isotropic high-dimensional kernel, which learns non-rigid matching with a small number of interpretable parameters.

Convolutional Hough Matching Networks for Robust and Efficient Visual Correspondence

Glow-TTS

Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search

**CT3D** is a two-stage 3D object detection framework that leverages a high-quality region proposal network and a Channel-wise [Transformer](https://paperswithcode.com/method/transformer) architecture. The proposed CT3D simultaneously performs proposal-aware embedding and channel-wise context aggregation for the point features within each proposal. Specifically, CT3D uses a proposal's keypoints for spatial contextual modelling and learns attention propagation in the encoding module, mapping the proposal to point embeddings. Next, a new channel-wise decoding module enriches the query-key interaction via channel-wise re-weighting to effectively merge multi-level contexts, which contributes to more accurate object predictions. 

In CT3D, the raw points are first fed into the [RPN](https://paperswithcode.com/method/rpn) for generating 3D proposals. Then the raw points along with the corresponding proposals are processed by the channel-wise Transformer composed of the proposal-to-point encoding module and the channel-wise decoding module. Specifically, the proposal-to-point encoding module is to modulate each point feature with global proposal-aware context information. After that, the encoded point features are transformed into an effective proposal feature representation by the
channel-wise decoding module for confidence prediction and box regression.

Source	Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com