**NVAE**, or **Nouveau VAE**, is deep, hierarchical variational autoencoder. It can be trained with the original [VAE](https://paperswithcode.com/method/vae) objective, unlike alternatives such as [VQ-VAE-2](https://paperswithcode.com/method/vq-vae-2). NVAE’s design focuses on tackling two main challenges: (i) designing expressive neural
networks specifically for VAEs, and (ii) scaling up the training to a large number of hierarchical
groups and image sizes while maintaining training stability.

To tackle long-range correlations in the data, the model employs hierarchical multi-scale modelling. The generative model starts from a small spatially arranged latent variables as $\mathbf{z}\_{1}$ and samples from the hierarchy group-by-group while gradually doubling the spatial dimensions. This multi-scale approach enables NVAE to capture global long-range correlations at the top of the hierarchy and local fine-grained dependencies at the lower groups.

Additional design choices include the use of residual cells for the generative models and the encoder, which employ a number of tricks and modules to achieve good performance, and the use of residual normal distributions to smooth optimization. See the components section for more details.

FashionCLIP is a fine-tuned CLIP model on fashion data (more than 800K pairs). It is the first foundation model for Fashion.

FashionCLIP

Contrastive language and vision learning of general fashion concepts

NVAE

NVAE: A Deep Hierarchical Variational Autoencoder

**TabNet** is a deep tabular data learning architecture that uses sequential attention to choose which features to reason from at each decision step.

The TabNet encoder is composed of a feature transformer, an attentive transformer and feature masking. A split block
divides the processed representation to be used by the attentive transformer of the subsequent step as well as for the overall output. For each step, the feature selection mask provides interpretable information about the model’s functionality, and the masks can be aggregated to obtain global feature important attribution. The TabNet decoder is composed of a feature transformer block at each step. 

In the feature transformer block, a 4-layer network is used, where 2 are shared across all decision steps and 2 are decision step-dependent. Each layer is composed of a fully-connected (FC) layer, BN and GLU nonlinearity. An attentive transformer block example – a single layer mapping is modulated with a prior scale information which aggregates how much each feature has been used before the current decision step. sparsemax is used for normalization of the coefficients, resulting in sparse selection of the salient features.

Source	NVAE: A Deep Hierarchical Variational Autoencoder
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com