**MUSIQ**, or **Multi-scale Image Quality Transformer**, is a [Transformer](https://paperswithcode.com/method/transformer)-based model for multi-scale image quality assessment. It processes native resolution images with varying sizes and aspect ratios. In MUSIQ, we construct a multi-scale image representation as input, including the native resolution image and its ARP resized variants.  Each image is split into fixed-size patches which are embedded by a patch encoding module (blue boxes). To capture 2D structure of the image and handle images of varying aspect ratios, the spatial embedding is encoded by hashing the patch position $(i,j)$ to $(t_{i},t_{j})$ within a grid of learnable embeddings (red boxes). Scale Embedding (green boxes) is introduced to capture scale information. The Transformer encoder takes the input tokens and performs multi-head self-attention. To predict the image quality, MUSIQ follows a common strategy in Transformers to add an [CLS] token to the sequence to represent the whole multi-scale input and the corresponding Transformer output is used as the final representation.

**Local Importance-based Pooling (LIP)** is a pooling layer that can enhance discriminative features during the downsampling procedure by learning adaptive importance weights based on inputs. By using a learnable network $G$ in $F$, the importance function now is not limited in hand-crafted forms and able to learn the criterion for the discriminativeness of features. Also, the window size of LIP is restricted to be not less than stride to fully utilize the feature map and avoid the issue of fixed interval sampling scheme. More specifically, the importance function in LIP is implemented by a tiny fully convolutional network, which learns to produce the importance map based on inputs in an end-to-end manner.

Local Importance-based Pooling

LIP: Local Importance-based Pooling

MUSIQ

MUSIQ: Multi-scale Image Quality Transformer

**Darknet-19** is a convolutional neural network that is used as the backbone of [YOLOv2](https://paperswithcode.com/method/yolov2).  Similar to the [VGG](https://paperswithcode.com/method/vgg) models it mostly uses $3 \times 3$ filters and doubles the number of channels after every pooling step. Following the work on Network in Network (NIN) it uses [global average pooling](https://paperswithcode.com/method/global-average-pooling) to make predictions as well as $1 \times 1$ filters to compress the feature representation between $3 \times 3$ convolutions. [Batch Normalization](https://paperswithcode.com/method/batch-normalization) is used to stabilize training, speed up convergence, and regularize the model batch.

Source	MUSIQ: Multi-scale Image Quality Transformer
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com