Florence is a computer vision foundation model aiming to learn universal visual-language representations that be adapted to various computer vision tasks, visual question answering, image captioning, video retrieval, among other tasks. Florence's workflow consists of data curation, unified learning, Transformer architectures and adaption. Florence is pre-trained in an image-label-description space, utilizing a unified image-text contrastive learning. It involves a two-tower architecture: 12-layer Transformer for the language encoder, and a Vision Transformer for the image encoder. Two linear projection layers are added on top of the image encoder and language encoder to match the dimensions of image and language features. Compared to previous methods for cross-modal shared representations, Florence expands beyond simple classification and retrieval capabilities to advanced representations that support object level, multiple modality, and videos respectively.

**GreedyNAS-C** is a convolutional neural network discovered using the [GreedyNAS](https://paperswithcode.com/method/greedynas) [neural architecture search](https://paperswithcode.com/method/neural-architecture-search) method. The basic building blocks used are inverted residual blocks (from [MobileNetV2](https://paperswithcode.com/method/mobilenetv2)) and squeeze-and-excitation blocks.

GreedyNAS-C

GreedyNAS: Towards Fast One-Shot NAS with Greedy Supernet

Florence

Florence: A New Foundation Model for Computer Vision

**PointRend** is a module for image segmentation tasks, such as instance and semantic segmentation, that attempts to treat segmentation as image rending problem to efficiently "render" high-quality label maps. It uses a subdivision strategy to adaptively select a non-uniform set of points at which to compute labels. PointRend can be incorporated into popular meta-architectures for both instance segmentation (e.g. [Mask R-CNN](https://paperswithcode.com/method/mask-r-cnn)) and semantic segmentation (e.g. [FCN](https://paperswithcode.com/method/fcn)). Its subdivision strategy efficiently computes high-resolution segmentation maps using an order of magnitude fewer floating-point operations than direct, dense computation.

PointRend is a general module that admits many possible implementations. Viewed abstractly, a PointRend module accepts one or more typical CNN feature maps $f\left(x\_{i}, y\_{i}\right)$ that are defined over regular grids, and outputs high-resolution predictions $p\left(x^{'}\_{i}, y^{'}\_{i}\right)$ over a finer grid. Instead of making excessive predictions over all points on the output grid, PointRend makes predictions only on carefully selected points. To make these predictions, it extracts a point-wise feature representation for the selected points by interpolating $f$, and uses a small point head subnetwork to predict output labels from the point-wise features.

Source	Florence: A New Foundation Model for Computer Vision
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com