**TrOCR** is an end-to-end [Transformer](https://paperswithcode.com/methods/category/transformers)-based OCR model for text recognition with pre-trained CV and NLP models. It leverages the [Transformer](https://paperswithcode.com/method/transformer) architecture for both image understanding and wordpiece-level text generation. It first resizes the input text image into $384 × 384$ and then the image is split into a sequence of 16 patches which are used as the input to image Transformers.  Standard Transformer architecture with the [self-attention mechanism](https://paperswithcode.com/method/scaled) is leveraged on both encoder and decoder parts, where wordpiece units are generated as the recognized text from the input image.

**Meta Reward Learning (MeRL)** is a meta-learning method for the problem of learning from sparse and underspecified rewards. For example, an agent receives a complex input, such as a natural language instruction, and needs to generate a complex response, such as an action sequence, while only receiving binary success-failure feedback. The key insight of MeRL in dealing with underspecified rewards is that spurious trajectories and programs that achieve accidental success are detrimental to the agent's generalization performance. For example, an agent might be able to solve a specific instance of the maze problem above. However, if it learns to perform spurious actions during training, it is likely to fail when provided with unseen instructions. To mitigate this issue, MeRL optimizes a more refined auxiliary reward function, which can differentiate between accidental and purposeful success based on features of action trajectories. The auxiliary reward is optimized by maximizing the trained agent's performance on a hold-out validation set via meta learning.

MeRL

Learning to Generalize from Sparse and Underspecified Rewards

TrOCR

TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models

**RegionViT** consists of two tokenization processes that convert an image into regional (upper path) and local tokens (lower path). Each tokenization is a convolution with different patch sizes, the patch size of regional tokens is $28^2$ while $4^2$ is used for local tokens with dimensions projected to $C$, which means that one regional token covers $7^2$ local tokens based on the spatial locality, leading to the window size of a local region to $7^2$. At stage 1, two set of tokens are passed through the proposed regional-to-local transformer encoders. However, for the later stages, to balance the computational load and to have feature maps at different resolution, the approach uses a downsampling process to halve the spatial resolution while doubling the channel dimension like CNN on both regional and local tokens before going to the next stage. Finally, at the end of the network, it simply averages the remaining regional tokens as the final embedding for the classification while the detection uses all local tokens at each stage since it provides more fine-grained location information. By having the pyramid structure, the ViT can generate multi-scale features and hence it could be easily extended to more vision applications, e.g., object detection, rather than image classification only.

Source	TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com