**Seq2Edits** is an open-vocabulary approach to sequence editing for natural language processing (NLP) tasks with a high degree of overlap between input and output texts. In this approach, each sequence-to-sequence transduction is represented as a sequence of edit operations, where each operation either replaces an entire source span with target tokens or keeps it unchanged. For text normalization, sentence fusion, sentence splitting & rephrasing, text simplification, and grammatical error correction, the approach improves explainability by associating each edit operation with a human-readable tag.

Rather than generating the target sentence as a series of tokens, the model predicts a sequence of edit operations that, when applied to the source sentence, yields the target sentence. Each edit operates on a span in the source sentence and either copies, deletes, or replaces it with one or more target tokens. Edits are generated auto-regressively from left to right using a modified [Transformer](https://paperswithcode.com/method/transformer) architecture to facilitate learning of long-range dependencies.

**RegionViT** consists of two tokenization processes that convert an image into regional (upper path) and local tokens (lower path). Each tokenization is a convolution with different patch sizes, the patch size of regional tokens is $28^2$ while $4^2$ is used for local tokens with dimensions projected to $C$, which means that one regional token covers $7^2$ local tokens based on the spatial locality, leading to the window size of a local region to $7^2$. At stage 1, two set of tokens are passed through the proposed regional-to-local transformer encoders. However, for the later stages, to balance the computational load and to have feature maps at different resolution, the approach uses a downsampling process to halve the spatial resolution while doubling the channel dimension like CNN on both regional and local tokens before going to the next stage. Finally, at the end of the network, it simply averages the remaining regional tokens as the final embedding for the classification while the detection uses all local tokens at each stage since it provides more fine-grained location information. By having the pyramid structure, the ViT can generate multi-scale features and hence it could be easily extended to more vision applications, e.g., object detection, rather than image classification only.

RegionViT

RegionViT: Regional-to-Local Attention for Vision Transformers

Seq2Edits

Seq2Edits: Sequence Transduction Using Span-level Edit Operations

There are at least eight notable examples of models from the literature that can be described using the **Message Passing Neural Networks** (**MPNN**) framework. For simplicity we describe MPNNs which operate on undirected graphs $G$ with node features $x_{v}$ and edge features $e_{vw}$. It is trivial to extend the formalism to directed multigraphs. The forward pass has two phases, a message passing phase and a readout phase. The message passing phase runs for $T$ time steps and is defined in terms of message functions $M_{t}$ and vertex update functions $U_{t}$. During the message passing phase, hidden states $h_{v}^{t}$ at each node in the graph are updated based on messages $m_{v}^{t+1}$ according to
$$
m_{v}^{t+1} = \sum_{w \in N(v)} M_{t}(h_{v}^{t}, h_{w}^{t}, e_{vw})
$$
$$
h_{v}^{t+1} = U_{t}(h_{v}^{t}, m_{v}^{t+1})
$$
where in the sum, $N(v)$ denotes the neighbors of $v$ in graph $G$. The readout phase computes a feature vector for the whole graph using some readout function $R$ according to
$$
\hat{y} = R(\\{ h_{v}^{T} | v \in G \\})
$$
The message functions $M_{t}$, vertex update functions $U_{t}$, and readout function $R$ are all learned differentiable functions. $R$ operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism.

Source	Seq2Edits: Sequence Transduction Using Span-level Edit Operations
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com