**PQ-Transformer**, or **PointQuad-Transformer**, is a [Transformer](https://paperswithcode.com/method/transformer)-based architecture that predicts 3D objects and layouts simultaneously, using point cloud inputs. Unlike existing methods that either estimate layout keypoints or edges, room layouts are directly parameterized as a set of quads. Along with the quad representation, a physical constraint loss function is used that discourages object-layout interference.

Given an input 3D point cloud of $N$ points, the point cloud feature learning backbone extracts $M$ context-aware point features of $\left(3+C\right)$ dimensions, through sampling and grouping. A voting module and a farthest point sampling (FPS) module are used to generate $K\_{1}$ object proposals and $K\_{2}$ quad proposals respectively. Then the proposals are processed by a transformer decoder to further refine proposal features. Through several feedforward layers and non-maximum suppression (NMS), the proposals become the final object bounding boxes and layout quads.

**BezierAlign** is a feature sampling method for arbitrarily-shaped scene text recognition that exploits parameterization nature of a compact Bezier curve bounding box.  Unlike RoIAlign, the shape of sampling grid of BezierAlign is not rectangular. Instead, each column of the arbitrarily-shaped grid is orthogonal to the Bezier curve boundary of the text. The sampling points have equidistant interval in width and height, respectively, which are bilinear interpolated with respect to the coordinates.

Formally given an input feature map and Bezier curve control points, we concurrently process all the output pixels of the rectangular output feature map with size $h\_{\text {out }} \times w\_{\text {out }}$. Taking pixel $g\_{i}$ with position $\left(g\_{i w}, g\_{i h}\right)$ (from output feature map) as an example, we calculate $t$ by:

$$
t=\frac{g\_{i w}}{w\_{o u t}}
$$

We then calculate the point of upper Bezier curve boundary $tp$ and lower Bezier curve boundary $bp$. Using $tp$ and $bp$, we can linearly index the sampling point $op$ by:

$$
op=bp \cdot \frac{g\_{i h}}{h\_{\text {out }}}+tp \cdot\left(1-\frac{g\_{i h}}{h\_{\text {out }}}\right)
$$

With the position of $op$, we can easily apply bilinear interpolation to calculate the result. Comparisons among previous sampling methods and BezierAlign are shown in the Figure.

BezierAlign

ABCNet: Real-time Scene Text Spotting with Adaptive Bezier-Curve Network

PQ-Transformer

PQ-Transformer: Jointly Parsing 3D Objects and Layouts from Point Clouds

**Fastformer** is an type of [Transformer](https://paperswithcode.com/method/transformer) which uses [additive attention](https://www.paperswithcode.com/method/additive-attention) as a building block. Instead of modeling the pair-wise interactions between tokens, [additive attention](https://paperswithcode.com/method/additive-attention) is used to model global contexts, and then each token representation is further transformed based on its interaction with global context representations.

Source	PQ-Transformer: Jointly Parsing 3D Objects and Layouts from Point Clouds
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com