What is: Context-aware Visual Attention-based (CoVA) webpage object detection pipeline?

Context-Aware Visual Attention-based end-to-end pipeline for Webpage Object Detection (CoVA) aims to learn function f to predict labels y = [ $y_1, y_2, ..., y_N$ ] for a webpage containing N elements. The input to CoVA consists of:

a screenshot of a webpage,
list of bounding boxes [x, y, w, h] of the web elements, and
neighborhood information for each element obtained from the DOM tree.

This information is processed in four stages:

the graph representation extraction for the webpage,
the Representation Network (RN),
the Graph Attention Network (GAT), and
a fully connected (FC) layer.

The graph representation extraction computes for every web element i its set of K neighboring web elements $N_i$ . The RN consists of a Convolutional Neural Net (CNN) and a positional encoder aimed to learn a visual representation $v_i$ for each web element i ∈ {1, ..., N}. The GAT combines the visual representation $v_i$ of the web element i to be classified and those of its neighbors, i.e., $v_k$ ∀k ∈ $N_i$ to compute the contextual representation $c_i$ for web element i. Finally, the visual and contextual representations of the web element are concatenated and passed through the FC layer to obtain the classification output.

Source	CoVA: Context-aware Visual Attention for Webpage Information Extraction
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com