AICurious Logo

What is: Context-aware Visual Attention-based (CoVA) webpage object detection pipeline?

SourceCoVA: Context-aware Visual Attention for Webpage Information Extraction
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Context-Aware Visual Attention-based end-to-end pipeline for Webpage Object Detection (CoVA) aims to learn function f to predict labels y = [y1,y2,...,yNy_1, y_2, ..., y_N] for a webpage containing N elements. The input to CoVA consists of:

  1. a screenshot of a webpage,
  2. list of bounding boxes [x, y, w, h] of the web elements, and
  3. neighborhood information for each element obtained from the DOM tree.

This information is processed in four stages:

  1. the graph representation extraction for the webpage,
  2. the Representation Network (RN),
  3. the Graph Attention Network (GAT), and
  4. a fully connected (FC) layer.

The graph representation extraction computes for every web element i its set of K neighboring web elements NiN_i. The RN consists of a Convolutional Neural Net (CNN) and a positional encoder aimed to learn a visual representation viv_i for each web element i ∈ {1, ..., N}. The GAT combines the visual representation viv_i of the web element i to be classified and those of its neighbors, i.e., vkv_k ∀k ∈ NiN_i to compute the contextual representation cic_i for web element i. Finally, the visual and contextual representations of the web element are concatenated and passed through the FC layer to obtain the classification output.