Knowledge Agora



Scientific Article details

Title Towards Tabular Data Extraction From Richly-Structured Documents Using Supervised and Weakly-Supervised Learning
ID_Doc 29338
Authors Chowdhury, AG; ben Ahmed, M; Atzmueller, M
Title Towards Tabular Data Extraction From Richly-Structured Documents Using Supervised and Weakly-Supervised Learning
Year 2022
Published
DOI 10.1109/ETFA52439.2022.9921455
Abstract Tabular information extraction from richly structured documents is a challenging task, due to rich table and document structures. Supervised document table detection approaches include image classification and object localization methods, typically relying on manually annotated data which is often costly to acquire specially on domain specific dataset. Self-supervised learning is quickly closing the gap with supervised methods in computer vision research [1]. This paper investigates the impact of a self-supervised image classifier as the primary backbone in supervised object detection for document table detection. Furthermore, we study an approach for table structure recognition based on the pix2pix Generative Adversarial Networks (GAN) approach [2]. We propose these approaches as the basis of a machine learning pipeline for table detection and structure recognition. Our evaluation results on different publicly available datasets, as well as a domain specific dataset demonstrate the efficacy of the presented approaches towards tabular information extraction pipelines from richly structured documents.
Author Keywords Table Detection; Self-supervised Learning; Barlow Twins; Table Structure Recognition
Index Keywords Index Keywords
Document Type Other
Open Access Open Access
Source Conference Proceedings Citation Index - Science (CPCI-S)
EID WOS:000934103900034
WoS Category Automation & Control Systems; Engineering, Industrial; Engineering, Manufacturing
Research Area Automation & Control Systems; Engineering
PDF
Similar atricles
Scroll