Data Preparation from Visually Rich Documents

Sarkhel, Ritesh

Keyword Search

School Logo

RS_Dissertation.pdf (14.19 MB)

Data Preparation from Visually Rich Documents

Author Info

Sarkhel, Ritesh

Permalink:

http://rave.ohiolink.edu/etdc/view?acc_num=osu1667174689895193

Year and Degree

2022, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.

Abstract

Modern information sources are heterogeneous in nature. They utilize a number of modalities to disseminate information effectively. Visually rich documents typify such an information source. A visually rich document refers to a physical or digital document that uses visual cues along with linguistic features to augment or highlight its semantics. Traditional data preparation solutions are inefficient in harvesting knowledge from these sources as they do not take their multimodality into account. They are also cumbersome in terms of the amount of human-effort required in their end-to-end workflow. We describe algorithmic solutions for two fundamental data preparation tasks, namely information extraction and data integration, for visually rich documents. For both tasks, the core element of our solution is a fundamental machine-learning problem – how to represent heterogeneous documents with diverse layouts and/or formats in a unified way? We develop efficient solutions for both tasks on the bedrock of this representation learning problem. In the first part of this dissertation, we describe Artemis – a machine-learning model to extract structured records from visually rich documents. It identifies named entities by representing each visual span as a multimodal feature vector and subsequently classifying it as one of target fields to be extracted. It is a generalized information extraction method, i.e. it does not utilize any prior knowledge about the layout or format of the document in its end-to-end workflow. We describe two utility functions that aid this machine-learning model – VS2, a visual segmentation algorithm that encodes the local context and LadderNet, a convolutional network that encodes document-specific discriminative features in a visual span representation. We establish the efficacy of our machine-learning model on a number of different datasets. We investigate the robustness of our extraction model on an extreme case of our usability spectrum. In this use-case, we investigate the viability of information extraction within a bounded latency constraint. We describe MLS – a preprocessing step that facilitates this by summarizing the document within a given length-budget. In the second part of this dissertation, we consider the fact that the information contained in a visually rich document may be incomplete. Some of these documents (e.g. leaflet,banner) may even appear in isolation. Additional information may be required to gather actionable insights from them. One way to address this issue is to retrieve relevant supplementary information from a back-end data store. We develop Polyglot – a generalized method to map a visually rich document to tuples in a relational table if they represent the same real-world object. We formulate this task as K-nearest-neighbor search on an embedding space that learns a common representation for each relational tuple in the table and visual span in the document. Experiments on multiple datasets establish the efficacy of our method. We conclude by proposing future works that investigate the robustness of this solution on a number of extreme use-cases of the usability spectrum.

Committee

Arnab Nandi (Advisor)
Srinivasan Parthasarathy (Committee Member)
Eric Fosler-Lussier (Committee Member)
Jay Gupta (Committee Member)

Pages

157 p.

Subject Headings

Computer Science; Information Science

Keywords

visually rich document; VRD; information extraction; IE; classification; entity matching; cross-modal entity matching; multimodality; cross-modality; data preparation; cognitive data preparation; cognitive databases; machine learning;

Sarkhel, R. (2022). Data Preparation from Visually Rich Documents [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1667174689895193
APA Style (7th edition)
Sarkhel, Ritesh. Data Preparation from Visually Rich Documents. 2022. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1667174689895193.
MLA Style (8th edition)
Sarkhel, Ritesh. "Data Preparation from Visually Rich Documents." Doctoral dissertation, Ohio State University, 2022. http://rave.ohiolink.edu/etdc/view?acc_num=osu1667174689895193
Chicago Manual of Style (17th edition)

Document number:

osu1667174689895193

Download Count:

323

Copyright Info

Data Preparation from Visually Rich Documents by Ritesh Sarkhel is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License. Based on a work at etd.ohiolink.edu.
This open access ETD is published by The Ohio State University and OhioLINK.

Global Search Box

Files

File List

ETD Abstract Container

Abstract Header

Data Preparation from Visually Rich Documents

Abstract Details

Recommended Citations

Citations

Abstract Footer

Global Footer

Ohio Department of Higher Education

State Government Links

Education Links

Global Search Box

Files

File List

ETD Abstract Container

Abstract Header

Data Preparation from Visually Rich Documents

Abstract Details

Recommended CitationsRefworksEndNoteRISMendeley

Citations

Abstract Footer

Global Footer

Ohio Department of Higher Education

State Government Links

Education Links

Recommended Citations