Skip to Main Content
 

Global Search Box

 
 
 
 

ETD Abstract Container

Abstract Header

Data Preparation from Visually Rich Documents

Abstract Details

2022, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.
Modern information sources are heterogeneous in nature. They utilize a number of modalities to disseminate information effectively. Visually rich documents typify such an information source. A visually rich document refers to a physical or digital document that uses visual cues along with linguistic features to augment or highlight its semantics. Traditional data preparation solutions are inefficient in harvesting knowledge from these sources as they do not take their multimodality into account. They are also cumbersome in terms of the amount of human-effort required in their end-to-end workflow. We describe algorithmic solutions for two fundamental data preparation tasks, namely information extraction and data integration, for visually rich documents. For both tasks, the core element of our solution is a fundamental machine-learning problem – how to represent heterogeneous documents with diverse layouts and/or formats in a unified way? We develop efficient solutions for both tasks on the bedrock of this representation learning problem. In the first part of this dissertation, we describe Artemis – a machine-learning model to extract structured records from visually rich documents. It identifies named entities by representing each visual span as a multimodal feature vector and subsequently classifying it as one of target fields to be extracted. It is a generalized information extraction method, i.e. it does not utilize any prior knowledge about the layout or format of the document in its end-to-end workflow. We describe two utility functions that aid this machine-learning model – VS2, a visual segmentation algorithm that encodes the local context and LadderNet, a convolutional network that encodes document-specific discriminative features in a visual span representation. We establish the efficacy of our machine-learning model on a number of different datasets. We investigate the robustness of our extraction model on an extreme case of our usability spectrum. In this use-case, we investigate the viability of information extraction within a bounded latency constraint. We describe MLS – a preprocessing step that facilitates this by summarizing the document within a given length-budget. In the second part of this dissertation, we consider the fact that the information contained in a visually rich document may be incomplete. Some of these documents (e.g. leaflet,banner) may even appear in isolation. Additional information may be required to gather actionable insights from them. One way to address this issue is to retrieve relevant supplementary information from a back-end data store. We develop Polyglot – a generalized method to map a visually rich document to tuples in a relational table if they represent the same real-world object. We formulate this task as K-nearest-neighbor search on an embedding space that learns a common representation for each relational tuple in the table and visual span in the document. Experiments on multiple datasets establish the efficacy of our method. We conclude by proposing future works that investigate the robustness of this solution on a number of extreme use-cases of the usability spectrum.
Arnab Nandi (Advisor)
Srinivasan Parthasarathy (Committee Member)
Eric Fosler-Lussier (Committee Member)
Jay Gupta (Committee Member)
157 p.

Recommended Citations

Citations

  • Sarkhel, R. (2022). Data Preparation from Visually Rich Documents [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1667174689895193

    APA Style (7th edition)

  • Sarkhel, Ritesh. Data Preparation from Visually Rich Documents. 2022. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1667174689895193.

    MLA Style (8th edition)

  • Sarkhel, Ritesh. "Data Preparation from Visually Rich Documents." Doctoral dissertation, Ohio State University, 2022. http://rave.ohiolink.edu/etdc/view?acc_num=osu1667174689895193

    Chicago Manual of Style (17th edition)