Test-Time Adaptation for Visual Document Understanding

Sayna Ebrahimi,    Sercan O. Arik,    Tomas Pfister   

Google Cloud AI Research   

TMLR June 2023

[Paper]
[Bibtex]


Abstract

Self-supervised pretraining has been able to produce transferable representations for various visual document understanding (VDU) tasks. However, the ability of such representations to adapt to new distribution shifts at test-time has not been studied yet. We propose DocTTA, a novel test-time adaptation approach for documents that leverages cross-modality self-supervised learning via masked visual language modeling as well as pseudo labeling to adapt models learned on a source domain to an unlabeled target domain at test time. We also introduce new benchmarks using existing public datasets for various VDU tasks including entity recognition, key-value extraction, and document visual question answering tasks where DocTTA improves the source model performance up to 1.79% in (F1 score), 3.43% (F1 score), and 17.68% (ANLS score), respectively while drastically reducing calibration error on target data.


Benchmark Datasets



To better highlight the impact of distribution shifts and to study the methods that are robust against them, we introduce new benchmarks for VDU. Our benchmark datasets are constructed from existing popular and publicly-available VDU data to mimic real-world challenges.

FUNSD-TTA

We consider FUNSD dataset for this benchmark which is a noisy form understanding collection with 9707 semantic entities and 31,485 words with 4 categories of entities question, answer, header, and other, where each category (except other) is either the beginning or the intermediate word of a sentence. Therefore, in total, we have 7 class labels. We first combine the original training and test splits and then manually divide them into two groups. We set aside 149 forms that are filled with more texts for the source domain and put 50 forms that are sparsely filled for the target domain. We randomly choose 10 out of 149 documents for validation, and the remaining 139 for training.

SROIE-TTA

We use SROIE dataset with 9 classes in total. Similar to FUNSD, we first combine the original training and test splits. Then, we manually divide them into two groups based on their visual appearance – source domain with 600 documents contains standard-looking receipts with proper angle of view and clear black ink color. We use 37 documents from this split for validation, which we use to tune adaptation hyperparameters. Note that the validation split does not overlap with the target domain which has 347 receipts with slightly blurry look, rotated view, colored ink, and large empty margins.

DocVQA-TTA

We use DocVQA, a large-scale VQA dataset with nearly 20 different types of documents including scientific reports, letters, notes, invoices, publications, tables, etc. The original training and validation splits contain questions from all of these document types. However, for the purpose of creating an adaptation benchmark, we select 4 domains of documents: i) Emails & Letters (L), ii) Tables & Lists (T), iii) Figure & Diagrams (F), and iv) Layout (L).


Paper

S. Ebrahimi, S. Arik, T. Pfister
Test-Time Adaptation for Visual Document Understanding
TMLR, 2023.

[Paper] | [Bibtex]




Template cloned from here!