Test-Time Adaptation for Visual Document Understanding , ,

We consider FUNSD dataset for this benchmark which is a noisy form understanding collection with 9707 semantic entities and 31,485 words with 4 categories of entities question, answer, header, and other, where each category (except other) is either the beginning or the intermediate word of a sentence. Therefore, in total, we have 7 class labels. We first combine the original training and test splits and then manually divide them into two groups. We set aside 149 forms that are filled with more texts for the source domain and put 50 forms that are sparsely filled for the target domain. We randomly choose 10 out of 149 documents for validation, and the remaining 139 for training.

We use SROIE dataset with 9 classes in total. Similar to FUNSD, we first combine the original training and test splits. Then, we manually divide them into two groups based on their visual appearance – source domain with 600 documents contains standard-looking receipts with proper angle of view and clear black ink color. We use 37 documents from this split for validation, which we use to tune adaptation hyperparameters. Note that the validation split does not overlap with the target domain which has 347 receipts with slightly blurry look, rotated view, colored ink, and large empty margins.

We use DocVQA, a large-scale VQA dataset with nearly 20 different types of documents including scientific reports, letters, notes, invoices, publications, tables, etc. The original training and validation splits contain questions from all of these document types. However, for the purpose of creating an adaptation benchmark, we select 4 domains of documents: i) Emails & Letters (L), ii) Tables & Lists (T), iii) Figure & Diagrams (F), and iv) Layout (L).

Abstract

Benchmark Datasets

FUNSD-TTA

SROIE-TTA

DocVQA-TTA

Paper