预测受损历史文档的原始外观

Predicting the Original Appearance of Damaged Historical Documents

December 16, 2024
作者: Zhenhua Yang, Dezhi Peng, Yongxin Shi, Yuyi Zhang, Chongyu Liu, Lianwen Jin
cs.AI

摘要

历史文献包含丰富的文化宝藏,但随着时间的推移,遭受了严重的损坏,包括缺失字符、纸张损坏和墨迹侵蚀。然而,现有的文献处理方法主要集中在二值化、增强等方面,忽视了对这些损坏的修复。为此,我们提出了一个新任务,称为历史文献修复(HDR),旨在预测受损历史文献的原始外观。为填补该领域的空白,我们提出了一个大规模数据集 HDR28K 和一个基于扩散的网络 DiffHDR 用于历史文献修复。具体而言,HDR28K 包含 28,552 对受损修复图像,带有字符级注释和多样式退化。此外,DiffHDR 通过语义和空间信息以及精心设计的字符感知损失,增强了基本扩散框架,以实现上下文和视觉的连贯性。实验结果表明,使用 HDR28K 训练的 DiffHDR 显著超越了现有方法,并在处理真实受损文档方面表现出色。值得注意的是,DiffHDR 还可以扩展到文档编辑和文本块生成,展示了其高灵活性和泛化能力。我们相信这项研究可以开创文献处理的新方向,并有助于传承宝贵的文化和文明。数据集和代码可在 https://github.com/yeungchenwa/HDR 获取。
English
Historical documents encompass a wealth of cultural treasures but suffer from severe damages including character missing, paper damage, and ink erosion over time. However, existing document processing methods primarily focus on binarization, enhancement, etc., neglecting the repair of these damages. To this end, we present a new task, termed Historical Document Repair (HDR), which aims to predict the original appearance of damaged historical documents. To fill the gap in this field, we propose a large-scale dataset HDR28K and a diffusion-based network DiffHDR for historical document repair. Specifically, HDR28K contains 28,552 damaged-repaired image pairs with character-level annotations and multi-style degradations. Moreover, DiffHDR augments the vanilla diffusion framework with semantic and spatial information and a meticulously designed character perceptual loss for contextual and visual coherence. Experimental results demonstrate that the proposed DiffHDR trained using HDR28K significantly surpasses existing approaches and exhibits remarkable performance in handling real damaged documents. Notably, DiffHDR can also be extended to document editing and text block generation, showcasing its high flexibility and generalization capacity. We believe this study could pioneer a new direction of document processing and contribute to the inheritance of invaluable cultures and civilizations. The dataset and code is available at https://github.com/yeungchenwa/HDR.

Summary

AI-Generated Summary

PDF42December 20, 2024