82342 | ИПУ РАН

Автор(ы):

Автор(ов):

Параметры публикации

Тип публикации:

Статья в журнале/сборнике

Название:

Near-duplicate handwritten document detection without text recognition

ISBN/ISSN:

2221-7932

DOI:

10.28995/2075-7182-2021-20-47-57

Наименование источника:

Computational linguistics and intellectual technologies (Papers from the Annual International Conference “Dialogue” 2021)

Обозначение и номер тома:

Выпуск 20

Город:

Москва

Издательство:

Российский государственный гуманитарный университет

Год издания:

2021

Страницы:

47-57

Аннотация

The paper presents a novel method for near-duplicate detection in handwritten document collections of school essays. A large amount of online resources with available academic essays currently makes it possible to cheat and reuse them during high school final exams. Despite the importance of the problem, at the moment there is no automatic method for near-duplicate detection for handwritten documents, such as school essays. The school essay is represented as a sequence of scanned images of handwritten essay text. Despite advances in recognition of handwritten printed text, the use of these methods for the current task is a challenge. The proposed method of near-duplicate detection does not require detailed markup text, which makes it possible to use it in a large number of tasks related to the information extraction in zero-shot regime, i.e. without any specific resources written in the processed language. The paper presents a method based on series analysis. The image is segmented into words. The text is characterized by a sequence of features, which are invariant to the author’s writing style: normalized lengths of the segmented words. These features can be used for both handwritten and machine-readable texts. The computational experiment is conducted on IAM dataset of English handwritten texts and the dataset of real images of handwritten school essays.

Библиографическая ссылка:

Бахтеев О.Ю., Кузнецова М.В., Хазов А.В., Огальцов А.В., Сафин К.Ф., Горленко Т.А., Суворова М.А., Ивахненко А.А., Ботов П.В., Чехович Ю.В., Моттль В.В. Near-duplicate handwritten document detection without text recognition // Computational linguistics and intellectual technologies (Papers from the Annual International Conference “Dialogue” 2021). 2021. Выпуск 20. С. 47-57.