82398 | ИПУ РАН

Автор(ы):

Автор(ов):

Параметры публикации

Тип публикации:

Доклад

Название:

Structure Extractor: Multilingual Extraction of Sections from Scientific Document

Электронная публикация:

Да

ISBN/ISSN:

2305-7254

DOI:

10.23919/fruct65909.2025.11008070

Наименование конференции:

2025 37th Conference of Open Innovations Association (FRUCT)

Наименование источника:

Proceedings of the 37th Conference of Open Innovations Association (FRUCT 2025)

Обозначение и номер тома:

T. 37

Город:

Хельсинки

Издательство:

FRUCT Oy

Год издания:

2025

Страницы:

122-128

Аннотация

A scientific article usually has a good structure. The structure helps to guide both readers and journal editors. It also allows differentiated assessment of text reuse occurring in different sections of the article. Considering the wide use of plagiarism detectors in scientific practice, the task of automatic structure extraction from scientific articles becomes relevant in the plagiarism detection process. Most of the published articles and theses consist of the following sections: title, contents, introduction, methods, results and discussion, conclusions, bibliography, and appendices. In this paper we present a method to extract the structure of the scientific documents. Our solution processes formatted documents (pdf, doc, docx), extracts the text layer and the layout from them and outputs the borders of the aforementioned sections within the text layer. To identify section borders we use histogram-based gradient boosting trees. Some of the detected sections, namely introduction, methods, results and discussion, comprise the well-known IMRAD organizational structure of documents. Our solution is multilingual and can be scaled to support more languages by an unsupervised approach. We are also presenting a new custom dataset that consists of 73 documents with labeled sections in 30 languages. The solution achieves 0.87 average precision and 0.75 average recall per section on the dataset. The developed approach is used to determine the structure of articles in the production environment. It processes more than 55 pages per second on 1 CPU and is very helpful in tasks like table extraction, annotation extraction and machine generated text detection.

Библиографическая ссылка:

Копаничук И.В., Чащин А.В., Очнева И.М., Грабовой А.В., Огальцов А.В., Кильдяков А.С., Чехович Ю.В. Structure Extractor: Multilingual Extraction of Sections from Scientific Document / Proceedings of the 37th Conference of Open Innovations Association (FRUCT 2025). Хельсинки: FRUCT Oy, 2025. T. 37. С. 122-128.