A scientific article usually has a good structure. The structure helps to guide both readers and journal editors. It also allows differentiated assessment of text reuse occurring in different sections of the article. Considering the wide use of
plagiarism detectors in scientific practice, the task of automatic structure extraction from scientific articles becomes relevant in
the plagiarism detection process. Most of the published articles and theses consist of the following sections: title, contents,
introduction, methods, results and discussion, conclusions, bibliography, and appendices. In this paper we present a method
to extract the structure of the scientific documents. Our solution processes formatted documents (pdf, doc, docx), extracts the text layer and the layout from them and outputs the borders of the aforementioned sections within the text layer. To identify section borders we use histogram-based gradient boosting trees. Some of the detected sections, namely introduction, methods, results and discussion, comprise the well-known IMRAD organizational structure of documents. Our solution is multilingual and can be scaled to support more languages by an unsupervised approach.
We are also presenting a new custom dataset that consists of 73 documents with labeled sections in 30 languages. The solution achieves 0.87 average precision and 0.75 average recall per section on the dataset. The developed approach is used to determine the structure of articles in the production environment. It processes more than 55 pages per second on 1 CPU and is very helpful in tasks like table extraction, annotation extraction and machine generated text detection.