Newspaper Segmentation API documentation

Arcanum, the company behind Arcanum Newspapers, introduces its Newspaper Segmentation application programming interface (API). Companies and institutions can make their newspapers even more accessible with Arcanum’s Newspaper Segmentation API that is able to automatically segment pages into articles, then articles further into different sections such as captions, body, titles, or advertisements.

Having been digitizing newspapers for many years and published over 50 million pages online, Arcanum understands the challenges behind digitizing newspapers. Arcanum’s Newspaper Segmentation API has been designed to overcome the limitations of existing OCR tools. The API offers the following solutions:

Optical Character Recognition

Most OCR tools perform poorly with newspaper pages of many columns. Our solution uses an advance OCR technology that performs very well on newspaper pages with up to 10 columns.

Reading order correction

Establishing the correct reading order on a newspaper page is extremely hard. There can be captions, titles or quotations inside the text which need to be removed in order to reestablish the correct text flow. Our API performs quite well on very complex layouts as well.

Article segmentation

A page of a newspaper can contain many articles on different subjects. Separating those articles is indispensible for creatiing an efficient search engine. This addresses the following problems:

  • Removes false hits when two different words are found in two separate articles on the same page.
  • Enhances the relevance precision.

Logical section detection

Unlike other OCR tools our API is able to distinguish between different logical sections of an article such as title, caption, byline, advertisement, etc. By segmenting the article’s text into logical sections, you can upgrade the feature-set of your service with one or more of the below features:

  • Finding images by searching in captions.
  • Searching for authors.
  • Searching in titles only.
  • Excluding advertisements or artifacts.
  • Adding higher relevance to select sections (e.g., Title, Subtitle, Lead).

We have designed the API to be simple and flexible: You need to provide the scanned image, and the system returns the structured content of the newspaper page. Our team is always ready to work with exciting and ambitious clients. If you are ready to start your creative partnership with us, get in touch.

Arcanum logo

Arcanum is an online publisher that creates massive structured databases of digitized cultural contents.

The Company Contact Press room

Languages