Convert scanned PDFs to searchable PDF using cURL

Share on FacebookTweet about this on TwitterShare on LinkedIn

PDF is the defacto file type to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. Aspose.PDF Cloud provides a number of operations that work seamlessly with your existing PDF documents, allowing you to convert to and from PDF formats, extract document information and manipulate your PDF documents on cloud storage of your choice.

There are two main types of PDF documents – those that are created electronically using PDF creation software and those that are created from a scanner or other photo-imaging equipment. PDF creation software actually builds a PDF document that has an internal structure, denoting characters, fonts and position – although the raw information makes little sense to the human eye. A scanned PDF is basically just a flat image of a document – hence, scanning a page of text results in a picture of words being represented on the screen. In order to take information from this sort of scanned PDF, OCR technology is required so that each character can be optically recognized and then represented. 

Aspose.PDF Cloud provides a powerful inbuilt OCR engine that allows you to recognize and extract text tokens from PDF Documents. Using Aspose.PDF Cloud you can embed OCR layers in a PDF Document, allowing you to search and index your scanned PDF Documents.

Aspose.PDF Cloud OCR support

Aspose.PDF Cloud provides the below API for OCR support with PDF documents

Type Resource URL Description
PUT /pdf/{name}/ocr Generate OCR layer for images in the input PDF document

The above resource accepts the following arguments

Parameter NameDecription
nameThe PDF document to add OCR layer to
lang Language for OCR engine

The language parameter supports recognition of the following language codes eng (English), ara (Arabic) , bel (Belarusian), ben (Bengali), bul (Bulgarian), ces (Czech), dan (Danish), deu (German), ell (Greek), fin (Finnish), fra (French), heb (Hebrew), hin (Hindi), ind (Indonesian), isl (Icelandic), ita (Italian), jpn (Japanese) , kor (Korean), nld (Dutch), nor (Norwegian), pol (Polish), por (Portuguese), ron (Romanian), rus (Russian), spa (Spanish), swe (Swedish), tha (Thai), tur (Turkish), ukr (Ukrainian), vie (Vietnamese), chi_sim (Chinese Simple), chi_tra
(Chinese Traditional) or their combination e.g. eng,rus etc.

Using cURL to add an OCR Layer for embedded images

For testing purposes, we are using a simple PDF with a single image on the first page.

PDF containing text as an image

Read text from OCR Layer

Now that an OCR layer, we can read all text items from the PDF document. You can see the response contains tokens from our embedded image above. Please note this is a partial response.

Have any Question

Feel free to drop us a comment below sharing your thoughts about Aspose.PDF Cloud REST API. Or let’s know if you have any suggestions or if you need any particular features which you expect our REST API to have.

Try It Out

And if you’ve not already had a chance to try our REST API, simply start a free trial today. All you need is to sign up with the aspose.cloud. Once you’ve signed up, you’re ready to try the powerful file processing features offered by aspose.cloud.