Image PDF to Searchable PDF

Convert Image PDF to Searchable PDF

For long term archival of books/documents, one of the quickest approach is to scan them as images. Also, in case you need to keep them as booklet, all the images can be combined as a unified PDF document. But, if a PDF document is comprised of images, its really difficult to copy/extract any textual information for further manipulation. So in this article, we are going to discuss the details on how to OCR PDF files and convert non-searchable PDF to searchable PDF using Java cloud SDK.

OCR PDF Java SDK

In order to create, manipulate and transform PDF files to variety of supported formats, we have developed Aspose.PDF Cloud. Now in order to implement OCR PDF Free in Java application, we need to try using Aspose.PDF Cloud SDK for Java which is a wrapper around Java Cloud API. Now the first step in usage of SDK is its installation. So please add the following details in pom.xml of maven build type project.

<repositories> 
    <repository>
        <id>aspose-cloud</id>
        <name>artifact.aspose-cloud-releases</name>
        <url>http://artifact.aspose.cloud/repo</url>
    </repository>   
</repositories>

<dependencies>
    <dependency>
        <groupId>com.aspose</groupId>
        <artifactId>aspose-pdf-cloud</artifactId>
        <version>21.11.0</version>
    </dependency>
</dependencies>

Once the JDK reference has been added, please create a free account over Aspose Cloud. Login using newly created account and lookup/create Client ID and Client Secret at Cloud Dashboard. These details are required in subsequent sections.

Scanned PDF to Searchable PDF using Java

This section explains the details on how to convert scanned PDF to Searchable PDF using Java code snippet. Please note that Java Cloud SDK supports the recognition of following languages: eng, ara, bel, ben, bul, ces, dan, deu, ell, fin, fra, heb, hin, ind, isl, ita, jpn, kor, nld, nor, pol, por, ron, rus, spa, swe, tha, tur, ukr, vie, chi_sim, chi_tra or their combination e.g. eng,rus.

  • The first step is to create a PdfApi object which takes ClientID and Client secret details as arguments
  • Secondly, create a File instance to load OCR PDF
  • Thirdly, call the uploadFile(…) method to upload input PDF to cloud storage
  • Since our image PDF contains english text, so we need to create a string object holding value “eng”
  • Finally, call the putSearchableDocument(…) method requiring input PDF and language code as arguments. The resultant searchable PDF is stored in same cloud storage
Image PDF to Searchable PDF

Image1:- Searchable PDF preview

The scanned PDF used in above example can be downloaded from BusinessReport.pdf and the resultant searchable PDF from Converted.pdf

OCR Online using cURL Commands

The cURL commands are among the convenient approaches for accessing REST APIs through command line terminal. So in this section, we are going to use the cURL commands for OCR online. Now as a pre-requisite, we need to first generate a JWT access token (based on client credentials) while executing the following command.

curl -v "https://api.aspose.cloud/connect/token" \
-X POST \
-d "grant_type=client_credentials&client_id=bb959721-5780-4be6-be35-ff5c3a6aa4a2&client_secret=4d84d5f6584160cbd91dba1fe145db14" \
-H "Content-Type: application/x-www-form-urlencoded" \
-H "Accept: application/json"

Once we have JWT token, please the following command to perform OCR online and convert Image PDF to searchable PDF document. The resultant file is then stored in cloud storage.

curl -v -X GET "https://api.aspose.cloud/v4.0/words/Resultant.docx?format=TIFF&outPath=converted.tiff" \
-H  "accept: application/octet-stream" \
-H  "Authorization: Bearer <JWT Token>"

Conclusion

In this article, we have discussed some simple yet amazing steps for converting Image PDF to searchable PDF using Java Cloud SDK. So we get an option to either use Java code snippet or cURL commands to accomplish to OCR PDF. Apart from this approach, you may consider accessing the API within web browser through swagger interface. Furthermore, as our SDKs are built under an MIT license, so the complete source code can be downloaded from GitHub. In case you encounter any issues while using the APIs, please feel free to contact us via the product support forum.

We highly recommend visiting the following links to learn more about: