Image PDF to Searchable PDF

Convert Image PDF to Searchable PDF

In today’s data-driven world, PDFs have become an indispensable format for storing and sharing documents. However, not all PDFs are easily searchable or editable, especially those that are image-based. When dealing with documents, it’s really difficult to copy/extract any textual information for further manipulation. Fortunately, with the power of Optical Character Recognition (OCR) technology, you can convert image PDFs into searchable PDFs with ease. In this technical blog, we will explore how to convert OCR PDF to searchable PDF using various techniques, with a specific focus on REST API. We will also discuss how to extract text from OCR PDFs, giving you a comprehensive understanding of how to leverage OCR technology to unlock the full potential of your PDF documents.

OCR PDF using Java SDK

Aspose.PDF Cloud SDK for Java is a powerful cloud-based API that offers a wide range of features and capabilities for working with PDF documents. One of its key functionalities is the ability to perform OCR on PDFs, which can greatly simplify the process of extracting text from image-based PDFs and creating searchable PDFs. With its user-friendly interface and comprehensive documentation, this SDK makes it easy to automate the process of performing OCR on PDFs, saving time and increasing productivity.

Furthermore, This cloud-based API is designed to handle a wide variety of input formats and can even recognize handwritten text, making it an excellent choice for businesses and developers looking to streamline their document workflow. Now the first step is to add its reference in Java project by adding following details in pom.xml of maven build project.

<repositories> 
    <repository>
        <id>aspose-cloud</id>
        <name>artifact.aspose-cloud-releases</name>
        <url>http://artifact.aspose.cloud/repo</url>
    </repository>   
</repositories>

<dependencies>
    <dependency>
        <groupId>com.aspose</groupId>
        <artifactId>aspose-pdf-cloud</artifactId>
        <version>21.11.0</version>
    </dependency>
</dependencies>

If you do not have an existing account, you need to create a free account over Aspose Cloud. Login using newly created account and lookup/create Client ID and Client Secret at Cloud Dashboard. These details are required in subsequent sections.

Scanned PDF to Searchable PDF using Java

This section explains the details on how to convert scanned PDF to Searchable PDF using Java code snippet. Please note that Java Cloud SDK supports the recognition of following languages: eng, ara, bel, ben, bul, ces, dan, deu, ell, fin, fra, heb, hin, ind, isl, ita, jpn, kor, nld, nor, pol, por, ron, rus, spa, swe, tha, tur, ukr, vie, chi_sim, chi_tra or their combination e.g. eng,rus.

  • First we need to create a object of PdfApi, where we pass ClientID and Client secret details as arguments
  • Secondly, create a instance of File class to load the Image PDF
  • Thirdly, call the method uploadFile(…) to upload the input PDF to the cloud storage
  • As our image PDF contains english text, so we need to create a string object holding a value “eng”
  • Finally, call the method putSearchableDocument(…), which requires an input PDF and a language code as arguments.

Once the code is successfully executed, the searchable PDF is stored in cloud storage

Image PDF to Searchable PDF

Image1:- Searchable PDF preview

The scanned PDF used in the above example can be downloaded from BusinessReport.pdf and the resultant searchable PDF from Converted.pdf

OCR Online using cURL Commands

The cURL commands are one of the convenient approaches to call the REST APIs. So in this section, we are going to use the cURL commands for OCR online. Now, as a prerequisite, we need to first generate a JWT access token (based on client credentials) while executing the following command.

curl -v "https://api.aspose.cloud/connect/token" \
-X POST \
-d "grant_type=client_credentials&client_id=bb959721-5780-4be6-be35-ff5c3a6aa4a2&client_secret=4d84d5f6584160cbd91dba1fe145db14" \
-H "Content-Type: application/x-www-form-urlencoded" \
-H "Accept: application/json"

Once we have JWT token, please the following command to perform OCR online and convert Image PDF to searchable PDF document. The resultant file is then stored in cloud storage.

curl -v -X GET "https://api.aspose.cloud/v4.0/words/Resultant.docx?format=TIFF&outPath=converted.tiff" \
-H  "accept: application/octet-stream" \
-H  "Authorization: Bearer <JWT Token>"

Conclusion

Performing OCR on PDFs is a critical process for unlocking the full potential of these documents. With the help of cloud-based OCR tools like Aspose.PDF Cloud SDK for Java, this process can be simplified and automated, saving time and increasing productivity. By leveraging the power of OCR, businesses and developers can transform image-based PDFs into searchable PDFs, making them easier to search, edit, and share. It is clear that this API offer a range of powerful features and capabilities for working with PDFs. By following the step-by-step guides provided in this technical blog, you can get started with OCR on PDFs and take your document workflow to the next level.

You may consider accessing the API within a web browser using the swagger interface. Furthermore, as our SDKs are built under an MIT license, so the complete source code can be downloaded from GitHub. In case you encounter any issues while using the API, please feel free to contact us via free product support forum.

We highly recommend visiting the following links to learn more about: