Extract PDF Images

How to Extract PDF Images using Cloud Java

We regularly use PDF files as they provide an amazing support for Text and image content. Once these elements have been placed inside the document, the layout of file is preserved no matter which platform you use for viewing them. But, we may have requirement to extract PDF images. This can be accomplished using PDF viewer application but you need to manually traverse through each page and individually save each image. Furthermore, in another scenario, if you have image based PDF and you need to perform PDF OCR, then first you need to extract all the images and then perform the OCR operation. This gets really difficulty when you have a large set of documents but a programmatic solution can be a reliable and quick solution. So in this article, we are going to explore the options to extract images from PDF using Java Cloud SDK

PDF to JPG Conversion API

In order to convert PDF to JPG or JPG to PDF in Java application, Aspose.PDF Cloud SDK for Java is an amazing choice. At the same time, it also enables you to extract images from PDF, extract text from PDF, extract attachments from PDF as well as provides a plethora of options for PDF manipulation. So in order to implement the feature to save PDF images in Java application, first we need to add the Cloud SDK reference in our project. So please add following details in pom.xml of maven build type project.

<repositories> 
    <repository>
        <id>aspose-cloud</id>
        <name>artifact.aspose-cloud-releases</name>
        <url>http://artifact.aspose.cloud/repo</url>
    </repository>   
</repositories>

<dependencies>
    <dependency>
        <groupId>com.aspose</groupId>
        <artifactId>aspose-pdf-cloud</artifactId>
        <version>21.11.0</version>
    </dependency>
</dependencies>

Once the SDK reference has been added and you do not have any existing account over Aspose Cloud, please create a free account using valid email address. Then login using newly created account and lookup/create Client ID and Client Secret at Cloud Dashboard. These details are required for authentication purposes in the following sections.

Extract PDF Images in Java

Please follow the steps given below to extract images from PDF and once the operation is complete, the images are stored on separate folder on Cloud storage.

  • First we need to create a PdfApi object while providing ClientID and Client secret as arguments
  • Secondly, load the input PDF file using File instance
  • Upload the input PDF to cloud storage using uploadFile(…) method
  • We are also going to use an optional parameter to set Height & Width details for extracted images
  • Finally call the putImagesExtractAsJpeg(…) method which takes input PDF name, PageNumber to extract images, extracted images dimensions and the name of folder on Cloud storage to save extracted images
Extract PDF Images preview

Image1:- Extract PDF Images preview

The sample PDF file used in above example can be downloaded from input.pdf.

Save PDF images using cURL Commands

Now we are going to call the API for PDF images extraction using cURL commands. Now as a pre-requisite for this approach, first we need to generate a JWT access token (based on client credentials) while executing the following command.

curl -v "https://api.aspose.cloud/connect/token" \
-X POST \
-d "grant_type=client_credentials&client_id=bb959721-5780-4be6-be35-ff5c3a6aa4a2&client_secret=4d84d5f6584160cbd91dba1fe145db14" \
-H "Content-Type: application/x-www-form-urlencoded" \
-H "Accept: application/json"

Once we have the JWT token, please execute the following command to save PDF images in separate folder over Cloud storage.

curl -X PUT "https://api.aspose.cloud/v3.0/pdf/input_file.pdf/pages/1/images/extract/jpeg?width=0&height=0&destFolder=NewFolder" \
-H  "accept: application/json" \
-H  "authorization: Bearer <JWT Token>"

Conclusion

After reading this article, you have learned a simple yet reliable approach for extracting PDF images using Java code snippet as well as through cURL commands. As we have noticed, we get a leverage to extract images from specified page of the PDF file, and provides more control over the extraction process. The product Documentation is enriched with an array of amazing topics further explaining the capabilities of this API.

Also, as all our Cloud SDKs are published under MIT license, so you may consider downloading the complete source code from GitHub and modify it as per your requirements. In case of any issues, you may consider approaching us for a quick resolution via free product support forum.

Please visit the following links to learn more about: