Extract Images from PDF in Python

Extract images from PDF

Images are among integral components of PDF documents and we may have a requirement to extract images from PDF files. Furthermore, if we have a bulk of PDF files and we need to extract the images, then a manual approach cannot be viable. Therefore, a programmatic solution is more convenient and time-saving. In this article, we are going to discuss the details on how to extract images from PDF in Python.

PDF Processing API

Aspose.PDF Cloud SDK for Python enables you to create, edit and render PDF files to other support formats. This SDK is developed on top of Aspose.PDF Cloud API but provides all the capabilities available in base REST API. So with the help of this SDK, we can load PDF documents and extract images inside them. The API is so flexible that it enables you to extract images in the format of your choice (such as JPEG, TIFF, GIF, and PNG). Now in order to proceed further with the usage of SDK, the first step is its installation. The SDK is available for free download over PIP and GitHub repository. Now execute the following command on the terminal/command prompt to install the latest version of SDK on the system.

 pip install asposepdfcloud

PyCharm IDE

If you are using PyCharm IDE, you may directly add the SDK as a dependency in your project.

File -> Settings -> Project -> Python Interpreter -> asposepdfcloud

PyCharm settings
Image 1:- PyCharm settings option.
Aspose.PDF Python package
Image 2:- Aspose.Pdf Cloud Python Package.

Free Cloud Dashboard Account

After the installation, the next major step is a free subscription to our cloud services via Aspose.Cloud dashboard. The purpose of this subscription is to only allow authorized persons to access our file processing services. If you have GitHub or Google account, simply Sign Up or, click on the Create a new Account button and provide the required information. Now login to the dashboard using credentials and expand the Applications section from the dashboard and scroll down towards the Client Credentials section to see Client ID and Client Secret details.

Client credentials
Image 3:- Client Credentials on Aspose.Cloud Dashboard.

Extract Images from PDF in Python

Please follow the steps given below to extract images from PDF documents in JPEG format and save them in the folder on Cloud storage.

  • Firstly, we need to create an instance of ApiClient class while providing Client ID Client Secret as arguments
  • Secondly, create an instance of PdfApi class which takes ApiClient object as input argument
  • Now call the method put_images_extract_as_jpeg(…) which take sinput PDF name, page number containing images and an optional parameter specifying the taget folder where images need to be extracted.

The API also supports two optional parameters to specify the Width and Height of extracted images.

Image 4:- Preview of extracted images.

In case you need to extract images in a format other than JPEG, please try using

Extract Images using cURL Command

The cURL commands also provide a convenient way of accessing REST APIs via the command line terminal. You can execute them on Windows, Linux, macOS, or other operating systems and accomplish your requirements. In this section, we are going to use the cURL commands for images extraction in PNG format and save the output in Cloud storage.

But before we proceed towards images extraction, we need to generate a JSON Web Token (JWT) based on your individual client credentials specified over Aspose.Cloud dashboard. It is mandatory because our APIs are only accessible to registered users. Please execute the following command to generate the JWT token.

curl -v "https://api.aspose.cloud/connect/token" \
-X POST \
-d "grant_type=client_credentials&client_id=bbf94a2c-6d7e-4020-b4d2-b9809741374e&client_secret=1c9379bb7d701c26cc87e741a29987bb" \
-H "Content-Type: application/x-www-form-urlencoded" \
-H "Accept: application/json"

Once the JWT token is generated, please execute the following command to extract images from the 3rd page of a document to PNG format.

curl -v -X PUT "https://api.aspose.cloud/v3.0/pdf/URL2PDF.pdf/pages/3/images/extract/png?width=0&height=0&destFolder=ExtractedImages" \
-H  "Accept: application/json" \
-H  "authorization: Bearer <JWT Token>" \
-d{}

The sample PDF file used in the above example can be downloaded from URL2PDF.pdf.

Conclusion

Now let’s recap the knowledge gained from this article. In this blog, we have learned the simple and amazing approach of extracting images from PDF files using Python SDK as well as through the cURL command. We started with a Free account on cloud dashboard which enables us to perform up to 150 document conversion/processing requests under a free license and once you are satisfied with our services, you may opt for a license subscription which can be as low as $0.005 / API call. Nevertheless, the complete source code of Apsose.PDF Cloud SDK for Python is available for download under MIT license over GitHub.

In case you encounter any issues while using the API or you have any further queries, please feel free to contact us via the Free product support forum.

Related Articles

We also recommend visiting the following links to learn more about