Extract Text from PDF in Python

Extract Text from PDF

Text and Images are among the widely used objects within PDF documents and we may stumble upon a requirement to extract useful content from PDF documents. This scenario also becomes useful when we need to compare the selected text among multiple documents. Or, we have a requirement to bulk extract selected text and save it to another system/format. In this article, we are going to discuss the steps on how to extract text from PDF in Python.

PDF Processing SDK

As the scope of this article is towards Python programming language, so we are going to use Aspose.PDF Cloud SDK for Python. It is a wrapper around Aspose.PDF Cloud API. Please note that the REST architecture of Aspose.PDF Cloud enables you to access the API on any platform i.e. Desktop, Mobile, Web, Hybrid, etc, so you can leverage the PDF processing capabilities on a platform of your choice. Now in order to use the Python SDK, the first step is the installation. It is available for free download over PIP and GitHub repository. Now execute the following command on the terminal/command prompt to install the latest version of SDK on the system.

 pip install asposepdfcloud

PyCharm IDE

If you are using PyCharm IDE, you may directly add the SDK as a dependency in your project.

File -> Settings -> Project -> Python Interpreter -> asposepdfcloud

Image 1:- PyCharm settings option.
Image 2:- Aspose.Pdf Cloud Python Package.

Free Cloud Dashboard Account

After the installation, the next major step is a free subscription to our cloud services via Aspose.Cloud dashboard. The purpose of this subscription is to only allow authorized persons to access our file processing services. If you have GitHub or Google account, simply Sign Up or, click on the Create a new Account button and provide the required information. Now login to the dashboard using credentials and expand the Applications section from the dashboard and scroll down towards the Client Credentials section to see Client ID and Client Secret details.

Client credentials
Image 3:- Client Credentials on Aspose.Cloud Dashboard.

Extract Text from PDF in Python

Please follow the instructions given below to extract Text from PDF documents using Python SDK.

  • Firstly, we need to create an instance of ApiClient class while providing Client ID Client Secret as arguments
  • Secondly, create an instance of PdfApi class which takes ApiClient object as input argument
  • Now call the method get_text(..) while providing LLX, LLY, URX, URY coordinates information
Text extract preview
Image 4:- Text extract preview.

In case you need to extract the text from a specific page of the document, please try using GetPageText API which takes pageNumber as an argument.

Extract Text using cURL Command

Since Aspose.PDF Cloud is built according to REST architecture, so it can easily be accessed via cURL commands. In this section, we are going to use the cURL command for text extraction. Please note that a pre-requisite here is to generate a JSON Web Token (JWT) based on your individual client credentials specified over Aspose.Cloud dashboard. It is mandatory because our APIs are only accessible to registered users. Please execute the following command to generate the JWT token.

curl -v "https://api.aspose.cloud/connect/token" \
-X POST \
-d "grant_type=client_credentials&client_id=bbf94a2c-6d7e-4020-b4d2-b9809741374e&client_secret=1c9379bb7d701c26cc87e741a29987bb" \
-H "Content-Type: application/x-www-form-urlencoded" \
-H "Accept: application/json"

Once we have the JWT token, we can use the following command to extract text from a PDF file and save it as a plain text file on the local drive.

curl -v -X GET "https://api.aspose.cloud/v3.0/pdf/awesomeTable.pdf/text?splitRects=true&LLX=0&LLY=0&URX=800&URY=800" \
-H  "accept: application/json" \
-H  "authorization: Bearer <JWT Token>" \
-o Extracted.txt

The sample used in the above example can be downloaded from awesomeTable.pdf.

Conclusion

In this article, we have discussed the steps and related details on how to extract the text from a PDF file using Python SDK and via the cURL command. Under the free cloud subscription account, you can perform up to 150 document conversion/processing requests under a free license and once you are satisfied with our services, you may opt for a license subscription which can be as low as $0.005 / API call.

Nevertheless, the complete source code of Apsose.PDF Cloud SDK for Python is available for download under MIT license over GitHub.

Related Articles

We also recommend visiting the following links to learn more about