Text and Images are among the widely used objects within PDF documents and we may stumble upon a requirement to extract useful content from PDF documents. This scenario also becomes useful when we need to compare the selected text among multiple documents. Or, we have a requirement to bulk extract selected text and save it to another system/format. In this article, we are going to discuss the steps on how to extract text from PDF in Python.
PDF to Text Conversion API
As the scope of this article is towards Python programming language, so we are going to use Aspose.PDF Cloud SDK for Python. Now in order to use the Python SDK, the first step is the installation. It is available for free download over PIP and GitHub repository. Now execute the following command on the terminal/command prompt to install the latest version of SDK on the system.
pip install asposepdfcloud
If you are using PyCharm IDE, you may directly add the SDK as a dependency in your project.
File -> Settings -> Project -> Python Interpreter -> asposepdfcloud
After the installation, the next major step is a free subscription to our cloud services via Aspose.Cloud dashboard. If you have GitHub or Google account, simply Sign Up or, click on the Create a new Account button. Now login to the dashboard and obtain your personalized Client ID and Client Secret details.
Extract Text from PDF in Python
Please follow the instructions given below to extract Text from PDF documents using Python SDK.
- Firstly, we need to create an instance of ApiClient class while providing Client ID Client Secret as arguments
- Secondly, create an instance of PdfApi class which takes ApiClient object as input argument
- Now call the method get_text(..) while providing LLX, LLY, URX, and URY coordinates information
In case you need to extract the text from a specific page of the document, please try using GetPageText API which takes pageNumber as an argument.
Extract Text using cURL Command
Since Aspose.PDF Cloud is built as per REST architecture, so it can easily be accessed via cURL commands. In this section, we are going to use the cURL command to convert PDF to Text format. Please note that a pre-requisite here is to generate a JSON Web Token (JWT) based on your client credentials. This step is mandatory as our APIs are only accessible to registered users. Please execute the following command to generate the JWT token.
curl -v "https://api.aspose.cloud/connect/token" \ -X POST \ -d "grant_type=client_credentials&client_id=bbf94a2c-6d7e-4020-b4d2-b9809741374e&client_secret=1c9379bb7d701c26cc87e741a29987bb" \ -H "Content-Type: application/x-www-form-urlencoded" \ -H "Accept: application/json"
Once we have the JWT token, we can use the following command to extract text from a PDF file and save it as a plain text file on the local drive.
curl -v -X GET "https://api.aspose.cloud/v3.0/pdf/awesomeTable.pdf/text?splitRects=true&LLX=0&LLY=0&URX=800&URY=800" \ -H "accept: application/json" \ -H "authorization: Bearer <JWT Token>" \ -o Extracted.txt
The sample used in the above example can be downloaded from awesomeTable.pdf.
In this article, we have discussed the steps and related details on how to extract text from PDF using Python SDK. We have also seen the steps to convert PDF to Text via cURL commands. The complete source code of Apsose.PDF Cloud SDK for Python is available for download under the MIT license over GitHub.
We also recommend visiting the following links to learn more about