A PDF file is usually comprised of Text, Image, Heading, Annotations and other elements. And as this format preserves the document layout across platforms (Desktop / Mobile etc), so its widely used to share information over the internet. However, we may have a requirement to extract textual content of PDF document for further processing. So in this article, we are going to discuss the details on how to extract text from PDF using Java Cloud SDK. Once the operation is complete, the output is saved in TXT format.
PDF to TXT Conversion API
Aspose.PDF Cloud SDK for Java is our award winning REST API solution offering the capabilities to create, edit and convert PDF to JPG, XPS, HTML, DOCX and variety of other supported formats. Now in order to implement the pdf text recognition capabilities in Java application, please add following details in pom.xml of maven build type project.
<repositories> <repository> <id>aspose-cloud</id> <name>artifact.aspose-cloud-releases</name> <url>http://artifact.aspose.cloud/repo</url> </repository> </repositories> <dependencies> <dependency> <groupId>com.aspose</groupId> <artifactId>aspose-pdf-cloud</artifactId> <version>21.11.0</version> </dependency> </dependencies>
After the SDK installation, the next important step is the creation of a free account over Aspose Cloud. So please login using newly created account and lookup/create Client ID and Client Secret at Cloud Dashboard. These details are required in subsequent sections.
PDF to Text in Java
Please follow the steps given below to perform the PDF to Text conversion using Java Cloud SDK. So after successful conversion, the resultant TXT file is saved in cloud storage.
- First we need to create a PdfApi object while providing ClientID and Client secret as arguments
- Secondly, load the input PDF file using File instance
- Upload the input PDF to cloud storage using uploadFile(…) method
- Create Integer variable specifying page number of PDF for text extraction and Double instances indicating the rectangular region of page from which we need to extract the Textual content
- Finally call the getPageText(…) method to fetch textual content from input PDF
The sample PDF file used in above example can be downloaded from marketing.pdf and extracted.txt
Extract Text from PDF using cURL Commands
Th REST APIs can easily be accessed via cURL commands, so in this section, we are going to explore the option of how we can extract Textual content from PDF using cURL commands. So as a pre-requisite, we first need to generate a JWT access token (based on client credentials) while executing the following command.
curl -v "https://api.aspose.cloud/connect/token" \ -X POST \ -d "grant_type=client_credentials&client_id=bb959721-5780-4be6-be35-ff5c3a6aa4a2&client_secret=4d84d5f6584160cbd91dba1fe145db14" \ -H "Content-Type: application/x-www-form-urlencoded" \ -H "Accept: application/json"
Once we have the JWT token, we need to execute the following command to extract all text occurances within PDF document.
curl -v -X GET "https://api.aspose.cloud/v3.0/pdf/input.pdf/text?splitRects=true&LLX=0&LLY=0&URX=800&URY=800" \ -H "accept: application/json" \ -H "authorization: Bearer <JWT Token>"
This article has explained the details on how to convert PDF to TXT using Java Cloud SDK. At the same time, we have also explored the options for extracting Text from PDF using cURL commands. So with the flexibility of traversing between multiple PDF pages, we get a control on where to extract the content. We highly recommend you to explore the product Documentation to learn further about the other exciting features being offered by Java Cloud API. Also, as all our Cloud SDKs are published under MIT license, so you may consider downloading the complete source code from GitHub and modify it as per your requirements. In case of any issues, you may consider approaching us for a quick resolution via free product support forum.
Please visit the following links to learn more about: