extract text from PDF

Extract Text from PDF in Java

We all know that PDF files are one of the most important and widely used digital formats used to present and exchange documents reliably, independent of software, hardware, or operating system. However, in some scenarios, we might be interested to get an excerpt from large PDF files. Or, we may have a requirement to save the PDF to Text online. So in this article, we are going to explore the details on how to develop PDF to Text converter using Java REST API.

PDF Generator API

Get the leverage to generate PDF documents using templates or from scratch using our REST API. At the same time, the API also enables you to edit as well as transform the PDF files to other supported formats. You can also take the benefits of extracting text from PDF, decrypting and merging PDF files using Java Cloud SDK. Now, in order to use the Aspose.PDF Cloud SDK for Java, we need to add its reference in our Java application by including the following details in pom.xml (maven build type project).

<repositories> 
    <repository>
        <id>aspose-cloud</id>
        <name>artifact.aspose-cloud-releases</name>
        <url>http://artifact.aspose.cloud/repo</url>
    </repository>   
</repositories>

<dependencies>
    <dependency>
        <groupId>com.aspose</groupId>
        <artifactId>aspose-cloud-pdf</artifactId>
        <version>21.11.0</version>
        <scope>compile</scope>
    </dependency>
</dependencies>

After the installation, we need to create a free account over Cloud Dashboard and obtain personalized client credentials.

Extract Text from PDF using Java

Let’s explore the details to extract text from PDF using Java cloud SDK. In this example, we are going to use the following input PdfWithTable.pdf file.

pdf to text ocr

Image 1:- Input file for PDF to Text extraction.

pdf to text ocr

Image 2:- Extract text from PDF preview

Now let’s try to understand the above specified code snippet:

PdfApi pdfApi = new PdfApi(clientSecret,clientId);

Create an instance of PdfApi while passing the personalized credentials as arguments.

File file = new File(name); 
pdfApi.uploadFile("input.pdf", file, null);

Read the input PDF using File object and upload it to cloud storage using uploadFile(…) method of the PdfAPi class. Please note the file is uploaded with the name used in uploadFile method.

TextRectsResponse response = pdfApi.getText("input.pdf", LLX, LLY, URX, URY, null, null, null, null, null);    

Now call getText(..) method where we specify the name of input PDF file, rectangular dimensions on page from which we need to extract the textual content and, return the extracted content to TextRectsResponse object.

response.getTextOccurrences().getList().get(counter).getText()

Finally, in order to print the extracted text content, we are going to iterate through all TextOccurances and display them in console.

PDF to Text using cURL Commands

Apart from Java code snippet, we can also perform pdftotext operation using cURL commands. Now, one of the prerequisite for this approach is to generate a JWT access token (based on client credentials) using the following command.

curl -v "https://api.aspose.cloud/connect/token" \
-X POST \
-d "grant_type=client_credentials&client_id=bb959721-5780-4be6-be35-ff5c3a6aa4a2&client_secret=4d84d5f6584160cbd91dba1fe145db14" \
-H "Content-Type: application/x-www-form-urlencoded" \
-H "Accept: application/json"

Once the JWT is generated, please execute the following command to extract text from the PDF file already available in cloud storage.

curl -v -X GET "https://api.aspose.cloud/v3.0/pdf/input.pdf/text?splitRects=true&LLX=0&LLY=0&URX=800&URY=800" \
-H  "accept: application/json" \
-H  "authorization: Bearer <JWT Token>"

Quick Tip

Looking for PDF to Text free App ! Please try using our PDF Parser.

Concluding Remarks

In conclusion, extracting text from PDF files using Java can be a powerful solution for those looking to automate their data processing and analysis needs. With the help of this guide, you now have a solid foundation to build upon and can easily implement your own Java-based solution for text extraction from PDF documents. Whether you’re looking to extract text for data analysis, machine learning, or any other purpose, Java provides a flexible and reliable platform for your needs. So go ahead and put your newly acquired skills to the test!

If you are interested to explore other exciting features being offered by the API, please explore the Product Documentation. Lastly, if you encounter any issue while using the API, or you have any related query, please feel free to contact us via free Product Support Forum.

Please visit the following links to learn more about: