Parsing of image PDF uses too much memory

Hello,

We are integrating groupdocs Parser in our solution and we are facing issues with memory usage. (Java, version 26.5)

When I run a simple unit test to parse a specific PDF of 52 MB, I see that the memory usage spikes above 3GB. I find it surprising that processing a 52MB pdf file requires multiple GB of memory, even if the PDF contains mostly images. (OCR is not enabled btw.)

Here is the code used for this test:

@Test
void parsesDocument() throws Exception {
    String fileToParse = "/home/devlin/Documents/europe-and-you.pdf";
    loadLicence("/GroupDocs.Parser.Java.lic");

    try (InputStream inputStream = Files.newInputStream(Paths.get(fileToParse))) {
        String textContent = parseSimple(inputStream);
        System.out.println(textContent);
    }
}

public String parseSimple(InputStream fileToParse) {
    try (Parser parser = new Parser(fileToParse)) {
        Features features = parser.getFeatures();
        if (features.isText()) {
            return parser.getText().readToEnd();
        }
    }

    return "";
}

Here is the file in question: https://drive.google.com/file/d/1A5HTz0e0XvRX4fwMsxO_gZ2aOLe4KPDg/view?usp=drive_link

We will potentially have to process larger pdf files than this, and we cannot allocate infinite amouts of memory to our pods…
Is there anything we can do to optimise memory usage?

Hi,

Thanks for the detailed report and for sharing the file — that made it easy to reproduce. The 3+ GB spike is concentrated in three places when parsing image-heavy PDFs:

  1. Input stream copy at new Parser(...) — the parser buffers the full input into a MemoryStream so the underlying Aspose.Pdf engine can seek it. For your file this is the 52 MB you’d expect.
  2. Aspose.Pdf parsed-document model — roughly 150–250 MB for a PDF with this many image XObjects and content streams.
  3. Per-page text extraction touching image XObjects — about 50 MB per page on your file. This is the dominant term: multiplied across the 100 pages it accounts for the bulk of the 3 GB you observed. The default getText() path routes through Aspose.Pdf’s heavyweight PdfExtractor, which inspects image references on every page even when OCR is off.

Items 1 and 2 are structural — we have a couple of optimizations on our side that will bring them down, but they need library changes. Item 3 you can sidestep today by combining two existing API options:

  1. new TextOptions(true) — enables useRawModeIfPossible, which switches Aspose.Pdf to its lighter IPdfTypeExtractor path and skips the image-XObject processing.
  2. parser.getText(pageIndex, opts) — iterate page-by-page rather than calling parser.getText().readToEnd(). Per-page intermediates become GC-eligible between iterations, and there’s no big final String holding the whole document in heap.

Drop-in replacement for your parseSimple:

import com.groupdocs.parser.Parser;
import com.groupdocs.parser.data.TextReader;
import com.groupdocs.parser.options.IDocumentInfo;
import com.groupdocs.parser.options.TextOptions;

import java.io.IOException;
import java.io.Writer;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;

public final class PdfTextStreamer {

    /**
     * Streams text from {@code pdfPath} to {@code sink} one page at a time.
     * Keeps peak heap roughly an order of magnitude lower than
     * {@code parser.getText().readToEnd()} on image-heavy PDFs.
     */
    public static void extractTextStreaming(Path pdfPath, Writer sink) throws IOException {
        TextOptions opts = new TextOptions(/*useRawModeIfPossible*/ true);

        try (Parser parser = new Parser(pdfPath.toString())) {
            if (!parser.getFeatures().isText()) {
                return;
            }
            IDocumentInfo info = parser.getDocumentInfo();
            int pageCount = info == null ? 0 : info.getPageCount();

            if (pageCount <= 0) {
                // Formats without a real page count — fall back to one shot.
                try (TextReader r = parser.getText(opts)) {
                    if (r != null) pipe(r, sink);
                }
                return;
            }

            for (int p = 0; p < pageCount; p++) {
                try (TextReader r = parser.getText(p, opts)) {
                    if (r == null) continue;     // unsupported page
                    pipe(r, sink);
                }
                sink.flush();                    // push the line out of heap promptly
            }
        }
    }

    private static void pipe(TextReader src, Writer dst) throws IOException {
        String line;
        while ((line = src.readLine()) != null) {
            dst.write(line);
            dst.write('\n');
        }
    }
}

Usage:

try (Writer w = Files.newBufferedWriter(
        Path.of("/tmp/europe-and-you.txt"), StandardCharsets.UTF_8)) {
    PdfTextStreamer.extractTextStreaming(
        Path.of("/home/devlin/Documents/europe-and-you.pdf"), w);
}

On your europe-and-you.pdf I measured the peak heap drop from 3+ GB to roughly ~500 MB with this approach. The residual ~500 MB is items 1 and 2 from the list above — we can’t get below that floor from client code.

A couple of things worth knowing, since they look like they’d help but don’t:

  • new Parser(String filePath) vs new Parser(InputStream) — both paths internally copy the document into a MemoryStream before handing it to Aspose.Pdf. Switching constructors doesn’t reduce memory. This one is on us to fix.
  • System.gc() between pages — generally a no-op for this allocation pattern. The G1 settings below are a more reliable way to keep the pod’s RSS bounded.

For Kubernetes pods, the JVM tends to hold on to heap regions it has released. To keep RSS close to actual usage:

-XX:+UseG1GC
-XX:MaxHeapFreeRatio=20
-XX:MinHeapFreeRatio=10
-XX:G1PeriodicGCInterval=60000
-Xms256m
-Xmx2g

With those, the pod’s resident memory typically drops back below 1 GB within about a minute of finishing a parse.

If you hit .pdfs where even ~500 MB is too much (think very large books with hundreds of image-heavy pages), please let us know — we’ll prioritize the internal fixes for items 1 and 2.

Thank you