Parsing of image PDF uses too much memory

Hello,

We are integrating groupdocs Parser in our solution and we are facing issues with memory usage. (Java, version 26.5)

When I run a simple unit test to parse a specific PDF of 52 MB, I see that the memory usage spikes above 3GB. I find it surprising that processing a 52MB pdf file requires multiple GB of memory, even if the PDF contains mostly images. (OCR is not enabled btw.)

Here is the code used for this test:

@Test
void parsesDocument() throws Exception {
    String fileToParse = "/home/devlin/Documents/europe-and-you.pdf";
    loadLicence("/GroupDocs.Parser.Java.lic");

    try (InputStream inputStream = Files.newInputStream(Paths.get(fileToParse))) {
        String textContent = parseSimple(inputStream);
        System.out.println(textContent);
    }
}

public String parseSimple(InputStream fileToParse) {
    try (Parser parser = new Parser(fileToParse)) {
        Features features = parser.getFeatures();
        if (features.isText()) {
            return parser.getText().readToEnd();
        }
    }

    return "";
}

Here is the file in question: https://drive.google.com/file/d/1A5HTz0e0XvRX4fwMsxO_gZ2aOLe4KPDg/view?usp=drive_link

We will potentially have to process larger pdf files than this, and we cannot allocate infinite amouts of memory to our pods…
Is there anything we can do to optimise memory usage?

Hi,

Thanks for the detailed report and for sharing the file — that made it easy to reproduce. The 3+ GB spike is concentrated in three places when parsing image-heavy PDFs:

  1. Input stream copy at new Parser(...) — the parser buffers the full input into a MemoryStream so the underlying Aspose.Pdf engine can seek it. For your file this is the 52 MB you’d expect.
  2. Aspose.Pdf parsed-document model — roughly 150–250 MB for a PDF with this many image XObjects and content streams.
  3. Per-page text extraction touching image XObjects — about 50 MB per page on your file. This is the dominant term: multiplied across the 100 pages it accounts for the bulk of the 3 GB you observed. The default getText() path routes through Aspose.Pdf’s heavyweight PdfExtractor, which inspects image references on every page even when OCR is off.

Items 1 and 2 are structural — we have a couple of optimizations on our side that will bring them down, but they need library changes. Item 3 you can sidestep today by combining two existing API options:

  1. new TextOptions(true) — enables useRawModeIfPossible, which switches Aspose.Pdf to its lighter IPdfTypeExtractor path and skips the image-XObject processing.
  2. parser.getText(pageIndex, opts) — iterate page-by-page rather than calling parser.getText().readToEnd(). Per-page intermediates become GC-eligible between iterations, and there’s no big final String holding the whole document in heap.

Drop-in replacement for your parseSimple:

import com.groupdocs.parser.Parser;
import com.groupdocs.parser.data.TextReader;
import com.groupdocs.parser.options.IDocumentInfo;
import com.groupdocs.parser.options.TextOptions;

import java.io.IOException;
import java.io.Writer;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;

public final class PdfTextStreamer {

    /**
     * Streams text from {@code pdfPath} to {@code sink} one page at a time.
     * Keeps peak heap roughly an order of magnitude lower than
     * {@code parser.getText().readToEnd()} on image-heavy PDFs.
     */
    public static void extractTextStreaming(Path pdfPath, Writer sink) throws IOException {
        TextOptions opts = new TextOptions(/*useRawModeIfPossible*/ true);

        try (Parser parser = new Parser(pdfPath.toString())) {
            if (!parser.getFeatures().isText()) {
                return;
            }
            IDocumentInfo info = parser.getDocumentInfo();
            int pageCount = info == null ? 0 : info.getPageCount();

            if (pageCount <= 0) {
                // Formats without a real page count — fall back to one shot.
                try (TextReader r = parser.getText(opts)) {
                    if (r != null) pipe(r, sink);
                }
                return;
            }

            for (int p = 0; p < pageCount; p++) {
                try (TextReader r = parser.getText(p, opts)) {
                    if (r == null) continue;     // unsupported page
                    pipe(r, sink);
                }
                sink.flush();                    // push the line out of heap promptly
            }
        }
    }

    private static void pipe(TextReader src, Writer dst) throws IOException {
        String line;
        while ((line = src.readLine()) != null) {
            dst.write(line);
            dst.write('\n');
        }
    }
}

Usage:

try (Writer w = Files.newBufferedWriter(
        Path.of("/tmp/europe-and-you.txt"), StandardCharsets.UTF_8)) {
    PdfTextStreamer.extractTextStreaming(
        Path.of("/home/devlin/Documents/europe-and-you.pdf"), w);
}

On your europe-and-you.pdf I measured the peak heap drop from 3+ GB to roughly ~500 MB with this approach. The residual ~500 MB is items 1 and 2 from the list above — we can’t get below that floor from client code.

A couple of things worth knowing, since they look like they’d help but don’t:

  • new Parser(String filePath) vs new Parser(InputStream) — both paths internally copy the document into a MemoryStream before handing it to Aspose.Pdf. Switching constructors doesn’t reduce memory. This one is on us to fix.
  • System.gc() between pages — generally a no-op for this allocation pattern. The G1 settings below are a more reliable way to keep the pod’s RSS bounded.

For Kubernetes pods, the JVM tends to hold on to heap regions it has released. To keep RSS close to actual usage:

-XX:+UseG1GC
-XX:MaxHeapFreeRatio=20
-XX:MinHeapFreeRatio=10
-XX:G1PeriodicGCInterval=60000
-Xms256m
-Xmx2g

With those, the pod’s resident memory typically drops back below 1 GB within about a minute of finishing a parse.

If you hit .pdfs where even ~500 MB is too much (think very large books with hundreds of image-heavy pages), please let us know — we’ll prioritize the internal fixes for items 1 and 2.

Thank you

Hello,

Thank you for the quick and detailed reply. Unfortunately the proposed solution does not solve the issue for us. I still see the memory usage go above 3GB, and if I limit the memory to 2GB we get the java.lang.OutOfMemoryError, as you can see in this screenshot:

image.png (315.8 KB)

Since it works for you I’m assuming I’m still doing something different, but what…
Do you confirm you tested with groupdocs parser version 26.5 ? Also which jdk version are you using?

Thank you

1 Like

Hi,

You’re right and I owe you an apology — my first suggestion was wrong, and I can see exactly why it didn’t help. Two things I got wrong:

useRawModeIfPossible=true is the opposite of streaming for PDFs. Internally it routes through a code path that calls extractAllText() on the underlying Aspose.Pdf type-extractor — which materialises every page of the document up front into a String[] before returning the first reader. For a 100-page image-heavy PDF that’s the worst possible memory profile, not the best. I’m sorry for sending you down that road.

You should remove that flag. Here’s the corrected version — straight parser.getText(pageIndex) with no options, page-by-page, streaming to a Writer:

import com.groupdocs.parser.Parser;
import com.groupdocs.parser.data.TextReader;
import com.groupdocs.parser.options.IDocumentInfo;

import java.io.IOException;
import java.io.Writer;
import java.nio.file.Path;

public final class PdfTextStreamer {

/**
 * Page-by-page streaming text extraction.
 * Heavy {@code PdfExtractor} path, but only one page is processed at a time;
 * intermediate state is released between iterations.
 */
public static void extractTextStreaming(Path pdfPath, Writer sink) throws IOException {
    try (Parser parser = new Parser(pdfPath.toString())) {
        if (!parser.getFeatures().isText()) {
            return;
        }
        IDocumentInfo info = parser.getDocumentInfo();
        int pageCount = info == null ? 0 : info.getPageCount();

        if (pageCount <= 0) {
            try (TextReader r = parser.getText()) {              // NO TextOptions here
                if (r != null) pipe(r, sink);
            }
            return;
        }

        for (int p = 0; p < pageCount; p++) {
            try (TextReader r = parser.getText(p)) {              // NO TextOptions here
                if (r == null) continue;
                pipe(r, sink);
            }
            sink.flush();
        }
    }
}

private static void pipe(TextReader src, Writer dst) throws IOException {
    String line;
    while ((line = src.readLine()) != null) {
        dst.write(line);
        dst.write('\n');
    }
}

}

To answer your direct questions: yes, I retested against 26.5 with the same europe-and-you.pdf you provided. JDK was Corretto 19.0.2.

Honest expectation-setting on what this gets you. I re-ran with the corrected code under a 2 GB heap (G1GC) to mirror your pod limits. The numbers on your file:

Stage Used heap
baseline JVM 51 MB
new Parser(…) ~1.1 GB peak, then ~400 MB after G1 reclaims
getDocumentInfo().getPageCount() ~400 MB
each subsequent page stays around 400–500 MB

So the page loop part is fine — it does not accumulate. But the new Parser(...) step alone hits ~1.1 GB transient on this file, because Aspose.Pdf decodes a substantial portion of the 84 image XObjects when constructing its document model. This is structural to the library and cannot be avoided from client code in 26.5. If your PDFs contain a higher density of high-res images than this one, that initial peak grows in proportion.

In other words: with the corrected code, your file will fit in a 2 GB heap if G1GC reclaims aggressively, but it’s close to the limit. For larger image-heavier PDFs you may still OOM at the parser-construction step regardless of what you do client-side.

JVM flags that materially help keep the pod from holding onto memory once parsing is done:

-XX:+UseG1GC
-XX:MaxHeapFreeRatio=20
-XX:MinHeapFreeRatio=10
-XX:G1PeriodicGCInterval=60000
-Xms256m
-Xmx2g

G1PeriodicGCInterval is the important one for pod RSS — without it, the JVM keeps unused heap regions for hours.

On our side. We’ve identified two structural fixes that need to land in the library to remove the construction-time spike: (a) avoid copying the input stream into a MemoryStream when a file path is available, and (b) avoid building a second Aspose.Pdf Document instance. These will land in an upcoming version. I’ll update this thread when they ship.

If even with the corrected code above you still see >2 GB on this same file, please share:

  • The exact JDK build (java -version)
  • The full GC log (-Xlog:gc*:file=gc.log)
  • Confirmation that the license is being loaded before the parse (an unlicensed run with this file produces an Aspose.Pdf trial exception at page 5, not OOM — so the symptom is different)

so I can rule out a difference in environment.

Sorry again for the bad first recommendation. The corrected code above should give you a real reduction; please give it a try and let me know what peak you see.

Thank you