Hi,
Thanks for the detailed report and for sharing the file — that made it easy to reproduce. The 3+ GB spike is concentrated in three places when parsing image-heavy PDFs:
- Input stream copy at
new Parser(...) — the parser buffers the full input into a MemoryStream so the underlying Aspose.Pdf engine can seek it. For your file this is the 52 MB you’d expect.
- Aspose.Pdf parsed-document model — roughly 150–250 MB for a PDF with this many image XObjects and content streams.
- Per-page text extraction touching image XObjects — about 50 MB per page on your file. This is the dominant term: multiplied across the 100 pages it accounts for the bulk of the 3 GB you observed. The default
getText() path routes through Aspose.Pdf’s heavyweight PdfExtractor, which inspects image references on every page even when OCR is off.
Items 1 and 2 are structural — we have a couple of optimizations on our side that will bring them down, but they need library changes. Item 3 you can sidestep today by combining two existing API options:
new TextOptions(true) — enables useRawModeIfPossible, which switches Aspose.Pdf to its lighter IPdfTypeExtractor path and skips the image-XObject processing.
parser.getText(pageIndex, opts) — iterate page-by-page rather than calling parser.getText().readToEnd(). Per-page intermediates become GC-eligible between iterations, and there’s no big final String holding the whole document in heap.
Drop-in replacement for your parseSimple:
import com.groupdocs.parser.Parser;
import com.groupdocs.parser.data.TextReader;
import com.groupdocs.parser.options.IDocumentInfo;
import com.groupdocs.parser.options.TextOptions;
import java.io.IOException;
import java.io.Writer;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
public final class PdfTextStreamer {
/**
* Streams text from {@code pdfPath} to {@code sink} one page at a time.
* Keeps peak heap roughly an order of magnitude lower than
* {@code parser.getText().readToEnd()} on image-heavy PDFs.
*/
public static void extractTextStreaming(Path pdfPath, Writer sink) throws IOException {
TextOptions opts = new TextOptions(/*useRawModeIfPossible*/ true);
try (Parser parser = new Parser(pdfPath.toString())) {
if (!parser.getFeatures().isText()) {
return;
}
IDocumentInfo info = parser.getDocumentInfo();
int pageCount = info == null ? 0 : info.getPageCount();
if (pageCount <= 0) {
// Formats without a real page count — fall back to one shot.
try (TextReader r = parser.getText(opts)) {
if (r != null) pipe(r, sink);
}
return;
}
for (int p = 0; p < pageCount; p++) {
try (TextReader r = parser.getText(p, opts)) {
if (r == null) continue; // unsupported page
pipe(r, sink);
}
sink.flush(); // push the line out of heap promptly
}
}
}
private static void pipe(TextReader src, Writer dst) throws IOException {
String line;
while ((line = src.readLine()) != null) {
dst.write(line);
dst.write('\n');
}
}
}
Usage:
try (Writer w = Files.newBufferedWriter(
Path.of("/tmp/europe-and-you.txt"), StandardCharsets.UTF_8)) {
PdfTextStreamer.extractTextStreaming(
Path.of("/home/devlin/Documents/europe-and-you.pdf"), w);
}
On your europe-and-you.pdf I measured the peak heap drop from 3+ GB to roughly ~500 MB with this approach. The residual ~500 MB is items 1 and 2 from the list above — we can’t get below that floor from client code.
A couple of things worth knowing, since they look like they’d help but don’t:
new Parser(String filePath) vs new Parser(InputStream) — both paths internally copy the document into a MemoryStream before handing it to Aspose.Pdf. Switching constructors doesn’t reduce memory. This one is on us to fix.
System.gc() between pages — generally a no-op for this allocation pattern. The G1 settings below are a more reliable way to keep the pod’s RSS bounded.
For Kubernetes pods, the JVM tends to hold on to heap regions it has released. To keep RSS close to actual usage:
-XX:+UseG1GC
-XX:MaxHeapFreeRatio=20
-XX:MinHeapFreeRatio=10
-XX:G1PeriodicGCInterval=60000
-Xms256m
-Xmx2g
With those, the pod’s resident memory typically drops back below 1 GB within about a minute of finishing a parse.
If you hit .pdfs where even ~500 MB is too much (think very large books with hundreds of image-heavy pages), please let us know — we’ll prioritize the internal fixes for items 1 and 2.
Thank you