Hi,
You’re right and I owe you an apology — my first suggestion was wrong, and I can see exactly why it didn’t help. Two things I got wrong:
useRawModeIfPossible=true is the opposite of streaming for PDFs. Internally it routes through a code path that calls extractAllText() on the underlying Aspose.Pdf type-extractor — which materialises every page of the document up front into a String[] before returning the first reader. For a 100-page image-heavy PDF that’s the worst possible memory profile, not the best. I’m sorry for sending you down that road.
You should remove that flag. Here’s the corrected version — straight parser.getText(pageIndex) with no options, page-by-page, streaming to a Writer:
import com.groupdocs.parser.Parser;
import com.groupdocs.parser.data.TextReader;
import com.groupdocs.parser.options.IDocumentInfo;
import java.io.IOException;
import java.io.Writer;
import java.nio.file.Path;
public final class PdfTextStreamer {
/**
* Page-by-page streaming text extraction.
* Heavy {@code PdfExtractor} path, but only one page is processed at a time;
* intermediate state is released between iterations.
*/
public static void extractTextStreaming(Path pdfPath, Writer sink) throws IOException {
try (Parser parser = new Parser(pdfPath.toString())) {
if (!parser.getFeatures().isText()) {
return;
}
IDocumentInfo info = parser.getDocumentInfo();
int pageCount = info == null ? 0 : info.getPageCount();
if (pageCount <= 0) {
try (TextReader r = parser.getText()) { // NO TextOptions here
if (r != null) pipe(r, sink);
}
return;
}
for (int p = 0; p < pageCount; p++) {
try (TextReader r = parser.getText(p)) { // NO TextOptions here
if (r == null) continue;
pipe(r, sink);
}
sink.flush();
}
}
}
private static void pipe(TextReader src, Writer dst) throws IOException {
String line;
while ((line = src.readLine()) != null) {
dst.write(line);
dst.write('\n');
}
}
}
To answer your direct questions: yes, I retested against 26.5 with the same europe-and-you.pdf you provided. JDK was Corretto 19.0.2.
Honest expectation-setting on what this gets you. I re-ran with the corrected code under a 2 GB heap (G1GC) to mirror your pod limits. The numbers on your file:
| Stage |
Used heap |
| baseline JVM |
51 MB |
| new Parser(…) |
~1.1 GB peak, then ~400 MB after G1 reclaims |
| getDocumentInfo().getPageCount() |
~400 MB |
| each subsequent page |
stays around 400–500 MB |
So the page loop part is fine — it does not accumulate. But the new Parser(...) step alone hits ~1.1 GB transient on this file, because Aspose.Pdf decodes a substantial portion of the 84 image XObjects when constructing its document model. This is structural to the library and cannot be avoided from client code in 26.5. If your PDFs contain a higher density of high-res images than this one, that initial peak grows in proportion.
In other words: with the corrected code, your file will fit in a 2 GB heap if G1GC reclaims aggressively, but it’s close to the limit. For larger image-heavier PDFs you may still OOM at the parser-construction step regardless of what you do client-side.
JVM flags that materially help keep the pod from holding onto memory once parsing is done:
-XX:+UseG1GC
-XX:MaxHeapFreeRatio=20
-XX:MinHeapFreeRatio=10
-XX:G1PeriodicGCInterval=60000
-Xms256m
-Xmx2g
G1PeriodicGCInterval is the important one for pod RSS — without it, the JVM keeps unused heap regions for hours.
On our side. We’ve identified two structural fixes that need to land in the library to remove the construction-time spike: (a) avoid copying the input stream into a MemoryStream when a file path is available, and (b) avoid building a second Aspose.Pdf Document instance. These will land in an upcoming version. I’ll update this thread when they ship.
If even with the corrected code above you still see >2 GB on this same file, please share:
- The exact JDK build (
java -version)
- The full GC log (
-Xlog:gc*:file=gc.log)
- Confirmation that the license is being loaded before the parse (an unlicensed run with this file produces an Aspose.Pdf trial exception at page 5, not OOM — so the symptom is different)
so I can rule out a difference in environment.
Sorry again for the bad first recommendation. The corrected code above should give you a real reduction; please give it a try and let me know what peak you see.
Thank you