Parse a large PDF to HTML using Java

shockvip1331 · February 22, 2020, 4:22pm

Hi, i used groupdocs.parser version 20.1 to parse file large PDF around >50MB, but it only parsed around 10-14 first pages and missed a lot of file pdf content. So please check this issue (version 18.12 dont have this problem).

Example file pdf 500MB: https://ia800304.us.archive.org/19/items/nasa_techdoc_19880069935/19880069935.pdf

atir.tahir · February 22, 2020, 6:10pm

@shockvip1331,

Please share the sample code or application using that issue could be reproduced at our end.

shockvip1331 · February 22, 2020, 11:37pm

I just used basic code like examples on Github:

try (Parser parser = new Parser(filePDFPath)) {
try {
try (TextReader reader = parser.getText()) {
String read = reader.readToEnd();
OutputStream outputStream = new FileOutputStream(fileOutputPath);
outputStream.write(read.getBytes(StandardCharsets.UTF_8));
outputStream.close();

            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

atir.tahir · February 23, 2020, 10:05am

@shockvip1331,

This issue is reproduced at our end. Hence, it has been logged in our internal issue tracking system with ID PARSERJAVA-110. As there is any further update, you’ll be notified.

shockvip1331 · February 23, 2020, 1:28pm

During the fixing time, i have to downgrade to version 18.12 to make sure my app works fine.

atir.tahir · February 23, 2020, 6:23pm

@shockvip1331,

Alright.