Image extracted from pdf file has a worse quality than before

safetica.rad · November 24, 2023, 10:41am

Hello,

we updated GroupDocs.Parser from version 22.6.0 to 23.8.0. After this update, images extracted from pdf files have worse quality. It is problem for us because we use 3rd party OCR technology to extract text from these images and now we get worse text extraction results.

I observed this problem on several files. Here is one so you can test it yourself.
TestFile.pdf (2.5 MB)

I created GroupDocs.Parser object from that document, using GroupDocsParser.GetImages() call extracted PageImageArea and then saved it using PageImageAre.Save(). Same code but got different .jpeg images. One from version 22.6.0 has 2.6MB and the second one from 23.8.0. version has 462KB.
ExtractedImages.zip (2.9 MB)

The difference is not when saving the images, because the already extracted PageImageArea is different (different size of bytes).

What is the reason this was changed? I did not find it in the changelogs. Is it a bug?

Thank you

atir.tahir · November 24, 2023, 12:19pm

@safetica.rad
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PARSERNET-2207

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.