Extract text from PDF OCR in Java

john.mcqueide · January 14, 2022, 4:52pm

First of all, I would like to say that the viewer is working normally for text-based PDFs. The problem is when I try to do the same with a PDF image-based that was OCR. Instead of getting the correct text, I got just some ‘22222’.

If you open the attached PDF document in a PDF reader, you’ll see that you can search and select the words, but GroupDocs doesn’t recognize those words.

Original document: input.pdf (203.9 KB)

OCR PDF file: ocr-output.pdf (209.6 KB)

HTML result file: p_1.zip (980.5 KB)
input.pdf (204 KB)
ocr-output.pdf (210 KB)
p_1.zip (981 KB)

atir.tahir · January 14, 2022, 4:52pm

@john.mcqueide

We cannot reproduce this issue using the web app. Are you using our free web app or the back-end API?

john.mcqueide · January 14, 2022, 4:52pm

@Atir_Tahir

Both, if you use the web app and try to select the text and copy and paste it in another place you will see strange text, you can try to search for the text too, if you do you won’t find any result.

In the back-end I tried to render it to HTML and the result file has some just ‘2’ characters instead of the correct text.

    public void renderPdfOcrImageBased() throws IOException, URISyntaxException {
    String file = viewerPath.resolve("input.pdf").toString();
    Path resultPath = this.resultPath.resolve("ocr-output");
    Files.createDirectories(resultPath);
    String pathFormat = new File(resultPath.toString(), "p_{0}.html").getPath();

    HtmlViewOptions viewOptions = HtmlViewOptions.forEmbeddedResources(pathFormat);

    try (Viewer viewer = new Viewer(file)) {
        viewer.view(viewOptions);
    }

    try (Stream<String> lines = Files.lines(resultPath.resolve("p_1.html"), StandardCharsets.UTF_8)) {
        Assert.assertTrue(lines.anyMatch(line -> line.contains("Shrimp")));
    }
}

POM dependency
groupId: com.groupdocs
artifactId: groupdocs-viewer
version: 21.11.1

john.mcqueide · January 14, 2022, 4:52pm

@Atir_Tahir

Can you reproduce the error? If needed I can record a video.

john.mcqueide · January 14, 2022, 4:52pm

@Atir_Tahir

See if these images help.

html-generated-file.png - HTML generated file by the Viewer with HtmlViewOptions in the cache folder
pdf-opened-on-gd-viewer.png - Any result was found for “shrimp” word
pdf-open-on-chrome.png - I can find “shrimp” word when I open the same file on Chrome
search-2222-gd-viewer.png - If I search for “2222” it finds occurrence because during the Html creation it was added a lot of “2222” instead of the correct words.

viewer-search-issue.zip (2.9 MB)
viewer-search-issue.zip (2.87 MB)

john.mcqueide · January 14, 2022, 4:52pm

@Atir_Tahir

Here is another document that I have the same problem with a scanned document that was OCR’d.

Do you have any updates on this topic?

ocr-image-based.pdf (15.6 KB)

vladimir.litvinchik · January 14, 2022, 6:02pm

@john.mcqueide

Thank you for attaching source and output files. We have created the issue in our internal bug-tracker for the investigation. The issue ID is VIEWERJAVA-2750. We’ll let you know in case we’ll have any updates.

john.mcqueide · February 3, 2022, 2:54pm

Hello,

Is there an update on this issue?

vladimir.litvinchik · February 3, 2022, 4:19pm

@john.mcqueide

The issue is still under investigation. We’ll share the update when we have any new information.

john.mcqueide · April 8, 2022, 8:58pm

Hello,

Any update about this issue?

vladimir.litvinchik · April 9, 2022, 5:48am

@john.mcqueide

The issue is still under investigation. We’ll let you know in case of any updates.

john.mcqueide · July 7, 2022, 4:08pm

Hello, is there any update about this topic?

vladimir.litvinchik · July 8, 2022, 12:37pm

@john.mcqueide

Unfortunately, there is no progress on this issue. Let me contact the team and check the if there any ETA.