Unable to extract text from non searchable PDF in groupdocs parser

Niteen_Jadhav · May 5, 2023, 7:01am

Hello,

I am extracting text from my documents using below code,

private string ExtractTextAll(MemoryStream stream, bool formatted)
{
    stream.Seek(0, SeekOrigin.Begin);
    GroupDocs.Parser.ExtractorFactory factory = new GroupDocs.Parser.ExtractorFactory();
    GroupDocs.Parser.Extractors.Text.TextExtractor extractor = formatted
        ? factory.CreateFormattedTextExtractor(stream)
        : factory.CreateTextExtractor(stream);
    if (extractor == null)
    {
        return null;
    }
    try
    {
        return extractor.ExtractAll();
    }
    finally
    {
        extractor.Dispose();
    }
}

but there are some pdf’s which are non searchable and in those pdf the text is not getting extracted, non-text-searchable.pdf (68.8 KB)

I tried the same in groupdocs parser online but I am unable to extract text in online example either.

How can I extract text from groupdocs parser in non searchable PDF’s.

atir.tahir · May 5, 2023, 12:20pm

@Niteen_Jadhav
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PARSERNET-2080

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

atir.tahir · May 5, 2023, 1:25pm

@Niteen_Jadhav

GroupDocs.Parser API supports the text extraction only from ‘text’ formats. It can’t extract a text from image (the sample PDF has the image instead of text). However, Parser has a special API to connect third-party OCR solutions to parse a text from an image. Please see details here - Using OCR.

Niteen_Jadhav · May 15, 2023, 1:26pm

Thank you for your response, but I have a question, If I am using a 3rd party tool for OCR, what is the need of using Parser api as I can extract text from using my 3rd party tool.

and if I am using groupdocs for my product, why should I buy a 3rd party tool for OCR?

GroupDocs.Parser API supports the text extraction only from ‘text’ formats. It can’t extract a text from image (the sample PDF has the image instead of text).

Is there any plan from the side of groupdocs to provide solution for text extraction from image or pdf’s which contains image?

atir.tahir · May 15, 2023, 8:05pm

@Niteen_Jadhav

We’ll continue our investigation and let you know about the outcomes.

Niteen_Jadhav · December 11, 2024, 8:36am

Hello,

Do you have any updates on this?

It’s almost a year now.

Thank you

atir.tahir · December 11, 2024, 8:10pm

@Niteen_Jadhav

Please try Extract a text from images and PDFs.

Niteen_Jadhav · December 12, 2024, 11:53am

Ok, thank you for the update, I’ll check and get back to you if required.