Groupdocs Viewer support text detection

witold.lojek · November 16, 2016, 5:05pm

Hi,

I have a problem with text recognition from specific pdf documents. When i’m trying to select or find some words from these documents, there is no any highlighted area on the preview. However, when i try to copy and paste anything from there, all i can see are same, random letters instead of actual words. It is worth to mention that text layer in each document was extracted before by OCR application based on FREngine ABBYY FineReader. My question is, what is the cause of the problem? Can you suggest a solution in this? I’m using groudocs viewer 2.15.1.0 and 3.7.0.0. In attachment you will find sample document.

Thanks,
Witold

usman.aziz · November 16, 2016, 7:03pm

Hi Witold,

Thanks for using GroupDocs.Viewer for .NET.

Can you please tell us whether are you using image based rendering or HTML based rendering for the documents? Furthermore, we’ll also recommend you to use latest version of the API which is GroupDocs.Viewer for .NET 16.10. We shall be looking forward to your response.

Warm Regards

witold.lojek · November 17, 2016, 9:47am

Unfortunately, this error occurs on either image based rendering or HTML based rendering. Also, as you suggested, i used GroupDocs.Viewer for .NET 16 for testing, but effect was the same.

Regards,
Witold

usman.aziz · November 17, 2016, 1:10pm

Hi Witold,

Thanks for providing the required information.

We are investigating your reported issue and shall be back after we get the results. We appreciate your patience and cooperation in this regards.

Have a nice day.

Warm Regards

witold.lojek · November 28, 2016, 3:08pm

Hi,
I would like to ask you if you figured out, what cause problems with the preview of specific documents and how to solve it?

Regards,
Witold

usman.aziz · November 28, 2016, 5:26pm

Thanks for coming back to us.

The quality of PDF document may effect its rendering and cause the problem. However, the issue is still under investigation and we can not provide you any information at this stage until we get the results. Once we have any further updates, we will notify you here.

Warm Regards

usman.aziz · December 12, 2016, 5:12pm

Hi Witold,

PDF document doesn’t contain text that can be converted to HTML. Instead, it contains only raster image and because of that it is presented as an image when converted to HTML. That is why, you are unable to search the words and random letters appear instead of actual words when you copy text from the image. Hence, it confirms that issue is related to this particular document and not the API.

In case of any further questions, please feel free to let us know.

Warm Regards

witold.lojek · December 13, 2016, 9:55am

Hi,

Thank you for your detailed response. However, i would like to ask you another question. If this document doesn’t contain any text but only raster image, so why i am able to copy, mark, search actual words from this document when is opened in Adobe Reader and similar software?

Regards,
Witold

usman.aziz · December 13, 2016, 1:21pm

Hi Witold,

Thanks for writing back to us.

You are able to copy, mark or search words because Adobe Reader or other software may provide the text selection and search in images using OCR technology. In case you would have any other questions, please feel free to let us know.

Warm Regards

witold.lojek · December 13, 2016, 4:09pm

Hi,

Thank you for your quick response. I did a research about what you say and with all respect, i don’t think that Adobe Acrobat Reader in standard version use OCR technology to provide text selection on documents which text layer was extracted before by third-party software. The OCR functionality is only available in Adobe Acrobat Reader Pro and is used as separated mechanism but , as i mentioned before, i am able to select, copy text in standard version of Adobe Acrobat Reader. In my opinion, text layer is extracted not as raster image but as a text. Have you any other ideas what is the cause of the problem?

Regards,
Witold

usman.aziz · December 14, 2016, 6:58am

Hi Witold,

Thanks for sharing the information with us.

As per analysis and investigation, the issue is specific to this PDF document and the whole page is rendered into a single image in HTML based rendering. Therefore, you are unable to find or select words and random letters appear in case of selecting and copying whole content. We apologize for inconvenience that currently there is no other solution to use search or select text features for this particular document.

Warm Regards