Cannot Parse PDF form

jvymazal · January 15, 2021, 2:03pm

Hello, I am trying to parse this PDF: TEST2.pdf (574.6 KB) via very simple .Net Core 3.1 based console app using GroupDocs.Parser (version 20.12.0).
I have tried it on both Windows and Mac platform with exactly the same result. While it is possible to extract basic info about the document (name, size, metadata etc.), the actual content (it is empty tax declaration form) are replaced by following:

Please wait…
If this message is not eventually replaced by the proper contents of the document, your PDF
viewer may not be able to display this type of document.
You can upgrade to the latest version of Adobe Reader for Windows®, Mac, or Linux® by
visiting http://www.adobe.com/go/reader_download.
For more assistance with Adobe Reader visit http://www.adobe.com/go/acrreader.
Windows is either a registered trademark or a trademark of Microsoft Corporation in the United States and/or other countries. Mac is a trademark
of Apple Inc., registered in the United States and other countries. Linux is the registered trademark of Linus Torvalds in the U.S. and other
countries.

The above result is produced by both GetText() and GetTextAreas() methods.

Many thanks for any advice and pointers.

atir.tahir · January 15, 2021, 7:21pm

@jvymazal

We couldn’t reproduce this issue at our end. Have a look at this screenshot.PNG (87.7 KB). Could you please share the sample console application using that this issue could be reproduced.

Thsi is how we get basic information.

using (Parser parser = new Parser(@"D:/TEST2.pdf"))
{
    // Get the document info
    IDocumentInfo info = parser.GetDocumentInfo();
    // Print document information
    Console.WriteLine(string.Format("FileType: {0}", info.FileType));
    Console.WriteLine(string.Format("PageCount: {0}", info.PageCount));
    Console.WriteLine(string.Format("Size: {0}", info.Size));
}

Have a look at the following documentation articles:

jvymazal · January 17, 2021, 10:45am

I am sorry, but I am not sure what should I see in the screenshot, there is some PDF extracted, but not the one I was not able to parse (it has “Lorem ipsum” text, not the contents of the file I sent). For the other question, yes, the basic document info does work and I am doing it the same way as you, what does not work is the text extraction. Could you please share the text output from the file I sent please?

The sample code I am using is here: Sample app to parse file using GroupDocs.Parser API · GitHub

atir.tahir · January 17, 2021, 6:01pm

@jvymazal

We are investigating this issue at our end. Your investigation ticket ID is PARSERNET-1723. We’ll notify you in case of any update.