Parse PDF documents using .NET Core

dpinart · April 7, 2020, 10:51am

I have a kind of pdfs that doesn’t seem to be recognized by GroupDocs.Parser library. Although I can select, copy and paste from Acrobat reader, once I try to parse it the GroupDocs.Parser I can also get 4 PageArea with just a few characeter strings. Is there any limitation about what kind of pdfs can be read? May be is about encoding?

atir.tahir · April 7, 2020, 5:51pm

@dpinart,

Can you please share the problematic files with us. If you cannot upload them here due to attachment size issue, you can upload them to some cloud storage (e.g. Dropbox, Google Drive) and share link here. Do you face any exception when you try to parse such a files? If yes, please share.

dpinart · April 7, 2020, 6:09pm

Sure, the file is not large at all WS_gcompedh.pdf_297983768529296001.pdf (59.4 KB)

atir.tahir · April 7, 2020, 7:17pm

@dpinart,

Please also share the sample code/application.

dpinart · April 14, 2020, 2:11pm

Hi, I didn’t see your reply. i don’t understand why do you need the code. Can you read extract text and tables from this file?

atir.tahir · April 14, 2020, 6:08pm

@dpinart,

GroupDocs.Parser API comes for both .NET and Java platforms. Please mention the API variant you are using (.NET or Java). That is why we were looking for the code sample.

dpinart · April 14, 2020, 6:20pm

I’m using .Net Core 3.1 framework

atir.tahir · April 14, 2020, 9:21pm

@dpinart,

We are not able to reproduce this issue using this application. Have a look at the output.png (23.9 KB). We’d recommend you to download our open-source GitHub example application (that contains .NET core project as well).
However, we noticed that if no license is applied, there is no output at all because of API limitations.

dpinart · April 15, 2020, 5:44am

May I know wich example of the application did you use to get the output?

thansks

atir.tahir · April 15, 2020, 6:14am

@dpinart,

We used this simple example to extract text from a document. If you clone or download this repository it has both .NET core and .NET framework example projects. The good thing about this repository is, it implements all the basic and advanced API features. However, for your ease we prepared a simple .NET core console application with just one feature (extract text from documents) that you can download here.

dpinart · April 15, 2020, 6:32am

I’m watching a really weird issue. I’m instanciating Paser with an Stream instead of the file path. My app will receive emails with files attached so I’m now writing some unit tests to check that each parser (depending on the customer sending the pdf each file will be in a different format) is working properly. So if I debug the unit tests the parser seems to do not work properly if this pdf (and other similar pdfs coming from he same customer) while if I run unit tests without debugging it seems is working ok. As I said it’s really weird because does only happens with these kind of pdfs. I have tested other pdfs (2 other types) and I could debug them with no problem

Thanks for the info

atir.tahir · April 15, 2020, 8:30am

@dpinart,

In this case, we need a simple console application to reproduce this issue at our end. Because we also tried to reproduce this issue by passing document stream to Parser and the PDF you shared but there is a right output in the console window. Your cooperation will be highly appreciated.

dpinart · April 15, 2020, 10:30am

I found the issue.It’s about licensing, I’m using a decorator to initialize License whenever a test suite starts. I forgot to decorate the test class for this kind of files so license was only being loaded when I run the full test suite, but not whenever I debug one of the tests of this suite.

My apologizes for the inconveniences

atir.tahir · April 15, 2020, 12:01pm

@dpinart,

No problem. In case of any further issue(s), you can post it on forum.