Groupdocs Parser

Hi @Niteen_Jadhav

Sorry for the delay in response, tomorrow we are going to share an updated solution where we fixed recognition quality for PDF (as I shared earlier, in previous version it was recommented to use tiff instead of PDF) and support for multi-page documents.

Next our step is to manage scan misplacement issues.
And that is the most challenging issue which we cannot estimate now, but we address it with the highest priority and currently working on this.

@Niteen_Jadhav

We have introduced several updates and you can download the updated solution here.

This update includes following fixes and improvements:

  1. PDF recognition has been improved so that there is no need to use tiff files instead.
  2. Automatic detection “Use OCR” checkbox when document is opened in the GUI.
  3. Support for multi-page PDFs

Please note that source code of DocumentParser commandline example tool was updated, especially with following setting required for better PDF recognition:
new ParserSettings(new PagePreviewOptions(288));

Anyway to see the parsing resuits and examples of parameters, feel free to use *.bat files in the root of the demo archive.

Ok, I’ll use the same and update you once it has been tested, any updates on dynamic template selection?

Additionally (this is not a requirement but a suggestion from our end as we have a client with various use cases), will you be able to create OCR engine for invoices which doesn’t require any template? We will upload the invoice documents and all the related data should be auto captured.

I tested this on multiple templates and it is only working on one template for the remaining template it is returning blank in the word files

I uploaded it on drive https://drive.google.com/file/d/1IFh816Ma0zcclAXYC_tTntZrufzF6IdF/view?usp=sharing

it is only working on template 1 and for the remaining it is not working.
i tried with --ocr true also and it is taking around 20 minutes for template 2 to return data

Hello @Niteen_Jadhav

Thanks for sharing your files.
We will investigate the newly shared documents and reply you when have updates.

But there are preliminary thoughts:
First suggestion is we still recommend to split documents because GroupDocs.Parser specifics is following: one template = one document.
For example, you define template for one document, one page, but call parse by template for a combined document that contains multiple pages from different documents.

Next issue and much more complex is that: between documents shared by you we see scans with more differences than in examples shared before.

So to build a robust solution we will need to add support of every document type one by one.
And unfortunately we will need many scan examples of every document type.

Let’s take 2 examples newly appeared in the new samples, provided by you:
Template4:
Let’s compare Commercial Inv USD 2 and Commercial Inv USD 3
The text in “Applicant” field resizes the table.
We didnt’t see that possibility before and we have to build a solution to handle such cases.
Yes, a text can sometimes shift document contents, but we need to see those examples and build our algorithms to manage them.
Also one of the scans has fields “Container №” and “Seal №” and the other scan doesn’t have them.
And since we have just few examples of this document type we can assume there can be other additional fields or unexpected resizings.
Second example:
Template2:
Here we have a new document type that you didn’t share with us before. Pages 7-12.
This document contains complex tables. Some tables are with dynamic columns (additional columns can be added in some cases) which we can’t handle now.

And we will not be able to build a robust solution having just few scan examples of every document type.

Put simply, AI solutions sometimes require thouzands of sample documents even of one type to train them.
Our approach is a combination of AI (OCR) and algorithms for document preprocessing. And we have to address all changes in documents that really happen.
So I think we should take every document type one by one to add support. And we will need many possible variations of scans of every document type and data that user want to extract.

What about solution taking around 20 minutes to fetch data, can we do something to reduce the time? Normally it takes around 5 to 10 seconds.

And what about auto capturing of temple based on the document?

@Niteen_Jadhav

Regarding 20 minutes - we are going to investigate the issue this week.
auto capturing - you mean automatic template mathing or capturing any invoice?

Regarding template capturing - we are working on the scan shifting compensation mechanishs currently and this will also allow to automatically select a template that matches the current document.

Regarding capturing any invoice - this concept is under investigation now. This feature will most likely require a integration with a LLM. We are considering to automatically download an LLM model from HuggungFace for example (which will assume such hardware requirements as GPU) , or integrate with OpenAI and the used may use his APIKey. But all these is under investigation now.

Ok,

Regarding 20 minutes - we are going to investigate the issue this week. Ok
auto capturing - you mean automatic template mathing or capturing any invoice? Yes automatic template matching

Any updates?

We are in progress working on performance improvements and optimizations now.

Currently we managed to improve performance by 4x, but there are further steps of improvements which we continue to implement.

Ok, when can I expect the patch?

Next week we will investigate ways for further improvements and will know when we can provide the patch, becasue we don’t exactly know what particular approach we will choose. Also we will need to implement and test it.

So next week we will form better vision regarding our roadmap and schedule.

We wouldn’t prefer to release just 4x performance improvement. But in case you prefer to get the intermediate release with it, please let us know and we will provide you this by the end of next week.

I think 4X faster means I’ll get the output in 5 minutes which will be fine for me as of now as I just need to showcase this

And after that we can work on performance again and if I get more templates I’ll share those templates too

Hello,

Any updates?

Hello @Niteen_Jadhav

As we discussed we are working on the new release and will provide the updated demo package by the end of this week and now we expet it on Friday.

Ok thank you

Hi @Niteen_Jadhav

We are now working on the release and will notify you as soon as it is finished

Hi @Niteen_Jadhav

Thanks for your patience and understanding, we have prepared a demo package update.
Please download it here.

Please note that we placed your newly provided files at folder Examples in corresponding subfolders named as you provided in original.
We also added bat files in the root package: DocumentParser - Template1.bat etc for a quick test.

Please also note that in the GUI the DPI was set to 288 with a setting:
settings = new ParserSettings(new PagePreviewOptions(288));

You are free to experiment with different DPI values (less DPI = better performance = worse OCR quality), but the values must be the same for the GUI and parsing code for now.

Also in case you want to rebuild the solution, this time it was build based on stage NuGet package.

To target this, in Visual Studio please add a new NuGEt source with URL: https://apiint.nugettest.org/v3/index.json

Thanks.

Ok, I want to know one more thing, is auto template selection developed?