Groupdocs Parser

igor.zubarev · June 13, 2025, 10:16pm

Thanks for sharing the document examples.
One of the issues is following: currently our GUI supports onle one-page documents, while your document contains 16 pages.
But this is not so much problem which can be fixed relatively quickly.

The bigger problem is that your cases look much more complex than we thought.
For example, pages 12-16. Those pages are supposed to be the same document type, but scans are too different in geometry. This means we have to do a preprocessing before we can apply a template.

In documents provided you initially the situation looked more promising.
If we look at Commercial Inv CIF.pdf, Page #6, the table numbers are spaced far enough apart which gives us more flexibility, allowing for some degree of scan misalignment.
But looking at the new document and especially pages 12-16, we are facing a more difficult scenario that will require significant preprocessing which cannot be achieved quickly.
Moreover we must investigate how we can do it automatically without human involvement in the preprocessing procedure.

Anyway we should investigate the documents and build our further plans.

Also if we could have a full list of possible document types, we could approach them one by one, implementing support of every document type one by one, it would be manageable to plan and develop.

Niteen_Jadhav · June 14, 2025, 5:29am

Ok, I’ll try to share the document with you,

Can I know the estimated timeline?

Niteen_Jadhav · June 14, 2025, 5:52am

One more thing, mostly I’ll be only using first 2/3 pages of the document for ocr

igor.zubarev · June 14, 2025, 11:30am

Hi @Niteen_Jadhav,

The ETA will depend on how difficult it turns out to be to preprocess the scans and the final set of documents (specifically, how many document types we’ll need to handle and optimize for). A key factor here is the degree of misplacement and variation in the scans.

Could you help us with a few preliminary details:

How many users will be uploading the scans? If it’s a limited group, would it be possible to share a short manual or guide with them to ensure better alignment and scan quality?
Could you consider adding a pre-validation step after upload? If a scan doesn’t meet quality and alignment criterias, the system could ask the user to rescan—similar to KYC processes where users are asked to retake a photo if it’s unclear or misaligned.
Will users be allowed to upload photos of documents (instead of scanned PDFs)? If so, that would increase the complexity significantly, as photos tend to vary more in alignment and quality.

The general idea is: the better the scans, the simpler and faster the preprocessing

Niteen_Jadhav · June 14, 2025, 11:38am

Ok, so what will be the quality and alignment criterias for scanning? And the user based will not be limited, user might also upload tif files

igor.zubarev · June 15, 2025, 7:05pm

@Niteen_Jadhav
We are going to investigate the misalignment issues next week and discuss quality requirements.
Also we will evaluate the challenges we have to resolve and how we will approach them.

Niteen_Jadhav · June 16, 2025, 9:32am

Can we have any hot fixes that can be showcased to our client, like I already said I’ll just need first 2 pages, can I split and use first 2 pages while creating a template, and when the document come we’ll only use the pages uploaded while creating a template, and also if the document get scanned better will it work, I’m asking because I can explain this to the client and then we can have some time to test as we already have the timeline of June to the client because we thought it’ll be available in June

igor.zubarev · June 16, 2025, 10:22am

@Niteen_Jadhav
Can you share exact documents and scanned images set that you will use for the showace, so we will address them for the hot fix. At the same time we will work on our further investigations how to manage more complex cases.

Niteen_Jadhav · June 17, 2025, 7:23am

I already shared

igor.zubarev · June 17, 2025, 4:17pm

@Niteen_Jadhav
Thanks, so we are taking currently known documents from you and currently working on further improvements.

Currently there are our priorities on your inquery:

support of multi-page documents, because when you totally cannot apply templates is mostly based on current single-page limitation.
resolve misalignment issues for scans

In case you want to get better experience with current version we recomment to split your multi-page documents and use single-page versions.

We will back to you when we have any update or question. Thanks.

Niteen_Jadhav · June 17, 2025, 5:47pm

What if I want to use first 2 pages from the document?

igor.zubarev · June 17, 2025, 7:35pm

@Niteen_Jadhav

Let’s confirm the requirements.

Taking your New Commercial inv.pdf
The document contains 16 pages, but you want to use first 2 pages.

The first 2 pages is actually a 2-paged document and you want to create a 2-paged template.

You will also have 2-paged incoming PDF scans for parsing by this template.

Please confirm or correct me.
Thanks

Niteen_Jadhav · June 18, 2025, 9:30am

This is for time being until proper solution comes from end.

I’m proposing is if I split the template and use first 2 pages to create a template, and when the documents come to the folder I will again split and get first 2 pages to process, will it work then?

Niteen_Jadhav · June 20, 2025, 8:37am

Any updates?

igor.zubarev · June 20, 2025, 11:38am

Hi @Niteen_Jadhav,

We’re currently working on multi-page document support to enable processing of the first two pages as you described.
Also we found recognition accuracy issues with the partcular 2 pages which we need to address.

Since this is a new solution involving significant enhancements across multiple subsystems, development is taking longer than initially expected.

However, this feature is a top priority for the GroupDocs.Parser product. We’re prioritizing it not only in response to your request but also as a key milestone in the product’s.

Due to the fact that your request is in the middle of our development path, there is low confidence in the ETAs because we meet many pitfalls that we din’t expect.
For example, we were going to provide a hotfix today, but noticed issues with OCR accuracy and we cannot provide it because we don’t want to share a low quality solution.

Thank you for understanding, we will provide the updated solution as soon as we resolve issies with those 2 pages.

Niteen_Jadhav · June 20, 2025, 11:55am

But remember this will be only a temporary solution, the main goal is to process all the pages of a document in GUI solution,

I want to add one more thing, what if I split all the pages by Page number and try to create a template from single single pages

Niteen_Jadhav · June 20, 2025, 11:56am

Will it work like this? Because I want to show something to the client.

igor.zubarev · June 20, 2025, 12:58pm

@Niteen_Jadhav

That should work,
We have splitted the file for your convenience and placed on the shared folder:

Please note that for first 2 pages it is better to use TIFF format for better OCR occuracy (PDFs have issues that I described earlier today)

To compensate scan misalignment, try to create template fields a bit larger. I.e. when a scan has misalignment, the field will anyway cover the field area and capture the value.

Niteen_Jadhav · June 30, 2025, 11:28am

Do we have any solution or tentative timeline for the solution?

Niteen_Jadhav · July 2, 2025, 5:16pm

Hello,

Still waiting for update from your end