Groupdocs Parser

igor.zubarev · July 20, 2025, 4:13pm

@Niteen_Jadhav, the auto template selection is not implemented yet

Niteen_Jadhav · July 20, 2025, 4:59pm

What is the estimated timeline for the same

igor.zubarev · July 21, 2025, 2:38pm

This week we will investigate possible ways to implement this feature and will probably better know about the ETA.

There are some difficulties especially for your cases.
Your documents have specifics when one document contains different documents, they can be one page, multiple page.
GroupDocs.Parser assumes that one input document = one template.
Then GroupDocs.Parser will choose correct template and apply it to the document. I.e. this can be implemented in a more effective way that having the combined documents

May we change your approach and split your multi-document documents into individual ones?

Niteen_Jadhav · July 22, 2025, 3:57am

You can do however you want to do

Niteen_Jadhav · July 27, 2025, 7:55pm

Any estimated timeline?

igor.zubarev · July 28, 2025, 1:01pm

Hi @Niteen_Jadhav

We are in progress of our investigations. We are currently evaluating several approaches for this feature together with other features like field misplacement and scan image offset compensation. Approaches for those features are overlapping and we need time to choose the right approach that would address them effectively as a whole.

Niteen_Jadhav · August 5, 2025, 9:49am

Do we have any updates on this?

igor.zubarev · August 5, 2025, 7:34pm

Hi @Niteen_Jadhav

We are in progress of development of the features and will include them in GroupDocs.Parser 25.8 release which is expected by the end of this month if everything goes as expected.
This release will include template matching feature and field position shift compensation feature.
Thanks.

Niteen_Jadhav · August 6, 2025, 7:52am

Thank you, I’ll check those features

Niteen_Jadhav · August 8, 2025, 9:15am

Ok, 1 thing just came to my mind, I don’t have any requirement right now but out of curiousity just asking.

Can we capture images from the document and fetch the file byte?

igor.zubarev · August 11, 2025, 8:45am

Hi @Niteen_Jadhav

Yes, GroupDocs.Parser provides the image extracting feature.
Please find the description at the docs article. Thanks.

Andrey.Golubkov · September 2, 2025, 8:27am

Hello @Niteen_Jadhav
Demo package updated. Please download it here.

Niteen_Jadhav · September 2, 2025, 10:31am

what is the purpose of the highlighted fields? these fields are not mentioned in the google docs which has been shared
highlighted GUI.PNG (3.8 KB)

Andrey.Golubkov · September 2, 2025, 1:22pm

The field with the number 288 is the DPI value used to generate document pages for text recognition.
If you click on this field, you can select a different DPI value.
The lower the DPI value, the faster the recognition. The higher the DPI value, the higher the recognition quality.
Visibility is the visibility flag on the page of hidden technical text fields. That is, fields that are generated when you click the Generate Template button.
These generated fields are necessary to be able to adjust the coordinates of the template elements in the presence of offset and scale distortions, as well as to be able to select the most suitable template.

Niteen_Jadhav · September 2, 2025, 2:16pm

how it will auto select the template?

Andrey.Golubkov · September 2, 2025, 2:35pm

A template is selected based on the hidden technical fields present in the template. Otherwise, it is impossible.
The algorithm compares the fields on a document page with the set of hidden fields of each template and selects the most similar template from the passed collection.

Niteen_Jadhav · September 2, 2025, 2:40pm

Below is the code I am using to identify the correct template

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Linq;

class Program
{
    static void Main(string[] args)
    {
        DocProOCR.FileServer.ServiceClient fs = new DocProOCR.FileServer.ServiceClient();
        System.Net.ServicePointManager.ServerCertificateValidationCallback += (sender, certificate, chain, sslPolicyErrors) => true;

        string documentsPath = System.Configuration.ConfigurationManager.AppSettings["DocumentsPath"].ToString();
        string userId = System.Configuration.ConfigurationManager.AppSettings["UserId"].ToString();
        string invoicePathBase = Path.Combine(documentsPath, "Working", "Invoice");
        string templateFolderPath = Path.Combine(documentsPath, "Working", "template");
        string exePath = Path.Combine(documentsPath, "DocumentParser", "net8.0", "DocumentParser.exe");
        string outputPath = Path.Combine(documentsPath, "Working", "OutPut", "output.txt");

        while (true) // Infinite loop
        {
            try
            {
                var details = fs.ScanTool_GetFilesForOCR(userId);

                foreach (var detail in details)
                {
                    var fb = fs.DownloadFile(detail.FSID);
                    string pdfPath = Path.Combine(invoicePathBase, detail.FileName + detail.FileExtension);
                    File.WriteAllBytes(pdfPath, fb);

                    string bestTemplatePath = null;
                    Dictionary<string, string> bestParsedData = null;
                    int maxNonEmptyFields = -1;
                    string templateName = "";

                    foreach (var templatePath in Directory.GetFiles(templateFolderPath, "*.xml"))
                    {
                        RunDocumentParserExe(exePath, pdfPath, templatePath, outputPath);

                        if (File.Exists(outputPath))
                        {
                            var lines = File.ReadAllLines(outputPath);
                            var data = ParseOutput(lines);
                            int filledFieldCount = data.Count(kvp => !string.IsNullOrWhiteSpace(kvp.Value));

                            if (filledFieldCount > maxNonEmptyFields)
                            {
                                maxNonEmptyFields = filledFieldCount;
                                bestTemplatePath = templatePath;
                                bestParsedData = data;
                                templateName = Path.GetFileName(templatePath);
                            }
                        }
                    }

                    if (bestParsedData != null)
                    {
                        var sf = new DocProOCR.FileServer.SaveFile
                        {
                            Ref001 = bestParsedData.TryGetValue("Ref001", out var r1) ? r1 : "",
                            Ref002 = bestParsedData.TryGetValue("Ref002", out var r2) ? r2 : "",
                            Ref003 = bestParsedData.TryGetValue("Ref003", out var r3) ? r3 : "",
                            Ref004 = bestParsedData.TryGetValue("Ref004", out var r4) ? r4 : "",
                            Ref005 = bestParsedData.TryGetValue("Ref005", out var r5) ? r5 : "",
                            Ref006 = bestParsedData.TryGetValue("Ref006", out var r6) ? r6 : "",
                            Ref007 = bestParsedData.TryGetValue("Ref007", out var r7) ? r7 : "",
                            Ref008 = bestParsedData.TryGetValue("Ref008", out var r8) ? r8 : "",
                            Ref009 = bestParsedData.TryGetValue("Ref009", out var r9) ? r9 : "",
                            Ref010 = bestParsedData.TryGetValue("Ref010", out var r10) ? r10 : "",

                            Ref011 = bestParsedData.TryGetValue("Ref011", out var r11) ? r11 : "",
                            Ref012 = bestParsedData.TryGetValue("Ref012", out var r12) ? r12 : "",
                            Ref013 = bestParsedData.TryGetValue("Ref013", out var r13) ? r13 : "",
                            Ref014 = bestParsedData.TryGetValue("Ref014", out var r14) ? r14 : "",
                            Ref015 = bestParsedData.TryGetValue("Ref015", out var r15) ? r15 : "",
                            Ref016 = bestParsedData.TryGetValue("Ref016", out var r16) ? r16 : "",
                            Ref017 = bestParsedData.TryGetValue("Ref017", out var r17) ? r17 : "",
                            Ref018 = bestParsedData.TryGetValue("Ref018", out var r18) ? r18 : "",
                            Ref019 = bestParsedData.TryGetValue("Ref019", out var r19) ? r19 : "",
                            Ref020 = bestParsedData.TryGetValue("Ref020", out var r20) ? r20 : "",

                            Ref021 = bestParsedData.TryGetValue("Ref021", out var r21) ? r21 : "",
                            Ref022 = bestParsedData.TryGetValue("Ref022", out var r22) ? r22 : "",
                            Ref023 = bestParsedData.TryGetValue("Ref023", out var r23) ? r23 : "",
                            Ref024 = bestParsedData.TryGetValue("Ref024", out var r24) ? r24 : "",
                            Ref025 = bestParsedData.TryGetValue("Ref025", out var r25) ? r25 : "",
                            Ref026 = bestParsedData.TryGetValue("Ref026", out var r26) ? r26 : "",
                            Ref027 = bestParsedData.TryGetValue("Ref027", out var r27) ? r27 : "",
                            Ref028 = bestParsedData.TryGetValue("Ref028", out var r28) ? r28 : "",
                            Ref029 = bestParsedData.TryGetValue("Ref029", out var r29) ? r29 : "",
                            Ref030 = bestParsedData.TryGetValue("Ref030", out var r30) ? r30 : "",
                            DocumentNo = detail.DocumentNo,
                            OCRResult = detail.DocumentNo,
                            OCRTemplateName = templateName,
                            UserId = userId
                        };

                        fs.OCRIndexFieldUpdateGD(sf);
                        File.Delete(pdfPath);
                    }
                }
            }
            catch (Exception ex)
            {
                Console.WriteLine($"Error: {ex.Message}");
                // Optionally log error to file or system
            }

            // Wait for 5 minutes before next check
            System.Threading.Thread.Sleep(300000);
        }
    }


    static void RunDocumentParserExe(string exePath, string pdfPath, string templatePath, string outputPath)
    {
        var process = new Process();
        process.StartInfo.FileName = exePath;
        process.StartInfo.Arguments = $"-i \"{pdfPath}\" -t \"{templatePath}\" -o \"{outputPath}\" --ocr true";
        process.StartInfo.CreateNoWindow = true;
        process.StartInfo.UseShellExecute = false;
        process.StartInfo.RedirectStandardOutput = true;
        process.StartInfo.RedirectStandardError = true;
        process.Start();
        process.WaitForExit();
    }

    static Dictionary<string, string> ParseOutput(string[] lines)
    {
        var dict = new Dictionary<string, string>();
        var cleanLines = lines.Where(line => !string.IsNullOrWhiteSpace(line))
            .Select(line => line.Trim())
            .ToList();

        for (int i = 0; i < cleanLines.Count; i++)
        {
            string currentLine = cleanLines[i];

            // Check if this line is a key (e.g., starts with "Ref")
            if (currentLine.StartsWith("Ref", StringComparison.OrdinalIgnoreCase))
            {
                string key = currentLine;
                string value = "";

                // Check if the next line exists and is NOT another key
                if (i + 1 < cleanLines.Count && !cleanLines[i + 1].StartsWith("Ref", StringComparison.OrdinalIgnoreCase))
                {
                    value = cleanLines[i + 1];
                    i++; // Skip the value line next iteration
                }

                dict[key] = value;
            }
        }
        return dict;
    }
}

Now, we are checking each template, what do I need to change here?

Andrey.Golubkov · September 2, 2025, 4:04pm

It might be better to use the GroupDocs.Parser library directly in your application. To do this, you need to find it in Nuget and connect it in your project.
Then replace the following code block.

foreach (var templatePath in Directory.GetFiles(templateFolderPath, "*.xml"))
{
    RunDocumentParserExe(exePath, pdfPath, templatePath, outputPath);

    if (File.Exists(outputPath))
    {
        var lines = File.ReadAllLines(outputPath);
        var data = ParseOutput(lines);
        int filledFieldCount = data.Count(kvp => !string.IsNullOrWhiteSpace(kvp.Value));

        if (filledFieldCount > maxNonEmptyFields)
        {
            maxNonEmptyFields = filledFieldCount;
            bestTemplatePath = templatePath;
            bestParsedData = data;
            templateName = Path.GetFileName(templatePath);
        }
    }
}

With the next code block.

// Loading collection of templates
var templateCollection = new TemplateCollection();
var templatePaths = Directory.GetFiles(templateFolderPath, "*.xml");
foreach (var templatePath in templatePaths)
{
    var template = Template.Load(templatePath);
    templateCollection.Add(template);
}

// Parsing by collection of templates
using var parser = new Parser(pdfPath);
var ocrOptions = new OcrOptions(new PagePreviewOptions(288));
var options = new ParseByTemplateOptions(
    pageIndex: 0,
    useOcr: true,
    ocrOptions: ocrOptions);
var data = parser.ParseByTemplate(templateCollection, options);
var bestTemplate = data.Template;
maxNonEmptyFields = data.Count;

// Getting parsed text for each field
foreach (var field in data)
{
    var fieldName = field.Name;
    var parsedText = field.Text;
}

Niteen_Jadhav · September 2, 2025, 6:50pm

Can we also have a confidence level in future updates?

yuriy.mazurchuk · September 2, 2025, 7:43pm

Hi @Niteen_Jadhav!

We usually deploy updates on a monthly basis.

In each release, we update dependencies, fix issues reported through the forum or paid support, and improve overall stability. If a critical bug or high-priority support case arises, we also issue a hotfix outside the regular schedule.

In other words, the product is under continuous support, and we address all questions and reported issues according to our planned tasks and priorities.

So please feel free to share any feedback or issues you encounter while using the product. We are always open to your ideas and suggestions to improve it based on your experience.