Dynamic template based ocr using groupdocs.parser

Niteen_Jadhav · September 2, 2022, 7:37am

Hello Team,

I want to do Template based ocr for my vendor invoices.
The template based ocr should be dynamic, I find groupdocs.parser but it looks like it is static ocr, please find below link for reference.

Blog

and can you please help me with the source code for template based ocr(dynamic)

atir.tahir · September 2, 2022, 4:07pm

We are investigating this scenario. Your investigation ticket ID is PARSERNET-1931.

atir.tahir · September 6, 2022, 5:39pm

@Niteen_Jadhav

For extracting dynamic fields TemplateRegexPosition class or the pair of TemplateLinkedPosition and TemplateRegexPosition classes are used. Please, see the following examples:

For extracting tables TemplateTableParameters class is used. Please, see this API Reference.
The following code shows how to extract a table from the page:

// Create parameters to find a table in the rectangle of the page
TemplateTableParameters parameters = new TemplateTableParameters(new Rectangle(new Point(50, 250), new Size(600, 300)), null);

// Create an instance of Parser class
using(Parser parser = new Parser("invoice.pdf"))
{
    // Create a template
    Template template = new Template(new TemplateItem[]
    {
        new TemplateTable(parameters, "table", null),
    });

    // Parse the document by the template
    DocumentData data = parser.ParseByTemplate(template);

    // Iterate over fields
    foreach (FieldData i in data.GetFieldsByName("table"))
    {
        // Convert the page area into the table
        PageTableArea? table = i.PageArea as PageTableArea;

        if (table == null)
        {
            Console.WriteLine("Can't find table");
            return;
        }

        // Iterate over rows
        for (int row = 0; row < table.RowCount; row++)
        {
            // Iterate over columns
            for (int column = 0; column < table.ColumnCount; column++)
            {
                Console.Write(table[row, column]?.Text);
                Console.Write(' ');
            }

            Console.WriteLine();
        }
    }
}

Please note, that the automatic table detection works with relative simple tables (without empty rows, for example). In other cases it’s recommended to use TemplateTableLayout class.
Moreover, OCR (converting images to a text) isn’t supported by the API. Parse by template functionality works only with PDF documents with text content, not scanned images.