Hello Team,
I want to do Template based ocr for my vendor invoices.
The template based ocr should be dynamic, I find groupdocs.parser but it looks like it is static ocr, please find below link for reference.
Blog
and can you please help me with the source code for template based ocr(dynamic)
We are investigating this scenario. Your investigation ticket ID is PARSERNET-1931.
@Niteen_Jadhav
For extracting dynamic fields TemplateRegexPosition
class or the pair of TemplateLinkedPosition
and TemplateRegexPosition
classes are used. Please, see the following examples:
For extracting tables TemplateTableParameters class is used. Please, see this API Reference.
The following code shows how to extract a table from the page:
// Create parameters to find a table in the rectangle of the page
TemplateTableParameters parameters = new TemplateTableParameters(new Rectangle(new Point(50, 250), new Size(600, 300)), null);
// Create an instance of Parser class
using(Parser parser = new Parser("invoice.pdf"))
{
// Create a template
Template template = new Template(new TemplateItem[]
{
new TemplateTable(parameters, "table", null),
});
// Parse the document by the template
DocumentData data = parser.ParseByTemplate(template);
// Iterate over fields
foreach (FieldData i in data.GetFieldsByName("table"))
{
// Convert the page area into the table
PageTableArea? table = i.PageArea as PageTableArea;
if (table == null)
{
Console.WriteLine("Can't find table");
return;
}
// Iterate over rows
for (int row = 0; row < table.RowCount; row++)
{
// Iterate over columns
for (int column = 0; column < table.ColumnCount; column++)
{
Console.Write(table[row, column]?.Text);
Console.Write(' ');
}
Console.WriteLine();
}
}
}
Please note, that the automatic table detection works with relative simple tables (without empty rows, for example). In other cases it’s recommended to use TemplateTableLayout class.
Moreover, OCR (converting images to a text) isn’t supported by the API. Parse by template functionality works only with PDF documents with text content, not scanned images.