Best .NET API to search text in documents

wolfgang.gogg · October 22, 2019, 10:07am

Hello,
we have the requirement to search text inside a document and get a list of all hits with some context information. So ideally we would like to get PageNumber, maybe some position information on page, some surrounding text. It seems that from the GroupDocs.Total Family, GroupsDocs.Parser comes closest to offer such functionality. Am i correct? Is this the best product to use for our use case?
Note: we only want to search in memory and do some basic text search, so building an index like in GroupsDocs.Search is not required, hence i think GroupDocs.Search would not be a fit here.

Thanks!

atir.tahir · October 22, 2019, 4:10pm

@wolfgang.gogg,

Using GroupDocs.Search for .NET you can search text and get list of hits with some context information. But creating index is mandatory. However, there are two types of indexing:

Index created in memory - An index created in memory cannot be saved after exiting your program
Index created on disk - may be loaded in the future to continue working

For further details visit this article.

GroupDocs.Parser also supports searching the keywords in the document’s text. However, it only provides the following outcomes:

The position of the keyword in the document text.
The found text.
The left highlight
The right highlight

For details, please visit this documentation article.

wolfgang.gogg · October 23, 2019, 9:25am

Hello,

feature wise GroupDocs.Parser would be perfect. The only feature gap for us is to also get the correct page information, not just the position (what is this exaclty? Word? Character?).
Is there any chance to get this from Parser or use Parser in a paged manner, i.e. parse one page after the other?
Thanks!

usman.aziz · October 23, 2019, 6:29pm

@wolfgang.gogg,

We have logged it in our Issue Tracking System (ID: PARSERNET-1292) to check if it is feasible to get the page information as well in the search results. Furthermore, the position returns the index of the first character of the found term in the document text.

Yes, you can also parse a document and extract text page by page. Please have a look at this documentation article for more details. Furthermore, you may also have a look at this blog article that shows how to count words and occurrences of each word in a document. You may modify or enhance the code sample given in this article to parse the document page by page.

wolfgang.gogg · October 24, 2019, 7:38am

Hello,

thanks - waiting for feedback on your feasability check for getting the page number in search results.

usman.aziz · October 24, 2019, 11:04am

@wolfgang.gogg,

Sure, we’ll let you know about the outcomes as soon as possible.

usman.aziz · November 26, 2019, 6:25am

@wolfgang.gogg,

The feature of getting page numbers from the search results has been provided in v19.11 of GroupDocs.Parser for .NET. The following code sample shows how to get page number of the found text.

// Create an instance of Parser class
using(Parser parser = new Parser("sample.pdf"))
{
    // Search a keyword with page numbers
    IEnumerable<SearchResult> sr = parser.Search("lorem", new SearchOptions(false, false, false, true));
    // Check if search is supported
    if(sr == null)
    {
        Console.WriteLine("Search isn't supported");
        return;
    }
    
    // Iterate over search results
    foreach(SearchResult s in sr)
    {
        // Print an index, page number and found text:
        Console.WriteLine(string.Format("At {0} (page {1}): {2}", s.Position, s.PageIndex, s.Text));
    }
}