Documents are not getting highlighted

Niteen_Jadhav · July 18, 2024, 3:15pm

I am using Groupdocs.Search to highlight phrases in the document (Word, excel, Powerpoint, pdf), but the text is not getting highlighted in almost all the files.

sharing my code and the document file for your reference.

Program p = new Program();
var storagePath = p.GetStoragePath();
string basePath = storagePath + "\\Search";
if (!Directory.Exists(basePath))
{
    Directory.CreateDirectory(basePath);
}

var fileGuidWithStoragePath = Path.Combine(storagePath, request.GUID);
var fileGuidWithSearchPath = Path.Combine(basePath, request.GUID);
if (System.IO.File.Exists(fileGuidWithSearchPath))
{
    System.IO.File.Delete(fileGuidWithSearchPath);
}
File.Copy(fileGuidWithStoragePath, fileGuidWithSearchPath);

string viewerCacheFolderPath = basePath + @"\Cache";
string indexFolder = basePath + @"\Index";
string documentsFolder = basePath;
string query = request.TextData;

// Creating an index in the specified folder
Index index = new Index(indexFolder);

// Indexing documents from the specified folder
index.Add(documentsFolder);

// Search in index
GroupDocs.Search.Options.SearchOptions searchOptions = new GroupDocs.Search.Options.SearchOptions();
searchOptions.UseCaseSensitiveSearch = false; // Adjust based on your requirement
SearchResult result = index.Search(query, searchOptions);

//Utils.TraceResult(query, result);

// Generating HTML
FoundDocument foundDocument = null;

// Iterate through the found documents

int documentCount = result.DocumentCount;
if (documentCount == 0)
{
    return new SPResponse() { ReturnStatus = "-1" };
}
for (int i = 0; i < documentCount; i++)
{
    FoundDocument document = result.GetFoundDocument(i);
    if (document.DocumentInfo.FilePath.Equals(fileGuidWithSearchPath, StringComparison.OrdinalIgnoreCase))
    {
        foundDocument = document;
        break;
    }
}


var documentGuid = foundDocument.DocumentInfo.FilePath;
var fileFolderName = Path.GetFileName(documentGuid).Replace(".", "_");

string cachePath = Path.Combine(p.GetStoragePath(), "Search");
cachePath = Path.Combine(cachePath, "Cache");
//cachePath = Path.Combine(cachePath, fileFolderName);

string fileCacheSubFolder = Path.Combine(cachePath, fileFolderName);
IViewerCache cache = new FileViewerCache(cachePath, fileCacheSubFolder);
LoadDocumentEntity loadDocumentEntity;

using (HtmlViewer htmlViewer = new HtmlViewer(documentGuid, cache, GetLoadOptions("")))
{
    loadDocumentEntity = GetLoadDocumentEntity(true, documentGuid, fileCacheSubFolder, htmlViewer, viewerCacheFolderPath);
}

IndexedFileInfo fileInfo = new IndexedFileInfo(viewerCacheFolderPath, foundDocument.DocumentInfo.FilePath);
HighlightService highlightService = new HighlightService(fileInfo, null, cache);

// Highlighting in HTML
highlightService.Highlight(foundDocument, index.Dictionaries.Alphabet, true);

My highlight function

foreach (var page in _pages)
{
    string pageFilePath = string.Empty;
    if (isHtmlMode)
    {
        pageFilePath = _fileInfo.GetHtmlPageFilePath(page.Number);
    }

    var text = File.ReadAllText(pageFilePath);
    //HtmlDocument htmlDoc = new HtmlDocument();
    //htmlDoc.LoadHtml(text);
    //string textContent = htmlDoc.DocumentNode.InnerText;

    var result = HtmlHighlighter.Handle(
        text,
        false,
        alphabet,
        foundDocument.Terms,
        foundDocument.TermSequences);

    int index = result.IndexOf(Key);
    if (index > 0 && index + Key.Length < result.Length)
    {
        result = result.Insert(index + Key.Length, HighlightStyle);
    }

    File.WriteAllText(pageFilePath, result);
}

int index = result.IndexOf(Key); //here the key is coming -1 in most of the cases
if (documentCount == 0)//the document count come 0 when we search phrase

My complete code in highlight service →

internal class HighlightService
{
    private const string Key = "<style>";
    private const string HighlightStyle = @".highlighted-term { background-color:#ADFF2F; } ";

    private readonly IndexedFileInfo _fileInfo;
    private readonly string _password;

    private IList<Page> _pages;

    public HighlightService(
        IndexedFileInfo fileInfo,
        string password, IViewerCache cache)
    {
        _fileInfo = fileInfo;
        _password = password;

        using (var htmlViewer = new HtmlViewer(_fileInfo, cache, _password))
        {
            _pages = htmlViewer.GetPages();
            foreach (var page in _pages)
            {
                htmlViewer.CreateCacheForPage(page.Number);
            }
        }
    }

    public void Highlight(FoundDocument foundDocument, Alphabet alphabet, bool isHtmlMode = true)
    {
        foreach (var page in _pages)
        {
            string pageFilePath = string.Empty;
            if (isHtmlMode)
            {
                pageFilePath = _fileInfo.GetHtmlPageFilePath(page.Number);
            }

            var text = File.ReadAllText(pageFilePath);
            //HtmlDocument htmlDoc = new HtmlDocument();
            //htmlDoc.LoadHtml(text);
            //string textContent = htmlDoc.DocumentNode.InnerText;

            var result = HtmlHighlighter.Handle(
                text,
                false,
                alphabet,
                foundDocument.Terms,
                foundDocument.TermSequences);

            int index = result.IndexOf(Key);
            if (index > 0 && index + Key.Length < result.Length)
            {
                result = result.Insert(index + Key.Length, HighlightStyle);
            }

            File.WriteAllText(pageFilePath, result);
        }
    }
}

Not Working online on Groupdocs website also (1.8 MB)

Working fine online but not working with my code (11.7 KB)

I am Searching “Proposal” in the search term

atir.tahir · July 18, 2024, 10:38pm

@Niteen_Jadhav
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): SEARCHNET-3210

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

Niteen_Jadhav · July 20, 2024, 7:33am

Have you verified this in your end or it is working in your end?

and it will be better if you can provide any work around for us for time being

atir.tahir · July 20, 2024, 10:06am

@Niteen_Jadhav

This particular feature is regarding GroupDocs.Search and we are already investigating it. You’ll be notified in case of any update.

Niteen_Jadhav · July 22, 2024, 1:12pm

I have done a work around please suggest if I can proceed with that.

My Main Api

[HttpPost]
[Route("api/GroupDocsApi/HighlightTextInViewer")]
public async Task<SPResponse> HighlightTextInViewer(DocPro.DMS.BusinessEntities.Request.GroupDocs.GetDocumentInfoRequest request)
{
    Program p = new Program();
    var storagePath = p.GetStoragePath();
    string basePath = storagePath + "\\Search";
    if (!Directory.Exists(basePath))
    {
        Directory.CreateDirectory(basePath);
    }

    var fileGuidWithStoragePath = Path.Combine(storagePath, request.GUID);
    var fileGuidWithSearchPath = Path.Combine(basePath, request.GUID);
    if (System.IO.File.Exists(fileGuidWithSearchPath))
    {
        System.IO.File.Delete(fileGuidWithSearchPath);
    }
    File.Copy(fileGuidWithStoragePath, fileGuidWithSearchPath);

    string viewerCacheFolderPath = basePath + @"\Cache";
    string indexFolder = basePath + @"\Index";
    string documentsFolder = basePath;
    string query = request.TextData;

    // Creating an index in the specified folder
    //Index index = new Index(indexFolder);

    //// Indexing documents from the specified folder
    //index.Add(documentsFolder);

    //// Search in index
    //GroupDocs.Search.Options.SearchOptions searchOptions = new GroupDocs.Search.Options.SearchOptions();
    //searchOptions.UseCaseSensitiveSearch = false; // Adjust based on your requirement
    //SearchResult result = index.Search(query, searchOptions);

    ////Utils.TraceResult(query, result);

    //// Generating HTML
    //FoundDocument foundDocument = null;

    //// Iterate through the found documents

    //int documentCount = result.DocumentCount;
    //if (documentCount == 0)
    //{
    //    return new SPResponse() { ReturnStatus = "-1" };
    //}
    //for (int i = 0; i < documentCount; i++)
    //{
    //    FoundDocument document = result.GetFoundDocument(i);
    //    if (document.DocumentInfo.FilePath.Equals(fileGuidWithSearchPath, StringComparison.OrdinalIgnoreCase))
    //    {
    //        foundDocument = document;
    //        break;
    //    }
    //}


    var documentGuid = GetDocumentPath(request.GUID);
    var fileFolderName = Path.GetFileName(documentGuid).Replace(".", "_");

    string cachePath = Path.Combine(p.GetStoragePath(), "Search");
    cachePath = Path.Combine(cachePath, "Cache");
    //cachePath = Path.Combine(cachePath, fileFolderName);

    string fileCacheSubFolder = Path.Combine(cachePath, fileFolderName);
    IViewerCache cache = new FileViewerCache(cachePath, fileCacheSubFolder);
    LoadDocumentEntity loadDocumentEntity;

    using (HtmlViewer htmlViewer = new HtmlViewer(documentGuid, cache, GetLoadOptions("")))
    {
        loadDocumentEntity = GetLoadDocumentEntity(true, documentGuid, fileCacheSubFolder, htmlViewer, viewerCacheFolderPath);
    }

    IndexedFileInfo fileInfo = new IndexedFileInfo(viewerCacheFolderPath, documentGuid);
    HighlightService highlightService = new HighlightService(fileInfo, null, cache);

    // Highlighting in HTML
    highlightService.Highlight(request.TextData, true);

    if (!request.IsHtmlMode)
    {
        var pagesBuilder = await GD.Utilities.GetDocumentInfoByFilePathAsync(fileGuidWithSearchPath);
        var pages = pagesBuilder.ToString();
        var noOfPages = pages.Split(',').Last();
        var tempFilesPath = Path.Combine(cachePath, fileFolderName);

        for (int i = 1; i <= int.Parse(noOfPages); i++)
        {
            var tempFile = Path.Combine(tempFilesPath, $"p{i}");
            using (var converter = new Converter($"{tempFile}.html"))
            {
                try
                {
                    var options = new ImageConvertOptions { Format = GroupDocs.Conversion.FileTypes.ImageFileType.Png };
                    converter.Convert($"{tempFile}.png", options);
                    File.Delete($"{tempFile}.html");
                }
                catch (Exception)
                {
                    throw;
                }
            }
        }
    }
    File.Delete(fileGuidWithSearchPath);
    return new SPResponse() { ReturnStatus = "0" };
}

my methods inside highlight class

public void Highlight(string term, bool isHtmlMode = true)
{
    foreach (var page in _pages)
    {
        string pageFilePath = string.Empty;
        if (isHtmlMode)
        {
            pageFilePath = _fileInfo.GetHtmlPageFilePath(page.Number);
        }

        var text = File.ReadAllText(pageFilePath);
        HtmlDocument htmlDoc = new HtmlDocument();
        htmlDoc.LoadHtml(text);

        var bodyNode = htmlDoc.DocumentNode.SelectSingleNode("//body");
        if (bodyNode != null)
        {
            HighlightText(bodyNode, term);
        }

        // Save the modified HTML back to the file
        htmlDoc.Save(pageFilePath);


        //htmlDoc.LoadHtml(text);
        //string textContent = htmlDoc.DocumentNode.InnerText;

        //var result = HtmlHighlighter.Handle(
        //    text,
        //    false,
        //    alphabet,
        //    foundDocument.Terms,
        //    foundDocument.TermSequences);

        //int index = result.IndexOf(Key);
        //if (index > 0 && index + Key.Length < result.Length)
        //{
        //    result = result.Insert(index + Key.Length, HighlightStyle);
        //}

        //File.WriteAllText(pageFilePath, result);
    }
}

private void HighlightText(HtmlNode node, string searchText)
{
    var textBuilder = new StringBuilder();
    var nodeList = new List<HtmlNode>();

    // Step 1: Extract all text content and track positions
    ExtractText(node, textBuilder, nodeList);
    var text = textBuilder.ToString();

    // Step 2: Find all positions of the search text in the concatenated string
    var regex = new Regex(Regex.Escape(searchText), RegexOptions.IgnoreCase);
    var matches = regex.Matches(text);

    // Step 3: If no matches, exit early
    if (matches.Count == 0) return;

    // Step 4: Create highlighted text segments
    var segments = new List<(int Start, int Length)>();
    foreach (Match match in matches)
    {
        segments.Add((match.Index, match.Length));
    }

    // Step 5: Reapply the HTML structure with highlighted text
    ReapplyHighlights(nodeList, segments);
}

private void ExtractText(HtmlNode node, StringBuilder textBuilder, List<HtmlNode> nodeList)
{
    if (node.NodeType == HtmlNodeType.Text)
    {
        textBuilder.Append(node.InnerText);
        nodeList.Add(node);
    }
    else if (node.HasChildNodes)
    {
        foreach (var child in node.ChildNodes)
        {
            ExtractText(child, textBuilder, nodeList);
        }
    }
    else
    {
        nodeList.Add(node);
    }
}

private void ReapplyHighlights(List<HtmlNode> nodeList, List<(int Start, int Length)> segments)
{
    int offset = 0;
    foreach (var segment in segments)
    {
        int start = segment.Start + offset;
        int end = start + segment.Length;
        int currentPos = 0;

        foreach (var node in nodeList)
        {
            if (node.NodeType == HtmlNodeType.Text)
            {
                var originalText = node.InnerText;
                int textLength = originalText.Length;

                if (currentPos + textLength >= start && currentPos < end)
                {
                    int relativeStart = Math.Max(start - currentPos, 0);
                    int relativeEnd = Math.Min(end - currentPos, textLength);

                    var beforeHighlight = originalText.Substring(0, relativeStart);
                    var highlight = originalText.Substring(relativeStart, relativeEnd - relativeStart);
                    var afterHighlight = originalText.Substring(relativeEnd);

                    var highlightedText = $"{beforeHighlight}<span style=\"background-color: yellow;\">{highlight}</span>{afterHighlight}";

                    node.InnerHtml = highlightedText;
                    offset += highlightedText.Length - originalText.Length;
                }

                currentPos += textLength;
            }
        }
    }
}

please note: as of now while testing it seems like it is working fine I just added HtmlAgilityPack to work with html files

atir.tahir · July 22, 2024, 7:04pm

@Niteen_Jadhav

You can continue with the workaround. Let us know if you face any issue in the back-end API.

Niteen_Jadhav · August 16, 2024, 7:59am

Do we have any updates on this?

atir.tahir · August 16, 2024, 7:53pm

@Niteen_Jadhav

To preserve the document structure, you need to highlight the search results in the same way as in the highlight example.

Moreover, the workaround you adopted, is appropriate at the moment.

Niteen_Jadhav · August 17, 2024, 2:02pm

ok, will this get resolve anytime soon?

atir.tahir · August 17, 2024, 8:47pm

@Niteen_Jadhav

Actually we have a solution for this issue that we already shared. Could you please be specific, if you have more queries regarding this particular scenario?

Niteen_Jadhav · August 25, 2024, 8:47am

But I did not got any solution on this

atir.tahir · August 25, 2024, 6:34pm

@Niteen_Jadhav

Following this code, we get proposal word highlighted in the output. But there’s some issue with the document/output structure that we are further working on. You’ll be notified in case of any update.

atir.tahir · August 28, 2024, 8:59pm

@Niteen_Jadhav

Please follow this Highlight in HTML code to highlight and preserve the document structure.

Niteen_Jadhav · August 30, 2024, 10:33am

Ok, I’ll try and let you know

atir.tahir · August 30, 2024, 11:50am

@Niteen_Jadhav

Sure.