Highlight text upon conversion to HTML in .NET

Clemens_Pestuka · December 18, 2024, 9:27am

Hi,

I was wondering if there is any way to highlight certain words of document, when converting it to HTML with GroupDocs.Viewer.

I tried using GroupDocs.Search after the conversion with GroupDocs.Viewer, but the resulting HTML had a completely new format. We would need to keep that formatting, done by the Viewer.

I wanted to also give GroupDocs.Annotation a try, to highlight some text, but it seems it requires an absolute position, which I don’t have. I only know which words I want to highlight unfortunately.

What we are ultimately trying to highlight search results, that were found by GroupDocs.Parser.
GroupDocs.Parse is searching on the output of GroupDocs.Viewer and gives us good results.
The only problem we have, that we don’t get any context, where this was found in the HTML.
There is a position in the search results, but it does not reflect a usable position in the original HTML, as it seems to be based on some intermediate Markup from GroupDocs.Parser.

I hope you have some idea for us, how we could tackle that problem.

Professionalize.Discourse · December 18, 2024, 9:27am

@Clemens_Pestuka

Could you please provide more details on how you are currently obtaining the search results from GroupDocs.Parser and the specific format of the HTML output from GroupDocs.Viewer?

Clemens_Pestuka · December 18, 2024, 9:31am

This is how we’re using GroupDocs.Parser:

            using (Parser parser = new Parser(documentPath))
            {
                HighlightOptions options = new HighlightOptions(40);
                IEnumerable<SearchResult> sr = parser.Search("deforestation. China", new SearchOptions(false, false, false, true, options, options));

Here are a few output files form Viewer we were using:
output.zip (107.7 KB)

vladimir.litvinchik · December 18, 2024, 10:57am

@Clemens_Pestuka

Can you please describe in more details how you are going to use the highlighted text. As you wrote about context it seems to me that you would like to know if there any entries found and their location, is that correct?

Clemens_Pestuka · December 18, 2024, 11:12am

@vladimir.litvinchik

Yes, if we’d have the exact location of a hit, we could do the highlighting ourselves.
That might be the ideal scenario, as we have more control about the highlight.

If that’s not possible, we’d also be fine with when Viewer could do the highlight for us and we could compare against the original where the highlight was done.

If there are any more question or my description wasn’t clear, please let me know.

vladimir.litvinchik · December 18, 2024, 12:37pm

@Clemens_Pestuka

Thank you for the details. We’ll take a look if we could do it in Viewer. For which file types do you need the highlight in the first place?

Clemens_Pestuka · December 18, 2024, 1:30pm

@vladimir.litvinchik

Good question.
We could either do the highlight from the original file, which could be any format.
Or we could do the highlight on the already converted file, which would always be HTML.
I never actually tried to run some output from GroupDocs.Viewer, through GroupDocs.Viewer again.

Clemens_Pestuka · December 18, 2024, 1:43pm

@vladimir.litvinchik

Maybe it would make more sense, to get a “correct” position from GroupDocs.Parser.
Searching in the previously attached “output_viewer26.html” for “deforestation. China”, gives me that position 898:
image.png (23.3 KB)

If I’d copy&paste all characters up to “deforestation. China” into notepad, I can see this exactly matches:
image.png (27.2 KB)

But there are two problems with that.

it does not always match. It can be completely off, when the document has hyperlinks
position is hard to determine programmatically, as the page source is way more complex

vladimir.litvinchik · December 18, 2024, 6:30pm

@Clemens_Pestuka

I’ve got response from the developer, unfortunately, he can’t provide a correct position as it is a different context.

We are considering to add a feature to Viewer that will highlight text in the source document and then convert to HTML. This approach has advantage as search is expected to work similar to the native one, like in MS Word compared to search in HTML where in some cases text could be in different blocks which makes it hard to find the entry.

To perform the analysis for this feature we would need the list of file formats which you’re processing.

Clemens_Pestuka · December 19, 2024, 3:21pm

@vladimir.litvinchik

Thank you , that would be great!
We would need this feature for all MS Office formats, PDF, Emails and text-based files (txt, log).

vladimir.litvinchik · December 19, 2024, 3:30pm

@Clemens_Pestuka

Got it, thanks. We’re going to analyze this feature and schedule implementation. Will update you when we have any updates.

denisgvardionov · April 29, 2025, 10:24pm

Hi @Clemens_Pestuka

We have implemented the requested feature, it is described in separate article: Search and highlight text in the loaded document. This feature is released with the GroupDocs.Viewer version 25.4, which was released today.

With best regards,
Denis Gvardionov

Clemens_Pestuka · May 16, 2025, 2:21pm

Hi @denisgvardionov,

Sorry for the late reply, but I had very limited time lately.
I was able to give the new feature a try, following this guide.

I tried a few search times that we were struggling with highlighting and they were working perfectly.
So as far as my testing goes, everything is looking great!
Configurable colors and regex are also really good options!

Thanks a lot for implementing that functionality

Best regards,
Clemens

denisgvardionov · May 16, 2025, 9:36pm

Hi @Clemens_Pestuka

Glad to hear that!

With best regards,
Denis Gvardionov