Indexing corrupted docx file leads to OutOfMemoryException

jamsharp · August 30, 2024, 10:06am

Hi.

We have seen OutOfMemoryExceptions when indexing our test environment.

We did take a closer look for the reason and identified an issue with corrupted/invalid docx and/or doc files. The existence of one of such files in the directory structure when indexing leads to OutOfMemoryException and aborts the indexing process.

Could you please help on this issue?

It would be great (expected) that such files would be skipped and not indexed, rather than the hole process is aborted.

P.S.: In case this is necessary, we could offer some sample doc / docx files which are affected.

Thank you for your time!

atir.tahir · August 30, 2024, 12:02pm

@jamsharp

Yes, could you please share the problematic file(s)? We’ll then investigate this scenario at our end.

jamsharp · September 2, 2024, 9:07am

Sure.

2 broken files are part of this ZIP folder:
broken-files.zip (325.2 KB)

atir.tahir · September 2, 2024, 10:04pm

@jamsharp

Using latest version of the API 24.8, we couldn’t reproduce this issue. However, we noticed another issue that if the invalid documents are added in the directory, search results are not as expected. But when we remove the corrupted documents, we get expected results.
We are investigating this issue. Your investigation ticket ID is SEARCHNET-3249.

Could you please share more details about your test environment (e.g. OS details, .NET version) and a sample console application using that issue could be reproduced.

jamsharp · September 3, 2024, 3:21pm

Hello,

We are also using 24.8.

I listened to the the events:

index.Events.ErrorOccurred += OnIndexingErrorOccurred;

And I got this one from the event:

System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at .(String , Int32 )
   at .()
   at .( ,  )
   at .(List`1 , List`1 )
   at .(List`1 , List`1 )
   at .(List`1 , List`1 )
   at .()
   at .( ,  , Int32 , MetadataIndexingOptions , OcrIndexingOptions , ImageIndexingOptions , Cancellation )
   at .(Boolean , Int32 ,  ,  , Boolean , String , Boolean , Int32 , MetadataIndexingOptions , OcrIndexingOptions , ImageIndexingOptions , Cancellation , OperationType )

I’ll try to find time tomorrow to build a console application. We are using .Net 8 and reproduced it on at least 2 Windows 11 PCs.

atir.tahir · September 3, 2024, 6:29pm

@jamsharp

Please always use following option:

var options = new IndexingOptions() { UseRawTextExtraction = false }; 
index.Add(documentsFolder, options);

In your case, the documents are indeed damaged. If an exception occurs while extracting data from a document, that document is skipped, and the remaining documents in the folder continue to be indexed. However, if the exception is of the OutOfMemoryException type, continuing the indexing process becomes impossible, as the system has run out of memory.

When there is a possibility of documents that could interrupt the entire indexing process, it is advisable to handle document indexing differently. For example, extract data from documents individually, then index them separately, as described here:
Separate Data Extraction - GroupDocs.Search for .NET.

In this approach, it is better to extract data from documents one by one and then index the extracted data in groups as large as possible.

jamsharp · September 4, 2024, 2:50pm

And what is the recommended way to update an existing index when there could be documents that could interrupt the entire process? What’s the best way to skip files that have not changed since the last indexing?

atir.tahir · September 4, 2024, 9:45pm

@jamsharp

We recommend extracting data individually from each file and then indexing the data collectively as a batch. This approach allows us to handle each document separately and determine if a document requires updating based on its modification date.

jamsharp · September 5, 2024, 9:25am

Hello,

We generated a simple console application that allows reproducing this.
Please extract everything from the ZIP file, add a license, compile it and run it.

It tries to index the files that are also part of the zip file.

GroupDocsOutOfMemoryDemo.zip (390.0 KB)

atir.tahir · September 5, 2024, 5:57pm

@jamsharp

Please implement the approach Separate Data Extraction, and the issue will not occur. However, we have reproduced the issue when not using the separate data extraction approach.

jamsharp · September 10, 2024, 11:11am

Hello

Is this SEARCHNET-3249?
If so, can you please give us an estimation when we could expect a fix for this issue?
Thank you!

atir.tahir · September 10, 2024, 2:08pm

@jamsharp

Let me clarify again.
Our investigation outcome is, you have to implement separate data extraction approach to get rid of the issue. Please follow this approach and let us know if any issue occurs.

jamsharp · September 12, 2024, 8:01am

Hi,

we changed the example to use separate data extraction, but we can still reproduce the internal exception. Here is the new example
GroupDocsOutOfMemoryDemo2.zip (351.0 KB)

We also noticed that in this case index.GetIndexingReports() doesn’t return anything. It looks like the indexing operation is aborted because of the exception.

atir.tahir · September 12, 2024, 9:56am

@jamsharp

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): SEARCHNET-3276

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

atir.tahir · September 12, 2024, 2:01pm

@jamsharp

We have a workaround for ticket SEARCHNET-3276. Please insert the following code right after the Index constructor call:

var index = new Index();

Action<Index, int, int> SetRange = (i, from, to) => i.Dictionaries.Alphabet.SetRange(
    Enumerable.Range(from, to - from + 1).ToArray(), CharacterType.Separator);
SetRange(index, 0x3300, 0x33FF);
SetRange(index, 0xFE30, 0xFE4F);
SetRange(index, 0xF900, 0xFAFF);
SetRange(index, 0x2F800, 0x2FA1F);
SetRange(index, 0x2E80, 0x2EFF);
SetRange(index, 0x31C0, 0x31EF);
SetRange(index, 0x3000, 0x303F);
SetRange(index, 0x4E00, 0x9FFF);
SetRange(index, 0x3400, 0x4DBF);
SetRange(index, 0x20000, 0x2A6DF);
SetRange(index, 0x2A700, 0x2B73F);
SetRange(index, 0x2B740, 0x2B81F);
SetRange(index, 0x2B820, 0x2CEAF);
SetRange(index, 0x2CEB0, 0x2EBEF);
SetRange(index, 0x30000, 0x3134F);
SetRange(index, 0x31350, 0x323AF);

This code disables support for Chinese, Japanese, and Korean languages in the index. Please let us know if this resolves the issue.