We have seen OutOfMemoryExceptions when indexing our test environment.
We did take a closer look for the reason and identified an issue with corrupted/invalid docx and/or doc files. The existence of one of such files in the directory structure when indexing leads to OutOfMemoryException and aborts the indexing process.
Could you please help on this issue?
It would be great (expected) that such files would be skipped and not indexed, rather than the hole process is aborted.
P.S.: In case this is necessary, we could offer some sample doc / docx files which are affected.
Using latest version of the API 24.8, we couldn’t reproduce this issue. However, we noticed another issue that if the invalid documents are added in the directory, search results are not as expected. But when we remove the corrupted documents, we get expected results.
We are investigating this issue. Your investigation ticket ID is SEARCHNET-3249.
Could you please share more details about your test environment (e.g. OS details, .NET version) and a sample console application using that issue could be reproduced.
var options = new IndexingOptions() { UseRawTextExtraction = false };
index.Add(documentsFolder, options);
In your case, the documents are indeed damaged. If an exception occurs while extracting data from a document, that document is skipped, and the remaining documents in the folder continue to be indexed. However, if the exception is of the OutOfMemoryException type, continuing the indexing process becomes impossible, as the system has run out of memory.
When there is a possibility of documents that could interrupt the entire indexing process, it is advisable to handle document indexing differently. For example, extract data from documents individually, then index them separately, as described here: Separate Data Extraction - GroupDocs.Search for .NET.
In this approach, it is better to extract data from documents one by one and then index the extracted data in groups as large as possible.
And what is the recommended way to update an existing index when there could be documents that could interrupt the entire process? What’s the best way to skip files that have not changed since the last indexing?
We recommend extracting data individually from each file and then indexing the data collectively as a batch. This approach allows us to handle each document separately and determine if a document requires updating based on its modification date.
We generated a simple console application that allows reproducing this.
Please extract everything from the ZIP file, add a license, compile it and run it.
It tries to index the files that are also part of the zip file.
Please implement the approach Separate Data Extraction, and the issue will not occur. However, we have reproduced the issue when not using the separate data extraction approach.