Feature request: Cancellation of file extraction after X seconds

Hello,

Our primary goal was to run your software (GroupDocs.Search) in an environment, where we don’t know which files will be indexed. We don’t decide it. The user of our software can decide it. We don’t know the files beforehand…

Unfortunately, we ran into multiple problems when indexing arbitrary files we found on our company drives. But not just any problems… Those problems we found are severe and lead to an infinite extraction time (sometimes combined with 100% RAM usage)…

These are some of the examples:

Some of them might be fixed by now, or you gave us workarounds.

But with the current state of your software, we kind of expect many more of these “running forever with 100% RAM” problems on our customers’ computers and are requesting a general solution for this problem. I know, you offer some kind of cancellation, but this does not seem to be a solution for our problem, as far as we know…

**What might be useful, is a cancellation per file. If extracting data from a file takes more than 10 seconds, I want to just ignore/skip that file. I don’t care about it. That’s a lot better than realizing “Oh, the indexing was running for 10 hours with full RAM usage”…

@jamsharp

To implement a cancellation mechanism for file extraction in GroupDocs.Search after a specified time period, you can utilize the Cancellation property within the IndexingOptions class. This allows you to set a timeout for the indexing operation, effectively skipping files that take too long to process.

Here’s a code example demonstrating how to set up a cancellation after 10 seconds (10,000 milliseconds):

C# Example

using GroupDocs.Search.Common;
using GroupDocs.Search.Options;
using System;

namespace GroupDocs.Search.Examples
{
    class IndexingWithCancellation
    {
        public static void Main()
        {
            string indexFolder = @"./IndexFolder";
            string documentFolder = @"./Documents";

            // Create an index in the specified folder
            Index index = new Index(indexFolder);

            // Create an instance of indexing options
            IndexingOptions options = new IndexingOptions();
            options.Cancellation = new Cancellation(); // Setting a cancellation object
            options.Cancellation.CancelAfter(10000); // Setting a time period of 10 seconds

            // Indexing documents from the specified folder
            index.Add(documentFolder, options);

            Console.WriteLine("Indexing completed.");
        }
    }
}

Java Example

import com.groupdocs.search.*;
import com.groupdocs.search.common.*;
import com.groupdocs.search.options.*;

public class IndexingWithCancellation {
    public static void main(String[] args) {
        String indexFolder = "./IndexFolder";
        String documentFolder = "./Documents";

        // Create an index in the specified folder
        Index index = new Index(indexFolder);

        // Create an instance of indexing options
        IndexingOptions options = new IndexingOptions();
        options.setCancellation(new Cancellation()); // Setting a cancellation object
        options.getCancellation().cancelAfter(10000); // Setting a time period of 10 seconds

        // Indexing documents from the specified folder
        index.add(documentFolder, options);

        System.out.println("Indexing completed.");
    }
}

In these examples, the CancelAfter method is used to specify the maximum time allowed for the indexing operation. If the operation exceeds this time, it will be canceled, allowing you to avoid situations where the indexing runs indefinitely and consumes excessive resources.

This approach should help mitigate the issues you’ve encountered with long-running indexing operations and high memory usage. If you have further questions or need additional assistance, feel free to ask!

Sources:
[1]: IndexingOptionsProperties.cs
[2]: IndexingOptionsProperties.java

Hello @Professionalize.Discourse

Unfortunately, this does not work.
The cancellation is not able to cancel an operation like the ones I mentioned above.
I just tried it out…

If a problematic file is found, the indexing runs forever with full RAM usage, even, if I add this cancellation code.

The reason is probably, that the check whether to cancel or not happens after the extraction, but the extraction for such a problematic file runs forever, so the cancellation is never triggered…

@jamsharp

We are further investigating this issue. Your investigation ticket ID is SEARCHNET-3287. We’ll notify you in case of any update.

@jamsharp

The latest release of GroupDocs.Search 24.10 introduces the ability to extract data in a separate process, as well as the ability to interrupt the process by timeout.

Thanks.

We analyzed it a bit and can verify that it is now able to handle a folder with files that caused trouble before.

We’ve got a few questions:


We saw in procmon that you spawn the process with a call like this:
dotnet C:\Users\xy\Projects\xy\bin\x64\Debug\net8.0-windows\win-x64\MyExtractionHost.dll 52e6f711-5499dca9

And the dotnet.exe from this path is used: C:\Program Files\dotnet\dotnet.exe

Does this mean, that a dotnet runtime needs to be installed on the system for this to work, or are we misunderstanding?


When we shut down the main application, can we expect the sub processes to be killed automatically?


Your IndexingOptions (and UpdateOptions) support cancellation via a Cancel() method and a Cancellation class, as described here: Feature request: Cancellation of file extraction after X seconds - #2 by Professionalize.Discourse

When we pass a Cancellation instance to the IndexingOptions and later call Cancel() on it, will it make sure that the indexing operation ends immediately, too?

1 Like

@jamsharp

We are looking into these questions. Your ticket ID for this inquiry is SEARCHNET-3326. You will be notified of any updates as we make progress.

@jamsharp

Yes, .NET must be installed, but this shouldn’t be an issue since the GroupDocs library itself is based on .NET.
If the main process terminates while indexing is ongoing, the child process won’t automatically close. The indexing process should be canceled using an instance of the Cancellation class.
We’ve implemented immediate process termination with the Cancellation class when extracting data in a separate process. This fix will be included in the next release.