Feature request: Cancellation of file extraction after X seconds

Hello,

Our primary goal was to run your software (GroupDocs.Search) in an environment, where we don’t know which files will be indexed. We don’t decide it. The user of our software can decide it. We don’t know the files beforehand…

Unfortunately, we ran into multiple problems when indexing arbitrary files we found on our company drives. But not just any problems… Those problems we found are severe and lead to an infinite extraction time (sometimes combined with 100% RAM usage)…

These are some of the examples:

Some of them might be fixed by now, or you gave us workarounds.

But with the current state of your software, we kind of expect many more of these “running forever with 100% RAM” problems on our customers’ computers and are requesting a general solution for this problem. I know, you offer some kind of cancellation, but this does not seem to be a solution for our problem, as far as we know…

**What might be useful, is a cancellation per file. If extracting data from a file takes more than 10 seconds, I want to just ignore/skip that file. I don’t care about it. That’s a lot better than realizing “Oh, the indexing was running for 10 hours with full RAM usage”…

@jamsharp

To implement a cancellation mechanism for file extraction in GroupDocs.Search after a specified time period, you can utilize the Cancellation property within the IndexingOptions class. This allows you to set a timeout for the indexing operation, effectively skipping files that take too long to process.

Here’s a code example demonstrating how to set up a cancellation after 10 seconds (10,000 milliseconds):

C# Example

using GroupDocs.Search.Common;
using GroupDocs.Search.Options;
using System;

namespace GroupDocs.Search.Examples
{
    class IndexingWithCancellation
    {
        public static void Main()
        {
            string indexFolder = @"./IndexFolder";
            string documentFolder = @"./Documents";

            // Create an index in the specified folder
            Index index = new Index(indexFolder);

            // Create an instance of indexing options
            IndexingOptions options = new IndexingOptions();
            options.Cancellation = new Cancellation(); // Setting a cancellation object
            options.Cancellation.CancelAfter(10000); // Setting a time period of 10 seconds

            // Indexing documents from the specified folder
            index.Add(documentFolder, options);

            Console.WriteLine("Indexing completed.");
        }
    }
}

Java Example

import com.groupdocs.search.*;
import com.groupdocs.search.common.*;
import com.groupdocs.search.options.*;

public class IndexingWithCancellation {
    public static void main(String[] args) {
        String indexFolder = "./IndexFolder";
        String documentFolder = "./Documents";

        // Create an index in the specified folder
        Index index = new Index(indexFolder);

        // Create an instance of indexing options
        IndexingOptions options = new IndexingOptions();
        options.setCancellation(new Cancellation()); // Setting a cancellation object
        options.getCancellation().cancelAfter(10000); // Setting a time period of 10 seconds

        // Indexing documents from the specified folder
        index.add(documentFolder, options);

        System.out.println("Indexing completed.");
    }
}

In these examples, the CancelAfter method is used to specify the maximum time allowed for the indexing operation. If the operation exceeds this time, it will be canceled, allowing you to avoid situations where the indexing runs indefinitely and consumes excessive resources.

This approach should help mitigate the issues you’ve encountered with long-running indexing operations and high memory usage. If you have further questions or need additional assistance, feel free to ask!

Sources:
[1]: IndexingOptionsProperties.cs
[2]: IndexingOptionsProperties.java

Hello @Professionalize.Discourse

Unfortunately, this does not work.
The cancellation is not able to cancel an operation like the ones I mentioned above.
I just tried it out…

If a problematic file is found, the indexing runs forever with full RAM usage, even, if I add this cancellation code.

The reason is probably, that the check whether to cancel or not happens after the extraction, but the extraction for such a problematic file runs forever, so the cancellation is never triggered…

@jamsharp

We are further investigating this issue. Your investigation ticket ID is SEARCHNET-3287. We’ll notify you in case of any update.

@jamsharp

The latest release of GroupDocs.Search 24.10 introduces the ability to extract data in a separate process, as well as the ability to interrupt the process by timeout.

Thanks.

We analyzed it a bit and can verify that it is now able to handle a folder with files that caused trouble before.

We’ve got a few questions:


We saw in procmon that you spawn the process with a call like this:
dotnet C:\Users\xy\Projects\xy\bin\x64\Debug\net8.0-windows\win-x64\MyExtractionHost.dll 52e6f711-5499dca9

And the dotnet.exe from this path is used: C:\Program Files\dotnet\dotnet.exe

Does this mean, that a dotnet runtime needs to be installed on the system for this to work, or are we misunderstanding?


When we shut down the main application, can we expect the sub processes to be killed automatically?


Your IndexingOptions (and UpdateOptions) support cancellation via a Cancel() method and a Cancellation class, as described here: Feature request: Cancellation of file extraction after X seconds - #2 by Professionalize.Discourse

When we pass a Cancellation instance to the IndexingOptions and later call Cancel() on it, will it make sure that the indexing operation ends immediately, too?

1 Like

@jamsharp

We are looking into these questions. Your ticket ID for this inquiry is SEARCHNET-3326. You will be notified of any updates as we make progress.

@jamsharp

Yes, .NET must be installed, but this shouldn’t be an issue since the GroupDocs library itself is based on .NET.
If the main process terminates while indexing is ongoing, the child process won’t automatically close. The indexing process should be canceled using an instance of the Cancellation class.
We’ve implemented immediate process termination with the Cancellation class when extracting data in a separate process. This fix will be included in the next release.

Thanks for answering our questions.

Yes, .NET must be installed, but this shouldn’t be an issue since the GroupDocs library itself is based on .NET.

Currently, we are focussing on delivering most of our code “self-contained” to our customers, which means, that we just assume that .net is not installed at all and the runtime + libraries are delivered by us within our installation folder, see Runtime roll forward for .NET Core self-contained app deployments. - .NET | Microsoft Learn.

Would it be possible to add one more option “IsSelfContained” or similar, which does not run “dotnet run extractorprocess.exe” and instead just “extractorprocess.exe”?
Alternatively (even better), it’s probably possible to analyze the extractorprocess.exe and find out whether it is self-contained and (if yes), just skip the dotnet command to execute it.

1 Like

@jamsharp

We will further investigate it and see if it’s even possible. You’ll be notified in case of any update.

@jamsharp

We are planning to implement the UseDotnetToStartProcess option in the next release of the API. This flag determines whether to use dotnet to start the process. The default value is true. Set the value to false if you want to start a self-contained application. Note that dotnet is not used to start applications on the old .NET Framework.
We’ll notify you once the release is available to download.

1 Like