Inconsistent timeouts with ExtractInSeparateProcess feature

Hello.

We are experiencing inconsistent timeouts with the ExtractInSeparateProcess feature. When we index a network share, we encounter random timeouts. Sometime more, sometimes less, but at least one so far. Without the ExtractInSeparateProcess feature the indexing completes normally. Therefore, and because the amount of timeouts varies, it propably isn’t related to specific files. It feels more like some kind of race condition.

When we indexed the same data on a local drive with the ExtractInSeparateProcess feature, it worked fine without timeouts. This would indicate a network related problem. But on the other hand, index creation without the ExtractInSeparateProcess feature worked fine on the network share.

Attached you will find the test application we used.

Thank you for your assistance!

@jamsharp

It seems you are experiencing inconsistent timeouts when using the ExtractInSeparateProcess feature while indexing documents from a network share. This issue could be related to network conditions, especially since you mentioned that indexing works fine on a local drive without timeouts.

The ExtractInSeparateProcess feature is designed to handle resource-intensive operations more effectively by isolating the data extraction process. This means that if there is insufficient memory or if the time limit for indexing a document is exceeded, only the affected document’s indexing will fail, allowing the rest to continue. However, when working with network shares, various factors such as network latency, bandwidth, and file access permissions can introduce inconsistencies.

Here are a few suggestions to troubleshoot and potentially resolve the issue:

  1. Network Stability: Ensure that the network connection to the share is stable and has sufficient bandwidth. You might want to monitor network performance during the indexing process.

  2. Timeout Settings: Check if there are configurable timeout settings in your indexing options. Adjusting these settings might help mitigate the timeouts you are experiencing.

  3. Resource Allocation: Since the ExtractInSeparateProcess feature is sensitive to memory availability, ensure that the machine running the indexing process has adequate resources allocated.

  4. Logging: Implement logging in your application to capture detailed information about the timeouts, including which documents were being processed at the time. This can help identify patterns or specific conditions that lead to timeouts.

  5. Testing with Different Configurations: If possible, try varying the configuration of the ExtractInSeparateProcess feature to see if certain settings lead to more consistent results.

If the problem persists, consider reaching out to the support team with detailed logs and configurations for further assistance.

For more information on configuring the ExtractInSeparateProcess feature, you can refer to the official documentation here.

Sources:
[1]: Extraction in separate process | Documentation

@jamsharp

It seems the test application you mentioned is not attached to your message; could you please re-upload it so we can analyze and reproduce the issue? Additionally, to better understand the problem, could you share details about the network environment, such as any latency, bandwidth constraints, or access restrictions that might affect the network share? This information will help us investigate the issue more effectively.

Sorry.
Here is the test application
GroupDocsTest.zip (5.0 KB)

I will talk to our administrators and share details with you soon.

Thank you for your time!

1 Like

could you share details about the network environment, such as any latency, bandwidth constraints, or access restrictions that might affect the network share?

Windows Server 2022 as FileServer (VM)
Storage is hosted on a Lenovo DH4000 (all SSD) connected via HBA to the ESXI Hosts
Network is 10Gig Fiber for all servers - no limitations.

@jamsharp
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): SEARCHNET-3372

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

@jamsharp

If by “inconsistent timeouts” you are referring to the document indexing time, please check the logs on your server. It’s possible that the indexing time varies due to insufficient RAM, leading to out-of-memory exceptions and subsequent restarts of the child process.
Alternatively, are you referring to the timeout that terminates the child process if the application exceeds the allotted time for data extraction?
What is the current amount of RAM on the server?

@atir.tahir: By “inconsistent timeouts” we are referring to the termination of the child processes. To be more precise, we are referring to the number of terminations. We have indexed the same network share multiple times. Each time with a new in-memory index. And each time we get a different amount of terminations. So far the least was 1 and the most 18.

The server on which the indexing application was running has 12 GB RAM.

1 Like

@jamsharp

Could you please provide details on the messages recorded in the logs? Additionally, we would like to know which errors are triggering the child process to restart.

// Subscribe to the ErrorOccurred event of the 'index' object
index.Events.ErrorOccurred += (sender, args) =>
{
    // Log the error message to the console
    Console.WriteLine(args.Message);
};

The logs don’t provide much information. It’s just the random timeouts and no other errors:

PS C:\Users\Administrator\Desktop\Demo> .\GroupDocsTest.exe '\\intranet\temp\SpaceObServer\SOS-3949\api' true
ExtractInSeparateProcess: true
Data extraction timeout exceeded
Error:
Output:
PS C:\Users\Administrator\Desktop\Demo> .\GroupDocsTest.exe '\\intranet\temp\SpaceObServer\SOS-3949\api' true
ExtractInSeparateProcess: true
Data extraction timeout exceeded
Error:
Output:
Data extraction timeout exceeded
Error:
Output:
PS C:\Users\Administrator\Desktop\Demo> .\GroupDocsTest.exe '\\intranet\temp\SpaceObServer\SOS-3949\api' true
ExtractInSeparateProcess: true
Data extraction timeout exceeded
Error:
Output:
Data extraction timeout exceeded
Error:
Output:
Data extraction timeout exceeded
Output:
Error:
Output:
Error:
PS C:\Users\Administrator\Desktop\Demo> .\GroupDocsTest.exe '\\intranet\temp\SpaceObServer\SOS-3949\api' true
ExtractInSeparateProcess: true
Data extraction timeout exceeded
Error:
Output:
Data extraction timeout exceeded
Error:
Output:
Error:
Output:
PS C:\Users\Administrator\Desktop\Demo>

This is the test project we used to create the output: GroupDocsTest.zip (3.1 KB)
And this is the data we used: TestData.7z (250.0 KB)

1 Like

@jamsharp

Thanks for the additional information. We’ll continue investigation and let you know in case of any progress update or if any further details are required.

@jamsharp

We have thoroughly reviewed the extraction process and implemented some adjustments to address the concerns you raised. Here’s an update:

  1. Issue Reproducibility:
    We were unable to reproduce the problem during our tests. The provided HTML files were indexed without any issues on our platform.

  2. Increased Timeout:
    To ensure sufficient time for extraction, we increased the extraction timeout to 10 minutes. This adjustment should help handle scenarios involving large or complex HTML files. The updated configuration is as follows:

    options.SeparateProcessOptions.ExtractInSeparateProcess = true;
    options.SeparateProcessOptions.UseDotnetToStartProcess = false;
    options.SeparateProcessOptions.AssemblyPath = assemblyPath;
    options.SeparateProcessOptions.Timeout = TimeSpan.FromMinutes(10); // Increased timeout
    
  3. Platform-Specific Considerations:
    If the issue persists, it might be related to the HTML file format on the specific platform or server configuration. In such cases, we recommend:

    • Verifying the compatibility of the HTML files with the indexing process.
    • Using plain text indexing as a fallback. We have included a custom extractor for handling HTML content:
      private class HtmlExtractor : IFieldExtractor
      {
          public string[] Extensions => new string[] { ".html" };
      
          public DocumentField[] GetFields(string filePath)
          {
              using var stream = File.OpenRead(filePath);
              return GetFields(stream);
          }
      
          public DocumentField[] GetFields(Stream stream)
          {
              using var reader = new StreamReader(stream);
              return new DocumentField[]
              {
                  new DocumentField(CommonFieldNames.Content, reader.ReadToEnd()),
              };
          }
      }
      
      This ensures the content is indexed even if the file structure poses challenges.

If the above steps do not resolve the issue, please share a screencast or video demonstrating the problem, along with detailed steps to reproduce it.