Indexing a specific PDF file leads to 100% RAM usage and runs forever

Hi,

We ran into a problem when indexing the following file:
SEPA_ZvFormate_de-5.pdf (1.1 MB)

When starting the indexing for only 1 folder with this 1 file in it, the RAM usage increases further and further (over 50 GB!!!) and seems to run forever. I had to cancel it after 10 minutes.

Our code is very simple (please ignore the empty property initializers. They don’t matter.):

var index = new GroupDocs.Search.Index(indexDirectory, new IndexSettings { }, overwriteIfExists: false);

index.Add(folderWithThatOneFileInIt, new IndexingOptions { });

We have the latest version 24.8.0 in use.

@jamsharp
This issue is reproduced at our end. Therefore, we have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): SEARCHNET-3241

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

@jamsharp

The issue could be resolved by configuring following indexing setting. We would recommend always using this option, as it helps ensure the text is properly structured.

index.Add(folderWithThatOneFileInIt, new IndexingOptions { UseRawTextExtraction = false });

Hi.
We did a little bit of testing with the UseRawTextExtraction option.
In our Test-Enviourment (approx 16k files) the duration of indexing is much higher with UseRawTextExtraction = false

UseRawTextExtraction Time
true 24:56
false 58:40

this is not what we would have expected.
Additionaly, we have seen other issues with other file types, even with UseRawTextExtraction = false (I will open seperate tickets for these issues).

Overall, we would be very happy to see a fix for SEARCHNET-3241 soon!

Thank you for your time!

@jamsharp

Could you please share a simple console application using that issue could be reproduced at our end?

We will check if we can do this with reasonable effort.
Thx for your fast reply!

@jamsharp

You are welcome.

Could you please share a simple console application using that issue could be reproduced at our end?

Hello,
We are not exactly sure what you want us to do.
The code is pretty simple. Not more than this:

var index = new Index(IndexStorageDirectory, false);
index.Add(m_TargetDirectory, new IndexingOptions { UseRawTextExtraction = false });

We executed it once with UseRawTextExtraction = true and once with false on a company drive with 16k files with many different formats. As the company drive contains company-internal information, we can unfortunately not share it with you.

Is our performance gap not reproducible on your side when you use the code above and execute it on a huge amount of files?

@jamsharp

Are you sure that the files are valid (there is no invalid/corrupted or damaged file in the source files directory)? You may take a look at this approach.

Are you sure that the files are valid (there is no invalid/corrupted or damaged file in the source files directory)?

We can’t guarantee that. That’s absolutely possible. Why? Does this make is slower when UseRawTextExtraction = false?

You may take a look at this approach.

What do you mean by this? This is the exact code from above, isn’t it?
That’s where we observe a performance gap.
We don’t understand, why UseRawTextExtraction = false is so much slower.

UseRawTextExtraction Time
true 24:56
false 58:40

@jamsharp

It’s always better to use UseRawTextExtraction = false.
By performing separate data extraction and indexing, you can parallelize this process. And therefore speed it up several times. You can extract data even on different servers, since the original documents and extracted data can be serialized. Therefore, please adopt this approach - Separate data extraction.