Search does not find word in existing files when there are many duplicates

Hello,

We have experienced something that looks like a bug in GoupDocs search.

I attached a ZIP file at the bottom. If you unzip it, there will be one folder “1000txt” with 10 sub folders with each 100 files in it. The files are duplicates of each other!

Step 1: Create an index for the “root” folder “1000txt”. This is successful, and all 1000 files can be indexed in my case:

Number of documents in index (total): 1.000
Number of indexed documents: 1.000
Number of updated documents: 0
Number of removed documents: 0
Number of errors: 0
Segment count: 1
Total documents size: 61.467.665 bytes
Indexed documents size: 0 MB
Total terms in index: 204

Step 2: Run a search query for the word “tempor”

var query = SearchQuery.CreateFieldQuery(CommonFieldNames.Content, SearchQuery.CreateWordQuery("tempor"));

var searchResult = index.Search(query);

Expected: searchResult contains all 1.000 files, because all of them contain the word “tempor”. From our view, it does not make sense, that only some files are returned, although they are duplicates of each other.

Actual: searchResult contains 121 elements. What’s interesting: It contains all 100 elements from the folder “Neuer Ordner” and exactly 21 elements from the folder “Neuer Ordner - Kopie”, but none of the others.

Intersting: When searching for different words, this happens

  • searching for “soluta” => all 1.000 files
  • searching for “nonumy” => 133 files
  • searching for “dolore” => 77 files

1000txt.zip (3.7 MB)

@jamsharp
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): SEARCHNET-3334

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

@jamsharp

In this case, the problem of a small number of found files is caused by the limitation on the maximum number of occurrences in the search options. By default MaxTotalOccurrenceCount and MaxOccurrenceCountPerTerm values are 500000 and 100000 respectively.
Please increase the values ​​of these properties and it will solve the issue.

SearchOption searchOption = new SearchOption();
searchOption.MaxTotalOccurrenceCount = 10000000;
searchOption.MaxOccurrenceCountPerTerm = 10000000;
SearchResult result = index.Search("dolore", searchOption);

What is the reason for a default values of 100.000 and 500.000 and when incrementing the numbers drastically, is there any other consequence than maybe having to wait longer for a search to finish?

1 Like

@jamsharp

The default values of 100,000 and 500,000 were originally set to mitigate the risk of running out of memory during operations. However, we’ve found that this concern is no longer as relevant as it once was.
In response to your question about increasing these limits, we’ve decided to raise the default values to 2 billion. This change is aimed at improving performance and flexibility. Please note that while higher values will allow for more extensive searches, it may also lead to longer processing times. This update will be included in the next release. We’ll notify you as the release is available to download.