Cannot find Umlauts located in ANSI files

jamsharp · February 26, 2025, 2:05pm

Hi there,

We tried this:

Have a Text file with ANSI encoding in a folder indexed by DataCentral. The file should contain “Mühe” and “Vielfraß”.
Create an index for that file
Make a search on that index for “Vielfraß” or for “Mühe”

Expected:

The occurrence is found

Actual:

Not found; only seems to work for UTF files.

We have also tried out setting AutoDetectEncoding to true in the ExtractionOptions, but this seems to cause even more problems and words with Umlauts can’t even be found in UTF8 files (if we tested this correctly)…

var extractionOptions = new ExtractionOptions { UseRawTextExtraction = false, AutoDetectEncoding = true};

atir.tahir · February 26, 2025, 7:21pm

@jamsharp
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): SEARCHNET-3471

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

atir.tahir · February 28, 2025, 12:33pm

@jamsharp

Since, there are many ANSI encodings. Could you please share the problematic/source file with us?

jamsharp · March 3, 2025, 8:34am

This ZIP folder contains such a file:
test.zip (236 Bytes)

atir.tahir · March 3, 2025, 4:48pm

@jamsharp

The provided text file is in Windows-1250 encoding. It can be indexed this way:

// Subscribe to the FileIndexing event of the index
index.Events.FileIndexing += (s, e) =>
{
    // Set the encoding to Windows-1250 for files that match the specified condition
    // Uncomment the line below if you want to filter files specifically
    // if (e.DocumentFullPath.Contains("Windows_1250"))
        e.Encoding = Encodings.Windows_1250; // Specify the encoding for the document
};

// Add the directory containing the text files to the index
index.Add(@"E:\Docs");

Automatic detection of ANSI encodings can be quite complex and is not supported by the API. Therefore, it is advisable to use UTF-8 encoding for text files whenever possible.

jamsharp · March 17, 2025, 3:34pm

Thanks for your answer. Unfortunately, we don’t have any influence on the encoding of the files, as they are located on our customers’ machines.

Automatic detection of ANSI encodings can be quite complex and is not supported by the API.

We though, that AutoDetectEncoding = true was capable of doing it. So, ANSI is a limitation of it?

atir.tahir · March 17, 2025, 7:03pm

@jamsharp

Please spare us sometime to further look into this scenario. You’ll be notified in case of any update.

jamsharp · March 26, 2025, 1:10pm

One more thing my team talked about: Would it be possible for you to detect that it is ANSI and use the default code ANSI code page of the current system automatically when AutoDetectEncoding = true?

atir.tahir · March 26, 2025, 3:14pm

We’ll investigate this scenario and update you in case of any outcome.

atir.tahir · March 28, 2025, 12:11pm

@jamsharp

We recommend using the NuGet libraries available for detecting the encoding of text files. One of the most effective options is Utf.Unknown, which is straightforward to implement.

You can use the following code snippet:

// Subscribe to the FileIndexing event
index.Events.FileIndexing += (s, e) =>
{
    // Read all bytes from the specified document
    byte[] data = File.ReadAllBytes(e.DocumentFullPath);
    
    // Detect the character set encoding from the byte array
    UtfUnknown.DetectionResult result = UtfUnknown.CharsetDetector.DetectFromBytes(data);
    
    // Check if an encoding was detected
    if (result.Detected != null)
    {
        // Output the detected encoding name to the console
        Console.WriteLine("Encoding detected: " + result.Detected.EncodingName);
        
        // Set the detected encoding for the event
        e.Encoding = result.Detected.EncodingName;
    }
};

This code will automatically detect the encoding of your files, ensuring that the appropriate encoding is applied.