Hi there,
We tried this:
- Have a Text file with ANSI encoding in a folder indexed by DataCentral. The file should contain “Mühe” and “Vielfraß”.
- Create an index for that file
- Make a search on that index for “Vielfraß” or for “Mühe”
Expected:
Actual:
- Not found; only seems to work for UTF files.
We have also tried out setting AutoDetectEncoding to true in the ExtractionOptions, but this seems to cause even more problems and words with Umlauts can’t even be found in UTF8 files (if we tested this correctly)…
var extractionOptions = new ExtractionOptions { UseRawTextExtraction = false, AutoDetectEncoding = true};
1 Like
@jamsharp
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.
Issue ID(s): SEARCHNET-3471
You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.
@jamsharp
Since, there are many ANSI encodings. Could you please share the problematic/source file with us?
This ZIP folder contains such a file:
test.zip (236 Bytes)
1 Like
@jamsharp
The provided text file is in Windows-1250 encoding. It can be indexed this way:
// Subscribe to the FileIndexing event of the index
index.Events.FileIndexing += (s, e) =>
{
// Set the encoding to Windows-1250 for files that match the specified condition
// Uncomment the line below if you want to filter files specifically
// if (e.DocumentFullPath.Contains("Windows_1250"))
e.Encoding = Encodings.Windows_1250; // Specify the encoding for the document
};
// Add the directory containing the text files to the index
index.Add(@"E:\Docs");
Automatic detection of ANSI encodings can be quite complex and is not supported by the API. Therefore, it is advisable to use UTF-8 encoding for text files whenever possible.
Thanks for your answer. Unfortunately, we don’t have any influence on the encoding of the files, as they are located on our customers’ machines.
Automatic detection of ANSI encodings can be quite complex and is not supported by the API.
We though, that AutoDetectEncoding = true
was capable of doing it. So, ANSI is a limitation of it?
@jamsharp
Please spare us sometime to further look into this scenario. You’ll be notified in case of any update.