Encoding is not detected when converting TXT to HTML with Viewer in .NET

Clemens · December 19, 2022, 1:49pm

Text encoding is not always detected correctly as it seems.
When converting the attached “Text normal.txt” or “Partially working.txt” to HTML without any other options, the result will not be correct:
image.png (5.9 KB)
image.png (1.3 KB)

Using GroupDocs.Conversion and PDF conversion, the attached “Partially working.txt” will look perfectly fine. (see “Partially working.txt.pdf”)
However the “Text normal.txt” will not (see “Text normal.txt.pdf”).

When specifying the Byte Order Mark, encoding will be correctly detected. (see “Text with BOM.txt”)

Text encoding.zip (40.7 KB)

I know that it’s possible to specify the encoding myself in the LoadOptions, but I also don’t know that in advance. Is it possible to improve the encoding detection, of the Viewer, at least to the level of Conversion?
GroupDocs Viewer and Conversion 22.11 were used for testing.

vladimir.litvinchik · December 19, 2022, 6:43pm

@Clemens

Viewer and Conversion uses different engines to process TXT files. Viewer’s default encoding is System.Text.Encoding.Default, so to make it working in the most cases you can set encoding to UTF-8 in the load options to make it work in most the cases.

LoadOptions loadOptions = new LoadOptions();
loadOptions.Encoding = Encoding.UTF8;

We’ll also consider setting default value for Encoding property to UTF-8 as it seems to be a better option compared to System.Text.Encoding.Default.

Clemens · December 20, 2022, 8:05am

@vladimir.litvinchik

Thank you for the quick answer and the information.
Did I understand this correctly, that there is no real detection anyway, but it just uses Default encoding?
Then I would agree that your suggestions to set it to UTF8 using LoadOptions is a good solution.
I also gave it a quick try with different formats and it seems to work fine in all cases.

Clemens · December 20, 2022, 10:13am

@vladimir.litvinchik

Okay I just double checked, by hardcoding the LoadOptions to UTF8, normal ANSI txt files will no longer be displayed correctly.
So improving the detection would be highly appreciated, as hardcoding is not an option for us.

vladimir.litvinchik · December 20, 2022, 10:37am

@Clemens

Got it, we’ll take a look at what we can do here.