Chinese words are translated to the garbled codes in TXT file using Java

zhaoj · October 20, 2021, 2:58am

We’re the paying customer of your company and we got a problem when using
the Redaction package(GroupDocs.Redaction_20.11-Java.zip).

We did some tests with a .txt file and there are some Chiniese words in this

.txt file, it’s encoding format is UTF-8.

The test algorithm is replacing test, we want to replace these Chinese words

with other words(maybe Chinese words or English words).

We've tried these tests on both Ubuntu OS and Kylin OS, everything is OK on

Ubuntu OS, while we got some garbled codes(just like ???) on Kylin OS, all
of the Chinese words will be translated to the garbled codes. the test codes is
totally same and we don’t know why it only happened on Kylin OS.

Here is the codes for the test :

Redactor redactor = new Redactor(redactionOptions.getSourcePath());
List redactions = getRedactions(redactionOptions);
RedactorChangeLog redactorChangeLog = redactor.apply(redactions.toArray(new Redaction[]{}));

if (redactorChangeLog.getStatus() != RedactionStatus.Failed) {
RasterizationOptions rasterizationOptions = new RasterizationOptions();
rasterizationOptions.setEnabled(false);
redactor.save(outputStream, rasterizationOptions);

httpServletResponse.setStatus(HttpServletResponse.SC_OK);

}

Here is the Ubuntu test environment :

Host : Ubuntu OS

uname -a
Linux c7d20f2e7f91 4.15.18-147 #1 SMP Thu May 13 17:28:25 CST 2021 x86_64 x86_64 x86_64 GNU/Linux

Container :

java -version
java version “1.8.0_45”
Java™ SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot™ 64-Bit Server VM (build 25.45-b02, mixed mode)

Here is the Kylin OS test environment :

Host : Kylin OS running on ARM platform

uname -a
Linux aecffd3d62e8 4.4.131-20190505.kylin.server-generic #kylin SMP Mon May 6 14:34:13 CST 2019 aarch64 GNU/Linux

Container :

java -version
openjdk version “1.8.0_252”
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~deb9u1-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)

So could you please help us to explain or solve this problem?  We'll

appreciate for your help, thanks alot.

atir.tahir · October 20, 2021, 6:48pm

@zhaoj

We are investigating this issue at our end. You investigation ticket ID is REDACTIONJAVA-142. You’ll be notified in case of any update.

AlexanderObraztsov · October 26, 2021, 7:02pm

@zhaoj

It looks like the issue could be environment specific. Could you please check that all the fonts are installed, e.g. as described here? If it doesn’t help, let us know.

zhaoj · October 29, 2021, 7:27am

All fonts under windows have been installed on both devices. Use the command locale to see the same result.
The Kylin device is not around now, and it is not clear what fonts are still installed.

atir.tahir · October 29, 2021, 2:50pm

@zhaoj

We need to know the Kylin environment details where you are facing this issue.

We can only investigate this issue if you share installed fonts list (as this issue is environment specific).

zhaoj · November 4, 2021, 6:29am

@AlexanderObraztsov @Atir_Tahir
fc-list.7z (6.7 KB)
The results obtained by using the fc-list command on the kylin os and the ubuntu os are exactly the same.
The attachment is the result obtained through the fc-list command.

atir.tahir · November 4, 2021, 5:05pm

@zhaoj

We’ll further look into this scenario.

AlexanderObraztsov · November 11, 2021, 12:20pm

@zhaoj

Could you tell us the value of “file.encoding” system property on both platforms, Ubuntu OS and Kylin OS?

zhaoj · November 23, 2021, 5:53am

@AlexanderObraztsov
Kylin OS: file.encoding: ANSI_X3.4-1968
Ubuntu OS: file.encoding: UTF-8

image.png (9.4 KB)

when I use the above case, the following error is displayed:
com.groupdocs.redaction.exceptions.DocumentFormatException: The stream contains document, which format is not supported

AlexanderObraztsov · November 23, 2021, 12:07pm

@zhaoj

GroupDocs.Redaction uses default system encoding, there is no way to specify it for a document.This property (file.encoding) specifies the default encoding to read files from the media. Since your text file is encoded in UTF-8, it reads fine on Ubuntu OS and it will be fine on Kylin OS with this encoding. The workaround is to read the file into a stream with explicit encoding settings.

As for the second issue, this exception is thrown when the Redactor class fails to detect the document format by its content for some reason. Could you please share a sample file to check on our side?

zhaoj · November 24, 2021, 10:00am

@AlexanderObraztsov

test.7z (171 Bytes)

I’ve done the test with the steps you described above and i got the second issue, you can check my test codes in the image.png attached here.
And also I uploaded the test file here and you could help to check with it, thanks alot.

Zhaojun

AlexanderObraztsov · November 24, 2021, 1:42pm

@zhaoj

We could reproduce this issue at our end. It affects all small-sized plain text documents (less than 256 Unicode characters). As there’s any update, you’ll be notified.

zhaoj · November 25, 2021, 8:52am

@AlexanderObraztsov

test.7z (747 Bytes)
test_redaction.7z (300 Bytes)
Hi there,
I did another test with a bigger .text file(bigger than 256 characters) including English characters and Chinese characters, while it still doesn’t work, you could find that all of the Chinese characters changed to misunderstanding characters(like “???”), so i think there maybe other issues that we don’t know yet.

I did the test on Kylin OS running on ARM platform. 

I've uploaded two files, one is the original file and the other is the file after redaption, and the test code is the same as i used before(described in image.png above here).

Thanks

AlexanderObraztsov · November 25, 2021, 6:51pm

@zhaoj

It looks like the same file.encoding issue, since the redacted file is not UTF-8, it is ANSI-encoded (e.g. ANSI_X3.4-1968). If you can’t change the system file encoding, you will have to specify UTF-8 as encoding for writing as well.

AlexanderObraztsov · December 10, 2021, 1:14am

@zhaoj

GroupDocs.Redaction for Java v21.12 that includes fix for issue REDACTIONJAVA-157 (small-sized plain text documents format recognition) has been published. You can find the new version at

Chinese words are translated to the garbled codes in TXT file using Java

}

java -version java version “1.8.0_45” Java™ SE Runtime Environment (build 1.8.0_45-b14) Java HotSpot™ 64-Bit Server VM (build 25.45-b02, mixed mode)

java -version openjdk version “1.8.0_252” OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~deb9u1-b09) OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)

java -version
java version “1.8.0_45”
Java™ SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot™ 64-Bit Server VM (build 25.45-b02, mixed mode)

java -version
openjdk version “1.8.0_252”
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~deb9u1-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)