Docx encoding

martin.kallinger · July 18, 2016, 9:55am

Hi,

We want to try your Viewer Java Lib, so I downloaded the latest version und the GitHub examples today.

I tried to use one of our existing docx files and I get encoding problems. üä are not displayed properly in the resulting html. I also reproduced the error when I modify (add ü) the word.docx file shipped within the GitHub examples.

I tried to set

options.getWordsOptions().setEncoding(Charset.defaultCharset());

Either to default charset (Windows-1252) or UTF-8, no effect at all.

Although if it would work it still would be a problem for our use case. In our application customers can upload their documents and we might not know the encoding in which the document was created.

Which leads me to the real question why do I have to specify an encoding on an docx or xlsx file? I use Apache POI in another project and do not have to care about the encoding at all.

I attached my test word file for you to reproduce the issue.

The code i am using is mostly copied from the samples.

public static void renderTestWord() {

try {

// Setup GroupDocs.Viewer config

ViewerConfig config = Utilities.getConfiguration();

// Create html handler

ViewerHtmlHandler htmlHandler = new ViewerHtmlHandler(config);

String guid = FILE_TEST_WORD;

HtmlOptions htmlOptions = new HtmlOptions();

htmlOptions.setResourcesEmbedded(true);

// htmlOptions.setPageNumbersToConvert(Arrays.asList(2, 3));

// Charset charset = Charset.forName("UTF-8");

// options.getWordsOptions().setEncoding(charset);

// options.getCellsOptions().setEncoding(charset);

// options.getEmailOptions().setEncoding(charset);

// Perform page reorder

// ReorderPageOptions reorderPageOptions = new ReorderPageOptions(

// guid, 2, 1);

// htmlHandler.reorderPage(reorderPageOptions);

List pages = htmlHandler.getPages(guid, htmlOptions);

for (PageHtml page : pages) {

Utilities.saveAsHtml(datePart() + "_" + page.getPageNumber()

+ "_testdocx", page.getHtmlContent());

}

} catch (Exception exp) {

System.out.println("Exception: " + exp.getMessage());

exp.printStackTrace();

}

Regards,

Martin

atir.tahir · July 18, 2016, 2:54pm

Hi Martin,

Thanks for showing your interest in GroupDocs.Viewer for Java 3.x and sharing the test file.

We apologize for the issue you are facing, we are able to reproduce the issue at our end as well. Please get updated example from GitHub, we have committed changes. You can seek help from our Save as HTML article as well.

As far as this question is concerned <span style=“color: rgb(76, 82, 89); font-family: “Times New Roman”, serif; font-size: 18px; line-height: 18px; background-color: rgb(221, 221, 221);”>why do I have to specify an encoding on an docx or xlsx file?

We have logged it as an improvement in our internal issue tracking system. As we get any update from the concerned team, we will certainly apprise you.

Best Wishes

martin.kallinger · July 19, 2016, 8:42am

Thank you for your help, Encoding is working now.

atir.tahir · July 19, 2016, 9:07am

Hi Martin,

You are welcome, and we are pleased that your issue is resolved.

Kind Regards