PDF and EPub documents have section titles and characters of words split into separate spans

bgrimes · October 31, 2022, 2:48pm

Hello,

We have implemented viewer v22.5.0, as well as tested with 22.9.0, and are seeing content with PDF and EPub documents split into separate spans, even to the point where each character of a word is in its own span.

Example:

4.1. Fluid Pressure/ V o lume/ T e mperature. Fluid PVT analysis is the study of the

Unfortunately I cannot share the document as it is property of our client. I am wondering if you have some insight on why this is happening.

We are working to create jump links with in the viewer and with the text being split up, it is making it difficult to get all of the headings we want from the HTML.

Thank you

vladimir.litvinchik · October 31, 2022, 8:47pm

@bgrimes

I’ve tried to reproduce this issue with the sample PDF files that I have at hand. Can you try creating a similar file that we could youse to reproduce the issue?

bgrimes · November 3, 2022, 12:33pm

Vladimir,

We have a sample that you can work with. In this sample, the section heading of 3.1 is getting split up into separate spans.

sample.pdf (91.0 KB)

vladimir.litvinchik · November 3, 2022, 2:29pm

@bgrimes

Thank you for attaching the sample file. I’ve reproduced this issue at my end. As soon as I have any new information I’ll let you know.

bgrimes · November 3, 2022, 6:24pm

Vladimir,

I found that a similar issue was found in Jan 2021, PDF to HTML conversion issue in .NET - #24 by atir.tahir with words being broken up. Possibly similar issue?

Thanks for taking time to look into this.

vladimir.litvinchik · November 3, 2022, 7:23pm

@bgrimes

Thank you for pointing me to this issue. Yes, this one is similar.

I’ve prepared two output files for comparison output-html-files.zip (30.2 KB) - the first one created with fixed layout enabled and the second one with fixed layout disabled. There is no FixedLayout option in our public API at the moment, but we can add it in the next public release if it will work for you.

bgrimes · November 3, 2022, 7:52pm

Vladimir,

Thank you this appears to be what we are looking for. Do you know when the next release of the viewer will be available? We are using the .Net Nuget Package version 22.9.0.

We are looking to complete our work by the end of the year.

Thanks

vladimir.litvinchik · November 3, 2022, 7:57pm

@bgrimes

The release has been planned at the mid of November. As soon as it will be available we’ll update you here.

bgrimes · November 3, 2022, 8:04pm

Vladimir,

Great, thank you!

vladimir.litvinchik · November 3, 2022, 8:27pm

@bgrimes

You’re welcome!

vladimir.litvinchik · December 1, 2022, 7:32pm

@bgrimes

FixedLayout option added in v22.11 that is already available at NuGet. See the following section in the release notes for more information: Added support for rendering PDF and EPUB documents to HTML with fluid layout.

bgrimes · December 2, 2022, 6:59pm

Thank you Vladimir!

vladimir.litvinchik · December 2, 2022, 6:59pm

@bgrimes

You’re welcome!