Converting the attached images to HTML results in a very large HTML file.
The input files are smaller than 110 KB, but the output is larger than 8 MB.
Without the “WrapImagesInSvg” flag, the output size is around 700 KB.
Visually, the quality increases quite a lot, using the “WrapImagesInSvg” option, which is good, but not at the cost of such large output files.
I did not see such an increase for other files so far, what makes this one different?
Can we control the output quality / file size somehow?
var options = HtmlViewOptions.ForEmbeddedResources("output_viewer{0}.html");
options.PdfOptions.WrapImagesInSvg = true;
viewer.View(options);
We have investigated the described situation with your PDF files and can confirm that the resultant HTML size is unreasonably large and this should be fixed. We already started to work on it and logged it under the “VIEWERNET-5051” moniker in our internal tracking system. Meanwhile we can suggest two options for you, which may help to decrease the output HTML size:
You’re using the all-embedded resultant HTML, when use a HtmlViewOptions.ForEmbeddedResources static factory - it converts and stores all resources, including image background, with base64 encoding, which is not so efficient as source (unencoded) data. If it is possile for your use-case, try to use the HtmlViewOptions.ForExternalResources static factory - in such case the resultant HTML will be small, and resources will be saved separately and as distinct files, but not combinen together into one piece.
If it is possile for your use-case, use the GroupDocs.Viewer for the .NET 6.0+ target environment - in that case the resultant HTML file size will be not the 8+ MiB (as it is on .NET Framework or .NET Core), but 6+ MiB in base64-embedded version.
We will notify you as soon as new information becomes available.
Thank you for logging the bug and especially for providing suggestions as well.
Unfortunately we cannot do that easily. We’re currently storing all converted files in a cache, which is not designed to handle multiple files per page. Also from GroupDocs code, I’m not really certain how this is handled. I gave it a quick try and had absolute links in the generated HTML, which are not usable for me (C:.…\someresource.svg). Such files would just not be available, when viewing it in a browser on another machine. Also I’m missing some reference between the generated HTMLs and which resource they need. We would need to send single pages to the client-side script in the browser and would need to know all required resources per page, to avoid sending too much data.
We have quite a few packages that still depend on the old .NET 4.X, but we’re definitely planning to move to a later .NET version sooner or later.
Regarding output HTML with external resources - let me help you with that, it is really not so obvious to handle correct references properly. Let’s say you have some output folder, and you want to save HTML file with external resources in that folder with only relative links, not absolute. In order to do this:
Thanks a lot for providing the sample code, works perfectly
Apologies for asking another question, but is it also somehow possible to determine which HTML pages need which resource file?
In our application, the backend has to provide the frontend only with certain pages that are currently being rendered. So I would need to provide them with only certain HTML pages and only a sub-set of the resource files at a time.
If I understood your intention correct - yes, it is possible to group HTML resourses on per-page basis, so every single page will be stored together only with those specific resources, which are used exactly there and nothing redundant.
I’ve modified the previous sample and here is an updated version:
Please note that GroupDocs.Viewer, when works in trial mode, has a limit of first two pages. Also in order to see how this sample works you need to specify an input PDF, which has more than 1 page - it is ok a single-page PDF, but you may not see the intention of this sample in such case. I’m attaching sample PDF “SampleDoc1.pdf” - it has 3 pages, image resources, fonts, complex formatting - good sample for this case.
Thanks you for updated code.
It works really well with the sample, but I got a very strange result with the first other file I tested with: Invoices - 2008041.pdf (142.9 KB)
The image on page 2 got somehow messed up, resulting in almost 500 individual files, instead of a single one.
This also resulted in a large “size on disk” value and would certainly be slower to transfer as well, having to send so many files.
Other files seem to be working just fine though, so I guess it was just a bad example.
I think I’d need to do more testing, to find out how reliable this way of converting really is.
We have checked your sample file and we confirm that for the 2nd page the GroupDocs.Viewer generates a lot of small images. But I’m afraid we cannot fix it easily because the algorithm, that converts PDF to HTML, works in that way, and it is a kind of expected behavior. Actually, it depends on specific PDF document, - its background, formatting, styles, layout, content and so on, - and if some of these “things” cannot be smoothly transferred to HTML without distortions, it is rasterized to image and backed to the SVG background.
So in this very specific case it might be better to use the base64-embedded HTML.
You’re welcome. Also want to advice one thing. Some document formats by their nature are pageless, they have a flow layout. For example, HTML, XML, CSS, plain text, e-Book formats, archives and so on. In that case, when you convert them to HTML, if it is allowed by your use-case, it may be useful to turn on the HtmlViewOptions.RenderToSinglePage flag - it that case you will obtain a single resultant HTML file.
Thank you for yet another suggestion .
We’re using “single pages” for Excel worksheets, as it resembles the original,
but for PDF which is also split up into pages originally, we would rather like to keep it like that.