Large output file size with WrapImagesInSvg for certain PDF files in .NET

Clemens_Pestuka · October 15, 2024, 6:15am

Converting the attached images to HTML results in a very large HTML file.
The input files are smaller than 110 KB, but the output is larger than 8 MB.
Without the “WrapImagesInSvg” flag, the output size is around 700 KB.
Visually, the quality increases quite a lot, using the “WrapImagesInSvg” option, which is good, but not at the cost of such large output files.
I did not see such an increase for other files so far, what makes this one different?
Can we control the output quality / file size somehow?

            var options = HtmlViewOptions.ForEmbeddedResources("output_viewer{0}.html");
            options.PdfOptions.WrapImagesInSvg = true;
            viewer.View(options);

Large output size.zip (181.2 KB)

vladimir.litvinchik · October 15, 2024, 7:08pm

@Clemens_Pestuka

Thank you for reporting this issue. We’ll take a look and update you.

denisgvardionov · October 15, 2024, 10:27pm

Hi @Clemens_Pestuka

We have investigated the described situation with your PDF files and can confirm that the resultant HTML size is unreasonably large and this should be fixed. We already started to work on it and logged it under the “VIEWERNET-5051” moniker in our internal tracking system. Meanwhile we can suggest two options for you, which may help to decrease the output HTML size:

You’re using the all-embedded resultant HTML, when use a HtmlViewOptions.ForEmbeddedResources static factory - it converts and stores all resources, including image background, with base64 encoding, which is not so efficient as source (unencoded) data. If it is possile for your use-case, try to use the HtmlViewOptions.ForExternalResources static factory - in such case the resultant HTML will be small, and resources will be saved separately and as distinct files, but not combinen together into one piece.
If it is possile for your use-case, use the GroupDocs.Viewer for the .NET 6.0+ target environment - in that case the resultant HTML file size will be not the 8+ MiB (as it is on .NET Framework or .NET Core), but 6+ MiB in base64-embedded version.

We will notify you as soon as new information becomes available.

Sorry for the inconvenience.

With best regards,
Denis

Clemens_Pestuka · October 16, 2024, 6:47am

Hi @denisgvardionov

Thank you for logging the bug and especially for providing suggestions as well.

Unfortunately we cannot do that easily. We’re currently storing all converted files in a cache, which is not designed to handle multiple files per page. Also from GroupDocs code, I’m not really certain how this is handled. I gave it a quick try and had absolute links in the generated HTML, which are not usable for me (C:.…\someresource.svg). Such files would just not be available, when viewing it in a browser on another machine. Also I’m missing some reference between the generated HTMLs and which resource they need. We would need to send single pages to the client-side script in the browser and would need to know all required resources per page, to avoid sending too much data.
We have quite a few packages that still depend on the old .NET 4.X, but we’re definitely planning to move to a later .NET version sooner or later.

Best regards,
Clemens

denisgvardionov · October 16, 2024, 11:37am

@Clemens_Pestuka

Noted.

Regarding output HTML with external resources - let me help you with that, it is really not so obvious to handle correct references properly. Let’s say you have some output folder, and you want to save HTML file with external resources in that folder with only relative links, not absolute. In order to do this:

const string filename = "AKL-Lomakkeet 1.pdf";
string inputPdfPath = Path.Combine("full/path/to/", filename);

string outputFolderPath = "full/path/to/output/folder";
string outputHtmlFilePathTemplate = Path.Combine(outputFolderPath, string.Format("{0}-page{{0}}.html", Path.GetFileNameWithoutExtension(filename)));
string resourceUrlTemplate = "resource-{1}";
string outputHtmlResourcePathTemplate = Path.Combine(outputFolderPath, resourceUrlTemplate);
HtmlViewOptions htmlOpt = HtmlViewOptions.ForExternalResources(outputHtmlFilePathTemplate, outputHtmlResourcePathTemplate, resourceUrlTemplate);
htmlOpt.PdfOptions.WrapImagesInSvg = true;

using (Viewer viewer = new Viewer(inputPdfPath))
{
    viewer.View(htmlOpt);
}

This code will generate resultant HTML with external resources, where all links will be relative:

link to external stylesheet in HTML file,
links to external fonts in CSS file,
link to external SVG in HTML file,
link to external raster image in SVG file.

With best regards,
Denis

Clemens_Pestuka · October 16, 2024, 12:17pm

Hi @denisgvardionov

Thanks a lot for providing the sample code, works perfectly
Apologies for asking another question, but is it also somehow possible to determine which HTML pages need which resource file?
In our application, the backend has to provide the frontend only with certain pages that are currently being rendered. So I would need to provide them with only certain HTML pages and only a sub-set of the resource files at a time.

Best regards,
Clemens

denisgvardionov · October 16, 2024, 1:21pm

@Clemens_Pestuka

If I understood your intention correct - yes, it is possible to group HTML resourses on per-page basis, so every single page will be stored together only with those specific resources, which are used exactly there and nothing redundant.

I’ve modified the previous sample and here is an updated version:

const string filename = "SampleDoc1.pdf";
const string path = "full\\correct\\path";
string inputPdfPath = Path.Combine(path, filename);
FileAssert.Exists(inputPdfPath);

string outputFolderPath = "full-path-to\\OutputFolder\\";
string fileSpecificPath = Path.Combine(outputFolderPath, string.Format("folder-{0}", Path.GetFileNameWithoutExtension(filename)));
if (Directory.Exists(fileSpecificPath))
{
    Directory.Delete(fileSpecificPath, true);
}
Directory.CreateDirectory(fileSpecificPath);

using (Viewer viewer = new Viewer(inputPdfPath))
{
    GroupDocs.Viewer.Results.ViewInfo viewInfo = viewer.GetViewInfo(ViewInfoOptions.ForHtmlView(false));
    int pagesCount = viewInfo.Pages.Count;
    for (int pageNumber = 1; pageNumber <= pagesCount; pageNumber++)
    {
        string pageSpecificPath = Path.Combine(fileSpecificPath, string.Format("page-{0}", pageNumber));
        Directory.CreateDirectory(pageSpecificPath);

        string outputHtmlFilePathTemplate = Path.Combine(pageSpecificPath, string.Format("{0}-page{{0}}.html", Path.GetFileNameWithoutExtension(filename)));
        string resourceUrlTemplate = "resource-{1}";
        string outputHtmlResourcePathTemplate = Path.Combine(pageSpecificPath, resourceUrlTemplate);

        HtmlViewOptions htmlOpt = HtmlViewOptions.ForExternalResources(outputHtmlFilePathTemplate, outputHtmlResourcePathTemplate, resourceUrlTemplate);
        htmlOpt.PdfOptions.WrapImagesInSvg = true;

        viewer.View(htmlOpt, new int[1] { pageNumber });
    }
}

Please note that GroupDocs.Viewer, when works in trial mode, has a limit of first two pages. Also in order to see how this sample works you need to specify an input PDF, which has more than 1 page - it is ok a single-page PDF, but you may not see the intention of this sample in such case. I’m attaching sample PDF “SampleDoc1.pdf” - it has 3 pages, image resources, fonts, complex formatting - good sample for this case.

With best regards,
Denis

SampleDoc1.pdf (690.5 KB)

Clemens_Pestuka · October 17, 2024, 6:38am

Hi @denisgvardionov

Thanks you for updated code.
It works really well with the sample, but I got a very strange result with the first other file I tested with:
Invoices - 2008041.pdf (142.9 KB)

The image on page 2 got somehow messed up, resulting in almost 500 individual files, instead of a single one.
This also resulted in a large “size on disk” value and would certainly be slower to transfer as well, having to send so many files.
Other files seem to be working just fine though, so I guess it was just a bad example.
I think I’d need to do more testing, to find out how reliable this way of converting really is.

Thanks again for suggesting it

Best regards,
Clemens

denisgvardionov · October 17, 2024, 1:57pm

Hi @Clemens_Pestuka ,

We have checked your sample file and we confirm that for the 2nd page the GroupDocs.Viewer generates a lot of small images. But I’m afraid we cannot fix it easily because the algorithm, that converts PDF to HTML, works in that way, and it is a kind of expected behavior. Actually, it depends on specific PDF document, - its background, formatting, styles, layout, content and so on, - and if some of these “things” cannot be smoothly transferred to HTML without distortions, it is rasterized to image and backed to the SVG background.

So in this very specific case it might be better to use the base64-embedded HTML.

Sorry for the inconvenience.

With best regards,
Denis

Clemens_Pestuka · October 18, 2024, 6:25am

Hi @denisgvardionov ,

Thank you for the explanation.
It was still interesting to try out this approach and I’m grateful for any suggestions.

Best regards,
Clemens

denisgvardionov · October 18, 2024, 9:51am

Hi @Clemens_Pestuka

You’re welcome. Also want to advice one thing. Some document formats by their nature are pageless, they have a flow layout. For example, HTML, XML, CSS, plain text, e-Book formats, archives and so on. In that case, when you convert them to HTML, if it is allowed by your use-case, it may be useful to turn on the HtmlViewOptions.RenderToSinglePage flag - it that case you will obtain a single resultant HTML file.

With best regards,
Denis

Clemens_Pestuka · October 21, 2024, 9:45am

Hi @denisgvardionov ,

Thank you for yet another suggestion .
We’re using “single pages” for Excel worksheets, as it resembles the original,
but for PDF which is also split up into pages originally, we would rather like to keep it like that.

Best regards,
Clemens