Converting large files to images

suredrop · November 16, 2020, 1:41pm

Hi,

Background :

Our use case is to convert all supported file types to images (JPG) and show in the client (browser). However, from our internal testing we see that converting about 200 pages from a PDF file to generate 200 JPG images takes roughly 2 minutes in a Win 2016 server by the GroupDocs.Viewer library.
Obviously after clicking on a document for viewing, if the user has to wait for a couple of minutes for it to be rendered then it is a poor user experience. To get around this problem, we were thinking to convert the first ten pages of the document synchronously and the remaining pages in a background thread in the backend. However, for the frontend to display the file contents we need some placeholder (like thumbnails) images for each page.

Questions :

Is there any way of generating thumbnails in GroupDocs.Viewer (possibly just a page number) ?
What is the most common (/recommended) way users convert large files (> 100MB) to images?

atir.tahir · November 16, 2020, 4:23pm

@devam

Please share following details and we’ll investigate this scenario:

Sample code
Problematic PDF
Specify API version used for the rendering

suredrop · November 17, 2020, 2:56am

@Atir_Tahir
Thanks for your response. We were testing medium sized files like this one or this second one. The API version is 20.8.0. The code is very straight forward and is taken from the GitHub example. It looks something like this -

string documentPath = @"C:\sample.pdf"; // NOTE: Put here actual path for your document
using (Viewer viewer = new Viewer(documentPath, settings)) // Settings uses a thread-safe cache
{
   // The file path format e.g. 'page_{0}.html'
   string filePathFormat = @"C:\output\page{0}.png";
   var options = new PngViewOptions(filePathFormat);
   viewer.View(options);
}

atir.tahir · November 17, 2020, 7:46am

@devam

Thank you for the details.

We are investigating these scenarios. Your investigation ticket ID VIEWERNET-2905. You’ll be notified as there’s any update.

atir.tahir · November 17, 2020, 10:36am

@devam

There are following possibilities:

Generate enough static thumbnails with a page number. To do so, you need to create some file with N pages like 500_pages.zip (21.2 KB) where each page has a corresponding page number see page_numbers.png (29.6 KB) and generate thumbnails with Viewer:
```
 using(Viewer viewer = new Viewer("500_pages.docx"))
 {
     JpgViewOptions viewOptions = new JpgViewOptions("static_thumbs/thumb_{0}.jpg");
     viewOptions.Width = 256;

     viewer.View(viewOptions);
 }
```

So, you’ll have 500 thumbnails in your static files folder.

Create a thumbnail at runtime. You’ll still need some template like this template.zip (337 Bytes) with a placeholder (rtf_placeholder.png (31.4 KB)) and then you replace the placeholder with page number and generate a thumbnail for a specific page as shown in the following code snippet:

 LoadOptions loadOptions = new LoadOptions();
 loadOptions.FileType = FileType.RTF;

 using(Viewer viewer = new Viewer(() => CreateThumbStream(pageNumber), () => loadOptions))
 {
     JpgViewOptions viewOptions = new JpgViewOptions($"dynamic_thumbs/thumb_{pageNumber}.jpg");
     viewOptions.Width = 256;

     viewer.View(viewOptions);
 }

 private static Stream CreateThumbStream(int pageNumber)
 {
     Encoding iso = Encoding.GetEncoding("ISO-8859-1");
     string template = File
             .ReadAllText("template.rtf", iso)
             .Replace("PAGE_NUMBER", pageNumber.ToString());

     MemoryStream templateStream = new MemoryStream(iso.GetBytes(template));
     return templateStream;
 }

Here is a complete sample_app.zip (563.5 KB) that demonstrates these use-cases.

We believe that rendering N first pages synchronously and the remaining pages in a background thread at backend is a good solution for this scenario. You can also split rendering of the remaining pages in chunks to speed-up the process e.g. 1-10 pages will be handled by first worker, 11-30 pages will be handled by second worker, and so on. Of course, each worker will require some CPU and memory resources because each worker will have to load a file into memory and process a file so it is recommended to test different scenarios in your environment.

suredrop · November 17, 2020, 11:08am

Thanks @Atir_Tahir, the second approach is quite interesting, we will definitely test it.
If I may ask, do you have any example (viewer) code to display the generated images in a lazy way?

atir.tahir · November 17, 2020, 4:44pm

@devam

We assume you are asking for lazy loading of images in the browser, right? API doesn’t support any such (built-in) feature. However, you can implement that based on your logic/use-case (as you have rendered or converted images in one hand, you can lazy load them in browser). Please let us know if you need any further details.

suredrop · November 20, 2020, 7:24am

Hi @Atir_Tahir,
Thanks for your response. Yes we have another question about generating templates for different locales. Let’s say we want to have a sentence in the thumbnail with the page number. Something like, “Page [page-number] is still being rendered, please be patient.” And then we wanted to have 500 static pages as you pointed in your example, but for 6 different locales (English, French, Spanish, blah blah).

In this situation, can GroupDocs help in generating these templates in some way? We thought of generating the thumbnails using watermarks but GroupDocs does not support a vertically centered watermark. Any thoughts on how we can localise these thumbnails with some generic text for each page?

Cheers,
Devam.

atir.tahir · November 20, 2020, 10:21am

@devam

It seems that a single template template_for_localized_text.zip (415 Bytes) and a list of localized strings would serve the purpose. Please have a look at this sample application.zip (15.4 KB). You can see the output generated in thumbs folder. Let us know if it meets your requirements.

suredrop · November 22, 2020, 10:53pm

Hi @Atir_Tahir,
We had tried this approach and it didn’t work for dual byte characters. Languages such as Chinese, Japanese, Thai, Korean, Arabic etc. You can verify by adding the following items in the dictionary of your sample code -

{"zh-TW", "我在解密文檔的頁面{0}時請稍候，以便可以查看" },
{"ja","ドキュメントのページ{0}を復号化して、表示できるようになるまでお待ちください" },

You should get the following exception -

Generating templates for zh-TW
Unhandled exception. GroupDocs.Viewer.Exceptions.GroupDocsViewerException: Could not load file. File is corrupted or damaged.
   at ?.(Object )
… …
at? . ? (LoadOptions , BaseViewOptions )
at ?(String , Stream , LoadOptions , BaseViewOptions)
at GroupDocs.Viewer.Viewer.(LoadOptions , BaseViewOptions )
at GroupDocs.Viewer.Viewer.(BaseViewOptions )
at . ?(Func`2 , ViewOptions )
at GroupDocs.Viewer.Viewer.View(ViewOptions options)

atir.tahir · November 23, 2020, 10:35am

@devam

We don’t face this exception, thumbnails are generated. Please have a look at this thumbnails_localized_templates.zip (32.8 KB) application. The issue you are facing is more related to the source file (that is either corrupt or damaged). You may debug the application and see where exactly this exception appears in your code.

atir.tahir · November 23, 2020, 10:35am

@devam

We don’t face this exception, thumbnails are generated. Please have a look at this thumbnails_localized_templates.zip (32.8 KB) application. The issue you are facing is more related to the source file (that is either corrupt or damaged). You may debug the application and see where exactly this exception appears in your code.

suredrop · November 27, 2020, 2:09am

Thanks @Atir_Tahir. Can you please check why converting the first 5 pages of this epub file (to PNG) takes a long time? Are there any performance improvements that can be done?

atir.tahir · November 27, 2020, 9:09am

@devam

We answered this query here EPUB to PNG rendering performance. We’d encourage you to create a new thread for a new issue.