Rendering one page at a time to avoid multiple open MemoryStreams

petecodes · April 25, 2022, 11:20pm

Our product is using GroupDocs.Viewer for .NET (.NET Core 3.1) to render PDF files as JPG images, one image per page. I’ve implemented an IPageStreamFactory that reuses a single Stream and am calling viewer.View(), passing each page number until the end of the document.

My concern now is that I’m doing tons of extra work when calling View() - can anyone confirm this? If so, is there a better way to get each page of the rendered document as a new/separate stream without having all of them in memory at once?

Here’s my IPageStreamFactory:

internal class MemoryPageStreamFactory : IPageStreamFactory {

    public MemoryPageStreamFactory(MemoryStream stream) {
        Stream = stream;
    }

    public MemoryStream Stream { get; }

    public Stream CreatePageStream(int pageNumber) {
        Stream.Position = 0;
        Stream.SetLength(0);

        return Stream;
    }

    public void ReleasePageStream(int pageNumber, Stream pageStream) {
    }
}

Here’s what my rendering process looks like. The method returns an IEnumerable so that I can iterate through the pages in the calling code, which as a result of the “yield return” will render the pages as I iterate:

LoadOptions loadOptions = new LoadOptions(fileType);

using (Viewer viewer = new Viewer(fileStream, loadOptions)) {
    using (MemoryStream pageStream = new MemoryStream()) {
        var pageStreamFactory = new MemoryPageStreamFactory(pageStream);
        JpgViewOptions viewOptions = new JpgViewOptions(pageStreamFactory);
        var viewInfoOptions = ViewInfoOptions.FromJpgViewOptions(viewOptions);
        var viewInfo = viewer.GetViewInfo(viewInfoOptions);

        if (maxWidth != null) {
            viewOptions.MaxWidth = maxWidth.Value;
        }

        if (maxHeight != null) {
            viewOptions.MaxHeight = maxHeight.Value;
        }

        for (int page = 1; page <= viewInfo.Pages.Count; page++) {
            viewer.View(viewOptions, page);
            yield return pageStreamFactory.Stream;
        }
    }
}

vladimir.litvinchik · April 26, 2022, 4:25am

@petecodes

It depends on how the consuming code. In case you’re processing the stream after a page is rendered then it seems you have to keep a stream in memory. In case you’re filling a list with streams and then peforming some operations on it then it may be reasonable to temporarily store data on disk.

The viewer.View(viewOptions, page); opens a file in case it was not opened and renders the page, but since you’re calling viewer.GetViewInfo(viewInfoOptions); the file is already opened so this function is only rendering the page.

Can you please share the consuming code so we have a complete picture?

petecodes · May 2, 2022, 2:20pm

Thanks, there’s not really much additional code to share. By temporarily inserting forced garbage collection (GC.Collect()) into the page processing loop I can confirm that memory is unreferenced after each page is rendered. As we’re running in Kubernetes I suspect that the amount of available memory is not being reported to the container correctly, as .NET is not performing garbage collection before the memory limit is reached.

The code that calls into the second code block above looks like this:

// Open stream to send to GroupDocs
await using var stream = await _fileStorage.OpenReadAsync(fileUri);
// Call processing code (returns IEnumerable<MemoryStream>)
var pageImages = _fileParserAdapter.ParsePagesToImages(stream, extension);
// Call code in second block below (iterates through the IEnumerable which actually performs the per-page conversion to JPG on each iteration
var result = await CreateSystemEntriesAsync(pageImages, programId, "jpeg", pageDirectory);

The code that iterates through the IEnumerable just saves the file to our file storage and persists metadata to the database:

foreach (var ms in streams) {
    ms.Position = 0;
    // Save to file storage and do some database stuff
}

vladimir.litvinchik · May 2, 2022, 7:30pm

@petecodes

Thank you for sharing the code you have. Please consider the following code

using System.IO;
using GroupDocs.Viewer;
using GroupDocs.Viewer.Options;
using GroupDocs.Viewer.Interfaces;

namespace ViewerSampleApp
{
    static class Program
    {
        static void Main()
        {
            using (Viewer viewer = new Viewer("sample.docx"))
            {
                var streamFactory = new SingleMemoryPageStreamFactory();
                var viewOptions = new JpgViewOptions(streamFactory);

                viewer.View(viewOptions);
            }
        }
    }

    internal class SingleMemoryPageStreamFactory : IPageStreamFactory
    {
        private readonly MemoryStream _stream  = new MemoryStream();

        public Stream CreatePageStream(int pageNumber) => _stream;

        public void ReleasePageStream(int pageNumber, Stream _)
        {
            _stream.Position = 0;

            // Save to file storage and do some database stuff

            ResetStream();
        }

        private void ResetStream() => _stream.SetLength(0);
    }
}

This solution should ensure that you’re using a single memory stream instance while synchronously rendering the pages one by one.

petecodes · May 2, 2022, 11:31pm

I’ll try it out but expect problems because my “Save to file storage” code is async. Any plans to make the IPageStreamFactory methods async?

vladimir.litvinchik · May 3, 2022, 4:52am

@petecodes

At the moment we do not have plans to make IPageStreamFactory methods async but we’ll consider adding async methods to Viewer class and related *StreamFactories as these types are actually responsible for IO operations.