Convert a multi-paged PDF to single HTML in .NET

Hi,

I’m using GroupDocs.Viewer 19.10.0 for .NET.
I have files with multiple pages inside (word, pdf). Is it possible to convert all pages to one html output file?

Thanks,
Dariusz

@kamelwielki,

We are currently investigating your scenario (logged as VIEWERNET-2219) to combine the HTML files into a single file. As a workaround, you can merge the files manually as shown in the following code sample:

// filePath="sample.xml"
static void renderDocument(string filePath)
{
	string outputDirectory = "D:/output";
	outputDirectory = Path.Combine(outputDirectory, Path.GetFileNameWithoutExtension(filePath));
	if (!Directory.Exists(outputDirectory))
	{
		Directory.CreateDirectory(outputDirectory);
	}
	// The {0} will be replaced with current processing page number.
	string pageFilePathFormat = Path.Combine(outputDirectory, "page_{0}.html");

	using (Viewer viewer = new Viewer(filePath))
	{
		HtmlViewOptions htmlOptions = HtmlViewOptions
			.ForEmbeddedResources(pageFilePathFormat);
		htmlOptions.RenderResponsive = true;
		viewer.View(htmlOptions);
	}
	// Combine files into a single HTML file
	combineFiles(outputDirectory);
}
public static void combineFiles(string folderPath)
{
	string[] inputFilePaths = Directory.GetFiles(folderPath, "*.html");
	using (var outputStream = File.Create(Path.Combine(folderPath, "combined_output.html")))
	{
		foreach (var inputFilePath in inputFilePaths)
		{
			using (var inputStream = File.OpenRead(inputFilePath))
			{
				inputStream.CopyTo(outputStream);
			}
		}
	}
}

Your solution based on simple joining of files is not an option - I don’t want an html file with many head, body, styles tags etc.
Even if I combine head and body tags to have one body and head element in the resulting file, I’m still not sure if css styles will not interfere with each other.

@kamelwielki,

In fact, GroupDocs.Viewer is designed to work for the document viewer applications where the document is loaded/displayed in the form of pages. It makes it simple to navigate to a particular page, apply watermark, rotate/reorder a page etc. and this is how the document viewers work. Imagine we have a large PDF/Word document rendered into single HTML file. The resultant HTML file could be large enough in size as well as in length of the content.

The feature of rendering into single HTML may be useful for small documents where it’s better to have single HTML document on the output, for example XML files. However, we are investigating it further to check the feasibility of this feature. We shall let you know in case of any updates.

The issues you have found earlier (filed as VIEWERNET-2219) have been fixed in this update. This message was posted using Bugs notification tool by atirtahir3

Hello, we are interested in this functionality too. Using the version 20.8 and the following code, to convert multi page pdf into a single html:

using (var viewer = new Viewer(input))
{
       var options = HtmlViewOptions.ForEmbeddedResources(output);
       options.RenderSinglePage = true;
       viewer.View(options, new int[] { 1, 2 });
}

However it does not work, it converts only page 2.

We need this to be able to print different document formats, after converting them to html.

1 Like

@eugenekr

This query is answered here.