Fail Converting PDF to MD with images and urls

Hey, I have been trying to use GroupDocs to convert pdf files to markdown files. Sadly when trying to convert pdf files with images and urls the images vanish and the urls just stay as highlighted text in the md document. The only way I am able to make it work is by first converting the pdf to a docx file and then to a md. I have also tried using your webapp in order to convert and I meet the same problem. I would really appreciate some help with this issue.
Thanks,
Dan

@dangilboa

Could you please provide more details about the specific method or API you are using for the conversion process? Additionally, what version of GroupDocs are you currently using?

I am using the latest version (25.7.0). I am using the converter from the GroupDocs.Conversion library in .NET. Using PdfLoadOptions and WordProccessingConvertOptions

@dangilboa

Could you please share the complete conversion code, including the source file and the output file? This will help us further investigate the issue.

drylab.pdf (1.3 MB)

I have added the input file(this is just one example, I have tried many and none work). I cannot add the output md because it isnt allowed here.
I have tried converting this pdf with my code and with your webapp: Online PDF to MD converter | Free GroupDocs Apps
I get the same result for both.

here is my implementation :

private async Task DemonstratePdfToMarkdownConversion(Stream pdfStream)
        {
            using var memoryStream = new MemoryStream();
            await pdfStream.CopyToAsync(memoryStream);
            memoryStream.Position = 0;
            
            var loadOptions = new PdfLoadOptions();
            var convertOptions = new WordProcessingConvertOptions
            {
                Format = WordProcessingFileType.Md
            };
            var outputPath = Path.Combine(Path.GetTempPath(), "demo_conversion.md");
            
            using var converter = new Converter(() => memoryStream, _ => loadOptions);
            converter.Convert(outputPath, convertOptions);
            
            Console.WriteLine($"Conversion completed: {outputPath}");
        }

I have tried all the pdfrecognitionsmode options and markdown options (exportimagetobase64) and nothing works.
Thanks for the help,
Dan

Hello @dangilboa ,

Thank you for sharing the detailed information about your use case. We can confirm that converting directly from .pdf to .md format indeed has issues. We have logged this case in our tracking system as CONVERSIONNET-7923, and once we complete further investigation, we will get back to you with the results.

Hey evan, we are in the process of using the trial licence in order to assess the quality of the sdk
I wanted to know how long until this issue would be resolved, so we can start using.
we are on a tight schedule ATM

Hello @Shay_BH ,

It is difficult to estimate how long this might take at the moment. First, our development team needs to fully investigate the issue to identify the root cause. Only then will it be possible to provide any timeframes for its resolution. We have assigned a high priority to this task, and the developers are already investigating it. As soon as we receive any updates from them, we will share them with you right away.

thanks for the quick response, we will keep evaluting. our timeline is short so I appreciate the prioritization.

@Shay_BH ,

You’re always welcome. If you have any further questions, please don’t hesitate to contact us.

@Shay_BH ,

We have received a report from our developers regarding the investigation of this issue. It will be fixed in GroupDocs.Conversion for .NET 25.8, which is planned to be released by the end of this month. If you are highly interested in this fix, we could prepare an alpha build of the library with the fix included for your testing in the coming days. Would that be useful for you?

this could be great, waiting for this alpha version.

@Shay_BH ,

Sure. We’ll notify you as soon as it’s ready for download.

1 Like

@dangilboa and @Shay_BH ,

Our development team has prepared an alpha version of GroupDocs.Conversion for .NET 25.8. You can download it here.
We look forward to your feedback on using it.

1 Like

we will look into asap. thanks!

another question i have is, would it be possible to add an event before converting the file to handle the images or urls in a custom way. i.e in our use case we would like to extract the images and put a unique placeholder where the image was. some vendors like sync fusion offers such feature that makes life easier when converting a file without having to parse it later to extract the images.

@Shay_BH ,

We are always open to improving our library and strive to implement the functionality our users need, whenever possible. Our developers will be able to review your request if you provide a more detailed description of the functionality you require in our library. For example, it would be very helpful if you could describe in more detail what exactly Sync Fusion offers and what additional features you would like to see in our library beyond that. We will do our best to investigate your request as quickly as possible and get back to you with the results.

alright let me give you an example.
say i want to extract all images and put a place holder of that image in the converted file, in sync fusion im able to hook an event to save the images and put a unique id instead.
example:

var imageDictionary = new Dictionary<string, byte[]>();

            // Hook the event to save images and store them in the image dictionary
            syncfusionDoc.SaveOptions.ImageNodeVisited += (_, args) =>
            {
                using var memoryStream = new MemoryStream();
                args.ImageStream.CopyTo(memoryStream);
                var imageBytes = memoryStream.ToArray();
                var imageName = $"image{imageCount}";
                imageDictionary[imageName] = imageBytes;
                imageCount++;

                // put a placeholder for the image in the markdown
                args.Uri = imageName;
            };

Hello @Shay_BH ,

Thank you for the clarification, we truly appreciate your interest in our library. We have created a corresponding task under ID CONVERSIONNET-7934 in our tracking system, and once our developers investigate it and we have some results, we will immediately share them with you.

If possible, could you also clarify whether you are only considering products that provide conversion from .pdf to .md format, or if you are interested in a broader range of formats? We would like to share some news with you: we are planning to launch a new product called GroupDocs.Markdown in the coming weeks. It will be a more specialized product focused on converting most document formats into a single format — .md. At the same time, we intend to extend its functionality beyond simple conversion. Your request has inspired us to consider adding such functionality to this new product. Would a product like this be of interest to you, or do you perhaps require additional features for working with .md files? We would be delighted to hear your ideas, comments, or suggestions.

this is exactly what we are looking for.
we want to convert many types of documents to markdowns.
if you have an alpha version we would be happy to try it out.
also thanks for this feature request, i must say it makes life easier when using sync fusion to hook an event to images\urls etc.
please let us know how we can use this Markdown library, because we are on tight schedule