PDF to CSV conversion taking too long

Hi Team,

I am using GroupDocs.Conversion(23.2.0) for one of my requirement to convert PDF to CSV file. My PDF document is a one pager file that contains some plain text and few tables, but the conversion from PDF to CSV is taking around 5 to 6 seconds that is too long. Can you please look into it and let me know is it possible to optimize this conversion time and what needs to be done for that.

Thanks in Advance
Manika Sood

@manika

Could you please share following details:

  • Sample conversion code
  • GroupDocs.Conversion API version
  • Source/PDF file
  • Are you evaluating API in trail mode (without license)?

Hi,

Please find the attached files for the source PDF file and generated CSV file and find the sample conversion code as follows –

Stopwatch watch1 = new Stopwatch();

watch1.Start();

// Load PDF file

var converter = new GroupDocs.Conversion.Converter(“PDFFilePath”);

// Set conversion parameters for CSV format

var convertOptions = converter.GetPossibleConversions()[“csv”].ConvertOptions;

// Convert to CSV format

converter.Convert(“CSVFilePath”, convertOptions);

watch1.Stop();

Console.WriteLine(“Time to convert CSV ----” + watch1.ElapsedMilliseconds.ToString());

· We are using GroupDocs.Conversion API version – 23.2.0

· For now we are using the trial version of the API for evaluation purpose to check is it fulfilling our requirement.

Following issue that I want to add here -

  • If you check the generated CSV the values for “Position” column in “ExecutiveWorkHistory” table is divided into multiple rows i.e – “Senior Vice President, Operations”.

~WRD0000.jpg (357 Bytes)

image001.png (2.52 KB)

(Attachment Test-Manika.csv is missing)

Test-Manika.pdf (27.9 KB)

1 Like

@manika

Please take a look at this screenshot.png (186 KB) and this output.zip (8.7 KB).

Please compress your CSV to a ZIP format and reupload. Secondly, this text “Senior Vice President, Operations” is on two lines in the source/PDF file. Therefore, in the CSV, it’s also divided in two lines. As you ca see the screenshot shared above.

Hi,

Thanks for your response I’ll check the PDF for the same. PFA for generated csv file and can you please check the issue related to the time that it takes to convert the PDF to CSV.

~WRD0000.jpg (357 Bytes)

image001.png (2.52 KB)

Converted.zip (954 Bytes)

@manika

Please use GroupDocs.Conversion latest API version that is 23.3.1 in your application. Secondly, request a temporary license.

As far as this issue is concerned, we are investigating it. Your investigation ticket ID is CONVERSIONNET-5950.

@manika
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): CONVERSIONNET-5950

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

Hi,

Thanks for your response, will update the API Version to latest one and will check for the temporary license as well.
I have one more query for the PDF to CSV conversion, Can you please let me know is it possible to save Converted CSV file directly to a stream object instead of saving it physically somewhere on local machine? If yes, them Please let me know how can I do it.

Thanks in Advance.

1 Like

@manika

Please use following code to save output as stream:

converter.Convert(() => new MemoryStream(), (convertedStream, sourceFileName) => convertedStream.CopyTo(outputStream), converterOptions);

or

converter.Convert(() => new MemoryStream(), (convertedStream, _) => convertedStream.CopyTo(outputStream), converterOptions);

You can also explore this documentation article - Save file to stream.

Hi,

Thanks for your response. As explained in the given article “Save file to stream”, I have tried this approach as well but We need to pass the physical path for converted CSV on our machine and csv file got saved to that location, then have to read that as stream as shown in the following screenshot –

correct me if I am wrong, highlighted text is the physical location of the converted file. I don’t want to save the converted file into the physical file as I can not do that. Please let me know is it possible to convert the PDF directly to CSV stream without referring to any physical location. I am using the stream for Input file(PDF) as well.

~WRD0000.jpg (357 Bytes)

image001.png (2.52 KB)

@manika

Below is the code to save output in MemoryStream.

MemoryStream outputStream = new MemoryStream(); 
using (var converter = new GroupDocs.Conversion.Converter("source file")
{ 
    var options = new SpreadsheetConvertOptions(); 
    converter.Convert(() => new MemoryStream(), (convertedStream, sourceFileName) => convertedStream.CopyTo(outputStream), options); 
}

Hi,

Thank you so much for the response, given solution worked for me. I have one query regarding the conversion of PDF to CSV file. I was trying to convert a multipage pdf file to CSV using the same code that I have shared with you in my earlier mail trail and API’s trial version (23.3.1), but it is only converting the first page of the PDF to CSV.

I have tried one of the option given in the link but the mentioned methods are not available for the “ConvertOptions” class object (SpreadsheetConvertOptions) that I have created.

Can you please let me know what needs to be done to convert the complete PDF file to CSV.

~WRD0000.jpg (357 Bytes)

image001.png (2.52 KB)

1 Like

@manika

Please share that multipage PDF and the output that you are getting.

Hi,

PFA for the sample multipage PDF file that I am using to convert to CSV. You can check in generated CSV after converting the data of first page of sample PDF it is again displaying the data of first page rather than displaying the data of second page. Please let me know if anything else is required from my side.

Thanks in advance.

~WRD0000.jpg (357 Bytes)

image001.png (2.52 KB)

MultipageSample.pdf (30.1 KB)

Output.zip (924 Bytes)

1 Like

@manika

Since, you are evaluating the API in trial mode (without any license). There are free trial limitations. However, the good thing is, you can request a temporary license. Here are steps to avail the temporary license.
You’ll then get an output.zip (9.7 KB) like this.

Hi,

Thank you for your quick response. I’ll check for the temporary license to test the complete functionality.

Can you please let me know Is it possible to get the data of PDF file that I have shared in one sheet? As I have checked the output file shared by you, data of second page is coming on the separate sheet.

Actually we have one requirement where source PDF can contain data for multiple candidates in that case we want to split the data across separate sheets i.e. data related to candidate A on one sheet and data for candidate B on another sheet. PDF file that I have shared with you contains the data for single candidate that’s why we want to get the data for that one in single sheet. Can you please let me know is it possible to achieve this functionality with Conversion API?

Is there any update on the Issue that I have raised for the time that Conversion API is taking to convert PDF to CSV format?

~WRD0000.jpg (357 Bytes)

image001.png (2.52 KB)

1 Like

@manika

Thanks for the details/clarification.

We are investigating this scenario. Your investigation ticket ID is CONVERSIONNET-5960. However, the other ticket is still under investigation.
We’ll notify you in case of any update.

Hi,

As you explained I was checking the source PDF for the text in the “Position” column in ExecutiveWorkHistory table. As you said this text is on 2 lines that’s why it is divided into the 2 lines in CSV and that I understand.I just want to add here that, As you can see in the screenshot that you have shared the text “Senior Vice President, Operations” comes in multiple lines but in a single cell and single row in the generated CSV file.

But that’s not the case for rest of the values in the same columns like there are more values in the “Position” column like "Senior Manager, Digital

Operations" and “Senior Manager, Product Development” but for these values data gets divided into separate rows and on separate cells in the generated CSV. Same is the case for Company column. You can refer to the attached screenshot for more clarification I have highlighted the mentioned cells.

Can you please look into this and let me know how I can resolve this.

~WRD0000.jpg (357 Bytes)

image001.png (2.52 KB)

DataScreenshot.PNG (205 KB)

Test-Manika.pdf (27.9 KB)

Converted.zip (954 Bytes)

@manika

This issue is reproduced at our end. Therefore, we’ve logged it in our internal issue tracking system with ticket ID CONVERSIONNET-5967.

Hi Team,

Just want to check is there any update on the requirement that have shared under ticket id - CONVERSIONNET-5960

Thanks
Manika Sood