Performance of text extraction and pdf(a) conversion in .NET

CaulDrohn · August 21, 2020, 4:37pm

Hello,

for my company, I’m researching alternatives for OIT (Oracle Outside In Technologie), which we use for text extraction and document to pdf conversion. The best solution I discovered so far is GroupDocs.Conversion, which can handle both use cases (all other solutions could only handle one of it) with many different file types (not just selected few MS / open office formats).

Subsequently, I wrote a small example program for extracting text and converting to pdf with various file formats. The implementation was really easy and straightforward, the conversions working like a charm without errors. However, when comparing the timings between GroupDocs and the next best other solution for a given use case, I felt that the GroupDocs conversions perform rather slow.

Regarding text extraction, I had 20 different files tested. Using GroupDocs, processing all files took a bit more than a minute, about 3 seonds per file. Using TikaOnDotNet (Discontinued .Net port of Apache Tika), it only took about 8 seconds, 400 ms per file. Thats quite a heavy difference, GroupDocs being nearly 8 times slower than TikaOnDotNet.

Regarding document-to-pdf, the conversion from various formats to pdf was comparable between GroupDocs and PdfTron. But for pdf-to-pdf/a, GroupDocs was again significantly slower than PdfTron, about 0.6s per file compared to 0.06s per file. I could only test three of my files for pdf/a conversion due to limitations of the unlicensed version, though.

So I wanted to ask if there are some tricks I could try out for improving the performance?
I tested with Debug build, so I could test again with Release. But the same applies to Tika / PdfTron.
I only tested unlicensed, so there could be some sort of throttling I don’t know about, but I doubt that.
Maybe I am using the converters suboptimally? I create a new converter object for every file, since the file path has to be specified via parameter. Is there a way to recycle converters so they don’t have to be created every time? It does not seem so, and likely wouldn’t have much of an impact, since the Convert-Call should require the most time, I guess. Could there be some other way to tweak the conversion so it runs faster?

I’m using the following simple code for text extraction:
public void Convert(string srcFile, string destFile)
{
using (var converter = new Converter(srcFile))
{
WordProcessingConvertOptions options = new WordProcessingConvertOptions();
options.Format = WordProcessingFileType.Txt;
converter.Convert(destFile, options);
}
}

And an equally simple code for conversion-to-pdf:
public void Convert(string srcFile, string destFile)
{
using (Converter converter = new Converter(srcFile))
{
PdfConvertOptions options = new PdfConvertOptions();
options.PdfOptions.PdfFormat = PdfFormats.PdfA_1B;
converter.Convert(srcFile + “.pdfa_1b.pdf”, options);
}
}

GroupDocs.Conversion v20.6.0.0

Thanks in advance,
F. Exler

atir.tahir · August 21, 2020, 7:37pm

@CaulDrohn

Please request a temporary license here (select an API and at the time of checkout, you can avail temporary license). Let us know if it improves your results.
However, if issue persists, we may need all test files (files that you are converting to PDF or PDF/A) from you that you are using for API evaluation .

Meanwhile, we are investigating this scenario. Your investigation ticket ID is CONVERSIONNET-4120.

CaulDrohn · August 25, 2020, 2:22pm

I cannot request a temporary license. I am told to put in a company name:

Your request for a temporary license has not been successfull because you did not enter a company name in your user profile.

We require a valid company name in order to issue a temporary license.

You can specify your company name here .

There is no company field on the linked url. just the organization one, which I have filled out. Still i get the same error. Does it have to synchronize or something alike until a change gets registered?

atir.tahir · August 25, 2020, 5:04pm

@CaulDrohn

We have created a thread on your behalf in our Purchase category. You’ll get assistance on temporary license ASAP.

CaulDrohn · August 26, 2020, 9:10am

I tried it again today, this time the site said it worked and I will get a mail.
The mail was not delivered yet, though. I’ll wait a bit more and write again if it does not show up.

atir.tahir · August 26, 2020, 9:16am

@CaulDrohn

Please share license related information here.

CaulDrohn · August 31, 2020, 1:23pm

I did some testing with the temporary license last week and collected the timings. This time, I ran my test in release build without debugger. 40 different files were used for testing (office formats with different sizes and 10 pdf files).

For my first tests, three runs per scenario were performed and timed (see Overall timings.txt of the zip). That tests showed some really bad runtimes for GroupDocs (e.g. see comparison between GroupDocs and TIka). But I found out two things impacting the run time:

1.) GroupDocs somehow has serious problems with large spreadsheets. Three of my test files are spreadsheets with 1000 lines (xls, xlsx, ods), and each took about 37 seconds to be converted to text. The 100 lines variants only took like 1.5 seconds, so it seems there’s some kind of exponential growth. I omitted the three spreadsheets from further testing / timing. You can find them in my zip (largeSS subfolder) as well as there timings aggregated in one file (GD Large SpreadSheets.log).

2.) I recognized that the run time for the first conversion of a given source file type may take considerably longer than subsequent ones. The additional time for the first conversion is likely caused by some library loading time required to process the new file type. The same applied to Tika conversions. So to rule out this loading time, I changes my test setup to perform the conversions three times in row. Only the first test run would include the loading times, so no loading required anymore for the following two runs. The timings for this runs can be seen in the log files in the zip.

With this two effects ruled out, the timings were significantly better than before, but still text extraction with Tika and pdf2pdfa conversion with PdfNet are magnitudes faster.
Text extraction: [GD] ~1,7s / file vs. ~0,06s / file [TK]
Pdf2PdfA: [GD] ~1,23s / file vs. ~0.05s / file [PN]

The conversion from NonPdf to Pdf(A) files feels alright and performed better than PdfNet (which also does not provide support for all tested file types):
NoPdf2Pdf: [GD] ~0.7s / file vs. ~3.x [PN]
NoPdf2PdfA: [GD] ~0,9 s / file vs. [Direct conversion not possible with PdfNet]

One side note about the created PdfA-Files. I tried to validate the files using an online pdf/a validator (PDF Tools Online - Validate PDF). It claimed the Pdf/A 1a conversions from GroupDocs were not valid. Pdf/A 1b conversion were fine:

Validating file “pdf_example_from_doc.pdf.grdocs.a1a.pdf” for conformance level pdfa-1a
The key Type is required but missing.
The document does not conform to the requested standard.
The document doesn’t provide appropriate logical structure information.
The document does not conform to the PDF/A-1a standard.
Done.

I additionaly validated the files with VeraPDF, without any issue there. Some generated sample Pdf/A files can be found in the ‘pdf results.zip’.

GD Evaluation.zip (2.3 MB)

atir.tahir · August 31, 2020, 7:25pm

@CaulDrohn

We could convert file_example_XLS_1000.xls in 6 to 8 seconds (maximum delay time that it took, during multiple tries was 10-11 seconds). Have a look at this screenshot.PNG (22.7 KB) and this output.zip (13.8 KB). Conversion time for rest of the 2 spreadsheets was also same (6-8 seconds).

As far as PDF/A conversion is concerned, all the files in pdfa results folder have evaluation tag. That means when you performed the conversion, license was not applied. Please share the source DOC file with us and we’ll investigate this scenario.

CaulDrohn · September 3, 2020, 9:39am

@atirtahir3

I checked the output.txt, and it’s significantly different from my output. Your output includes all the html tags, in mine there’s only the observable data (see extract.txt in my second zip) . This might explain why your scans are way faster. I use the following code for doc2text conversion:

using (var converter = new Converter(srcFile))
{
    WordProcessingConvertOptions options = new WordProcessingConvertOptions();
    options.Format = WordProcessingFileType.Txt;
    converter.Convert(destFile, options);
}

Regardless off the actual timings, do you see the same pattern that the increase in processing time seems to be more than linear with increasing file size? Like i wrote in my last message, the one with 100 rows took about 1.5s for me and the one with 1000 rows took 37 seconds, where I would have expected something like 15 seconds. I put four different xls files with increasing row counts into my second zip, so you could test this on your side. I would expect some similar, non-linear increase even with your lower overall processing time.

That likely because I created the source pdfs in my initial tests, where I didn’t have the license yet. But when converting this pdfs to pdf/a, I had the license active, otherwise the conversions wouldn’t have worked due to the restrictions (“At most 4 elements (for any collection) can be viewed in evaluation mode.”). If the tags are still there, I would think they did not get removed with the pdfa conversions.
Nevertheless, I also added the smallest document style source files I used into my second zip.

follow-up #1.zip (359.2 KB)

atir.tahir · September 3, 2020, 7:37pm

@CaulDrohn

Yes, with the provided code, API takes long time (based on source file rows) for conversion. We’re investigating this scenario with ID CONVERSIONNET-4145. You’ll be notified in case of further progress/update.
We tried to convert the provided file-sample_100kB.odt file to PDF and it took 5-6 seconds and from PDF to PDFA_1A it took 6-7 seconds.
Have a look at source and output Files.zip (408.5 KB).