Hello,
for my company, I’m researching alternatives for OIT (Oracle Outside In Technologie), which we use for text extraction and document to pdf conversion. The best solution I discovered so far is GroupDocs.Conversion, which can handle both use cases (all other solutions could only handle one of it) with many different file types (not just selected few MS / open office formats).
Subsequently, I wrote a small example program for extracting text and converting to pdf with various file formats. The implementation was really easy and straightforward, the conversions working like a charm without errors. However, when comparing the timings between GroupDocs and the next best other solution for a given use case, I felt that the GroupDocs conversions perform rather slow.
Regarding text extraction, I had 20 different files tested. Using GroupDocs, processing all files took a bit more than a minute, about 3 seonds per file. Using TikaOnDotNet (Discontinued .Net port of Apache Tika), it only took about 8 seconds, 400 ms per file. Thats quite a heavy difference, GroupDocs being nearly 8 times slower than TikaOnDotNet.
Regarding document-to-pdf, the conversion from various formats to pdf was comparable between GroupDocs and PdfTron. But for pdf-to-pdf/a, GroupDocs was again significantly slower than PdfTron, about 0.6s per file compared to 0.06s per file. I could only test three of my files for pdf/a conversion due to limitations of the unlicensed version, though.
So I wanted to ask if there are some tricks I could try out for improving the performance?
I tested with Debug build, so I could test again with Release. But the same applies to Tika / PdfTron.
I only tested unlicensed, so there could be some sort of throttling I don’t know about, but I doubt that.
Maybe I am using the converters suboptimally? I create a new converter object for every file, since the file path has to be specified via parameter. Is there a way to recycle converters so they don’t have to be created every time? It does not seem so, and likely wouldn’t have much of an impact, since the Convert-Call should require the most time, I guess. Could there be some other way to tweak the conversion so it runs faster?
I’m using the following simple code for text extraction:
public void Convert(string srcFile, string destFile)
{
using (var converter = new Converter(srcFile))
{
WordProcessingConvertOptions options = new WordProcessingConvertOptions();
options.Format = WordProcessingFileType.Txt;
converter.Convert(destFile, options);
}
}
And an equally simple code for conversion-to-pdf:
public void Convert(string srcFile, string destFile)
{
using (Converter converter = new Converter(srcFile))
{
PdfConvertOptions options = new PdfConvertOptions();
options.PdfOptions.PdfFormat = PdfFormats.PdfA_1B;
converter.Convert(srcFile + “.pdfa_1b.pdf”, options);
}
}
GroupDocs.Conversion v20.6.0.0
Thanks in advance,
F. Exler