Unable to parse pptx file on linux

Island.io · October 6, 2022, 4:10pm

Hello,
Recently I am encountering an error when trying to extract text from a pptx file on linux. (the same code works fine on osx)

The exception stack trace is:
“exception”: "|GroupDocs.Parser.Exceptions.GroupDocsParserException: The type initializer for ‘Gdip’ threw an exception.\n at \u0005\u0019\u0005.\u0002(Stream \u0002, LoadOptions \u0003)\n at \u000e\u0018\u0005.\u0002()\n at \u000e\u0018\u0005.\b\u0015\u0003\u0003\u0019\u0003\u0019\u0002()\n at GroupDocs.Parser.Options.DocumentInfo.get_PageCount()

it happens when calling to “parser.GetDocumentInfo().PageCount” or to “parser.GetText()”
Im using the .net version of groupdocs.parser with the following spec:
OS: Ubuntu debian Jammy 22.04.1
Groupdocs.Parser package version: 22.8.0

The same code on linux is able to parse docx / xls and other doc types successfully.
I attached the pptx file that I am doing my tests on.

Will appreciate your help on this matter, thanks.

samplepptx.pptx.zip (395.9 KB)

atir.tahir · October 6, 2022, 7:56pm

@Island.io

Could you please share the sample code/application as well?

Island.io · October 7, 2022, 6:57pm

public string ExtractTextFromFile(Stream fileStream) {
try {
using var parser = new Parser(fileStream);
if (!parser.Features.TextPage && parser.Features.Text) {
return GetTextFromTxtFiles(parser);
}
return GetTextFromDocument(parser);
}
catch (GroupDocs.Parser.Exceptions.UnsupportedDocumentFormatException) {
return string.Empty;
}
}

private string GetTextFromTxtFiles(Parser parser) {
    using var reader = parser.GetText();
    if (reader == null)
        throw new Exception("group docs extractor could not get text from file");
    return reader.ReadToEnd();
}

private string GetTextFromDocument(Parser parser) {
    var text = new StringBuilder();
    var docInfo = parser.GetDocumentInfo();
    for (var pageNumber = 0; pageNumber < docInfo.PageCount; pageNumber++) {
        foreach (var textArea in GetTextAreas(parser, pageNumber)) {
            text.Append(TrimMultiSpace.Replace(textArea.Text.Trim(), " ") + " ");
        }
    }
    return text.ToString();
}

private static IEnumerable<PageTextArea> GetTextAreas(Parser parser, int pageNumber) {
    return parser.GetTextAreas(pageNumber).SelectMany(GetTextAreas);
}

private static IEnumerable<PageTextArea> GetTextAreas(PageTextArea area) {
    if (area.Areas.Count == 0) yield return area;

    foreach (var textArea in area.Areas.SelectMany(GetTextAreas).AsEnumerable()) {
        yield return textArea;
    }

}

atir.tahir · October 7, 2022, 8:56pm

@Island.io

Thanks for the details. We are investigating this issue. Your investigation ticket ID is PARSERNET-1938.

atir.tahir · October 11, 2022, 10:46am

@Island.io

In order to use GroupDocs.Parser for .Net on Linux the following packages should be installed:

libgdiplus package
libc6-dev package
package with Microsoft compatible fonts: ttf-mscorefonts-installer. (e.g. sudo apt-get install ttf-mscorefonts-installer)

After successfully installation of these packages. Let us know if issue persists.

Island.io · October 11, 2022, 11:18am

That solved the issue.
Thanks

atir.tahir · October 11, 2022, 11:31am

@Island.io

Glad to know that the issue is resolved.