Hello,
Recently I am encountering an error when trying to extract text from a pptx file on linux. (the same code works fine on osx)
The exception stack trace is:
“exception”: "|GroupDocs.Parser.Exceptions.GroupDocsParserException: The type initializer for ‘Gdip’ threw an exception.\n at \u0005\u0019\u0005.\u0002(Stream \u0002, LoadOptions \u0003)\n at \u000e\u0018\u0005.\u0002()\n at \u000e\u0018\u0005.\b\u0015\u0003\u0003\u0019\u0003\u0019\u0002()\n at GroupDocs.Parser.Options.DocumentInfo.get_PageCount()
it happens when calling to “parser.GetDocumentInfo().PageCount” or to “parser.GetText()”
Im using the .net version of groupdocs.parser with the following spec:
OS: Ubuntu debian Jammy 22.04.1
Groupdocs.Parser package version: 22.8.0
The same code on linux is able to parse docx / xls and other doc types successfully.
I attached the pptx file that I am doing my tests on.
Will appreciate your help on this matter, thanks.
samplepptx.pptx.zip (395.9 KB)
@Island.io
Could you please share the sample code/application as well?
public string ExtractTextFromFile(Stream fileStream) {
try {
using var parser = new Parser(fileStream);
if (!parser.Features.TextPage && parser.Features.Text) {
return GetTextFromTxtFiles(parser);
}
return GetTextFromDocument(parser);
}
catch (GroupDocs.Parser.Exceptions.UnsupportedDocumentFormatException) {
return string.Empty;
}
}
private string GetTextFromTxtFiles(Parser parser) {
using var reader = parser.GetText();
if (reader == null)
throw new Exception("group docs extractor could not get text from file");
return reader.ReadToEnd();
}
private string GetTextFromDocument(Parser parser) {
var text = new StringBuilder();
var docInfo = parser.GetDocumentInfo();
for (var pageNumber = 0; pageNumber < docInfo.PageCount; pageNumber++) {
foreach (var textArea in GetTextAreas(parser, pageNumber)) {
text.Append(TrimMultiSpace.Replace(textArea.Text.Trim(), " ") + " ");
}
}
return text.ToString();
}
private static IEnumerable<PageTextArea> GetTextAreas(Parser parser, int pageNumber) {
return parser.GetTextAreas(pageNumber).SelectMany(GetTextAreas);
}
private static IEnumerable<PageTextArea> GetTextAreas(PageTextArea area) {
if (area.Areas.Count == 0) yield return area;
foreach (var textArea in area.Areas.SelectMany(GetTextAreas).AsEnumerable()) {
yield return textArea;
}
}
@Island.io
Thanks for the details. We are investigating this issue. Your investigation ticket ID is PARSERNET-1938.
@Island.io
In order to use GroupDocs.Parser for .Net on Linux the following packages should be installed:
- libgdiplus package
- libc6-dev package
- package with Microsoft compatible fonts: ttf-mscorefonts-installer. (e.g. sudo apt-get install ttf-mscorefonts-installer)
After successfully installation of these packages. Let us know if issue persists.
1 Like
That solved the issue.
Thanks
1 Like
@Island.io
Glad to know that the issue is resolved.