Regularly seeing Unsupported Document Format Exception errors

prominentmedia · September 12, 2022, 1:43pm

We are seeing the error below fairly often when we upload PDFs (I’d love to upload examples, but the info within is sensitive, and the only non-sensitive example is 5MB, which is over your 4MB upload size limit):

Error during text extraction from [path_to_pdf]
GroupDocs.Parser.Exceptions.UnsupportedDocumentFormatException: Exception of type 'GroupDocs.Parser.Exceptions.UnsupportedDocumentFormatException' was thrown.

The issue is that from then on, whenever we upload a new document to the same folder, we get those same errors due to it reindexing, even though the new document has been indexed successfully.

We don’t want to just ignore errors coming from Groupdocs.Search but we also don’t want to constantly report on issues that are not related to the current upload.

Can you advise us on the best course of action?

If you are interested in the code we are using for the indexing, I’ve included it below.

public string AddOrUpdateIndex(string indexFolderLocation, string filesToIndexFolderLocation)
{
    var settings = GetStandardIndexSettings();         
    var indexIsNew = Directory.GetFiles(indexFolderLocation).Length < 1;
    var index =new GroupDocs.Search.Index(indexFolderLocation, settings);
    var errorMessage = string.Empty;
    index.Events.ErrorOccurred += (sender, args) =>
    {
        errorMessage = args.Message;
    };

    if (indexIsNew)
    {
        index.Add(filesToIndexFolderLocation);
    }
    else
    {
        UpdateOptions options = new UpdateOptions();
        options.Threads = 2;
        index.Update(options);
        index.Optimize();
    }

    if (string.IsNullOrEmpty(errorMessage))
    {
        return "Success";
    }

    return errorMessage;
}   

private IndexSettings GetStandardIndexSettings()
{
    var settings = new IndexSettings();
    settings.UseRawTextExtraction = false;
    return settings;
}

atir.tahir · September 12, 2022, 2:57pm

@prominentmedia

Please upload the file on a Cloud storage (e.g. Google Drive) and share the link here. Also specify the API version that you are using.

prominentmedia · September 12, 2022, 3:12pm

Thanks for the reply. We are using 21.8.1 and you can see a file that fails here: prominentmedia

atir.tahir · September 12, 2022, 6:59pm

@prominentmedia

prominentmedia:

Error during text extraction from [path_to_pdf]
GroupDocs.Parser.Exceptions.UnsupportedDocumentFormatException: Exception of type 'GroupDocs.Parser.Exceptions.UnsupportedDocumentFormatException' was thrown.

We don’t get any such exception. Could you please share the sample application using that issue could be reproduce and a short video/screencast explaining the steps to reproduce the issue.

prominentmedia · September 20, 2022, 1:30pm

You can download a small source application from https://prominentmedia.com

There is a screencast of the issue at https://prominentmedia.com/pdf/groupdocs-capture.mkv

An example PDF that is giving us an error can be found at https://prominentmedia.com/pdf/15747.pdf

Thanks in advance!

atir.tahir · September 20, 2022, 5:27pm

@prominentmedia

Thanks for the details. This issue is reproduced at our end. Therefore, we have logged it in our internal issue tracking system with ticket ID SEARCHNET-2710. You’ll be notified in case of any update.

atir.tahir · September 21, 2022, 8:07pm

@prominentmedia

The problem exists and has already been fixed, the fix will be available soon. However, indexing of this PDF file will only work from file (not stream).

prominentmedia · September 21, 2022, 8:25pm

Thank you, Tahir. That supports our use case, so I’ll be glad when the fix is released.

Will you post here when the big day comes?

atir.tahir · September 22, 2022, 6:57am

@prominentmedia

We’ll notify you here as there’s any further update.

aspose.notifier · October 12, 2022, 9:53am

The issues you have found earlier (filed as SEARCHNET-2710) have been fixed in this update. This message was posted using Bugs notification tool by Atir_Tahir