Regularly seeing Unsupported Document Format Exception errors

We are seeing the error below fairly often when we upload PDFs (I’d love to upload examples, but the info within is sensitive, and the only non-sensitive example is 5MB, which is over your 4MB upload size limit):

Error during text extraction from [path_to_pdf]
GroupDocs.Parser.Exceptions.UnsupportedDocumentFormatException: Exception of type 'GroupDocs.Parser.Exceptions.UnsupportedDocumentFormatException' was thrown.

The issue is that from then on, whenever we upload a new document to the same folder, we get those same errors due to it reindexing, even though the new document has been indexed successfully.

We don’t want to just ignore errors coming from Groupdocs.Search but we also don’t want to constantly report on issues that are not related to the current upload.

Can you advise us on the best course of action?

If you are interested in the code we are using for the indexing, I’ve included it below.

public string AddOrUpdateIndex(string indexFolderLocation, string filesToIndexFolderLocation)
{
    var settings = GetStandardIndexSettings();         
    var indexIsNew = Directory.GetFiles(indexFolderLocation).Length < 1;
    var index =new GroupDocs.Search.Index(indexFolderLocation, settings);
    var errorMessage = string.Empty;
    index.Events.ErrorOccurred += (sender, args) =>
    {
        errorMessage = args.Message;
    };

    if (indexIsNew)
    {
        index.Add(filesToIndexFolderLocation);
    }
    else
    {
        UpdateOptions options = new UpdateOptions();
        options.Threads = 2;
        index.Update(options);
        index.Optimize();
    }

    if (string.IsNullOrEmpty(errorMessage))
    {
        return "Success";
    }

    return errorMessage;
}   

private IndexSettings GetStandardIndexSettings()
{
    var settings = new IndexSettings();
    settings.UseRawTextExtraction = false;
    return settings;
}
1 Like

@prominentmedia

Please upload the file on a Cloud storage (e.g. Google Drive) and share the link here. Also specify the API version that you are using.

Thanks for the reply. We are using 21.8.1 and you can see a file that fails here: https://prominentmedia.com/pdf/15747.pdf

@prominentmedia

We don’t get any such exception. Could you please share the sample application using that issue could be reproduce and a short video/screencast explaining the steps to reproduce the issue.

You can download a small source application from https://prominentmedia.com/pdf/SecondPdfDemo.zip

There is a screencast of the issue at https://prominentmedia.com/pdf/groupdocs-capture.mkv

An example PDF that is giving us an error can be found at https://prominentmedia.com/pdf/15747.pdf

Thanks in advance!

@prominentmedia

Thanks for the details. This issue is reproduced at our end. Therefore, we have logged it in our internal issue tracking system with ticket ID SEARCHNET-2710. You’ll be notified in case of any update.

@prominentmedia

The problem exists and has already been fixed, the fix will be available soon. However, indexing of this PDF file will only work from file (not stream).

Thank you, Tahir. That supports our use case, so I’ll be glad when the fix is released.

Will you post here when the big day comes? :slight_smile:

1 Like

@prominentmedia

We’ll notify you here as there’s any further update.

The issues you have found earlier (filed as SEARCHNET-2710) have been fixed in this update. This message was posted using Bugs notification tool by Atir_Tahir