Thousands of exceptions during the extraction of the "Corpora" dataset

jamsharp · July 1, 2026, 8:32am

Hello,

We tried extracting all > 700.000 files of the “Corpora” dataset. Those are files that were scraped from US .gov websites.

Apart from reproducing the OverflowException: Arithmetic operation resulted in an overflow this way, we discovered that multiple Thousand files could not be extracted, although I expect most of them to be valid.

Total exceptions: 4084 (3462x .doc, 545x .ppt, 34x .xls, 15x .pps, 13x .html, 12x .pdf, 3x .rtf)

The attached zip file
GroupDocsErrors.zip (57.8 KB)
contains multiple markdown files.

The file “_Zusammenfassung.md” contains an overview of the different exceptions and how often each occurs. For each of the exceptions, I added a separate md file that tells you which file(s) caused the exception.

You can reproduce the exceptions I reported… You can download the problematic files easily from here (each of the zips there contains 1000 files… If you want file 047092.pdf for example, you need to download 047.zip and it will be in there): AWS S3 Explorer

yuriy.mazurchuk · July 2, 2026, 3:27pm

Hi @jamsharp!

Thank you for your request.
I understand your concern and appreciate your help. There are multiple document engine libraries we are depending on that also require retesting and adjustments. That is why the entire reproducing and fixing takes more time from our side.
We are working on these issues and will get back once we have any updates!
Thank you!

jamsharp · July 6, 2026, 6:27am

That is why the entire reproducing and fixing takes more time from our side.

As a software developer myself, I understand that. We don’t expect that all of this is all fixed within weeks. We are fine with it happening step by step over time.

yuriy.mazurchuk · July 10, 2026, 11:08am

Hi @jamsharp !

Let me update you that we are working on the GroupDocs.Search July release that will include multiple fixes on known issues. At this moment, we are running product on Corpora dataset to verify changes and prove fixes.
Once the release is ready, I will share the update.

Thank you!