Thousands of exceptions during the extraction of the "Corpora" dataset

Hello,

We tried extracting all > 700.000 files of the “Corpora” dataset. Those are files that were scraped from US .gov websites.

Apart from reproducing the OverflowException: Arithmetic operation resulted in an overflow this way, we discovered that multiple Thousand files could not be extracted, although I expect most of them to be valid.

Total exceptions: 4084 (3462x .doc, 545x .ppt, 34x .xls, 15x .pps, 13x .html, 12x .pdf, 3x .rtf)

The attached zip file
GroupDocsErrors.zip (57.8 KB)
contains multiple markdown files.

The file “_Zusammenfassung.md” contains an overview of the different exceptions and how often each occurs. For each of the exceptions, I added a separate md file that tells you which file(s) caused the exception.

You can reproduce the exceptions I reported… You can download the problematic files easily from here (each of the zips there contains 1000 files… If you want file 047092.pdf for example, you need to download 047.zip and it will be in there): AWS S3 Explorer