List of supported files for GetText function in .NET

Clemens · August 8, 2025, 12:04pm

Hi,

I’m trying to use “GetText” function from GroupDocs.Parser, to extract text from various files.
Parser generally seems to support a lot of formats:

But “GetText” specifically seems to be more limited.
Is there a list of all file types, that support text extraction?

Best regards,
Clemens

Professionalize.Discourse · August 8, 2025, 12:04pm

@Clemens

To extract text using the “GetText” function from the GroupDocs.Parser in .NET, it’s important to note that not all document formats support this feature. The supported formats for text extraction are more limited compared to the overall formats that the parser can handle.

Here’s a summary of the document types that support text extraction with the “GetText” function:

Supported Document Formats for Text Extraction

Word Processing Formats:
- DOC, DOCX, DOT, DOTX, ODT, RTF, TXT
PDF:
- PDF (Portable Document Format)
Spreadsheet Formats:
- XLS, XLSX, XLSM, ODS, CSV
Presentation Formats:
- PPT, PPTX, PPS, PPSX, ODP
Email Formats:
- EML, MSG, PST, OST
Ebook Formats:
- EPUB, FB2, MOBI
Markup Formats:
- HTML, XHTML, XML, Markdown
Image Formats:
- JPG, JPEG, PNG, BMP, GIF, TIFF
Archive Formats:
- ZIP, RAR, 7Z

For a complete list of supported formats and their capabilities, you can refer to the official documentation on supported document formats here.

Example Code

Here’s a simple example of how to use the “GetText” function in your .NET application:

using GroupDocs.Parser;

public class TextExtractionExample
{
    public static void ExtractText(string filePath)
    {
        using (Parser parser = new Parser(filePath))
        {
            // Check if text extraction is supported
            if (!parser.Features.Text)
            {
                Console.WriteLine("Text extraction isn't supported for this document.");
                return;
            }

            // Extract text
            using (TextReader reader = parser.GetText())
            {
                Console.WriteLine(reader.ReadToEnd());
            }
        }
    }
}

This code checks if the document supports text extraction and then retrieves the text if it does.

If you have any further questions or need assistance with specific formats, feel free to ask!

Sources

[1]: GetText | GroupDocs.Parser for .NET API Reference
[2]: Supported Document Formats - GroupDocs Documentation

Clemens · August 8, 2025, 12:11pm

The bot just quoted the same link from my request
Still need help please.

atir.tahir · August 9, 2025, 6:57am

@Clemens

To better assist you, could you please share:

File formats you’ve tried where GetText works, and formats where it doesn’t work.
Which GroupDocs.Parser API version you are currently using.
Any sample files where text extraction didn’t work as expected.
Sample code.

Once we have this information, we can check against the supported formats for your specific version and let you know exactly which file types are compatible with GetText.

Clemens · August 22, 2025, 11:30am

Hi @atir.tahir ,

Sorry for the late reply.
I have attempted to convert a TIF file with OCR option set to true.

I was using GroupDocs.Parser from GroupDocs.Total.NETFramework package 25.6

Here is a sample file:
MultiPage TIF 5 pages.zip (850.8 KB)

That’s the code:


            using (var parser = new Parser(documentPath))
            {
                TextOptions options = new TextOptions(false, true);
                using (TextReader reader = parser.GetText(options))
                {
                    if (reader == null)
                    {
                        throw new Exception("Text extraction isn't supported");
                    }
                    File.WriteAllText(@"ParserOutput.txt", reader.ReadToEnd());
                }
            }

“reader” was null in my test.

atir.tahir · August 22, 2025, 12:34pm

@Clemens
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): TOTALNET-280

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

vladimir.litvinchik · September 4, 2025, 12:40pm

@Clemens

I have checked this issue and found that OCR feature was not included into GroupDocs.Total.NETFramework package 25.6 and previous versions.

At the moment I can see that it can be enabled in the next version of the GroupDocs.Total package (.NET 6 assembly) but since you’re on .NET Framework it won’t help much.

Can you use GroupDocs.Parser.NETFramework for now while we’re looking for a solution?

Clemens · September 4, 2025, 12:56pm

Hi @vladimir.litvinchik ,

Thanks a lot for the update!
Yes I think that is fine for now.
We are currently using TextExtraction without OCR.
It would have been nice to have, but it’s not critical for our use-case.
Still good to know, that it will work with GroupDocs.Parser.NetFramework

Best regards,
Clemens

vladimir.litvinchik · November 11, 2025, 3:11pm

@Clemens

This issue was fixed in GroupDocs.Total for .NET 25.9, so you can use https://www.nuget.org/packages/GroupDocs.Total.NETFramework now to get text from TIFF and other raster formats.

Have a nice day!