List of supported files for GetText function in .NET

Hi,

I’m trying to use “GetText” function from GroupDocs.Parser, to extract text from various files.
Parser generally seems to support a lot of formats:

But “GetText” specifically seems to be more limited.
Is there a list of all file types, that support text extraction?

Best regards,
Clemens

@Clemens

To extract text using the “GetText” function from the GroupDocs.Parser in .NET, it’s important to note that not all document formats support this feature. The supported formats for text extraction are more limited compared to the overall formats that the parser can handle.

Here’s a summary of the document types that support text extraction with the “GetText” function:

Supported Document Formats for Text Extraction

  1. Word Processing Formats:

    • DOC, DOCX, DOT, DOTX, ODT, RTF, TXT
  2. PDF:

    • PDF (Portable Document Format)
  3. Spreadsheet Formats:

    • XLS, XLSX, XLSM, ODS, CSV
  4. Presentation Formats:

    • PPT, PPTX, PPS, PPSX, ODP
  5. Email Formats:

    • EML, MSG, PST, OST
  6. Ebook Formats:

    • EPUB, FB2, MOBI
  7. Markup Formats:

    • HTML, XHTML, XML, Markdown
  8. Image Formats:

    • JPG, JPEG, PNG, BMP, GIF, TIFF
  9. Archive Formats:

    • ZIP, RAR, 7Z

For a complete list of supported formats and their capabilities, you can refer to the official documentation on supported document formats here.

Example Code

Here’s a simple example of how to use the “GetText” function in your .NET application:

using GroupDocs.Parser;

public class TextExtractionExample
{
    public static void ExtractText(string filePath)
    {
        using (Parser parser = new Parser(filePath))
        {
            // Check if text extraction is supported
            if (!parser.Features.Text)
            {
                Console.WriteLine("Text extraction isn't supported for this document.");
                return;
            }

            // Extract text
            using (TextReader reader = parser.GetText())
            {
                Console.WriteLine(reader.ReadToEnd());
            }
        }
    }
}

This code checks if the document supports text extraction and then retrieves the text if it does.

If you have any further questions or need assistance with specific formats, feel free to ask!

Sources

[1]: GetText | GroupDocs.Parser for .NET API Reference
[2]: Supported Document Formats - GroupDocs Documentation

The bot just quoted the same link from my request :smiley:
Still need help please.

@Clemens

To better assist you, could you please share:

  1. File formats you’ve tried where GetText works, and formats where it doesn’t work.
  2. Which GroupDocs.Parser API version you are currently using.
  3. Any sample files where text extraction didn’t work as expected.
  4. Sample code.

Once we have this information, we can check against the supported formats for your specific version and let you know exactly which file types are compatible with GetText.

Hi @atir.tahir ,

Sorry for the late reply.
I have attempted to convert a TIF file with OCR option set to true.

I was using GroupDocs.Parser from GroupDocs.Total.NETFramework package 25.6

Here is a sample file:
MultiPage TIF 5 pages.zip (850.8 KB)

That’s the code:


            using (var parser = new Parser(documentPath))
            {
                TextOptions options = new TextOptions(false, true);
                using (TextReader reader = parser.GetText(options))
                {
                    if (reader == null)
                    {
                        throw new Exception("Text extraction isn't supported");
                    }
                    File.WriteAllText(@"ParserOutput.txt", reader.ReadToEnd());
                }
            }

“reader” was null in my test.

@Clemens
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): TOTALNET-280

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

1 Like