List of supported files for GetText function in .NET

Hi,

I’m trying to use “GetText” function from GroupDocs.Parser, to extract text from various files.
Parser generally seems to support a lot of formats:

But “GetText” specifically seems to be more limited.
Is there a list of all file types, that support text extraction?

Best regards,
Clemens

@Clemens

To extract text using the “GetText” function from the GroupDocs.Parser in .NET, it’s important to note that not all document formats support this feature. The supported formats for text extraction are more limited compared to the overall formats that the parser can handle.

Here’s a summary of the document types that support text extraction with the “GetText” function:

Supported Document Formats for Text Extraction

  1. Word Processing Formats:

    • DOC, DOCX, DOT, DOTX, ODT, RTF, TXT
  2. PDF:

    • PDF (Portable Document Format)
  3. Spreadsheet Formats:

    • XLS, XLSX, XLSM, ODS, CSV
  4. Presentation Formats:

    • PPT, PPTX, PPS, PPSX, ODP
  5. Email Formats:

    • EML, MSG, PST, OST
  6. Ebook Formats:

    • EPUB, FB2, MOBI
  7. Markup Formats:

    • HTML, XHTML, XML, Markdown
  8. Image Formats:

    • JPG, JPEG, PNG, BMP, GIF, TIFF
  9. Archive Formats:

    • ZIP, RAR, 7Z

For a complete list of supported formats and their capabilities, you can refer to the official documentation on supported document formats here.

Example Code

Here’s a simple example of how to use the “GetText” function in your .NET application:

using GroupDocs.Parser;

public class TextExtractionExample
{
    public static void ExtractText(string filePath)
    {
        using (Parser parser = new Parser(filePath))
        {
            // Check if text extraction is supported
            if (!parser.Features.Text)
            {
                Console.WriteLine("Text extraction isn't supported for this document.");
                return;
            }

            // Extract text
            using (TextReader reader = parser.GetText())
            {
                Console.WriteLine(reader.ReadToEnd());
            }
        }
    }
}

This code checks if the document supports text extraction and then retrieves the text if it does.

If you have any further questions or need assistance with specific formats, feel free to ask!

Sources

[1]: GetText | GroupDocs.Parser for .NET API Reference
[2]: Supported Document Formats - GroupDocs Documentation

The bot just quoted the same link from my request :smiley:
Still need help please.

@Clemens

To better assist you, could you please share:

  1. File formats you’ve tried where GetText works, and formats where it doesn’t work.
  2. Which GroupDocs.Parser API version you are currently using.
  3. Any sample files where text extraction didn’t work as expected.
  4. Sample code.

Once we have this information, we can check against the supported formats for your specific version and let you know exactly which file types are compatible with GetText.