To extract text using the “GetText” function from the GroupDocs.Parser in .NET, it’s important to note that not all document formats support this feature. The supported formats for text extraction are more limited compared to the overall formats that the parser can handle.
Here’s a summary of the document types that support text extraction with the “GetText” function:
Supported Document Formats for Text Extraction
Word Processing Formats:
DOC, DOCX, DOT, DOTX, ODT, RTF, TXT
PDF:
PDF (Portable Document Format)
Spreadsheet Formats:
XLS, XLSX, XLSM, ODS, CSV
Presentation Formats:
PPT, PPTX, PPS, PPSX, ODP
Email Formats:
EML, MSG, PST, OST
Ebook Formats:
EPUB, FB2, MOBI
Markup Formats:
HTML, XHTML, XML, Markdown
Image Formats:
JPG, JPEG, PNG, BMP, GIF, TIFF
Archive Formats:
ZIP, RAR, 7Z
For a complete list of supported formats and their capabilities, you can refer to the official documentation on supported document formats here.
Example Code
Here’s a simple example of how to use the “GetText” function in your .NET application:
using GroupDocs.Parser;
public class TextExtractionExample
{
public static void ExtractText(string filePath)
{
using (Parser parser = new Parser(filePath))
{
// Check if text extraction is supported
if (!parser.Features.Text)
{
Console.WriteLine("Text extraction isn't supported for this document.");
return;
}
// Extract text
using (TextReader reader = parser.GetText())
{
Console.WriteLine(reader.ReadToEnd());
}
}
}
}
This code checks if the document supports text extraction and then retrieves the text if it does.
If you have any further questions or need assistance with specific formats, feel free to ask!
File formats you’ve tried where GetText works, and formats where it doesn’t work.
Which GroupDocs.Parser API version you are currently using.
Any sample files where text extraction didn’t work as expected.
Sample code.
Once we have this information, we can check against the supported formats for your specific version and let you know exactly which file types are compatible with GetText.