Markdown Unicorn leads to Memory leak

Hello,

You are probably wondering what the title is about and assume, it was written by a drunk person…

But no…

A markdown file that seems to be valid leads to FULL RAM consumption (60 GB in my case) when trying to index it and the indexing process will never stop.

The zip file contains the MD file with an ASCII-art unicorn. If you remove it, the indexing works fine…

I assume, that it is caused by the backticks used in the unicorn and that they are interpreted as markdown somehow…

readme.zip (1.7 KB)

@jamsharp
This issue is reproduced at our end. Therefore, we have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): SEARCHNET-3279

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

@jamsharp

This issue will be addressed in the next API release. In the meantime, here’s a workaround:

// Custom Markdown extractor implementation
// Implements the IFieldExtractor interface to handle .md (Markdown) files
public class MdExtractor : IFieldExtractor
{
    // Define the supported file extensions for this extractor (.md in this case)
    public string[] Extensions => new string[] { ".md" };

    // Extract fields from a file path (used for .md files on disk)
    public DocumentField[] GetFields(string filePath)
    {
        // Read the content of the file
        using (var sr = File.OpenText(filePath))
        {
            // Return the extracted fields as DocumentField objects
            return new DocumentField[]
            {
                // File name
                new DocumentField(CommonFieldNames.FileName, filePath),
                // Set format family as WordProcessing (though it's Markdown, this ensures compatibility with expected formats)
                new DocumentField(CommonFieldNames.FormatFamily, FormatFamily.WordProcessing.ToString()),
                // Creation date of the file
                new DocumentField(CommonFieldNames.CreationDate, File.GetCreationTime(filePath).ToString(CultureInfo.InvariantCulture)),
                // Last modification date of the file
                new DocumentField(CommonFieldNames.ModificationDate, File.GetLastWriteTime(filePath).ToString(CultureInfo.InvariantCulture)),
                // Full content of the Markdown file
                new DocumentField(CommonFieldNames.Content, sr.ReadToEnd()),
            };
        }
    }

    // Extract fields from a stream (used for .md files in memory or passed as streams)
    public DocumentField[] GetFields(Stream stream)
    {
        // Read the content from the stream
        using (var sr = new StreamReader(stream))
        {
            // Return extracted fields from the stream
            return new DocumentField[]
            {
                // Set format family as WordProcessing for compatibility
                new DocumentField(CommonFieldNames.FormatFamily, FormatFamily.WordProcessing.ToString()),
                // Full content of the Markdown from the stream
                new DocumentField(CommonFieldNames.Content, sr.ReadToEnd()),
            };
        }
    }
}

// Setting MdExtractor for the extractor
// Configure the extraction options to use the custom Markdown extractor
var extractionOptions = new ExtractionOptions
{
    // Disable raw text extraction (we are using custom extraction logic)
    UseRawTextExtraction = false,
    // Assign the custom extractor for Markdown
    CustomExtractor = new MdExtractor(),
};

// Setting MdExtractor for the index
// Configure the index to use the custom Markdown extractor
var settings = new IndexSettings();
settings.CustomExtractors.Add(new MdExtractor());

// Create an index with the custom settings
var index = new Index(indexPath, settings);

@atir.tahir
Thanks for the workaround. Could you please explain what the difference to the default behavior is, or what the impact of the workaround is? That information would help us to decide if we can use the workaround, or if we need to wait for the next API release with the fix.

@jamsharp

The proposed workaround is the same as the default behavior in the next release, as the MD format is essentially simple text.