Markdown Unicorn leads to Memory leak

jamsharp · September 12, 2024, 2:36pm

Hello,

You are probably wondering what the title is about and assume, it was written by a drunk person…

But no…

A markdown file that seems to be valid leads to FULL RAM consumption (60 GB in my case) when trying to index it and the indexing process will never stop.

The zip file contains the MD file with an ASCII-art unicorn. If you remove it, the indexing works fine…

I assume, that it is caused by the backticks used in the unicorn and that they are interpreted as markdown somehow…

readme.zip (1.7 KB)

atir.tahir · September 12, 2024, 8:05pm

@jamsharp
This issue is reproduced at our end. Therefore, we have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): SEARCHNET-3279

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

atir.tahir · September 13, 2024, 12:01pm

@jamsharp

This issue will be addressed in the next API release. In the meantime, here’s a workaround:

// Custom Markdown extractor implementation
// Implements the IFieldExtractor interface to handle .md (Markdown) files
public class MdExtractor : IFieldExtractor
{
    // Define the supported file extensions for this extractor (.md in this case)
    public string[] Extensions => new string[] { ".md" };

    // Extract fields from a file path (used for .md files on disk)
    public DocumentField[] GetFields(string filePath)
    {
        // Read the content of the file
        using (var sr = File.OpenText(filePath))
        {
            // Return the extracted fields as DocumentField objects
            return new DocumentField[]
            {
                // File name
                new DocumentField(CommonFieldNames.FileName, filePath),
                // Set format family as WordProcessing (though it's Markdown, this ensures compatibility with expected formats)
                new DocumentField(CommonFieldNames.FormatFamily, FormatFamily.WordProcessing.ToString()),
                // Creation date of the file
                new DocumentField(CommonFieldNames.CreationDate, File.GetCreationTime(filePath).ToString(CultureInfo.InvariantCulture)),
                // Last modification date of the file
                new DocumentField(CommonFieldNames.ModificationDate, File.GetLastWriteTime(filePath).ToString(CultureInfo.InvariantCulture)),
                // Full content of the Markdown file
                new DocumentField(CommonFieldNames.Content, sr.ReadToEnd()),
            };
        }
    }

    // Extract fields from a stream (used for .md files in memory or passed as streams)
    public DocumentField[] GetFields(Stream stream)
    {
        // Read the content from the stream
        using (var sr = new StreamReader(stream))
        {
            // Return extracted fields from the stream
            return new DocumentField[]
            {
                // Set format family as WordProcessing for compatibility
                new DocumentField(CommonFieldNames.FormatFamily, FormatFamily.WordProcessing.ToString()),
                // Full content of the Markdown from the stream
                new DocumentField(CommonFieldNames.Content, sr.ReadToEnd()),
            };
        }
    }
}

// Setting MdExtractor for the extractor
// Configure the extraction options to use the custom Markdown extractor
var extractionOptions = new ExtractionOptions
{
    // Disable raw text extraction (we are using custom extraction logic)
    UseRawTextExtraction = false,
    // Assign the custom extractor for Markdown
    CustomExtractor = new MdExtractor(),
};

// Setting MdExtractor for the index
// Configure the index to use the custom Markdown extractor
var settings = new IndexSettings();
settings.CustomExtractors.Add(new MdExtractor());

// Create an index with the custom settings
var index = new Index(indexPath, settings);

jamsharp · September 16, 2024, 7:53am

@atir.tahir
Thanks for the workaround. Could you please explain what the difference to the default behavior is, or what the impact of the workaround is? That information would help us to decide if we can use the workaround, or if we need to wait for the next API release with the fix.

atir.tahir · September 16, 2024, 11:44am

@jamsharp

The proposed workaround is the same as the default behavior in the next release, as the MD format is essentially simple text.

jamsharp · October 16, 2024, 12:04pm

We are not quite sure, what to do with this part of your code… The variable “extractionOptions” is only defined in your code, but not actually used. The indexing seems to work without it.

// Configure the extraction options to use the custom Markdown extractor
var extractionOptions = new ExtractionOptions
{
    // Disable raw text extraction (we are using custom extraction logic)
    UseRawTextExtraction = false,
    // Assign the custom extractor for Markdown
    CustomExtractor = new MdExtractor(),
};

By the way:

This issue will be addressed in the next API release.

When can we expect the next release?

atir.tahir · October 16, 2024, 8:26pm

@jamsharp

ExtractionOptions is used if text is extracted separately from indexing. Please take a look at this article - Separate Data Extraction.
Example code:

string indexFolder = @"c:\MyIndex";
string documentPath = @"c:\MyDocuments\MyDocument.pdf";

// Extracting data from a document
Extractor extractor = new Extractor();
Document document = Document.CreateFromFile(documentPath);
ExtractionOptions extractionOptions = new ExtractionOptions();
extractionOptions.UseRawTextExtraction = false;
extractionOptions.CustomExtractor = new MdExtractor();
ExtractedData extractedData = extractor.Extract(document, extractionOptions);

// Serializing the data
byte[] array = extractedData.Serialize();

// Deserializing the data
ExtractedData deserializedData = ExtractedData.Deserialize(array);

// Creating an index
Index index = new Index(indexFolder);

// Indexing the data
ExtractedData[] data = new ExtractedData[]
{
    deserializedData
};
index.Add(data, new IndexingOptions());

// Searching in the index
SearchResult result = index.Search("Einstein");

The next release is expected by the end of this Month. We’ll notify you as it is available to download.

jamsharp · November 5, 2024, 10:40am

I can confirm that the problem is fixed in version 24.10.

Please feel free to close this topic.

atir.tahir · November 5, 2024, 11:54am