File's size increases after the remove metadata operation


#1

In my testing, I noticed when removing metadata from a JPG, the file size of the converted file actually increases. Not sure if this is a bug or not.

Attached is a JPG, which is 61,906 bytes. After the remove metadata operation, the size increased to 61,985 bytes. I also included a screenshot of my Beyond Compare session showing the data added to the converted file.
Attached is the JPG I was using in my testing.

Beyond_Compare.PNG (99.4 KB)
P1010002.JPG (58.8 KB)


#2

@Glority_Developer,

Thank you for using GroupDocs.Metadata for .NET.
We have forwarded your concern to the product team and have logged this issue in our internal tracking system under ID: “METADATANET-1924” . We will let you know if we can avoid the increase in the image’s size after metadata removal at our earliest.


#3

@Glority_Developer,

Kindly provide us with the sample code you are using to remove the metadata of the provided JPEG file.


#4

@Glority_Developer

The issue you have found earlier (filed as METADATANET-1924) has been fixed in this update.


#5

I have updated to v17.10. I verified the file size for a JPG file decreases after having metadata removed. However, the file sizes increase for GIF and PNG files, which had their metadata removed.
Attached is “gif_png_increased.png”, which shows a Beyond Compare summary of file sizes. I also attached the GIF and PNG file I used in my testing.

In addition, I also noticed PDF and XLSX files are also seeing their files sizes increase after a “Remove info” operation. Please refer to “pdf_xlsx_increased_sizes.png”.
pdf_xlsx_increased_sizes.PNG (92.1 KB)
gif_png_size increased.PNG (75.6 KB)
Dust Storm.GIF (157.4 KB)
test.jpg (531.7 KB)


#6

@Glority_Developer,

Thanks for sharing your issue with us. Please share the sample code you are using to clean metadata for each file so that we can investigate your issue at our end. We look forward to hearing from you.


#7

Here is the code:

using GroupDocs.Metadata;
using GroupDocs.Metadata.Formats;
using GroupDocs.Metadata.Formats.Document;
using GroupDocs.Metadata.Formats.Image;
using GroupDocs.Metadata.Xmp;
using System.IO;

namespace ConsoleApplication1
{
class Program
{
public static string inputFilePath = @“S:\DublinCore-XmpBasic -Code doc.pdf”;
public static string outputFilePath = @“S:\DublinCore-XmpBasic -Code doc-123.pdf”;

    static void Main(string[] args)
    {
        License lic = new License();
        lic.SetLicense(@"S:\GroupDocs.total.lic");

        FormatBase format = null;
        string extension = Path.GetExtension(inputFilePath).ToLower();
        switch (extension)
        {
            case ".gif":
                {
                    format = new GifFormat(inputFilePath);
                    if (format == null || !((GifFormat)format).IsSupportedXmp)
                    {
                        return;
                    }

                    XmpEditableCollection _xmpEditableCollection = ((GifFormat)format).XmpValues;
                    XmpSchemes _schemes = _xmpEditableCollection.Schemes;
                    RemoveMetaData(_schemes);
                    break;
                }
            case ".png":
                {
                    format = new PngFormat(inputFilePath);
                    if (format == null)
                    {
                        return;
                    }

                    XmpEditableCollection _xmpEditableCollection = ((PngFormat)format).XmpValues;
                    XmpSchemes _schemes = _xmpEditableCollection.Schemes;
                    RemoveMetaData(_schemes);
                    break;
                }
            case ".pdf":
                {
                    format = new PdfFormat(inputFilePath);
                    if (format == null)
                    {
                        return;
                    }

                    XmpEditableCollection _xmpEditableCollection = ((PdfFormat)format).XmpValues;
                    XmpSchemes _schemes = _xmpEditableCollection.Schemes;
                    RemoveMetaData(_schemes);

                    // Remove document properties:
                    // Author, Category, Comments, CreatedDate, Company, HyperlinkBase, Keywords, Manager, ModifiedDate, Subject, Title
                    PdfMetadata metaData = ((PdfFormat)format).DocumentProperties;
                    metaData.Title = string.Empty;
                    metaData.Author = string.Empty;
                    break;
                }
            case ".xlsx":
                {
                    format = new XlsFormat(inputFilePath);
                    if (format == null)
                    {
                        return;
                    }

                    // Remove document properties:
                    // Author, Category, Comments, CreatedDate, Company, HyperlinkBase, Keywords, Manager, ModifiedDate, Subject, Title
                    XlsMetadata metaData = ((XlsFormat)format).DocumentProperties;
                    metaData.Title = string.Empty;
                    metaData.Author = string.Empty;
                    break;
                }
            default:
                return;
        }

        format.Save(outputFilePath);
    }

    private static void RemoveMetaData(XmpSchemes schemes)
    {
        // Remove XMP DublinCorePackage metadata:
        // Contributors, Creators, Source, Subject
        schemes.DublinCore.Source = string.Empty;
        schemes.DublinCore.Subject = string.Empty;

        // Remove XMP PdfPackage metadata:
        // Keywords, Producer
        schemes.Pdf.Keywords = string.Empty;
        schemes.Pdf.Producer = string.Empty;

        // Remove XMP PhotoshopPackage metadata:
        // AuthorsPosition, CaptionWriter, City, Country, Credit, DateCreated, Headline, History, Instructions, Source, State
        schemes.Photoshop.City = string.Empty;
        schemes.Photoshop.Country = string.Empty;

        // Remove XMP XmpBasicPackage metadata:
        // BaseUrl, CreateDate, CreatorTool, Label, MetadataDate, ModifyDate, Nickname
        schemes.XmpBasic.BaseUrl = string.Empty;
        schemes.XmpBasic.Nickname = string.Empty;
    }
}

}


#8

@Glority_Developer

Thanks for your response. We investigated your issue and it was reproduced at our end. We have logged this issue in our issue tracking system under ID: “METADATANET-1976”. We will notify you in case of any updates on this issue.


#9

@ali.ahmed

Do you have any update on this issue? Is it solved in the latest version?


#10

@Glority_Developer

Thanks for coming back again. Your issue is currently not fixed. We don’t have any updates related to your issue yet. However, your issue is in the queue and we shall share the latest updates with you shortly.

Thanks,


#11

@Glority_Developer,

The issue you found earlier (filed as METADATANET-1976) has been fixed in this update.

Please use the following code snippet to remove metadata:

using (GifFormat format = new GifFormat(@"D:\input.gif"))
{
    XmpEditableCollection xmpEditableCollection = format.XmpValues;
    XmpSchemes schemes = xmpEditableCollection.Schemes;

    schemes.DublinCore.Source = null;
    schemes.DublinCore.Subject = null;

    schemes.Pdf.Keywords = null;
    schemes.Pdf.Producer = null;

    schemes.Photoshop.City = null;
    schemes.Photoshop.Country = null;

    schemes.XmpBasic.BaseUrl = null;
    schemes.XmpBasic.Nickname = null;

    format.Save(@"D:\output.gif");
}

#12

Hi @ali.ahmed

We found for most files, the file size won’t increase.

However, for one excel file, the file size increased. We used v18.6 library. Here’s the excel file and the screenshot.
excelfile.zip (490.4 KB)
image.png (45.2 KB)


#13

@Glority_Developer,

Thanks for coming back to us.

We are able to reproduce your reported behavior at our end. We have logged it in our internal Issue Tracking System as METADATANET-2047. We shall keep you informed in case we have any updates.


#14

@Glority_Developer,

We have got the updates for you regarding the issue logged as METADATANET-2047.

The reason for the increased size of the resultant Excel document (which should have less size after removing metadata) is, different compression algorithms are used by Microsoft Excel and their library. Since the size of metadata is usually relatively small as compared to the size of document content, the compression algorithm has more influence on the output document’s size. That’s why a resultant workbook produced by GroupDocs.Metadata can have a bigger or smaller size depending on its content.

Having the input file (Excel2016_1MB_test (1).xlsx) and the output file produced after removing the metadata, please try the following:

  • Extract the actual content of both xlsx files to different folders. Since the xlsx files are just zipped archives, you can use any archiver on your choice.
  • Measure the size of all files contained in each folder.
  • Compare the sizes.

We have got the following results after extracting the input and the resultant output Excel document:

  • Extracted input xlsx = 3,702,784 bytes
  • Extracted output xlsx = 3,502,080 bytes
  • See the results here

This means that the actual size of the output data is less than the actual size of the input data and all the difference is in the archiving algorithms which give different compression level. Hence, it is not a bug but the expected behavior of the API.