Searching PDF Docs

dwayne.gould · April 22, 2015, 6:48pm

We have discovered an issue when we create PDF files. We cannot search the text within the PDF. When we create PDF files using other tools, our documents are searchable.

Is there an option when converting to PDF that we need to set?

Or is there something else we can try?

Thanks,

pavelteplitsky · April 23, 2015, 5:25am

Hello,

We are sorry to hear that you have such issue. Sorry but from your post is not clear what you do and how to reproduce the issue. Since that could you please provide more info about the issue. Please share with us next:

1. Which GroupDocs library do you use and which version?

2. What project type do you have - MVC, Web Forms etc?

3. Step by step guide for how to reproduce the issue - how do you create the PDF?

Please come back with all these data.

Thank you.

dwayne.gould · May 5, 2015, 7:11am

In attachment, you can find two PDF files which one was generated by the GroupDocs (not searchable) and one by the Foxit Reader PDF Printer (searchable). Both were generated from the attached file ‘Atqui electram.docx’.

To generate PDF I use following code:

1) in Global.asax.cs:

GroupdocsConversion.SetLicensePath(HostingEnvironment.ApplicationPhysicalPath + @"\App_data\Licenses\GroupDocs.Total.for.NET.lic");

GroupdocsConversion.SetRootStoragePath(HostingEnvironment.ApplicationPhysicalPath + SiteContext.CurrentSiteName + “\files\”);

GroupdocsConversion.Init();

2) and method to convert PDF:

private static string ConvertCommand(string inputFilePath, string inputFileName)

{

string pdfFilePath = String.Empty;

FileType fileType = FileType.Pdf;

var conversion = GroupdocsConversion.Instance();

string outputFilePath = fileTempStorage + inputFileName + “.” + fileType;

var convertResult = conversion.Convert(inputFilePath, outputFilePath, fileType);

if (convertResult.State == ConversionState.Completed)

{

pdfFilePath = convertResult.ConvertedFileName;

}

else

{

if (convertResult.State == ConversionState.Failed)

{

Exception ex = new Exception(convertResult.ErrorMessage);

EventLogProvider.LogException(typeof(GeneratePDFHelper).Name, MethodBase.GetCurrentMethod().Name, ex);

}

}

return pdfFilePath;

}

Thanks,

evgen.efimov · May 5, 2015, 4:45pm

Hello,

Thanks for your inquiry and for using GroupDocs.Conversion.

We have tested the scenario (on the GroupDocs.Conversion 1.9.0) and your files, but we could not reproduce the same issue at our side. As you can see on these screencasts - one and two - we tried to reproduce the issue in the GroupDocs.Viewer for .NET (2.11.0) and in the Foxit Reader (the same for Acrobat Reader), but in all cases search worked well for us.

Please share with us where you try to test the search function (in which program and what version) and how you tested it, so we will try to reproduce the issue on our side.

Best regards
Evgen Efimov

http://groupdocs.com
Your Document Collaboration APIs
Follow us on LinkedIn, Twitter, Facebook and Google+

dwayne.gould · May 15, 2015, 8:41am

We are using Kentico and the PDF files cannot be indexed. We are able to use several other PDF files created using other tools, but those using GroupDocs.Conversion cannot be indexed.

pavelteplitsky · May 15, 2015, 10:12am

Hello,

We are sorry to hear that you have such issue. Could you please provide more info for what you mean under indexing the pdf documents and how do you index them

Thank you.

dwayne.gould · May 15, 2015, 12:01pm

When we upload the document to the SQL database, Kentico will index the document content, making it accessible for full-text search.

Smart Search retrieves the appropriate data from the database and stores it in an index file using an easily searchable format. When website visitors submit a search expression, the index is scanned instead of the raw data and the results are returned. An index is automatically updated whenever the corresponding website content changes.

pavelteplitsky · May 15, 2015, 12:36pm

Hello,

We have checked the conversion functional with your documents and all works fine for us - search works well in the Foxit reader and other pdf readers. We have used the latest version of the Conversion library . Please make sure that you use the latest version in your project. If that will not help you please share with us the example of the project and example of the DB (empty DB, we just need the structure - that we can reproduce 100% the same use case)

Thank you.

dwayne.gould · May 27, 2015, 5:08pm

Here is the reply from Kentico:

Thank you for your message.

does the search indexing work for other PDFs? If so, it looks like the tool you are using is using something special to create the documents. I do not want to play table tennis but I do not see how is this related to Kentico. Our search is using standard .Net and Lucene engine to index the files. I am not sure how we can control this if the PDF is generated in some maybe non standard format.

Best regards,
Juraj Ondrus
Support Engineer

Please rate my answer if you found it useful!

+1-866-328-8998 (US Toll free)
+61-1800-764-561 (APAC Toll free)
+420-511-180-800 (International)
Nove sady 25, Brno 602 00, Czech Republic

pavelteplitsky · May 28, 2015, 5:38am

Hello,

Thank you for sharing this info. Sorry but we need examples of your code that we can reproduce the issue. Also we need to know does you use our Viewer plugin for Kentico or you have created custom integration with our Viewer library. Since that please share with us full description of your integration and full code of it that we can put it in to our Kentico demo and test it.

Thank you.

dwayne.gould · June 2, 2015, 1:43am

This can be viewed with a standard Kentico installation. Upload a GroupDocs PDF document and try to search for content in that document.

evgen.efimov · June 2, 2015, 3:11pm

Hello ,

Thank you for posting.

We are investigating the issue yet. When we will have a result you will be notified.

Thanks for your patience.

Best regards
Evgen Efimov

http://groupdocs.com
Your Document Collaboration APIs
Follow us on LinkedIn, Twitter, Facebook and Google+

evgen.efimov · June 5, 2015, 1:15pm

Hello,

We have checked the scenario ( indexing of .pdf file) for Kentiko CMS and managed to reproduce your issue. We thin that the issue is with PDF format that is used to generated PDF documents. As you can see that format is not ok only for Lucene engine, but works ok for Adobe and other PDF readers. We created a task for our product team, they will investigate whether it is a bug or a feature and will try to fix it.

We have added a ticket number (DC-654) to this thread , when it will be fixed you will be notified.

Best regards
Evgen Efimov

http://groupdocs.com
Your Document Collaboration APIs
Follow us on LinkedIn, Twitter, Facebook and Google+

dwayne.gould · July 6, 2015, 6:42pm

Do you have an estimate when this will be completed?

evgen.efimov · July 7, 2015, 6:49am

Hello,

Thank you for your inquiry.

Our product team still working on resolving this issue . We understand importance of this fix for you and we do all our best to resolve it soon. Unfortunately, at this moment we haven't any estimated date when the issue will be resolved , when it will be resolved you will be notified in this forum thread.

We apologize for the inconvenience.

Best regards
Evgen Efimov

http://groupdocs.com
Your Document Collaboration APIs
Follow us on LinkedIn, Twitter, Facebook and Google+

pavelteplitsky · September 8, 2015, 11:30am

Hello Dwayne,

Sorry for the delay. Our Product team has completed the work on your issue and has come to a conclusion that this issue and the undesired behavior you’re observing is actually not a bug in GroupDocs.Viewer. So, we have closed this issue as ‘Not a Bug’. Because .Net itself doesn’t support extracting text from Pdf file. Some other library should be used for this purpose. It seems that Kentiko CMS uses the one that doesn’t support Pdf files generated by our library. But GroupDocs.Viewer generates correct files according to Pdf specification. These files are correctly opened and text is searched in many viewers. So it seems to be not a bug of the GroupDocs.Viewer but of Kentiko CMS library that extracts text from Pdf files.

The difference is that the GroupDocs.Viewer write text operators for each word and other for each string. I don’t think we can modify our code without significantly changes in APS model.

Please try to use some different tool for extracting text from Pdf if it can be done. Another way is to ask Kentiko CMS to provide support for this problem.

Best regards.