Group Docc Extraction section based content

Dilip0527 · January 28, 2021, 5:23am

Hi,

We have requirement of extraction of section based content from .DocX document whoch is converted by GroupDocs.Can anyone guide me how to do this?

atir.tahir · January 28, 2021, 6:22am

@Dilip0527

You can extract content on the basis of following sections:

There are a lot of other categories, please have a look at advanced usage documentation articles.

Could you please elaborate this? Do you want to extract content from a Word document that is actually converted from a PDF or any other file format using GroupDocs.Conversion API?

Dilip0527 · January 28, 2021, 7:16am

Hi ,

In our application we are uploading Pdf document as as a input file in the back end (JAVA) we need to convert that uploaded pdf to .DocX format (via any of library as of now we are using online converters).
From that converted DocX file we need to read DocX file (By Iterating Bodyelements like Table and Paragraph) extract content based on sectionwise i.e For example one heading call it us TITLE PAGE we need to extract what are all the contents availble in that headings need to extract .
Here I have attached our sample dpcument aslo

Symbiance-003.pdf (689.7 KB)

I hope this could be understandable by you else kindly you can reach me

Thanks

atir.tahir · January 28, 2021, 10:17am

@Dilip0527

For this purpose, you could use GroupDocs.Conversion for Java on-premises API. You may find following resources helpful:

Open source example project
PDF to Word conversion code
Developers guide - Documentation

We’re investigating this use-case at our end. Your investigation ticket ID is PARSERJAVA-213. You’ll be notified in case of any progress update.

atir.tahir · August 10, 2021, 5:19pm

@Dilip0527

There’s an update on PARSERJAVA-213. To iterate the document structure Extract text structure is used. Here’s the code implementation. Let us know if you need any further details on this topic.