PDF to Docx Conversion - paragraph splitting issue

Karthik_Nair · August 21, 2020, 7:15am

Hey !

I am trying to convert PDF to Docx using Group Docs converter, conversion appears good but each line is converted into pargraph (refer the posted pics), is there a way to group the run objects in a particular frame (rectangle) ? I am attaching the necessary screenshots and pdf document for your referral, please guide on this

Thank You !
GroupDocs Output:
each_line_paragraph.png (78.3 KB)

Ideal Output (Manual):
ideal_para_split.png (81.0 KB)

PDF:
PFTF_201.pdf (492.7 KB)

atir.tahir · August 21, 2020, 10:20am

@Karthik_Nair

Please share following details and we’ll investigate this issue:

API version (e.g. 20.2, 20.7) and variant (Java or .NET) that you are evaluating
Sample conversion code

Karthik_Nair · August 26, 2020, 6:49pm

package temp.testing;

import java.math.RoundingMode;
import java.text.DecimalFormat;
import java.util.HashMap;
import java.util.Map;

import org.json.JSONArray;
import org.json.JSONObject;

import com.groupdocs.parser.Parser;
import com.groupdocs.parser.data.PageTextArea;
import com.groupdocs.parser.data.Rectangle;

public class TestPosition {
	
	public static void main(String args[]) {
    	DecimalFormat df = new DecimalFormat("#.####");
    	df.setRoundingMode(RoundingMode.CEILING);
        try (Parser parser = new Parser(args[0])) {
            // Extract text areas
            Iterable<PageTextArea> areas = parser.getTextAreas();
            // Check if text areas extraction is supported
            JSONArray map = new JSONArray();
            if (areas == null) {
            	map.put("Error in AReas");
                System.out.println(map.toString(4));
                return;
            }
            // Iterate over page text areas
            for (PageTextArea a : areas) {
                // Print a page index, rectangle and text area value:
            	JSONObject details = new JSONObject();
            	details.put("pos_rect_x", a.getRectangle().getPosition().getX());
            	details.put("pos_rect_y", a.getRectangle().getPosition().getY());
            	details.put("x_left_edge", a.getRectangle().getLeft());
            	details.put("x_right_edge", a.getRectangle().getRight());
            	details.put("y_top_edge", a.getRectangle().getTop());
            	details.put("y_bot_edge", a.getRectangle().getBottom());
            	details.put("size_width",a.getRectangle().getSize().getWidth());
            	details.put("size_height",a.getRectangle().getSize().getHeight());
            	Map<String,JSONObject> newmap = new HashMap<String, JSONObject>();
            	
            	newmap.put(a.getText(), details);
            	map.put(newmap);
            }

for(int i = 0; i < map.length(); i++)
{
      JSONObject temp=map.getJSONObject(i);
      System.out.println(temp.toString(4));
      //Iterate through the elements of the array i.
      //Get thier value.
      //Get the value for the first element and the value for the last element.
}
        }
	}
}

This above code extract the textareas not the frames

My requirement is to extract Frames (Style) and group paragraphs in each frame (If they have same font size/text style)

I have attached the necessary photos, I have also attached ideal case scenario pic too(above). Please guide Thank you
Screenshot from 2020-08-27 00-16-30.png (76.5 KB)

Thank You

atir.tahir · August 26, 2020, 7:26pm

@Karthik_Nair

Thank you for the details. We are investigating this scenario at our end with ID CONVERSIONJAVA-1074. You’ll be notified as there’s any update.

aspose.notifier · February 1, 2021, 2:45pm

The issues you have found earlier (filed as CONVERSIONJAVA-1074) have been fixed in this update. This message was posted using Bugs notification tool by Atir_Tahir