Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing eval: compute layout metrics #3550

Open
jacopo-chevallard opened this issue Jan 22, 2025 — with Linear · 1 comment
Open

Parsing eval: compute layout metrics #3550

jacopo-chevallard opened this issue Jan 22, 2025 — with Linear · 1 comment
Assignees

Comments

Copy link
Collaborator

jacopo-chevallard commented Jan 22, 2025

Note that the ground-truth is based on the image-pdf, most of which have abandon areas, i.e. in the image-pdf some special graphics in headers and footers have been removed. This means that when comparing the extracted vs ground-truth layouts based on the native pdfs, we should expect some differences in the headers/footers areas.

Starting from the ground-truth layout, which, for each PDF page, looks like

{
  "extra": {
    "relation": [
      {
        "relation_type": "parent_son",
        "source_anno_id": 2,
        "target_anno_id": 3
      },
      {
        "relation_type": "parent_son",
        "source_anno_id": 5,
        "target_anno_id": 8
      }
    ]
  },
  "layout_dets": [
    {
      "anno_id": 6,
      "attribute": {
        "text_background": "white",
        "text_language": "text_simplified_chinese",
        "text_rotate": "normal"
      },
      "category_type": "title",
      "ignore": false,
      "line_with_spans": [
        {
          "category_type": "text_span",
          "poly": [
            109.3333333333331,
            121.73651418039208,
            722.1022134807848,
            121.73651418039208,
            722.1022134807848,
            195.75809149176507,
            109.3333333333331,
            195.75809149176507
          ],
          "text": "国资背景基金情况"
        }
      ],
      "order": 1,
      "poly": [
        102.5999912116609,
        120.87255879760278,
        719.3118659856144,
        120.87255879760278,
        719.3118659856144,
        194.14083813380114,
        102.5999912116609,
        194.14083813380114
      ],
      "text": "国资背景基金情况"
    },
    {
      "anno_id": 4,
      "attribute": {
        "text_background": "white",
        "text_language": "text_simplified_chinese",
        "text_rotate": "normal"
      },
      "category_type": "text_block",
      "ignore": false,
      "line_with_spans": [
        {
          "category_type": "text_span",
          "poly": [
            99.66504579139392,
            227.6650457913944,
            1269.333333333333,
            227.6650457913944,
            1269.333333333333,
            271.3365750838786,
            99.66504579139392,
            271.3365750838786
          ],
          "text": "2022年备案基金规模小幅回升,但仍未恢复至资管新规出台前的水平"
        }
      ],
      "order": 2,
      "poly": [
        97.71487020898245,
        226.92028692633914,
        1271.9932332148471,
        226.92028692633914,
        1271.9932332148471,
        264.88925750697814,
        97.71487020898245,
        264.88925750697814
      ],
      "text": "2022年备案基金规模小幅回升,但仍未恢复至资管新规出台前的水平"
    },
    {
      "anno_id": 3,
      "attribute": {
        "text_background": "white",
        "text_language": "text_simplified_chinese",
        "text_rotate": "normal"
      },
      "category_type": "figure_caption",
      "ignore": false,
      "line_with_spans": [
        {
          "category_type": "text_span",
          "poly": [
            253.94664201855937,
            321.21295194692755,
            1076.1203813864063,
            321.21295194692755,
            1076.1203813864063,
            364.93470762745034,
            253.94664201855937,
            364.93470762745034
          ],
          "text": "2014年-2023Q3国资背景基金的备案数量及规模"
        }
      ],
      "order": 3,
      "poly": [
        246.96994018554688,
        318.7444152832031,
        1088.26025390625,
        318.7444152832031,
        1088.26025390625,
        369.0964660644531,
        246.96994018554688,
        369.0964660644531
      ],
      "text": "2014年-2023Q3国资背景基金的备案数量及规模"
    },
    {
      "anno_id": 2,
      "category_type": "figure",
      "ignore": false,
      "order": 4,
      "poly": [
        118.08102792118407,
        379.29373168945347,
        1299.4279383691976,
        379.29373168945347,
        1299.4279383691976,
        1028.2773128579047,
        118.08102792118407,
        1028.2773128579047
      ]
    },
    {
      "anno_id": 8,
      "attribute": {
        "text_background": "white",
        "text_language": "text_simplified_chinese",
        "text_rotate": "normal"
      },
      "category_type": "figure_caption",
      "ignore": false,
      "line_with_spans": [
        {
          "category_type": "text_span",
          "poly": [
            1509.6758069519938,
            324.34247361866034,
            2292.4771492866826,
            324.34247361866034,
            2292.4771492866826,
            364.8196229053426,
            1509.6758069519938,
            364.8196229053426
          ],
          "text": "2014年-2023Q3国资背景基金数量TOP10地区"
        }
      ],
      "order": 5,
      "poly": [
        1497.726318359375,
        318.7418518066406,
        2301.80224609375,
        318.7418518066406,
        2301.80224609375,
        367.1272888183594,
        1497.726318359375,
        367.1272888183594
      ],
      "text": "2014年-2023Q3国资背景基金数量TOP10地区"
    },
    {
      "anno_id": 5,
      "category_type": "figure",
      "ignore": false,
      "order": 6,
      "poly": [
        1370.0374839590943,
        424.35013794251097,
        2552.3561471143494,
        424.35013794251097,
        2552.3561471143494,
        1026.8955618700252,
        1370.0374839590943,
        1026.8955618700252
      ]
    },
    {
      "anno_id": 9,
      "attribute": {
        "text_background": "white",
        "text_language": "text_simplified_chinese",
        "text_rotate": "normal"
      },
      "category_type": "title",
      "ignore": false,
      "line_with_spans": [
        {
          "category_type": "text_span",
          "poly": [
            169.67751098302242,
            1071.225836994341,
            328.08580770628134,
            1071.225836994341,
            328.08580770628134,
            1111.655822350311,
            169.67751098302242,
            1111.655822350311
          ],
          "text": "核心发现"
        }
      ],
      "order": 7,
      "poly": [
        170.92340081387997,
        1069.7956822171332,
        326.21460986860313,
        1069.7956822171332,
        326.21460986860313,
        1111.7494049722532,
        170.92340081387997,
        1111.7494049722532
      ],
      "text": "核心发现"
    },
    {
      "anno_id": 7,
      "attribute": {
        "text_background": "white",
        "text_language": "text_simplified_chinese",
        "text_rotate": "normal"
      },
      "category_type": "text_block",
      "ignore": false,
      "line_with_spans": [
        {
          "category_type": "text_span",
          "poly": [
            165.603649650326,
            1150.009124125815,
            2509.333333333333,
            1150.009124125815,
            2509.333333333333,
            1198.666666666666,
            165.603649650326,
            1198.666666666666
          ],
          "text": "- 2018年4月资管新规出台后,国资背景基金备案数量增速放缓且规模骤减,受新冠疫情影响,2021年新增基金规模再次下降,虽然"
        },
        {
          "category_type": "text_span",
          "poly": [
            219.22996126565647,
            1201.1457902508969,
            2250.770752144285,
            1201.1457902508969,
            2250.770752144285,
            1243.9433217869077,
            219.22996126565647,
            1243.9433217869077
          ],
          "text": "2022年基金规模回升至1.25万亿元,但仍未恢复至资管新规出台前的水平,2023前三季度新增规模略低于2022年同期。"
        }
      ],
      "order": 8,
      "poly": [
        172.66793877059249,
        1155.2640660519091,
        2514.2408071863138,
        1155.2640660519091,
        2514.2408071863138,
        1241.6284871157177,
        172.66793877059249,
        1241.6284871157177
      ],
      "text": "- 2018年4月资管新规出台后,国资背景基金备案数量增速放缓且规模骤减,受新冠疫情影响,2021年新增基金规模再次下降,虽然 2022年基金规模回升至1.25万亿元,但仍未恢复至资管新规出台前的水平,2023前三季度新增规模略低于2022年同期。"
    },
    {
      "anno_id": 1,
      "attribute": {
        "text_background": "white",
        "text_language": "text_simplified_chinese",
        "text_rotate": "normal"
      },
      "category_type": "text_block",
      "ignore": false,
      "line_with_spans": [
        {
          "category_type": "text_span",
          "poly": [
            161.7899369148969,
            1278.308761376868,
            2508,
            1278.308761376868,
            2508,
            1317.333333333333,
            161.7899369148969,
            1317.333333333333
          ],
          "text": "- 截至2023Q3全国国资背景基金备案数量累计9196只,基金规模累计8.91万亿元。基金注册区域集中于广东省、浙江省和江苏省,广东"
        },
        {
          "category_type": "text_span",
          "poly": [
            222.66666666666688,
            1325.3333333333335,
            1623.8331583485456,
            1325.3333333333335,
            1623.8331583485456,
            1365.333333333333,
            222.66666666666688,
            1365.333333333333
          ],
          "text": "省国资背景基金总规模遥遥领先。备案基金数量前10的省份基金数量占全国总量的"
        },
        {
          "category_type": "equation_ignore",
          "poly": [
            1624.4165959289367,
            1327.0154193159506,
            1703.7259660435407,
            1327.0154193159506,
            1703.7259660435407,
            1363.1237504250385,
            1624.4165959289367,
            1363.1237504250385
          ],
          "text": "73%"
        },
        {
          "category_type": "text_span",
          "poly": [
            1704.6905743174548,
            1322.6134268787764,
            2053.985160092844,
            1322.6134268787764,
            2053.985160092844,
            1370.6736155849724,
            1704.6905743174548,
            1370.6736155849724
          ],
          "text": ",规模占全国总量的"
        },
        {
          "category_type": "equation_ignore",
          "poly": [
            2055.1374027302004,
            1326.3706276890023,
            2149.276980264608,
            1326.3706276890023,
            2149.276980264608,
            1365.7029169328305,
            2055.1374027302004,
            1365.7029169328305
          ],
          "text": "68%。"
        }
      ],
      "order": 9,
      "poly": [
        171.69999831539863,
        1278.820932742719,
        2512.084408886781,
        1278.820932742719,
        2512.084408886781,
        1365.690053585406,
        171.69999831539863,
        1365.690053585406
      ],
      "text": "- 截至2023Q3全国国资背景基金备案数量累计9196只,基金规模累计8.91万亿元。基金注册区域集中于广东省、浙江省和江苏省,广东省国资背景基金总规模遥遥领先。备案基金数量前10的省份基金数量占全国总量的 73% ,规模占全国总量的 68%。"
    },
    {
      "anno_id": 10,
      "category_type": "abandon",
      "ignore": false,
      "order": null,
      "poly": [
        114.12910090860571,
        1403.1676953230935,
        175.21358196554792,
        1403.1676953230935,
        175.21358196554792,
        1462.6586681785502,
        114.12910090860571,
        1462.6586681785502
      ]
    },
    {
      "anno_id": 0,
      "attribute": {
        "text_background": "white",
        "text_language": "text_en_ch_mixed",
        "text_rotate": "normal"
      },
      "category_type": "footer",
      "ignore": false,
      "line_with_spans": [
        {
          "category_type": "text_span",
          "poly": [
            178.18192276049803,
            1409.8767302579377,
            288.0868232114207,
            1409.8767302579377,
            288.0868232114207,
            1467.2607048296584,
            178.18192276049803,
            1467.2607048296584
          ],
          "text": "CVINFO 投中信息"
        }
      ],
      "order": null,
      "poly": [
        180.18207532211585,
        1404.2778174322868,
        289.9793827860912,
        1404.2778174322868,
        289.9793827860912,
        1462.652231000048,
        180.18207532211585,
        1462.652231000048
      ],
      "text": "CVINFO 投中信息"
    }
  ],
  "page_info": {
    "height": 1500,
    "image_path": "eastmoney_59cde7e939acc3124df9d3f2c85b5a0ec41b9da1157d5be38e098672022b47cb.pdf_11.jpg",
    "page_attribute": {
      "data_source": "PPT2PDF",
      "language": "simplified_chinese",
      "layout": "1andmore_column",
      "special_issue": [
        "watermark"
      ]
    },
    "page_no": 11,
    "width": 2667
  }
}

We consider the array layout_dets and extract the list of category_type with its corresponding poly, which encodes the position information: coordinates (x,y) for top-left, top-right, bottom-right, bottom-left corners of the bounding box.

category_type can be one among

# Block level annotation boxes
'title'               # Title
'text_block'          # Paragraph level plain text
'figure',             # Figure type
'figure_caption',     # Figure description/title
'figure_footnote',    # Figure notes
'table',              # Table body
'table_caption',      # Table description/title
'table_footnote',     # Table notes
'equation_isolated',  # Display formula
'equation_caption',   # Formula number
'header'              # Header
'footer'              # Footer
'page_number'         # Page number
'page_footnote'       # Page notes
'abandon',            # Other discarded content (e.g. irrelevant information in middle of page)
'code_txt',           # Code block
'code_txt_caption',   # Code block description
'reference'          # References

We extract the same information from the Megaparse output, i.e. for each page we extract the element type and the bounding box, group the pages per document type, per language, per layout type, and compute:

  • fraction of correctly extracted blocks in each block category. A block is correctly extracted if
    • the normalized category match with the ground truth (we need to establish the correspondance between the Megaparse categories and the those above)
    • AND the bounding box coordinates match within some errors. We can set the error to 1 pixel initially and refine (increase/decrease) the error after the first tests.
  • average fraction of correctly extracted blocks, i.e. we compute the average of the fraction of each block (this means that each block category will contribute equally to the metric)
  • fraction of correctly extracted blocks across all categories (more numerous blocks, likely text blocks, will contribute more to the metric)

We can also have compute the metrics above across all document types.

@jacopo-chevallard jacopo-chevallard self-assigned this Jan 22, 2025
Copy link

linear bot commented Jan 22, 2025

@jacopo-chevallard jacopo-chevallard changed the title Implement layout metrics Parsing: layout metrics Jan 22, 2025
@jacopo-chevallard jacopo-chevallard changed the title Parsing: layout metrics Parsing: compute layout metrics Jan 28, 2025
@jacopo-chevallard jacopo-chevallard changed the title Parsing: compute layout metrics Parsing eval: compute layout metrics Jan 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant