Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

能否提供word中的结构化数据? #370

Open
sangeren opened this issue Aug 18, 2024 · 3 comments
Open

能否提供word中的结构化数据? #370

sangeren opened this issue Aug 18, 2024 · 3 comments

Comments

@sangeren
Copy link

比较有一个word试卷有10道试题,每道试题有问题和答案,能否解析出10道试题的结构化数据,包含每一个试题的问题和答案。谢谢

@AlexNosk
Copy link

@sangeren 您可以分析文档结构。请参阅我们的文档以了解有关 Aspose.Words 文档对象模型的更多信息:
https://docs.aspose.com/words/net/aspose-words-document-object-model/

PS:获得支持的主要地方是 Aspose.Words 支持论坛:
https://forum.aspose.com/c/words/8

@verydemo
Copy link

verydemo commented Nov 4, 2024

想结构化数据输出json, 保留word大纲级别, 段落内容, 表格, 图片; 无需保留样式, 我尝试从word2html, 再从html2json; 输出结果只有文本, 没有大纲,表格,图片等, 使用代码如下:

Document doc = new Document("input.docx");
doc.Save("output.html", Aspose.Words.SaveFormat.Html);
var workbook = new Workbook("output.html");
workbook.Save("output.json", Aspose.Cells.SaveFormat.Json);

@AlexNosk
Copy link

AlexNosk commented Nov 4, 2024

@verydemo 没有内置方法可以将文档转换为 JSON。但是,您可以通过检查文档对象模型来实现这一点。以下是将文档转换为 JSON 的简化代码:
https://docs.aspose.com/words/java/aspose-words-document-object-model/

private static string GetJson(Document doc)
{
    StringBuilder sb = new StringBuilder();
    int indent = 1;
    sb.Append(OpenJson());
    sb.Append(OpenElement(doc, indent++));
    foreach (Section section in doc.Sections)
    {
        sb.Append(OpenElement(section, indent++));
        HandleContainer(sb, section.Body, ref indent);
        sb.Append(CloseElement(--indent, (section.NextSibling == null)));
    }
    sb.Append(CloseElement(--indent, true));
    sb.Append(CloseElement(0, true));
    return sb.ToString();
}

private static void HandleContainer(StringBuilder sb, CompositeNode container, ref int indent)
{
    if (!container.HasChildNodes)
        sb.Append(OpenAndCloseElement(container, indent, (container.NextSibling == null)));
    else
    {
        sb.Append(OpenElement(container, indent++));
        foreach (Node node in container.ChildNodes)
        {
            CompositeNode childContainer = node as CompositeNode;
            if (childContainer != null)
                HandleContainer(sb, childContainer, ref indent);
            else
                HandleNode(sb, node, ref indent);
        }
        sb.Append(CloseElement(--indent, (container.NextSibling == null)));
    }
}

private static void HandleNode(StringBuilder sb, Node node, ref int indent)
{
    switch (node.NodeType)
    {
        case NodeType.Run:
            sb.Append(OpenElement(node, indent++));
            Run run = node as Run;
            // for shorter output 
            {
                //sb.Append(WriteElement(nameof(run.Text), run.Text, indent, false));
                //HandleFont(sb, run.Font, ref indent, true);
                sb.Append(WriteElement(nameof(run.Text), run.Text, indent, true));
            }
            sb.Append(CloseElement(--indent, (node.NextSibling == null)));
            break;
        default:
            break;
    }
}

private static void HandleFont(StringBuilder sb, Font font, ref int indent, bool isLast)
{
    sb.Append(OpenElement("Font", indent++));
    sb.Append(WriteElement(nameof(font.Name), font.Name, indent, false));
    sb.Append(WriteElement(nameof(font.Size), font.Size, indent, true));
    sb.Append(CloseElement(--indent, isLast));
}

private static string OpenJson() { return "{\n"; }
private static string OpenElement(Node node, int indent) { return OpenElement(GetNodeName(node), indent); }
private static string OpenElement(string name, int indent) { return GetIndent(indent) + GetQuoted(name) + " : {\n"; }
private static string OpenAndCloseElement(Node node, int indent, bool isLast)
{
    return GetIndent(indent) + GetQuoted(GetNodeName(node)) + " : { }" + GetComma(isLast) + "\n";
}
private static string WriteElement(string name, string value, int indent, bool isLast)
{
    return GetIndent(indent) + GetQuoted(name) + " : " + GetQuoted(value) + GetComma(isLast) + "\n";
}
private static string WriteElement(string name, double value, int indent, bool isLast)
{
    return GetIndent(indent) + GetQuoted(name) + " : " + value + GetComma(isLast) + "\n";
}
private static string CloseElement(int indent, bool isLast) { return GetIndent(indent) + "}" + GetComma(isLast) + "\n"; }
private static string GetNodeName(Node node) { return node.NodeType.ToString(); }
private static string GetQuoted(string value) { return "\"" + value + "\""; }
private static string GetComma(bool isLastElement) { return isLastElement ? string.Empty : ","; }
private static string GetIndent(int indent) { return new string(' ', indent * 2); }

或者,您可以使用 DocumentVisitor 将文档结构写入您自己的自定义格式。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants