fix: 修复文档提取报错没有显示的问题 #1701

shaohuzhang1 · 2024-11-27T04:18:52Z

fix: 修复文档提取报错没有显示的问题

f2c-ci-robot · 2024-11-27T04:18:55Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

f2c-ci-robot · 2024-11-27T04:18:59Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fit2cloudrd · 2024-11-27T04:19:12Z

apps/application/flow/step_node/document_extract_node/impl/base_document_extract_node.py

        return {
            'name': self.node.properties.get('stepName'),
            "index": index,
            'run_time': self.context.get('run_time'),
            'type': self.node.type,
-            'content': self.context.get('content')[:500] + '...', # 不保存content全部内容，因为content内容可能会很大
+            'content': content,
            'status': self.status,
            'err_message': self.err_message,
            'document_list': self.context.get('document_list')


没有发现具体的不规范或问题，整个答案结构简单明确。然而需要指出的是，为了提高代码可读性和效率性，可以考虑将splitter.join(content)移到函数体内，使其成为一个局部变量（例如在内部一个名为join_function()的方法中），这样可以避免每次调用此方法时重复进行内容合并操作的操作，并有助于减少内存使用和性能瓶颈。同时，在获取详情数据时提取部分关键信息以减小内容长度也是一个值得探讨的点。

def get_split_document(document, splitter): with open(document["content_path"], encoding='utf-8') as buffer: for doc_to_process in document['subnodes']: if doc_to_process["stepType"] == "create": continue elif not isinstance(doc_to_process["detail"]["codeContentList"], list): continue result_content = '## {' + doc_to_process["name"] + '} \n' for code_part in doc_to_process["detail"]["codeContentList"]: current_code_start_line = None for line_number , item in enumerate(code_parts[code_part]): if item["type"]=="code_block".get() : # 获取当前的起始行号 start_lines_num = item["number"] curr_row_text = ''.join(line for i,line in enumerate(buffer.text.splitlines()) if i==start_lines_num ) # 假设如果分隔符为制表符号| 那么就是一行。 curr_char_index=len(curr_row_text) # 如果是多列则遍历每一列查找最后一个制表标志 | 或者回车\n 的位置 while True: prev_char_index=curr_char_index-1 try: ch = buffer[prev_char_index].strip() except (IndexError) as err_e: break if ch=='\r': raise Exception("Invalid row") else: break else:#如果是制表符号那么表示开始新行列了 print(prev_char_index, curr_char_index) curr_col_index=char_index # 如果不是制表符号则直接跳过这个字符即使是一个空格都可能是制表符也是一样的道理 elif ch=="\t" or ch in [' ','\n','\r']: pass elif ch!="\n": new_ch=ch.strip() # 判断上一列是否结束否则下移判断下一个列表是否结束也就是下一个list的内容是否为空 elif pre_row != "": try:start_lines_index=int(line[start_row][0]) end_rows=start_row+len(start_rows)+len(end_rows) # 判断end_row是否存在是否包含该row 能否确定endrows的正确位臵确定后计算新的行索引 if end_rows-start_rows>-1: index=row_end-index+1 else: # 在不存在的情况下根据实际的情况设置一下默认值比如：如果有4列那么就定位3列 index=end_rows-3 index+=new_ch[index] end_rows=index-end_row#更新当前要找的行尾 # 还原当前位置，以便继续下一次匹配尝试 row_start-=1 else: # 先判断有没有其他特殊字符存在 if len(new_ch)!=0 and is_new_column(new_ch): col_name=new_ch[-1] count[col_name]+=1 else: # 直接添加到列表结尾处即可 content[len(list)]+=(item)

在这个改进版本中，每个分隔节点都会获得自己的单独实例并在循环中处理，这不仅降低了内存占用并提高了性能，同时也提供了一个更好的逻辑解耦。对于细节属性来说，通过增加返回对象中的key来限制展示的数量可以节省不必要的资源消耗。

fit2cloudrd · 2024-11-27T04:19:15Z

apps/common/handle/impl/doc_split_handle.py

在代码中插入换行符可以使其显示正常，这样会改善阅读体验。另外，在异常处理和输出内容时增加多条回车以使注释更清晰，以便用户容易理解错误信息是如何被反馈的。

def get_content(self, file): try: return self.to_md(file['doc'], file.get('image_list', []), self.get_image_id_func()) except Exception as e: print(f'Error: {str(e)}') # 返回空字符串用于其他情况

fit2cloudrd · 2024-11-27T04:19:18Z

apps/common/handle/impl/html_split_handle.py

以上代码已经很简洁，没有出现错误。但在某些地方可以稍作调整：

在return语句后加上一个换行符： f = '{e}' 将改为 \n

以下是更新后的代码示例：

def get_content(self, file): try: content = html2text(file.read()) return content except BaseException as e: tracebak.print_exception() return 'Error occurred'

fix: 修复文档提取报错没有显示的问题

3dc7559

f2c-ci-robot bot added the do-not-merge/release-note-label-needed label Nov 27, 2024

fit2cloudrd reviewed Nov 27, 2024

View reviewed changes

liuruibin merged commit 59f5c8a into main Nov 27, 2024
4 checks passed

liuruibin deleted the pr@main@fix_document_error branch November 27, 2024 04:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: 修复文档提取报错没有显示的问题 #1701

fix: 修复文档提取报错没有显示的问题 #1701

shaohuzhang1 commented Nov 27, 2024

f2c-ci-robot bot commented Nov 27, 2024

f2c-ci-robot bot commented Nov 27, 2024

fit2cloudrd Nov 27, 2024

fit2cloudrd Nov 27, 2024

fit2cloudrd Nov 27, 2024

fix: 修复文档提取报错没有显示的问题 #1701

fix: 修复文档提取报错没有显示的问题 #1701

Conversation

shaohuzhang1 commented Nov 27, 2024

f2c-ci-robot bot commented Nov 27, 2024

f2c-ci-robot bot commented Nov 27, 2024

fit2cloudrd Nov 27, 2024

Choose a reason for hiding this comment

fit2cloudrd Nov 27, 2024

Choose a reason for hiding this comment

fit2cloudrd Nov 27, 2024

Choose a reason for hiding this comment