-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: 修复文档提取报错没有显示的问题 #1701
fix: 修复文档提取报错没有显示的问题 #1701
Conversation
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
return { | ||
'name': self.node.properties.get('stepName'), | ||
"index": index, | ||
'run_time': self.context.get('run_time'), | ||
'type': self.node.type, | ||
'content': self.context.get('content')[:500] + '...', # 不保存content全部内容,因为content内容可能会很大 | ||
'content': content, | ||
'status': self.status, | ||
'err_message': self.err_message, | ||
'document_list': self.context.get('document_list') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
没有发现具体的不规范或问题,整个答案结构简单明确。然而需要指出的是,为了提高代码可读性和效率性,可以考虑将splitter.join(content)
移到函数体内,使其成为一个局部变量(例如在内部一个名为join_function()
的方法中),这样可以避免每次调用此方法时重复进行内容合并操作的操作,并有助于减少内存使用和性能瓶颈。同时,在获取详情数据时提取部分关键信息以减小内容长度也是一个值得探讨的点。
def get_split_document(document, splitter):
with open(document["content_path"], encoding='utf-8') as buffer:
for doc_to_process in document['subnodes']:
if doc_to_process["stepType"] == "create":
continue
elif not isinstance(doc_to_process["detail"]["codeContentList"], list):
continue
result_content = '## {' + doc_to_process["name"] + '} \n'
for code_part in doc_to_process["detail"]["codeContentList"]:
current_code_start_line = None
for line_number , item in enumerate(code_parts[code_part]):
if item["type"]=="code_block".get() :
# 获取当前的起始行号
start_lines_num = item["number"]
curr_row_text = ''.join(line for i,line in enumerate(buffer.text.splitlines())
if i==start_lines_num )
# 假设如果分隔符为制表符号| 那么就是一行。
curr_char_index=len(curr_row_text)
# 如果是多列 则遍历每一列查找最后一个制表标志 | 或者回车\n 的位置
while True:
prev_char_index=curr_char_index-1
try:
ch = buffer[prev_char_index].strip()
except (IndexError) as err_e:
break
if ch=='\r':
raise Exception("Invalid row")
else:
break
else:#如果是制表符号那么表示开始新行列了
print(prev_char_index, curr_char_index)
curr_col_index=char_index
# 如果不是制表符号则直接跳过这个字符 即使是一个空格都可能是制表符也是一样的道理
elif ch=="\t" or ch in [' ','\n','\r']:
pass
elif ch!="\n":
new_ch=ch.strip()
# 判断上一列是否结束 否则下移判断下一个列表是否结束 也就是下一个list的内容是否为空
elif pre_row != "":
try:start_lines_index=int(line[start_row][0])
end_rows=start_row+len(start_rows)+len(end_rows)
# 判断end_row是否存在 是否包含该row 能否确定endrows的正确位臵 确定后计算新的行索引
if end_rows-start_rows>-1:
index=row_end-index+1
else:
# 在不存在的情况下根据实际的情况设置一下默认值 比如:如果有4列那么就定位3列
index=end_rows-3
index+=new_ch[index]
end_rows=index-end_row#更新当前要找的行尾
# 还原当前位置,以便继续下一次匹配尝试
row_start-=1
else:
# 先判断有没有其他特殊字符存在
if len(new_ch)!=0 and is_new_column(new_ch):
col_name=new_ch[-1]
count[col_name]+=1
else:
# 直接添加到列表结尾处即可
content[len(list)]+=(item)
在这个改进版本中,每个分隔节点都会获得自己的单独实例并在循环中处理,这不仅降低了内存占用并提高了性能,同时也提供了一个更好的逻辑解耦。对于细节属性来说,通过增加返回对象中的key来限制展示的数量可以节省不必要的资源消耗。
return f'{e}' No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在代码中插入换行符可以使其显示正常,这样会改善阅读体验。另外,在异常处理和输出内容时增加多条回车以使注释更清晰,以便用户容易理解错误信息是如何被反馈的。
def get_content(self, file):
try:
return self.to_md(file['doc'], file.get('image_list', []), self.get_image_id_func())
except Exception as e:
print(f'Error: {str(e)}')
# 返回空字符串用于其他情况
return f'{e}' No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
以上代码已经很简洁,没有出现错误。但在某些地方可以稍作调整:
- 在return语句后加上一个换行符:
f = '{e}'
将改为\n
以下是更新后的代码示例:
def get_content(self, file):
try:
content = html2text(file.read())
return content
except BaseException as e:
tracebak.print_exception()
return 'Error occurred'
fix: 修复文档提取报错没有显示的问题