从pdf文件中提取和打印table时如何去掉'\r'？

Question

异议是从给定的 PDF 文件中提取 table 并将整个 table 转换为 pd 数据帧以供进一步操作。显然，整个table里面只会包含字符串。

虽然代码本身正在运行，但在将提取的 table 转换为数据帧时，每个最初在其单元格中从 table 中断的字符串都出现在其间的“\r”单词

示例：单元格中的原始外观：“Neues 什么..."

应该看起来像：“Neues Wh...”

转换为 df 后的结果：“Neues\rWh...”

查看下面我的代码：

import pandas as pd
import win32com.client
from win32com.client import Dispatch, constants
import codecs
import os
import io

import tabula
from tabula import read_pdf
from tabulate import tabulate

mapping = {df.columns[0]: 'x1',
           df.columns[1]: 'x2',
           df.columns[2]: 'x3',
           df.columns[3]: 'x4?',
           df.columns[4]: 'x5',
           df.columns[5]: 'x6',
           df.columns[6]: 'x7',
           df.columns[7]: 'x8'}

pdf_template_path = os.path.join(r'H:\folder\ pdf-file')
pdf_template_path1 = pdf_template_path + '.pdf'

pdf_table = read_pdf(pdf_template_path1,
                     pages = 'all', 
                     multiple_tables = True,
                     lattice= True, 
                     pandas_options={'header': None}
)

# Transform the result into a string table format
table = tabulate(pdf_table)

# Transform the table into dataframe
df = pd.read_fwf(io.StringIO(table))

df.rename(columns= mapping, inplace= True)
df_pdf.style.set_properties(subset=['Beschreibung'], **{'width': '300px'})

display(df.head())
df.shape

结果如下： result

如图所示，有时单词之间会出现回车Return序列“\r”，即：'Neues\rWh..'，但结果应该是这样的：'Neues Wh..'.

我试过像 replace():

这样的方法

df = df.replace('\r', '', regex= True)

编辑：但它没有用，因为 df 中的字符串保持不变，请参阅结果图片： result after df_replace

感谢您的建议。

Answer 1

已解决。这里的解决方案是：

df = df.replace(r'\r', ' ', regex= True)

as r'\' 禁用第一个 \。因此，'\r' 可以作为字符串的普通字符处理。

从pdf文件中提取和打印table时如何去掉'\r'？

How to get rid of '\r' when exttracting and printing a table from a pdf file?

python

pdf

stringio

pandas

tabulate