如何管理 pandas 数据帧中的特殊字符 \r

Question

读取 csv 文件时，符号 \r 为何会产生 pandas 错误？

示例：

test = pd.DataFrame(columns = ['id','text'])
test.id = [1,2,3]
test.text = ['Foo\rBar','Bar\rFoo','Foo\r\r\nBar']
test.to_csv('temp.csv',index = False)
test2 = pd.read_csv('temp.csv')

则数据帧如下：

测试：

    id  text
0   1   Foo\rBar
1   2   Bar\rFoo
2   3   Foo\r\r\nBar

测试 2:

    id      text
0   1       Foo
1   Bar     NaN
2   2       Bar
3   Foo     NaN
4   3       Foo\r\r\nBar

请注意，在文本中添加 \n 可防止转到另一行。知道发生了什么事吗？以及如何防止这种行为？

请注意，它还会阻止使用 pandas.to_pickle，因为它会损坏文件。生成包含以下错误的文件：

Error! ..\my_pickle.pkl is not UTF-8 encoded
Saving disabled.
See Console for more details.

Answer 1

尝试添加lineterminator和encoding参数：

test = pd.DataFrame(columns = ['id', 'text'])
test.id = [1, 2, 3]
test.text = ['Foo\rBar', 'Bar\rFoo', 'Foo\r\r\nBar']
test.to_csv('temp.csv', index=False, line_terminator='\n', encoding='utf-8')
test2 = pd.read_csv('temp.csv', lineterminator='\n', encoding='utf-8')

测试和测试 2：

    id  text
0   1   Foo\rBar
1   2   Bar\rFoo
2   3   Foo\r\r\nBar

它对我来说工作正常，但也许它只是 Windows 问题（我有 MacBook）。还要检查这个 issue.

Answer 2

为了获得有效的 csv 数据，所有包含换行符的字段都应该用双引号引起来。

生成的 csv 应如下所示：

id  text
1   "Foo\rBar"
2   "Bar\rFoo"
3   "Foo\r\r\nBar"

或：

id  text
1   "Foo
Bar"
2   "Bar
Foo"
3   "Foo


Bar"

如果 reader 仅将 \n 视为换行符，则这样做：

id  text
1   Foo\rBar
2   Bar\rFoo
3   "Foo\r\r\nBar"

要读取 csv 数据，请确保告诉 reader 将字段解析为 quoted（这可能是默认值）。

解析器可能会尝试自动检测文件中换行符的类型（可能是 \n、\r\n 甚至 \r），也许这就是为什么如果未加引号的字段中有 \r 和 \n 的组合。

如何管理 pandas 数据帧中的特殊字符 \r

How to manage the special character \r in pandas dataframes

python

csv

pickle

character-encoding

pandas