使用 pd.read_clipboard 时如何处理列名中包含空格的列名？

Question

这是我长期面临的一个真实问题。

取这个数据框：

         A         B  THRESHOLD
       NaN       NaN        NaN
 -0.041158 -0.161571   0.329038
  0.238156  0.525878   0.110370
  0.606738  0.854177  -0.095147
  0.200166  0.385453   0.166235

使用pd.read_clipboard复制很容易。但是，如果其中一个列名称具有 space：

         A         B     Col #3
       NaN       NaN        NaN
 -0.041158 -0.161571   0.329038
  0.238156  0.525878   0.110370
  0.606738  0.854177  -0.095147
  0.200166  0.385453   0.166235

然后是这样读的：

          A         B       Col  #3
0       NaN       NaN       NaN NaN
1 -0.041158 -0.161571  0.329038 NaN
2  0.238156  0.525878  0.110370 NaN
3  0.606738  0.854177 -0.095147 NaN
4  0.200166  0.385453  0.166235 NaN

我该如何预防？

Answer 1

在这种情况下，我所做的是将所有列分开两个或更多 space，然后使用 sep='\s\s+' 作为分隔符，这样当我确实有带有单个 space 的列标题，例如，上面的第 3 列将其视为一列。

         A         B     Col #3
       NaN       NaN        NaN
 -0.041158  -0.161571   0.329038
  0.238156   0.525878   0.110370
  0.606738   0.854177  -0.095147
  0.200166   0.385453   0.166235

df = pd.read_clipboard(sep='\s\s+')

你确实收到了这个警告，但你可以忽略它，因为它做得对。或者，如果您的 OCD 打败了您，您可以输入 engine='python'。 :)

C:\Program Files\Anaconda3\lib\site-packages\pandas\io\clipboards.py:63: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'. return read_table(StringIO(text), sep=sep, **kwargs)

print(df)

          A         B    Col #3
0       NaN       NaN       NaN
1 -0.041158 -0.161571  0.329038
2  0.238156  0.525878  0.110370
3  0.606738  0.854177 -0.095147
4  0.200166  0.385453  0.166235

Answer 2

使用 re、io 和 pd.read_table 来推动我在评论中提出的观点，我复制了您在 post 中的确切文本，应用第一轮 re.sub 删除任何领先的白色 space。然后，我用 2 个 space 替换了前面有数字的任何 space——这对于手头的情况是唯一的，因为列名主要是字符串字符。完成所有这些后，我将生成的字符串转换为 io.StringIO 对象并将后者提供给 pd.read_table 函数。这本质上与复制文本并将其粘贴到 sublime text，然后在最终复制结果字符串并将其提供给 pd.read_clipboard.

之前应用搜索和替换操作相同。

下面的代码片段说明了这一点：

import pandas as pd
import re
import io


text = """         A         B     Col #3
        NaN       NaN        NaN
  -0.041158 -0.161571   0.329038
   0.238156  0.525878   0.110370
   0.606738  0.854177  -0.095147
   0.200166  0.385453   0.166235"""


with io.StringIO(re.sub("(?<=[0-9]) +", "  ", re.sub("^ +", "", text))) as fs:
    df =  pd.read_table(fs, header=0, sep="\s{2,}",engine='python')


#           A         B    Col #3
# 0       NaN       NaN       NaN
# 1 -0.041158 -0.161571  0.329038
# 2  0.238156  0.525878  0.110370
# 3  0.606738  0.854177 -0.095147
# 4  0.200166  0.385453  0.166235

感谢提问。

使用 pd.read_clipboard 时如何处理列名中包含空格的列名？

How do you handle column names having spaces in them when using pd.read_clipboard?

python

clipboard

dataframe

pandas