您如何使用 pd.read_clipboard 读取包含列表的数据框?

How do you read in a dataframe with lists using pd.read_clipboard?

这是来自另一个 的一些数据:

                          positive                 negative          neutral
1   [marvel, moral, bold, destiny]                       []   [view, should]
2                      [beautiful]      [complicated, need]               []
3                      [celebrate]   [crippling, addiction]            [big]

我首先要做的是在所有单词中添加引号,然后:

import ast

df = pd.read_clipboard(sep='\s{2,}')
df = df.applymap(ast.literal_eval)

有没有更聪明的方法来做到这一点?

我是这样做的:

df = pd.read_clipboard(sep='\s{2,}', engine='python')
df = df.apply(lambda x: x.str.replace(r'[\[\]]*', '').str.split(',\s*', expand=False))

PS 我确定 - 一定有更好的方法...

字符串列表

对于基本结构,您可以使用 yaml 而无需添加引号:

import yaml
df = pd.read_clipboard(sep='\s{2,}').applymap(yaml.load)

type(df.iloc[0, 0])
Out: list

数值数据列表

在特定条件下,您可以将列表读取为字符串,并使用 literal_eval(或 pd.eval,如果它们是简单列表)对其进行转换。

例如,

           A   B
0  [1, 2, 3]  11
1  [4, 5, 6]  12

首先,确保列之间至少有两个空格,然后复制您的数据和运行以下内容:

import ast 

df = pd.read_clipboard(sep=r'\s{2,}', engine='python')
df['A'] = df['A'].map(ast.literal_eval)    
df
    
           A   B
0  [1, 2, 3]  11
1  [4, 5, 6]  12

df.dtypes

A    object
B     int64
dtype: object

Notes

  • for multiple columns, use applymap in the conversion step:

    df[['A', 'B', ...]] = df[['A', 'B', ...]].applymap(ast.literal_eval)
    
  • if your columns can contain NaNs, define a function that can handle them appropriately:

    parser = lambda x: x if pd.isna(x) else ast.literal_eval(x)
    df[['A', 'B', ...]] = df[['A', 'B', ...]].applymap(parser)
    
  • if your columns contain lists of strings, you will need something like yaml.load (requires installation) to parse them instead if you don't want to manually add quotes to the data. See above.

另一个版本:

df.applymap(lambda x:
            ast.literal_eval("[" + re.sub(r"[[\]]", "'", 
                                          re.sub("[,\s]+", "','", x)) + "]"))

另一种选择是

In [43]:  df.applymap(lambda x: x[1:-1].split(', '))
Out[43]: 
                         positive                negative         neutral
1  [marvel, moral, bold, destiny]                      []  [view, should]
2                     [beautiful]     [complicated, need]              []
3                     [celebrate]  [crippling, addiction]           [big]

请注意,这假定每个单元格中的第一个和最后一个字符是 []。 它还假定逗号后正好有一个 space。

来自@MaxU 的帮助

df = pd.read_clipboard(sep='\s{2,}', engine='python')

然后:

>>> df.apply(lambda col: col.str[1:-1].str.split(', '))
                         positive                negative         neutral
1  [marvel, moral, bold, destiny]                      []  [view, should]
2                     [beautiful]     [complicated, need]              []
3                     [celebrate]  [crippling, addiction]           [big]

>>> df.apply(lambda col: col.str[1:-1].str.split()).loc[3, 'negative']
['crippling', 'addiction']

根据提出类似解决方案的@unutbu 的笔记:

assumes the first and last character in each cell is [ and ]. It also assumes there is exactly one space after the commas.