如何将每行的每个单词转换为数据框的数值

How to convert each word of each row to numeric value of a dataframe

这个数据框是给我的。

我想要的字典输出是这样的

**Given the following dictionary:-** 
d = {'I': 30,'am': 45,'good': 90,'boy': 50,'We':100,'are':70,'going':110}

如何使用 python .. 我试过这样但失败了:(

dataframe['new'] = data['documents'].apply(lambda x: dictionary[x]) 

请帮帮我。提前致谢。

您可以使用 explode 获取单词然后映射到您的字典并重塑您的数据框:

MAPPING = {'I': 30,'am': 45,'good': 90,'boy': 50,'We':100,'are':70,'going':110}

df['documents'] = (df['documents'].str.split().explode().map(MAPPING).astype(str)
                                  .groupby(level=0).agg(list).str.join(' '))
print(df)

# Output
   id    documents
0   0  30 45 90 50
1   1   100 70 110
2   2    30 45 110

循序渐进

第 1 阶段:爆炸

# Split phrase into words
>>> out = df['documents'].str.split()
0    [I, am, good, boy]
1      [We, are, going]
2        [I, am, going]
Name: documents, dtype: object

# Explode lists into scalar values
>>> out = out.explode()
0        I
0       am
0     good
0      boy
1       We
1      are
1    going
2        I
2       am
2    going
Name: documents, dtype: object

第 2 阶段:转换

# Convert words with your dict mapping and convert as string
>>> out = out.map(MAPPING).astype(str)
0     30
0     45
0     90
0     50
1    100
1     70
1    110
2     30
2     45
2    110
Name: documents, dtype: object  # <- .astype(str)

第 3 阶段:重塑

# Group by index (level=0) then aggregate to a list
>>> out = out.groupby(level=0).agg(list)
0    [30, 45, 90, 50]
1      [100, 70, 110]
2       [30, 45, 110]
Name: documents, dtype: object

# Join your list of words
>>> out = out.str.join(' ')
0    30 45 90 50
1     100 70 110
2      30 45 110
Name: documents, dtype: object

不要搜索 d[x],其中 x 是整个句子,您应该搜索 d[w] 句子 x 中的每个单词 w .

您可以使用 .split() 将字符串拆分为单词列表。然后,您可以使用列表推导式或 map 在字典中搜索列表中的每个单词:

import pandas as pd

df = pd.DataFrame({'id': range(3), 'documents': ['I am good boy', 'We are going', 'I am going']})

print(df)
#    id      documents
# 0   0  I am good boy
# 1   1   We are going
# 2   2     I am going

d = {'I': 30,'am': 45,'good': 90,'boy': 50,'We':100,'are':70,'going':110}

df['new'] = df['documents'].apply(lambda s: list(map(d.get, s.split())))

# or alternatively:
# df['new'] = df['documents'].apply(lambda s: [d.get(w) for w in s.split()])

print(df)
#    id      documents               new
# 0   0  I am good boy  [30, 45, 90, 50]
# 1   1   We are going    [100, 70, 110]
# 2   2     I am going     [30, 45, 110]

重要说明:我建议使用 d.get(w) 而不是 d[w]。如果 w 不在字典中,则尝试 d[w] 将引发异常。但是,d.get 接受默认值,并且永远不会引发异常。默认情况下,如果 w 不在 d 中,d.get(w) 将 return None,但您可以自己指定默认值:

df = pd.DataFrame({'id': range(4), 'documents': ['I am good boy', 'We are going', 'I am going', 'I am good words not going in dictionary']})

df['new'] = df['documents'].apply(lambda s: [d.get(w, 37) for w in s.split()])

print(df)
#    id                                documents                                new
# 0   0                            I am good boy                   [30, 45, 90, 50]
# 1   1                             We are going                     [100, 70, 110]
# 2   2                               I am going                      [30, 45, 110]
# 3   3  I am good words not going in dictionary  [30, 45, 90, 37, 37, 110, 37, 37]