如何检查单词是否在 pandas 数据框中的字典中

How to check if words are in dictionary in pandas dataframe

假设我有一个包含一列句子的数据框:

     data['sentence']

0    i like to move it move it
1    i like to move ir move it
2    you like to move it
3    i liketo move it move it
4    i like to moveit move it
5    ye like to move it

而且我想检查哪些句子中的单词 在词典 之外,例如

     data['sentence']                OOV

0    i like to move it move it      False
1    i like to move ir move it      False
2    you like to move it            False
3    i liketo move it move it       True
4    i like to moveit move it       True
5    ye like to move it             True

现在我必须遍历每一行:


data['OOV'] = False  # out of vocabulary

for i, row in data.iterrows():
    words = set(data['sentence'].split())
    for word in words:    
       if word not in dictionary:
           data.at[i,'OOV'] = True
           break

有没有办法矢量化(或加速)这个任务?

由于我没有字典的完整上下文和其他细节,我建议使用 df.apply(operation) ,它通常会导致加速而不是迭代。

pandas.DataFrame.apply

不知道字典的内容,你的要求不太清楚(就是我想象中更多的是python意义上的列表)。

然而,假设参考词是“I like to move it”,下面是如何标记句子中包含字典外词的行:

dictionary = set(['i', 'like', 'to', 'move', 'it'])
df['OOV'] = df['data'].str.split(' ').apply(lambda x: not set(x).issubset(dictionary))

# only for illustration:
df['words'] = df['data'].str.split(' ').apply(set)
df['words_outside'] = df['data'].str.split(' ').apply(lambda x: set(x).difference(dictionary))

输出:

                        data    OOV                            words words_outside
0  i like to move it move it  False          {like, to, it, i, move}            {}
1  i like to move ir move it   True      {like, to, it, i, move, ir}          {ir}
2        you like to move it   True        {move, like, to, it, you}         {you}
3   i liketo move it move it   True            {liketo, it, move, i}      {liketo}
4   i like to moveit move it   True  {like, to, it, i, move, moveit}      {moveit}
5         ye like to move it   True         {like, to, it, move, ye}          {ye}