Python:标记关键字并用 1 和 0 创建新的标签列

Python: Tag keywords and create new columns of tags with 1's and 0's

我有以下代码来遍历列的句子,在句子中标记关键字,并创建由 1 和 0 组成的这些标记的新列。如果关键字存在,它会被自动标记并在以标记命名的新列中给出 1。如果它不存在但存在另一个关键字,则给它一个0。如果该句子没有任何关键字,则整行将被删除。

下面的代码还不错,但它仍然遗漏了关键字,它在部分单词和空白单元格(没有句子的行)上标记和输出 1 和 0。我不确定缺少什么?如何保证不遗漏关键词,不标记部分词和空句?

pattern = '|'.join(dict_list)
tags_id = (df['description_summary']
   .str.extractall(f'({pattern})')[0]
   .map(keyword_dict)
   .reset_index(name='col')
   .assign(value=1)
   .pivot_table(index=[df['issue.id'], df['description_summary']], columns='col', values='value', fill_value=0))

这里基本上是我在 excel 文件中处理的数据:

    issue.id  description_summary

0   753       Long sentence with keywords ball and hot
1   937       Long sentence with keywords cold, stick, and glove
2   
3   598       Long sentence with NO keywords
4   574       Long sentence with keywords very cold and cold 

这是当前(错误的)输出:

    issue.id  description_summary                                     Toy     Temperature 

0    753       Long sentence with keywords ball and hot                1       1
1    937       Long sentence with keywords cold, stick, and glove      1       1
2                                                                      1       0
3    598       Long sentence with NO keywords but outputs 1s and 0s    0       1
4    574       Long sentence with keywords very cold and cold          1       1

这是我想要的输出:

    issue.id  description_summary                                     Toy     Temperature    

0    753       Long sentence with keywords ball and hot                1       1
1    937       Long sentence with keywords cold, stick, and glove      1       1
4    574       Long sentence with keywords very cold and cold          0       1

这里是关键字和标签的字典('keywords':'tags'):

dict_list = {'Hot': 'Temperature',
 'Cold': 'Temperature',
 'Very cold': 'Temperature',
 'Ball': 'Toy',
 'Glove': 'Toy',
 'Stick': 'Toy'
 }

如何保证不遗漏关键词,不标记偏词和空句?

我认为您的第一个问题是 map。如果我大致重构你在做什么,直到那里:

>>> pattern = '|'.join(dict_list.keys())
>>> matches = df['description_summary'].str.extractall(f"({pattern})", flags=re.IGNORECASE)[0]
>>> matches
   match
0  0             ball
   1              hot
1  0             cold
   1            stick
   2            glove
4  0        very cold
   1             cold
Name: 0, dtype: object
>>> matches.map(dict_list)
   match
0  0        NaN
   1        NaN
1  0        NaN
   1        NaN
   2        NaN
4  0        NaN
   1        NaN
Name: 0, dtype: object

但是强制不区分大小写我们得到了更好的结果:

>>> matches.str.lower().map({kw.lower():tag for kw, tag in dict_list.items()})
   match
0  0                Toy
   1        Temperature
1  0        Temperature
   1                Toy
   2                Toy
4  0        Temperature
   1        Temperature
Name: 0, dtype: object

第二个问题似乎是 pivot_table,因为 dfmatches 没有相同的形状,所以它分配了错误的匹配行。我们可以改为使用第一级索引进行数据透视,然后使用它与 df:

连接
>>> tags = matches.str.lower().map({kw.lower():tag for kw, tag in dict_list.items()})
>>> tags = tags.rename_axis(['line', 'match']).reset_index(name='tag').assign(value=1)
>>> tags.pivot_table(index='line', columns='tag', values='value', fill_value=0).join(df[['issue.id', 'description_summary']])
      Temperature  Toy  issue.id                                description_summary
line                                                                               
0               1    1     753.0           Long sentence with keywords ball and hot
1               1    1     937.0  Long sentence with keywords cold, stick, and g...
4               1    0     574.0     Long sentence with keywords very cold and cold

输入数据:

>>> df
  issue.id                                description_summary
0      753           Long sentence with keywords ball and hot
1      937  Long sentence with keywords cold, stick, and g...
2     <NA>                                               <NA>
3      598                     Long sentence with NO keywords
4      574     Long sentence with keywords very cold and cold

>>> mapping
{'Hot': 'Temperature',
 'Cold': 'Temperature',
 'Very cold': 'Temperature',
 'Ball': 'Toy',
 'Glove': 'Toy',
 'Stick': 'Toy'}

>>> words  # words = fr"({'|'.join(mapping.keys())})".lower()
'(hot|cold|very cold|ball|glove|stick)'

稍后我会写一些解释(但你可以逐行测试)

out = df['description_summary'].str.lower().str.findall(words) \
                               .explode().str.capitalize() \
                               .replace(dict_list) \
                               .pipe(lambda x: x.loc[x.notna()]) \
                               .str.get_dummies() \
                               .groupby(level=0) \
                               .any().astype(int)

输出结果

>>> df.merge(out, left_index=True, right_index=True)
  issue.id                                description_summary  Temperature  Toy
0      753           Long sentence with keywords ball and hot            1    1
1      937  Long sentence with keywords cold, stick, and g...            1    1
4      574     Long sentence with keywords very cold and cold            1    0