str.findall returns 所有 NA

Question

我有这个 df1，里面有很多不同的新闻文章。新闻文章的示例如下：

'Today is Monday Aug. 17 the 230th day of 2020 . There are 136 days left in the year . On August 17 2017 a van plowed through pedestrians along a packed promenade in the Spanish city of Barcelona killing 13 people and injuring 120 . A 14th victim died later from injuries . Another man was stabbed to death in a carjacking that night as the van driver made his getaway and a woman died early the next day in a vehicle-and-knife attack in a nearby coastal town . Six by police two more died when a bomb workshop exploded . In 1915 a mob in Cobb County Georgia lynched Jewish businessman Leo Frank 31 whose death sentence for the murder of 13-year-old Mary Phagan had been commuted to life imprisonment . Frank who d maintained his innocence was pardoned by the state of Georgia in 1986 . In 1960 the newly renamed Beatles formerly the Silver Beetles began their first gig in Hamburg West Germany Teamsters union president Jimmy Hoffa was sentenced in Chicago to five years in federal prison for defrauding his union s pension fund . Hoffa was released in 1971 after President Richard Nixon commuted his sentence for this conviction and jury tampering . In 1969 Hurricane Camille slammed into the Mississippi coast as a Category 5 storm that was blamed for 256 U.S. deaths three in Cuba . In 1978 the first successful trans-Atlantic balloon flight ended as Maxie Anderson Ben Abruzzo and Larry Newman landed In 1982 the first commercially produced compact discs a recording of ABBA s The Visitors were pressed at a Philips factory near Hanover West Germany .'

我有这个 df2，其中“单词”列中的新闻文章中的所有单词以及第二列中相应的 LIWC 类别。

数据示例：

data = {'Word': ['killing','even','guilty','brain'], 'Category': ['Affect', 'Adverb', 'Anx','Body']}

我想做的是：为 df1 中的每篇文章计算 df2 中每个类别出现的单词数。所以我想为 df2["category"] 中提到的每个类别创建一个列。最后应该是这样的：

 Content              | Achieve | Affiliation   | affect
article text here     | 6       | 2             | 2 
article text here     | 2       | 43            | 2
article text here     | 6       | 8             | 8 
article text here     | 2       | 13            | 7

我因为它是我尝试的所有字符串 str.findall 但是这个 returns 所有的一切都是 NA。这是我试过的：

from collections import Counter
liwc = df1['articles'].str.findall(fr"'({'|'.join(df2)})'") \
         .apply(lambda x: pd.Series(Counter(x), index=df2["category"].unique())) \
         .fillna(0).astype(int)

pandas 或 r 解决方案同样出色。

Answer 1

首先将 df2 值扁平化到字典，添加单词边界 \b\b 并传递给 Series.str.extractall, so possible use Series.map and create DataFrame by reset_index, last pass to crosstab and append to original by DataFrame.join:

df1 = pd.DataFrame({'articles':['Today is killing Aug. 17 the 230th day of 2020',
                                'Today is brain Aug. 17 the guilty day of 2020 ']})

print (df1)
                                         articles
0  Today is killing Aug. 17 the 230th day of 2020
1  Today is brain Aug. 17 the guilty day of 2020

如果 Word 列中的值列表如图所示：

data = {'Word': [['killing'],['even'],['guilty'],['brain']], 
       'Category': ['Affect', 'Adverb', 'Anx','Body']} 
df2 = pd.DataFrame(data)
print (df2)
        Word Category
0  [killing]   Affect
1     [even]   Adverb
2   [guilty]      Anx
3    [brain]     Body


d = {x: b for a, b in zip(df2['Word'], df2['Category']) for x in a}
print (d)
{'killing': 'Affect', 'even': 'Adverb', 'guilty': 'Anx', 'brain': 'Body'}

如果df2不同：

data = {'Word': ['killing','even','guilty','brain'],
        'Category': ['Affect', 'Adverb', 'Anx','Body']} 
df2 = pd.DataFrame(data)
print (df2)

0  killing   Affect
1     even   Adverb
2   guilty      Anx
3    brain     Body
    
d = dict(zip(df2['Word'], df2['Category']))
print (d)
{'killing': 'Affect', 'even': 'Adverb', 'guilty': 'Anx', 'brain': 'Body'}

import re

#thank you for improve solution Wiktor Stribiżew
pat = r"\b(?:{})\b".format("|".join(re.escape(x) for x in d))
df = df1['articles'].str.extractall(rf'({pat})')[0].map(d).reset_index(name='Category')

df = df1.join(pd.crosstab(df['level_0'], df['Category']))
print (df)
                                         articles  Affect  Anx  Body
0  Today is killing Aug. 17 the 230th day of 2020       1    0     0
1  Today is brain Aug. 17 the guilty day of 2020        0    1     1

Answer 2

您可以制作带有命名捕获组的自定义正则表达式并使用 str.extractall。

使用您的字典，自定义正则表达式将是 '(?P<Affect>\bkilling\b)|(?P<Adverb>\beven\b)|(?P<Anx>\bguilty\b)|(?P<Body>\bbrain\b)'

然后groupby+max notna 结果，转换为 int 和 join 到原始数据帧：

regex = '|'.join(fr'(?P<{k}>\b{v}\b)' for v,k  in zip(*data.values()))
(df1.join(df1['articles'].str.extractall(regex, flags=2) # re.IGNORECASE
             .notna().groupby(level=0).max()
             .astype(int)
         )
)

输出：

                                         articles  Affect  Adverb  Anx  Body
0  Today is killing Aug. 17 the 230th day of 2020       1       0    0     0
1  Today is brain Aug. 17 the guilty day of 2020        0       0    1     1

str.findall returns 所有 NA

str.findall returns all NA's

python

text-processing

pandas