删除数据框中的空行并检查相似性

Question

我在我的数据框中使用正则表达式 (findall) select 不为空字段时遇到一些困难，寻找包含在文本源中的单词：

text = "Be careful otherwise police will capture you quickly."

我需要在我的文本字符串中查找以 ful 结尾的单词，然后在我的数据集中查找以 full 结尾的单词。

Author      DF_Text

31       Better the devil you know than the one you don't      
53       Beware the door with too many keys.      
563      Be careful what you tolerate. You are teaching people how to treat you. 
41       Fear the Greeks bearing gifts.      
539      NaN
51       The honey is sweet but the bee has a sting.      
21       Be careful what you ask for; you may get it.

（来自 csv/txt 文件）。我需要在 text 中提取以 ful 结尾的单词，然后查看包含以 ful 结尾的单词的 DF_Text（因此作者）并将结果附加到列表中。

n=0
for i in df['DF_Text']:
        print(re.findall(r"\w+ful", i))
        n=n+1
        print(n)

我的问题是：如何从分析 (NaN) 中删除空行 ([]) 并报告作者姓名（例如 563、21 ）相关？如果不清楚，我很乐意提供更多信息。

Answer 1

使用 str.findall 而不是循环使用 re.findall:

df["found"] = df["DF_Text"].str.findall(r"(\w+ful)")

df.loc[df["found"].str.len().eq(0),"found"] = df["Author"]

print (df)

   Author                                            DF_Text      found
0      31   Better the devil you know than the one you don't         31
1      53                Beware the door with too many keys.         53
2     563  Be careful what you tolerate. You are teaching...  [careful]
3      41                     Fear the Greeks bearing gifts.         41
4     539                                                NaN        NaN
5      51        The honey is sweet but the bee has a sting.         51
6      21       Be careful what you ask for; you may get it.  [careful]

Answer 2

我会使用 Pandas 的 .notna() 函数来删除 datafrae 中的那一行。湖。像这样

df = df[df['DF_Text'].notna()]

请注意，Python 在覆盖之前调用数据框两次，这是正确的。

见https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.notna.htm

删除数据框中的空行并检查相似性

Remove empty rows within a dataframe and check similarity

python

regex

text-mining

pandas