删除数据框中的空行并检查相似性
Remove empty rows within a dataframe and check similarity
我在我的数据框中使用正则表达式 (findall) select 不为空字段时遇到一些困难,寻找包含在文本源中的单词:
text = "Be careful otherwise police will capture you quickly."
我需要在我的文本字符串中查找以 ful
结尾的单词,然后在我的数据集中查找以 full 结尾的单词。
Author DF_Text
31 Better the devil you know than the one you don't
53 Beware the door with too many keys.
563 Be careful what you tolerate. You are teaching people how to treat you.
41 Fear the Greeks bearing gifts.
539 NaN
51 The honey is sweet but the bee has a sting.
21 Be careful what you ask for; you may get it.
(来自 csv/txt 文件)。
我需要在 text
中提取以 ful
结尾的单词,然后查看包含以 ful
结尾的单词的 DF_Text(因此作者)并将结果附加到列表中。
n=0
for i in df['DF_Text']:
print(re.findall(r"\w+ful", i))
n=n+1
print(n)
我的问题是:如何从分析 (NaN
) 中删除空行 ([]
) 并报告作者姓名(例如 563
、21
) 相关?
如果不清楚,我很乐意提供更多信息。
使用 str.findall
而不是循环使用 re.findall
:
df["found"] = df["DF_Text"].str.findall(r"(\w+ful)")
df.loc[df["found"].str.len().eq(0),"found"] = df["Author"]
print (df)
Author DF_Text found
0 31 Better the devil you know than the one you don't 31
1 53 Beware the door with too many keys. 53
2 563 Be careful what you tolerate. You are teaching... [careful]
3 41 Fear the Greeks bearing gifts. 41
4 539 NaN NaN
5 51 The honey is sweet but the bee has a sting. 51
6 21 Be careful what you ask for; you may get it. [careful]
我会使用 Pandas 的 .notna() 函数来删除 datafrae 中的那一行。湖。
像这样
df = df[df['DF_Text'].notna()]
请注意,Python 在覆盖之前调用数据框两次,这是正确的。
见https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.notna.htm
我在我的数据框中使用正则表达式 (findall) select 不为空字段时遇到一些困难,寻找包含在文本源中的单词:
text = "Be careful otherwise police will capture you quickly."
我需要在我的文本字符串中查找以 ful
结尾的单词,然后在我的数据集中查找以 full 结尾的单词。
Author DF_Text
31 Better the devil you know than the one you don't
53 Beware the door with too many keys.
563 Be careful what you tolerate. You are teaching people how to treat you.
41 Fear the Greeks bearing gifts.
539 NaN
51 The honey is sweet but the bee has a sting.
21 Be careful what you ask for; you may get it.
(来自 csv/txt 文件)。
我需要在 text
中提取以 ful
结尾的单词,然后查看包含以 ful
结尾的单词的 DF_Text(因此作者)并将结果附加到列表中。
n=0
for i in df['DF_Text']:
print(re.findall(r"\w+ful", i))
n=n+1
print(n)
我的问题是:如何从分析 (NaN
) 中删除空行 ([]
) 并报告作者姓名(例如 563
、21
) 相关?
如果不清楚,我很乐意提供更多信息。
使用 str.findall
而不是循环使用 re.findall
:
df["found"] = df["DF_Text"].str.findall(r"(\w+ful)")
df.loc[df["found"].str.len().eq(0),"found"] = df["Author"]
print (df)
Author DF_Text found
0 31 Better the devil you know than the one you don't 31
1 53 Beware the door with too many keys. 53
2 563 Be careful what you tolerate. You are teaching... [careful]
3 41 Fear the Greeks bearing gifts. 41
4 539 NaN NaN
5 51 The honey is sweet but the bee has a sting. 51
6 21 Be careful what you ask for; you may get it. [careful]
我会使用 Pandas 的 .notna() 函数来删除 datafrae 中的那一行。湖。 像这样
df = df[df['DF_Text'].notna()]
请注意,Python 在覆盖之前调用数据框两次,这是正确的。
见https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.notna.htm