如果前三个句子包含关键字,如何过滤字符串
How to filter strings if the first three sentences contain keywords
我有一个名为 df
的 pandas 数据框。它有一个名为 article
的列。 article
列包含 600 个字符串,每个字符串代表一篇新闻文章。
我只想保留那些前四个句子包含关键字 "COVID-19" AND ("China" OR "Chinese") 的文章。但是我无法找到一种方法来自行执行此操作。
(在字符串中,句子由 \n
分隔。示例文章如下所示:)
\nChina may be past the worst of the COVID-19 pandemic, but they aren’t taking any chances.\nWorkers in Wuhan in service-related jobs would have to take a coronavirus test this week, the government announced, proving they had a clean bill of health before they could leave the city, Reuters reported.\nThe order will affect workers in security, nursing, education and other fields that come with high exposure to the general public, according to the edict, which came down from the country’s National Health Commission.\ .......
这里:
found = []
s1 = "hello"
s2 = "good"
s3 = "great"
for string in article:
if s1 in string and (s2 in string or s3 in string):
found.append(string)
首先,我们根据您的关键字是否出现在给定的句子中,定义一个 return 布尔值的函数:
def contains_covid_kwds(sentence):
kw1 = 'COVID19'
kw2 = 'China'
kw3 = 'Chinese'
return kw1 in sentence and (kw2 in sentence or kw3 in sentence)
然后我们通过将此函数(使用 Series.apply
)应用于 df.article
列的句子来创建一个布尔系列。
请注意,我们使用 lambda 函数来截断传递给 contains_covid_kwds
的句子,直到第五次出现 '\n'
,即您的前四个句子(有关此方法的更多信息作品 here):
series = df.article.apply(lambda s: contains_covid_kwds(s[:s.replace('\n', '#', 4).find('\n')]))
然后我们将布尔系列传递给df.loc
,以便本地化系列被评估为True
的行:
filtered_df = df.loc[series]
首先,我创建了一个系列,其中仅包含原始 `df['articles'] 列的前四个句子,并将其转换为小写,假设搜索应该与大小写无关。
articles = df['articles'].apply(lambda x: "\n".join(x.split("\n", maxsplit=4)[:4])).str.lower()
然后使用一个简单的布尔掩码仅过滤在前四个句子中找到关键字的那些行。
df[(articles.str.contains("covid")) & (articles.str.contains("chinese") | articles.str.contains("china"))]
您可以使用 pandas apply 方法并按照我的方式进行操作。
string = "\nChina may be past the worst of the COVID-19 pandemic, but they aren’t taking any chances.\nWorkers in Wuhan in service-related jobs would have to take a coronavirus test this week, the government announced, proving they had a clean bill of health before they could leave the city, Reuters reported.\nThe order will affect workers in security, nursing, education and other fields that come with high exposure to the general public, according to the edict, which came down from the country’s National Health Commission."
df = pd.DataFrame({'article':[string]})
def findKeys(string):
string_list = string.strip().lower().split('\n')
flag=0
keywords=['china','covid-19','wuhan']
# Checking if the article has more than 4 sentences
if len(string_list)>4:
# iterating over string_list variable, which contains sentences.
for i in range(4):
# iterating over keywords list
for key in keywords:
# checking if the sentence contains any keyword
if key in string_list[i]:
flag=1
break
# Else block is executed when article has less than or equal to 4 sentences
else:
# Iterating over string_list variable, which contains sentences
for i in range(len(string_list)):
# iterating over keywords list
for key in keywords:
# Checking if sentence contains any keyword
if key in string_list[i]:
flag=1
break
if flag==0:
return False
else:
return True
然后在 df:-
上调用 pandas apply 方法
df['Contains Keywords?'] = df['article'].apply(findKeys)
我有一个名为 df
的 pandas 数据框。它有一个名为 article
的列。 article
列包含 600 个字符串,每个字符串代表一篇新闻文章。
我只想保留那些前四个句子包含关键字 "COVID-19" AND ("China" OR "Chinese") 的文章。但是我无法找到一种方法来自行执行此操作。
(在字符串中,句子由 \n
分隔。示例文章如下所示:)
\nChina may be past the worst of the COVID-19 pandemic, but they aren’t taking any chances.\nWorkers in Wuhan in service-related jobs would have to take a coronavirus test this week, the government announced, proving they had a clean bill of health before they could leave the city, Reuters reported.\nThe order will affect workers in security, nursing, education and other fields that come with high exposure to the general public, according to the edict, which came down from the country’s National Health Commission.\ .......
这里:
found = []
s1 = "hello"
s2 = "good"
s3 = "great"
for string in article:
if s1 in string and (s2 in string or s3 in string):
found.append(string)
首先,我们根据您的关键字是否出现在给定的句子中,定义一个 return 布尔值的函数:
def contains_covid_kwds(sentence):
kw1 = 'COVID19'
kw2 = 'China'
kw3 = 'Chinese'
return kw1 in sentence and (kw2 in sentence or kw3 in sentence)
然后我们通过将此函数(使用 Series.apply
)应用于 df.article
列的句子来创建一个布尔系列。
请注意,我们使用 lambda 函数来截断传递给 contains_covid_kwds
的句子,直到第五次出现 '\n'
,即您的前四个句子(有关此方法的更多信息作品 here):
series = df.article.apply(lambda s: contains_covid_kwds(s[:s.replace('\n', '#', 4).find('\n')]))
然后我们将布尔系列传递给df.loc
,以便本地化系列被评估为True
的行:
filtered_df = df.loc[series]
首先,我创建了一个系列,其中仅包含原始 `df['articles'] 列的前四个句子,并将其转换为小写,假设搜索应该与大小写无关。
articles = df['articles'].apply(lambda x: "\n".join(x.split("\n", maxsplit=4)[:4])).str.lower()
然后使用一个简单的布尔掩码仅过滤在前四个句子中找到关键字的那些行。
df[(articles.str.contains("covid")) & (articles.str.contains("chinese") | articles.str.contains("china"))]
您可以使用 pandas apply 方法并按照我的方式进行操作。
string = "\nChina may be past the worst of the COVID-19 pandemic, but they aren’t taking any chances.\nWorkers in Wuhan in service-related jobs would have to take a coronavirus test this week, the government announced, proving they had a clean bill of health before they could leave the city, Reuters reported.\nThe order will affect workers in security, nursing, education and other fields that come with high exposure to the general public, according to the edict, which came down from the country’s National Health Commission."
df = pd.DataFrame({'article':[string]})
def findKeys(string):
string_list = string.strip().lower().split('\n')
flag=0
keywords=['china','covid-19','wuhan']
# Checking if the article has more than 4 sentences
if len(string_list)>4:
# iterating over string_list variable, which contains sentences.
for i in range(4):
# iterating over keywords list
for key in keywords:
# checking if the sentence contains any keyword
if key in string_list[i]:
flag=1
break
# Else block is executed when article has less than or equal to 4 sentences
else:
# Iterating over string_list variable, which contains sentences
for i in range(len(string_list)):
# iterating over keywords list
for key in keywords:
# Checking if sentence contains any keyword
if key in string_list[i]:
flag=1
break
if flag==0:
return False
else:
return True
然后在 df:-
上调用 pandas apply 方法df['Contains Keywords?'] = df['article'].apply(findKeys)