使用正则表达式将 pandas 列值与文本文件中的单词进行比较
Compare pandas column values with words in textfile using regular expressions
我有一个这样的 df
数据框:
product name description0 description1 description2 description3
A plane flies air passengers wings
B car rolls road NaN NaN
C boat floats sea passengers NaN
我想做的是比较要在 txt 文件中搜索的描述列中的每个值。
假设我的 test.txt
文件是:
他 flies
到伦敦,然后穿过 sea
到达纽约。
结果如下所示:
product name description0 description1 description2 description3 Match
A plane flies air passengers wings Match
B car rolls road NaN NaN No match
C boat floats sea passengers NaN Match
我知道主要结构,但我对其余部分有点迷茫
with open ("test.txt", 'r') as searchfile:
for line in searchfile:
print line
if re.search() in line:
print(match)
您可以使用 str.find()
搜索输入文本,因为您正在搜索字符串文字。 re.search()
好像有点矫枉过正了。
使用 .apply(axis=1)
的快速解决方案:
数据
# df as given
input_text = "He flies to London then crosses the sea to reach New-York."
代码
input_text_lower = input_text.lower()
def search(row):
for el in row: # description 0,1,2,3
# skip non-string contents and if the search is successful
if isinstance(el, str) and (input_text_lower.find(el.lower()) >= 0):
return True
return False
df["Match"] = df[[f"description{i}" for i in range(4)]].apply(search, axis=1)
结果
print(df)
product name description0 description1 description2 description3 Match
0 A plane flies air passengers wings True
1 B car rolls road NaN NaN False
2 C boat floats sea passengers NaN True
备注
原题没有考虑词界、标点和连字符。在实际情况下,可能需要额外的预处理步骤。这超出了原始问题的范围。
我有一个这样的 df
数据框:
product name description0 description1 description2 description3
A plane flies air passengers wings
B car rolls road NaN NaN
C boat floats sea passengers NaN
我想做的是比较要在 txt 文件中搜索的描述列中的每个值。
假设我的 test.txt
文件是:
他 flies
到伦敦,然后穿过 sea
到达纽约。
结果如下所示:
product name description0 description1 description2 description3 Match
A plane flies air passengers wings Match
B car rolls road NaN NaN No match
C boat floats sea passengers NaN Match
我知道主要结构,但我对其余部分有点迷茫
with open ("test.txt", 'r') as searchfile:
for line in searchfile:
print line
if re.search() in line:
print(match)
您可以使用 str.find()
搜索输入文本,因为您正在搜索字符串文字。 re.search()
好像有点矫枉过正了。
使用 .apply(axis=1)
的快速解决方案:
数据
# df as given
input_text = "He flies to London then crosses the sea to reach New-York."
代码
input_text_lower = input_text.lower()
def search(row):
for el in row: # description 0,1,2,3
# skip non-string contents and if the search is successful
if isinstance(el, str) and (input_text_lower.find(el.lower()) >= 0):
return True
return False
df["Match"] = df[[f"description{i}" for i in range(4)]].apply(search, axis=1)
结果
print(df)
product name description0 description1 description2 description3 Match
0 A plane flies air passengers wings True
1 B car rolls road NaN NaN False
2 C boat floats sea passengers NaN True
备注
原题没有考虑词界、标点和连字符。在实际情况下,可能需要额外的预处理步骤。这超出了原始问题的范围。