使用正则表达式将 pandas 列值与文本文件中的单词进行比较

Question

我有一个这样的 df 数据框：

product      name     description0 description1 description2 description3
  A          plane         flies        air      passengers     wings
  B          car           rolls        road        NaN          NaN
  C          boat          floats       sea      passengers      NaN

我想做的是比较要在 txt 文件中搜索的描述列中的每个值。

假设我的 test.txt 文件是：

他 flies 到伦敦，然后穿过 sea 到达纽约。

结果如下所示：

product      name     description0 description1 description2 description3 Match
  A          plane         flies        air      passengers     wings     Match
  B          car           rolls        road        NaN          NaN      No match
  C          boat          floats       sea      passengers      NaN      Match

我知道主要结构，但我对其余部分有点迷茫

with open ("test.txt", 'r') as searchfile:
    for line in searchfile:
        print line
        if re.search() in line:
            print(match)

Answer 1

您可以使用 str.find() 搜索输入文本，因为您正在搜索字符串文字。 re.search() 好像有点矫枉过正了。

使用 .apply(axis=1) 的快速解决方案：

数据

# df as given
input_text = "He flies to London then crosses the sea to reach New-York."

代码

input_text_lower = input_text.lower()

def search(row):
    for el in row:  # description 0,1,2,3
        #  skip non-string contents and if the search is successful
        if isinstance(el, str) and (input_text_lower.find(el.lower()) >= 0):
            return True
    return False

df["Match"] = df[[f"description{i}" for i in range(4)]].apply(search, axis=1)

结果

print(df)
  product   name description0 description1 description2 description3  Match
0       A  plane        flies          air   passengers        wings   True
1       B    car        rolls         road          NaN          NaN  False
2       C   boat       floats          sea   passengers          NaN   True

备注

原题没有考虑词界、标点和连字符。在实际情况下，可能需要额外的预处理步骤。这超出了原始问题的范围。

使用正则表达式将 pandas 列值与文本文件中的单词进行比较

Compare pandas column values with words in textfile using regular expressions

python

match

pandas

python-re

数据

代码

结果

备注