在 python(pandas) 中完成搜索引擎的最后一步

Question

我有一个字典，它基本上存储了一个大数据框（很多行和 12 列）中存在的所有单词，字典看起来像这样：

    vocabulary = {'hello':[3,1998,876,3888], 'beautiful':[677, 4, 56],......}

其中值是包含该词的数据帧的行。

我想做的是，将一个字符串（查询）作为输入，

    query = 'a beautiful house with big windows'

return Dataframe 的某些列（我们称它们为 A、B、C、D），仅包含包含输入句子的所有单词的行。我已经为词汇表和输入查询预处理了数据（词干提取、停用词、删除标点符号……）。任何人都可以帮助我吗？谢谢

Answer 1

如果我没理解错的话，你想检查 query 句子中的每个单词，找出这些单词出现在哪一行（来自 vocabulary dict），以及 return all 查询中的词共有的行。如果是这样，这是一个解决方案（我已经简化了您的示例）：

vocabulary = {'hello':[3,1998,876,3888], 'beautiful':[677, 4, 56, 3, 876]}
query = 'hello beautiful'
words = set(query.split())
rows = [vocabulary[w] for w in words]
common_rows = rows[0]
for r in rows[1:]:
    common_rows = list(set(common_rows) & set(r))
print(common_rows)

[3, 876]

要select DataFrame 中的行，您只需要做：

df.loc[common_rows, ["A", "B", "C", "D"]]

在 python(pandas) 中完成搜索引擎的最后一步

Final step for completing a search engine in python(pandas)

python

search-engine

pandas