将句子列表（带有 ntlk 的标记）与 pandas 数据框中的列匹配

Question

I'm new to python, so still struggling with the basics, but I have this be sorted out and any help would be greatly appreciated. So I have this long dataframe with hundreds of rows formed by text of a specific pdf page extract from a medical exam, each row is a different person.

I succefully extracted the text (using pymupdf) and iterated it for each row, cleaned the text as much I could and ended up with a dataframe similar to this one below with a column of sentences obtained using nltk sent_tokenize and multiple rows.

import pandas as pd
from nltk.tokenize import sent_tokenize

df = pd.DataFrame({"text":["hello, this is a sentence. the sun shines. the the night is beautiful",
              "the sun shines",
              "the night is beautiful. tomorrow i work"]})

df["token"] = df["text"].apply(sent_tokenize)

The last part of my task is to match specific sentences from a list medical phrases (specific for the exam) to those in my dataframe and keep only the matches, in a new column, for example. For that, I found this thread with @furas solution, clean and looked like would do the job. So, in the end, I have a pandas column of sentences (ntlk tokens) and list of medical phrases also (ntlk tokens as well) and need to match them.

specific_sent = "the sun shines. hello, this is a sentence."
query = sent_tokenize(''.join(specific_sent))

df["query_match"] = df["token"].str.contains(query) 
df["word"] = df["token"].str.extract('({})'.format(query))

When I run this code, I get this error "TypeError: unhashable type: 'list'", which is not uncommon and I have of an understanding of it, but I'm struggling to overcome. Any help on how to overcome this error in this particular example and ways to prevent this error in the future is really appreciated. Thanks!

This is an example of desired output:

text	token	query_match	word
hello, this is a sentence. the sun shines. the night is beautiful	[hello, this is a sentence., the sun shines., the night is beautiful]	True	the sun shines., hello, this is a sentence.
the sun shines.	[the sun shines.]	True	the sun shines.
the night is beautiful. tomorrow i work	[the night is beautiful., tomorrow i work.]	False	NaN

Answer 1

一旦将 DataFrame 中的每个句子和特定句子标记化，您就会获得列表，您可以从中找到共同的元素并构建列 word。之后，您还可以填充列 query_match 检查包含共同元素的结果列表是否为空。

df = pd.DataFrame({"text":["hello, this is a sentence. the sun shines. the the night is beautiful",
              "the sun shines.",
              "the night is beautiful. tomorrow i work"]})

specific_sent = "the sun shines. hello, this is a sentence."
query = sent_tokenize(''.join(specific_sent))

df["token"] = df["text"].apply(sent_tokenize)

# check elements in common between each sentence and query
df["word"] = df["token"].apply(lambda x: list(set(query).intersection(x)))

# if they had elements in common insert True, otherwise False
df["query_match"] = df["word"].apply(lambda x: 'True' if x else 'False')

将句子列表（带有 ntlk 的标记）与 pandas 数据框中的列匹配

Matching a list of sentences (tokens with ntlk) with a column in pandas dataframe

python

pdf

text

mining

pandas