使用 nltk 和/或正则表达式从 pandas 文本列中的句子中提取关键字，并将单词作为句子中的组放置在另一列中

Question

一个主要由结构化数据组成的 pandas 数据框有 2 列，其中包含用户输入、文本叙述。有些故事写得不好。我正在寻找在每个叙述中提取出现在同一个句子中的关键字。这些词有时是双字母组（植入物破裂），但通常在关键词之间有很多非关键词（植入物真的破裂了）。如果它们出现在叙述中的同一个句子中，它们只是一对，并且一个句子中可能有超过 2 个关键字。这是一个例子，加上我的尝试。

import pandas as pd
import nltk

def get_keywords(x, y):
    tokens = nltk.tokenize.word_tokenize(x)
    keywords = [keyword for keyword in tokens if keyword in y]
    keywords_string = ', '.join(keywords)
    return keywords_string


text = ['after investigation it was found that plate was fractured.  It was a broken plate. 
     patient had fractured his femur. ', 
    'investigation took long.  upon xray the plate, which looked ok at first suffered 
     breakage.',
    'it happend that the screws had all broken', 'it was sad.   fractured was the implant.',
    'this sentance has nothing. as does this one.  and this one too.',
    'nothing happening here though a bone was fractured. bone was broke too as was screw.']

df = pd.DataFrame(text, columns = ['Text'])

## These are the key words.  The pairs belong to separate lists--(items, modes) in 
## either order.  These lists tend to grow as more keywords are discovered.
items = ['implant', 'implants', 'plate', 'plates', 'screw', 'screws']
modes = ['broke', 'broken', 'break', 'breaks', 'breakage' , 'fracture', 'fractured']
other = ['bone', 'femor', 'ulna' ]

# the apply(lambda) is slow but I don't mind it.
df['items'] = df['Text'].apply(lambda x: get_keywords(x, items))
df['F Modes'] = df['Text'].apply(lambda x: get_keywords(x, modes)) 
df['other'] = df['Text'].apply(lambda x: get_keywords(x, other)) 

### After using loc to isolate rows of interest, go back and grab whole 
## sentence for review. It's shorter than reading everything. But this
## is what I'm hoping to reduce.

xxx = df['Text'].str.extractall(r"([^.]*?fracture[^.]*\.)").unstack()

这需要大量的努力和迭代。提取带有关键字的句子比阅读所有内容要少，但仍然需要很多工作。 问题： 是否可以查看每个句子并只抓取感兴趣的词，让它们按顺序排列，然后将它们作为组放在摘要列中。删除感兴趣的关键字之间的所有单词。必须保留索引，因为此文本数据将合并到索引上的另一个 df。

所需的 df 如下所示：

text = [['after investigation it was found that plate was fractured.  It was a broken plate. 
         patient had fractured his femur. ', 'plate fractured, broken plate, fracture femur'],
        ['investigation took long.  upon xray the plate, which looked ok at first suffered 
         breakage.', 'plate breakage'],
        ['it happened that the screws had all broken', 'screws broken'],
        ['it was sad.   fractured was the implant.', 'fractured implant'],
        ['this sentence has nothing. as does this one.  and this one too.', ''],
        ['nothing happening here. though a bone was fractured. bone was broke too as was 
        screw.', 'bone fractured, bone broke screw']]

df = pd.DataFrame(text, columns = ['Text', 'Summary'])
df

Answer 1

您可以尝试在提取关键字之前对文本进行分词：

import pandas as pd
import nltk
import numpy as np
from more_itertools import split_after

nltk.download('punkt')

text = ['after investigation it was found that plate was fractured.  It was a broken plate. patient had fractured his femur. ', 
    'investigation took long.  upon xray the plate, which looked ok at first suffered breakage.',
    'it happend that the screws had all broken', 'it was sad.   fractured was the implant.',
    'this sentance has nothing. as does this one.  and this one too.',
    'nothing happening here though a bone was fractured. bone was broke too as was screw.']

def tokenize(texts):
  return [nltk.tokenize.word_tokenize(t) for t in texts]

之后可以把关键词提取出来作为一个新的列（这里我是提取每个句子的关键词）：

def key_word_intersection(df):
  summaries = []
  for x in tokenize(df['Text'].to_numpy()):
    keywords = np.concatenate([
                                np.intersect1d(x, ['implant', 'implants', 'plate', 'plates', 'screw', 'screws']),
                                np.intersect1d(x, ['broke', 'broken', 'break', 'breaks', 'breakage' , 'fracture', 'fractured']), 
                                np.intersect1d(x, ['bone', 'femur', 'ulna' ])])

    dot_sep_sentences = np.array(list(split_after(x, lambda i: i == ".")), dtype=object)
    summary = []
    for i, s in enumerate(dot_sep_sentences):
      summary.append([dot_sep_sentences[i][j] for j, keyword in enumerate(s) if keyword in keywords ])
    summaries.append(', '.join([' '.join(x) for x in summary if x]))
  return summaries

df = pd.DataFrame(text, columns = ['Text'])
df['Summary'] = key_word_intersection(df)

|    | Text                                                                                                                | Summary                                        |
|---:|:--------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------|
|  0 | after investigation it was found that plate was fractured.  It was a broken plate. patient had fractured his femur. | plate fractured, broken plate, fractured femur |
|  1 | investigation took long.  upon xray the plate, which looked ok at first suffered breakage.                          | plate breakage                                 |
|  2 | it happend that the screws had all broken                                                                           | screws broken                                  |
|  3 | it was sad.   fractured was the implant.                                                                            | fractured implant                              |
|  4 | this sentance has nothing. as does this one.  and this one too.                                                     |                                                |
|  5 | nothing happening here though a bone was fractured. bone was broke too as was screw.                                | bone fractured, bone broke screw               |

如果你不想sentence-separated个关键词，但又想维护它们的顺序，你可以这样做：

def key_word_intersection(df):
  summaries = []
  for x in tokenize(df['Text'].to_numpy()):
    keywords = np.concatenate([
                                np.intersect1d(x, ['implant', 'implants', 'plate', 'plates', 'screw', 'screws']),
                                np.intersect1d(x, ['broke', 'broken', 'break', 'breaks', 'breakage' , 'fracture', 'fractured']), 
                                np.intersect1d(x, ['bone', 'femur', 'ulna' ])])
    summaries.append(np.array(x)[[i for i, keyword in enumerate(x) if keyword in keywords]])
  return summaries

df = pd.DataFrame(text, columns = ['Text'])
df['Summary'] = key_word_intersection(df)

|    | Text                                                                                                                | Summary                                                    |
|---:|:--------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------|
|  0 | after investigation it was found that plate was fractured.  It was a broken plate. patient had fractured his femur. | ['plate' 'fractured' 'broken' 'plate' 'fractured' 'femur'] |
|  1 | investigation took long.  upon xray the plate, which looked ok at first suffered breakage.                          | ['plate' 'breakage']                                       |
|  2 | it happend that the screws had all broken                                                                           | ['screws' 'broken']                                        |
|  3 | it was sad.   fractured was the implant.                                                                            | ['fractured' 'implant']                                    |
|  4 | this sentance has nothing. as does this one.  and this one too.                                                     | []                                                         |
|  5 | nothing happening here though a bone was fractured. bone was broke too as was screw.                                | ['bone' 'fractured' 'bone' 'broke' 'screw']                |

使用 nltk 和/或正则表达式从 pandas 文本列中的句子中提取关键字，并将单词作为句子中的组放置在另一列中

extract keyword from sentences in a pandas text column, using nltk, and or regex, and place words in another column as groups from a sentence

python

regex

nltk

dataframe

pandas