根据标记化 pandas 数据框中出现的特定二元组创建新的布尔字段

Question

遍历要搜索的二元组列表，我需要根据每个二元组是否存在于标记化 pandas 系列中来为每个二元组创建一个布尔字段。如果您认为这是一个好问题，我将不胜感激！

二元组列表：

bigrams = ['data science', 'computer science', 'bachelors degree']

数据框：

df = pd.DataFrame(data={'job_description': [['data', 'science', 'degree', 'expert'],
                                            ['computer', 'science', 'degree', 'masters'],
                                            ['bachelors', 'degree', 'computer', 'vision'],
                                            ['data', 'processing', 'science']]})

期望的输出：

                         job_description  data science computer science bachelors degree
0        [data, science, degree, expert]          True            False            False
1   [computer, science, degree, masters]         False             True            False
2  [bachelors, degree, computer, vision]         False            False             True
3             [data, bachelors, science]         False            False            False

条件：

仅应替换完全匹配项（例如，'data science' 的标记应 return 对 'data science' 为真，但对 'science data' 或 'data bachelors science' 应为假）
每个搜索词都应该有自己的字段并连接到原始 df

我尝试过的：

失败：df = [x for x in df['job_description'] if x in bigrams]

失败：df[bigrams] = [[any(w==term for w in lst) for term in bigrams] for lst in df['job_description']]

失败：无法调整此处的方法 -> Match trigrams, bigrams, and unigrams to a text; if unigram or bigram a substring of already matched trigram, pass; python

失败：也无法让这个适应 ->

失败：此方法非常接近，但无法使其适应双字母组 ->

感谢您提供的任何帮助！

Answer 1

您可以使用正则表达式和 extractall:

regex = '|'.join('(%s)' % b.replace(' ', r'\s+') for b in bigrams)
matches = (df['job_description'].apply(' '.join)
           .str.extractall(regex).droplevel(1).notna()
           .groupby(level=0).max()
           )
matches.columns = bigrams

out = df.join(matches).fillna(False)

输出：

                         job_description  data science  computer science  bachelors degree
0        [data, science, degree, expert]          True             False             False
1   [computer, science, degree, masters]         False              True             False
2  [bachelors, degree, computer, vision]         False             False              True
3            [data, processing, science]         False             False             False

生成的正则表达式：

'(data\s+science)|(computer\s+science)|(bachelors\s+degree)'

Answer 2

您也可以尝试使用 numpy 和 nltk，应该会很快：

import pandas as pd
import numpy as np
import nltk

bigrams = ['data science', 'computer science', 'bachelors degree']
df = pd.DataFrame(data={'job_description': [['data', 'science', 'degree', 'expert'],
                                            ['computer', 'science', 'degree', 'masters'],
                                            ['bachelors', 'degree', 'computer', 'vision'],
                                            ['data', 'processing', 'science']]})

def find_bigrams(data):
  output = np.zeros((data.shape[0], len(bigrams)), dtype=bool)
  for i, d in enumerate(data):
    possible_bigrams = [' '.join(x) for x in list(nltk.bigrams(d)) + list(nltk.bigrams(d[::-1]))]
    indices = np.where(np.isin(bigrams, list(set(bigrams).intersection(set(possible_bigrams)))))
    output[i, indices] = True
  return list(output.T)

output = find_bigrams(df['job_description'].to_numpy())
df = df.assign(**dict(zip(bigrams, output)))

|    | job_description                               | data science   | computer science   | bachelors degree   |
|---:|:----------------------------------------------|:---------------|:-------------------|:-------------------|
|  0 | ['data', 'science', 'degree', 'expert']       | True           | False              | False              |
|  1 | ['computer', 'science', 'degree', 'masters']  | False          | True               | False              |
|  2 | ['bachelors', 'degree', 'computer', 'vision'] | False          | False              | True               |
|  3 | ['data', 'processing', 'science']             | False          | False              | False              |

根据标记化 pandas 数据框中出现的特定二元组创建新的布尔字段

Create new boolean fields based on specific bigrams appearing in a tokenized pandas dataframe

python

nlp

boolean

dataframe

pandas