根据标记化 pandas 数据框中出现的特定二元组创建新的布尔字段
Create new boolean fields based on specific bigrams appearing in a tokenized pandas dataframe
遍历要搜索的二元组列表,我需要根据每个二元组是否存在于标记化 pandas 系列中来为每个二元组创建一个布尔字段。如果您认为这是一个好问题,我将不胜感激!
二元组列表:
bigrams = ['data science', 'computer science', 'bachelors degree']
数据框:
df = pd.DataFrame(data={'job_description': [['data', 'science', 'degree', 'expert'],
['computer', 'science', 'degree', 'masters'],
['bachelors', 'degree', 'computer', 'vision'],
['data', 'processing', 'science']]})
期望的输出:
job_description data science computer science bachelors degree
0 [data, science, degree, expert] True False False
1 [computer, science, degree, masters] False True False
2 [bachelors, degree, computer, vision] False False True
3 [data, bachelors, science] False False False
条件:
- 仅应替换完全匹配项(例如,'data science' 的标记应 return 对 'data science' 为真,但对 'science data' 或 'data bachelors science' 应为假)
- 每个搜索词都应该有自己的字段并连接到原始 df
我尝试过的:
失败:df = [x for x in df['job_description'] if x in bigrams]
失败:df[bigrams] = [[any(w==term for w in lst) for term in bigrams] for lst in df['job_description']]
失败:无法调整此处的方法 -> Match trigrams, bigrams, and unigrams to a text; if unigram or bigram a substring of already matched trigram, pass; python
失败:也无法让这个适应 ->
失败:此方法非常接近,但无法使其适应双字母组 ->
感谢您提供的任何帮助!
您可以使用正则表达式和 extractall
:
regex = '|'.join('(%s)' % b.replace(' ', r'\s+') for b in bigrams)
matches = (df['job_description'].apply(' '.join)
.str.extractall(regex).droplevel(1).notna()
.groupby(level=0).max()
)
matches.columns = bigrams
out = df.join(matches).fillna(False)
输出:
job_description data science computer science bachelors degree
0 [data, science, degree, expert] True False False
1 [computer, science, degree, masters] False True False
2 [bachelors, degree, computer, vision] False False True
3 [data, processing, science] False False False
生成的正则表达式:
'(data\s+science)|(computer\s+science)|(bachelors\s+degree)'
您也可以尝试使用 numpy
和 nltk
,应该会很快:
import pandas as pd
import numpy as np
import nltk
bigrams = ['data science', 'computer science', 'bachelors degree']
df = pd.DataFrame(data={'job_description': [['data', 'science', 'degree', 'expert'],
['computer', 'science', 'degree', 'masters'],
['bachelors', 'degree', 'computer', 'vision'],
['data', 'processing', 'science']]})
def find_bigrams(data):
output = np.zeros((data.shape[0], len(bigrams)), dtype=bool)
for i, d in enumerate(data):
possible_bigrams = [' '.join(x) for x in list(nltk.bigrams(d)) + list(nltk.bigrams(d[::-1]))]
indices = np.where(np.isin(bigrams, list(set(bigrams).intersection(set(possible_bigrams)))))
output[i, indices] = True
return list(output.T)
output = find_bigrams(df['job_description'].to_numpy())
df = df.assign(**dict(zip(bigrams, output)))
| | job_description | data science | computer science | bachelors degree |
|---:|:----------------------------------------------|:---------------|:-------------------|:-------------------|
| 0 | ['data', 'science', 'degree', 'expert'] | True | False | False |
| 1 | ['computer', 'science', 'degree', 'masters'] | False | True | False |
| 2 | ['bachelors', 'degree', 'computer', 'vision'] | False | False | True |
| 3 | ['data', 'processing', 'science'] | False | False | False |
遍历要搜索的二元组列表,我需要根据每个二元组是否存在于标记化 pandas 系列中来为每个二元组创建一个布尔字段。如果您认为这是一个好问题,我将不胜感激!
二元组列表:
bigrams = ['data science', 'computer science', 'bachelors degree']
数据框:
df = pd.DataFrame(data={'job_description': [['data', 'science', 'degree', 'expert'],
['computer', 'science', 'degree', 'masters'],
['bachelors', 'degree', 'computer', 'vision'],
['data', 'processing', 'science']]})
期望的输出:
job_description data science computer science bachelors degree
0 [data, science, degree, expert] True False False
1 [computer, science, degree, masters] False True False
2 [bachelors, degree, computer, vision] False False True
3 [data, bachelors, science] False False False
条件:
- 仅应替换完全匹配项(例如,'data science' 的标记应 return 对 'data science' 为真,但对 'science data' 或 'data bachelors science' 应为假)
- 每个搜索词都应该有自己的字段并连接到原始 df
我尝试过的:
失败:df = [x for x in df['job_description'] if x in bigrams]
失败:df[bigrams] = [[any(w==term for w in lst) for term in bigrams] for lst in df['job_description']]
失败:无法调整此处的方法 -> Match trigrams, bigrams, and unigrams to a text; if unigram or bigram a substring of already matched trigram, pass; python
失败:也无法让这个适应 ->
失败:此方法非常接近,但无法使其适应双字母组 ->
感谢您提供的任何帮助!
您可以使用正则表达式和 extractall
:
regex = '|'.join('(%s)' % b.replace(' ', r'\s+') for b in bigrams)
matches = (df['job_description'].apply(' '.join)
.str.extractall(regex).droplevel(1).notna()
.groupby(level=0).max()
)
matches.columns = bigrams
out = df.join(matches).fillna(False)
输出:
job_description data science computer science bachelors degree
0 [data, science, degree, expert] True False False
1 [computer, science, degree, masters] False True False
2 [bachelors, degree, computer, vision] False False True
3 [data, processing, science] False False False
生成的正则表达式:
'(data\s+science)|(computer\s+science)|(bachelors\s+degree)'
您也可以尝试使用 numpy
和 nltk
,应该会很快:
import pandas as pd
import numpy as np
import nltk
bigrams = ['data science', 'computer science', 'bachelors degree']
df = pd.DataFrame(data={'job_description': [['data', 'science', 'degree', 'expert'],
['computer', 'science', 'degree', 'masters'],
['bachelors', 'degree', 'computer', 'vision'],
['data', 'processing', 'science']]})
def find_bigrams(data):
output = np.zeros((data.shape[0], len(bigrams)), dtype=bool)
for i, d in enumerate(data):
possible_bigrams = [' '.join(x) for x in list(nltk.bigrams(d)) + list(nltk.bigrams(d[::-1]))]
indices = np.where(np.isin(bigrams, list(set(bigrams).intersection(set(possible_bigrams)))))
output[i, indices] = True
return list(output.T)
output = find_bigrams(df['job_description'].to_numpy())
df = df.assign(**dict(zip(bigrams, output)))
| | job_description | data science | computer science | bachelors degree |
|---:|:----------------------------------------------|:---------------|:-------------------|:-------------------|
| 0 | ['data', 'science', 'degree', 'expert'] | True | False | False |
| 1 | ['computer', 'science', 'degree', 'masters'] | False | True | False |
| 2 | ['bachelors', 'degree', 'computer', 'vision'] | False | False | True |
| 3 | ['data', 'processing', 'science'] | False | False | False |