将关键字从数据框列提取到另一列

Extract keywords from a dataframe column to another column

我有以下格式的数据框: link to the csv file

      image_name caption_number                caption

0   1000092795.jpg  0   Two young guys with shaggy hair look at their...
1   1000092795.jpg  1   Two young , White males are outside near many...
2   1000092795.jpg  2   Two men in green shirts are standing in a yard .
3   1000092795.jpg  3   A man in a blue shirt standing in a garden .
4   1000092795.jpg  4   Two friends enjoy time spent together .

我想添加另一列keywords,使用 NLP 关键字提取方法提取关键字。

这是我尝试过的:

df = pd.read_csv('results.csv', delimiter='|')
df.columns = ['image_name', 'caption_number', 'caption']
stop_words = stopwords.words('english')

def get_keywords(row):
    some_text = row['caption']
    lowered = some_text.lower()
    tokens = nltk.tokenize.word_tokenize(some_text)
    keywords = [keyword for keyword in tokens if keyword.isalpha() and not keyword in stop_words]
    keywords_string = ','.join(keywords)
    return keywords_string


df['Keywords'] = df['caption'].apply(get_keywords, axis=1) 

上面returns一个错误:get_keywords() got an unexpected keyword argument 'axis'

原因是标题列有 nan 值,因此需要在应用函数之前删除 nan 值。

#replaces all occurring digits in the strings with nothing
df['caption'] = df['caption'].str.replace('\d+', '')
#drop all the nan values 
df=df.dropna()
#if you need the whole row to be passed inside the function
df['Keywords'] = df.apply(lambda row:get_keywords(row), axis=1)