将关键字从数据框列提取到另一列
Extract keywords from a dataframe column to another column
我有以下格式的数据框:
link to the csv file
image_name caption_number caption
0 1000092795.jpg 0 Two young guys with shaggy hair look at their...
1 1000092795.jpg 1 Two young , White males are outside near many...
2 1000092795.jpg 2 Two men in green shirts are standing in a yard .
3 1000092795.jpg 3 A man in a blue shirt standing in a garden .
4 1000092795.jpg 4 Two friends enjoy time spent together .
我想添加另一列keywords
,使用 NLP 关键字提取方法提取关键字。
这是我尝试过的:
df = pd.read_csv('results.csv', delimiter='|')
df.columns = ['image_name', 'caption_number', 'caption']
stop_words = stopwords.words('english')
def get_keywords(row):
some_text = row['caption']
lowered = some_text.lower()
tokens = nltk.tokenize.word_tokenize(some_text)
keywords = [keyword for keyword in tokens if keyword.isalpha() and not keyword in stop_words]
keywords_string = ','.join(keywords)
return keywords_string
df['Keywords'] = df['caption'].apply(get_keywords, axis=1)
上面returns一个错误:get_keywords() got an unexpected keyword argument 'axis'
原因是标题列有 nan 值,因此需要在应用函数之前删除 nan 值。
#replaces all occurring digits in the strings with nothing
df['caption'] = df['caption'].str.replace('\d+', '')
#drop all the nan values
df=df.dropna()
#if you need the whole row to be passed inside the function
df['Keywords'] = df.apply(lambda row:get_keywords(row), axis=1)
我有以下格式的数据框: link to the csv file
image_name caption_number caption
0 1000092795.jpg 0 Two young guys with shaggy hair look at their...
1 1000092795.jpg 1 Two young , White males are outside near many...
2 1000092795.jpg 2 Two men in green shirts are standing in a yard .
3 1000092795.jpg 3 A man in a blue shirt standing in a garden .
4 1000092795.jpg 4 Two friends enjoy time spent together .
我想添加另一列keywords
,使用 NLP 关键字提取方法提取关键字。
这是我尝试过的:
df = pd.read_csv('results.csv', delimiter='|')
df.columns = ['image_name', 'caption_number', 'caption']
stop_words = stopwords.words('english')
def get_keywords(row):
some_text = row['caption']
lowered = some_text.lower()
tokens = nltk.tokenize.word_tokenize(some_text)
keywords = [keyword for keyword in tokens if keyword.isalpha() and not keyword in stop_words]
keywords_string = ','.join(keywords)
return keywords_string
df['Keywords'] = df['caption'].apply(get_keywords, axis=1)
上面returns一个错误:get_keywords() got an unexpected keyword argument 'axis'
原因是标题列有 nan 值,因此需要在应用函数之前删除 nan 值。
#replaces all occurring digits in the strings with nothing
df['caption'] = df['caption'].str.replace('\d+', '')
#drop all the nan values
df=df.dropna()
#if you need the whole row to be passed inside the function
df['Keywords'] = df.apply(lambda row:get_keywords(row), axis=1)