在 YAKE 中根据关键字创建新列

Create a new columns based on keywords in YAKE

我正在尝试使用 YAKE 从书籍摘要列表中提取关键字。

df = {'Book': [1, 2], 'Summary': ['text definition includes the original words of something written, printed, or spoken', 'example of the Lorem ipsum placeholder text on a green and white webpage']}
df = pd.DataFrame(df)

然后我尝试使用循环并从每个摘要中提取 1 个关键字:

for i in df['Summary']:
  language = "en"
  max_ngram_size = 1
  deduplication_threshold = 0.9
  numOfKeywords = 2
  custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold, top=numOfKeywords, features=None)
  keywords = custom_kw_extractor.extract_keywords(i)
  for kw, w in keywords:
    print(kw)

输出是:

printed
Lorem

但是,我想将它们添加为同一数据框中的新列。最终输出应该是:

Book Summary Keywods
1 text definition includes the original words of something written, printed, or spoken printed
2 example of the Lorem ipsum placeholder text on a green and white webpage Lorem

我尝试创建一个新列表

df['keywords'] = kw

但是没用!自从我使用 Python 和 pandas 以来已经有一段时间了,我似乎不记得这样做了!

如有任何帮助,我们将不胜感激!

尝试 df.Summary.apply:

import pandas as pd
import yake

language = "en"
max_ngram_size = 1
deduplication_threshold = 0.9
numOfKeywords = 1
custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold, top=numOfKeywords, features=None)

df = {'Book': [1, 2], 'Summary': ['text definition includes the original words of something written, printed, or spoken', 'example of the Lorem ipsum placeholder text on a green and white webpage']}
df = pd.DataFrame(df)

df['Keywords'] = df.Summary.apply(lambda x : custom_kw_extractor.extract_keywords(x)[0][0])
|    |   Book | Summary                                                                              | Keywords   |
|---:|-------:|:-------------------------------------------------------------------------------------|:-----------|
|  0 |      1 | text definition includes the original words of something written, printed, or spoken | printed    |
|  1 |      2 | example of the Lorem ipsum placeholder text on a green and white webpage             | Lorem      |

to_numpy():

df['keywords'] = [custom_kw_extractor.extract_keywords(d)[0][0] for d in df.Summary.to_numpy()]

关键字列表使用 to_numpy(),因为它通常比 df.apply 快:

df['Keywords']  = [[s[0] for s in custom_kw_extractor.extract_keywords(d)] for d in df.Summary.to_numpy()]
|    |   Book | Summary                                                                              | Keywords                                               |
|---:|-------:|:-------------------------------------------------------------------------------------|:-------------------------------------------------------|
|  0 |      1 | text definition includes the original words of something written, printed, or spoken | ['printed', 'text', 'written', 'spoken', 'definition'] |
|  1 |      2 | example of the Lorem ipsum placeholder text on a green and white webpage             | ['Lorem', 'webpage', 'ipsum', 'placeholder', 'text']   

或者如果您想要逗号分隔的字符串:

df['Keywords']  = [','.join([s[0] for s in custom_kw_extractor.extract_keywords(d)]) for d in df.Summary.to_numpy()]
|    |   Book | Summary                                                                              | Keywords                               |
|---:|-------:|:-------------------------------------------------------------------------------------|:---------------------------------------|
|  0 |      1 | text definition includes the original words of something written, printed, or spoken | printed,text,written,spoken,definition |
|  1 |      2 | example of the Lorem ipsum placeholder text on a green and white webpage             | Lorem,webpage,ipsum,placeholder,text   |

更新

1 keyword was for simplicity but I'd love it if I can generalise it to multiple keywords

对于多个关键字,更改numOfKeywordslambda函数:

language = 'en'
max_ngram_size = 1
deduplication_threshold = 0.9
numOfKeywords = 3  # <- Multiple keywords
custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size,
                                            dedupLim=deduplication_threshold,
                                            top=numOfKeywords, features=None)

extract_keywords = lambda x: [k[0] for k in custom_kw_extractor.extract_keywords(x)]
df['TopKeyword'] = df['Summary'].apply(extract_keywords)

输出:

Book Summary TopKeyword
1 text definition includes the original words of something written, printed, or spoken ['printed', 'text', 'written']
2 example of the Lorem ipsum placeholder text on a green and white webpage ['Lorem', 'webpage', 'ipsum']

要获取字符串而不是列表,请更新 lambda function:

extract_keywords = lambda x: ','.join(k[0] for k in custom_kw_extractor.extract_keywords(x))

输出:

Book Summary TopKeyword
1 text definition includes the original words of something written, printed, or spoken printed,text,written
2 example of the Lorem ipsum placeholder text on a green and white webpage Lorem,webpage,ipsum

旧答案

extract_keywords = lambda x: ','.join(k[0] for k in custom_kw_extractor.extract_keywords(x))
df['Keywords'] = df['Summary'].apply(extract_keywords)

这里不需要循环,apply可以为你做。

language = 'en'
max_ngram_size = 1
deduplication_threshold = 0.9
numOfKeywords = 1
custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size,
                                            dedupLim=deduplication_threshold,
                                            top=numOfKeywords, features=None)

extract_keyword = lambda x: custom_kw_extractor.extract_keywords(x)[0][0]
df['TopKeyword'] = df['Summary'].apply(extract_keyword)

输出:

Book Summary TopKeyword
1 text definition includes the original words of something written, printed, or spoken printed
2 example of the Lorem ipsum placeholder text on a green and white webpage Lorem