在 YAKE 中根据关键字创建新列
Create a new columns based on keywords in YAKE
我正在尝试使用 YAKE 从书籍摘要列表中提取关键字。
df = {'Book': [1, 2], 'Summary': ['text definition includes the original words of something written, printed, or spoken', 'example of the Lorem ipsum placeholder text on a green and white webpage']}
df = pd.DataFrame(df)
然后我尝试使用循环并从每个摘要中提取 1 个关键字:
for i in df['Summary']:
language = "en"
max_ngram_size = 1
deduplication_threshold = 0.9
numOfKeywords = 2
custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold, top=numOfKeywords, features=None)
keywords = custom_kw_extractor.extract_keywords(i)
for kw, w in keywords:
print(kw)
输出是:
printed
Lorem
但是,我想将它们添加为同一数据框中的新列。最终输出应该是:
Book
Summary
Keywods
1
text definition includes the original words of something written, printed, or spoken
printed
2
example of the Lorem ipsum placeholder text on a green and white webpage
Lorem
我尝试创建一个新列表
df['keywords'] = kw
但是没用!自从我使用 Python 和 pandas 以来已经有一段时间了,我似乎不记得这样做了!
如有任何帮助,我们将不胜感激!
尝试 df.Summary.apply
:
import pandas as pd
import yake
language = "en"
max_ngram_size = 1
deduplication_threshold = 0.9
numOfKeywords = 1
custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold, top=numOfKeywords, features=None)
df = {'Book': [1, 2], 'Summary': ['text definition includes the original words of something written, printed, or spoken', 'example of the Lorem ipsum placeholder text on a green and white webpage']}
df = pd.DataFrame(df)
df['Keywords'] = df.Summary.apply(lambda x : custom_kw_extractor.extract_keywords(x)[0][0])
| | Book | Summary | Keywords |
|---:|-------:|:-------------------------------------------------------------------------------------|:-----------|
| 0 | 1 | text definition includes the original words of something written, printed, or spoken | printed |
| 1 | 2 | example of the Lorem ipsum placeholder text on a green and white webpage | Lorem |
或 to_numpy()
:
df['keywords'] = [custom_kw_extractor.extract_keywords(d)[0][0] for d in df.Summary.to_numpy()]
关键字列表使用 to_numpy()
,因为它通常比 df.apply
快:
df['Keywords'] = [[s[0] for s in custom_kw_extractor.extract_keywords(d)] for d in df.Summary.to_numpy()]
| | Book | Summary | Keywords |
|---:|-------:|:-------------------------------------------------------------------------------------|:-------------------------------------------------------|
| 0 | 1 | text definition includes the original words of something written, printed, or spoken | ['printed', 'text', 'written', 'spoken', 'definition'] |
| 1 | 2 | example of the Lorem ipsum placeholder text on a green and white webpage | ['Lorem', 'webpage', 'ipsum', 'placeholder', 'text']
或者如果您想要逗号分隔的字符串:
df['Keywords'] = [','.join([s[0] for s in custom_kw_extractor.extract_keywords(d)]) for d in df.Summary.to_numpy()]
| | Book | Summary | Keywords |
|---:|-------:|:-------------------------------------------------------------------------------------|:---------------------------------------|
| 0 | 1 | text definition includes the original words of something written, printed, or spoken | printed,text,written,spoken,definition |
| 1 | 2 | example of the Lorem ipsum placeholder text on a green and white webpage | Lorem,webpage,ipsum,placeholder,text |
更新
1 keyword was for simplicity but I'd love it if I can generalise it to multiple keywords
对于多个关键字,更改numOfKeywords
和lambda
函数:
language = 'en'
max_ngram_size = 1
deduplication_threshold = 0.9
numOfKeywords = 3 # <- Multiple keywords
custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size,
dedupLim=deduplication_threshold,
top=numOfKeywords, features=None)
extract_keywords = lambda x: [k[0] for k in custom_kw_extractor.extract_keywords(x)]
df['TopKeyword'] = df['Summary'].apply(extract_keywords)
输出:
Book
Summary
TopKeyword
1
text definition includes the original words of something written, printed, or spoken
['printed', 'text', 'written']
2
example of the Lorem ipsum placeholder text on a green and white webpage
['Lorem', 'webpage', 'ipsum']
要获取字符串而不是列表,请更新 lambda function
:
extract_keywords = lambda x: ','.join(k[0] for k in custom_kw_extractor.extract_keywords(x))
输出:
Book
Summary
TopKeyword
1
text definition includes the original words of something written, printed, or spoken
printed,text,written
2
example of the Lorem ipsum placeholder text on a green and white webpage
Lorem,webpage,ipsum
旧答案
extract_keywords = lambda x: ','.join(k[0] for k in custom_kw_extractor.extract_keywords(x))
df['Keywords'] = df['Summary'].apply(extract_keywords)
这里不需要循环,apply
可以为你做。
language = 'en'
max_ngram_size = 1
deduplication_threshold = 0.9
numOfKeywords = 1
custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size,
dedupLim=deduplication_threshold,
top=numOfKeywords, features=None)
extract_keyword = lambda x: custom_kw_extractor.extract_keywords(x)[0][0]
df['TopKeyword'] = df['Summary'].apply(extract_keyword)
输出:
Book
Summary
TopKeyword
1
text definition includes the original words of something written, printed, or spoken
printed
2
example of the Lorem ipsum placeholder text on a green and white webpage
Lorem
我正在尝试使用 YAKE 从书籍摘要列表中提取关键字。
df = {'Book': [1, 2], 'Summary': ['text definition includes the original words of something written, printed, or spoken', 'example of the Lorem ipsum placeholder text on a green and white webpage']}
df = pd.DataFrame(df)
然后我尝试使用循环并从每个摘要中提取 1 个关键字:
for i in df['Summary']:
language = "en"
max_ngram_size = 1
deduplication_threshold = 0.9
numOfKeywords = 2
custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold, top=numOfKeywords, features=None)
keywords = custom_kw_extractor.extract_keywords(i)
for kw, w in keywords:
print(kw)
输出是:
printed
Lorem
但是,我想将它们添加为同一数据框中的新列。最终输出应该是:
Book | Summary | Keywods |
---|---|---|
1 | text definition includes the original words of something written, printed, or spoken | printed |
2 | example of the Lorem ipsum placeholder text on a green and white webpage | Lorem |
我尝试创建一个新列表
df['keywords'] = kw
但是没用!自从我使用 Python 和 pandas 以来已经有一段时间了,我似乎不记得这样做了!
如有任何帮助,我们将不胜感激!
尝试 df.Summary.apply
:
import pandas as pd
import yake
language = "en"
max_ngram_size = 1
deduplication_threshold = 0.9
numOfKeywords = 1
custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold, top=numOfKeywords, features=None)
df = {'Book': [1, 2], 'Summary': ['text definition includes the original words of something written, printed, or spoken', 'example of the Lorem ipsum placeholder text on a green and white webpage']}
df = pd.DataFrame(df)
df['Keywords'] = df.Summary.apply(lambda x : custom_kw_extractor.extract_keywords(x)[0][0])
| | Book | Summary | Keywords |
|---:|-------:|:-------------------------------------------------------------------------------------|:-----------|
| 0 | 1 | text definition includes the original words of something written, printed, or spoken | printed |
| 1 | 2 | example of the Lorem ipsum placeholder text on a green and white webpage | Lorem |
或 to_numpy()
:
df['keywords'] = [custom_kw_extractor.extract_keywords(d)[0][0] for d in df.Summary.to_numpy()]
关键字列表使用 to_numpy()
,因为它通常比 df.apply
快:
df['Keywords'] = [[s[0] for s in custom_kw_extractor.extract_keywords(d)] for d in df.Summary.to_numpy()]
| | Book | Summary | Keywords |
|---:|-------:|:-------------------------------------------------------------------------------------|:-------------------------------------------------------|
| 0 | 1 | text definition includes the original words of something written, printed, or spoken | ['printed', 'text', 'written', 'spoken', 'definition'] |
| 1 | 2 | example of the Lorem ipsum placeholder text on a green and white webpage | ['Lorem', 'webpage', 'ipsum', 'placeholder', 'text']
或者如果您想要逗号分隔的字符串:
df['Keywords'] = [','.join([s[0] for s in custom_kw_extractor.extract_keywords(d)]) for d in df.Summary.to_numpy()]
| | Book | Summary | Keywords |
|---:|-------:|:-------------------------------------------------------------------------------------|:---------------------------------------|
| 0 | 1 | text definition includes the original words of something written, printed, or spoken | printed,text,written,spoken,definition |
| 1 | 2 | example of the Lorem ipsum placeholder text on a green and white webpage | Lorem,webpage,ipsum,placeholder,text |
更新
1 keyword was for simplicity but I'd love it if I can generalise it to multiple keywords
对于多个关键字,更改numOfKeywords
和lambda
函数:
language = 'en'
max_ngram_size = 1
deduplication_threshold = 0.9
numOfKeywords = 3 # <- Multiple keywords
custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size,
dedupLim=deduplication_threshold,
top=numOfKeywords, features=None)
extract_keywords = lambda x: [k[0] for k in custom_kw_extractor.extract_keywords(x)]
df['TopKeyword'] = df['Summary'].apply(extract_keywords)
输出:
Book | Summary | TopKeyword |
---|---|---|
1 | text definition includes the original words of something written, printed, or spoken | ['printed', 'text', 'written'] |
2 | example of the Lorem ipsum placeholder text on a green and white webpage | ['Lorem', 'webpage', 'ipsum'] |
要获取字符串而不是列表,请更新 lambda function
:
extract_keywords = lambda x: ','.join(k[0] for k in custom_kw_extractor.extract_keywords(x))
输出:
Book | Summary | TopKeyword |
---|---|---|
1 | text definition includes the original words of something written, printed, or spoken | printed,text,written |
2 | example of the Lorem ipsum placeholder text on a green and white webpage | Lorem,webpage,ipsum |
旧答案
extract_keywords = lambda x: ','.join(k[0] for k in custom_kw_extractor.extract_keywords(x))
df['Keywords'] = df['Summary'].apply(extract_keywords)
这里不需要循环,apply
可以为你做。
language = 'en'
max_ngram_size = 1
deduplication_threshold = 0.9
numOfKeywords = 1
custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size,
dedupLim=deduplication_threshold,
top=numOfKeywords, features=None)
extract_keyword = lambda x: custom_kw_extractor.extract_keywords(x)[0][0]
df['TopKeyword'] = df['Summary'].apply(extract_keyword)
输出:
Book | Summary | TopKeyword |
---|---|---|
1 | text definition includes the original words of something written, printed, or spoken | printed |
2 | example of the Lorem ipsum placeholder text on a green and white webpage | Lorem |