Python 中的关键字提取 - 如何处理带连字符的复合词
Keywords extraction in Python - How to handle hyphenated compound words
我正在尝试使用 KeyBert 和 pke PositionRank Python 执行关键短语提取。您可以在下面看到我的代码的摘录。
from keybert import KeyBERT
from keyphrase_vectorizers import KeyphraseCountVectorizer
import pke
text = "The life-cycle Global Warming Potential of the building resulting from the construction has been calculated for each stage in the life-cycle and is disclosed to investors and clients on demand" #text_cleaning(df_tassonomia.iloc[1077].text, sentence_adjustment, stop_words)
# Pke
extractor = pke.unsupervised.PositionRank()
extractor.load_document(text, language='en')
extractor.candidate_selection(maximum_word_number = 5)
extractor.candidate_weighting(window = 10)
keyphrases = extractor.get_n_best(n=10)
print(keyphrases)
# KeyBert
kw_model = KeyBERT(model = "all-mpnet-base-v2")
keyphrases_2 = kw_model.extract_keywords(docs=text,
vectorizer=KeyphraseCountVectorizer(),
keyphrase_ngram_range = (1,5),
top_n=10
)
print("")
print(keyphrases_2)
这里是结果:
[('cycle global warming potential', 0.44829175082921835), ('life', 0.17858359644549557), ('cycle', 0.15775994057934534), ('building', 0.09131084381406684), ('construction', 0.08860454878871142), ('investors', 0.05426710724030216), ('clients', 0.054111700289631526), ('stage', 0.045672396861507744), ('demand', 0.039158055731066406)]
[('cycle global warming potential', 0.5444), ('building', 0.4479), ('construction', 0.3476), ('investors', 0.1967), ('clients', 0.1519), ('demand', 0.1484), ('cycle', 0.1312), ('stage', 0.0931), ('life', 0.0847)]
我想处理带连字符的复合词(示例中的生命周期)被视为唯一词,但我不明白如何从单词分隔符列表中排除 -。
提前感谢您的帮助。
弗朗西斯卡
这可能是一个愚蠢的解决方法,但它可能会有所帮助:
from keybert import KeyBERT
from keyphrase_vectorizers import KeyphraseCountVectorizer
import pke
text = "The life-cycle Global Warming Potential of the building
resulting from the construction has been calculated for each stage in
the life-cycle and is disclosed to investors and clients on demand"
# Pke
tokens = text.split()
orignal = set([x for x in tokens if "_" in x])
text = text.replace("-", "_")
extractor = pke.unsupervised.PositionRank()
extractor.load_document(text, language='en')
extractor.candidate_selection(maximum_word_number=5)
extractor.candidate_weighting(window=10)
keyphrases = extractor.get_n_best(n=10)
keyphrases_replaced = []
for pair in keyphrases:
if "_" in pair[0] and pair[0] not in orignal:
keyphrases_replaced.append((pair[0].replace("_","-"),pair[1]))
else:
keyphrases_replaced.append(pair)
print(keyphrases_replaced)
# KeyBert
keyphrases_2 = kw_model.extract_keywords(docs=text,
vectorizer=KeyphraseCountVectorizer(),
keyphrase_ngram_range=(1, 5),
top_n=10
)
print("")
print(keyphrases_2)
输出应如下所示:
[('life-cycle global warming potential', 0.5511001220016548), ('life-cycle', 0.20123353586644233), ('construction', 0.11945270995269436), ('building', 0.10637157845606555), ('investors', 0.06675114967366767), ('stage', 0.05503532672910801), ('clients', 0.0507262942318816), ('demand', 0.05056281895492815)]
希望对您有所帮助:)
问题已在最新的 pke 更新中修复:https://github.com/boudinfl/pke/issues/195
import pke
extractor = pke.unsupervised.TopicRank()
extractor.load_document(input='BERT is a state-of-the-art model.', language='en')
extractor.grammar_selection(grammar="NP: {<ADJ>*<NOUN|PROPN>+}")
print(extractor.candidates.keys())
现在 returns 这个输出:
dict_keys(['bert', 'state-of-the-art model'])
我正在尝试使用 KeyBert 和 pke PositionRank Python 执行关键短语提取。您可以在下面看到我的代码的摘录。
from keybert import KeyBERT
from keyphrase_vectorizers import KeyphraseCountVectorizer
import pke
text = "The life-cycle Global Warming Potential of the building resulting from the construction has been calculated for each stage in the life-cycle and is disclosed to investors and clients on demand" #text_cleaning(df_tassonomia.iloc[1077].text, sentence_adjustment, stop_words)
# Pke
extractor = pke.unsupervised.PositionRank()
extractor.load_document(text, language='en')
extractor.candidate_selection(maximum_word_number = 5)
extractor.candidate_weighting(window = 10)
keyphrases = extractor.get_n_best(n=10)
print(keyphrases)
# KeyBert
kw_model = KeyBERT(model = "all-mpnet-base-v2")
keyphrases_2 = kw_model.extract_keywords(docs=text,
vectorizer=KeyphraseCountVectorizer(),
keyphrase_ngram_range = (1,5),
top_n=10
)
print("")
print(keyphrases_2)
这里是结果:
[('cycle global warming potential', 0.44829175082921835), ('life', 0.17858359644549557), ('cycle', 0.15775994057934534), ('building', 0.09131084381406684), ('construction', 0.08860454878871142), ('investors', 0.05426710724030216), ('clients', 0.054111700289631526), ('stage', 0.045672396861507744), ('demand', 0.039158055731066406)]
[('cycle global warming potential', 0.5444), ('building', 0.4479), ('construction', 0.3476), ('investors', 0.1967), ('clients', 0.1519), ('demand', 0.1484), ('cycle', 0.1312), ('stage', 0.0931), ('life', 0.0847)]
我想处理带连字符的复合词(示例中的生命周期)被视为唯一词,但我不明白如何从单词分隔符列表中排除 -。
提前感谢您的帮助。 弗朗西斯卡
这可能是一个愚蠢的解决方法,但它可能会有所帮助:
from keybert import KeyBERT
from keyphrase_vectorizers import KeyphraseCountVectorizer
import pke
text = "The life-cycle Global Warming Potential of the building
resulting from the construction has been calculated for each stage in
the life-cycle and is disclosed to investors and clients on demand"
# Pke
tokens = text.split()
orignal = set([x for x in tokens if "_" in x])
text = text.replace("-", "_")
extractor = pke.unsupervised.PositionRank()
extractor.load_document(text, language='en')
extractor.candidate_selection(maximum_word_number=5)
extractor.candidate_weighting(window=10)
keyphrases = extractor.get_n_best(n=10)
keyphrases_replaced = []
for pair in keyphrases:
if "_" in pair[0] and pair[0] not in orignal:
keyphrases_replaced.append((pair[0].replace("_","-"),pair[1]))
else:
keyphrases_replaced.append(pair)
print(keyphrases_replaced)
# KeyBert
keyphrases_2 = kw_model.extract_keywords(docs=text,
vectorizer=KeyphraseCountVectorizer(),
keyphrase_ngram_range=(1, 5),
top_n=10
)
print("")
print(keyphrases_2)
输出应如下所示:
[('life-cycle global warming potential', 0.5511001220016548), ('life-cycle', 0.20123353586644233), ('construction', 0.11945270995269436), ('building', 0.10637157845606555), ('investors', 0.06675114967366767), ('stage', 0.05503532672910801), ('clients', 0.0507262942318816), ('demand', 0.05056281895492815)]
希望对您有所帮助:)
问题已在最新的 pke 更新中修复:https://github.com/boudinfl/pke/issues/195
import pke
extractor = pke.unsupervised.TopicRank()
extractor.load_document(input='BERT is a state-of-the-art model.', language='en')
extractor.grammar_selection(grammar="NP: {<ADJ>*<NOUN|PROPN>+}")
print(extractor.candidates.keys())
现在 returns 这个输出:
dict_keys(['bert', 'state-of-the-art model'])