使用 TF-IDF Vectorizer 热删除一个字母标记

Question

我正在做一个小项目来计算本文档中的 tf_idf，该文档基本上包含书名及其摘要。到目前为止，我只设法删除停用词和数字，现在我的目标是 select 包含至少三个字母的单词，并对单词进行词形还原。这是我写的代码：

from sklearn.feature_extraction.text import TfidfVectorizer
tf_idf = TfidfVectorizer(stop_words='english', token_pattern=r'(?u)\b[A-Za-z]+\b')
tfidf_matrix = tf_idf.fit_transform(doc)
print(tfidf_matrix)

如果我打印“tf_idf.vocabulary_”，我会得到文档中出现的所有单词以及 r、s、t、m 等字母。就词形还原而言，我不会知道如何去做，但我仍然不明白它是如何工作的，如果有人能帮助我，我提前谢谢你。

Answer 1

token_patternstr, default=r”(?u)\b\w\w+\b” Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'. The default regexp selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).

要 select 个至少包含三个字母的单词，请更改您的正则表达式：

tf_idf = TfidfVectorizer(stop_words='english', token_pattern=r'(?u)\b[A-Za-z]+\b')

到正则表达式量词 {3,}，其匹配其前面的元素至少 n 次。

tf_idf = TfidfVectorizer(stop_words='english', analyzer='word', token_pattern=r'(?u)\b[A-Za-z]{3,}\b')

# doc used as sample text.
doc = """Hi Lucia. How are you? It was so nice to meet you last week in Sydney at the sales meeting. How was the rest of your trip? Did you see any kangaroos? I hope you got home to Mexico City OK.
Anyway, I have the documents about the new Berlin offices. We're going to be open in three months. I moved here from London just last week. They are very nice offices, and the location is perfect.
There are lots of restaurants, cafés and banks in the area. There's also public transport; we are next to an U-Bahn (that is the name for the metro here). Maybe you can come and see them one day? I would love to show you Berlin, especially in the winter. You said you have never seen snow – you will see lots here! Here's a photo of you and me at the restaurant in Sydney. That was a very fun night! Remember the singing Englishman? Crazy! Please send me any other photos you have of that night. Good memories.
Please give me your email address and I will send you the documents. Bye for now. Mikel"""

print(tf_idf.vocabulary_)
{
   "lucia": 27,
   "nice": 38,
   "meet": 29,
   "week": 59,
   "sydney": 56,
   "sales": 51,
   "meeting": 30,
   "rest": 47,
   "trip": 58,
   "did": 10,
   "kangaroos": 22,
   "hope": 20,
   "got": 18,
   ...
   ...

使用 TF-IDF Vectorizer 热删除一个字母标记

Hot to remove one letter token with TF-IDF Vectorizer

python

regex

text-mining

stop-words

tf-idf