spacy 以字符串形式获取令牌而不是在 uint8 上
spacy getting tokens in the form of string instead on uint8
我想知道是否有办法以字符串形式使用 tokenizer(s).to_array("LOWERCASE")
而不是格式 uint8。
from spacy.lang.en import English
from spacy.tokenizer import Tokenizer
s = "Lets pray for the people that can be the victim of the possible eruption of Taal Volcano keep safe everyone."
# Create nlp obj
nlp = English()
tokenizer = Tokenizer(nlp.vocab)
#Get a list of tokens through list comprehension
tokens = [word.text for word in tokenizer(s)]
#Out > ["Lets", "pray", "for", ... , "everyone"]
# But easier method where you can also can apply Lowercase to the tokens as well by using,
tokens = tokenizer(s).to_array("LOWER")
#Out > array([565864407007422797, 10267499103039061205, ... ,13330460590412905967],dtype=uint64)
#But the format you end with results is dtype Unit8
spacy 中有没有办法以字符串格式获取它?
这会让像删除停用词这样的事情变得更容易
sp = spacy.load("en_core_web_sm")
all_stop_words = sp.Defaults.stop_words
token_without_stopwords = [word for word in tokenizer(s).to_array("LOWER") if word not in all_stopwords]
# This will ofcourse not work since they are two diffrent sata types from what I understand.
由于 Doc.to_array
return type, ndarray
:
,to_array
似乎无法获取字符串标记列表
Export given token attributes to a numpy ndarray
. If attr_ids
is a sequence of M
attributes, the output array will be of shape (N, M)
, where N
is the length of the Doc
(in tokens). If attr_ids
is a single attribute, the output shape will be (N,)
. You can specify attributes by integer ID (e.g. spacy.attrs.LEMMA
) or string name (e.g. “LEMMA” or “lemma”). The values will be 64-bit integers.
您可以使用
token_without_stopwords = [word for word in map(lambda x: x.text.lower(),tokenizer(s)) if word not in all_stopwords]
其中 map(lambda x: x.text.lower(),tokenizer(s))
获取所有标记文本均为小写的地图对象。
你可以这样做。
sp = spacy.load("en_core_web_sm")
all_stop_words = sp.Defaults.stop_words
lower_words = [word.text.lower() for word in sp(s)]
filtered = [word for word in lower_words if word not in all_stopwords]
我想知道是否有办法以字符串形式使用 tokenizer(s).to_array("LOWERCASE")
而不是格式 uint8。
from spacy.lang.en import English
from spacy.tokenizer import Tokenizer
s = "Lets pray for the people that can be the victim of the possible eruption of Taal Volcano keep safe everyone."
# Create nlp obj
nlp = English()
tokenizer = Tokenizer(nlp.vocab)
#Get a list of tokens through list comprehension
tokens = [word.text for word in tokenizer(s)]
#Out > ["Lets", "pray", "for", ... , "everyone"]
# But easier method where you can also can apply Lowercase to the tokens as well by using,
tokens = tokenizer(s).to_array("LOWER")
#Out > array([565864407007422797, 10267499103039061205, ... ,13330460590412905967],dtype=uint64)
#But the format you end with results is dtype Unit8
spacy 中有没有办法以字符串格式获取它?
这会让像删除停用词这样的事情变得更容易
sp = spacy.load("en_core_web_sm")
all_stop_words = sp.Defaults.stop_words
token_without_stopwords = [word for word in tokenizer(s).to_array("LOWER") if word not in all_stopwords]
# This will ofcourse not work since they are two diffrent sata types from what I understand.
由于 Doc.to_array
return type, ndarray
:
to_array
似乎无法获取字符串标记列表
Export given token attributes to a numpy
ndarray
. Ifattr_ids
is a sequence ofM
attributes, the output array will be of shape(N, M)
, whereN
is the length of theDoc
(in tokens). Ifattr_ids
is a single attribute, the output shape will be(N,)
. You can specify attributes by integer ID (e.g.spacy.attrs.LEMMA
) or string name (e.g. “LEMMA” or “lemma”). The values will be 64-bit integers.
您可以使用
token_without_stopwords = [word for word in map(lambda x: x.text.lower(),tokenizer(s)) if word not in all_stopwords]
其中 map(lambda x: x.text.lower(),tokenizer(s))
获取所有标记文本均为小写的地图对象。
你可以这样做。
sp = spacy.load("en_core_web_sm")
all_stop_words = sp.Defaults.stop_words
lower_words = [word.text.lower() for word in sp(s)]
filtered = [word for word in lower_words if word not in all_stopwords]