如何使用 spaCy 获取令牌 ID(我想将文本句子映射到整数序列)
How to get token ids using spaCy (I want to map a text sentence to sequence of integers)
我想使用 spacy 对句子进行分词,以获得可用于下游任务的整数分词 ID 序列。我希望像下面这样使用它。请填写???
import spacy
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load('en_core_web_lg')
# Process whole documents
text = (u"When Sebastian Thrun started working on self-driving cars at ")
doc = nlp(text)
idxs = ??????
print(idxs)
# Want output to be something like;
>> array([ 8045, 70727, 24304, 96127, 44091, 37596, 24524, 35224, 36253])
最好是整数指的是en_core_web_lg
中的一些特殊嵌入ID..
spacy.io/usage/vectors-similarity 没有提示要查找文档中的什么属性。
我在 crossvalidated 上问过这个问题,但被确定为 OT。 googling/describing这个问题的正确术语也很有帮助。
Spacy 在文本上使用散列来获得唯一的 ID。对于 Document
中给定 Token
的不同用例,所有 Token
对象都有多种形式
如果您只想要 Token
的规范化形式,则使用 .norm
属性,它是文本的整数表示(散列)
>>> import spacy
>>> nlp = spacy.load('en')
>>> text = "here is some test text"
>>> doc = nlp(text)
>>> [token.norm for token in doc]
[411390626470654571, 3411606890003347522, 7000492816108906599, 1618900948208871284, 15099781594404091470]
您还可以使用其他属性,例如小写整数属性 .lower
或许多其他属性。在 Document
或 Token
上使用 help()
获取更多信息。
>>> help(doc[0])
Help on Token object:
class Token(builtins.object)
| An individual token – i.e. a word, punctuation symbol, whitespace,
| etc.
|
...
解决方案;
import spacy
nlp = spacy.load('en_core_web_md')
text = (u"When Sebastian Thrun started working on self-driving cars at ")
doc = nlp(text)
ids = []
for token in doc:
if token.has_vector:
id = nlp.vocab.vectors.key2row[token.norm]
else:
id = None
ids.append(id)
print([token for token in doc])
print(ids)
#>> [When, Sebastian, Thrun, started, working, on, self, -, driving, cars, at]
#>> [71, 19994, None, 369, 422, 19, 587, 32, 1169, 1153, 41]
打破这个;
# A Vocabulary for which __getitem__ can take a chunk of text and returns a hash
nlp.vocab
# >> <spacy.vocab.Vocab at 0x12bcdce48>
nlp.vocab['hello'].norm # hash
# >> 5983625672228268878
# The tensor holding the word-vector
nlp.vocab.vectors.data.shape
# >> (20000, 300)
# A dict mapping hash -> row in this array
nlp.vocab.vectors.key2row
# >> {12646065887601541794: 0,
# >> 2593208677638477497: 1,
# >> ...}
# So to get int id of 'earth';
i = nlp.vocab.vectors.key2row[nlp.vocab['earth'].norm]
nlp.vocab.vectors.data[i]
# Note that tokens have hashes but may not have vector
# (Hence no entry in .key2row)
nlp.vocab['Thrun'].has_vector
# >> False
我想使用 spacy 对句子进行分词,以获得可用于下游任务的整数分词 ID 序列。我希望像下面这样使用它。请填写???
import spacy
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load('en_core_web_lg')
# Process whole documents
text = (u"When Sebastian Thrun started working on self-driving cars at ")
doc = nlp(text)
idxs = ??????
print(idxs)
# Want output to be something like;
>> array([ 8045, 70727, 24304, 96127, 44091, 37596, 24524, 35224, 36253])
最好是整数指的是en_core_web_lg
中的一些特殊嵌入ID..
spacy.io/usage/vectors-similarity 没有提示要查找文档中的什么属性。
我在 crossvalidated 上问过这个问题,但被确定为 OT。 googling/describing这个问题的正确术语也很有帮助。
Spacy 在文本上使用散列来获得唯一的 ID。对于 Document
Token
的不同用例,所有 Token
对象都有多种形式
如果您只想要 Token
的规范化形式,则使用 .norm
属性,它是文本的整数表示(散列)
>>> import spacy
>>> nlp = spacy.load('en')
>>> text = "here is some test text"
>>> doc = nlp(text)
>>> [token.norm for token in doc]
[411390626470654571, 3411606890003347522, 7000492816108906599, 1618900948208871284, 15099781594404091470]
您还可以使用其他属性,例如小写整数属性 .lower
或许多其他属性。在 Document
或 Token
上使用 help()
获取更多信息。
>>> help(doc[0])
Help on Token object:
class Token(builtins.object)
| An individual token – i.e. a word, punctuation symbol, whitespace,
| etc.
|
...
解决方案;
import spacy
nlp = spacy.load('en_core_web_md')
text = (u"When Sebastian Thrun started working on self-driving cars at ")
doc = nlp(text)
ids = []
for token in doc:
if token.has_vector:
id = nlp.vocab.vectors.key2row[token.norm]
else:
id = None
ids.append(id)
print([token for token in doc])
print(ids)
#>> [When, Sebastian, Thrun, started, working, on, self, -, driving, cars, at]
#>> [71, 19994, None, 369, 422, 19, 587, 32, 1169, 1153, 41]
打破这个;
# A Vocabulary for which __getitem__ can take a chunk of text and returns a hash
nlp.vocab
# >> <spacy.vocab.Vocab at 0x12bcdce48>
nlp.vocab['hello'].norm # hash
# >> 5983625672228268878
# The tensor holding the word-vector
nlp.vocab.vectors.data.shape
# >> (20000, 300)
# A dict mapping hash -> row in this array
nlp.vocab.vectors.key2row
# >> {12646065887601541794: 0,
# >> 2593208677638477497: 1,
# >> ...}
# So to get int id of 'earth';
i = nlp.vocab.vectors.key2row[nlp.vocab['earth'].norm]
nlp.vocab.vectors.data[i]
# Note that tokens have hashes but may not have vector
# (Hence no entry in .key2row)
nlp.vocab['Thrun'].has_vector
# >> False