如何对 Python 列表中的单词进行编码
How to encode words in a list in Python
我有一本字典,其中每个单词作为键和相应的整数值,例如:
{'me': 41, 'are': 21, 'the': 0}
我有一个数据框,其中包含一列已经标记化的单词列表,例如:
['I', 'liked', 'the', 'color', 'of', 'this', 'top']
['Just', 'grabbed', 'this', 'today', 'great', 'find']
如何将这些单词中的每一个编码成它们在字典中的对应值。例如:
[56, 78, 5, 1197, 556, 991, 40]
使用字典和列表
下面使用字典 (final_dictionary
) 来确定单词的 id。如果你有一个预设的 id 字典,那就太好了。
def encode_tokens(tokens):
encoded_tokens = tokens[:]
for i, token in enumerate(tokens):
if token in final_dictionary:
encoded_tokens[i] = final_dictionary[token]
return encoded_tokens
print(encode_tokens(tokens))
添加和维护 id
如果您要动态分配 id,我会实现一个 class 来这样做(见下文)。但是,如果你有一个你已经提前定义的 id 字典,你可以传入关键字参数 di
:
token_words_1 = ['I', 'liked', 'the', 'color', 'of', 'this', 'top']
token_words_2 = ['I', 'liked', 'to', 'test', 'repeat', 'words']
class AutoId:
def __init__(self, **kwargs):
self.di = kwargs.get("di", {})
self.loc = 0
def get(self, value):
if value not in self.di:
self.di[value] = self.loc
self.loc += 1
return self.di[value]
def get_list(self, li):
return [*map(self.get, li)]
encoding = AutoId()
print(encoding.get_list(token_words_1))
print(encoding.get_list(token_words_2))
干什么
word2key = {'me': 41, 'are': 21, 'the': 0}
words = ['Just', 'grabbed', 'this', 'today', 'great', 'find']
default = 'unknown'
output = [word2key.get(x, default) for x in words]
如果您希望 'Just'
和 'just'
映射到相同的值,您可能需要使用 x.lower()
。
假设你的字典在一个名为 d
的变量中并且你的列表名为 l
:
d = {'me': 41, 'are': 21, 'the': 0}
l = ['I', 'liked', 'the', 'color', 'of', 'this', 'top']
print(l)
c = 0
while c < len(l):
try:
l[c] = d[l[c]]
except:
l[c] = None
c += 1
print(l)
from itertools import chain
import numpy as np
# d = {'me': 41, 'are': 21, 'the': 0}
l1 = ['I', 'liked', 'the', 'color', 'of', 'this', 'top']
l2 = ['Just', 'grabbed', 'this', 'today', 'great', 'find']
# This is just for data generation for the sake of a complete example.
# Use your already given d here instead.
d = {k: np.random.randint(10) for k in chain(l1, l2)}
print(d)
l1_d = [d.get(k, 0) for k in l1] # <- this is the actual command you need
print(l1_d)
l2_d = [d.get(k, 0) for k in l2]
print(l2_d)
结果:
{'I': 3, 'liked': 3, 'the': 8, 'color': 7, 'of': 3, 'this': 5,
'top': 3, 'Just': 6, 'grabbed': 0, 'today': 0, 'great': 7, 'find': 0}
[3, 3, 8, 7, 3, 5, 3]
[6, 0, 5, 0, 7, 0]
我有一本字典,其中每个单词作为键和相应的整数值,例如:
{'me': 41, 'are': 21, 'the': 0}
我有一个数据框,其中包含一列已经标记化的单词列表,例如:
['I', 'liked', 'the', 'color', 'of', 'this', 'top']
['Just', 'grabbed', 'this', 'today', 'great', 'find']
如何将这些单词中的每一个编码成它们在字典中的对应值。例如:
[56, 78, 5, 1197, 556, 991, 40]
使用字典和列表
下面使用字典 (final_dictionary
) 来确定单词的 id。如果你有一个预设的 id 字典,那就太好了。
def encode_tokens(tokens):
encoded_tokens = tokens[:]
for i, token in enumerate(tokens):
if token in final_dictionary:
encoded_tokens[i] = final_dictionary[token]
return encoded_tokens
print(encode_tokens(tokens))
添加和维护 id
如果您要动态分配 id,我会实现一个 class 来这样做(见下文)。但是,如果你有一个你已经提前定义的 id 字典,你可以传入关键字参数 di
:
token_words_1 = ['I', 'liked', 'the', 'color', 'of', 'this', 'top']
token_words_2 = ['I', 'liked', 'to', 'test', 'repeat', 'words']
class AutoId:
def __init__(self, **kwargs):
self.di = kwargs.get("di", {})
self.loc = 0
def get(self, value):
if value not in self.di:
self.di[value] = self.loc
self.loc += 1
return self.di[value]
def get_list(self, li):
return [*map(self.get, li)]
encoding = AutoId()
print(encoding.get_list(token_words_1))
print(encoding.get_list(token_words_2))
干什么
word2key = {'me': 41, 'are': 21, 'the': 0}
words = ['Just', 'grabbed', 'this', 'today', 'great', 'find']
default = 'unknown'
output = [word2key.get(x, default) for x in words]
如果您希望 'Just'
和 'just'
映射到相同的值,您可能需要使用 x.lower()
。
假设你的字典在一个名为 d
的变量中并且你的列表名为 l
:
d = {'me': 41, 'are': 21, 'the': 0}
l = ['I', 'liked', 'the', 'color', 'of', 'this', 'top']
print(l)
c = 0
while c < len(l):
try:
l[c] = d[l[c]]
except:
l[c] = None
c += 1
print(l)
from itertools import chain
import numpy as np
# d = {'me': 41, 'are': 21, 'the': 0}
l1 = ['I', 'liked', 'the', 'color', 'of', 'this', 'top']
l2 = ['Just', 'grabbed', 'this', 'today', 'great', 'find']
# This is just for data generation for the sake of a complete example.
# Use your already given d here instead.
d = {k: np.random.randint(10) for k in chain(l1, l2)}
print(d)
l1_d = [d.get(k, 0) for k in l1] # <- this is the actual command you need
print(l1_d)
l2_d = [d.get(k, 0) for k in l2]
print(l2_d)
结果:
{'I': 3, 'liked': 3, 'the': 8, 'color': 7, 'of': 3, 'this': 5,
'top': 3, 'Just': 6, 'grabbed': 0, 'today': 0, 'great': 7, 'find': 0}
[3, 3, 8, 7, 3, 5, 3]
[6, 0, 5, 0, 7, 0]