Python 对文本中的每 2 个字符进行一次热编码

Question

我有一个自定义字母表，因为字典由 2 个字母键和相应的十进制值组成。我基本上想使用这个字母表对文本中的每 2 个字符进行编码。文本不能超出给定的字母表，因此手动定义它是安全的。这是我到目前为止所做的。

values = {'00' : 0.0, '01': 1.0, '02':2.0, '03':3.0, '04':4.0, '05':5.0, '06':6.0, '07':7.0, '08':8.0, '09':9.0, '0a':10, '0b':11, '0c':12, '0d':13, '0e':14}

sample = '000a'
indexes = [values[ch:ch+2] for ch in range(0,len(sample),2)]

输出应该是 0.010

但是，我得到 unhashable type: 'slice' type error by 运行 this。

有没有另一种方法可以遍历文本中的每两个项目并将它们替换为字典中的值？或者对于超过 20G+ 的文本文件执行此操作的最佳方法是什么？

Answer 1

这会执行您在示例中描述的操作：

values = {'00' : 0.0, '01': 1.0, '02':2.0, '03':3.0, '04':4.0, '05':5.0, '06':6.0, '07':7.0, '08':8.0, '09':9.0, '0a':10, '0b':11, '0c':12, '0d':13, '0e':14}

sample = '000a'
indexes = ''.join(str(values[sample[ch:ch+2]]) for ch in range(0,len(sample),2))

我想你错过了 sample[:] 的 values 键...

Python 对文本中的每 2 个字符进行一次热编码

Python One-hot encode every 2 characters in a text

python

text

dictionary

slice

one-hot-encoding