用 keras.preprocessing.text.Tokenizer 标记中文文本
Tokenizing Chinese text with keras.preprocessing.text.Tokenizer
keras.preprocessing.text.Tokenizer
不能正确处理中文文本。我如何修改它以在中文文本上工作?
from keras.preprocessing.text import Tokenizer
def fit_get_tokenizer(data, max_words):
tokenizer = Tokenizer(num_words=max_words, filters='!"#%&()*+,-./:;<=>?@[\]^_`{|}~\t\n')
tokenizer.fit_on_texts(data)
return tokenizer
tokenizer = fit_get_tokenizer(df.sentence,max_words=150000)
print('Total number of words: ', len(tokenizer.word_index))
vocabulary_inv = {}
for word in tokenizer.word_index:
vocabulary_inv[tokenizer.word_index[word]] = word
print(vocabulary_inv)
因为我不能post SO中的中文文本,我将演示如何用英文句子来做,但同样适用于中文:
import tensorflow as tf
text = ['This is a chinese sentence',
'This is another chinese sentence']
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=50, char_level = False)
tokenizer.fit_on_texts(text)
print(tokenizer.word_index)
{'this': 1, 'is': 2, 'chinese': 3, 'sentence': 4, 'a': 5, 'another': 6}
确保您有一个中文 space-separated 句子列表,它应该可以正常工作。使用列表的列表将导致意外行为。
def fit_get_tokenizer(data, max_words):
c=[]
for i in range(len(data)):
a = []
text_tokens = re.findall(r'(.*?[?\ . \ !)。])\s?', data[i])
for i, j in enumerate(text_tokens):
seg_list = jieba.lcut(j, cut_all=False)
sen = " ".join(seg_list)
a.append(sen)
for i in a:
c.append(i)
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(c)
return tokenizer ```
If anyone going thorough Chinese text segmentation. I used regular expression to extract sentence from Chinese paragraph. Then I have used jieba(instead of NLTK) to get perfect word tokens and getting it ready keras Tokenizer.
keras.preprocessing.text.Tokenizer
不能正确处理中文文本。我如何修改它以在中文文本上工作?
from keras.preprocessing.text import Tokenizer
def fit_get_tokenizer(data, max_words):
tokenizer = Tokenizer(num_words=max_words, filters='!"#%&()*+,-./:;<=>?@[\]^_`{|}~\t\n')
tokenizer.fit_on_texts(data)
return tokenizer
tokenizer = fit_get_tokenizer(df.sentence,max_words=150000)
print('Total number of words: ', len(tokenizer.word_index))
vocabulary_inv = {}
for word in tokenizer.word_index:
vocabulary_inv[tokenizer.word_index[word]] = word
print(vocabulary_inv)
因为我不能post SO中的中文文本,我将演示如何用英文句子来做,但同样适用于中文:
import tensorflow as tf
text = ['This is a chinese sentence',
'This is another chinese sentence']
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=50, char_level = False)
tokenizer.fit_on_texts(text)
print(tokenizer.word_index)
{'this': 1, 'is': 2, 'chinese': 3, 'sentence': 4, 'a': 5, 'another': 6}
确保您有一个中文 space-separated 句子列表,它应该可以正常工作。使用列表的列表将导致意外行为。
def fit_get_tokenizer(data, max_words):
c=[]
for i in range(len(data)):
a = []
text_tokens = re.findall(r'(.*?[?\ . \ !)。])\s?', data[i])
for i, j in enumerate(text_tokens):
seg_list = jieba.lcut(j, cut_all=False)
sen = " ".join(seg_list)
a.append(sen)
for i in a:
c.append(i)
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(c)
return tokenizer ```
If anyone going thorough Chinese text segmentation. I used regular expression to extract sentence from Chinese paragraph. Then I have used jieba(instead of NLTK) to get perfect word tokens and getting it ready keras Tokenizer.