大声 pad_sequence 和 Tokenizer
keras pad_sequence and Tokenizer
我在 kaggle 数据集 Here 上学习以在 nlp 上练习 我在标记推文并去填充它们时遇到错误 我遇到了错误 我正在寻找解决方案但我没有得到答案
# Get tha max Number Of Word In Tweets
texts = df['text']
LENGTH = texts.apply(lambda p:len(p.split()))
x = df ['text']
y = df['target']
x_train,x_test , y_train,y_test =train_test_split(x,y,test_size=.30,random_state=41)
tokenize = Tokenizer()
tokenize.fit_on_texts(x)
x = tokenize.texts_to_sequences(x)
print('start padding ...')
# Padding Tweets To Be The Same Length
x = pad_sequences(x ,maxlen=LENGTH)
我遇到了这个错误
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/tmp/ipykernel_34/2607522322.py in <module>
8
9 # Padding Tweets To Be The Same Length
---> 10 x = pad_sequences(x ,maxlen=LENGTH)
/opt/conda/lib/python3.7/site-packages/keras/preprocessing/sequence.py in pad_sequences(sequences, maxlen, dtype, padding, truncating, value)
152 return sequence.pad_sequences(
153 sequences, maxlen=maxlen, dtype=dtype,
--> 154 padding=padding, truncating=truncating, value=value)
155
156 keras_export(
/opt/conda/lib/python3.7/site-packages/keras_preprocessing/sequence.py in pad_sequences(sequences, maxlen, dtype, padding, truncating, value)
83 .format(dtype, type(value)))
84
---> 85 x = np.full((num_samples, maxlen) + sample_shape, value, dtype=dtype)
86 for idx, s in enumerate(sequences):
87 if not len(s):
/opt/conda/lib/python3.7/site-packages/numpy/core/numeric.py in full(shape, fill_value, dtype, order, like)
340 fill_value = asarray(fill_value)
341 dtype = fill_value.dtype
--> 342 a = empty(shape, dtype, order)
343 multiarray.copyto(a, fill_value, casting='unsafe')
344 return a
TypeError: 'Series' object cannot be interpreted as an integer
问题是 LENGTH
不是 integer
,而是 Pandas series
。尝试这样的事情:
from sklearn.model_selection import train_test_split
import pandas as pd
import tensorflow as tf
df = pd.DataFrame({'text': ['is upset that he cant update his Facebook by texting it... and might cry as a result School today also. Blah!',
'@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds',
'my whole body feels itchy and like its on fire',
'@nationwideclass no, its not behaving at all. im mad. why am i here? because I cant see you all over there.',
'@Kwesidei not the whole crew'],
'target': [0, 1, 0, 0, 1]})
x = df['text'].values
y = df['target'].values
max_length = max([len(d.split()) for d in x])
x_train, x_test ,y_train, y_test =train_test_split(x,y,test_size=.30,random_state=41)
tokenize = tf.keras.preprocessing.text.Tokenizer()
tokenize.fit_on_texts(x)
x = tokenize.texts_to_sequences(x)
print('start padding ...')
x = tf.keras.preprocessing.sequence.pad_sequences(x, maxlen=max_length)
print(x)
start padding ...
[[ 9 10 11 12 3 13 14 15 16 17 18 4 19 20 21 22 23 24 25 26 27]
[ 0 0 0 28 1 29 30 31 32 2 33 34 35 36 37 2 38 39 40 41 42]
[ 0 0 0 0 0 0 0 0 0 0 0 43 5 44 45 46 4 47 6 48 49]
[50 51 6 7 52 53 8 54 55 56 57 1 58 59 1 3 60 61 8 62 63]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 64 7 2 5 65]]
如果要使用post-padding,运行:
x = tf.keras.preprocessing.sequence.pad_sequences(x, maxlen=max_length, padding='post')
我在 kaggle 数据集 Here 上学习以在 nlp 上练习 我在标记推文并去填充它们时遇到错误 我遇到了错误 我正在寻找解决方案但我没有得到答案
# Get tha max Number Of Word In Tweets
texts = df['text']
LENGTH = texts.apply(lambda p:len(p.split()))
x = df ['text']
y = df['target']
x_train,x_test , y_train,y_test =train_test_split(x,y,test_size=.30,random_state=41)
tokenize = Tokenizer()
tokenize.fit_on_texts(x)
x = tokenize.texts_to_sequences(x)
print('start padding ...')
# Padding Tweets To Be The Same Length
x = pad_sequences(x ,maxlen=LENGTH)
我遇到了这个错误
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/tmp/ipykernel_34/2607522322.py in <module>
8
9 # Padding Tweets To Be The Same Length
---> 10 x = pad_sequences(x ,maxlen=LENGTH)
/opt/conda/lib/python3.7/site-packages/keras/preprocessing/sequence.py in pad_sequences(sequences, maxlen, dtype, padding, truncating, value)
152 return sequence.pad_sequences(
153 sequences, maxlen=maxlen, dtype=dtype,
--> 154 padding=padding, truncating=truncating, value=value)
155
156 keras_export(
/opt/conda/lib/python3.7/site-packages/keras_preprocessing/sequence.py in pad_sequences(sequences, maxlen, dtype, padding, truncating, value)
83 .format(dtype, type(value)))
84
---> 85 x = np.full((num_samples, maxlen) + sample_shape, value, dtype=dtype)
86 for idx, s in enumerate(sequences):
87 if not len(s):
/opt/conda/lib/python3.7/site-packages/numpy/core/numeric.py in full(shape, fill_value, dtype, order, like)
340 fill_value = asarray(fill_value)
341 dtype = fill_value.dtype
--> 342 a = empty(shape, dtype, order)
343 multiarray.copyto(a, fill_value, casting='unsafe')
344 return a
TypeError: 'Series' object cannot be interpreted as an integer
问题是 LENGTH
不是 integer
,而是 Pandas series
。尝试这样的事情:
from sklearn.model_selection import train_test_split
import pandas as pd
import tensorflow as tf
df = pd.DataFrame({'text': ['is upset that he cant update his Facebook by texting it... and might cry as a result School today also. Blah!',
'@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds',
'my whole body feels itchy and like its on fire',
'@nationwideclass no, its not behaving at all. im mad. why am i here? because I cant see you all over there.',
'@Kwesidei not the whole crew'],
'target': [0, 1, 0, 0, 1]})
x = df['text'].values
y = df['target'].values
max_length = max([len(d.split()) for d in x])
x_train, x_test ,y_train, y_test =train_test_split(x,y,test_size=.30,random_state=41)
tokenize = tf.keras.preprocessing.text.Tokenizer()
tokenize.fit_on_texts(x)
x = tokenize.texts_to_sequences(x)
print('start padding ...')
x = tf.keras.preprocessing.sequence.pad_sequences(x, maxlen=max_length)
print(x)
start padding ...
[[ 9 10 11 12 3 13 14 15 16 17 18 4 19 20 21 22 23 24 25 26 27]
[ 0 0 0 28 1 29 30 31 32 2 33 34 35 36 37 2 38 39 40 41 42]
[ 0 0 0 0 0 0 0 0 0 0 0 43 5 44 45 46 4 47 6 48 49]
[50 51 6 7 52 53 8 54 55 56 57 1 58 59 1 3 60 61 8 62 63]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 64 7 2 5 65]]
如果要使用post-padding,运行:
x = tf.keras.preprocessing.sequence.pad_sequences(x, maxlen=max_length, padding='post')