大声 pad_sequence 和 Tokenizer

keras pad_sequence and Tokenizer


我在 kaggle 数据集 Here 上学习以在 nlp 上练习 我在标记推文并去填充它们时遇到错误 我遇到了错误 我正在寻找解决方案但我没有得到答案
# Get tha max Number Of Word In Tweets
texts = df['text']
LENGTH = texts.apply(lambda p:len(p.split()))

x = df ['text']
y = df['target']
x_train,x_test , y_train,y_test =train_test_split(x,y,test_size=.30,random_state=41)


tokenize = Tokenizer()
tokenize.fit_on_texts(x)
x = tokenize.texts_to_sequences(x)

print('start padding ...')

# Padding Tweets To Be The Same Length
x = pad_sequences(x ,maxlen=LENGTH)

我遇到了这个错误

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_34/2607522322.py in <module>
      8 
      9 # Padding Tweets To Be The Same Length
---> 10 x = pad_sequences(x ,maxlen=LENGTH)

/opt/conda/lib/python3.7/site-packages/keras/preprocessing/sequence.py in pad_sequences(sequences, maxlen, dtype, padding, truncating, value)
    152   return sequence.pad_sequences(
    153       sequences, maxlen=maxlen, dtype=dtype,
--> 154       padding=padding, truncating=truncating, value=value)
    155 
    156 keras_export(

/opt/conda/lib/python3.7/site-packages/keras_preprocessing/sequence.py in pad_sequences(sequences, maxlen, dtype, padding, truncating, value)
     83                          .format(dtype, type(value)))
     84 
---> 85     x = np.full((num_samples, maxlen) + sample_shape, value, dtype=dtype)
     86     for idx, s in enumerate(sequences):
     87         if not len(s):

/opt/conda/lib/python3.7/site-packages/numpy/core/numeric.py in full(shape, fill_value, dtype, order, like)
    340         fill_value = asarray(fill_value)
    341         dtype = fill_value.dtype
--> 342     a = empty(shape, dtype, order)
    343     multiarray.copyto(a, fill_value, casting='unsafe')
    344     return a

TypeError: 'Series' object cannot be interpreted as an integer

问题是 LENGTH 不是 integer,而是 Pandas series。尝试这样的事情:

from sklearn.model_selection import train_test_split
import pandas as pd
import tensorflow as tf 

df = pd.DataFrame({'text': ['is upset that he cant update his Facebook by texting it... and might cry as a result  School today also. Blah!',
                                    '@Kenichan I dived many times for the ball. Managed to save 50%  The rest go out of bounds',
                                    'my whole body feels itchy and like its on fire', 
                                    '@nationwideclass no, its not behaving at all. im mad. why am i here? because I cant see you all over there.',
                                    '@Kwesidei not the whole crew'],
                          'target': [0, 1, 0, 0, 1]})
x = df['text'].values
y = df['target'].values

max_length = max([len(d.split()) for d in x])
x_train, x_test ,y_train, y_test =train_test_split(x,y,test_size=.30,random_state=41)

tokenize = tf.keras.preprocessing.text.Tokenizer()
tokenize.fit_on_texts(x)
x = tokenize.texts_to_sequences(x)

print('start padding ...')

x = tf.keras.preprocessing.sequence.pad_sequences(x, maxlen=max_length)
print(x)
start padding ...
[[ 9 10 11 12  3 13 14 15 16 17 18  4 19 20 21 22 23 24 25 26 27]
 [ 0  0  0 28  1 29 30 31 32  2 33 34 35 36 37  2 38 39 40 41 42]
 [ 0  0  0  0  0  0  0  0  0  0  0 43  5 44 45 46  4 47  6 48 49]
 [50 51  6  7 52 53  8 54 55 56 57  1 58 59  1  3 60 61  8 62 63]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 64  7  2  5 65]]

如果要使用post-padding,运行:

x = tf.keras.preprocessing.sequence.pad_sequences(x, maxlen=max_length, padding='post')