大声 pad_sequence 和 Tokenizer

Question

我在 kaggle 数据集 Here 上学习以在 nlp 上练习我在标记推文并去填充它们时遇到错误我遇到了错误我正在寻找解决方案但我没有得到答案

# Get tha max Number Of Word In Tweets
texts = df['text']
LENGTH = texts.apply(lambda p:len(p.split()))

x = df ['text']
y = df['target']
x_train,x_test , y_train,y_test =train_test_split(x,y,test_size=.30,random_state=41)


tokenize = Tokenizer()
tokenize.fit_on_texts(x)
x = tokenize.texts_to_sequences(x)

print('start padding ...')

# Padding Tweets To Be The Same Length
x = pad_sequences(x ,maxlen=LENGTH)

我遇到了这个错误

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_34/2607522322.py in <module>
      8 
      9 # Padding Tweets To Be The Same Length
---> 10 x = pad_sequences(x ,maxlen=LENGTH)

/opt/conda/lib/python3.7/site-packages/keras/preprocessing/sequence.py in pad_sequences(sequences, maxlen, dtype, padding, truncating, value)
    152   return sequence.pad_sequences(
    153       sequences, maxlen=maxlen, dtype=dtype,
--> 154       padding=padding, truncating=truncating, value=value)
    155 
    156 keras_export(

/opt/conda/lib/python3.7/site-packages/keras_preprocessing/sequence.py in pad_sequences(sequences, maxlen, dtype, padding, truncating, value)
     83                          .format(dtype, type(value)))
     84 
---> 85     x = np.full((num_samples, maxlen) + sample_shape, value, dtype=dtype)
     86     for idx, s in enumerate(sequences):
     87         if not len(s):

/opt/conda/lib/python3.7/site-packages/numpy/core/numeric.py in full(shape, fill_value, dtype, order, like)
    340         fill_value = asarray(fill_value)
    341         dtype = fill_value.dtype
--> 342     a = empty(shape, dtype, order)
    343     multiarray.copyto(a, fill_value, casting='unsafe')
    344     return a

TypeError: 'Series' object cannot be interpreted as an integer

Answer 1

问题是 LENGTH 不是 integer，而是 Pandas series。尝试这样的事情：

from sklearn.model_selection import train_test_split
import pandas as pd
import tensorflow as tf 

df = pd.DataFrame({'text': ['is upset that he cant update his Facebook by texting it... and might cry as a result  School today also. Blah!',
                                    '@Kenichan I dived many times for the ball. Managed to save 50%  The rest go out of bounds',
                                    'my whole body feels itchy and like its on fire', 
                                    '@nationwideclass no, its not behaving at all. im mad. why am i here? because I cant see you all over there.',
                                    '@Kwesidei not the whole crew'],
                          'target': [0, 1, 0, 0, 1]})
x = df['text'].values
y = df['target'].values

max_length = max([len(d.split()) for d in x])
x_train, x_test ,y_train, y_test =train_test_split(x,y,test_size=.30,random_state=41)

tokenize = tf.keras.preprocessing.text.Tokenizer()
tokenize.fit_on_texts(x)
x = tokenize.texts_to_sequences(x)

print('start padding ...')

x = tf.keras.preprocessing.sequence.pad_sequences(x, maxlen=max_length)
print(x)

start padding ...
[[ 9 10 11 12  3 13 14 15 16 17 18  4 19 20 21 22 23 24 25 26 27]
 [ 0  0  0 28  1 29 30 31 32  2 33 34 35 36 37  2 38 39 40 41 42]
 [ 0  0  0  0  0  0  0  0  0  0  0 43  5 44 45 46  4 47  6 48 49]
 [50 51  6  7 52 53  8 54 55 56 57  1 58 59  1  3 60 61  8 62 63]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 64  7  2  5 65]]

如果要使用post-padding,运行:

x = tf.keras.preprocessing.sequence.pad_sequences(x, maxlen=max_length, padding='post')

大声 pad_sequence 和 Tokenizer

keras pad_sequence and Tokenizer

python

nlp

tokenize

keras

tensorflow