如何将 pandas 多列文本转换为张量?

how to convert pandas multiple columns of text into tensors?

您好,我正在研究关键点分析任务,由IBM共享,这里是link. In the given dataset there are more than one rows of text and anyone can please tell me how can I convert the text columns into tensors and again assign them in the same dataFrame because there are other columns of data there.

问题

这里我面临一个问题,我以前从未见过这种数据,比如有多个文本列,如何将所有这些列转换为张量,然后应用模型。大多数时候数据就像:一个文本列 其他列是标签,示例:电影评论,有毒评论分类。

def clean_text(text):
"""
    text: a string

    return: modified initial string
"""
text = text.lower()  # lowercase text
text = REPLACE_BY_SPACE_RE.sub(' ',
                               text)  
text = BAD_SYMBOLS_RE.sub('',
                          text)  
text = text.replace('x', '')
#    text = re.sub(r'\W+', '', text)
text = ' '.join(word for word in text.split() if word not in STOPWORDS) 
return text

如果我答对了你的问题,你会做如下事情:

from transformers import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
DF["args"]=DF["args"].apply(lambda x:tokenizer(x)['input_ids'])

这会将句子转换为标记数组。