Python 个笔记本中只有 select 列 [无行]

Question

我正在对笔记本中的非结构化数据进行一些分析——它占了一列信息。我想拉出这个唯一的专栏并进行自然语言处理以查看最频繁和标记化的关键字。

当我在用户评论栏上应用我的分词器时，我要分析的文本：

text = df.loc[:, "User Reviews"]

行号包含在文本 "User Reviews" 列中。

由于一些用户评论包含与行号相同的数字，这让分析变得混乱，特别是因为我正在应用标记化并查看术语频率。因此，在下面的示例中，该行从 1 开始，然后是 2 是下一行，然后是 3，依此类推，以获得 10K 条用户评论。

['1', 'great', 'cat', 'waiting', 'on', 'me', 'home', 'to', 'feed', 'love', 'fancy', 'feast',
 '2', 'my', '3', 'dogs', 'love', 'this', '3', 'So', 'bad', 'my', '4', 'dogs', 'threw', 'up', ...]

有办法吗？我需要 text.drop 才能删除该行吗？我在这里查找了一些来源：

https://www.shanelynn.ie/using-pandas-dataframe-creating-editing-viewing-data-in-python/

https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-6fcd0170be9c

但还在挣扎

                                            User Reviews  
0  i think my puppy likes this. She seemed to keep...  
1  Its Great! My cat waiting on me to feed her. Fa...  
2  My 3 dogs love this so much. Wanted to get more...
3  All of my 4 dogs threw this up. Wouldnt ever re...  
4  I think she likes it. I gave it to her yesterda...  
5  Do not trust this brand, dog died 3 yrs ago aft...  
6  Tried and true dog food, never has issues with ...

Answer 1

The row numbers are included with the text "User Reviews" column.

一个 pd.Series 对象包括一个值数组 以及一个关联的索引 。该索引如果不受特定操作的影响，可能会与 "row numbers" 重合——但不保证一定如此。

您的标记化逻辑似乎旨在应用于一组值，而不是一系列值。您可以使用 pd.Series.values:

提取仅包含值的基础 numpy 数组

text = df.loc[:, "User Reviews"].values

numpy数组表示丢失索引，只保留底层数据

Python 个笔记本中只有 select 列 [无行]

Only select columns [no rows] in Python notebooks

python

tokenize

dataframe

pandas