从数据帧创建句子(行)到 POS 标签计数(列)矩阵

Create sentence (row) to POS tags counts (column) matrix from a dataframe

我正在尝试构建一个矩阵,其中第一行是词性,第一列是句子。矩阵中的值应显示句子中此类 POS 的数量。

所以我以这种方式创建 POS 标签:

data = pd.read_csv(open('myfile.csv'),sep=';') 

target = data["label"]
del data["label"]

data.sentence = data.sentence.str.lower() # All strings in data frame to lowercase

for line in data.sentence:
    Line_new= nltk.pos_tag(nltk.word_tokenize(line))
    print(Line_new)

输出为:

[('together', 'RB'), ('with', 'IN'), ('the', 'DT'), ('6th', 'CD'), ('battalion', 'NN'), ('of', 'IN'), ('the', 'DT')]

如何从这样的输出中创建上面描述的矩阵?

更新: 期望的输出是

                   NN  VB    IN    VBZ    DT
 I was there       1   1     1      0     0
 He came there     0   0     1      1     1

myfile.csv:

"A child who is exclusively or predominantly oral (using speech for communication) can experience social isolation from his or her hearing peers, particularly if no one takes the time to explicitly teach them social skills that other children acquire independently by virtue of having normal hearing.";"certain"
"Preliminary Discourse to the Encyclopedia of Diderot";"certain"
"d'Alembert claims that it would be ignorant to perceive that everything could be known about a particular subject.";"certain"
"However, as the overemphasis on parental influence of psychodynamics theory has been strongly criticized in the previous century, modern psychologists adopted interracial contact as a more important determinant than childhood experience on shaping people’s prejudice traits (Stephan & Rosenfield, 1978).";"uncertain"
"this can also be summarized as a distinguish behaviour on the peronnel level";"uncertain"

多头:

首先让我们向您的 csv 添加一些 headers,以便在访问列时它更多 humanly-readable:

>>> import pandas as pd
>>> df = pd.read_csv('myfile.csv', delimiter=';')
>>> df.columns = ['sent', 'tag']
>>> df['sent']
0    Preliminary Discourse to the Encyclopedia of D...
1    d'Alembert claims that it would be ignorant to...
2    However, as the overemphasis on parental influ...
3    this can also be summarized as a distinguish b...
Name: sent, dtype: object
>>> df['tag']
0      certain
1      certain
2    uncertain
3    uncertain

现在让我们创建一个函数 tok_and_tag,以链式方式执行 word_tokenizepos_tag

>>> from nltk import word_tokenize, pos_tag
>>> from functools import partial
>>> tok_and_tag = lambda x: pos_tag(word_tokenize(x))
>>> df['sent'][0]
'Preliminary Discourse to the Encyclopedia of Diderot'
>>> tok_and_tag(df['sent'][0])
[('Preliminary', 'JJ'), ('Discourse', 'NNP'), ('to', 'TO'), ('the', 'DT'), ('Encyclopedia', 'NNP'), ('of', 'IN'), ('Diderot', 'NNP')]

然后,我们可以使用df.apply对数据框的句子列进行分词和标记:

>>> df['sent'].apply(tok_and_tag)
0    [(Preliminary, JJ), (Discourse, NNP), (to, TO)...
1    [(d'Alembert, NN), (claims, NNS), (that, IN), ...
2    [(However, RB), (,, ,), (as, IN), (the, DT), (...
3    [(this, DT), (can, MD), (also, RB), (be, VB), ...
Name: sent, dtype: object

如果你想让句子小写:

>>> df['sent'].apply(str.lower)
0    preliminary discourse to the encyclopedia of d...
1    d'alembert claims that it would be ignorant to...
2    however, as the overemphasis on parental influ...
3    this can also be summarized as a distinguish b...
Name: sent, dtype: object

>>> df['lower_sent'] = df['sent'].apply(str.lower)

>>> df['lower_sent'].apply(tok_and_tag)
0    [(preliminary, JJ), (discourse, NN), (to, TO),...
1    [(d'alembert, NN), (claims, NNS), (that, IN), ...
2    [(however, RB), (,, ,), (as, IN), (the, DT), (...
3    [(this, DT), (can, MD), (also, RB), (be, VB), ...
Name: lower_sent, dtype: object

此外,我们需要某种方式来获取 POS 词汇表,我们可以使用 collections.Counteritertools.chain 来展平列表的列表:

>>> df['lower_sent']
0    preliminary discourse to the encyclopedia of d...
1    d'alembert claims that it would be ignorant to...
2    however, as the overemphasis on parental influ...
3    this can also be summarized as a distinguish b...
Name: lower_sent, dtype: object

>>> df['lower_sent'].apply(tok_and_tag)
0    [(preliminary, JJ), (discourse, NN), (to, TO),...
1    [(d'alembert, NN), (claims, NNS), (that, IN), ...
2    [(however, RB), (,, ,), (as, IN), (the, DT), (...
3    [(this, DT), (can, MD), (also, RB), (be, VB), ...
Name: lower_sent, dtype: object

>>> df['tagged_sent'] = df['lower_sent'].apply(tok_and_tag)

>>> tokens, tags = zip(*chain(*df['tagged_sent'].tolist()))

>>> tags
('JJ', 'NN', 'TO', 'DT', 'NN', 'IN', 'NN', 'NN', 'NNS', 'IN', 'PRP', 'MD', 'VB', 'JJ', 'TO', 'VB', 'IN', 'NN', 'MD', 'VB', 'VBN', 'IN', 'DT', 'JJ', 'NN', '.', 'RB', ',', 'IN', 'DT', 'NN', 'IN', 'JJ', 'NN', 'IN', 'NNS', 'NN', 'VBZ', 'VBN', 'RB', 'VBN', 'IN', 'DT', 'JJ', 'NN', ',', 'JJ', 'NNS', 'VBD', 'JJ', 'NN', 'IN', 'DT', 'RBR', 'JJ', 'NN', 'IN', 'NN', 'NN', 'IN', 'VBG', 'JJ', 'NN', 'NNS', '(', 'NN', 'CC', 'NN', ',', 'CD', ')', '.', 'DT', 'MD', 'RB', 'VB', 'VBN', 'IN', 'DT', 'JJ', 'NN', 'IN', 'DT', 'NNS', 'NN')

>>> set(tags)
{'CC', 'VB', ')', 'NNS', ',', 'JJ', 'VBZ', 'DT', 'NN', 'PRP', 'RBR', 'TO', 'VBD', '(', 'VBN', '.', 'MD', 'IN', 'RB', 'VBG', 'CD'}
>>> possible_tags = sorted(set(tags))
>>> possible_tags
['(', ')', ',', '.', 'CC', 'CD', 'DT', 'IN', 'JJ', 'MD', 'NN', 'NNS', 'PRP', 'RB', 'RBR', 'TO', 'VB', 'VBD', 'VBG', 'VBN', 'VBZ']

>>> possible_tags_counter = Counter({p:0 for p in possible_tags})
>>> possible_tags_counter
Counter({'NNS': 0, 'VBZ': 0, 'DT': 0, '(': 0, 'JJ': 0, 'VBD': 0, ')': 0, 'RB': 0, 'VBG': 0, 'RBR': 0, 'VB': 0, 'IN': 0, 'CC': 0, ',': 0, 'PRP': 0, 'CD': 0, 'VBN': 0, '.': 0, 'MD': 0, 'NN': 0, 'TO': 0})

遍历每个标记的句子并获取 POS 的计数:

>>> df['tagged_sent'].apply(lambda x: Counter(list(zip(*x))[1]))
0    {'NN': 3, 'IN': 1, 'TO': 1, 'DT': 1, 'JJ': 1}
1    {'NN': 3, 'VB': 3, 'PRP': 1, 'TO': 1, 'DT': 1,...
2    {')': 1, 'JJ': 6, 'NN': 11, 'CC': 1, 'NNS': 3,...
3    {'DT': 3, 'VB': 1, 'NN': 2, 'VBN': 1, 'NNS': 1...
Name: tagged_sent, dtype: object

>>> df['pos_counts'] = df['tagged_sent'].apply(lambda x: Counter(list(zip(*x))[1]))

>>> df['pos_counts']
0    {'NN': 3, 'IN': 1, 'TO': 1, 'DT': 1, 'JJ': 1}
1    {'NN': 3, 'VB': 3, 'PRP': 1, 'TO': 1, 'DT': 1,...
2    {')': 1, 'JJ': 6, 'NN': 11, 'CC': 1, 'NNS': 3,...
3    {'DT': 3, 'VB': 1, 'NN': 2, 'VBN': 1, 'NNS': 1...
Name: pos_counts, dtype: object

# Now we can add in the POS that don't appears in the sentence with 0 counts:

>>> def add_pos_with_zero_counts(counter, keys_to_add):
...     for k in keys_to_add:
...         counter[k] = counter.get(k, 0)
...     return counter
... 
>>> df['pos_counts'].apply(lambda x: add_pos_with_zero_counts(x, possible_tags))
0    {'VB': 0, 'IN': 1, 'PRP': 0, 'DT': 1, 'CC': 0,...
1    {'VB': 3, ')': 0, 'DT': 1, 'CC': 0, 'RB': 0, '...
2    {'VB': 0, ')': 1, 'JJ': 6, 'NN': 11, 'CC': 1, ...
3    {'VB': 1, 'IN': 2, 'PRP': 0, 'NN': 2, 'CC': 0,...
Name: pos_counts, dtype: object

>>> df['pos_counts_with_zero'] = df['pos_counts'].apply(lambda x: add_pos_with_zero_counts(x, possible_tags))

现在将值展平到列表中:

>>> df['pos_counts_with_zero'].apply(lambda x: [count for tag, count in sorted(x.most_common())])
0    [0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 3, 0, 0, 0, 0, ...
1    [0, 0, 0, 1, 0, 0, 1, 3, 2, 2, 3, 1, 1, 0, 0, ...
2    [1, 1, 3, 1, 1, 1, 3, 7, 6, 0, 11, 3, 0, 2, 1,...
3    [0, 0, 0, 0, 0, 0, 3, 2, 1, 1, 2, 1, 0, 1, 0, ...
Name: pos_counts_with_zero, dtype: object

>>> df['sent_vector'] = df['pos_counts_with_zero'].apply(lambda x: [count for tag, count in sorted(x.most_common())])

现在您需要创建一个新矩阵来存储 BoW:

>>> df2 = pd.DataFrame(df['sent_vector'].tolist)
>>> df2.columns = sorted(possible_tags)

瞧:

>>> df2
   (  )  ,  .  CC  CD  DT  IN  JJ  MD ...   NNS  PRP  RB  RBR  TO  VB  VBD  \
0  0  0  0  0   0   0   1   1   1   0 ...     0    0   0    0   1   0    0   
1  0  0  0  1   0   0   1   3   2   2 ...     1    1   0    0   1   3    0   
2  1  1  3  1   1   1   3   7   6   0 ...     3    0   2    1   0   0    1   
3  0  0  0  0   0   0   3   2   1   1 ...     1    0   1    0   0   1    0   

   VBG  VBN  VBZ  
0    0    0    0  
1    0    1    0  
2    1    2    1  
3    0    1    0  

[4 rows x 21 columns]

简而言之:

from collections import Counter
from itertools import chain

import pandas as pd

from nltk import word_tokenize, pos_tag

df = pd.read_csv('myfile.csv', delimiter=';')
df.columns = ['sent', 'tag']

tok_and_tag = lambda x: pos_tag(word_tokenize(x))

df['lower_sent'] = df['sent'].apply(str.lower)
df['tagged_sent'] = df['lower_sent'].apply(tok_and_tag)

possible_tags = sorted(set(list(zip(*chain(*df['tagged_sent'])))[1]))

def add_pos_with_zero_counts(counter, keys_to_add):
    for k in keys_to_add:
        counter[k] = counter.get(k, 0)
    return counter


# Detailed steps.
df['pos_counts'] = df['tagged_sent'].apply(lambda x: Counter(list(zip(*x))[1]))
df['pos_counts_with_zero'] = df['pos_counts'].apply(lambda x: add_pos_with_zero_counts(x, possible_tags))
df['sent_vector'] = df['pos_counts_with_zero'].apply(lambda x: [count for tag, count in sorted(x.most_common())])

# All in one.
df['sent_vector'] = df['tagged_sent'].apply(lambda x:
    [count for tag, count in sorted(
        add_pos_with_zero_counts(
            Counter(list(zip(*x))[1]), 
                    possible_tags).most_common()
         )
    ]
)

df2 = pd.DataFrame(df['sent_vector'].tolist())
df2.columns = possible_tags