如何将单词保存在 CSV 文件中,这些文件是从带有句子 ID 号的文章中标记出来的?

How to save words in a CSV file tokenized from articles with sentence id number?

我正在尝试从存储在 CSV 文件中的文章中提取所有单词,并将句子 ID 号和包含的单词写入新的 CSV 文件。

到目前为止我已经尝试了什么,

import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize
df = pd.read_csv(r"D:\data.csv", nrows=10)

row = 0; sentNo = 0
while( row < 1 ):
    sentences = tokenizer.tokenize(df['articles'][row])
    for index, sents in enumerate(sentences):
        sentNo += 1
        words = word_tokenize(sents)
        print(f'{sentNo}: {words}')
    row += 1

df['articles'][0] 包含:

The ultimate productivity hack is saying no. Not doing something will always be faster than doing it. This statement reminds me of the old computer programming saying, “Remember that there is no code faster than no code.”

我只取了df['articles'][0],它给出的输出是这样的:

1:['The', 'ultimate', 'productivity', 'hack', 'is', 'saying', 'no', '.']
2:['Not', 'doing', 'something', 'will', 'always', 'be', 'faster', 'than', 'doing', 'it', '.']
3:['This', 'statement', 'reminds', 'me', 'of', 'the', 'old', 'computer', 'programming', 'saying', ',', '“', 'Remember', 'that', 'there', 'is', 'no', 'code', 'faster', 'than', 'no', 'code', '.', '”']

如何编写一个新的 output.csv 文件,其中包含 data.csv 文件中所有文章的所有句子,格式如下:

Sentence No | Word
1             The
              ultimate
              productivity
              hack
              is
              saying
              no
              .
2             Not
              doing 
              something 
              will
              always
              be
              faster
              than
              doing
              it
              .
3             This 
              statement 
              reminds 
              me 
              of 
              the 
              old 
              computer 
              programming 
              saying
              , 
              “
              Remember
              that 
              there
              is
              no
              code
              faster
              than
              no
              code
              .
              ”

我是 Python 的新手,在 Jupyter Notebook 上使用它。

这是我第一次 post 关于 Stack overflow。如有不妥之处,指正学习。谢谢。

只需要遍历单词并为每个单词写一个新行。

有点不可预测,因为您也有逗号 "words" - 可能需要考虑另一个分隔符或从单词列表中删除逗号。

编辑:这似乎是一种更简洁的方法。

import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize

df = pd.read_csv(r"D:\data.csv", nrows=10)
sentences = tokenizer.tokenize(df['articles'[row]])
f = open('output.csv','w+')
stcNum = 1

for stc in sentences:
  for word in stc:
    prntLine = ','
    if word == stc[0]:
      prntLine = str(stcNum) + prntLine
    prntLine = prntLine + word + '\r\n'
    f.write(prntLine)
  stcNum += 1

f.close()

output.csv:

1,The
,ultimate
,productivity
,hack
,is
,saying
,no
,.
2,Not
,doing
,something
,will
,always
,be
,faster
,than
,doing
,it
,.
3,This
,statement
,reminds
,me
,of
,the
,old
,computer
,programming
,saying
,,     # <<< Most CSV parsers will see this as 3 empty columns
,“
,Remember
,that
,there
,is
,no
,code
,faster
,than
,no
,code
,.
,”