如何将单词保存在 CSV 文件中,这些文件是从带有句子 ID 号的文章中标记出来的?
How to save words in a CSV file tokenized from articles with sentence id number?
我正在尝试从存储在 CSV 文件中的文章中提取所有单词,并将句子 ID 号和包含的单词写入新的 CSV 文件。
到目前为止我已经尝试了什么,
import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize
df = pd.read_csv(r"D:\data.csv", nrows=10)
row = 0; sentNo = 0
while( row < 1 ):
sentences = tokenizer.tokenize(df['articles'][row])
for index, sents in enumerate(sentences):
sentNo += 1
words = word_tokenize(sents)
print(f'{sentNo}: {words}')
row += 1
df['articles'][0]
包含:
The ultimate productivity hack is saying no. Not doing something will always be faster than doing it. This statement reminds me of the old computer programming saying, “Remember that there is no code faster than no code.”
我只取了df['articles'][0]
,它给出的输出是这样的:
1:['The', 'ultimate', 'productivity', 'hack', 'is', 'saying', 'no', '.']
2:['Not', 'doing', 'something', 'will', 'always', 'be', 'faster', 'than', 'doing', 'it', '.']
3:['This', 'statement', 'reminds', 'me', 'of', 'the', 'old', 'computer', 'programming', 'saying', ',', '“', 'Remember', 'that', 'there', 'is', 'no', 'code', 'faster', 'than', 'no', 'code', '.', '”']
如何编写一个新的 output.csv
文件,其中包含 data.csv
文件中所有文章的所有句子,格式如下:
Sentence No | Word
1 The
ultimate
productivity
hack
is
saying
no
.
2 Not
doing
something
will
always
be
faster
than
doing
it
.
3 This
statement
reminds
me
of
the
old
computer
programming
saying
,
“
Remember
that
there
is
no
code
faster
than
no
code
.
”
我是 Python 的新手,在 Jupyter Notebook 上使用它。
这是我第一次 post 关于 Stack overflow。如有不妥之处,指正学习。谢谢。
只需要遍历单词并为每个单词写一个新行。
有点不可预测,因为您也有逗号 "words" - 可能需要考虑另一个分隔符或从单词列表中删除逗号。
编辑:这似乎是一种更简洁的方法。
import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize
df = pd.read_csv(r"D:\data.csv", nrows=10)
sentences = tokenizer.tokenize(df['articles'[row]])
f = open('output.csv','w+')
stcNum = 1
for stc in sentences:
for word in stc:
prntLine = ','
if word == stc[0]:
prntLine = str(stcNum) + prntLine
prntLine = prntLine + word + '\r\n'
f.write(prntLine)
stcNum += 1
f.close()
output.csv:
1,The
,ultimate
,productivity
,hack
,is
,saying
,no
,.
2,Not
,doing
,something
,will
,always
,be
,faster
,than
,doing
,it
,.
3,This
,statement
,reminds
,me
,of
,the
,old
,computer
,programming
,saying
,, # <<< Most CSV parsers will see this as 3 empty columns
,“
,Remember
,that
,there
,is
,no
,code
,faster
,than
,no
,code
,.
,”
我正在尝试从存储在 CSV 文件中的文章中提取所有单词,并将句子 ID 号和包含的单词写入新的 CSV 文件。
到目前为止我已经尝试了什么,
import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize
df = pd.read_csv(r"D:\data.csv", nrows=10)
row = 0; sentNo = 0
while( row < 1 ):
sentences = tokenizer.tokenize(df['articles'][row])
for index, sents in enumerate(sentences):
sentNo += 1
words = word_tokenize(sents)
print(f'{sentNo}: {words}')
row += 1
df['articles'][0]
包含:
The ultimate productivity hack is saying no. Not doing something will always be faster than doing it. This statement reminds me of the old computer programming saying, “Remember that there is no code faster than no code.”
我只取了df['articles'][0]
,它给出的输出是这样的:
1:['The', 'ultimate', 'productivity', 'hack', 'is', 'saying', 'no', '.']
2:['Not', 'doing', 'something', 'will', 'always', 'be', 'faster', 'than', 'doing', 'it', '.']
3:['This', 'statement', 'reminds', 'me', 'of', 'the', 'old', 'computer', 'programming', 'saying', ',', '“', 'Remember', 'that', 'there', 'is', 'no', 'code', 'faster', 'than', 'no', 'code', '.', '”']
如何编写一个新的 output.csv
文件,其中包含 data.csv
文件中所有文章的所有句子,格式如下:
Sentence No | Word
1 The
ultimate
productivity
hack
is
saying
no
.
2 Not
doing
something
will
always
be
faster
than
doing
it
.
3 This
statement
reminds
me
of
the
old
computer
programming
saying
,
“
Remember
that
there
is
no
code
faster
than
no
code
.
”
我是 Python 的新手,在 Jupyter Notebook 上使用它。
这是我第一次 post 关于 Stack overflow。如有不妥之处,指正学习。谢谢。
只需要遍历单词并为每个单词写一个新行。
有点不可预测,因为您也有逗号 "words" - 可能需要考虑另一个分隔符或从单词列表中删除逗号。
编辑:这似乎是一种更简洁的方法。
import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize
df = pd.read_csv(r"D:\data.csv", nrows=10)
sentences = tokenizer.tokenize(df['articles'[row]])
f = open('output.csv','w+')
stcNum = 1
for stc in sentences:
for word in stc:
prntLine = ','
if word == stc[0]:
prntLine = str(stcNum) + prntLine
prntLine = prntLine + word + '\r\n'
f.write(prntLine)
stcNum += 1
f.close()
output.csv:
1,The
,ultimate
,productivity
,hack
,is
,saying
,no
,.
2,Not
,doing
,something
,will
,always
,be
,faster
,than
,doing
,it
,.
3,This
,statement
,reminds
,me
,of
,the
,old
,computer
,programming
,saying
,, # <<< Most CSV parsers will see this as 3 empty columns
,“
,Remember
,that
,there
,is
,no
,code
,faster
,than
,no
,code
,.
,”