如何阻止csv文件中的每一行?
how to stem each row in csv file?
我有一个 CSV 文件,其中两列包含句子。例如
Test.csv:
Col[1]
----------------------
This trip was amazing.
Col[2]
--------------------
The cats are playing.
所以我做了一些 nlp 过程:
with codecs.open('test.csv','r', encoding='utf-8', errors='ignore') as myfile:
data = csv.reader(myfile, delimiter=',')
next(data)
stops = set(stopwords.words("english"))
stemmer = PorterStemmer()
for row in data:
word_tokens1 = word_tokenize(row[1].lower())
word_tokens2 = word_tokenize(row[2].lower())
remo1 = [w for w in word_tokens1 if w in re.sub("[^a-zA-Z]"," ",w )]
remo2 = [w for w in word_tokens2 if w in re.sub("[^a-zA-Z]"," ",w)]
list1 = [w for w in remo1 if not w in stops]
list2 = [w for w in remo2 if not w in stops]
for w in list1:
l = stemmer.stem(w)
print(l)
for w in list2:
l2 = stemmer.stem(w)
print(l2)
我的问题是当我进行词干提取和打印时:
trip
amazi
cat
play
它连续打印每个单词。怎么才能return到词干后的句子
喜欢:
Col[1]:
-------------------
trip amazi
Col[2]:
-------------------
cat play
这是您的代码的修改版本,可以生成您想要的输出。您必须做的最重要的事情是改变
for w in list1:
l = stemmer.stem(w)
print(l)
for w in list2:
l2 = stemmer.stem(w)
print(l2)
到
stemmed_first = ""
c = 0
for w in list1:
if c < len(list1)-1:
stemmed_first += stemmer.stem(w) + " "
else:
stemmed_first += stemmer.stem(w)
c += 1
list2
也是如此。但是,我对您的代码进行了其他小改动:
stemmer = PorterStemmer()
stops = set(stopwords.words("english"))
with open('test.csv', 'rb') as csvfile:
spamreader = csv.reader(csvfile, delimiter=',')
for row in spamreader:
if len(row) >= 2:
word_tokens1 = nltk.tokenize.word_tokenize(row[0])
word_tokens2 = nltk.tokenize.word_tokenize(row[1])
remo1 = [w for w in word_tokens1 if w in re.sub("[^a-zA-Z]", " ", w)]
remo2 = [w for w in word_tokens2 if w in re.sub("[^a-zA-Z]", " ", w)]
list1 = [w for w in remo1 if not w in stops]
list2 = [w for w in remo2 if not w in stops]
stemmed_first = ""
c = 0
for w in list1:
if c < len(list1)-1:
stemmed_first += stemmer.stem(w) + " "
else:
stemmed_first += stemmer.stem(w)
c += 1
stemmed_second = ""
c = 0
for w in list2:
if c < len(list2)-1:
stemmed_second += stemmer.stem(w) + " "
else:
stemmed_second += stemmer.stem(w)
c += 1
print stemmed_first
print stemmed_second
我有一个 CSV 文件,其中两列包含句子。例如 Test.csv:
Col[1]
----------------------
This trip was amazing.
Col[2]
--------------------
The cats are playing.
所以我做了一些 nlp 过程:
with codecs.open('test.csv','r', encoding='utf-8', errors='ignore') as myfile:
data = csv.reader(myfile, delimiter=',')
next(data)
stops = set(stopwords.words("english"))
stemmer = PorterStemmer()
for row in data:
word_tokens1 = word_tokenize(row[1].lower())
word_tokens2 = word_tokenize(row[2].lower())
remo1 = [w for w in word_tokens1 if w in re.sub("[^a-zA-Z]"," ",w )]
remo2 = [w for w in word_tokens2 if w in re.sub("[^a-zA-Z]"," ",w)]
list1 = [w for w in remo1 if not w in stops]
list2 = [w for w in remo2 if not w in stops]
for w in list1:
l = stemmer.stem(w)
print(l)
for w in list2:
l2 = stemmer.stem(w)
print(l2)
我的问题是当我进行词干提取和打印时:
trip
amazi
cat
play
它连续打印每个单词。怎么才能return到词干后的句子 喜欢:
Col[1]:
-------------------
trip amazi
Col[2]:
-------------------
cat play
这是您的代码的修改版本,可以生成您想要的输出。您必须做的最重要的事情是改变
for w in list1:
l = stemmer.stem(w)
print(l)
for w in list2:
l2 = stemmer.stem(w)
print(l2)
到
stemmed_first = ""
c = 0
for w in list1:
if c < len(list1)-1:
stemmed_first += stemmer.stem(w) + " "
else:
stemmed_first += stemmer.stem(w)
c += 1
list2
也是如此。但是,我对您的代码进行了其他小改动:
stemmer = PorterStemmer()
stops = set(stopwords.words("english"))
with open('test.csv', 'rb') as csvfile:
spamreader = csv.reader(csvfile, delimiter=',')
for row in spamreader:
if len(row) >= 2:
word_tokens1 = nltk.tokenize.word_tokenize(row[0])
word_tokens2 = nltk.tokenize.word_tokenize(row[1])
remo1 = [w for w in word_tokens1 if w in re.sub("[^a-zA-Z]", " ", w)]
remo2 = [w for w in word_tokens2 if w in re.sub("[^a-zA-Z]", " ", w)]
list1 = [w for w in remo1 if not w in stops]
list2 = [w for w in remo2 if not w in stops]
stemmed_first = ""
c = 0
for w in list1:
if c < len(list1)-1:
stemmed_first += stemmer.stem(w) + " "
else:
stemmed_first += stemmer.stem(w)
c += 1
stemmed_second = ""
c = 0
for w in list2:
if c < len(list2)-1:
stemmed_second += stemmer.stem(w) + " "
else:
stemmed_second += stemmer.stem(w)
c += 1
print stemmed_first
print stemmed_second