Python 和 nGrams
Python and nGrams
这里的 Aster 用户正试图完全转移到 python 以进行基本文本分析。
我正在尝试使用 nltk 或其他一些模块在 Python 中复制 ASTER ngram 的输出。我需要能够对 1 到 4 的 ngram 执行此操作。输出到 csv。
数据:
Unique_ID, Text_Narrative
需要输出:
Unique_id, ngram(token), ngram(frequency)
示例输出:
- 023345 "I" 1
- 023345 "Love" 1
- 023345 "Python" 1
出于教育原因,我只使用 python
的标准库编写了这个简单版本。
生产代码应使用 spacy
和 pandas
import collections
from operator import itemgetter as at
with open("input.csv",'r') as f:
data = [l.split(',', 2) for l in f.readlines()]
spaced = lambda t: (t[0][0],' '.join(map(at(1), t))) if t[0][0]==t[1][0] else []
unigrams = [(i,w) for i, d in data for w in d.split()]
bigrams = filter(any, map(spaced, zip(unigrams, unigrams[1:] )))
trigrams = filter(any, map(spaced, zip(unigrams, unigrams[1:], unigrams[2:])))
with open("output.csv", 'w') as f:
for ngram in [unigrams, bigrams, trigrams]:
counts = collections.Counter(ngram)
for t,count in counts.items():
f.write("{i},{w},{c}\n".format(c=count, i=t[0], w=t[1]))
正如其他人所说,这个问题确实很模糊,但由于您是新手,所以这里有一份详细的指南。 :-)
from collections import Counter
#Your starting input - a phrase with an ID
#I added some extra words to show count
dict1 = {'023345': 'I love Python love Python Python'}
#Split the dict vlue into a list for counting
dict1['023345'] = dict1['023345'].split()
#Use counter to count
countlist = Counter(dict1['023345'])
#count list is now "Counter({'I': 1, 'Python': 1, 'love': 1})"
#If you want to output it like you requested, interate over the dict
for key, value in dict1.iteritems():
id1 = key
for key, value in countlist.iteritems():
print id1, key, value
这里的 Aster 用户正试图完全转移到 python 以进行基本文本分析。 我正在尝试使用 nltk 或其他一些模块在 Python 中复制 ASTER ngram 的输出。我需要能够对 1 到 4 的 ngram 执行此操作。输出到 csv。
数据:
Unique_ID, Text_Narrative
需要输出:
Unique_id, ngram(token), ngram(frequency)
示例输出:
- 023345 "I" 1
- 023345 "Love" 1
- 023345 "Python" 1
出于教育原因,我只使用 python
的标准库编写了这个简单版本。
生产代码应使用 spacy
和 pandas
import collections
from operator import itemgetter as at
with open("input.csv",'r') as f:
data = [l.split(',', 2) for l in f.readlines()]
spaced = lambda t: (t[0][0],' '.join(map(at(1), t))) if t[0][0]==t[1][0] else []
unigrams = [(i,w) for i, d in data for w in d.split()]
bigrams = filter(any, map(spaced, zip(unigrams, unigrams[1:] )))
trigrams = filter(any, map(spaced, zip(unigrams, unigrams[1:], unigrams[2:])))
with open("output.csv", 'w') as f:
for ngram in [unigrams, bigrams, trigrams]:
counts = collections.Counter(ngram)
for t,count in counts.items():
f.write("{i},{w},{c}\n".format(c=count, i=t[0], w=t[1]))
正如其他人所说,这个问题确实很模糊,但由于您是新手,所以这里有一份详细的指南。 :-)
from collections import Counter
#Your starting input - a phrase with an ID
#I added some extra words to show count
dict1 = {'023345': 'I love Python love Python Python'}
#Split the dict vlue into a list for counting
dict1['023345'] = dict1['023345'].split()
#Use counter to count
countlist = Counter(dict1['023345'])
#count list is now "Counter({'I': 1, 'Python': 1, 'love': 1})"
#If you want to output it like you requested, interate over the dict
for key, value in dict1.iteritems():
id1 = key
for key, value in countlist.iteritems():
print id1, key, value