如何获取字符串的连续字数 python
How to get consecutive word count of a string python
我正在尝试制作一个 python 脚本,它接受一个字符串并给出连续单词的计数。
比方说:
string = " i have no idea how to write this script. i have an idea."
output =
['i', 'have'] 2
['have', 'no'] 1
['no', 'idea'] 1
['idea', 'how'] 1
['how', 'to'] 1
['to', 'write'] 1
...
我正在尝试使用 python 而不从集合中导入集合、计数器。我所拥有的在下面。我正在尝试使用 re.findall(#whatpatterndoiuse, string)
遍历字符串并进行比较,但我很难弄清楚如何进行。
string2 = re.split('\s+', string. lower())
freq_dict = {} #empty dictionary
for word in word_list:
word = punctuation.sub("", word)
freq_dic[word] = freq_dic.get(word,0) + 1
freq_list = freq_dic.items()
freq_list.sort()
for word, freq in freq_list:
print word, freq
使用我不想要的集合中的计数器。它还以一种不是我上面提到的格式产生输出。
import re
from collections import Counter
words = re.findall('\w+', open('a.txt').read())
print(Counter(zip(words,words[1:])))
你需要解决三个问题:
- 生成所有单词对 (
['i', 'have']
, ['have', 'no']
, ...);
- 计算这对单词出现的次数;
- 从最常见到最不常见的对进行排序。
第二个问题很容易解决,用Counter
. Counter
objects also provide a most_common()
方法解决第三个问题
第一个问题可以通过多种方式解决。最紧凑的方法是使用 zip
:
>>> import re
>>> s = 'i have no idea how to write this script. i have an idea.'
>>> words = re.findall('\w+', s)
>>> pairs = zip(words, words[1:])
>>> list(pairs)
[('i', 'have'), ('have', 'no'), ('no', 'idea'), ...]
将所有内容放在一起:
import collections
import re
def count_pairs(s):
"""
Returns a mapping that links each pair of words
to its number of occurrences.
"""
words = re.findall('\w+', s.lower())
pairs = zip(words, words[1:])
return collections.Counter(pairs)
def print_freqs(s):
"""
Prints the number of occurrences of word pairs
from the most common to the least common.
"""
cnt = count_pairs(s)
for pair, count in cnt.most_common():
print list(pair), count
编辑: 我刚刚意识到我不小心读到了 "with collections, counters, ..." 而不是 "with out importing collections, ..."。我的错,对不起。
不用 zip 解决这个问题相当简单。只需构建每对单词的元组并在字典中跟踪它们的计数。只有少数特殊情况需要注意 - 当输入字符串只有一个单词时,以及当您位于字符串末尾时。
试一试:
def freq(input_string):
freq = {}
words = input_string.split()
if len(words) == 1:
return freq
for idx, word in enumerate(words):
if idx+1 < len(words):
word_pair = (word, words[idx+1])
if word_pair in freq:
freq[word_pair] += 1
else:
freq[word_pair] = 1
return freq
string = "i have no idea how to write this script. i have an idea."
def count_words(string):
''' warning, won't work with a leading or trailing space,
though all you would have to do is check if there is one, and remove it.'''
x = string.split(' ')
return len(x)
我想通了答案已发布在下面。 :).它需要一个 TXT 文件,但可以很容易地对其进行操作以接收字符串。简单删除 arg1 并插入您自己的字符串 !!!
script, arg1 = argv #takes 2 arguments
#conditions
try:
sys.argv[1]
except IndexError:
print('doesnt work insert 2 arguments\n')
exit()
with open(arg1, 'r') as content_file: #open file
textsplit = content_file.read() #read it
textsplit = textsplit.lower() #lowercase it
word_list = textsplit.split() #split file put into var word_lists
textsplit = re.sub(r"[^\w\s]+", "", textsplit).split() #remove white space
#print textsplit
freq_dic = {} #creates empty dictionary
for i in range( 0, len(textsplit)-1): #counter to itterate
key = textsplit[i] + ',' + textsplit[i+1] # produces corresponding keys
try:
freq_dic[key]+=1 #if
except:
freq_dic[key]=1 #if not
for word in freq_dic:
print [word], freq_dic[word]
我正在尝试制作一个 python 脚本,它接受一个字符串并给出连续单词的计数。 比方说:
string = " i have no idea how to write this script. i have an idea."
output =
['i', 'have'] 2
['have', 'no'] 1
['no', 'idea'] 1
['idea', 'how'] 1
['how', 'to'] 1
['to', 'write'] 1
...
我正在尝试使用 python 而不从集合中导入集合、计数器。我所拥有的在下面。我正在尝试使用 re.findall(#whatpatterndoiuse, string)
遍历字符串并进行比较,但我很难弄清楚如何进行。
string2 = re.split('\s+', string. lower())
freq_dict = {} #empty dictionary
for word in word_list:
word = punctuation.sub("", word)
freq_dic[word] = freq_dic.get(word,0) + 1
freq_list = freq_dic.items()
freq_list.sort()
for word, freq in freq_list:
print word, freq
使用我不想要的集合中的计数器。它还以一种不是我上面提到的格式产生输出。
import re
from collections import Counter
words = re.findall('\w+', open('a.txt').read())
print(Counter(zip(words,words[1:])))
你需要解决三个问题:
- 生成所有单词对 (
['i', 'have']
,['have', 'no']
, ...); - 计算这对单词出现的次数;
- 从最常见到最不常见的对进行排序。
第二个问题很容易解决,用Counter
. Counter
objects also provide a most_common()
方法解决第三个问题
第一个问题可以通过多种方式解决。最紧凑的方法是使用 zip
:
>>> import re
>>> s = 'i have no idea how to write this script. i have an idea.'
>>> words = re.findall('\w+', s)
>>> pairs = zip(words, words[1:])
>>> list(pairs)
[('i', 'have'), ('have', 'no'), ('no', 'idea'), ...]
将所有内容放在一起:
import collections
import re
def count_pairs(s):
"""
Returns a mapping that links each pair of words
to its number of occurrences.
"""
words = re.findall('\w+', s.lower())
pairs = zip(words, words[1:])
return collections.Counter(pairs)
def print_freqs(s):
"""
Prints the number of occurrences of word pairs
from the most common to the least common.
"""
cnt = count_pairs(s)
for pair, count in cnt.most_common():
print list(pair), count
编辑: 我刚刚意识到我不小心读到了 "with collections, counters, ..." 而不是 "with out importing collections, ..."。我的错,对不起。
不用 zip 解决这个问题相当简单。只需构建每对单词的元组并在字典中跟踪它们的计数。只有少数特殊情况需要注意 - 当输入字符串只有一个单词时,以及当您位于字符串末尾时。
试一试:
def freq(input_string):
freq = {}
words = input_string.split()
if len(words) == 1:
return freq
for idx, word in enumerate(words):
if idx+1 < len(words):
word_pair = (word, words[idx+1])
if word_pair in freq:
freq[word_pair] += 1
else:
freq[word_pair] = 1
return freq
string = "i have no idea how to write this script. i have an idea."
def count_words(string):
''' warning, won't work with a leading or trailing space,
though all you would have to do is check if there is one, and remove it.'''
x = string.split(' ')
return len(x)
我想通了答案已发布在下面。 :).它需要一个 TXT 文件,但可以很容易地对其进行操作以接收字符串。简单删除 arg1 并插入您自己的字符串 !!!
script, arg1 = argv #takes 2 arguments
#conditions
try:
sys.argv[1]
except IndexError:
print('doesnt work insert 2 arguments\n')
exit()
with open(arg1, 'r') as content_file: #open file
textsplit = content_file.read() #read it
textsplit = textsplit.lower() #lowercase it
word_list = textsplit.split() #split file put into var word_lists
textsplit = re.sub(r"[^\w\s]+", "", textsplit).split() #remove white space
#print textsplit
freq_dic = {} #creates empty dictionary
for i in range( 0, len(textsplit)-1): #counter to itterate
key = textsplit[i] + ',' + textsplit[i+1] # produces corresponding keys
try:
freq_dic[key]+=1 #if
except:
freq_dic[key]=1 #if not
for word in freq_dic:
print [word], freq_dic[word]