Haskell 中的频率计数

Frequency Count in Haskell

所以我要完成的任务是从字符串 s 中获取长度为 n 的所有可能模式的频率列表。

Input (Text,Length) 
Output (Frequency)
String -> Int -> [Int]
freqCount s n = frequency list [int] in alphabetical order

["A","B","C","D"]

其中 s 中的字符仅限于上述四个字符,因此我认为第一步是获取长度为 n 的所有可能的排列,并允许重复。

permutationsR k = sort(replicateM k ['A','B','C','D'])

例如。

permutationsR 2 

会给出输出

 ["AA","AB","AC","AD","BA","BB","BC","BD","CA","CB","CC","CD","DA","DB","DC","DD"]

然后为每个模式计算每个模式出现的次数 所以像

patternCount:: String -> String -> Int
patternCount text pattern = length (filter (\x -> x == pattern) [take (length(pattern)) (drop x text)| x <- [0..length(text)- length(pattern)]])
frequencyCount s n = map (\x -> patternCount s x) (permutationsR n)

但是我认为这将是非常低效的,因为我实际上是遍历整个列表来检查每个模式长度 (permutationsR n) 次,而不是我认为应该只在一次迭代中就可以做的事情。

有没有办法像我在命令式语言中那样生成频率图。

即在伪代码中

where s = string
and n = length of pattern
//pattern is a map where key = pattern and value = frequencyCount

patterns = {"AA":0,"AB":0,"AC:0...}
for (i = 0; i < (length s - n); i++){
    patterns[s[i:(i+n)]] += 1
}

基本上只是迭代一次,将字符串从 (i:i+n) 中拆分出来 并在每次出现时更新模式图。

示例输入,输出是这样的

s= "AABBCA"
n = 2
frequencyList s n = [1,1,0,0,0,1,1,0,1,0,0,0,0,0,0,0]

这是一个可能的解决方案:

import Data.List

groupsOf n str =
  unfoldr (\s ->
            if length s < n
            then Nothing
            else Just (take n s, tail s)) str

frequency :: (Ord t, Eq t) => [t] -> [(t, Int)]
frequency =
  map (\s -> (head s, length s)) . group . sort

groupsOf 将输入字符串拆分为长度为 n 的重叠序列。例如,

groupsOf 3 "AABCABC"

会给

["AAB", "ABC", "BCA", "CAB", "ABC"].  

frequency然后会统计每个子序列出现的次数,所以

frequency $ groupsOf 3 "AABCABC"

应该给

[("AAB", 1), ("ABC", 2), ("BCA", 1), ("CAB", 1)].

任何未出现在结果中的子序列出现零次。