Building a Transition Matrix using words in Python/Numpy

我正在尝试使用此数据构建一个 3x3 转换矩阵

days=['rain', 'rain', 'rain', 'clouds', 'rain', 'sun', 'clouds', 'clouds', 
  'rain', 'sun', 'rain', 'rain', 'clouds', 'clouds', 'sun', 'sun', 
  'clouds', 'clouds', 'rain', 'clouds', 'sun', 'rain', 'rain', 'sun',
  'sun', 'clouds', 'clouds', 'rain', 'rain', 'sun', 'sun', 'rain', 
  'rain', 'sun', 'clouds', 'clouds', 'sun', 'sun', 'clouds', 'rain', 
  'rain', 'rain', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds', 'sun', 
  'clouds', 'clouds', 'sun', 'clouds', 'rain', 'sun', 'sun', 'sun', 
  'clouds', 'sun', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds', 
  'rain', 'clouds', 'clouds', 'sun', 'sun', 'sun', 'sun', 'sun', 'sun', 
  'clouds', 'clouds', 'clouds', 'clouds', 'clouds', 'sun', 'rain', 
  'rain', 'rain', 'clouds', 'sun', 'clouds', 'clouds', 'clouds', 'rain', 
  'clouds', 'rain', 'sun', 'sun', 'clouds', 'sun', 'sun', 'sun', 'sun',
  'sun', 'sun', 'rain']



#the columns are today
sun_total_count = 0
temp_dict={'sun':0, 'clouds':0, 'rain':0}
total_runs = 0
for (x, y), c in Counter(zip(data, data[1:])).items():
    #if column 0 is sun
    if x is 'sun':
        #find the sum of all the numbers in this column
        sun_total_count +=  c
        total_runs += 1
        if y is 'sun':
            temp_dict['sun'] = c
        if y is 'clouds':
            temp_dict['clouds'] = c
        if y is 'rain':
            temp_dict['rain'] = c

        if total_runs is 3:
            self.transitionMatrix[0][0] = temp_dict['sun']/sun_total_count
            self.transitionMatrix[1][0] = temp_dict['clouds']/sun_total_count
            self.transitionMatrix[2][0] = temp_dict['rain']/sun_total_count

return self.transitionMatrix


  1. 将报告从天数转换为索引代码。
  2. 遍历数组,获取昨天和今天的天气代码。
  3. 使用这些索引计算 3x3 矩阵中的组合。


report = [
  'rain', 'rain', 'rain', 'clouds', 'rain', 'sun', 'clouds', 'clouds', 
  'rain', 'sun', 'rain', 'rain', 'clouds', 'clouds', 'sun', 'sun', 
  'clouds', 'clouds', 'rain', 'clouds', 'sun', 'rain', 'rain', 'sun',
  'sun', 'clouds', 'clouds', 'rain', 'rain', 'sun', 'sun', 'rain', 
  'rain', 'sun', 'clouds', 'clouds', 'sun', 'sun', 'clouds', 'rain', 
  'rain', 'rain', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds', 'sun', 
  'clouds', 'clouds', 'sun', 'clouds', 'rain', 'sun', 'sun', 'sun', 
  'clouds', 'sun', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds', 
  'rain', 'clouds', 'clouds', 'sun', 'sun', 'sun', 'sun', 'sun', 'sun', 
  'clouds', 'clouds', 'clouds', 'clouds', 'clouds', 'sun', 'rain', 
  'rain', 'rain', 'clouds', 'sun', 'clouds', 'clouds', 'clouds', 'rain', 
  'clouds', 'rain', 'sun', 'sun', 'clouds', 'sun', 'sun', 'sun', 'sun',
  'sun', 'sun', 'rain']

weather_dict = {"sun":0, "clouds":1, "rain": 2}
weather_code = [weather_dict[day] for day in report]
print weather_code

for n in range(1, len(weather_code)):
    yesterday_code = weather_code[n-1]
    today_code     = weather_code[n]

# You now have the indicies you need for your 3x3 matrix.


def probabilityMatrix():
    occurancesOfEach = Counter(data)
    myMatrix = Counter(zip(data, data[1:]))
    probabilityMatrix = {key : myMatrix[key] / occurancesOfEach[key[0]] for key in myMatrix}
    return probabilityMatrix



def getTomorrowsProbability(weather):
    probMatrix = probabilityMatrix()
    return {key[1] : probMatrix[key]  for key in probMatrix if key[0] == weather}


我喜欢 pandasitertools 的组合。代码块比上面的要长一些,但不要将冗长与速度混为一谈。 (window func 应该非常快;诚然,pandas 部分会更慢。)

首先,制作一个“window”函数。这是 itertools 食谱中的一个。这使您进入转换元组列表(state1 到 state2)。

from itertools import islice

def window(seq, n=2):
    """Sliding window width n from seq.  From old itertools recipes."""
    it = iter(seq)
    result = tuple(islice(it, n))
    if len(result) == n:
        yield result
    for elem in it:
        result = result[1:] + (elem,)
        yield result

# list(window(days))
# [('rain', 'rain'),
#  ('rain', 'rain'),
#  ('rain', 'clouds'),
#  ('clouds', 'rain'),
#  ('rain', 'sun'),
# ...

然后使用pandas groupby + value counts 操作得到每个state1到每个state2的转换矩阵:

import pandas as pd

pairs = pd.DataFrame(window(days), columns=['state1', 'state2'])
counts = pairs.groupby('state1')['state2'].value_counts()
probs = (counts / counts.sum()).unstack()


state2  clouds  rain   sun
clouds    0.13  0.09  0.10
rain      0.06  0.11  0.09
sun       0.13  0.06  0.23

这是一个 "pure" numpy 解决方案,它创建 3x3 tables,其中第 0 个 dim(行号)对应于今天,最后一个 dim(列号)对应于明天。

从单词到索引的转换是通过截断第一个字母然后使用查找来完成的 table。

用于计数 numpy.add.at


import numpy as np

report = [
  'rain', 'rain', 'rain', 'clouds', 'rain', 'sun', 'clouds', 'clouds', 
  'rain', 'sun', 'rain', 'rain', 'clouds', 'clouds', 'sun', 'sun', 
  'clouds', 'clouds', 'rain', 'clouds', 'sun', 'rain', 'rain', 'sun',
  'sun', 'clouds', 'clouds', 'rain', 'rain', 'sun', 'sun', 'rain', 
  'rain', 'sun', 'clouds', 'clouds', 'sun', 'sun', 'clouds', 'rain', 
  'rain', 'rain', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds', 'sun', 
  'clouds', 'clouds', 'sun', 'clouds', 'rain', 'sun', 'sun', 'sun', 
  'clouds', 'sun', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds', 
  'rain', 'clouds', 'clouds', 'sun', 'sun', 'sun', 'sun', 'sun', 'sun', 
  'clouds', 'clouds', 'clouds', 'clouds', 'clouds', 'sun', 'rain', 
  'rain', 'rain', 'clouds', 'sun', 'clouds', 'clouds', 'clouds', 'rain', 
  'clouds', 'rain', 'sun', 'sun', 'clouds', 'sun', 'sun', 'sun', 'sun',
  'sun', 'sun', 'rain']

# create np array, keep only first letter (by forcing dtype)
# obviously, this only works because rain, sun, clouds start with different
# letters
# cast to int type so we can use for indexing
ri = np.array(report, dtype='|S1').view(np.uint8)
# create lookup
c, r, s = 99, 114, 115 # you can verify this using chr and ord
lookup = np.empty((s+1,), dtype=int)
lookup[[c, r, s]] = np.arange(3)
# translate c, r, s to 0, 1, 2
rc = lookup[ri]
# get counts (of pairs (today, tomorrow))
cnts = np.zeros((3, 3), dtype=int)
np.add.at(cnts, (rc[:-1], rc[1:]), 1)
# or as probs
probs = cnts / cnts.sum()
# or as condional probs (if today is sun how probable is rain tomorrow etc.)
cond = cnts / cnts.sum(axis=-1, keepdims=True)


# [13  9 10]
#  [ 6 11  9]
#  [13  6 23]]
# [[ 0.13  0.09  0.1 ]
#  [ 0.06  0.11  0.09]
#  [ 0.13  0.06  0.23]]
# [[ 0.40625     0.28125     0.3125    ]
#  [ 0.23076923  0.42307692  0.34615385]
#  [ 0.30952381  0.14285714  0.54761905]]

下面使用 pandas 的另一种选择。转换列表可以替换为 'rain'、'clouds' 等

import pandas as pd
transitions = ['A', 'B', 'B', 'C', 'B', 'A', 'D', 'D', 'A', 'B', 'A', 'D'] * 2
df = pd.DataFrame(columns = ['state', 'next_state'])
for i, val in enumerate(transitions[:-1]): # We don't care about last state
    df_stg = pd.DataFrame(index=[0])
    df_stg['state'], df_stg['next_state'] = transitions[i], transitions[i+1]
    df = pd.concat([df, df_stg], axis = 0)
cross_tab = pd.crosstab(df['state'], df['next_state'])
cross_tab.div(cross_tab.sum(axis=1), axis=0)

如果您不介意使用 pandas,可以使用一行代码来提取转移概率:



Today      clouds      rain       sun
clouds    0.40625  0.230769  0.309524
rain      0.28125  0.423077  0.142857
sun       0.31250  0.346154  0.547619

此处在 'rain' 行 'sun' 列中找到假设今天下雨明天是晴天的(正向)概率。如果您想获得后向概率(鉴于今天的天气,昨天的天气可能是什么),请切换前两个参数。

如果您想将概率存储在行而不是列中,请设置 normalize=0 但请注意,如果您直接在本示例中这样做,您将获得存储为行的反向概率。如果您想获得与上述相同的结果但转置您可以 a) 是,转置或 b) 切换前两个参数的顺序并将 normalize 设置为 0。

如果您只想将结果保留为 numpy 二维数组(而不是 pandas 数据框),请在最后一个括号后键入 .values