使用 Python/Numpy 中的单词构建转换矩阵
Building a Transition Matrix using words in Python/Numpy
我正在尝试使用此数据构建一个 3x3 转换矩阵
days=['rain', 'rain', 'rain', 'clouds', 'rain', 'sun', 'clouds', 'clouds',
'rain', 'sun', 'rain', 'rain', 'clouds', 'clouds', 'sun', 'sun',
'clouds', 'clouds', 'rain', 'clouds', 'sun', 'rain', 'rain', 'sun',
'sun', 'clouds', 'clouds', 'rain', 'rain', 'sun', 'sun', 'rain',
'rain', 'sun', 'clouds', 'clouds', 'sun', 'sun', 'clouds', 'rain',
'rain', 'rain', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds', 'sun',
'clouds', 'clouds', 'sun', 'clouds', 'rain', 'sun', 'sun', 'sun',
'clouds', 'sun', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds',
'rain', 'clouds', 'clouds', 'sun', 'sun', 'sun', 'sun', 'sun', 'sun',
'clouds', 'clouds', 'clouds', 'clouds', 'clouds', 'sun', 'rain',
'rain', 'rain', 'clouds', 'sun', 'clouds', 'clouds', 'clouds', 'rain',
'clouds', 'rain', 'sun', 'sun', 'clouds', 'sun', 'sun', 'sun', 'sun',
'sun', 'sun', 'rain']
目前,我正在使用一些临时词典和一些单独计算每种天气概率的列表来做这件事。这不是一个很好的解决方案。有人可以指导我更合理地解决这个问题吗?
self.transitionMatrix=np.zeros((3,3))
#the columns are today
sun_total_count = 0
temp_dict={'sun':0, 'clouds':0, 'rain':0}
total_runs = 0
for (x, y), c in Counter(zip(data, data[1:])).items():
#if column 0 is sun
if x is 'sun':
#find the sum of all the numbers in this column
sun_total_count += c
total_runs += 1
if y is 'sun':
temp_dict['sun'] = c
if y is 'clouds':
temp_dict['clouds'] = c
if y is 'rain':
temp_dict['rain'] = c
if total_runs is 3:
self.transitionMatrix[0][0] = temp_dict['sun']/sun_total_count
self.transitionMatrix[1][0] = temp_dict['clouds']/sun_total_count
self.transitionMatrix[2][0] = temp_dict['rain']/sun_total_count
return self.transitionMatrix
对于每种类型的天气,我需要计算第二天的概率
- 将报告从天数转换为索引代码。
- 遍历数组,获取昨天和今天的天气代码。
- 使用这些索引计算 3x3 矩阵中的组合。
这是让您入门的编码设置。
report = [
'rain', 'rain', 'rain', 'clouds', 'rain', 'sun', 'clouds', 'clouds',
'rain', 'sun', 'rain', 'rain', 'clouds', 'clouds', 'sun', 'sun',
'clouds', 'clouds', 'rain', 'clouds', 'sun', 'rain', 'rain', 'sun',
'sun', 'clouds', 'clouds', 'rain', 'rain', 'sun', 'sun', 'rain',
'rain', 'sun', 'clouds', 'clouds', 'sun', 'sun', 'clouds', 'rain',
'rain', 'rain', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds', 'sun',
'clouds', 'clouds', 'sun', 'clouds', 'rain', 'sun', 'sun', 'sun',
'clouds', 'sun', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds',
'rain', 'clouds', 'clouds', 'sun', 'sun', 'sun', 'sun', 'sun', 'sun',
'clouds', 'clouds', 'clouds', 'clouds', 'clouds', 'sun', 'rain',
'rain', 'rain', 'clouds', 'sun', 'clouds', 'clouds', 'clouds', 'rain',
'clouds', 'rain', 'sun', 'sun', 'clouds', 'sun', 'sun', 'sun', 'sun',
'sun', 'sun', 'rain']
weather_dict = {"sun":0, "clouds":1, "rain": 2}
weather_code = [weather_dict[day] for day in report]
print weather_code
for n in range(1, len(weather_code)):
yesterday_code = weather_code[n-1]
today_code = weather_code[n]
# You now have the indicies you need for your 3x3 matrix.
您似乎想创建一个矩阵,表示太阳之后下雨或太阳之后出现云(或等等)的概率。你可以像这样吐出概率矩阵(不是数学术语):
def probabilityMatrix():
tomorrowsProbability=np.zeros((3,3))
occurancesOfEach = Counter(data)
myMatrix = Counter(zip(data, data[1:]))
probabilityMatrix = {key : myMatrix[key] / occurancesOfEach[key[0]] for key in myMatrix}
return probabilityMatrix
print(probabilityMatrix())
但是,您可能想根据今天的天气吐出每种天气的概率:
def getTomorrowsProbability(weather):
probMatrix = probabilityMatrix()
return {key[1] : probMatrix[key] for key in probMatrix if key[0] == weather}
print(getTomorrowsProbability('sun'))
我喜欢 pandas
和 itertools
的组合。代码块比上面的要长一些,但不要将冗长与速度混为一谈。 (window
func 应该非常快;诚然,pandas 部分会更慢。)
首先,制作一个“window”函数。这是 itertools 食谱中的一个。这使您进入转换元组列表(state1 到 state2)。
from itertools import islice
def window(seq, n=2):
"""Sliding window width n from seq. From old itertools recipes."""
it = iter(seq)
result = tuple(islice(it, n))
if len(result) == n:
yield result
for elem in it:
result = result[1:] + (elem,)
yield result
# list(window(days))
# [('rain', 'rain'),
# ('rain', 'rain'),
# ('rain', 'clouds'),
# ('clouds', 'rain'),
# ('rain', 'sun'),
# ...
然后使用pandas groupby + value counts 操作得到每个state1到每个state2的转换矩阵:
import pandas as pd
pairs = pd.DataFrame(window(days), columns=['state1', 'state2'])
counts = pairs.groupby('state1')['state2'].value_counts()
probs = (counts / counts.sum()).unstack()
您的结果如下所示:
print(probs)
state2 clouds rain sun
state1
clouds 0.13 0.09 0.10
rain 0.06 0.11 0.09
sun 0.13 0.06 0.23
这是一个 "pure" numpy 解决方案,它创建 3x3 tables,其中第 0 个 dim(行号)对应于今天,最后一个 dim(列号)对应于明天。
从单词到索引的转换是通过截断第一个字母然后使用查找来完成的 table。
用于计数 numpy.add.at
。
写这篇文章时考虑到了效率。它在不到一秒的时间内完成一百万个单词。
import numpy as np
report = [
'rain', 'rain', 'rain', 'clouds', 'rain', 'sun', 'clouds', 'clouds',
'rain', 'sun', 'rain', 'rain', 'clouds', 'clouds', 'sun', 'sun',
'clouds', 'clouds', 'rain', 'clouds', 'sun', 'rain', 'rain', 'sun',
'sun', 'clouds', 'clouds', 'rain', 'rain', 'sun', 'sun', 'rain',
'rain', 'sun', 'clouds', 'clouds', 'sun', 'sun', 'clouds', 'rain',
'rain', 'rain', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds', 'sun',
'clouds', 'clouds', 'sun', 'clouds', 'rain', 'sun', 'sun', 'sun',
'clouds', 'sun', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds',
'rain', 'clouds', 'clouds', 'sun', 'sun', 'sun', 'sun', 'sun', 'sun',
'clouds', 'clouds', 'clouds', 'clouds', 'clouds', 'sun', 'rain',
'rain', 'rain', 'clouds', 'sun', 'clouds', 'clouds', 'clouds', 'rain',
'clouds', 'rain', 'sun', 'sun', 'clouds', 'sun', 'sun', 'sun', 'sun',
'sun', 'sun', 'rain']
# create np array, keep only first letter (by forcing dtype)
# obviously, this only works because rain, sun, clouds start with different
# letters
# cast to int type so we can use for indexing
ri = np.array(report, dtype='|S1').view(np.uint8)
# create lookup
c, r, s = 99, 114, 115 # you can verify this using chr and ord
lookup = np.empty((s+1,), dtype=int)
lookup[[c, r, s]] = np.arange(3)
# translate c, r, s to 0, 1, 2
rc = lookup[ri]
# get counts (of pairs (today, tomorrow))
cnts = np.zeros((3, 3), dtype=int)
np.add.at(cnts, (rc[:-1], rc[1:]), 1)
# or as probs
probs = cnts / cnts.sum()
# or as condional probs (if today is sun how probable is rain tomorrow etc.)
cond = cnts / cnts.sum(axis=-1, keepdims=True)
print(cnts)
print(probs)
print(cond)
# [13 9 10]
# [ 6 11 9]
# [13 6 23]]
# [[ 0.13 0.09 0.1 ]
# [ 0.06 0.11 0.09]
# [ 0.13 0.06 0.23]]
# [[ 0.40625 0.28125 0.3125 ]
# [ 0.23076923 0.42307692 0.34615385]
# [ 0.30952381 0.14285714 0.54761905]]
下面使用 pandas 的另一种选择。转换列表可以替换为 'rain'、'clouds' 等
import pandas as pd
transitions = ['A', 'B', 'B', 'C', 'B', 'A', 'D', 'D', 'A', 'B', 'A', 'D'] * 2
df = pd.DataFrame(columns = ['state', 'next_state'])
for i, val in enumerate(transitions[:-1]): # We don't care about last state
df_stg = pd.DataFrame(index=[0])
df_stg['state'], df_stg['next_state'] = transitions[i], transitions[i+1]
df = pd.concat([df, df_stg], axis = 0)
cross_tab = pd.crosstab(df['state'], df['next_state'])
cross_tab.div(cross_tab.sum(axis=1), axis=0)
如果您不介意使用 pandas
,可以使用一行代码来提取转移概率:
pd.crosstab(pd.Series(days[1:],name='Tomorrow'),
pd.Series(days[:-1],name='Today'),normalize=1)
输出:
Today clouds rain sun
Tomorrow
clouds 0.40625 0.230769 0.309524
rain 0.28125 0.423077 0.142857
sun 0.31250 0.346154 0.547619
此处在 'rain' 行 'sun' 列中找到假设今天下雨明天是晴天的(正向)概率。如果您想获得后向概率(鉴于今天的天气,昨天的天气可能是什么),请切换前两个参数。
如果您想将概率存储在行而不是列中,请设置 normalize=0
但请注意,如果您直接在本示例中这样做,您将获得存储为行的反向概率。如果您想获得与上述相同的结果但转置您可以 a) 是,转置或 b) 切换前两个参数的顺序并将 normalize
设置为 0。
如果您只想将结果保留为 numpy
二维数组(而不是 pandas 数据框),请在最后一个括号后键入 .values
。
我正在尝试使用此数据构建一个 3x3 转换矩阵
days=['rain', 'rain', 'rain', 'clouds', 'rain', 'sun', 'clouds', 'clouds',
'rain', 'sun', 'rain', 'rain', 'clouds', 'clouds', 'sun', 'sun',
'clouds', 'clouds', 'rain', 'clouds', 'sun', 'rain', 'rain', 'sun',
'sun', 'clouds', 'clouds', 'rain', 'rain', 'sun', 'sun', 'rain',
'rain', 'sun', 'clouds', 'clouds', 'sun', 'sun', 'clouds', 'rain',
'rain', 'rain', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds', 'sun',
'clouds', 'clouds', 'sun', 'clouds', 'rain', 'sun', 'sun', 'sun',
'clouds', 'sun', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds',
'rain', 'clouds', 'clouds', 'sun', 'sun', 'sun', 'sun', 'sun', 'sun',
'clouds', 'clouds', 'clouds', 'clouds', 'clouds', 'sun', 'rain',
'rain', 'rain', 'clouds', 'sun', 'clouds', 'clouds', 'clouds', 'rain',
'clouds', 'rain', 'sun', 'sun', 'clouds', 'sun', 'sun', 'sun', 'sun',
'sun', 'sun', 'rain']
目前,我正在使用一些临时词典和一些单独计算每种天气概率的列表来做这件事。这不是一个很好的解决方案。有人可以指导我更合理地解决这个问题吗?
self.transitionMatrix=np.zeros((3,3))
#the columns are today
sun_total_count = 0
temp_dict={'sun':0, 'clouds':0, 'rain':0}
total_runs = 0
for (x, y), c in Counter(zip(data, data[1:])).items():
#if column 0 is sun
if x is 'sun':
#find the sum of all the numbers in this column
sun_total_count += c
total_runs += 1
if y is 'sun':
temp_dict['sun'] = c
if y is 'clouds':
temp_dict['clouds'] = c
if y is 'rain':
temp_dict['rain'] = c
if total_runs is 3:
self.transitionMatrix[0][0] = temp_dict['sun']/sun_total_count
self.transitionMatrix[1][0] = temp_dict['clouds']/sun_total_count
self.transitionMatrix[2][0] = temp_dict['rain']/sun_total_count
return self.transitionMatrix
对于每种类型的天气,我需要计算第二天的概率
- 将报告从天数转换为索引代码。
- 遍历数组,获取昨天和今天的天气代码。
- 使用这些索引计算 3x3 矩阵中的组合。
这是让您入门的编码设置。
report = [
'rain', 'rain', 'rain', 'clouds', 'rain', 'sun', 'clouds', 'clouds',
'rain', 'sun', 'rain', 'rain', 'clouds', 'clouds', 'sun', 'sun',
'clouds', 'clouds', 'rain', 'clouds', 'sun', 'rain', 'rain', 'sun',
'sun', 'clouds', 'clouds', 'rain', 'rain', 'sun', 'sun', 'rain',
'rain', 'sun', 'clouds', 'clouds', 'sun', 'sun', 'clouds', 'rain',
'rain', 'rain', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds', 'sun',
'clouds', 'clouds', 'sun', 'clouds', 'rain', 'sun', 'sun', 'sun',
'clouds', 'sun', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds',
'rain', 'clouds', 'clouds', 'sun', 'sun', 'sun', 'sun', 'sun', 'sun',
'clouds', 'clouds', 'clouds', 'clouds', 'clouds', 'sun', 'rain',
'rain', 'rain', 'clouds', 'sun', 'clouds', 'clouds', 'clouds', 'rain',
'clouds', 'rain', 'sun', 'sun', 'clouds', 'sun', 'sun', 'sun', 'sun',
'sun', 'sun', 'rain']
weather_dict = {"sun":0, "clouds":1, "rain": 2}
weather_code = [weather_dict[day] for day in report]
print weather_code
for n in range(1, len(weather_code)):
yesterday_code = weather_code[n-1]
today_code = weather_code[n]
# You now have the indicies you need for your 3x3 matrix.
您似乎想创建一个矩阵,表示太阳之后下雨或太阳之后出现云(或等等)的概率。你可以像这样吐出概率矩阵(不是数学术语):
def probabilityMatrix():
tomorrowsProbability=np.zeros((3,3))
occurancesOfEach = Counter(data)
myMatrix = Counter(zip(data, data[1:]))
probabilityMatrix = {key : myMatrix[key] / occurancesOfEach[key[0]] for key in myMatrix}
return probabilityMatrix
print(probabilityMatrix())
但是,您可能想根据今天的天气吐出每种天气的概率:
def getTomorrowsProbability(weather):
probMatrix = probabilityMatrix()
return {key[1] : probMatrix[key] for key in probMatrix if key[0] == weather}
print(getTomorrowsProbability('sun'))
我喜欢 pandas
和 itertools
的组合。代码块比上面的要长一些,但不要将冗长与速度混为一谈。 (window
func 应该非常快;诚然,pandas 部分会更慢。)
首先,制作一个“window”函数。这是 itertools 食谱中的一个。这使您进入转换元组列表(state1 到 state2)。
from itertools import islice
def window(seq, n=2):
"""Sliding window width n from seq. From old itertools recipes."""
it = iter(seq)
result = tuple(islice(it, n))
if len(result) == n:
yield result
for elem in it:
result = result[1:] + (elem,)
yield result
# list(window(days))
# [('rain', 'rain'),
# ('rain', 'rain'),
# ('rain', 'clouds'),
# ('clouds', 'rain'),
# ('rain', 'sun'),
# ...
然后使用pandas groupby + value counts 操作得到每个state1到每个state2的转换矩阵:
import pandas as pd
pairs = pd.DataFrame(window(days), columns=['state1', 'state2'])
counts = pairs.groupby('state1')['state2'].value_counts()
probs = (counts / counts.sum()).unstack()
您的结果如下所示:
print(probs)
state2 clouds rain sun
state1
clouds 0.13 0.09 0.10
rain 0.06 0.11 0.09
sun 0.13 0.06 0.23
这是一个 "pure" numpy 解决方案,它创建 3x3 tables,其中第 0 个 dim(行号)对应于今天,最后一个 dim(列号)对应于明天。
从单词到索引的转换是通过截断第一个字母然后使用查找来完成的 table。
用于计数 numpy.add.at
。
写这篇文章时考虑到了效率。它在不到一秒的时间内完成一百万个单词。
import numpy as np
report = [
'rain', 'rain', 'rain', 'clouds', 'rain', 'sun', 'clouds', 'clouds',
'rain', 'sun', 'rain', 'rain', 'clouds', 'clouds', 'sun', 'sun',
'clouds', 'clouds', 'rain', 'clouds', 'sun', 'rain', 'rain', 'sun',
'sun', 'clouds', 'clouds', 'rain', 'rain', 'sun', 'sun', 'rain',
'rain', 'sun', 'clouds', 'clouds', 'sun', 'sun', 'clouds', 'rain',
'rain', 'rain', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds', 'sun',
'clouds', 'clouds', 'sun', 'clouds', 'rain', 'sun', 'sun', 'sun',
'clouds', 'sun', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds',
'rain', 'clouds', 'clouds', 'sun', 'sun', 'sun', 'sun', 'sun', 'sun',
'clouds', 'clouds', 'clouds', 'clouds', 'clouds', 'sun', 'rain',
'rain', 'rain', 'clouds', 'sun', 'clouds', 'clouds', 'clouds', 'rain',
'clouds', 'rain', 'sun', 'sun', 'clouds', 'sun', 'sun', 'sun', 'sun',
'sun', 'sun', 'rain']
# create np array, keep only first letter (by forcing dtype)
# obviously, this only works because rain, sun, clouds start with different
# letters
# cast to int type so we can use for indexing
ri = np.array(report, dtype='|S1').view(np.uint8)
# create lookup
c, r, s = 99, 114, 115 # you can verify this using chr and ord
lookup = np.empty((s+1,), dtype=int)
lookup[[c, r, s]] = np.arange(3)
# translate c, r, s to 0, 1, 2
rc = lookup[ri]
# get counts (of pairs (today, tomorrow))
cnts = np.zeros((3, 3), dtype=int)
np.add.at(cnts, (rc[:-1], rc[1:]), 1)
# or as probs
probs = cnts / cnts.sum()
# or as condional probs (if today is sun how probable is rain tomorrow etc.)
cond = cnts / cnts.sum(axis=-1, keepdims=True)
print(cnts)
print(probs)
print(cond)
# [13 9 10]
# [ 6 11 9]
# [13 6 23]]
# [[ 0.13 0.09 0.1 ]
# [ 0.06 0.11 0.09]
# [ 0.13 0.06 0.23]]
# [[ 0.40625 0.28125 0.3125 ]
# [ 0.23076923 0.42307692 0.34615385]
# [ 0.30952381 0.14285714 0.54761905]]
下面使用 pandas 的另一种选择。转换列表可以替换为 'rain'、'clouds' 等
import pandas as pd
transitions = ['A', 'B', 'B', 'C', 'B', 'A', 'D', 'D', 'A', 'B', 'A', 'D'] * 2
df = pd.DataFrame(columns = ['state', 'next_state'])
for i, val in enumerate(transitions[:-1]): # We don't care about last state
df_stg = pd.DataFrame(index=[0])
df_stg['state'], df_stg['next_state'] = transitions[i], transitions[i+1]
df = pd.concat([df, df_stg], axis = 0)
cross_tab = pd.crosstab(df['state'], df['next_state'])
cross_tab.div(cross_tab.sum(axis=1), axis=0)
如果您不介意使用 pandas
,可以使用一行代码来提取转移概率:
pd.crosstab(pd.Series(days[1:],name='Tomorrow'),
pd.Series(days[:-1],name='Today'),normalize=1)
输出:
Today clouds rain sun
Tomorrow
clouds 0.40625 0.230769 0.309524
rain 0.28125 0.423077 0.142857
sun 0.31250 0.346154 0.547619
此处在 'rain' 行 'sun' 列中找到假设今天下雨明天是晴天的(正向)概率。如果您想获得后向概率(鉴于今天的天气,昨天的天气可能是什么),请切换前两个参数。
如果您想将概率存储在行而不是列中,请设置 normalize=0
但请注意,如果您直接在本示例中这样做,您将获得存储为行的反向概率。如果您想获得与上述相同的结果但转置您可以 a) 是,转置或 b) 切换前两个参数的顺序并将 normalize
设置为 0。
如果您只想将结果保留为 numpy
二维数组(而不是 pandas 数据框),请在最后一个括号后键入 .values
。