Python 分布的卡方检验
Chi squared test of a distribution in Python
我对这个练习有两个疑惑:
代码的第一部分完美运行。现在我需要使用卡方检验检查分布是否平坦。
我实现的代码是:
#UNIFORM RANDOM SAMPLING
import numpy as np #library needed for numerical calculations
import matplotlib.pyplot as plt #library needed for plotting purposes
from scipy.stats import chisquare #function needed for chi square test
#*******************************************************************************
i=np.uintc(987654321) #unsigned int variable i
m=10**3 #number of 10**3 events
list1=[i] #list1 needed to be updated with random i
for count in range(m): #for cycle over expected period and update i
i=np.uintc(i*663608941)
list1.append(i)
list1=np.divide(list1,(2**32)-1) #needed in order to normalize the list1 elements
bins1=int(np.sqrt(m)) #histogram bin numbers
[hist,bin_edges]=np.histogram(list1,bins=bins1) #compute the histogram of a dataset
#*******************************************************************************
f_exp=(m/bins1)*np.ones(bins1) #expected frequency, expresses in array form.
#we define an array of ones of the exact size
#as the number of bins, and then just multiply
#it with n/N where n is number of elements
#and N is number of bins.
#So it will look like [n/N,n/N,n/N...]
chisquareval=chisquare(hist,f_exp,axis=0) #Calculate a one-way chi-square test.
#The chi-square test tests the null hypothesis
#that the categorical data has the given frequencies.
#It needs: Observed frequencies in each category, Expected frequencies in each category
print("\n")
print("The result of chi squared test is:", chisquareval, "\n")
#*******************************************************************************
plt.figure() #a unique identifier for the figure
plt.hist(list1[0:m],bins=bins1) #compute and draw the histogram of x with n bins
plt.grid() #configure the grid lines
plt.xlabel('Bins',fontweight='bold') #set the label for the y-axis
plt.ylabel('Frequency',fontweight='bold') #set the label for the y-axis
plt.title('Uniform distribution: number of elements ^{3}$') #set a title for the hist
plt.show() #display all open figures
#*******************************************************************************
i=np.uintc(987654321) #unsigned int variable i
n=10**6 #number of 10**6 events
list1=[i] #list1 needed to be updated with random i
for count in range(n): #for cycle over expected period and update i
i=np.uintc(i*663608941)
list1.append(i)
list1=np.divide(list1,(2**32)-1) #needed in order to normalize the list1 elements
bins1=int(np.sqrt(n)) #histogram bin numbers
[hist,bin_edges]=np.histogram(list1,bins=bins1) #compute the histogram of a dataset
#*******************************************************************************
f_exp=(n/bins1)*np.ones(bins1) #expected frequency, expresses in array form.
#we define an array of ones of the exact size
#as the number of bins, and then just multiply
#it with n/N where n is number of elements
#and N is number of bins.
#So it will look like [n/N,n/N,n/N...]
chisquareval=chisquare(hist,f_exp,axis=0) #Calculate a one-way chi-square test.
#The chi-square test tests the null hypothesis
#that the categorical data has the given frequencies.
#It needs: Observed frequencies in each category, Expected frequencies in each category
print("\n")
print("The result of chi squared test is:", chisquareval, "\n")
#*******************************************************************************
plt.figure() #a unique identifier for the figure
plt.hist(list1[0:n],bins=bins1) #compute and draw the histogram of x with n bins
plt.grid() #configure the grid lines
plt.xlabel('Bins',fontweight='bold') #set the label for the x-axis
plt.ylabel('Frequency',fontweight='bold') #set the label for the y-axis
plt.title('Uniform distribution: number of elements ^{6}$') #set a title for the hist
plt.show() #display all open figures
#*******************************************************************************
具有以下输出:
对于第一个直方图,一切正常,有效。在第二个直方图中,我们可以看到一个错误的统计值。我过去运行过该代码,结果是 103,但我没有更改代码!
- 为什么会这样?
- 为什么 chisquare 的显示输出看起来很糟糕?
Power_divergenceResult(statistic=32.315000000000005, pvalue=0.35302378840079285)
可以分开打印statistc和pvalue吗?
我明白这个问题了。代码是正确的,但如果我没有找到 ddof=1000 的 table,我无法判断原假设是否正确。所以得到 bin=ddof=100 我可以比较卡方检验的结果,在这种情况下我可以说假设被拒绝,错误为 2%。
我对这个练习有两个疑惑:
代码的第一部分完美运行。现在我需要使用卡方检验检查分布是否平坦。
我实现的代码是:
#UNIFORM RANDOM SAMPLING
import numpy as np #library needed for numerical calculations
import matplotlib.pyplot as plt #library needed for plotting purposes
from scipy.stats import chisquare #function needed for chi square test
#*******************************************************************************
i=np.uintc(987654321) #unsigned int variable i
m=10**3 #number of 10**3 events
list1=[i] #list1 needed to be updated with random i
for count in range(m): #for cycle over expected period and update i
i=np.uintc(i*663608941)
list1.append(i)
list1=np.divide(list1,(2**32)-1) #needed in order to normalize the list1 elements
bins1=int(np.sqrt(m)) #histogram bin numbers
[hist,bin_edges]=np.histogram(list1,bins=bins1) #compute the histogram of a dataset
#*******************************************************************************
f_exp=(m/bins1)*np.ones(bins1) #expected frequency, expresses in array form.
#we define an array of ones of the exact size
#as the number of bins, and then just multiply
#it with n/N where n is number of elements
#and N is number of bins.
#So it will look like [n/N,n/N,n/N...]
chisquareval=chisquare(hist,f_exp,axis=0) #Calculate a one-way chi-square test.
#The chi-square test tests the null hypothesis
#that the categorical data has the given frequencies.
#It needs: Observed frequencies in each category, Expected frequencies in each category
print("\n")
print("The result of chi squared test is:", chisquareval, "\n")
#*******************************************************************************
plt.figure() #a unique identifier for the figure
plt.hist(list1[0:m],bins=bins1) #compute and draw the histogram of x with n bins
plt.grid() #configure the grid lines
plt.xlabel('Bins',fontweight='bold') #set the label for the y-axis
plt.ylabel('Frequency',fontweight='bold') #set the label for the y-axis
plt.title('Uniform distribution: number of elements ^{3}$') #set a title for the hist
plt.show() #display all open figures
#*******************************************************************************
i=np.uintc(987654321) #unsigned int variable i
n=10**6 #number of 10**6 events
list1=[i] #list1 needed to be updated with random i
for count in range(n): #for cycle over expected period and update i
i=np.uintc(i*663608941)
list1.append(i)
list1=np.divide(list1,(2**32)-1) #needed in order to normalize the list1 elements
bins1=int(np.sqrt(n)) #histogram bin numbers
[hist,bin_edges]=np.histogram(list1,bins=bins1) #compute the histogram of a dataset
#*******************************************************************************
f_exp=(n/bins1)*np.ones(bins1) #expected frequency, expresses in array form.
#we define an array of ones of the exact size
#as the number of bins, and then just multiply
#it with n/N where n is number of elements
#and N is number of bins.
#So it will look like [n/N,n/N,n/N...]
chisquareval=chisquare(hist,f_exp,axis=0) #Calculate a one-way chi-square test.
#The chi-square test tests the null hypothesis
#that the categorical data has the given frequencies.
#It needs: Observed frequencies in each category, Expected frequencies in each category
print("\n")
print("The result of chi squared test is:", chisquareval, "\n")
#*******************************************************************************
plt.figure() #a unique identifier for the figure
plt.hist(list1[0:n],bins=bins1) #compute and draw the histogram of x with n bins
plt.grid() #configure the grid lines
plt.xlabel('Bins',fontweight='bold') #set the label for the x-axis
plt.ylabel('Frequency',fontweight='bold') #set the label for the y-axis
plt.title('Uniform distribution: number of elements ^{6}$') #set a title for the hist
plt.show() #display all open figures
#*******************************************************************************
具有以下输出:
对于第一个直方图,一切正常,有效。在第二个直方图中,我们可以看到一个错误的统计值。我过去运行过该代码,结果是 103,但我没有更改代码!
- 为什么会这样?
- 为什么 chisquare 的显示输出看起来很糟糕?
Power_divergenceResult(statistic=32.315000000000005, pvalue=0.35302378840079285)
可以分开打印statistc和pvalue吗?
我明白这个问题了。代码是正确的,但如果我没有找到 ddof=1000 的 table,我无法判断原假设是否正确。所以得到 bin=ddof=100 我可以比较卡方检验的结果,在这种情况下我可以说假设被拒绝,错误为 2%。