r 中独立性的卡方检验
Chi-Square Test of Independence in r
我有一个与我的 df 结构相关的技术问题。
它看起来像这样:
Month District Age Gender Education Disability Religion Occupation JobSeekers GMI
1 2020-01 Dan U17 Male None None Jewish Unprofessional workers 2 0
2 2020-01 Dan U17 Male None None Muslims Sales and costumer service 1 0
3 2020-01 Dan U17 Female None None Other Undefined 1 0
4 2020-01 Dan 18-24 Male None None Jewish Production and construction 1 0
5 2020-01 Dan 18-24 Male None None Jewish Academic degree 1 0
6 2020-01 Dan 18-24 Male None None Jewish Practical engineers and technicians 1 0
ACU NACU NewSeekers NewFiredSeekers
1 0 2 0 0
2 0 1 0 0
3 0 1 0 0
4 0 1 0 0
5 0 1 0 0
6 0 1 1 1
我正在寻找一种方法来对地区和求职者等 2 个变量之间的独立性进行卡方检验,这样我就可以判断北部地区与求职者的关系是否比南部地区更多。
据我所知,数据结构有问题(地区是一个字符,求职者是一个整数,表示我有多少基于地区、性别、职业等的求职者)
我试图像这样将它划分为地区和求职者:
Month District JobSeekers GMI ACU NACU NewSeekers NewFiredSeekers
<chr> <chr> <int> <int> <int> <int> <int> <int>
1 2020-01 Dan 33071 4694 9548 18829 6551 4682
2 2020-01 Jerusalem 21973 7665 3395 10913 3589 2260
3 2020-01 North 47589 22917 4318 20354 6154 3845
4 2020-01 Sharon 25403 6925 4633 13845 4131 2727
5 2020-01 South 37089 18874 2810 15405 4469 2342
6 2020-02 Dan 32660 4554 9615 18491 5529 3689
但这样更难处理
当然,我会接受任何其他可行的测试。
如果您需要更多信息,请帮助并告诉我,
莫舍
更新
# t test for district vs new seekers
# sorting
dist.newseek <- Cdata %>%
group_by(Month,District) %>%
summarise(NewSeekers=sum(NewSeekers))
# performing a t test on the mini table we created
t.test(NewSeekers ~ District,data=subset(dist.newseek,District %in% c("Dan","South")))
# results
Welch Two Sample t-test
data: NewSeekers by District
t = 0.68883, df = 4.1617, p-value = 0.5274
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-119952.3 200737.3
sample estimates:
mean in group Dan mean in group South
74608.25 34215.75
#wilcoxon test
# filtering Cdata to New seekers based on month and age
age.newseek <- Cdata %>%
group_by(Month,Age) %>%
summarise(NewSeekers=sum(NewSeekers))
#performing a wilcoxon test on the subset
wilcox.test(NewSeekers ~ Age,data=subset(age.newseek,Age %in% c("25-34","45-54")))
# Results
Wilcoxon rank sum exact test
data: NewSeekers by Age
W = 11, p-value = 0.4857
alternative hypothesis: true location shift is not equal to 0
方差分析
# Sorting occupation and month by new seekers
occu.newseek <- Cdata %>%
group_by(Month,Occupation) %>%
summarise(NewSeekers=sum(NewSeekers))
## Make the Occupation as a factor
occu.newseek$District <- as.factor(occu.newseek$Occupation)
## Get the occupation group means and standart deviations
group.mean.sd <- aggregate(
x = occu.newseek$NewSeekers, # Specify data column
by = list(occu.newseek$Occupation), # Specify group indicator
FUN = function(x) c('mean'=mean(x),'sd'= sd(x))
)
## Run one way ANOVA test
anova_one_way <- aov(NewSeekers~ Occupation, data = occu.newseek)
summary(anova_one_way)
## Run the Tukey Test to compare the groups
TukeyHSD(anova_one_way)
## Check the mean differences across the groups
library(ggplot2)
ggplot(occu.newseek, aes(x = Occupation, y = NewSeekers, fill = Occupation)) +
geom_boxplot() +
geom_jitter(shape = 15,
color = "steelblue",
position = position_jitter(0.21)) +
theme_classic()
Plot
您可以使用方差分析来比较多个组。如果您通过综合方差分析发现任何统计显着的结果,那么您可以检查哪个地区更好或更差。
您还可以参考 UCLA 的网站,该网站显示了应该使用哪些测试来测试他们的数据。 link 是 here.
作为一个简单的例子,让我在这里说明如何 运行 方差分析测试。
这是您的数据:
head(df)
r$> head(df)
Month District Age Gender Education Disability Religion Occupation JobSeekers GMI ACU NACU NewSeekers NewFiredSeekers
1 2020-01 Dan 18-24 Male None Hard Jewish Practical engineers and technicians 1 0 0 1 1 1
2 2020-01 North 18-24 Male None Hard Jewish Practical engineers and technicians 1 0 0 1 1 1
3 2020-01 North 18-24 Male None Hard Jewish Practical engineers and technicians 1 0 0 1 1 1
4 2020-01 South 18-24 Male None Hard Jewish Practical engineers and technicians 1 0 0 1 1 1
5 2020-01 Dan 18-24 Male None Hard Jewish Practical engineers and technicians 1 0 0 1 1 1
6 2020-01 Jerusalem 18-24 Male None Hard Jewish Practical engineers and technicians 1 0 0 1 1 1
因为我需要更多的数据点来进行测试,所以我通过引导复制了你的数据。我还增加了南北地区的求职人数。您无需在数据中执行以下步骤。但我就是这样做的。
# For the sake of this example, I increased the number of observation by bootstrapping the example data
for(i in 1:20) df <- rbind(df[sample(6, 5), ],df)
rownames(df) <- 1:nrow(df)
df$District <- sample(c("Jerusalem", "North", "Sharon", "South", "Dan"), nrow(df),replace = T)
df$JobSeekers[df$District == "North"] <- sample(1:3,length(df$JobSeekers[df$District == "North"]),replace=T,p=c(0.1,0.5,0.4))
df$JobSeekers[df$District == "South"] <- sample(4:6,length(df$JobSeekers[df$District == "South"]),replace=T,p=c(0.1,0.5,0.4))
在分析分类变量时,最好将字符作为一个因素。通过这样做,您可以控制因素的水平。
## Make the District as a factor
df$District <- as.factor(df$District)
接下来,获取组均值和标准差以查看组间是否存在任何有意义的差异。如你所见,我改变了南区和北区,因此与其他区相比,它们的平均分数最高。
## Get the group means and standart deviations
group.mean.sd <- aggregate(
x = df$JobSeekers, # Specify data column
by = list(df$District), # Specify group indicator
FUN = function(x) c('mean'=mean(x),'sd'= sd(x))
)
r$> group.mean.sd
Group.1 x.mean x.sd
1 Dan 1.1000000 0.3077935
2 Jerusalem 1.0000000 0.0000000
3 North 2.3225806 0.5992827
4 Sharon 1.1363636 0.3512501
5 South 5.2380952 0.4364358
最后,您可以 运行 如下方差分析和 Tukey 检验。
## Run one way ANOVA test
anova_one_way <- aov(JobSeekers~ District, data = df)
summary(anova_one_way)
r$> summary(anova_one_way)
Df Sum Sq Mean Sq F value Pr(>F)
District 4 260.09 65.02 346.1 <2e-16 ***
Residuals 101 18.97 0.19
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Run the Tukey Test to compare the groups
TukeyHSD(anova_one_way)
r$> Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = JobSeekers ~ District, data = df)
$District
diff lwr upr p adj
Jerusalem-Dan -0.10000000 -0.5396190 0.3396190 0.9695592
North-Dan 1.22258065 0.8772809 1.5678804 0.0000000
Sharon-Dan 0.03636364 -0.3356042 0.4083315 0.9987878
South-Dan 4.13809524 3.7619337 4.5142567 0.0000000
North-Jerusalem 1.32258065 0.9132542 1.7319071 0.0000000
Sharon-Jerusalem 0.13636364 -0.2956969 0.5684241 0.9048406
South-Jerusalem 4.23809524 3.8024191 4.6737714 0.0000000
Sharon-North -1.18621701 -1.5218409 -0.8505932 0.0000000
South-North 2.91551459 2.5752488 3.2557803 0.0000000
South-Sharon 4.10173160 3.7344321 4.4690311 0.0000000
最后,您可以用条形图标出哪个地区的求职者最多。
## Check the mean differences across the groups
library(ggplot2)
ggplot(df, aes(x = District, y = JobSeekers, fill = District)) +
geom_boxplot() +
geom_jitter(shape = 15,
color = "steelblue",
position = position_jitter(0.21)) +
theme_classic()
更新
根据您的更新,您可以使用以下语法来缩写 x 标签并更改图例。
library(stringr)
library(ggplot2)
ggplot(occu.newseek, aes(x = Occupation, y = NewSeekers, fill = str_wrap(Occupation,10))) +
geom_boxplot() +
geom_jitter(
shape = 19,
color = "black",
position = position_jitter(0.21)
) +
scale_x_discrete(
labels =
c(
"Academic degree" = "Academic",
"Practical engineers and technicians" = "Engineering",
'Production and construction'='Production',
"Sales and costumer service" = "Sales",
"Unprofessional workers" = "Unprofessional",
"Undefined" = "Undefined"
)
) +
labs(fill = "Occupation") +
theme_classic()+
theme(
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1), legend.key.height=unit(2, "cm")
#legend.position = "top",
)
你应该得到这样的图表。
你不能做卡方因为JobSeekers
是连续的,所以如果你想知道南北区有区别,你可以使用wilcoxon或t.test。这取决于你的数据。 wilcoxon 基于排名,不需要您的数据呈正态分布。
假设您统计了每个地区和每个月的求职者人数:
df = data.frame(Month=rep(c("2020-01","2020-02","2020-03","2020-04","2020-05","2020-06"),3),
District=rep(c("Dan","North","South"),each=6),JobSeekers=rpois(18,20))
A t.test 如下所示,但是如果您的样本是配对的,例如北方每个月有 12 个值,南方有相应的 12 个值,那么您需要设置 paired=FALSE , 看到这个 tutorial:
t.test(JobSeekers ~ District,data=subset(df,District %in% c("North","South")))
Welch Two Sample t-test
data: JobSeekers by District
t = 0.27455, df = 9.9435, p-value = 0.7893
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.560951 4.560951
sample estimates:
mean in group North mean in group South
21.5 21.0
如果您不确定样本是否呈正态分布,请使用 wilcoxon:
wilcox.test(JobSeekers ~ District,data=subset(df,District %in% c("North","South")))
Wilcoxon rank sum test with continuity correction
data: JobSeekers by District
W = 19.5, p-value = 0.8721
alternative hypothesis: true location shift is not equal to 0
我有一个与我的 df 结构相关的技术问题。 它看起来像这样:
Month District Age Gender Education Disability Religion Occupation JobSeekers GMI
1 2020-01 Dan U17 Male None None Jewish Unprofessional workers 2 0
2 2020-01 Dan U17 Male None None Muslims Sales and costumer service 1 0
3 2020-01 Dan U17 Female None None Other Undefined 1 0
4 2020-01 Dan 18-24 Male None None Jewish Production and construction 1 0
5 2020-01 Dan 18-24 Male None None Jewish Academic degree 1 0
6 2020-01 Dan 18-24 Male None None Jewish Practical engineers and technicians 1 0
ACU NACU NewSeekers NewFiredSeekers
1 0 2 0 0
2 0 1 0 0
3 0 1 0 0
4 0 1 0 0
5 0 1 0 0
6 0 1 1 1
我正在寻找一种方法来对地区和求职者等 2 个变量之间的独立性进行卡方检验,这样我就可以判断北部地区与求职者的关系是否比南部地区更多。 据我所知,数据结构有问题(地区是一个字符,求职者是一个整数,表示我有多少基于地区、性别、职业等的求职者) 我试图像这样将它划分为地区和求职者:
Month District JobSeekers GMI ACU NACU NewSeekers NewFiredSeekers
<chr> <chr> <int> <int> <int> <int> <int> <int>
1 2020-01 Dan 33071 4694 9548 18829 6551 4682
2 2020-01 Jerusalem 21973 7665 3395 10913 3589 2260
3 2020-01 North 47589 22917 4318 20354 6154 3845
4 2020-01 Sharon 25403 6925 4633 13845 4131 2727
5 2020-01 South 37089 18874 2810 15405 4469 2342
6 2020-02 Dan 32660 4554 9615 18491 5529 3689
但这样更难处理 当然,我会接受任何其他可行的测试。
如果您需要更多信息,请帮助并告诉我,
莫舍
更新
# t test for district vs new seekers
# sorting
dist.newseek <- Cdata %>%
group_by(Month,District) %>%
summarise(NewSeekers=sum(NewSeekers))
# performing a t test on the mini table we created
t.test(NewSeekers ~ District,data=subset(dist.newseek,District %in% c("Dan","South")))
# results
Welch Two Sample t-test
data: NewSeekers by District
t = 0.68883, df = 4.1617, p-value = 0.5274
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-119952.3 200737.3
sample estimates:
mean in group Dan mean in group South
74608.25 34215.75
#wilcoxon test
# filtering Cdata to New seekers based on month and age
age.newseek <- Cdata %>%
group_by(Month,Age) %>%
summarise(NewSeekers=sum(NewSeekers))
#performing a wilcoxon test on the subset
wilcox.test(NewSeekers ~ Age,data=subset(age.newseek,Age %in% c("25-34","45-54")))
# Results
Wilcoxon rank sum exact test
data: NewSeekers by Age
W = 11, p-value = 0.4857
alternative hypothesis: true location shift is not equal to 0
方差分析
# Sorting occupation and month by new seekers
occu.newseek <- Cdata %>%
group_by(Month,Occupation) %>%
summarise(NewSeekers=sum(NewSeekers))
## Make the Occupation as a factor
occu.newseek$District <- as.factor(occu.newseek$Occupation)
## Get the occupation group means and standart deviations
group.mean.sd <- aggregate(
x = occu.newseek$NewSeekers, # Specify data column
by = list(occu.newseek$Occupation), # Specify group indicator
FUN = function(x) c('mean'=mean(x),'sd'= sd(x))
)
## Run one way ANOVA test
anova_one_way <- aov(NewSeekers~ Occupation, data = occu.newseek)
summary(anova_one_way)
## Run the Tukey Test to compare the groups
TukeyHSD(anova_one_way)
## Check the mean differences across the groups
library(ggplot2)
ggplot(occu.newseek, aes(x = Occupation, y = NewSeekers, fill = Occupation)) +
geom_boxplot() +
geom_jitter(shape = 15,
color = "steelblue",
position = position_jitter(0.21)) +
theme_classic()
Plot
您可以使用方差分析来比较多个组。如果您通过综合方差分析发现任何统计显着的结果,那么您可以检查哪个地区更好或更差。
您还可以参考 UCLA 的网站,该网站显示了应该使用哪些测试来测试他们的数据。 link 是 here.
作为一个简单的例子,让我在这里说明如何 运行 方差分析测试。
这是您的数据:
head(df)
r$> head(df)
Month District Age Gender Education Disability Religion Occupation JobSeekers GMI ACU NACU NewSeekers NewFiredSeekers
1 2020-01 Dan 18-24 Male None Hard Jewish Practical engineers and technicians 1 0 0 1 1 1
2 2020-01 North 18-24 Male None Hard Jewish Practical engineers and technicians 1 0 0 1 1 1
3 2020-01 North 18-24 Male None Hard Jewish Practical engineers and technicians 1 0 0 1 1 1
4 2020-01 South 18-24 Male None Hard Jewish Practical engineers and technicians 1 0 0 1 1 1
5 2020-01 Dan 18-24 Male None Hard Jewish Practical engineers and technicians 1 0 0 1 1 1
6 2020-01 Jerusalem 18-24 Male None Hard Jewish Practical engineers and technicians 1 0 0 1 1 1
因为我需要更多的数据点来进行测试,所以我通过引导复制了你的数据。我还增加了南北地区的求职人数。您无需在数据中执行以下步骤。但我就是这样做的。
# For the sake of this example, I increased the number of observation by bootstrapping the example data
for(i in 1:20) df <- rbind(df[sample(6, 5), ],df)
rownames(df) <- 1:nrow(df)
df$District <- sample(c("Jerusalem", "North", "Sharon", "South", "Dan"), nrow(df),replace = T)
df$JobSeekers[df$District == "North"] <- sample(1:3,length(df$JobSeekers[df$District == "North"]),replace=T,p=c(0.1,0.5,0.4))
df$JobSeekers[df$District == "South"] <- sample(4:6,length(df$JobSeekers[df$District == "South"]),replace=T,p=c(0.1,0.5,0.4))
在分析分类变量时,最好将字符作为一个因素。通过这样做,您可以控制因素的水平。
## Make the District as a factor
df$District <- as.factor(df$District)
接下来,获取组均值和标准差以查看组间是否存在任何有意义的差异。如你所见,我改变了南区和北区,因此与其他区相比,它们的平均分数最高。
## Get the group means and standart deviations
group.mean.sd <- aggregate(
x = df$JobSeekers, # Specify data column
by = list(df$District), # Specify group indicator
FUN = function(x) c('mean'=mean(x),'sd'= sd(x))
)
r$> group.mean.sd
Group.1 x.mean x.sd
1 Dan 1.1000000 0.3077935
2 Jerusalem 1.0000000 0.0000000
3 North 2.3225806 0.5992827
4 Sharon 1.1363636 0.3512501
5 South 5.2380952 0.4364358
最后,您可以 运行 如下方差分析和 Tukey 检验。
## Run one way ANOVA test
anova_one_way <- aov(JobSeekers~ District, data = df)
summary(anova_one_way)
r$> summary(anova_one_way)
Df Sum Sq Mean Sq F value Pr(>F)
District 4 260.09 65.02 346.1 <2e-16 ***
Residuals 101 18.97 0.19
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Run the Tukey Test to compare the groups
TukeyHSD(anova_one_way)
r$> Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = JobSeekers ~ District, data = df)
$District
diff lwr upr p adj
Jerusalem-Dan -0.10000000 -0.5396190 0.3396190 0.9695592
North-Dan 1.22258065 0.8772809 1.5678804 0.0000000
Sharon-Dan 0.03636364 -0.3356042 0.4083315 0.9987878
South-Dan 4.13809524 3.7619337 4.5142567 0.0000000
North-Jerusalem 1.32258065 0.9132542 1.7319071 0.0000000
Sharon-Jerusalem 0.13636364 -0.2956969 0.5684241 0.9048406
South-Jerusalem 4.23809524 3.8024191 4.6737714 0.0000000
Sharon-North -1.18621701 -1.5218409 -0.8505932 0.0000000
South-North 2.91551459 2.5752488 3.2557803 0.0000000
South-Sharon 4.10173160 3.7344321 4.4690311 0.0000000
最后,您可以用条形图标出哪个地区的求职者最多。
## Check the mean differences across the groups
library(ggplot2)
ggplot(df, aes(x = District, y = JobSeekers, fill = District)) +
geom_boxplot() +
geom_jitter(shape = 15,
color = "steelblue",
position = position_jitter(0.21)) +
theme_classic()
更新
根据您的更新,您可以使用以下语法来缩写 x 标签并更改图例。
library(stringr)
library(ggplot2)
ggplot(occu.newseek, aes(x = Occupation, y = NewSeekers, fill = str_wrap(Occupation,10))) +
geom_boxplot() +
geom_jitter(
shape = 19,
color = "black",
position = position_jitter(0.21)
) +
scale_x_discrete(
labels =
c(
"Academic degree" = "Academic",
"Practical engineers and technicians" = "Engineering",
'Production and construction'='Production',
"Sales and costumer service" = "Sales",
"Unprofessional workers" = "Unprofessional",
"Undefined" = "Undefined"
)
) +
labs(fill = "Occupation") +
theme_classic()+
theme(
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1), legend.key.height=unit(2, "cm")
#legend.position = "top",
)
你应该得到这样的图表。
你不能做卡方因为JobSeekers
是连续的,所以如果你想知道南北区有区别,你可以使用wilcoxon或t.test。这取决于你的数据。 wilcoxon 基于排名,不需要您的数据呈正态分布。
假设您统计了每个地区和每个月的求职者人数:
df = data.frame(Month=rep(c("2020-01","2020-02","2020-03","2020-04","2020-05","2020-06"),3),
District=rep(c("Dan","North","South"),each=6),JobSeekers=rpois(18,20))
A t.test 如下所示,但是如果您的样本是配对的,例如北方每个月有 12 个值,南方有相应的 12 个值,那么您需要设置 paired=FALSE , 看到这个 tutorial:
t.test(JobSeekers ~ District,data=subset(df,District %in% c("North","South")))
Welch Two Sample t-test
data: JobSeekers by District
t = 0.27455, df = 9.9435, p-value = 0.7893
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.560951 4.560951
sample estimates:
mean in group North mean in group South
21.5 21.0
如果您不确定样本是否呈正态分布,请使用 wilcoxon:
wilcox.test(JobSeekers ~ District,data=subset(df,District %in% c("North","South")))
Wilcoxon rank sum test with continuity correction
data: JobSeekers by District
W = 19.5, p-value = 0.8721
alternative hypothesis: true location shift is not equal to 0