R中的决策树公式
decision tree formula in R
我正在尝试分析马拉松比赛数据。我构建了一个简单的模型并创建了一个决策树:
fit <- rpart(timeCategory ~ country + age.group + participated.times, data=data)
我的目标是创建一个通用公式来预测结果,like in this article (page 4)。
我如何在 R 中做到这一点,使用什么技术?因此,我想要一个带有提供的属性的公式。
数据:我用的一些真实数据可以downloaded here。
读取数据如下:
data = read.table("data/processedData.txt", header=T)
data$timeCategory <- ntile(data$time, 10)
这些是使用时间作为连续值的回归系数,这是示例中提供的预测类型。它们可用于构建您请求的公式类型。
> lmfit <- lm(time ~ country + age.group + particip.time, data=data)
> lmfit
Call:
lm(formula = time ~ country + age.group + particip.time, data = data)
Coefficients:
(Intercept) countryJõgeva countryLääne-Viru countryLäti
9526.702 345.930 122.513 -73.239
countryLeedu countryPärnu countryRapla countrySaaremaa
120.592 -78.086 -208.882 114.292
countryTallinn countryTartu countryViljandi age.groupM20
-37.536 55.771 -70.417 -142.600
age.groupM21 age.groupM35 age.groupM40 age.groupM45
-218.225 -218.067 -20.108 -196.331
age.groupM50 particip.time
88.342 -2.487
如果你想让它们全部排成一行,那么:
> as.matrix(coef(lmfit))
[,1]
(Intercept) 9526.702146
countryJõgeva 345.930334
countryLääne-Viru 122.513294
countryLäti -73.239333
countryLeedu 120.591585
countryPärnu -78.086107
countryRapla -208.882244
countrySaaremaa 114.291592
countryTallinn -37.535659
countryTartu 55.771326
countryViljandi -70.416659
age.groupM20 -142.599598
age.groupM21 -218.224754
age.groupM35 -218.066655
age.groupM40 -20.108242
age.groupM45 -196.331263
age.groupM50 88.341978
particip.time -2.486818
进一步处理文本:
> form <- as.matrix(coef(lmfit))
> rownames(form) <- gsub("try", "try == ", rownames(form) )
> rownames(form) <- gsub("oup", "oup == ", rownames(form) )
> form
[,1]
(Intercept) 9526.702146
country == Jõgeva 345.930334
country == Lääne-Viru 122.513294
country == Läti -73.239333
country == Leedu 120.591585
country == Pärnu -78.086107
country == Rapla -208.882244
country == Saaremaa 114.291592
country == Tallinn -37.535659
country == Tartu 55.771326
country == Viljandi -70.416659
age.group == M20 -142.599598
age.group == M21 -218.224754
age.group == M35 -218.066655
age.group == M40 -20.108242
age.group == M45 -196.331263
age.group == M50 88.341978
particip.time -2.486818
即将完成:
cat(paste( form, paste0("(", rownames(form), ")" ), sep="*", collapse="+\n") )
9526.70214596473*((Intercept))+
345.93033373724*(country == Jõgeva)+
122.51329418344*(country == Lääne-Viru)+
-73.2393326763322*(country == Läti)+
120.591584530399*(country == Leedu)+
-78.0861070429056*(country == Pärnu)+
-208.882244416016*(country == Rapla)+
114.291592299937*(country == Saaremaa)+
-37.5356589458207*(country == Tallinn)+
55.771326363022*(country == Tartu)+
-70.4166587941724*(country == Viljandi)+
-142.599598141679*(age.group == M20)+
-218.224754448193*(age.group == M21)+
-218.066655292225*(age.group == M35)+
-20.1082422022072*(age.group == M40)+
-196.33126335145*(age.group == M45)+
88.3419781798024*(age.group == M50)+
-2.48681789339678*(particip.time)
我正在尝试分析马拉松比赛数据。我构建了一个简单的模型并创建了一个决策树:
fit <- rpart(timeCategory ~ country + age.group + participated.times, data=data)
我的目标是创建一个通用公式来预测结果,like in this article (page 4)。
我如何在 R 中做到这一点,使用什么技术?因此,我想要一个带有提供的属性的公式。
数据:我用的一些真实数据可以downloaded here。 读取数据如下:
data = read.table("data/processedData.txt", header=T)
data$timeCategory <- ntile(data$time, 10)
这些是使用时间作为连续值的回归系数,这是示例中提供的预测类型。它们可用于构建您请求的公式类型。
> lmfit <- lm(time ~ country + age.group + particip.time, data=data)
> lmfit
Call:
lm(formula = time ~ country + age.group + particip.time, data = data)
Coefficients:
(Intercept) countryJõgeva countryLääne-Viru countryLäti
9526.702 345.930 122.513 -73.239
countryLeedu countryPärnu countryRapla countrySaaremaa
120.592 -78.086 -208.882 114.292
countryTallinn countryTartu countryViljandi age.groupM20
-37.536 55.771 -70.417 -142.600
age.groupM21 age.groupM35 age.groupM40 age.groupM45
-218.225 -218.067 -20.108 -196.331
age.groupM50 particip.time
88.342 -2.487
如果你想让它们全部排成一行,那么:
> as.matrix(coef(lmfit))
[,1]
(Intercept) 9526.702146
countryJõgeva 345.930334
countryLääne-Viru 122.513294
countryLäti -73.239333
countryLeedu 120.591585
countryPärnu -78.086107
countryRapla -208.882244
countrySaaremaa 114.291592
countryTallinn -37.535659
countryTartu 55.771326
countryViljandi -70.416659
age.groupM20 -142.599598
age.groupM21 -218.224754
age.groupM35 -218.066655
age.groupM40 -20.108242
age.groupM45 -196.331263
age.groupM50 88.341978
particip.time -2.486818
进一步处理文本:
> form <- as.matrix(coef(lmfit))
> rownames(form) <- gsub("try", "try == ", rownames(form) )
> rownames(form) <- gsub("oup", "oup == ", rownames(form) )
> form
[,1]
(Intercept) 9526.702146
country == Jõgeva 345.930334
country == Lääne-Viru 122.513294
country == Läti -73.239333
country == Leedu 120.591585
country == Pärnu -78.086107
country == Rapla -208.882244
country == Saaremaa 114.291592
country == Tallinn -37.535659
country == Tartu 55.771326
country == Viljandi -70.416659
age.group == M20 -142.599598
age.group == M21 -218.224754
age.group == M35 -218.066655
age.group == M40 -20.108242
age.group == M45 -196.331263
age.group == M50 88.341978
particip.time -2.486818
即将完成:
cat(paste( form, paste0("(", rownames(form), ")" ), sep="*", collapse="+\n") )
9526.70214596473*((Intercept))+
345.93033373724*(country == Jõgeva)+
122.51329418344*(country == Lääne-Viru)+
-73.2393326763322*(country == Läti)+
120.591584530399*(country == Leedu)+
-78.0861070429056*(country == Pärnu)+
-208.882244416016*(country == Rapla)+
114.291592299937*(country == Saaremaa)+
-37.5356589458207*(country == Tallinn)+
55.771326363022*(country == Tartu)+
-70.4166587941724*(country == Viljandi)+
-142.599598141679*(age.group == M20)+
-218.224754448193*(age.group == M21)+
-218.066655292225*(age.group == M35)+
-20.1082422022072*(age.group == M40)+
-196.33126335145*(age.group == M45)+
88.3419781798024*(age.group == M50)+
-2.48681789339678*(particip.time)