如何整理包含多个信息的列的数据集-样本数据放置?
How to tidy the data set with column containing multiple information-Sample data put?
请帮我整理一下数据。谢谢。
总观测值是 394,有 26 列。数据从 ms excel 导出。
数据样本如下。在这个示例中实际上应该只有三个 observations/rows。
在向量 d1..d2..no 和 Farmer.Name 中,对应于 v1 的 NA 的观测值应该被清除并添加到前一行值中。
d1..d2..no 对应于三个观察值(两个日期观察值一个唯一标识号),Farmer.Name 向量也是如此。
样本是
d1..d2..no<-c("27/01/2020", "43832", "KE004421", "43832", "43832",
"KE003443", "31/12/2019", "43832", "KE0001512")
Farmer.Name<-c("S Jacob Gender:male","farmer type :marginal","farmer category :general",
"J Isac Gender :Female","farmer type: large","farmer category :general",
"P Kumar Gender :Male","farmer type:small","farmer category :general")
adress<-c("k11",NA,NA,"k12",NA,NA,"k13",NA,NA)
amount<-c(25,NA,NA,25,NA,NA,32,NA,NA)
mydata<-data.frame(v1=v1, d1..d2..no=d1..d2..no, Farmer.Name=Farmer.Name,
adress=adress, amount=amount)
在向量 d1..d2..no 和 Farmer.Name 中,应清除对应于 v1 的 NA 的观测值并将其添加到前一行值中。
d1..d2..no 对应三个观察值(两个日期观察值一个唯一标识号)
Farmer.Name 向量也是如此。也就是说,我的预期结果就像这段代码
v1<-c(1,2,3)
d1<-c("27/01/2020","43832","31/12/2019")
d2<-c("43832","43832","43832")
no<-c("KE004421","KE003443","KE0001512")
Farmer.Name1<-c("S Jacob","J Isac","P Kumar")
Gender<-c("male","female","male")
farmer_type <-c("marginal","large","small")
farmer_category <-c("general", "general", "general")
adress<-c("k11","k12","k13")
amount<-c(25,25,32)
myfinaldata<-data.frame(v1=v1,d1=d1,d2=d2,no=no,
Farmer.Name1=Farmer.Name1,
farmer_type=farmer_type,
farmer_category=farmer_category,
adress=adress,amount=amount)
结果应该是
v1 d1 d2 no Farmer.Name1 farmer_type farmer_category adress amount
1 1 27/01/2020 43832 KE004421 S Jacob marginal general k11 25
2 2 43832 43832 KE003443 J Isac large general k12 25
3 3 31/12/2019 43832 KE0001512 P Kumar small general k13 32
我是编程和r的新手,通过网上资源学习。也是我在这个平台上的第一个 post。有错误请见谅
我在整洁的 vesre 的传播、分离等方面做了很多乱七八糟的事情。但是不知道如何进行。
您的数据集中的日期不是日期格式。考虑在此之后格式化它们。
library(reshape)
df.new <- cbind(mydata[seq(1, nrow(mydata), 3), ], mydata[seq(2, nrow(mydata), 3), ][2:3], mydata[seq(3, nrow(mydata), 3), ][2:3])
colnames(df.new) <- c("v1", "d1", "Farmer.Name1", "adress", "amount", "d2", "farmer_type", "no", "farmer_category")
df.new <- df.new[c(1,2,6, 8,3, 7,9, 4,5)]
library(stringr)
df.new$Farmer.Name1 <- word(df.new$Farmer.Name1,1,sep = "\ Gender")
df.new$farmer_type <- word(df.new$farmer_type,2,sep = "\:")
df.new$farmer_category <- word(df.new$farmer_category,2,sep = "\:")
决赛 table:
> df.new
v1 d1 d2 no Farmer.Name1 farmer_type farmer_category adress amount
1 1 27/01/2020 43832 KE004421 S Jacob marginal general k11 25
4 2 43832 43832 KE003443 J Isac large general k12 25
7 3 31/12/2019 43832 KE0001512 P Kumar small general k13 32
P.S.: 我没有重命名行号。
不整洁的数据可能是一个挑战。这是一个 tidyverse
方法。
首先,为 d1
、d2
和 no
添加了建议的列名称。假定行按此顺序排列。
列 Farmer.Name
separate
d 分为两列,由 :
。
Name
本身在单词 Gender
之前被分隔开。
fill
允许为同一个人填写共同的值(例如v1
、adress
、amount
和Name
) .
pivot_wider
是为了广泛传播数据,首先是 d1
、d2
和 no
,然后是其他列,包括 [=20] =]、farmer_type
和 farmer_category
.
library(tidyverse)
df1 <- mydata %>%
mutate(d_var = rep(c("d1", "d2", "no"), times = 3)) %>%
separate(Farmer.Name, into = c("Var", "Val"), sep = ":") %>%
separate(Var, into = c("Name", "Var"), sep = "(?=Gender)", fill = "left") %>%
mutate_at(c("Name", "Var"), trimws) %>%
fill(v1, adress, amount, Name, .direction = "down") %>%
mutate(Var = gsub(" ", "_", Var))
df1 %>%
pivot_wider(id_cols = c(v1, Name, adress, amount), names_from = d_var, values_from = d1..d2..no) %>%
left_join(pivot_wider(df1, id_cols = c(v1, Name, adress, amount), names_from = Var, values_from = Val))
输出
# A tibble: 3 x 10
v1 Name adress amount d1 d2 no Gender farmer_type farmer_category
<dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 S Jacob k11 25 27/01/2020 43832 KE004421 male "marginal" general
2 2 J Isac k12 25 43832 43832 KE003443 Female " large" general
3 3 P Kumar k13 32 31/12/2019 43832 KE0001512 Male "small" general
请帮我整理一下数据。谢谢。 总观测值是 394,有 26 列。数据从 ms excel 导出。 数据样本如下。在这个示例中实际上应该只有三个 observations/rows。 在向量 d1..d2..no 和 Farmer.Name 中,对应于 v1 的 NA 的观测值应该被清除并添加到前一行值中。 d1..d2..no 对应于三个观察值(两个日期观察值一个唯一标识号),Farmer.Name 向量也是如此。 样本是
d1..d2..no<-c("27/01/2020", "43832", "KE004421", "43832", "43832",
"KE003443", "31/12/2019", "43832", "KE0001512")
Farmer.Name<-c("S Jacob Gender:male","farmer type :marginal","farmer category :general",
"J Isac Gender :Female","farmer type: large","farmer category :general",
"P Kumar Gender :Male","farmer type:small","farmer category :general")
adress<-c("k11",NA,NA,"k12",NA,NA,"k13",NA,NA)
amount<-c(25,NA,NA,25,NA,NA,32,NA,NA)
mydata<-data.frame(v1=v1, d1..d2..no=d1..d2..no, Farmer.Name=Farmer.Name,
adress=adress, amount=amount)
在向量 d1..d2..no 和 Farmer.Name 中,应清除对应于 v1 的 NA 的观测值并将其添加到前一行值中。 d1..d2..no 对应三个观察值(两个日期观察值一个唯一标识号) Farmer.Name 向量也是如此。也就是说,我的预期结果就像这段代码
v1<-c(1,2,3)
d1<-c("27/01/2020","43832","31/12/2019")
d2<-c("43832","43832","43832")
no<-c("KE004421","KE003443","KE0001512")
Farmer.Name1<-c("S Jacob","J Isac","P Kumar")
Gender<-c("male","female","male")
farmer_type <-c("marginal","large","small")
farmer_category <-c("general", "general", "general")
adress<-c("k11","k12","k13")
amount<-c(25,25,32)
myfinaldata<-data.frame(v1=v1,d1=d1,d2=d2,no=no,
Farmer.Name1=Farmer.Name1,
farmer_type=farmer_type,
farmer_category=farmer_category,
adress=adress,amount=amount)
结果应该是
v1 d1 d2 no Farmer.Name1 farmer_type farmer_category adress amount
1 1 27/01/2020 43832 KE004421 S Jacob marginal general k11 25
2 2 43832 43832 KE003443 J Isac large general k12 25
3 3 31/12/2019 43832 KE0001512 P Kumar small general k13 32
我是编程和r的新手,通过网上资源学习。也是我在这个平台上的第一个 post。有错误请见谅
我在整洁的 vesre 的传播、分离等方面做了很多乱七八糟的事情。但是不知道如何进行。
您的数据集中的日期不是日期格式。考虑在此之后格式化它们。
library(reshape)
df.new <- cbind(mydata[seq(1, nrow(mydata), 3), ], mydata[seq(2, nrow(mydata), 3), ][2:3], mydata[seq(3, nrow(mydata), 3), ][2:3])
colnames(df.new) <- c("v1", "d1", "Farmer.Name1", "adress", "amount", "d2", "farmer_type", "no", "farmer_category")
df.new <- df.new[c(1,2,6, 8,3, 7,9, 4,5)]
library(stringr)
df.new$Farmer.Name1 <- word(df.new$Farmer.Name1,1,sep = "\ Gender")
df.new$farmer_type <- word(df.new$farmer_type,2,sep = "\:")
df.new$farmer_category <- word(df.new$farmer_category,2,sep = "\:")
决赛 table:
> df.new
v1 d1 d2 no Farmer.Name1 farmer_type farmer_category adress amount
1 1 27/01/2020 43832 KE004421 S Jacob marginal general k11 25
4 2 43832 43832 KE003443 J Isac large general k12 25
7 3 31/12/2019 43832 KE0001512 P Kumar small general k13 32
P.S.: 我没有重命名行号。
不整洁的数据可能是一个挑战。这是一个 tidyverse
方法。
首先,为 d1
、d2
和 no
添加了建议的列名称。假定行按此顺序排列。
列 Farmer.Name
separate
d 分为两列,由 :
。
Name
本身在单词 Gender
之前被分隔开。
fill
允许为同一个人填写共同的值(例如v1
、adress
、amount
和Name
) .
pivot_wider
是为了广泛传播数据,首先是 d1
、d2
和 no
,然后是其他列,包括 [=20] =]、farmer_type
和 farmer_category
.
library(tidyverse)
df1 <- mydata %>%
mutate(d_var = rep(c("d1", "d2", "no"), times = 3)) %>%
separate(Farmer.Name, into = c("Var", "Val"), sep = ":") %>%
separate(Var, into = c("Name", "Var"), sep = "(?=Gender)", fill = "left") %>%
mutate_at(c("Name", "Var"), trimws) %>%
fill(v1, adress, amount, Name, .direction = "down") %>%
mutate(Var = gsub(" ", "_", Var))
df1 %>%
pivot_wider(id_cols = c(v1, Name, adress, amount), names_from = d_var, values_from = d1..d2..no) %>%
left_join(pivot_wider(df1, id_cols = c(v1, Name, adress, amount), names_from = Var, values_from = Val))
输出
# A tibble: 3 x 10
v1 Name adress amount d1 d2 no Gender farmer_type farmer_category
<dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 S Jacob k11 25 27/01/2020 43832 KE004421 male "marginal" general
2 2 J Isac k12 25 43832 43832 KE003443 Female " large" general
3 3 P Kumar k13 32 31/12/2019 43832 KE0001512 Male "small" general