使用 ID 重塑 3 列数据
Reshape 3 Column Data with ID
我正在尝试在 R 中创建有向网络图。为此,我需要创建一个节点连接矩阵。
SOURCE_SUBREDDIT TARGET_SUBREDDIT LINK_SENTIMENT
rddtgaming rddtrust 1
xboxone battlefield_4 1
ps4 battlefield_4 1
fitnesscirclejerk leangains 1
fitnesscirclejerk lifeprotips 1
cancer fuckcancer 1
jleague soccer 1
bestoftldr tifu 1
quityourbullshit pics 1
bestof confession 1
anarchychess funny 1
internet_box ama 1
fitnesscirclejerk nofap 1
ffxiv ffxivapp 1
switcharoo funny 1
bitcoinmining bitcoin 1
subredditdrama nfl -1
rddtgaming rddtrust -1
正如您在上面看到的,第一对和最后一对具有相同的子编辑。数据显示了 subreddits 之间的方向关系,这就是为什么有多对
请查看我希望输出的照片:
到目前为止我的代码:
#reading in csv file
mydata <- read.csv(file="C:/Users/bmpmap/Documents/School/Netowrk Analysis/Connections List.csv", header=TRUE, sep=",")
colnames(mydata)
#SOURCE_SUBREDDIT TARGET_SUBREDDIT LINK_SENTIMENT
#install.packages("splitstackshape")
library(splitstackshape)
mydata_id = getanID(mydata , c("SOURCE_SUBREDDIT", "TARGET_SUBREDDIT", "LINK_SENTIMENT"))
colnames(mydata_id)
#reshaping data
我在上面的代码中创建了一个 ID 变量。我想我应该用它来唯一地识别这些对
你可以这样做-
> table(dt$SOURCE_SUBREDDIT,dt$TARGET_SUBREDDIT)
输出-
ama battlefield_4 bitcoin confession ffxivapp fuckcancer funny leangains lifeprotips nfl nofap pics rddtrust soccer tifu
anarchychess 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
bestof 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
bestoftldr 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
bitcoinmining 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
cancer 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
ffxiv 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
fitnesscirclejerk 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0
internet_box 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
jleague 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
ps4 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
quityourbullshit 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
rddtgaming 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0
subredditdrama 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
switcharoo 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
xboxone 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
注意- 您的预期输出未显示 id
列。
我会尝试一下。首先,我们将从您发布的示例数据中创建一个可重现的数据集:
df <- structure(list(SOURCE_SUBREDDIT = c("rddtgaming", "xboxone",
"ps4", "fitnesscirclejerk", "fitnesscirclejerk", "fitnesscirclejerk",
"cancer", "jleague", "bestoftldr", "quityourbullshit"), TARGET_SUBREDDIT = c("rddtrust",
"battlefield_4", "battlefield_4", "leangains", "lifeprotips",
"leangains", "fuckcancer", "soccer", "tifu", "pics"), LINK_SENTIMENT = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), row.names = c(NA, 10L), class = "data.frame")
请注意 fitnesscirclejerk
与 leangains
相关联两次,这是您提到的一个特征出现在您的数据中:
df
SOURCE_SUBREDDIT TARGET_SUBREDDIT LINK_SENTIMENT
1 rddtgaming rddtrust 1
2 xboxone battlefield_4 1
3 ps4 battlefield_4 1
4 fitnesscirclejerk leangains 1
5 fitnesscirclejerk lifeprotips 1
6 fitnesscirclejerk leangains 1
7 cancer fuckcancer 1
8 jleague soccer 1
9 bestoftldr tifu 1
10 quityourbullshit pics 1
现在,目标是将其从长格式传播到宽格式,就像您发布的示例图像一样。正如您已经确定的那样,相同的行(第 4 行和第 6 行)在尝试传播时会造成问题:
tidyr::spread(df, key = TARGET_SUBREDDIT, value = LINK_SENTIMENT, fill = 0)
Error: Each row of output must be identified by a unique combination of keys.
Keys are shared for 2 rows:
* 4, 6
Do you need to create unique ID with tibble::rowid_to_column()?
由于你想在传播时保持相同的行数,我们可以通过为每一行添加一个唯一的ID来解决这个问题,所以每一行都是唯一的。您可以使用 splitstackshape::getanID
来做到这一点,但我们也可以使用 tidyverse
包来做到这一点:
df2 <- dplyr::mutate(df, rowid = dplyr::row_number())
df2 <- tibble::rowid_to_column(df)
这两个都给了我们这个 data.frame,我假设它与您的 mydata_id
:
相似
df2
rowid SOURCE_SUBREDDIT TARGET_SUBREDDIT LINK_SENTIMENT
1 1 rddtgaming rddtrust 1
2 2 xboxone battlefield_4 1
3 3 ps4 battlefield_4 1
4 4 fitnesscirclejerk leangains 1
5 5 fitnesscirclejerk lifeprotips 1
6 6 fitnesscirclejerk leangains 1
7 7 cancer fuckcancer 1
8 8 jleague soccer 1
9 9 bestoftldr tifu 1
10 10 quityourbullshit pics 1
现在,当我们传播时,唯一 ID 列的存在使 R 无法合并(或试图合并)具有相同 subreddit 对的行:
df3 <- tidyr::spread(df2, key = TARGET_SUBREDDIT, value = LINK_SENTIMENT, fill = 0)
df3
rowid SOURCE_SUBREDDIT battlefield_4 fuckcancer leangains lifeprotips pics rddtrust soccer tifu
1 1 rddtgaming 0 0 0 0 0 1 0 0
2 2 xboxone 1 0 0 0 0 0 0 0
3 3 ps4 1 0 0 0 0 0 0 0
4 4 fitnesscirclejerk 0 0 1 0 0 0 0 0
5 5 fitnesscirclejerk 0 0 0 1 0 0 0 0
6 6 fitnesscirclejerk 0 0 1 0 0 0 0 0
7 7 cancer 0 1 0 0 0 0 0 0
8 8 jleague 0 0 0 0 0 0 1 0
9 9 bestoftldr 0 0 0 0 0 0 0 1
10 10 quityourbullshit 0 0 0 0 1 0 0 0
如您所见,此输出反映了所需输出图像的格式,并保留了关系的顺序和重复行。
我正在尝试在 R 中创建有向网络图。为此,我需要创建一个节点连接矩阵。
SOURCE_SUBREDDIT TARGET_SUBREDDIT LINK_SENTIMENT
rddtgaming rddtrust 1
xboxone battlefield_4 1
ps4 battlefield_4 1
fitnesscirclejerk leangains 1
fitnesscirclejerk lifeprotips 1
cancer fuckcancer 1
jleague soccer 1
bestoftldr tifu 1
quityourbullshit pics 1
bestof confession 1
anarchychess funny 1
internet_box ama 1
fitnesscirclejerk nofap 1
ffxiv ffxivapp 1
switcharoo funny 1
bitcoinmining bitcoin 1
subredditdrama nfl -1
rddtgaming rddtrust -1
正如您在上面看到的,第一对和最后一对具有相同的子编辑。数据显示了 subreddits 之间的方向关系,这就是为什么有多对
请查看我希望输出的照片:
到目前为止我的代码:
#reading in csv file
mydata <- read.csv(file="C:/Users/bmpmap/Documents/School/Netowrk Analysis/Connections List.csv", header=TRUE, sep=",")
colnames(mydata)
#SOURCE_SUBREDDIT TARGET_SUBREDDIT LINK_SENTIMENT
#install.packages("splitstackshape")
library(splitstackshape)
mydata_id = getanID(mydata , c("SOURCE_SUBREDDIT", "TARGET_SUBREDDIT", "LINK_SENTIMENT"))
colnames(mydata_id)
#reshaping data
我在上面的代码中创建了一个 ID 变量。我想我应该用它来唯一地识别这些对
你可以这样做-
> table(dt$SOURCE_SUBREDDIT,dt$TARGET_SUBREDDIT)
输出-
ama battlefield_4 bitcoin confession ffxivapp fuckcancer funny leangains lifeprotips nfl nofap pics rddtrust soccer tifu
anarchychess 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
bestof 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
bestoftldr 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
bitcoinmining 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
cancer 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
ffxiv 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
fitnesscirclejerk 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0
internet_box 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
jleague 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
ps4 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
quityourbullshit 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
rddtgaming 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0
subredditdrama 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
switcharoo 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
xboxone 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
注意- 您的预期输出未显示 id
列。
我会尝试一下。首先,我们将从您发布的示例数据中创建一个可重现的数据集:
df <- structure(list(SOURCE_SUBREDDIT = c("rddtgaming", "xboxone",
"ps4", "fitnesscirclejerk", "fitnesscirclejerk", "fitnesscirclejerk",
"cancer", "jleague", "bestoftldr", "quityourbullshit"), TARGET_SUBREDDIT = c("rddtrust",
"battlefield_4", "battlefield_4", "leangains", "lifeprotips",
"leangains", "fuckcancer", "soccer", "tifu", "pics"), LINK_SENTIMENT = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), row.names = c(NA, 10L), class = "data.frame")
请注意 fitnesscirclejerk
与 leangains
相关联两次,这是您提到的一个特征出现在您的数据中:
df
SOURCE_SUBREDDIT TARGET_SUBREDDIT LINK_SENTIMENT
1 rddtgaming rddtrust 1
2 xboxone battlefield_4 1
3 ps4 battlefield_4 1
4 fitnesscirclejerk leangains 1
5 fitnesscirclejerk lifeprotips 1
6 fitnesscirclejerk leangains 1
7 cancer fuckcancer 1
8 jleague soccer 1
9 bestoftldr tifu 1
10 quityourbullshit pics 1
现在,目标是将其从长格式传播到宽格式,就像您发布的示例图像一样。正如您已经确定的那样,相同的行(第 4 行和第 6 行)在尝试传播时会造成问题:
tidyr::spread(df, key = TARGET_SUBREDDIT, value = LINK_SENTIMENT, fill = 0)
Error: Each row of output must be identified by a unique combination of keys.
Keys are shared for 2 rows:
* 4, 6
Do you need to create unique ID with tibble::rowid_to_column()?
由于你想在传播时保持相同的行数,我们可以通过为每一行添加一个唯一的ID来解决这个问题,所以每一行都是唯一的。您可以使用 splitstackshape::getanID
来做到这一点,但我们也可以使用 tidyverse
包来做到这一点:
df2 <- dplyr::mutate(df, rowid = dplyr::row_number())
df2 <- tibble::rowid_to_column(df)
这两个都给了我们这个 data.frame,我假设它与您的 mydata_id
:
df2
rowid SOURCE_SUBREDDIT TARGET_SUBREDDIT LINK_SENTIMENT
1 1 rddtgaming rddtrust 1
2 2 xboxone battlefield_4 1
3 3 ps4 battlefield_4 1
4 4 fitnesscirclejerk leangains 1
5 5 fitnesscirclejerk lifeprotips 1
6 6 fitnesscirclejerk leangains 1
7 7 cancer fuckcancer 1
8 8 jleague soccer 1
9 9 bestoftldr tifu 1
10 10 quityourbullshit pics 1
现在,当我们传播时,唯一 ID 列的存在使 R 无法合并(或试图合并)具有相同 subreddit 对的行:
df3 <- tidyr::spread(df2, key = TARGET_SUBREDDIT, value = LINK_SENTIMENT, fill = 0)
df3
rowid SOURCE_SUBREDDIT battlefield_4 fuckcancer leangains lifeprotips pics rddtrust soccer tifu
1 1 rddtgaming 0 0 0 0 0 1 0 0
2 2 xboxone 1 0 0 0 0 0 0 0
3 3 ps4 1 0 0 0 0 0 0 0
4 4 fitnesscirclejerk 0 0 1 0 0 0 0 0
5 5 fitnesscirclejerk 0 0 0 1 0 0 0 0
6 6 fitnesscirclejerk 0 0 1 0 0 0 0 0
7 7 cancer 0 1 0 0 0 0 0 0
8 8 jleague 0 0 0 0 0 0 1 0
9 9 bestoftldr 0 0 0 0 0 0 0 1
10 10 quityourbullshit 0 0 0 0 1 0 0 0
如您所见,此输出反映了所需输出图像的格式,并保留了关系的顺序和重复行。