有选择地读取r中的CSV文件

Question

我有一个 CSV 格式的大数据文件，我只需要导入其中的某些行。我们称这个大文件为 A.csv。

我有另一个 csv 文件，即 B.csv，它有两列和一些行。

现在我只需要从 A.csv 中导入前两列值与特定行的 B.csv 列值相同的那些数据行。所以，我在导入两者后尝试了这个文件

但似乎要花很长时间

while(count<4632)
{
    count=count+1
    count2=0
    while(count2<17415)
    {
        count2=count 2+1
        if(B[count,1]==A[count2,1])
            dbase[count,]=A[count2,]
    }
}

请帮忙！！

Answer 1

你有两个嵌套的大循环，并且你正在动态增长一个向量。两者都不利于性能。尝试向量化这两个操作。

例如：

set.seed(123)
dfA <- data.frame(
    a = sample(LETTERS, 10000, TRUE),
    b = sample(LETTERS[1:3], 10000, TRUE),
    c = rnorm( 10000 ),
    stringsAsFactors = FALSE
)
dfB <- data.frame(
    a = sample(LETTERS, 1000, TRUE),
    b = sample(LETTERS[1:3], 1000, TRUE),
    stringsAsFactors = FALSE
)

dfC <- dfA[ which( paste(dfA$a, dfA$b) %in% paste(dfB$a, dfB$a)), ]

Answer 2

可能是我信息太少了，我会尽量回复...

我认为您可以简单地在两个文件之间进行连接，只加载较小的文件。我将在 sqldf 包的帮助下做这样的事情：

library(sqldf)

tmp_csv <- "path/of/your/big/file.csv"

# load your small file and make sure the two columns 
# have the same name of the columns of the big file
tmp_df <- read.csv("path/of/your/small/file.csv")

# join the two dataset with a single sql query
out_data <- read.csv2.sql(tmp_csv, sql = "select * from file join tmp_df using (Column1, Column2)", header = TRUE)

您可以使用 read.csv2.sql 或 read.csv.sql，具体取决于您的分隔符。仔细检查列的名称，因为它是连接操作的基础部分。

有选择地读取r中的CSV文件

Selectively reading CSV file in r

import

select

r

rows