为什么我的 Python 脚本比它的 R 等效脚本慢得多?
Why is my Python script so much slower than its R equivalent?
注意:这个问题涵盖了为什么脚本这么慢。不过,如果你是那种比较想改进的人,可以看看my post on CodeReview which aims to improve the performance。
我正在处理一个处理纯文本文件 (.lst) 的项目。
文件名 (fileName
) 的名称很重要,因为我将提取 node
(例如 abessijn)和 component
(例如 WR-P-E-A)从它们到数据帧。示例:
abessijn.WR-P-E-A.lst
A-bom.WR-P-E-A.lst
acroniem.WR-P-E-C.lst
acroniem.WR-P-E-G.lst
adapter.WR-P-E-A.lst
adapter.WR-P-E-C.lst
adapter.WR-P-E-G.lst
每个文件由一行或多行组成。每行由一个句子组成(在 <sentence>
标签内)。示例 (abessijn.WR-P-E-A.lst)
/home/nobackup/SONAR/COMPACT/WR-P-E-A/WR-P-E-A0000364.data.ids.xml: <sentence>Vooral mijn abessijn ruikt heerlijk kruidig .. : ) )</sentence>
/home/nobackup/SONAR/COMPACT/WR-P-E-A/WR-P-E-A0000364.data.ids.xml: <sentence>Mijn abessijn denkt daar heel anders over .. : ) ) Maar mijn kinderen richt ik ook niet af , zit niet in mijn bloed .</sentence>
我从每一行中提取句子,对其进行一些小的修改,并将其命名为sentence
。接下来是一个名为 leftContext
的元素,它采用 node
(例如 abessijn)和它来自的句子之间的第一部分。最后,从 leftContext
我得到 precedingWord,它是 sentence
中 node
之前的词,或者是 leftContext
中最右边的词(有一些限制,例如 a 的选项由连字符组成的复合词)。示例:
ID | filename | node | component | precedingWord | leftContext | sentence
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 adapter.WR-P-P-F.lst adapter WR-P-P-F aanpassingseenheid Een aanpassingseenheid ( Een aanpassingseenheid ( adapter ) ,
2 adapter.WR-P-P-F.lst adapter WR-P-P-F toestel Het toestel ( Het toestel ( adapter ) draagt zorg voor de overbrenging van gegevens
3 adapter.WR-P-P-F.lst adapter WR-P-P-F de de aansluiting tussen de sensor en de de aansluiting tussen de sensor en de adapter ,
4 airbag.WS-U-E-A.lst airbag WS-U-E-A den ja voor den ja voor den airbag op te pompen eh :p
5 airbag.WS-U-E-A.lst airbag WS-U-E-A ne Dobby , als ze valt heeft ze dan wel al ne Dobby , als ze valt heeft ze dan wel al ne airbag hee
该数据框导出为 dataset.csv。
在那之后,我的项目的目的就来了:我创建了一个频率 table,它考虑了 node
和 precedingWord
。从我定义的变量 neuter
和 non_neuter
,例如(在 Python 中)
neuter = ["het", "Het"]
non_neuter = ["de","De"]
和一个休息类别 unspecified
。当 precedingWord
是列表中的一项时,将其分配给变量。频率 table 输出示例:
node | neuter | nonNeuter | unspecified
-------------------------------------------------
A-bom 0 4 2
acroniem 3 0 2
act 3 2 1
频率列表导出为 frequencies.csv。
我从 R 开始,考虑到稍后我会对频率进行一些统计分析。我当前的 R 脚本(也可用 paste):
# ---
# STEP 0: Preparations
start_time <- Sys.time()
## 1. Set working directory in R
setwd("")
## 2. Load required library/libraries
library(dplyr)
library(mclm)
library(stringi)
## 3. Create directory where we'll save our dataset(s)
dir.create("../R/dataset", showWarnings = FALSE)
# ---
# STEP 1: Loop through files, get data from the filename
## 1. Create first dataframe, based on filename of all files
files <- list.files(pattern="*.lst", full.names=T, recursive=FALSE)
d <- data.frame(fileName = unname(sapply(files, basename)), stringsAsFactors = FALSE)
## 2. Create additional columns (word & component) based on filename
d$node <- sub("\..+", "", d$fileName, perl=TRUE)
d$node <- tolower(d$node)
d$component <- gsub("^[^\.]+\.|\.lst$", "", d$fileName, perl=TRUE)
# ---
# STEP 2: Loop through files again, but now also through its contents
# In other words: get the sentences
## 1. Create second set which is an rbind of multiple frames
## One two-column data.frame per file
## First column is fileName, second column is data from each file
e <- do.call(rbind, lapply(files, function(x) {
data.frame(fileName = x, sentence = readLines(x, encoding="UTF-8"), stringsAsFactors = FALSE)
}))
## 2. Clean fileName
e$fileName <- sub("^\.\/", "", e$fileName, perl=TRUE)
## 3. Get the sentence and clean
e$sentence <- gsub(".*?<sentence>(.*?)</sentence>", "\1", e$sentence, perl=TRUE)
e$sentence <- tolower(e$sentence)
# Remove floating space before/after punctuation
e$sentence <- gsub("\s(?:(?=[.,:;?!) ])|(?<=\( ))", "\1", e$sentence, perl=TRUE)
# Add space after triple dots ...
e$sentence <- gsub("\.{3}(?=[^\s])", "... ", e$sentence, perl=TRUE)
# Transform HTML entities into characters
# It is unfortunate that there's no easier way to do this
# E.g. Python provides the HTML package which can unescape (decode) HTML
# characters
e$sentence <- gsub("'", "'", e$sentence, perl=TRUE)
e$sentence <- gsub("&", "&", e$sentence, perl=TRUE)
# Avoid R from wrongly interpreting ", so replace by single quotes
e$sentence <- gsub(""|\"", "'", e$sentence, perl=TRUE)
# Get rid of some characters we can't use such as ³ and ¾
e$sentence <- gsub("[^[:graph:]\s]", "", e$sentence, perl=TRUE)
# ---
# STEP 3:
# Create final dataframe
## 1. Merge d and e by common column name fileName
df <- merge(d, e, by="fileName", all=TRUE)
## 2. Make sure that only those sentences in which df$node is present in df$sentence are taken into account
matchFunction <- function(x, y) any(x == y)
matchedFrame <- with(df, mapply(matchFunction, node, stri_split_regex(sentence, "[ :?.,]")))
df <- df[matchedFrame, ]
## 3. Create leftContext based on the split of the word and the sentence
# Use paste0 to make sure we are looking for the node, not a compound
# node can only be preceded by a space, but can be followed by punctuation as well
contexts <- strsplit(df$sentence, paste0("(^| )", df$node, "( |[!\",.:;?})\]])"), perl=TRUE)
df$leftContext <- sapply(contexts, `[`, 1)
## 4. Get the word preceding the node
df$precedingWord <- gsub("^.*\b(?<!-)(\w+(?:-\w+)*)[^\w]*$","\1", df$leftContext, perl=TRUE)
## 5. Improve readability by sorting columns
df <- df[c("fileName", "component", "precedingWord", "node", "leftContext", "sentence")]
## 6. Write dataset to dataset dir
write.dataset(df,"../R/dataset/r-dataset.csv")
# ---
# STEP 4:
# Create dataset with frequencies
## 1. Define neuter and nonNeuter classes
neuter <- c("het")
non.neuter<- c("de")
## 2. Mutate df to fit into usable frame
freq <- mutate(df, gender = ifelse(!df$precedingWord %in% c(neuter, non.neuter), "unspecified",
ifelse(df$precedingWord %in% neuter, "neuter", "non_neuter")))
## 3. Transform into table, but still usable as data frame (i.e. matrix)
## Also add column name "node"
freqTable <- table(freq$node, freq$gender) %>%
as.data.frame.matrix %>%
mutate(node = row.names(.))
## 4. Small adjustements
freqTable <- freqTable[,c(4,1:3)]
## 5. Write dataset to dataset dir
write.dataset(freqTable,"../R/dataset/r-frequencies.csv")
diff <- Sys.time() - start_time # calculate difference
print(diff) # print in nice format
但是,由于我使用的是一个大数据集(16,500 个文件,所有文件都有多行),所以似乎需要很长时间。在我的系统上,整个过程大约需要一个小时一刻钟。我心想应该有一种语言可以更快地做到这一点,所以我去自学了一些 Python 并在这里问了很多关于 SO 的问题。
最后我想出了以下脚本 (paste)。
import os, pandas as pd, numpy as np, regex as re
from glob import glob
from datetime import datetime
from html import unescape
start_time = datetime.now()
# Create empty dataframe with correct column names
columnNames = ["fileName", "component", "precedingWord", "node", "leftContext", "sentence" ]
df = pd.DataFrame(data=np.zeros((0,len(columnNames))), columns=columnNames)
# Create correct path where to fetch files
subdir = "rawdata"
path = os.path.abspath(os.path.join(os.getcwd(), os.pardir, subdir))
# "Cache" regex
# See
p_filename = re.compile(r"[./\]")
p_sentence = re.compile(r"<sentence>(.*?)</sentence>")
p_typography = re.compile(r" (?:(?=[.,:;?!) ])|(?<=\( ))")
p_non_graph = re.compile(r"[^\x21-\x7E\s]")
p_quote = re.compile(r"\"")
p_ellipsis = re.compile(r"\.{3}(?=[^ ])")
p_last_word = re.compile(r"^.*\b(?<!-)(\w+(?:-\w+)*)[^\w]*$", re.U)
# Loop files in folder
for file in glob(path+"\*.lst"):
with open(file, encoding="utf-8") as f:
[n, c] = p_filename.split(file.lower())[-3:-1]
fn = ".".join([n, c])
for line in f:
s = p_sentence.search(unescape(line)).group(1)
s = s.lower()
s = p_typography.sub("", s)
s = p_non_graph.sub("", s)
s = p_quote.sub("'", s)
s = p_ellipsis.sub("... ", s)
if n in re.split(r"[ :?.,]", s):
lc = re.split(r"(^| )" + n + "( |[!\",.:;?})\]])", s)[0]
pw = p_last_word.sub("\1", lc)
df = df.append([dict(fileName=fn, component=c,
precedingWord=pw, node=n,
leftContext=lc, sentence=s)])
continue
# Reset indices
df.reset_index(drop=True, inplace=True)
# Export dataset
df.to_csv("dataset/py-dataset.csv", sep="\t", encoding="utf-8")
# Let's make a frequency list
# Create new dataframe
# Define neuter and non_neuter
neuter = ["het"]
non_neuter = ["de"]
# Create crosstab
df.loc[df.precedingWord.isin(neuter), "gender"] = "neuter"
df.loc[df.precedingWord.isin(non_neuter), "gender"] = "non_neuter"
df.loc[df.precedingWord.isin(neuter + non_neuter)==0, "gender"] = "rest"
freqDf = pd.crosstab(df.node, df.gender)
freqDf.to_csv("dataset/py-frequencies.csv", sep="\t", encoding="utf-8")
# How long has the script been running?
time_difference = datetime.now() - start_time
print("Time difference of", time_difference)
在确保两个脚本的输出相同之后,我想我应该对它们进行测试。
我 运行 Windows 10 64 位,四核处理器和 8 GB Ram。对于 R,我使用 RGui 64 位 3.2.2,Python 在版本 3.4.3 (Anaconda) 上运行,并在 Spyder 中执行。请注意,我在 32 位中是 运行 Python,因为我想在将来使用 nltk module,他们不鼓励用户使用 64 位。
我发现 R 在大约 55 分钟内完成。但是 Python 已经连续 运行 两个小时了,我可以在变量资源管理器中看到它只在 business.wr-p-p-g.lst
(文件按字母顺序排序)。 慢了 waaaaayyyy!
所以我所做的是制作一个测试用例,看看这两个脚本如何在更小的数据集上执行。我拿了大约 100 个文件(而不是 16,500 个)和 运行 脚本。同样,R 的速度要快得多。 R 在大约 2 秒内完成,Python 在 17 秒内完成!
看到Python的目标是让一切都顺利进行,我很困惑。我读到 Python 很快(而 R 相当慢),那么我哪里出错了?问题是什么? Python 读取文件和行或执行正则表达式是否较慢?或者 R 是否更适合处理数据帧并且不能被 pandas 打败? 或者是不是我的代码优化得不好,Python真的应该是胜利者吗?
因此我的问题是:在这种情况下,为什么 Python 比 R 慢,并且 - 如果可能的话 - 我们如何改进 Python 以发挥作用?
愿意尝试任何一种脚本的人可以下载我使用的测试数据here。下载文件时请提醒我。
你做的最低效的事情是在循环中调用 DataFrame.append
方法,即
df = pandas.DataFrame(...)
for file in files:
...
for line in file:
...
df = df.append(...)
NumPy 数据结构在设计时考虑了函数式编程,因此此操作并不意味着以迭代命令方式使用,因为调用不会就地更改您的数据框,但它会创建一个新的第一,导致时间和内存复杂性的巨大增加。如果您真的想使用数据框,请将您的行累积在 list
中并将其传递给 DataFrame
构造函数,例如
pre_df = []
for file in files:
...
for line in file:
...
pre_df.append(processed_line)
df = pandas.DataFrame(pre_df, ...)
这是最简单的方法,因为它将对您的代码进行最少的更改。但更好(并且计算上更漂亮)的方法是弄清楚如何惰性地生成数据集。这可以通过将您的工作流拆分为离散函数(在函数式编程风格的意义上)并使用惰性生成器表达式 and/or imap
、ifilter
高阶函数组合它们来轻松实现。然后你可以使用生成的生成器来构建你的数据框,例如
df = pandas.DataFrame.from_records(processed_lines_generator, columns=column_names, ...)
至于在一个文件中读取多个文件 运行,您可能需要阅读 this。
P.S.
如果您遇到性能问题,您应该在尝试优化代码之前分析您的代码。
注意:这个问题涵盖了为什么脚本这么慢。不过,如果你是那种比较想改进的人,可以看看my post on CodeReview which aims to improve the performance。
我正在处理一个处理纯文本文件 (.lst) 的项目。
文件名 (fileName
) 的名称很重要,因为我将提取 node
(例如 abessijn)和 component
(例如 WR-P-E-A)从它们到数据帧。示例:
abessijn.WR-P-E-A.lst
A-bom.WR-P-E-A.lst
acroniem.WR-P-E-C.lst
acroniem.WR-P-E-G.lst
adapter.WR-P-E-A.lst
adapter.WR-P-E-C.lst
adapter.WR-P-E-G.lst
每个文件由一行或多行组成。每行由一个句子组成(在 <sentence>
标签内)。示例 (abessijn.WR-P-E-A.lst)
/home/nobackup/SONAR/COMPACT/WR-P-E-A/WR-P-E-A0000364.data.ids.xml: <sentence>Vooral mijn abessijn ruikt heerlijk kruidig .. : ) )</sentence>
/home/nobackup/SONAR/COMPACT/WR-P-E-A/WR-P-E-A0000364.data.ids.xml: <sentence>Mijn abessijn denkt daar heel anders over .. : ) ) Maar mijn kinderen richt ik ook niet af , zit niet in mijn bloed .</sentence>
我从每一行中提取句子,对其进行一些小的修改,并将其命名为sentence
。接下来是一个名为 leftContext
的元素,它采用 node
(例如 abessijn)和它来自的句子之间的第一部分。最后,从 leftContext
我得到 precedingWord,它是 sentence
中 node
之前的词,或者是 leftContext
中最右边的词(有一些限制,例如 a 的选项由连字符组成的复合词)。示例:
ID | filename | node | component | precedingWord | leftContext | sentence
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 adapter.WR-P-P-F.lst adapter WR-P-P-F aanpassingseenheid Een aanpassingseenheid ( Een aanpassingseenheid ( adapter ) ,
2 adapter.WR-P-P-F.lst adapter WR-P-P-F toestel Het toestel ( Het toestel ( adapter ) draagt zorg voor de overbrenging van gegevens
3 adapter.WR-P-P-F.lst adapter WR-P-P-F de de aansluiting tussen de sensor en de de aansluiting tussen de sensor en de adapter ,
4 airbag.WS-U-E-A.lst airbag WS-U-E-A den ja voor den ja voor den airbag op te pompen eh :p
5 airbag.WS-U-E-A.lst airbag WS-U-E-A ne Dobby , als ze valt heeft ze dan wel al ne Dobby , als ze valt heeft ze dan wel al ne airbag hee
该数据框导出为 dataset.csv。
在那之后,我的项目的目的就来了:我创建了一个频率 table,它考虑了 node
和 precedingWord
。从我定义的变量 neuter
和 non_neuter
,例如(在 Python 中)
neuter = ["het", "Het"]
non_neuter = ["de","De"]
和一个休息类别 unspecified
。当 precedingWord
是列表中的一项时,将其分配给变量。频率 table 输出示例:
node | neuter | nonNeuter | unspecified
-------------------------------------------------
A-bom 0 4 2
acroniem 3 0 2
act 3 2 1
频率列表导出为 frequencies.csv。
我从 R 开始,考虑到稍后我会对频率进行一些统计分析。我当前的 R 脚本(也可用 paste):
# ---
# STEP 0: Preparations
start_time <- Sys.time()
## 1. Set working directory in R
setwd("")
## 2. Load required library/libraries
library(dplyr)
library(mclm)
library(stringi)
## 3. Create directory where we'll save our dataset(s)
dir.create("../R/dataset", showWarnings = FALSE)
# ---
# STEP 1: Loop through files, get data from the filename
## 1. Create first dataframe, based on filename of all files
files <- list.files(pattern="*.lst", full.names=T, recursive=FALSE)
d <- data.frame(fileName = unname(sapply(files, basename)), stringsAsFactors = FALSE)
## 2. Create additional columns (word & component) based on filename
d$node <- sub("\..+", "", d$fileName, perl=TRUE)
d$node <- tolower(d$node)
d$component <- gsub("^[^\.]+\.|\.lst$", "", d$fileName, perl=TRUE)
# ---
# STEP 2: Loop through files again, but now also through its contents
# In other words: get the sentences
## 1. Create second set which is an rbind of multiple frames
## One two-column data.frame per file
## First column is fileName, second column is data from each file
e <- do.call(rbind, lapply(files, function(x) {
data.frame(fileName = x, sentence = readLines(x, encoding="UTF-8"), stringsAsFactors = FALSE)
}))
## 2. Clean fileName
e$fileName <- sub("^\.\/", "", e$fileName, perl=TRUE)
## 3. Get the sentence and clean
e$sentence <- gsub(".*?<sentence>(.*?)</sentence>", "\1", e$sentence, perl=TRUE)
e$sentence <- tolower(e$sentence)
# Remove floating space before/after punctuation
e$sentence <- gsub("\s(?:(?=[.,:;?!) ])|(?<=\( ))", "\1", e$sentence, perl=TRUE)
# Add space after triple dots ...
e$sentence <- gsub("\.{3}(?=[^\s])", "... ", e$sentence, perl=TRUE)
# Transform HTML entities into characters
# It is unfortunate that there's no easier way to do this
# E.g. Python provides the HTML package which can unescape (decode) HTML
# characters
e$sentence <- gsub("'", "'", e$sentence, perl=TRUE)
e$sentence <- gsub("&", "&", e$sentence, perl=TRUE)
# Avoid R from wrongly interpreting ", so replace by single quotes
e$sentence <- gsub(""|\"", "'", e$sentence, perl=TRUE)
# Get rid of some characters we can't use such as ³ and ¾
e$sentence <- gsub("[^[:graph:]\s]", "", e$sentence, perl=TRUE)
# ---
# STEP 3:
# Create final dataframe
## 1. Merge d and e by common column name fileName
df <- merge(d, e, by="fileName", all=TRUE)
## 2. Make sure that only those sentences in which df$node is present in df$sentence are taken into account
matchFunction <- function(x, y) any(x == y)
matchedFrame <- with(df, mapply(matchFunction, node, stri_split_regex(sentence, "[ :?.,]")))
df <- df[matchedFrame, ]
## 3. Create leftContext based on the split of the word and the sentence
# Use paste0 to make sure we are looking for the node, not a compound
# node can only be preceded by a space, but can be followed by punctuation as well
contexts <- strsplit(df$sentence, paste0("(^| )", df$node, "( |[!\",.:;?})\]])"), perl=TRUE)
df$leftContext <- sapply(contexts, `[`, 1)
## 4. Get the word preceding the node
df$precedingWord <- gsub("^.*\b(?<!-)(\w+(?:-\w+)*)[^\w]*$","\1", df$leftContext, perl=TRUE)
## 5. Improve readability by sorting columns
df <- df[c("fileName", "component", "precedingWord", "node", "leftContext", "sentence")]
## 6. Write dataset to dataset dir
write.dataset(df,"../R/dataset/r-dataset.csv")
# ---
# STEP 4:
# Create dataset with frequencies
## 1. Define neuter and nonNeuter classes
neuter <- c("het")
non.neuter<- c("de")
## 2. Mutate df to fit into usable frame
freq <- mutate(df, gender = ifelse(!df$precedingWord %in% c(neuter, non.neuter), "unspecified",
ifelse(df$precedingWord %in% neuter, "neuter", "non_neuter")))
## 3. Transform into table, but still usable as data frame (i.e. matrix)
## Also add column name "node"
freqTable <- table(freq$node, freq$gender) %>%
as.data.frame.matrix %>%
mutate(node = row.names(.))
## 4. Small adjustements
freqTable <- freqTable[,c(4,1:3)]
## 5. Write dataset to dataset dir
write.dataset(freqTable,"../R/dataset/r-frequencies.csv")
diff <- Sys.time() - start_time # calculate difference
print(diff) # print in nice format
但是,由于我使用的是一个大数据集(16,500 个文件,所有文件都有多行),所以似乎需要很长时间。在我的系统上,整个过程大约需要一个小时一刻钟。我心想应该有一种语言可以更快地做到这一点,所以我去自学了一些 Python 并在这里问了很多关于 SO 的问题。
最后我想出了以下脚本 (paste)。
import os, pandas as pd, numpy as np, regex as re
from glob import glob
from datetime import datetime
from html import unescape
start_time = datetime.now()
# Create empty dataframe with correct column names
columnNames = ["fileName", "component", "precedingWord", "node", "leftContext", "sentence" ]
df = pd.DataFrame(data=np.zeros((0,len(columnNames))), columns=columnNames)
# Create correct path where to fetch files
subdir = "rawdata"
path = os.path.abspath(os.path.join(os.getcwd(), os.pardir, subdir))
# "Cache" regex
# See
p_filename = re.compile(r"[./\]")
p_sentence = re.compile(r"<sentence>(.*?)</sentence>")
p_typography = re.compile(r" (?:(?=[.,:;?!) ])|(?<=\( ))")
p_non_graph = re.compile(r"[^\x21-\x7E\s]")
p_quote = re.compile(r"\"")
p_ellipsis = re.compile(r"\.{3}(?=[^ ])")
p_last_word = re.compile(r"^.*\b(?<!-)(\w+(?:-\w+)*)[^\w]*$", re.U)
# Loop files in folder
for file in glob(path+"\*.lst"):
with open(file, encoding="utf-8") as f:
[n, c] = p_filename.split(file.lower())[-3:-1]
fn = ".".join([n, c])
for line in f:
s = p_sentence.search(unescape(line)).group(1)
s = s.lower()
s = p_typography.sub("", s)
s = p_non_graph.sub("", s)
s = p_quote.sub("'", s)
s = p_ellipsis.sub("... ", s)
if n in re.split(r"[ :?.,]", s):
lc = re.split(r"(^| )" + n + "( |[!\",.:;?})\]])", s)[0]
pw = p_last_word.sub("\1", lc)
df = df.append([dict(fileName=fn, component=c,
precedingWord=pw, node=n,
leftContext=lc, sentence=s)])
continue
# Reset indices
df.reset_index(drop=True, inplace=True)
# Export dataset
df.to_csv("dataset/py-dataset.csv", sep="\t", encoding="utf-8")
# Let's make a frequency list
# Create new dataframe
# Define neuter and non_neuter
neuter = ["het"]
non_neuter = ["de"]
# Create crosstab
df.loc[df.precedingWord.isin(neuter), "gender"] = "neuter"
df.loc[df.precedingWord.isin(non_neuter), "gender"] = "non_neuter"
df.loc[df.precedingWord.isin(neuter + non_neuter)==0, "gender"] = "rest"
freqDf = pd.crosstab(df.node, df.gender)
freqDf.to_csv("dataset/py-frequencies.csv", sep="\t", encoding="utf-8")
# How long has the script been running?
time_difference = datetime.now() - start_time
print("Time difference of", time_difference)
在确保两个脚本的输出相同之后,我想我应该对它们进行测试。
我 运行 Windows 10 64 位,四核处理器和 8 GB Ram。对于 R,我使用 RGui 64 位 3.2.2,Python 在版本 3.4.3 (Anaconda) 上运行,并在 Spyder 中执行。请注意,我在 32 位中是 运行 Python,因为我想在将来使用 nltk module,他们不鼓励用户使用 64 位。
我发现 R 在大约 55 分钟内完成。但是 Python 已经连续 运行 两个小时了,我可以在变量资源管理器中看到它只在 business.wr-p-p-g.lst
(文件按字母顺序排序)。 慢了 waaaaayyyy!
所以我所做的是制作一个测试用例,看看这两个脚本如何在更小的数据集上执行。我拿了大约 100 个文件(而不是 16,500 个)和 运行 脚本。同样,R 的速度要快得多。 R 在大约 2 秒内完成,Python 在 17 秒内完成!
看到Python的目标是让一切都顺利进行,我很困惑。我读到 Python 很快(而 R 相当慢),那么我哪里出错了?问题是什么? Python 读取文件和行或执行正则表达式是否较慢?或者 R 是否更适合处理数据帧并且不能被 pandas 打败? 或者是不是我的代码优化得不好,Python真的应该是胜利者吗?
因此我的问题是:在这种情况下,为什么 Python 比 R 慢,并且 - 如果可能的话 - 我们如何改进 Python 以发挥作用?
愿意尝试任何一种脚本的人可以下载我使用的测试数据here。下载文件时请提醒我。
你做的最低效的事情是在循环中调用 DataFrame.append
方法,即
df = pandas.DataFrame(...)
for file in files:
...
for line in file:
...
df = df.append(...)
NumPy 数据结构在设计时考虑了函数式编程,因此此操作并不意味着以迭代命令方式使用,因为调用不会就地更改您的数据框,但它会创建一个新的第一,导致时间和内存复杂性的巨大增加。如果您真的想使用数据框,请将您的行累积在 list
中并将其传递给 DataFrame
构造函数,例如
pre_df = []
for file in files:
...
for line in file:
...
pre_df.append(processed_line)
df = pandas.DataFrame(pre_df, ...)
这是最简单的方法,因为它将对您的代码进行最少的更改。但更好(并且计算上更漂亮)的方法是弄清楚如何惰性地生成数据集。这可以通过将您的工作流拆分为离散函数(在函数式编程风格的意义上)并使用惰性生成器表达式 and/or imap
、ifilter
高阶函数组合它们来轻松实现。然后你可以使用生成的生成器来构建你的数据框,例如
df = pandas.DataFrame.from_records(processed_lines_generator, columns=column_names, ...)
至于在一个文件中读取多个文件 运行,您可能需要阅读 this。
P.S.
如果您遇到性能问题,您应该在尝试优化代码之前分析您的代码。