Julia:将 DataFrame 传递给函数会创建一个指向 DataFrame 的指针?

Julia: Passing a DataFrame to a function creates a pointer to the DataFrame?

我有一个函数可以规范化 DataFrame 的前 N ​​列。我想要 return 规范化的 DataFrame,但不要理会原始数据。然而,该函数似乎也改变了传递的 DataFrame!

using DataFrames

function normalize(input_df::DataFrame, cols::Array{Int})
    norm_df = input_df
    for i in cols
        norm_df[i] = (input_df[i] - minimum(input_df[i])) / 
            (maximum(input_df[i]) - minimum(input_df[i]))
    end
    norm_df
end

using RDatasets
iris = dataset("datasets", "iris")
println("original df:\n", head(iris))

norm_df = normalize(iris, [1:4]);
println("should be the same:\n", head(iris))

输出:

original df:
6x5 DataFrame
| Row | SepalLength | SepalWidth | PetalLength | PetalWidth | Species  |
|-----|-------------|------------|-------------|------------|----------|
| 1   | 5.1         | 3.5        | 1.4         | 0.2        | "setosa" |
| 2   | 4.9         | 3.0        | 1.4         | 0.2        | "setosa" |
| 3   | 4.7         | 3.2        | 1.3         | 0.2        | "setosa" |
| 4   | 4.6         | 3.1        | 1.5         | 0.2        | "setosa" |
| 5   | 5.0         | 3.6        | 1.4         | 0.2        | "setosa" |
| 6   | 5.4         | 3.9        | 1.7         | 0.4        | "setosa" |

should be the same:
6x5 DataFrame
| Row | SepalLength | SepalWidth | PetalLength | PetalWidth | Species  |
|-----|-------------|------------|-------------|------------|----------|
| 1   | 0.222222    | 0.625      | 0.0677966   | 0.0416667  | "setosa" |
| 2   | 0.166667    | 0.416667   | 0.0677966   | 0.0416667  | "setosa" |
| 3   | 0.111111    | 0.5        | 0.0508475   | 0.0416667  | "setosa" |
| 4   | 0.0833333   | 0.458333   | 0.0847458   | 0.0416667  | "setosa" |
| 5   | 0.194444    | 0.666667   | 0.0677966   | 0.0416667  | "setosa" |
| 6   | 0.305556    | 0.791667   | 0.118644    | 0.125      | "setosa" |

Julia 使用一种称为“分享传递”的行为。来自文档(强调我的):

Julia function arguments follow a convention sometimes called “pass-by-sharing”, which means that values are not copied when they are passed to functions. Function arguments themselves act as new variable bindings (new locations that can refer to values), but the values they refer to are identical to the passed values. Modifications to mutable values (such as Arrays) made within a function will be visible to the caller. This is the same behavior found in Scheme, most Lisps, Python, Ruby and Perl, among other dynamic languages.

在您的特定情况下,您似乎想要为规范化操作创建一个全新且独立的 DataFrame。执行此操作有两个操作:copydeepcopy。如果所有 DataFrame 列的元素类型都是不可变的(例如 IntFloat64String、e.t.c),那么 copy 就足够了。但是,如果其中一列包含可变类型,则需要使用 deepcopy。函数调用如下所示:

norm_df = copy(input_df)     # Column types are immutable
norm_df = deepcopy(input_df) # At least one column type is mutable

Julia 通常会要求您显式地执行这些操作,因为创建大型数据框的独立副本的计算成本很高,而且 Julia 是一种面向性能的语言。

对于那些想要了解 copydeepcopy 之间区别的更多详细信息的人,请再次从文档中注意以下内容:

copy(x): Create a shallow copy of x: the outer structure is copied, but not all internal values. For example, copying an array produces a new array with identically-same elements as the original.

deepcopy(x): Create a deep copy of x: everything is copied recursively, resulting in a fully independent object. For example, deep-copying an array produces a new array whose elements are deep-copies of the original elements.

类型 DataFrame 类似于数组,因此如果元素是可变的,则 deepcopy 是必需的。如果您不确定,请使用 deepcopy(虽然它会更慢)。

一个相关的 SO 问题是 here