如何在 Julia 中使用 PyCall 将 Python 输出转换为 Julia DataFrame

Question

我想从 quandl 中检索一些数据并在 Julia 中分析它们。不幸的是，目前还没有官方 API 可用于此。我知道 this solution，但它的功能仍然非常有限，并且不遵循与原始 Python API.

相同的语法

我认为使用 PyCall 从 Julia 中使用官方 Python API 检索数据是一件明智的事情。这确实会产生一个输出，但我不确定如何将其转换为我可以在 Julia 中使用的格式（最好是 DataFrame）。

我尝试了以下方法。

using PyCall, DataFrames
@pyimport quandl

data = quandl.get("WIKI/AAPL", returns = "pandas");

Julia 将此输出转换为 Dict{Any,Any}。当使用 returns = "numpy" 而不是 returns = "pandas" 时，我最终得到 PyObject rec.array.

如何让 data 成为 Julia DataFrame，就像 quandl.jl return 那样？请注意，quandl.jl 对我来说不是一个选项，因为它不支持自动检索多个资产并且缺少其他几个功能，所以我必须使用 Python API。

感谢您的任何建议！

Answer 1

这是一种选择：

首先，从你的data对象中提取列名：

julia> colnames = map(Symbol, data[:columns]);
12-element Array{Symbol,1}:
 :Open                
 :High                
 :Low                 
 :Close               
 :Volume              
 Symbol("Ex-Dividend")
 Symbol("Split Ratio")
 Symbol("Adj. Open")  
 Symbol("Adj. High")  
 Symbol("Adj. Low")   
 Symbol("Adj. Close") 
 Symbol("Adj. Volume")

然后将所有列倒入 DataFrame 中：

julia> y = DataFrame(Any[Array(data[c]) for c in colnames], colnames)

6×12 DataFrames.DataFrame
│ Row │ Open  │ High  │ Low   │ Close │ Volume   │ Ex-Dividend │ Split Ratio │
├─────┼───────┼───────┼───────┼───────┼──────────┼─────────────┼─────────────┤
│ 1   │ 28.75 │ 28.87 │ 28.75 │ 28.75 │ 2.0939e6 │ 0.0         │ 1.0         │
│ 2   │ 27.38 │ 27.38 │ 27.25 │ 27.25 │ 785200.0 │ 0.0         │ 1.0         │
│ 3   │ 25.37 │ 25.37 │ 25.25 │ 25.25 │ 472000.0 │ 0.0         │ 1.0         │
│ 4   │ 25.87 │ 26.0  │ 25.87 │ 25.87 │ 385900.0 │ 0.0         │ 1.0         │
│ 5   │ 26.63 │ 26.75 │ 26.63 │ 26.63 │ 327900.0 │ 0.0         │ 1.0         │
│ 6   │ 28.25 │ 28.38 │ 28.25 │ 28.25 │ 217100.0 │ 0.0         │ 1.0         │

│ Row │ Adj. Open │ Adj. High │ Adj. Low │ Adj. Close │ Adj. Volume │
├─────┼───────────┼───────────┼──────────┼────────────┼─────────────┤
│ 1   │ 0.428364  │ 0.430152  │ 0.428364 │ 0.428364   │ 1.17258e8   │
│ 2   │ 0.407952  │ 0.407952  │ 0.406015 │ 0.406015   │ 4.39712e7   │
│ 3   │ 0.378004  │ 0.378004  │ 0.376216 │ 0.376216   │ 2.6432e7    │
│ 4   │ 0.385453  │ 0.38739   │ 0.385453 │ 0.385453   │ 2.16104e7   │
│ 5   │ 0.396777  │ 0.398565  │ 0.396777 │ 0.396777   │ 1.83624e7   │
│ 6   │ 0.420914  │ 0.422851  │ 0.420914 │ 0.420914   │ 1.21576e7   │

感谢@Matt B. 提出的简化代码的建议。

上面的问题是数据框内的列类型 Any。为了使它更有效率，这里有一些功能可以完成工作：

# first, guess the Julia equivalent of type of the object
function guess_type(x::PyCall.PyObject)
  string_dtype = x[:dtype][:name]
  julia_string = string(uppercase(string_dtype[1]), string_dtype[2:end])

  return eval(parse("$julia_string"))
end

# convert an individual column, falling back to Any array if the guess was wrong
function convert_column(x)
  y = try Array{guess_type(x)}(x) catch Array(x) end
  return y
end

# put everything together into a single function
function convert_pandas(df)
  colnames =  map(Symbol, data[:columns])
  y = DataFrame(Any[convert_column(df[c]) for c in colnames], colnames)

  return y
end

以上内容应用于您的 data 时会给出与以前相同的列名称，但具有正确的 Float64 列类型：

y = convert_pandas(data);
showcols(y)
9147×12 DataFrames.DataFrame
│ Col # │ Name        │ Eltype  │ Missing │
├───────┼─────────────┼─────────┼─────────┤
│ 1     │ Open        │ Float64 │ 0       │
│ 2     │ High        │ Float64 │ 0       │
│ 3     │ Low         │ Float64 │ 0       │
│ 4     │ Close       │ Float64 │ 0       │
│ 5     │ Volume      │ Float64 │ 0       │
│ 6     │ Ex-Dividend │ Float64 │ 0       │
│ 7     │ Split Ratio │ Float64 │ 0       │
│ 8     │ Adj. Open   │ Float64 │ 0       │
│ 9     │ Adj. High   │ Float64 │ 0       │
│ 10    │ Adj. Low    │ Float64 │ 0       │
│ 11    │ Adj. Close  │ Float64 │ 0       │
│ 12    │ Adj. Volume │ Float64 │ 0       │

Answer 2

您运行在 Python/Pandas 版本中有所不同。我碰巧有两种配置可供我轻松使用； Pandas 0.18.0 in Python 2 和 Pandas 0.19.1 in Python 3. @niczky12 提供的答案在第一个配置中运行良好，但我'我在第二个配置中看到了你的 Dict{Any,Any} 行为。基本上，这两种配置之间发生了一些变化，例如 PyCall 检测到 Pandas 对象的类似映射的接口，然后通过自动转换将该接口公开为字典。这里有两个选项：

使用词典界面：

data = quandl.get("WIKI/AAPL", returns = "pandas")
cols = keys(data)
df = DataFrame(Any[collect(values(data[c])) for c in cols], map(Symbol, cols))

明确禁用自动转换并使用 PyCall 接口将列提取为。请注意 data[:Open] 会自动转换为映射字典，而 data["Open"] 只会 return 一个 PyObject.
```
data = pycall(quandl.get, PyObject, "WIKI/AAPL", returns = "pandas")
cols = data[:columns]
df = DataFrame(Any[Array(data[c]) for c in cols], map(Symbol, cols))
```

不过，请注意，在这两种情况下，生成的数据框中均未包含最重要的日期索引。您几乎肯定想将其添加为一列：

df[:Date] = collect(data[:index])

Answer 3

有一个API。只需使用 Quandl.jl: https://github.com/milktrader/Quandl.jl

using Quandl
data = quandlget("WIKI/AAPL")

这具有以有用的 Julia 格式（TimeArray）获取数据的额外优势，该格式定义了用于处理此类数据的适当方法。

如何在 Julia 中使用 PyCall 将 Python 输出转换为 Julia DataFrame

How to use PyCall in Julia to convert Python output to Julia DataFrame

python

dataframe

julia

quandl