在 pyarrow a dictionnay_encoded 列中转换整个 table/dataframe

converting a whole table/dataframe in pyarrow a dictionnay_encoded columns

我正在从 apache arrow (pyarrow) 加载镶木地板文件,到目前为止,我必须转移到 pandas,进行分类转换,然后将其作为箭头 table 发回(稍后将其保存为羽毛文件类型)

代码看起来像:

    df = pq.read_table(inputFile)
    # convert to pandas
    df2 = df.to_pandas()
    # get all cols that needs to be transformed and cast
    list_str_obj_cols = df2.columns[df2.dtypes == "object"].tolist()
    for str_obj_col in list_str_obj_cols:
        df2[str_obj_col] = df2[str_obj_col].astype("category")

    print(df2.dtypes)
    #get back from pandas to arrow
    table = pa.Table.from_pandas(df2)
    # write the file in fs
    ft.write_feather(table, outputFile, compression='lz4')

有没有办法直接用 pyarrow 做这个?会更快吗? 提前致谢

在 pyarrow 中,“分类”被称为“字典编码”。所以我认为您的问题是是否可以对现有 table 中的列进行字典编码。您可以使用 pyarrow.compute.dictionary_encode 函数来执行此操作。将它们放在一起:

import pyarrow as pa
import pyarrow.compute as pc

def dict_encode_all_str_columns(table):
    new_arrays = []
    for index, field in enumerate(table.schema):
        if field.type == pa.string():
            new_arr = pc.dictionary_encode(table.column(index))
            new_arrays.append(new_arr)
        else:
            new_arrays.append(table.column(index))
    return pa.Table.from_arrays(new_arrays, names=table.column_names)

table = pa.Table.from_pydict({'int': [1, 2, 3, 4], 'str': ['x', 'y', 'x', 'y']})
print(table)
print(dict_encode_all_str_columns(table))