在 polars.dataframe 中使用 IF 条件

Question

这是我的代码示例。我正在尝试使用 IF 条件查找特定行例如，

┌────────────┬────────────┬────────────┬────────────┬────────────┬───────┬──────┐
│ xmin       ┆ ymin       ┆ xmax       ┆ ymax       ┆ confidence ┆ class ┆ name │
│ ---        ┆ ---        ┆ ---        ┆ ---        ┆ ---        ┆ ---   ┆ ---  │
│ f64        ┆ f64        ┆ f64        ┆ f64        ┆ f64        ┆ i64   ┆ str  │
╞════════════╪════════════╪════════════╪════════════╪════════════╪═══════╪══════╡
│ 385.432404 ┆ 265.198486 ┆ 402.597534 ┆ 286.265503 ┆ 0.880611   ┆ 0     ┆ corn │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 357.966461 ┆ 424.923828 ┆ 393.622803 ┆ 473.383209 ┆ 0.8493     ┆ 0     ┆ ice  │
└────────────┴────────────┴────────────┴────────────┴────────────┴───────┴──────┘

if boxes[(boxes['name']=='corn') & (boxes['ymax']>=250) & (boxes['ymin']<=265)]:
    print("found")
if boxes[(boxes['name']=='ice') & (boxes['ymax']>=460) & (boxes['ymin']<=600) & (boxes['xmin']<=600)]:
    print("found")

上面的代码可以工作，但是当没有数据时，它会给我一个错误 ValueError: could not convert string to float: 'corn'。我假设它是因为当没有插入数据时，列的每个 dtype(xmin, ymin,xmax,ymax, confidence, class, name) 都会自动设置为 f32。那么我该如何更改它，是否有更好的方法来完成同样的工作，因为我上面写的代码对我来说看起来不太优化。（对不起，我的英语不好，如果很难理解我想说的话。）如果有人能帮助我，我将不胜感激

Answer 1

设置每一列的数据类型

为了标准化DataFrame集合中每一列的数据类型，您可以在创建DataFrame时将（列名，数据类型）元组列表传递给columns关键字。这确保每个 DataFrame 都将具有相同名称和数据类型的列，即使 DataFrame 为空。

来自 DataFrame 的文档：

columns: Sequence of str or (str,DataType) pairs, default None

让我们从您的那里借用一些代码，看看它是如何工作的：

import polars as pl

ca = [
    ("xmin", pl.Float64),
    ("ymin", pl.Float64),
    ("xmax", pl.Float64),
    ("ymax", pl.Float64),
    ("confidence", pl.Float64),
    ("class", pl.Int32),
    ("name", pl.Utf8),
]  # xyxy columns

请注意，每个列名现在都是元组的一部分，连同我们想要的数据类型。我为您的大部分专栏选择了 Float64，但您可以将其更改为更合适的内容。这是 Polars datatypes.

的便捷列表

让我们看看它是如何工作的（同样，从您的那里借用代码）。

a = [
    [],
    [
        [
            370.01605224609375,
            346.4305114746094,
            398.3968811035156,
            384.5684814453125,
            0.9011853933334351,
            0,
            "corn",
        ]
    ],
]

for x in a:
    print(pl.DataFrame(x or None, columns=ca, orient="row"))

shape: (0, 7)
┌──────┬──────┬──────┬──────┬────────────┬───────┬──────┐
│ xmin ┆ ymin ┆ xmax ┆ ymax ┆ confidence ┆ class ┆ name │
│ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---        ┆ ---   ┆ ---  │
│ f64  ┆ f64  ┆ f64  ┆ f64  ┆ f64        ┆ i32   ┆ str  │
╞══════╪══════╪══════╪══════╪════════════╪═══════╪══════╡
└──────┴──────┴──────┴──────┴────────────┴───────┴──────┘
shape: (1, 7)
┌────────────┬────────────┬────────────┬────────────┬────────────┬───────┬──────┐
│ xmin       ┆ ymin       ┆ xmax       ┆ ymax       ┆ confidence ┆ class ┆ name │
│ ---        ┆ ---        ┆ ---        ┆ ---        ┆ ---        ┆ ---   ┆ ---  │
│ f64        ┆ f64        ┆ f64        ┆ f64        ┆ f64        ┆ i32   ┆ str  │
╞════════════╪════════════╪════════════╪════════════╪════════════╪═══════╪══════╡
│ 370.016052 ┆ 346.430511 ┆ 398.396881 ┆ 384.568481 ┆ 0.901185   ┆ 0     ┆ corn │
└────────────┴────────────┴────────────┴────────────┴────────────┴───────┴──────┘

现在对应的列具有相同的数据类型，即使对于空数据帧也是如此。这将在我们查询数据帧时帮助我们。（我们不需要每次查询 DataFrame 时都 cast 数据类型）。

进一步说明：请注意，我更改了 DataFrame 构造函数：

pl.DataFrame(x, columns=c, orient="row")

至

pl.DataFrame(x or None, columns=c, orient="row")

这是 DataFrame 为空时的解决方法。（未来版本的 Polars 可能不再需要此解决方法。）

查询

现在每个 DataFrame 中的列的数据类型都是标准化的，即使对于空 DataFrame，我们也可以运行查询而不用担心转换数据类型。

让我们首先使用示例中的数据创建一个 DataFrame：

_data = [
    [385.432404, 265.198486, 402.597534, 286.265503, 0.880611, 0, "corn"],
    [357.966461, 424.923828, 393.622803, 473.383209, 0.8493, 0, "ice"],
]

ca = [
    ("xmin", pl.Float64),
    ("ymin", pl.Float64),
    ("xmax", pl.Float64),
    ("ymax", pl.Float64),
    ("confidence", pl.Float64),
    ("class", pl.Int32),
    ("name", pl.Utf8),
]  # xyxy columns

boxes = pl.DataFrame(_data or None, columns=ca, orient="row")
print(boxes)

shape: (2, 7)
┌────────────┬────────────┬────────────┬────────────┬────────────┬───────┬──────┐
│ xmin       ┆ ymin       ┆ xmax       ┆ ymax       ┆ confidence ┆ class ┆ name │
│ ---        ┆ ---        ┆ ---        ┆ ---        ┆ ---        ┆ ---   ┆ ---  │
│ f64        ┆ f64        ┆ f64        ┆ f64        ┆ f64        ┆ i32   ┆ str  │
╞════════════╪════════════╪════════════╪════════════╪════════════╪═══════╪══════╡
│ 385.432404 ┆ 265.198486 ┆ 402.597534 ┆ 286.265503 ┆ 0.880611   ┆ 0     ┆ corn │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 357.966461 ┆ 424.923828 ┆ 393.622803 ┆ 473.383209 ┆ 0.8493     ┆ 0     ┆ ice  │
└────────────┴────────────┴────────────┴────────────┴────────────┴───────┴──────┘

让我们看看您发布的两个查询中的第二个：

if boxes[(boxes['name']=='ice') & (boxes['ymax']>=460) & (boxes['ymin']<=600) & (boxes['xmin']<=600)]:
    print("found")

在 Polars 中，我们运行使用 filter 方法进行查询。在 Polars 中，我们会将此查询表示为：

boxes.filter(
    (pl.col("name") == "ice")
    & (pl.col("ymax") >= 460)
    & (pl.col("ymin") <= 600)
    & (pl.col("xmin") <= 600)
)

shape: (1, 7)
┌────────────┬────────────┬────────────┬────────────┬────────────┬───────┬──────┐
│ xmin       ┆ ymin       ┆ xmax       ┆ ymax       ┆ confidence ┆ class ┆ name │
│ ---        ┆ ---        ┆ ---        ┆ ---        ┆ ---        ┆ ---   ┆ ---  │
│ f64        ┆ f64        ┆ f64        ┆ f64        ┆ f64        ┆ i32   ┆ str  │
╞════════════╪════════════╪════════════╪════════════╪════════════╪═══════╪══════╡
│ 357.966461 ┆ 424.923828 ┆ 393.622803 ┆ 473.383209 ┆ 0.8493     ┆ 0     ┆ ice  │
└────────────┴────────────┴────────────┴────────────┴────────────┴───────┴──────┘

如果您想知道查询是否返回了任何记录（以便您可以在 if 语句中使用结果），请使用 is_empty 方法：

my_query = boxes.filter(
    (pl.col("name") == "ice")
    & (pl.col("ymax") >= 460)
    & (pl.col("ymin") <= 600)
    & (pl.col("xmin") <= 600)
)

if not my_query.is_empty():
    print("I found records")

>>> my_query = boxes.filter(
...     (pl.col("name") == "ice")
...     & (pl.col("ymax") >= 460)
...     & (pl.col("ymin") <= 600)
...     & (pl.col("xmin") <= 600)
... 
... )
>>> if not my_query.is_empty():
...     print("I found records")
... 
I found records

is_empty 方法不是绝对必要的。这也将起作用：

if my_query:
    print("I found records")

>>> if my_query:
...     print("I found records")
... 
I found records

Answer 2

This is a workaround for cases when your DataFrame is empty. (This workaround may no longer be needed in future versions of Polars.)

的确，不再需要了-我patched this大约3周前，不久后就合并了；您可以跳过此后发布的所有版本中的解决方法。

在 polars.dataframe 中使用 IF 条件

using IF condition in polars.dataframe

python

if-statement

python-polars

设置每一列的数据类型

查询