Python DataFrames 的假设混合策略行为
Python Hypothesis mixing strategies bahavior for DataFrames
以下按预期工作
from datetime import datetime
from hypothesis.extra.pandas import columns, data_frames, indexes
import hypothesis.strategies as st
def boundarize(d: datetime):
return d.replace(minute=15 * (d.minute // 15), second=0, microsecond=0)
min_date = datetime(2022, 4, 1, 22, 22, 22)
max_date = datetime(2022, 5, 1, 22, 22, 22)
dfs = data_frames(
index=indexes(
elements=st.datetimes(min_value=min_date, max_value=max_date).map(boundarize),
min_size=3,
max_size=5,
).map(lambda idx: idx.sort_values()),
columns=columns("A B C".split(), dtype=int),
)
dfs.example()
输出类似于
A B C
2022-04-06 12:45:00 -11482 1588438979 -1994987295
2022-04-08 15:45:00 -833447611 3 -51
2022-04-24 06:15:00 -465371373 990274387 -14969
2022-05-01 01:15:00 1750446827 1214440777 116
2022-05-01 06:15:00 -44089 30508 58737
现在,当我尝试通过
生成具有均匀 spaced DatetimeIndex 值的类似 DataFrame 时
from datetime import datetime
from hypothesis.extra.pandas import columns, data_frames, indexes
import hypothesis.strategies as st
def boundarize(d: datetime):
return d.replace(minute=15 * (d.minute // 15), second=0, microsecond=0)
min_date_start = datetime(2022, 4, 1, 11, 11, 11)
max_date_start = datetime(2022, 4, 2, 11, 11, 11)
min_date_end = datetime(2022, 5, 1, 22, 22, 22)
max_date_end = datetime(2022, 5, 2, 22, 22, 22)
dfs = data_frames(
index=st.builds(pd.date_range,
start=st.datetimes(min_value=min_date_start, max_value=max_date_start).map(boundarize),
end=st.datetimes(min_value=min_date_end, max_value=max_date_end).map(boundarize),
freq=st.just("15T"),
),
columns=columns("A B C".split(), dtype=int),
)
dfs.example()
输出如下,请注意整数列在第一个示例中不存在时始终为零:
A B C
2022-04-01 15:45:00 0 0 0
2022-04-01 16:00:00 0 0 0
2022-04-01 16:15:00 0 0 0
2022-04-01 16:30:00 0 0 0
2022-04-01 16:45:00 0 0 0
... .. .. ..
2022-05-01 21:15:00 0 0 0
2022-05-01 21:30:00 0 0 0
2022-05-01 21:45:00 0 0 0
2022-05-01 22:00:00 0 0 0
2022-05-01 22:15:00 0 0 0
[2907 rows x 3 columns]
这是预期的行为还是我遗漏了什么?
编辑:
回避“随机连续子集”的方法(请参阅下面我的评论),我还尝试使用预定义索引
from datetime import datetime
from hypothesis.extra.pandas import columns, data_frames
import hypothesis.strategies as st
min_date_start = datetime(2022, 4, 1, 8, 0, 0)
dfs = data_frames(
index=st.just(pd.date_range(start=min_date_start, periods=10, freq="15T")),
columns=columns("A B C".split(), dtype=int),
)
dfs.example()
它也给出所有零列
A B C
2022-04-01 08:00:00 0 0 0
2022-04-01 08:15:00 0 0 0
2022-04-01 08:30:00 0 0 0
2022-04-01 08:45:00 0 0 0
2022-04-01 09:00:00 0 0 0
2022-04-01 09:15:00 0 0 0
2022-04-01 09:30:00 0 0 0
2022-04-01 09:45:00 0 0 0
2022-04-01 10:00:00 0 0 0
2022-04-01 10:15:00 0 0 0
编辑 2:
我试图想出一个连续子集的手工版本,它应该减少 space 的值,以便根据 @zac-hatfield-dodds 的回答为列值留下足够的熵,但根据经验它仍然生成几乎全为零的列值
from datetime import datetime
import math
import hypothesis.strategies as st
from hypothesis.extra.pandas import columns, data_frames
import pandas as pd
time_start = datetime(2022, 4, 1, 8, 0, 0)
time_stop = datetime(2022, 4, 2, 8, 0, 0)
r = pd.date_range(start=time_start, end=time_stop, freq="15T")
def build_indices(sequence):
first = 0
if len(sequence) % 2 == 0:
mid_ceiling = len(sequence) // 2
mid_floor = mid_ceiling - 1
else:
mid_floor = math.floor(len(sequence) / 2)
mid_ceiling = mid_floor + 1
second = len(sequence) - 1
return first, mid_floor, mid_ceiling, second
first, mid_floor, mid_ceiling, second = build_indices(r)
a = st.integers(min_value=first, max_value=mid_floor)
b = st.integers(min_value=mid_ceiling, max_value=second)
def indexer(sequence, lower, upper):
return sequence[lower:upper]
dfs = data_frames(
index=st.builds(lambda lower, upper: indexer(r, lower, upper), lower=a, upper=b),
columns=columns("A B C".split(), dtype=int),
)
dfs.example()
您的问题是后面的索引 大得多 ,假设是 运行 出于熵生成列内容。如果您将索引限制为最多几十个条目,一切都应该正常。
我们有这个 soft-cap 是为了限制其他无限的递归结构,所以整体设计按预期工作,但我承认在这种情况下它既不必要也不可取。
以下按预期工作
from datetime import datetime
from hypothesis.extra.pandas import columns, data_frames, indexes
import hypothesis.strategies as st
def boundarize(d: datetime):
return d.replace(minute=15 * (d.minute // 15), second=0, microsecond=0)
min_date = datetime(2022, 4, 1, 22, 22, 22)
max_date = datetime(2022, 5, 1, 22, 22, 22)
dfs = data_frames(
index=indexes(
elements=st.datetimes(min_value=min_date, max_value=max_date).map(boundarize),
min_size=3,
max_size=5,
).map(lambda idx: idx.sort_values()),
columns=columns("A B C".split(), dtype=int),
)
dfs.example()
输出类似于
A B C
2022-04-06 12:45:00 -11482 1588438979 -1994987295
2022-04-08 15:45:00 -833447611 3 -51
2022-04-24 06:15:00 -465371373 990274387 -14969
2022-05-01 01:15:00 1750446827 1214440777 116
2022-05-01 06:15:00 -44089 30508 58737
现在,当我尝试通过
生成具有均匀 spaced DatetimeIndex 值的类似 DataFrame 时from datetime import datetime
from hypothesis.extra.pandas import columns, data_frames, indexes
import hypothesis.strategies as st
def boundarize(d: datetime):
return d.replace(minute=15 * (d.minute // 15), second=0, microsecond=0)
min_date_start = datetime(2022, 4, 1, 11, 11, 11)
max_date_start = datetime(2022, 4, 2, 11, 11, 11)
min_date_end = datetime(2022, 5, 1, 22, 22, 22)
max_date_end = datetime(2022, 5, 2, 22, 22, 22)
dfs = data_frames(
index=st.builds(pd.date_range,
start=st.datetimes(min_value=min_date_start, max_value=max_date_start).map(boundarize),
end=st.datetimes(min_value=min_date_end, max_value=max_date_end).map(boundarize),
freq=st.just("15T"),
),
columns=columns("A B C".split(), dtype=int),
)
dfs.example()
输出如下,请注意整数列在第一个示例中不存在时始终为零:
A B C
2022-04-01 15:45:00 0 0 0
2022-04-01 16:00:00 0 0 0
2022-04-01 16:15:00 0 0 0
2022-04-01 16:30:00 0 0 0
2022-04-01 16:45:00 0 0 0
... .. .. ..
2022-05-01 21:15:00 0 0 0
2022-05-01 21:30:00 0 0 0
2022-05-01 21:45:00 0 0 0
2022-05-01 22:00:00 0 0 0
2022-05-01 22:15:00 0 0 0
[2907 rows x 3 columns]
这是预期的行为还是我遗漏了什么?
编辑:
回避“随机连续子集”的方法(请参阅下面我的评论),我还尝试使用预定义索引
from datetime import datetime
from hypothesis.extra.pandas import columns, data_frames
import hypothesis.strategies as st
min_date_start = datetime(2022, 4, 1, 8, 0, 0)
dfs = data_frames(
index=st.just(pd.date_range(start=min_date_start, periods=10, freq="15T")),
columns=columns("A B C".split(), dtype=int),
)
dfs.example()
它也给出所有零列
A B C
2022-04-01 08:00:00 0 0 0
2022-04-01 08:15:00 0 0 0
2022-04-01 08:30:00 0 0 0
2022-04-01 08:45:00 0 0 0
2022-04-01 09:00:00 0 0 0
2022-04-01 09:15:00 0 0 0
2022-04-01 09:30:00 0 0 0
2022-04-01 09:45:00 0 0 0
2022-04-01 10:00:00 0 0 0
2022-04-01 10:15:00 0 0 0
编辑 2:
我试图想出一个连续子集的手工版本,它应该减少 space 的值,以便根据 @zac-hatfield-dodds 的回答为列值留下足够的熵,但根据经验它仍然生成几乎全为零的列值
from datetime import datetime
import math
import hypothesis.strategies as st
from hypothesis.extra.pandas import columns, data_frames
import pandas as pd
time_start = datetime(2022, 4, 1, 8, 0, 0)
time_stop = datetime(2022, 4, 2, 8, 0, 0)
r = pd.date_range(start=time_start, end=time_stop, freq="15T")
def build_indices(sequence):
first = 0
if len(sequence) % 2 == 0:
mid_ceiling = len(sequence) // 2
mid_floor = mid_ceiling - 1
else:
mid_floor = math.floor(len(sequence) / 2)
mid_ceiling = mid_floor + 1
second = len(sequence) - 1
return first, mid_floor, mid_ceiling, second
first, mid_floor, mid_ceiling, second = build_indices(r)
a = st.integers(min_value=first, max_value=mid_floor)
b = st.integers(min_value=mid_ceiling, max_value=second)
def indexer(sequence, lower, upper):
return sequence[lower:upper]
dfs = data_frames(
index=st.builds(lambda lower, upper: indexer(r, lower, upper), lower=a, upper=b),
columns=columns("A B C".split(), dtype=int),
)
dfs.example()
您的问题是后面的索引 大得多 ,假设是 运行 出于熵生成列内容。如果您将索引限制为最多几十个条目,一切都应该正常。
我们有这个 soft-cap 是为了限制其他无限的递归结构,所以整体设计按预期工作,但我承认在这种情况下它既不必要也不可取。