dataprep.eda TypeError: Please provide npartitions as an int, or possibly as None if you specify chunksize
dataprep.eda TypeError: Please provide npartitions as an int, or possibly as None if you specify chunksize
正在努力理解来自 dataprep 包的 TypeError。我的设置很简单,如下:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"phone": [
"555-234-5678",
"(555) 234-5678",
"555.234.5678",
"555/234/5678",
15551234567,
"(1) 555-234-5678",
"+1 (234) 567-8901 x. 1234",
"2345678901 extension 1234",
"2345678",
"800-299-JUNK",
"1-866-4ZIPCAR",
"123 ABC COMPANY",
"+66 91 889 8948",
"hello",
np.nan,
"NULL",
]
}
)
from dataprep.clean import clean_phone
clean_phone(df, "phone")
最终的错误消息被抛出到终端中(为了安全起见,我省略了文件路径并将敏感值替换为 x):
Traceback (most recent call last):
File "c:\Users\x\x\Documents\Repositories\test.py", line 14, in <module>
clean_phone(df, "phone")
File "C:\Users\x\Anaconda3\envs\myenv\lib\site-packages\dataprep\clean\clean_phone.py", line 150, in clean_phone
df = to_dask(df)
File "C:\Users\x\Anaconda3\envs\myenv\lib\site-packages\dataprep\clean\utils.py", line 73, in to_dask
return dd.from_pandas(df, npartitions=npartitions)
File "C:\Users\x\Anaconda3\envs\myenv\lib\site-packages\dask\dataframe\io\io.py", line 236, in from_pandas
raise TypeError(
TypeError: Please provide npartitions as an int, or possibly as None if you specify chunksize.
这是直接尝试复制 dataprep 包团队在以下位置找到的教程:https://docs.dataprep.ai/user_guide/clean/clean_phone.html
根据教程,预期输出如下:
Expected output.
将此作为 TypeError 发布仅在 Google 搜索时显示一个半相关的结果。
dataprep
包中有一个小错误,您可以在 this PR 中跟踪它。
与此同时,避免错误的一种选择是将数据显式转换为 dask
数据帧并将其传递给函数:
import numpy as np
import pandas as pd
from dask.dataframe import from_pandas
from dataprep.clean import clean_phone
df = pd.DataFrame(
{
"phone": [
"555-234-5678",
"(555) 234-5678",
"555.234.5678",
"555/234/5678",
15551234567,
"(1) 555-234-5678",
"+1 (234) 567-8901 x. 1234",
"2345678901 extension 1234",
"2345678",
"800-299-JUNK",
"1-866-4ZIPCAR",
"123 ABC COMPANY",
"+66 91 889 8948",
"hello",
np.nan,
"NULL",
]
}
)
# to avoid the bug we are passing ddf, not df
ddf = from_pandas(df, npartitions=2)
clean_phone(ddf, "phone")
正在努力理解来自 dataprep 包的 TypeError。我的设置很简单,如下:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"phone": [
"555-234-5678",
"(555) 234-5678",
"555.234.5678",
"555/234/5678",
15551234567,
"(1) 555-234-5678",
"+1 (234) 567-8901 x. 1234",
"2345678901 extension 1234",
"2345678",
"800-299-JUNK",
"1-866-4ZIPCAR",
"123 ABC COMPANY",
"+66 91 889 8948",
"hello",
np.nan,
"NULL",
]
}
)
from dataprep.clean import clean_phone
clean_phone(df, "phone")
最终的错误消息被抛出到终端中(为了安全起见,我省略了文件路径并将敏感值替换为 x):
Traceback (most recent call last):
File "c:\Users\x\x\Documents\Repositories\test.py", line 14, in <module>
clean_phone(df, "phone")
File "C:\Users\x\Anaconda3\envs\myenv\lib\site-packages\dataprep\clean\clean_phone.py", line 150, in clean_phone
df = to_dask(df)
File "C:\Users\x\Anaconda3\envs\myenv\lib\site-packages\dataprep\clean\utils.py", line 73, in to_dask
return dd.from_pandas(df, npartitions=npartitions)
File "C:\Users\x\Anaconda3\envs\myenv\lib\site-packages\dask\dataframe\io\io.py", line 236, in from_pandas
raise TypeError(
TypeError: Please provide npartitions as an int, or possibly as None if you specify chunksize.
这是直接尝试复制 dataprep 包团队在以下位置找到的教程:https://docs.dataprep.ai/user_guide/clean/clean_phone.html
根据教程,预期输出如下:
Expected output.
将此作为 TypeError 发布仅在 Google 搜索时显示一个半相关的结果。
dataprep
包中有一个小错误,您可以在 this PR 中跟踪它。
与此同时,避免错误的一种选择是将数据显式转换为 dask
数据帧并将其传递给函数:
import numpy as np
import pandas as pd
from dask.dataframe import from_pandas
from dataprep.clean import clean_phone
df = pd.DataFrame(
{
"phone": [
"555-234-5678",
"(555) 234-5678",
"555.234.5678",
"555/234/5678",
15551234567,
"(1) 555-234-5678",
"+1 (234) 567-8901 x. 1234",
"2345678901 extension 1234",
"2345678",
"800-299-JUNK",
"1-866-4ZIPCAR",
"123 ABC COMPANY",
"+66 91 889 8948",
"hello",
np.nan,
"NULL",
]
}
)
# to avoid the bug we are passing ddf, not df
ddf = from_pandas(df, npartitions=2)
clean_phone(ddf, "phone")