Dask read_csv-- 在`pd.read_csv`/`pd.read_table` 中发现不匹配的数据类型
Dask read_csv-- Mismatched dtypes found in `pd.read_csv`/`pd.read_table`
我正在尝试使用 dask 读取 csv 文件,它给了我一个如下所示的错误。但问题是我希望我的 ARTICLE_ID
是 object(string)
。谁能帮我成功读取数据?
回溯如下:
ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.
+------------+--------+----------+
| Column | Found | Expected |
+------------+--------+----------+
| ARTICLE_ID | object | int64 |
+------------+--------+----------+
The following columns also raised exceptions on conversion:
ARTICLE_ID:
ValueError("invalid literal for int() with base 10: ' July 2007 and 31 March 2008. Diagnostic practices of the medical practitioners for establishing the diagnosis of different types of EPTB were studied. Results: For the diagnosi\\'",)
Usually this is due to dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:
dtype={'ARTICLE_ID': 'object'}
to the call to `read_csv`/`read_table`.
该消息建议您将呼叫更改为
df = dd.read_csv('mylocation.csv', ...)
至
df = dd.read_csv('mylocation.csv', ..., dtype={'ARTICLE_ID': 'object'})
您应该在哪里将文件位置和任何其他参数更改为您之前使用的内容。如果这仍然不起作用,请更新您的问题。
您可以在 read_csv
方法中使用 sample
参数并为其分配一个整数以指示在确定数据类型时要使用的字节数。例如,我必须给它 25000000 才能正确推断出形状为 (171907, 161) 的数据类型。
df = dd.read_csv("game_logs.csv", sample=25000000)
https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_csv
我正在尝试使用 dask 读取 csv 文件,它给了我一个如下所示的错误。但问题是我希望我的 ARTICLE_ID
是 object(string)
。谁能帮我成功读取数据?
回溯如下:
ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.
+------------+--------+----------+
| Column | Found | Expected |
+------------+--------+----------+
| ARTICLE_ID | object | int64 |
+------------+--------+----------+
The following columns also raised exceptions on conversion:
ARTICLE_ID:
ValueError("invalid literal for int() with base 10: ' July 2007 and 31 March 2008. Diagnostic practices of the medical practitioners for establishing the diagnosis of different types of EPTB were studied. Results: For the diagnosi\\'",)
Usually this is due to dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:
dtype={'ARTICLE_ID': 'object'}
to the call to `read_csv`/`read_table`.
该消息建议您将呼叫更改为
df = dd.read_csv('mylocation.csv', ...)
至
df = dd.read_csv('mylocation.csv', ..., dtype={'ARTICLE_ID': 'object'})
您应该在哪里将文件位置和任何其他参数更改为您之前使用的内容。如果这仍然不起作用,请更新您的问题。
您可以在 read_csv
方法中使用 sample
参数并为其分配一个整数以指示在确定数据类型时要使用的字节数。例如,我必须给它 25000000 才能正确推断出形状为 (171907, 161) 的数据类型。
df = dd.read_csv("game_logs.csv", sample=25000000)
https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_csv