给定 csv 文件中列的条件,如何删除整行数据集?
How to delete entire row of data set given a condition on a column in csv file?
以下是 csv 格式的以下数据集的片段:
quantity revenue time_x transaction_id user_id
1 0 57:57.0 0 0 0
1 0 18:59.0 0 1
我想在 user_id 为空时删除整行。我如何在 python 中执行此操作?到目前为止,这是我的代码:
activity = pd.read_csv("activity(delimited).csv", delimiter=';', error_bad_lines=False, dtype=object)
impression = pd.read_csv("impression(delimited).csv", delimiter=';', error_bad_lines=False, dtype=object)
click = pd.read_csv("click(delimited).csv", delimiter=';', error_bad_lines=False, dtype=object)
pre_merge = activity.merge(impression, on="user_id", how="outer")
merged = pre_merge.merge(click, on="user_id", how="outer")
merged.to_csv("merged.csv", index=False)
open_merged = pd.read_csv("merged.csv", delimiter=',', error_bad_lines= False, dtype=object)
filtered_merged = open_merged.dropna(axis='columns', how='all')
另外,如何高效地编写代码?
与Pandas:
import pandas as pd
df = pd.read_csv("path/to/csv/data.csv", delimiter=';', error_bad_lines=False)
df = df[pd.notnull(df.user_id)] # boolean indexing
# Shift user_id to first column
df = df.set_index("user_id")
df = df.reset_index()
df.to_csv("path/to/csv/data.csv", index=False)
括号表示法允许您提供可迭代的布尔值。这叫做boolean indexing。 numpy、matlab 和 R
中使用了类似的概念和语法
不同风格:获取数据,加入然后删除。保持命名空间干净。
activity = pd.read_csv("activity(delimited).csv", delimiter=';', error_bad_lines=False)
impression = pd.read_csv("impression(delimited).csv", delimiter=';', error_bad_lines=False)
pre_merge = activity.merge(impression, on="user_id", how="outer")
del activity, impression
click = pd.read_csv("click(delimited).csv", delimiter=';', error_bad_lines=False)
merged = pre_merge.merge(click, on="user_id", how="outer")
merged.to_csv("merged.csv", index=False)
del click
open_merged = pd.read_csv("merged.csv", error_bad_lines= False)
filtered_merged = open_merged.dropna(axis='columns', how='all')
以下是 csv 格式的以下数据集的片段:
quantity revenue time_x transaction_id user_id
1 0 57:57.0 0 0 0
1 0 18:59.0 0 1
我想在 user_id 为空时删除整行。我如何在 python 中执行此操作?到目前为止,这是我的代码:
activity = pd.read_csv("activity(delimited).csv", delimiter=';', error_bad_lines=False, dtype=object)
impression = pd.read_csv("impression(delimited).csv", delimiter=';', error_bad_lines=False, dtype=object)
click = pd.read_csv("click(delimited).csv", delimiter=';', error_bad_lines=False, dtype=object)
pre_merge = activity.merge(impression, on="user_id", how="outer")
merged = pre_merge.merge(click, on="user_id", how="outer")
merged.to_csv("merged.csv", index=False)
open_merged = pd.read_csv("merged.csv", delimiter=',', error_bad_lines= False, dtype=object)
filtered_merged = open_merged.dropna(axis='columns', how='all')
另外,如何高效地编写代码?
与Pandas:
import pandas as pd
df = pd.read_csv("path/to/csv/data.csv", delimiter=';', error_bad_lines=False)
df = df[pd.notnull(df.user_id)] # boolean indexing
# Shift user_id to first column
df = df.set_index("user_id")
df = df.reset_index()
df.to_csv("path/to/csv/data.csv", index=False)
括号表示法允许您提供可迭代的布尔值。这叫做boolean indexing。 numpy、matlab 和 R
中使用了类似的概念和语法不同风格:获取数据,加入然后删除。保持命名空间干净。
activity = pd.read_csv("activity(delimited).csv", delimiter=';', error_bad_lines=False)
impression = pd.read_csv("impression(delimited).csv", delimiter=';', error_bad_lines=False)
pre_merge = activity.merge(impression, on="user_id", how="outer")
del activity, impression
click = pd.read_csv("click(delimited).csv", delimiter=';', error_bad_lines=False)
merged = pre_merge.merge(click, on="user_id", how="outer")
merged.to_csv("merged.csv", index=False)
del click
open_merged = pd.read_csv("merged.csv", error_bad_lines= False)
filtered_merged = open_merged.dropna(axis='columns', how='all')