如何计算 Pandas 中多个 CSV 文件之间的相同行数?
How to count the same rows between multiple CSV files in Pandas?
我合并了 3 个不同的 CSV(D1,D2,D3) Netflow 数据集并创建了一个大数据集 (df),并将 KMeans 聚类应用于该数据集。
为了合并它们,我没有使用 pd.concat 因为内存错误并用 Linux 终端解决了。
df = pd.read_csv('D.csv')
#D is already created in a Linux machine from terminal
........
KMeans Clustering
........
As a result of clustering, I separated the clusters into a dataframe
then created a csv file.
cluster_0 = df[df['clusters'] == 0]
cluster_1 = df[df['clusters'] == 1]
cluster_2 = df[df['clusters'] == 2]
cluster_0.to_csv('cluster_0.csv')
cluster_1.to_csv('cluster_1.csv')
cluster_2.to_csv('cluster_2.csv')
#My goal is to understand the number of same rows with clusters
#and D1-D2-D3
D1 = pd.read_csv('D1.csv')
D2 = pd.read_csv('D2.csv')
D3 = pd.read_csv('D3.csv')
所有这些数据集包含相同的列名,它们有 12 列(所有数值)
示例预期结果:
cluster_0 有 xxxx 个来自 D1 的相同行,xxxxx 个来自 D2 的相同行,xxxxx 个来自 D3 的相同行?
我认为问题中没有足够的信息来涵盖边缘情况,但如果我理解正确,这应该可行。
# Read in the 3, and add a column called "file" so we know which file they came from
D1 = pd.read_csv('D1.csv')
D1['file'] = 'D1.csv'
D2 = pd.read_csv('D2.csv')
D2['file'] = 'D2.csv'
D3 = pd.read_csv('D3.csv')
D3['file'] = 'D3.csv'
# Merge them together into the DF that the "awk" command was doing
df = pd.concat([D1, D2, D3], axis=1)
# Save off the series showing which files each row belong sto
files = df['file']
# Drop it so that doesnt get included in your analysis
df.drop('file', inplace=True, axis=1)
"""
There is no code in the question to show the KMeans clustering
"""
# Add the filename back
df['filename'] = files
我们将避免使用 awk
命令,而是选择 pd.concat
。
cluster0_D1 = pd.merge(D1, cluster_0, how ='inner')
number_of_rows_D1 = len(cluster0_D1)
cluster0_D2 = pd.merge(D2, cluster_0, how ='inner')
number_of_rows_D2 = len(cluster0_D2)
cluster0_D3 = pd.merge(D3, cluster_0, how ='inner')
number_of_rows_D3 = len(cluster0_D3)
print("How many samples belong to D1, D2, D3 for cluster_0?")
print("D1: ",number_of_rows_D1)
print("D2: ",number_of_rows_D2)
print("D3: ",number_of_rows_D3)
我认为这解决了我的问题。
我合并了 3 个不同的 CSV(D1,D2,D3) Netflow 数据集并创建了一个大数据集 (df),并将 KMeans 聚类应用于该数据集。 为了合并它们,我没有使用 pd.concat 因为内存错误并用 Linux 终端解决了。
df = pd.read_csv('D.csv')
#D is already created in a Linux machine from terminal
........
KMeans Clustering
........
As a result of clustering, I separated the clusters into a dataframe
then created a csv file.
cluster_0 = df[df['clusters'] == 0]
cluster_1 = df[df['clusters'] == 1]
cluster_2 = df[df['clusters'] == 2]
cluster_0.to_csv('cluster_0.csv')
cluster_1.to_csv('cluster_1.csv')
cluster_2.to_csv('cluster_2.csv')
#My goal is to understand the number of same rows with clusters
#and D1-D2-D3
D1 = pd.read_csv('D1.csv')
D2 = pd.read_csv('D2.csv')
D3 = pd.read_csv('D3.csv')
所有这些数据集包含相同的列名,它们有 12 列(所有数值)
示例预期结果:
cluster_0 有 xxxx 个来自 D1 的相同行,xxxxx 个来自 D2 的相同行,xxxxx 个来自 D3 的相同行?
我认为问题中没有足够的信息来涵盖边缘情况,但如果我理解正确,这应该可行。
# Read in the 3, and add a column called "file" so we know which file they came from
D1 = pd.read_csv('D1.csv')
D1['file'] = 'D1.csv'
D2 = pd.read_csv('D2.csv')
D2['file'] = 'D2.csv'
D3 = pd.read_csv('D3.csv')
D3['file'] = 'D3.csv'
# Merge them together into the DF that the "awk" command was doing
df = pd.concat([D1, D2, D3], axis=1)
# Save off the series showing which files each row belong sto
files = df['file']
# Drop it so that doesnt get included in your analysis
df.drop('file', inplace=True, axis=1)
"""
There is no code in the question to show the KMeans clustering
"""
# Add the filename back
df['filename'] = files
我们将避免使用 awk
命令,而是选择 pd.concat
。
cluster0_D1 = pd.merge(D1, cluster_0, how ='inner')
number_of_rows_D1 = len(cluster0_D1)
cluster0_D2 = pd.merge(D2, cluster_0, how ='inner')
number_of_rows_D2 = len(cluster0_D2)
cluster0_D3 = pd.merge(D3, cluster_0, how ='inner')
number_of_rows_D3 = len(cluster0_D3)
print("How many samples belong to D1, D2, D3 for cluster_0?")
print("D1: ",number_of_rows_D1)
print("D2: ",number_of_rows_D2)
print("D3: ",number_of_rows_D3)
我认为这解决了我的问题。