比较 Python 中的 2 个 DataFrame 时出现问题，应排除所有重复项，但工作不正常

Question

我正在为 SQL 服务器数据库开发一个连接器 Google 分析，但遇到重复值问题。

首先，脚本使用 GA Accounts 配置解析嵌套字典，然后将其转换为 pandas df 并将所有响应存储在列表中，然后是当前 SQL table获取所有 GA 数据并进行循环比较新值（来自 GA API）和当前值（在 sql table 中）。

但出于某种原因，在比较这 2 个 dfs 时，所有重复项都被保留了。

如果有人能提供帮助，我将非常高兴。

带有用于发出 GA API 请求的配置的嵌套字典


data_test = {
    
    'view_id_111' : {'view_id': '111', 
                           'start_date': '2019-08-01', 
                           'end_date': '2019-09-01',
                           'metrics': [{'expression': 'ga:sessions'}, {'expression':'ga:users'}],
                           'dimensions': [{'name': 'ga:country'}, {'name': 'ga:userType'}, {'name': 'ga:date'}]},
    
     'view_id_222' : {'view_id': '222', 
                           'start_date': '2019-08-01', 
                           'end_date': '2019-09-01',
                           'metrics': [{'expression': 'ga:sessions'}, {'expression':'ga:users'}],
                           'dimensions': [{'name': 'ga:country'}, {'name': 'ga:date'}]},
    
    'view_id_333' : {'view_id': '333', 
                           'start_date': '2019-01-01', 
                           'end_date': '2019-05-01',
                           'metrics': [{'expression': 'ga:sessions'}, {'expression':'ga:users'}],
                           'dimensions': [{'name': 'ga:country'}, {'name': 'ga:date'}]} 
}

向Google发送请求API，将其转换为df并将值存储在列表中

responses = []

for k, v in data_test.items():
    
    sample_request = {
        'viewId': v['view_id'],
        'dateRanges': {
            'startDate': v['start_date'],
            'endDate': v['end_date']
        },
        'metrics': v['metrics'],
        'dimensions': v['dimensions']
    }
    
    response = analytics.reports().batchGet(
        body={
            'reportRequests': sample_request
        }).execute()
    
    n_response=print_response_new_test(response)
    responses.append(n_response)

使用 GA 数据获取当前 SQL table

def get_current_sql_gadata_table():
    global sql_table_current_gadata
    sql_table_current_gadata = pd.read_sql('SELECT * FROM Table', con=conn)
    sql_table_current_gadata['date'] = pd.to_datetime(sql_table_current_gadata['date'])
    return sql_table_current_gadata

最后比较2个DF，如果有差异，更新SQLtable


def compare_df_gadata():
    
    for report in responses:
        response=pd.DataFrame.equals(sql_table_current_gadata, report)
        if response==False:
            compared_dfs = pd.concat([sql_table_current_gadata, report], sort=False)
            compared_dfs.drop_duplicates(keep=False, inplace=True)
    
            #sql params in sqlalchemy
            params = urllib.parse.quote_plus(#params)
            engine = create_engine('mssql+pyodbc:///?odbc_connect={}'.format(params))

            #insert new values to the sql table
            compared_dfs.to_sql('Table', con=engine, if_exists='append', index=False)

我也试过合并 2 table，但结果是一样的。也许制作一个check-in MS Studio 更合理？

也不能正常工作

df_outer = pd.merge(sql_table_current_gadata, report, on=None, how='left', sort=True)

更新

我又用 concat 函数检查了一次，看起来问题出在 'index'。

原来的 240 行（960 行已经重复，所以只是清理了 SQL table 和运行脚本）。

我有 3 个 GA 帐户，当前 SQL table 包括：72 行 + 13 行 + 154 行 + header = 240 行。

并且当再次运行运行脚本时，与 pd.concat 比较并将结果存储在数据帧 (compared_dfs) 中（不将其发送到数据库），它包含上次请求 GA API.

的 154 行

我在这里尝试重置：

if response==False:
            compared_dfs = pd.concat([sql_table_current_gadata, report], sort=False)
            compared_dfs.drop_duplicates(keep=False, inplace=True)
            compared_dfs.reset_index(inplace=True)

但结果它被添加为 compared_dfs

中的附加列

Resulted DF

它显示了 2 个索引列，一个来自 SQL table，另一个来自 pandas

Answer 1

你的问题很详细，但很明确。我首先会问您是否确定您的索引，您可以尝试合并特定列以查看是否可以解决问题？我首先关注 pandas 部分，因为它似乎是您问题的重点。

import pandas as pd
import numpy as np

merge = True
concat = False

anp = np.ones((2, 5))
anp[1, 1] = 3
anp[1, 4] = 3
bnp = np.ones((1, 5))
bnp[0, 1] = 4  # use 4 to make it different, also works with nan
bnp[0, 4] = 4  # use 4 to make it different, also works with nan
a = pd.DataFrame(anp)
b = pd.DataFrame(bnp)
if merge:
    a.rename(columns=dict(zip(range(5), ['a', 'b', 'c', 'd', 'e'])), inplace=True)
    b.rename(columns=dict(zip(range(5), ['a', 'b', 'c', 'd', 'e'])), inplace=True)
    # choose suitable and meaningful column(s) for your merge (do you have any id column etc.?)
    a = pd.merge(a, b, how='outer', copy=False, on=['a', 'c', 'd', 'e'])
    # che
    print(a)

if concat:
    # can use ignore_index or pass keys to maintain distiction
    c = pd.concat((a, b), axis=0, join='outer', keys=['a', 'b'])
    print(c)
    c.drop_duplicates(inplace=True)
    print(c)

Answer 2

正在检查 Luca Peruzzo 解决方案，但如果列为空则崩溃

获取当前的列列表 sql table

list_of_col = list(sql_table_current_gadata.columns)

迭代响应列表中的报告（GA API 响应）

for report in responses:
    df_outer = pd.merge(test, report, how='outer', copy=False, on=list_of_col)

引发错误

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-321-4fbfe59db175> in <module>
      1 for report in responses:
----> 2     df_outer = pd.merge(test, report, how='outer', copy=False, on=list_of_col)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
     45                          right_index=right_index, sort=sort, suffixes=suffixes,
     46                          copy=copy, indicator=indicator,
---> 47                          validate=validate)
     48     return op.get_result()
     49 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\merge.py in __init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate)
    527         (self.left_join_keys,
    528          self.right_join_keys,
--> 529          self.join_names) = self._get_merge_keys()
    530 
    531         # validate the merge keys dtypes. We may need to coerce

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\merge.py in _get_merge_keys(self)
    831                         if rk is not None:
    832                             right_keys.append(
--> 833                                 right._get_label_or_level_values(rk))
    834                         else:
    835                             # work-around for merge_asof(right_index=True)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in _get_label_or_level_values(self, key, axis)
   1704             values = self.axes[axis].get_level_values(key)._values
   1705         else:
-> 1706             raise KeyError(key)
   1707 
   1708         # Check for duplicates

KeyError: 'userGender'

list_of_col包括：

['view_id',
 'start_date',
 'end_date',
 'userType',
 'userGender',
 'userAgeBracket',
 'sourceMedium',
 'source',
 'socialNetwork',
 'region',
 'regionId',
 'pageTitle',
 'pagePath',
 'pageDepth',
 'operatingSystemVersion',
 'operatingSystem',
 'mobileDeviceModel',
 'mobileDeviceMarketingName',
 'mobileDeviceInfo',
 'mobileDeviceBranding',
 'medium',
 'deviceCategory',
 'dataSource',
 'country',
 'continent',
 'continentId',
 'cityId',
 'city',
 'users',
 'sessions',
 'sessionDuration',
 'pageviews',
 'newUsers',
 'bounces',
 'date']

我还检查了 'userGender' 有 None 个值，它在所有空列上崩溃

比较 Python 中的 2 个 DataFrame 时出现问题，应排除所有重复项，但工作不正常

Problem with comparing 2 DataFrames in Python, should exclude all duplicates, but works unproperly

python

sql

duplicates

google-analytics-api

pandas

更新