使用 Python 从 PowerPivot 模型中提取原始数据

Extracting raw data from a PowerPivot model using Python

当我不得不使用 Python 从 PowerPivot 模型中读取一些数据时,看似微不足道的任务变成了一场真正的噩梦。我相信我在过去几天对此进行了很好的研究,但现在我碰壁了,希望得到 Python/SSAS/ADO 社区的一些帮助。

基本上,我想做的就是以编程方式访问存储在 PowerPivot 模型中的原始数据 - 我的想法是通过下面列出的方法之一连接到底层 PowerPivot(即 MS Analysis Services)引擎,列出 tables 包含在模型中,然后使用简单的 DAX 查询(类似于 EVALUATE (table_name))从每个 table 中提取原始数据。简单易行,对吧?好吧,也许不是。

0。一些背景信息

如您所见,我尝试了几种不同的方法。我会尽可能仔细地记录所有内容,以便那些不熟悉 PowerPivot 功能的人能够很好地了解我想做什么。

首先,关于以编程方式访问 Analysis Services 引擎的一些背景知识(它说 2005 SQL 服务器,但所有这些应该仍然适用):SQL Server Data Mining Programmability and Data providers used for Analysis Services connections.

我将在下面的示例中使用的样本 Excel/PowerPivot 文件可以在这里找到:Microsoft PowerPivot for Excel 2010 and PowerPivot in Excel 2013 Samples.

另外,请注意我使用的是 Excel 2010,所以我的一些代码是特定于版本的。例如。如果您使用 Excel 2013,wb.Connections["PowerPivot Data"].OLEDBConnection.ADOConnection 应该是 wb.Model.DataModelConnection.ModelConnection.ADOConnection

我将在整个问题中使用的连接字符串基于此处找到的信息:Connect to PowerPivot engine with C#. Additionally, some of the methods apparently require some sort of initialization of the PowerPivot model prior to data retrieval. See here: Automating PowerPivot Refresh operation from VBA

最后,这里有几个链接表明这应该是可以实现的(但是请注意,这些链接主要指的是 C#,而不是 Python):

1。使用 ADOMD

import clr
clr.AddReference("Microsoft.AnalysisServices.AdomdClient")
import Microsoft.AnalysisServices.AdomdClient as ADOMD
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
             Location=H:\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"

Connection = ADOMD.AdomdConnection(ConnString)
Connection.Open()

这里看来是PowerPivot模型没有初始化的问题:

AdomdConnectionException: A connection cannot be made. Ensure that the server is running.

2。使用 AMO

import clr
clr.AddReference("Microsoft.AnalysisServices")
import Microsoft.AnalysisServices as AMO
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
             Location=H:\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"

Connection = AMO.Server()
Connection.Connect(ConnString)

同样的故事,"the server is not running":

ConnectionException: A connection cannot be made. Ensure that the server is running.

请注意,从技术上讲,AMO 不用于查询数据,但我将其作为连接到 PowerPivot 模型的潜在方式之一。

3。使用 ADO.NET

import clr
clr.AddReference("System.Data")
import System.Data.OleDb as ADONET
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
             Location=H:\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"

Connection = ADONET.OleDbConnection()
Connection.ConnectionString = ConnString
Connection.Open()

这类似于 What's the simplest way to access mssql with python or ironpython?。不幸的是,这也行不通:

OleDbException: OLE DB error: OLE DB or ODBC error: The following system error occurred:
The requested name is valid, but no data of the requested type was found.

4。通过 adodbapi 模块使用 ADO

import adodbapi
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
             Location=H:\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"

Connection = adodbapi.connect(ConnString)

类似于Opposite Workings of OLEDB/ODBC between Python and MS Access VBA。我得到的错误是:

OperationalError: (com_error(-2147352567, 'Exception occurred.', (0, u'Microsoft OLE DB
Provider for SQL Server 2012 Analysis Services.', u'OLE DB error: OLE DB or ODBC error: The
following system error occurred:  The requested name is valid, but no data of the requested
type was found...

这与上面 ADO.NET 的问题基本相同。

5。通过 Excel/win32com 模块

使用 ADO
from win32com.client import Dispatch
Xlfile = "H:\PowerPivotTutorialSample.xlsx"
XlApp = Dispatch("Excel.Application")
Workbook = XlApp.Workbooks.Open(Xlfile)
Workbook.Connections["PowerPivot Data"].Refresh()
Connection = Workbook.Connections["PowerPivot Data"].OLEDBConnection.ADOConnection
Recordset = Dispatch('ADODB.Recordset')

Query = "EVALUATE(dbo_DimDate)" #sample DAX query
Recordset.Open(Query, Connection)

这种方法的想法来自这个使用 VBA 的博客 post:Export a table or DAX query from Power Pivot to CSV using VBA。请注意,此方法使用初始化模型的显式刷新命令(即 "server")。这是错误消息:

com_error: (-2147352567, 'Exception occurred.', (0, u'ADODB.Recordset', u'Arguments are of
the wrong type, are out of acceptable range, or are in conflict with one another.',
u'C:\Windows\HELP\ADO270.CHM', 1240641, -2146825287), None)

但是,ADO 连接似乎已经建立:

看来问题出在 ADODB.Recordset 对象的创建上。

6.通过 Excel/win32com 使用 ADO,直接使用 ADODB.Connection

from win32com.client import Dispatch
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
             Location=H:\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"

Connection = Dispatch('ADODB.Connection')
Connection.Open(ConnString)

类似于Connection to Access from Python [duplicate] and Query access using ADO in Win32 platform (Python recipe)。不幸的是,Python 吐出的错误与上面两个示例中的相同:

com_error: (-2147352567, 'Exception occurred.', (0, u'Microsoft OLE DB Provider for SQL
Server 2012 Analysis Services.', u'OLE DB error: OLE DB or ODBC error: The following system
error occurred:  The requested name is valid, but no data of the requested type was found.
..', None, 0, -2147467259), None)

7。通过Excel/win32com使用ADO,直接使用ADODB.Connection加上模型刷新

from win32com.client import Dispatch
Xlfile = "H:\PowerPivotTutorialSample.xlsx"
XlApp = Dispatch("Excel.Application")
Workbook = XlApp.Workbooks.Open(Xlfile)
Workbook.Connections["PowerPivot Data"].Refresh()
ConnStringInternal = "Provider=MSOLAP.5;Persist Security Info=True;Initial Catalog=
                     Microsoft_SQLServer_AnalysisServices;Data Source=$Embedded$;MDX
                     Compatibility=1;Safety Options=2;ConnectTo=11.0;MDX Missing Member
                     Mode=Error;Optimize Response=3;Cell Error Mode=TextValue"

Connection = Dispatch('ADODB.Connection')
Connection.Open(ConnStringInternal)

我希望我可以初始化 Excel 的实例,然后初始化 PowerPivot 模型,然后使用内部连接字符串创建连接 Excel 用于嵌入的 PowerPivot 数据(类似于 How do you copy the powerpivot data into the excel workbook as a table? - 请注意,连接字符串与我在其他地方使用的连接字符串不同)。不幸的是,这不起作用,我的猜测是 Python 在单独的实例中启动 ADODB.Connection 进程(因为当我在没有先初始化 [=132 的情况下执行最后三行时收到相同的错误消息=], 等等):

com_error: (-2147352567, 'Exception occurred.', (0, u'Microsoft OLE DB Provider for SQL
Server 2012 Analysis Services.', u'Either the user, ****** (masked), does not have access
to the Microsoft_SQLServer_AnalysisServices database, or the database does not exist.',
None, 0, -2147467259), None)

从 PowerPivot 获取数据的问题在于 PowerPivot 中的表格引擎在 Excel 中运行 in-process 并且 连接到该引擎引擎也是将您的代码 运行 放在 Excel 中。 (我怀疑它可能使用共享内存或其他一些传输方式,但它绝对不会监听 TCP 端口或命名管道或任何允许外部进程连接的类似内容)

我们在 Dax Studio 中通过 运行 C# VSTO Excel add-in 在 Excel 中执行此操作。然而,这仅适用于测试分析查询,不适用于批量数据提取。我们使用字符串变量将数据从 add-in 编组到 UI,因此整个数据集必须小于 2Gb,否则响应将被截断,您将看到 "unrecognizable response" 错误(数据被序列化为一个非常冗长的 XMLA 行集,因此在仅提取几百 Mb 的数据时可能会看到它中断)

如果您想构建一个脚本来自动从模型中提取所有原始数据,我认为您无法使用 Python 来完成,因为我不相信您可以获得python 解释器 运行 in-process 里面 Excel。我会考虑使用像这样的 vba 宏 http://www.powerpivotblog.nl/export-a-table-or-dax-query-from-power-pivot-to-csv-using-vba/

您应该会发现您可以使用 "SELECT * FROM $SYSTEM.DBSCHEMA_TABLES" 之类的内容查询模型以获取 table 的列表 - 然后您可以遍历每个 table 并使用以下变体进行提取上面的代码link.

我联系上了 Tom Gleeson(又名 Gobán Saor),他很友好地让我 post 他的电子邮件在这里。其中有一些有趣的掘金,所以希望其他人也会发现它们有用。

电子邮件 #1

When you say Python, you mean running Python.NET as a standalone exe? If that’s the case, you’re out of luck with Excel PP models (different story for Power BI desktop though). I’ve accessed PP models (2010+) successfully from both VBA, and from Python.NET (via AMO) using similar code to that in your SO question. The difference being (in both VBA & .NET version) is that my code is running in-process within Excel using Excel’s various add-in technologies. (Likely Tableau is also running as an add-in or has embedded Excel within itself enabling similar behaviour). DAX Studio (a useful C# code base to learn the how-tos of PP access) runs both as an Excel add-in and as a standalone EXE, but only as an add-in can it access Excel based PP models.

电子邮件 #2

You might find the process of using Python.NET for this somewhat challenging. You would need to embed a Python engine using C#/VB.NET Excel add-in code. I’ve used Excel-DNA (a fantastic open source project) rather than MS’s highly cumbersome "official" method for developing such .NET addins in the past, but I mainly stick to VBA where at all possible.

Using VBA you’ll not be able to access the .NET-only AMO (so no ability to create calculated columns on the fly), but by loading the resulting dataset into an ADO recordset you should be able to output to a worksheet OR to a corporate-database/MS Access OR to a flat-file/CSV etc.

Unlike the 1M worksheet limit, for a flat-file or database output memory (RAM) will be the limiting factor, but, assuming you’re using 64bit Excel and have enough memory to hold the compacted model and the workspace for the largest of the model’s tables in un-compacted form (i.e. a row based rather than column based format that’ll result from a DAX Query), multiplied by 2ish (one instance within PP workspace the other within VBA’s ADO workspace) you should be okay.

Having said that, I’ve never attempted extracting a very large dataset, and using models as a dataset exchange medium is not one of PP’s "use-cases"; so, very large tables might hit some other bug/constraint!

瞧,我终于设法破解了问题 - 结果证明使用 Python 访问 Power Pivot 数据确实可行!下面是我所做工作的简短回顾 - 您可以在此处找到更详细的描述:Analysis Services (SSAS) on a shoestring。注意:代码已经过优化,既没有效率也没有优雅。

  • 安装 Microsoft Power BI Desktop(随附免费的 Analysis Services 服务器,因此无需昂贵的 SQL 服务器许可证 - 但是,如果您拥有适当的许可证,同样的方法显然也适用)。
  • 首先创建 msmdsrv.ini 设置文件启动 AS 引擎,然后从 ABF 文件恢复数据库(使用 AMO.NET),然后使用 ADOMD.NET 提取数据。

这是说明 AS 引擎 + AMO.NET 部分的 Python 代码:

import psutil, subprocess, random, os, zipfile, shutil, clr, sys, pandas

def initialSetup(pathPowerBI):
    sys.path.append(pathPowerBI)

    #required Analysis Services assemblies
    clr.AddReference("Microsoft.PowerBI.Amo.Core")
    clr.AddReference("Microsoft.PowerBI.Amo")     
    clr.AddReference("Microsoft.PowerBI.AdomdClient")

    global AMO, ADOMD
    import Microsoft.AnalysisServices as AMO
    import Microsoft.AnalysisServices.AdomdClient as ADOMD

def restorePowerPivot(excelName, pathTarget, port, pathPowerBI):   
    #create random folder
    os.chdir(pathTarget)
    folder = os.getcwd()+str(random.randrange(10**6, 10**7))
    os.mkdir(folder)

    #extract PowerPivot model (abf backup)
    archive = zipfile.ZipFile(excelName)
    for member in archive.namelist():
        if ".data" in member:
            filename = os.path.basename(member)
            abfname = os.path.join(folder, filename) + ".abf"
            source = archive.open(member)
            target = file(os.path.join(folder, abfname), 'wb')
            shutil.copyfileobj(source, target)
            del target
    archive.close()

    #start the cmd.exe process to get its PID
    listPIDpre = [proc for proc in psutil.process_iter()]
    process = subprocess.Popen('cmd.exe /k', stdin=subprocess.PIPE)
    listPIDpost = [proc for proc in psutil.process_iter()]
    pid = [proc for proc in listPIDpost if proc not in listPIDpre if "cmd.exe" in str(proc)][0]
    pid = str(pid).split("=")[1].split(",")[0]

    #msmdsrv.ini
    msmdsrvText = '''<ConfigurationSettings>
       <DataDir>{0}</DataDir>
       <TempDir>{0}</TempDir>
       <LogDir>{0}</LogDir>
       <BackupDir>{0}</BackupDir>
       <DeploymentMode>2</DeploymentMode>
       <RecoveryModel>1</RecoveryModel>
       <DisklessModeRequested>0</DisklessModeRequested>
       <CleanDataFolderOnStartup>1</CleanDataFolderOnStartup>
       <AutoSetDefaultInitialCatalog>1</AutoSetDefaultInitialCatalog>
       <Network>
          <Requests>
             <EnableBinaryXML>1</EnableBinaryXML>
             <EnableCompression>1</EnableCompression>
          </Requests>
          <Responses>
             <EnableBinaryXML>1</EnableBinaryXML>
             <EnableCompression>1</EnableCompression>
             <CompressionLevel>9</CompressionLevel>
          </Responses>
          <ListenOnlyOnLocalConnections>1</ListenOnlyOnLocalConnections>
       </Network>
       <Port>{1}</Port>
       <PrivateProcess>{2}</PrivateProcess>
       <InstanceVisible>0</InstanceVisible>
       <Language>1033</Language>
       <Debug>
          <CallStackInError>0</CallStackInError>
       </Debug>
       <Log>
          <Exception>
             <CrashReportsFolder>{0}</CrashReportsFolder>
          </Exception>
          <FlightRecorder>
             <Enabled>0</Enabled>
          </FlightRecorder>
       </Log>
       <AllowedBrowsingFolders>{0}</AllowedBrowsingFolders>
       <ResourceGovernance>
          <GovernIMBIScheduler>0</GovernIMBIScheduler>
       </ResourceGovernance>
       <Feature>
          <ManagedCodeEnabled>1</ManagedCodeEnabled>
       </Feature>
       <VertiPaq>
          <EnableDisklessTMImageSave>0</EnableDisklessTMImageSave>
          <EnableProcessingSimplifiedLocks>1</EnableProcessingSimplifiedLocks>
       </VertiPaq>
    </ConfigurationSettings>'''

    #save ini file to disk, fill it with required parameters
    msmdsrvini = open(folder+"\msmdsrv.ini", "w")
    msmdsrvText = msmdsrvText.format(folder, port, pid) #{0},{1},{2}
    msmdsrvini.write(msmdsrvText)
    msmdsrvini.close()

    #run AS engine inside the cmd.exe process
    initString = "\"{0}\msmdsrv.exe\" -c -s \"{1}\""
    initString = initString.format(pathPowerBI.replace("/","\"),folder)
    process.stdin.write(initString + " \n")

    #connect to the AS instance from Python
    AMOServer = AMO.Server()
    AMOServer.Connect("localhost:{0}".format(port))

    #restore database from PowerPivot abf backup, disconnect
    AMORestoreInfo = AMO.RestoreInfo(os.path.join(folder, abfname))
    AMOServer.Restore(AMORestoreInfo)
    AMOServer.Disconnect()

    return process

以及数据提取部分:

def runQuery(query, port, flag):
    #ADOMD assembly
    ADOMDConn = ADOMD.AdomdConnection("Data Source=localhost:{0}".format(port))
    ADOMDConn.Open()
    ADOMDCommand = ADOMDConn.CreateCommand() 
    ADOMDCommand.CommandText = query

    #read data in via AdomdDataReader object
    DataReader = ADOMDCommand.ExecuteReader()

    #get metadata, number of columns
    SchemaTable = DataReader.GetSchemaTable()
    numCol = SchemaTable.Rows.Count #same as DataReader.FieldCount

    #get column names
    columnNames = []
    for i in range(numCol):
        columnNames.append(str(SchemaTable.Rows[i][0]))

    #fill with data
    data = []
    while DataReader.Read()==True:
        row = []
        for j in range(numCol):
            try:
                row.append(DataReader[j].ToString())
            except:
                row.append(DataReader[j])
        data.append(row)
    df = pandas.DataFrame(data)
    df.columns = columnNames 

    if flag==0:
        DataReader.Close()
        ADOMDConn.Close()

        return df     
    else:   
        #metadata table
        metadataColumnNames = []
        for j in range(SchemaTable.Columns.Count):
            metadataColumnNames.append(SchemaTable.Columns[j].ToString())
        metadata = []
        for i in range(numCol):
            row = []
            for j in range(SchemaTable.Columns.Count):
                try:
                    row.append(SchemaTable.Rows[i][j].ToString())
                except:
                    row.append(SchemaTable.Rows[i][j])
            metadata.append(row)
        metadf = pandas.DataFrame(metadata)
        metadf.columns = metadataColumnNames

        DataReader.Close()
        ADOMDConn.Close()

        return df, metadf

然后通过如下方式提取原始数据:

pathPowerBI = "C:/Program Files/Microsoft Power BI Desktop/bin"
initialSetup(pathPowerBI)
session = restorePowerPivot("D:/Downloads/PowerPivotTutorialSample.xlsx", "D:/", 60000, pathPowerBI)
df, metadf = runQuery("EVALUATE dbo_DimProduct", 60000, 1)
endSession(session)