来自 excel 文件的 Pentaho 勺子转换

Question

我的 excel 文件中的年度数据格式如下：

Country \ Years   1980   1981   ...   2010
Abkhazia           234    334   ...    456
Afghanistan        466    789   ...    732
...

这是图片

我希望我的数据转换为 3 个不同的 tables 并将其加载到 postgres 数据库。

表格应该看起来像那样

第一个 table - 国家：

id | name
1  | Abkhazia
2  | Afghanistan

第二个 table 日期：

id | date
1  | 1980
2  | 1981

第三个是 table，其中根据国家和日期存储所有数据：

country_id    date_id   data
         1          1    234
         1          2    334
         2          1    466
         2          2    789
       ...        ...    ...

有什么办法可以实现我的目标吗？

Answer 1

假设源 excel 结构如下 （我已经自定义构建了这个）:

您的问题基本上分为 3 个部分。为了更好地理解，我将转换分解为多个部分：

1.正在加载 Table - 国家/地区

根据 excel 中给出的数据，这非常简单。只需

Excel Input >> Add a sequence step. Give the Sequence name as Country ID >> Select only the Country Name and Country ID >> Load into the Country Table using Table Output.

2。正在加载 Table - 年份：

这里的想法是在给定 excel 源数据的情况下，以行格式而不是列显示年份 ID。 PDI 版本 5 及更高版本为您提供了一个非常有用的步骤，称为 Metadata Structure。此步骤允许您获取 table 的结构。在这种情况下，我们需要拉出年份列，忽略国家列。

按照以下步骤操作：

Read the Excel Data >> Get the Metadata structure of your source >> Filter Out the Country Column (which is available in row at position=1) >> Add a Sequence Number. Name it YearID >> Finally Load the Year Table.

3。加载最终 Table - 国家和年份以及数据：

在 PDI 中将所有列数据值显示到行级别的方法是使用 Row Normalizer 步骤。使用此步骤显示规范化输出。现在按照以下步骤操作：

Read the Excel source data >> use Row Normalizer Step to normalize the rows based on the Years >> Do a Stream Lookup with the Above Country and Year tables to fetch the CountryID and YearID respectively >> Finally Load the necessary column data into Table Output

希望对您有所帮助:)

我已将代码与我使用过的数据文件一起放在 github 存储库中。它的 here.

此外，刚刚意识到我根据你的问题给出了错误的命名约定。将 date_id 视为 YearID 而不是 id，我给出了 countryid 和 yearid。

来自 excel 文件的 Pentaho 勺子转换

Pentaho spoon transformation from excel file

postgresql

etl

pentaho

kettle