CSV 到 json 使用 NiFi 的动态模式

CSV to json with dynamic schema using NiFi

我正在从第 3 方获取 CSV 文件。此文件的架构是动态的,我唯一可以确定的是,

  1. 每个包含数据的列也将具有 header 名称。
  2. 文件将始终有一个 header.
  3. header 名称将始终是一串不带空格的字母 和点。 (所以,有点 "clean")。
  4. 值应该被视为字符串,因为我不确定它们是什么 即将发送。

现在要在我的系统中使用这种类型的数据,我正在考虑使用 MongoDB 作为暂存区。因为没有。列的数量、列的顺序或列名称在一次加载到另一次加载之间不是恒定的。我认为 MongoDB 将成为一个很好的暂存区。

我阅读了有关 ConvertRecord 处理器的信息,它是 CSV 到 JSON 转换器的理想选择,但我没有架构。我只希望每一行都作为文档,header 名称作为键,值作为值。

我该怎么办?此外,此文件将在 25-30 GB 范围内,因此我不想关闭我的系统。

我想过用我自己的处理器来做(在Java),并且我能够得到我想要的东西,但它似乎花费了太多时间,而且有点不' 看起来最佳。

请告诉我,这是否可以通过现有处理器实现?

谢谢, 拉凯什

更新时间:2018 年 9 月 5 日

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><template encoding-version="1.2"><description></description><groupId>a2bd0551-0165-1000-7c6a-a32ca4db047c</groupId><name>csv_to_json_no_schema_v1</name><snippet><connections><id>91bc4a66-704c-3a2f-0000-000000000000</id><parentGroupId>defb04c4-c15c-3a07-0000-000000000000</parentGroupId><backPressureDataSizeThreshold>1 GB</backPressureDataSizeThreshold><backPressureObjectThreshold>10000</backPressureObjectThreshold><destination><groupId>defb04c4-c15c-3a07-0000-000000000000</groupId><id>bb6c25ae-f2b6-386a-0000-000000000000</id><type>PROCESSOR</type></destination><flowFileExpiration>0 sec</flowFileExpiration><labelIndex>1</labelIndex><name></name><selectedRelationships>success</selectedRelationships><source><groupId>defb04c4-c15c-3a07-0000-000000000000</groupId><id>eb6cd54a-e1f1-3871-0000-000000000000</id><type>PROCESSOR</type></source><zIndex>0</zIndex></connections><connections><id>ad804e3c-f233-3556-0000-000000000000</id><parentGroupId>defb04c4-c15c-3a07-0000-000000000000</parentGroupId><backPressureDataSizeThreshold>1 GB</backPressureDataSizeThreshold><backPressureObjectThreshold>10000</backPressureObjectThreshold><destination><groupId>defb04c4-c15c-3a07-0000-000000000000</groupId><id>64b15a56-8a5f-3297-0000-000000000000</id><type>PROCESSOR</type></destination><flowFileExpiration>0 sec</flowFileExpiration><labelIndex>1</labelIndex><name></name><selectedRelationships>invalid</selectedRelationships><source><groupId>defb04c4-c15c-3a07-0000-000000000000</groupId><id>bb6c25ae-f2b6-386a-0000-000000000000</id><type>PROCESSOR</type></source><zIndex>0</zIndex></connections><connections><id>c30bd123-c436-36ce-0000-000000000000</id><parentGroupId>defb04c4-c15c-3a07-0000-000000000000</parentGroupId><backPressureDataSizeThreshold>1 GB</backPressureDataSizeThreshold><backPressureObjectThreshold>10000</backPressureObjectThreshold><destination><groupId>defb04c4-c15c-3a07-0000-000000000000</groupId><id>8a0e37da-acd2-3d72-0000-000000000000</id><type>PROCESSOR</type></destination><flowFileExpiration>0 sec</flowFileExpiration><labelIndex>1</labelIndex><name></name><selectedRelationships>valid</selectedRelationships><source><groupId>defb04c4-c15c-3a07-0000-000000000000</groupId><id>bb6c25ae-f2b6-386a-0000-000000000000</id><type>PROCESSOR</type></source><zIndex>0</zIndex></connections><connections><id>247d2139-26b7-31fe-0000-000000000000</id><parentGroupId>defb04c4-c15c-3a07-0000-000000000000</parentGroupId><backPressureDataSizeThreshold>1 GB</backPressureDataSizeThreshold><backPressureObjectThreshold>10000</backPressureObjectThreshold><destination><groupId>defb04c4-c15c-3a07-0000-000000000000</groupId><id>1297bea9-b30f-3f45-0000-000000000000</id><type>PROCESSOR</type></destination><flowFileExpiration>0 sec</flowFileExpiration><labelIndex>1</labelIndex><name></name><selectedRelationships>failure</selectedRelationships><source><groupId>defb04c4-c15c-3a07-0000-000000000000</groupId><id>8a0e37da-acd2-3d72-0000-000000000000</id><type>PROCESSOR</type></source><zIndex>0</zIndex></connections><connections><id>45e5403f-99f7-3ddf-0000-000000000000</id><parentGroupId>defb04c4-c15c-3a07-0000-000000000000</parentGroupId><backPressureDataSizeThreshold>1 GB</backPressureDataSizeThreshold><backPressureObjectThreshold>10000</backPressureObjectThreshold><destination><groupId>defb04c4-c15c-3a07-0000-000000000000</groupId><id>9f8f32f7-130c-35bd-0000-000000000000</id><type>PROCESSOR</type></destination><flowFileExpiration>0 sec</flowFileExpiration><labelIndex>1</labelIndex><name></name><selectedRelationships>success</selectedRelationships><source><groupId>defb04c4-c15c-3a07-0000-000000000000</groupId><id>8a0e37da-acd2-3d72-0000-000000000000</id><type>PROCESSOR</type></source><zIndex>0</zIndex></connections><controllerServices><id>88b0195a-34b2-34f0-0000-000000000000</id><parentGroupId>defb04c4-c15c-3a07-0000-000000000000</parentGroupId><bundle><artifact>nifi-record-serialization-services-nar</artifact><group>org.apache.nifi</group><version>1.6.0</version></bundle><comments></comments><descriptors><entry><key>Schema Write Strategy</key><value><name>Schema Write Strategy</name></value></entry><entry><key>schema-access-strategy</key><value><name>schema-access-strategy</name></value></entry><entry><key>schema-registry</key><value><identifiesControllerService>org.apache.nifi.schemaregistry.services.SchemaRegistry</identifiesControllerService><name>schema-registry</name></value></entry><entry><key>schema-name</key><value><name>schema-name</name></value></entry><entry><key>schema-version</key><value><name>schema-version</name></value></entry><entry><key>schema-branch</key><value><name>schema-branch</name></value></entry><entry><key>schema-text</key><value><name>schema-text</name></value></entry><entry><key>Date Format</key><value><name>Date Format</name></value></entry><entry><key>Time Format</key><value><name>Time Format</name></value></entry><entry><key>Timestamp Format</key><value><name>Timestamp Format</name></value></entry><entry><key>Pretty Print JSON</key><value><name>Pretty Print JSON</name></value></entry><entry><key>suppress-nulls</key><value><name>suppress-nulls</name></value></entry></descriptors><name>JsonRecordSetWriter</name><persistsState>false</persistsState><properties><entry><key>Schema Write Strategy</key><value>no-schema</value></entry><entry><key>schema-access-strategy</key></entry><entry><key>schema-registry</key></entry><entry><key>schema-name</key></entry><entry><key>schema-version</key></entry><entry><key>schema-branch</key></entry><entry><key>schema-text</key></entry><entry><key>Date Format</key></entry><entry><key>Time Format</key></entry><entry><key>Timestamp Format</key></entry><entry><key>Pretty Print JSON</key></entry><entry><key>suppress-nulls</key></entry></properties><state>ENABLED</state><type>org.apache.nifi.json.JsonRecordSetWriter</type></controllerServices><controllerServices><id>c3e80a29-498b-36d4-0000-000000000000</id><parentGroupId>defb04c4-c15c-3a07-0000-000000000000</parentGroupId><bundle><artifact>nifi-record-serialization-services-nar</artifact><group>org.apache.nifi</group><version>1.6.0</version></bundle><comments></comments><descriptors><entry><key>schema-access-strategy</key><value><name>schema-access-strategy</name></value></entry><entry><key>schema-registry</key><value><identifiesControllerService>org.apache.nifi.schemaregistry.services.SchemaRegistry</identifiesControllerService><name>schema-registry</name></value></entry><entry><key>schema-name</key><value><name>schema-name</name></value></entry><entry><key>schema-version</key><value><name>schema-version</name></value></entry><entry><key>schema-branch</key><value><name>schema-branch</name></value></entry><entry><key>schema-text</key><value><name>schema-text</name></value></entry><entry><key>csv-reader-csv-parser</key><value><name>csv-reader-csv-parser</name></value></entry><entry><key>Date Format</key><value><name>Date Format</name></value></entry><entry><key>Time Format</key><value><name>Time Format</name></value></entry><entry><key>Timestamp Format</key><value><name>Timestamp Format</name></value></entry><entry><key>CSV Format</key><value><name>CSV Format</name></value></entry><entry><key>Value Separator</key><value><name>Value Separator</name></value></entry><entry><key>Skip Header Line</key><value><name>Skip Header Line</name></value></entry><entry><key>ignore-csv-header</key><value><name>ignore-csv-header</name></value></entry><entry><key>Quote Character</key><value><name>Quote Character</name></value></entry><entry><key>Escape Character</key><value><name>Escape Character</name></value></entry><entry><key>Comment Marker</key><value><name>Comment Marker</name></value></entry><entry><key>Null String</key><value><name>Null String</name></value></entry><entry><key>Trim Fields</key><value><name>Trim Fields</name></value></entry><entry><key>csvutils-character-set</key><value><name>csvutils-character-set</name></value></entry></descriptors><name>CSVReader</name><persistsState>false</persistsState><properties><entry><key>schema-access-strategy</key></entry><entry><key>schema-registry</key></entry><entry><key>schema-name</key></entry><entry><key>schema-version</key></entry><entry><key>schema-branch</key></entry><entry><key>schema-text</key></entry><entry><key>csv-reader-csv-parser</key></entry><entry><key>Date Format</key></entry><entry><key>Time Format</key></entry><entry><key>Timestamp Format</key></entry><entry><key>CSV Format</key></entry><entry><key>Value Separator</key></entry><entry><key>Skip Header Line</key><value>true</value></entry><entry><key>ignore-csv-header</key><value>true</value></entry><entry><key>Quote Character</key></entry><entry><key>Escape Character</key></entry><entry><key>Comment Marker</key></entry><entry><key>Null String</key></entry><entry><key>Trim Fields</key></entry><entry><key>csvutils-character-set</key></entry></properties><state>ENABLED</state><type>org.apache.nifi.csv.CSVReader</type></controllerServices><processors><id>8a0e37da-acd2-3d72-0000-000000000000</id><parentGroupId>defb04c4-c15c-3a07-0000-000000000000</parentGroupId><position><x>0.0</x><y>227.99996948242188</y></position><bundle><artifact>nifi-standard-nar</artifact><group>org.apache.nifi</group><version>1.6.0</version></bundle><config><bulletinLevel>WARN</bulletinLevel><comments></comments><concurrentlySchedulableTaskCount>1</concurrentlySchedulableTaskCount><descriptors><entry><key>record-reader</key><value><identifiesControllerService>org.apache.nifi.serialization.RecordReaderFactory</identifiesControllerService><name>record-reader</name></value></entry><entry><key>record-writer</key><value><identifiesControllerService>org.apache.nifi.serialization.RecordSetWriterFactory</identifiesControllerService><name>record-writer</name></value></entry></descriptors><executionNode>ALL</executionNode><lossTolerant>false</lossTolerant><penaltyDuration>30 sec</penaltyDuration><properties><entry><key>record-reader</key><value>c3e80a29-498b-36d4-0000-000000000000</value></entry><entry><key>record-writer</key><value>88b0195a-34b2-34f0-0000-000000000000</value></entry></properties><runDurationMillis>0</runDurationMillis><schedulingPeriod>0 sec</schedulingPeriod><schedulingStrategy>TIMER_DRIVEN</schedulingStrategy><yieldDuration>1 sec</yieldDuration></config><name>ConvertRecord</name><relationships><autoTerminate>false</autoTerminate><name>failure</name></relationships><relationships><autoTerminate>false</autoTerminate><name>success</name></relationships><state>STOPPED</state><style/><type>org.apache.nifi.processors.standard.ConvertRecord</type></processors><processors><id>9f8f32f7-130c-35bd-0000-000000000000</id><parentGroupId>defb04c4-c15c-3a07-0000-000000000000</parentGroupId><position><x>11.0</x><y>483.0</y></position><bundle><artifact>nifi-standard-nar</artifact><group>org.apache.nifi</group><version>1.6.0</version></bundle><config><bulletinLevel>WARN</bulletinLevel><comments></comments><concurrentlySchedulableTaskCount>1</concurrentlySchedulableTaskCount><descriptors><entry><key>Log Level</key><value><name>Log Level</name></value></entry><entry><key>Log Payload</key><value><name>Log Payload</name></value></entry><entry><key>Attributes to Log</key><value><name>Attributes to Log</name></value></entry><entry><key>attributes-to-log-regex</key><value><name>attributes-to-log-regex</name></value></entry><entry><key>Attributes to Ignore</key><value><name>Attributes to Ignore</name></value></entry><entry><key>attributes-to-ignore-regex</key><value><name>attributes-to-ignore-regex</name></value></entry><entry><key>Log prefix</key><value><name>Log prefix</name></value></entry><entry><key>character-set</key><value><name>character-set</name></value></entry></descriptors><executionNode>ALL</executionNode><lossTolerant>false</lossTolerant><penaltyDuration>30 sec</penaltyDuration><properties><entry><key>Log Level</key><value>info</value></entry><entry><key>Log Payload</key><value>false</value></entry><entry><key>Attributes to Log</key></entry><entry><key>attributes-to-log-regex</key><value>.*</value></entry><entry><key>Attributes to Ignore</key></entry><entry><key>attributes-to-ignore-regex</key></entry><entry><key>Log prefix</key></entry><entry><key>character-set</key><value>UTF-8</value></entry></properties><runDurationMillis>0</runDurationMillis><schedulingPeriod>0 sec</schedulingPeriod><schedulingStrategy>TIMER_DRIVEN</schedulingStrategy><yieldDuration>1 sec</yieldDuration></config><name>LogAttribute</name><relationships><autoTerminate>true</autoTerminate><name>success</name></relationships><state>STOPPED</state><style/><type>org.apache.nifi.processors.standard.LogAttribute</type></processors><processors><id>bb6c25ae-f2b6-386a-0000-000000000000</id><parentGroupId>defb04c4-c15c-3a07-0000-000000000000</parentGroupId><position><x>670.0</x><y>225.0</y></position><bundle><artifact>nifi-standard-nar</artifact><group>org.apache.nifi</group><version>1.6.0</version></bundle><config><bulletinLevel>WARN</bulletinLevel><comments></comments><concurrentlySchedulableTaskCount>1</concurrentlySchedulableTaskCount><descriptors><entry><key>validate-csv-schema</key><value><name>validate-csv-schema</name></value></entry><entry><key>validate-csv-header</key><value><name>validate-csv-header</name></value></entry><entry><key>validate-csv-delimiter</key><value><name>validate-csv-delimiter</name></value></entry><entry><key>validate-csv-quote</key><value><name>validate-csv-quote</name></value></entry><entry><key>validate-csv-eol</key><value><name>validate-csv-eol</name></value></entry><entry><key>validate-csv-strategy</key><value><name>validate-csv-strategy</name></value></entry></descriptors><executionNode>ALL</executionNode><lossTolerant>false</lossTolerant><penaltyDuration>30 sec</penaltyDuration><properties><entry><key>validate-csv-schema</key><value>NotNull,ParseInt(),Optional(ParseInt()),Null</value></entry><entry><key>validate-csv-header</key><value>true</value></entry><entry><key>validate-csv-delimiter</key><value>,</value></entry><entry><key>validate-csv-quote</key><value>"</value></entry><entry><key>validate-csv-eol</key><value>\n</value></entry><entry><key>validate-csv-strategy</key><value>Line by line validation</value></entry></properties><runDurationMillis>0</runDurationMillis><schedulingPeriod>0 sec</schedulingPeriod><schedulingStrategy>TIMER_DRIVEN</schedulingStrategy><yieldDuration>1 sec</yieldDuration></config><name>ValidateCsv</name><relationships><autoTerminate>false</autoTerminate><name>invalid</name></relationships><relationships><autoTerminate>false</autoTerminate><name>valid</name></relationships><state>STOPPED</state><style/><type>org.apache.nifi.processors.standard.ValidateCsv</type></processors><processors><id>eb6cd54a-e1f1-3871-0000-000000000000</id><parentGroupId>defb04c4-c15c-3a07-0000-000000000000</parentGroupId><position><x>688.0</x><y>0.0</y></position><bundle><artifact>nifi-standard-nar</artifact><group>org.apache.nifi</group><version>1.6.0</version></bundle><config><bulletinLevel>WARN</bulletinLevel><comments></comments><concurrentlySchedulableTaskCount>1</concurrentlySchedulableTaskCount><descriptors><entry><key>File Size</key><value><name>File Size</name></value></entry><entry><key>Batch Size</key><value><name>Batch Size</name></value></entry><entry><key>Data Format</key><value><name>Data Format</name></value></entry><entry><key>Unique FlowFiles</key><value><name>Unique FlowFiles</name></value></entry><entry><key>generate-ff-custom-text</key><value><name>generate-ff-custom-text</name></value></entry><entry><key>character-set</key><value><name>character-set</name></value></entry><entry><key>schema.name</key><value><name>schema.name</name></value></entry></descriptors><executionNode>ALL</executionNode><lossTolerant>false</lossTolerant><penaltyDuration>30 sec</penaltyDuration><properties><entry><key>File Size</key><value>0B</value></entry><entry><key>Batch Size</key><value>1</value></entry><entry><key>Data Format</key><value>Text</value></entry><entry><key>Unique FlowFiles</key><value>false</value></entry><entry><key>generate-ff-custom-text</key><value>name,age,int_val,address Rakesh Prasad,0,99,"address 12 33333, 444441" rakesh Prasad1,1,,"address 12 33333, 444442" rakesh Prasad2,2,55,"address 12 33333, 444443" rakesh Prasad3,,33,"address 12 33333, 444444"</value></entry><entry><key>character-set</key><value>UTF-8</value></entry><entry><key>schema.name</key><value>empData</value></entry></properties><runDurationMillis>0</runDurationMillis><schedulingPeriod>1 day</schedulingPeriod><schedulingStrategy>TIMER_DRIVEN</schedulingStrategy><yieldDuration>1 sec</yieldDuration></config><name>GenerateFlowFile</name><relationships><autoTerminate>false</autoTerminate><name>success</name></relationships><state>STOPPED</state><style/><type>org.apache.nifi.processors.standard.GenerateFlowFile</type></processors><processors><id>1297bea9-b30f-3f45-0000-000000000000</id><parentGroupId>defb04c4-c15c-3a07-0000-000000000000</parentGroupId><position><x>450.0</x><y>539.0</y></position><bundle><artifact>nifi-standard-nar</artifact><group>org.apache.nifi</group><version>1.6.0</version></bundle><config><bulletinLevel>WARN</bulletinLevel><comments></comments><concurrentlySchedulableTaskCount>1</concurrentlySchedulableTaskCount><descriptors><entry><key>Log Level</key><value><name>Log Level</name></value></entry><entry><key>Log Payload</key><value><name>Log Payload</name></value></entry><entry><key>Attributes to Log</key><value><name>Attributes to Log</name></value></entry><entry><key>attributes-to-log-regex</key><value><name>attributes-to-log-regex</name></value></entry><entry><key>Attributes to Ignore</key><value><name>Attributes to Ignore</name></value></entry><entry><key>attributes-to-ignore-regex</key><value><name>attributes-to-ignore-regex</name></value></entry><entry><key>Log prefix</key><value><name>Log prefix</name></value></entry><entry><key>character-set</key><value><name>character-set</name></value></entry></descriptors><executionNode>ALL</executionNode><lossTolerant>false</lossTolerant><penaltyDuration>30 sec</penaltyDuration><properties><entry><key>Log Level</key><value>info</value></entry><entry><key>Log Payload</key><value>false</value></entry><entry><key>Attributes to Log</key></entry><entry><key>attributes-to-log-regex</key><value>.*</value></entry><entry><key>Attributes to Ignore</key></entry><entry><key>attributes-to-ignore-regex</key></entry><entry><key>Log prefix</key></entry><entry><key>character-set</key><value>UTF-8</value></entry></properties><runDurationMillis>0</runDurationMillis><schedulingPeriod>0 sec</schedulingPeriod><schedulingStrategy>TIMER_DRIVEN</schedulingStrategy><yieldDuration>1 sec</yieldDuration></config><name>LogAttribute</name><relationships><autoTerminate>true</autoTerminate><name>success</name></relationships><state>STOPPED</state><style/><type>org.apache.nifi.processors.standard.LogAttribute</type></processors><processors><id>64b15a56-8a5f-3297-0000-000000000000</id><parentGroupId>defb04c4-c15c-3a07-0000-000000000000</parentGroupId><position><x>837.0</x><y>482.0000305175781</y></position><bundle><artifact>nifi-standard-nar</artifact><group>org.apache.nifi</group><version>1.6.0</version></bundle><config><bulletinLevel>WARN</bulletinLevel><comments></comments><concurrentlySchedulableTaskCount>1</concurrentlySchedulableTaskCount><descriptors><entry><key>Log Level</key><value><name>Log Level</name></value></entry><entry><key>Log Payload</key><value><name>Log Payload</name></value></entry><entry><key>Attributes to Log</key><value><name>Attributes to Log</name></value></entry><entry><key>attributes-to-log-regex</key><value><name>attributes-to-log-regex</name></value></entry><entry><key>Attributes to Ignore</key><value><name>Attributes to Ignore</name></value></entry><entry><key>attributes-to-ignore-regex</key><value><name>attributes-to-ignore-regex</name></value></entry><entry><key>Log prefix</key><value><name>Log prefix</name></value></entry><entry><key>character-set</key><value><name>character-set</name></value></entry></descriptors><executionNode>ALL</executionNode><lossTolerant>false</lossTolerant><penaltyDuration>30 sec</penaltyDuration><properties><entry><key>Log Level</key><value>info</value></entry><entry><key>Log Payload</key><value>false</value></entry><entry><key>Attributes to Log</key></entry><entry><key>attributes-to-log-regex</key><value>.*</value></entry><entry><key>Attributes to Ignore</key></entry><entry><key>attributes-to-ignore-regex</key></entry><entry><key>Log prefix</key></entry><entry><key>character-set</key><value>UTF-8</value></entry></properties><runDurationMillis>0</runDurationMillis><schedulingPeriod>0 sec</schedulingPeriod><schedulingStrategy>TIMER_DRIVEN</schedulingStrategy><yieldDuration>1 sec</yieldDuration></config><name>LogAttribute</name><relationships><autoTerminate>true</autoTerminate><name>success</name></relationships><state>STOPPED</state><style/><type>org.apache.nifi.processors.standard.LogAttribute</type></processors></snippet><timestamp>09/05/2018 01:32:27 EDT</timestamp></template>

您可以将 ConvertRecord 与 CSV Reader 一起使用,并在 CSV Reader 中为架构访问策略选择 "Use String Fields From Header"。这将从 header.

动态创建一个模式