使用 FileHelpers.Dynamic,读取 fixed-width 文件并上传到 SQL
Using FileHelpers.Dynamic, read a fixed-width file and upload to SQL
好的,我会尽力解释这一点。我编写了一个应用程序,它使用 SQL table 来定义 fixed-width 数据源结构(因此,header、起始索引、字段长度等)。当我的应用程序运行时,它会查询此 table 并创建一个 DataTable object(称之为 finalDT
),DataColumn objects 持有 ColumnName = header。然后,我将一组 DataColumn object 附加到此 table 中,它们存在于我们使用的每个数据源中(我倾向于将其称为派生列)。我还创建了一个主键字段,它是一个 auto-incrementing 整数。最初,我推出了自己的解决方案来读取 fixed-width 文件,但我试图将其转换为使用 FileHelper。主要是,我希望合并它,以便我可以访问 FileHelper 可以解析的其他文件类型(CSV、Excel 等)。
现在,我的问题。使用 FileHelper.Dynamic,我能够使用以下方法创建 FileHelperEngine object:
private static FileHelperEngine GetFixedWidthFileClass(bool ignore)
{
singletonArguments sArgs = singletonArguments.sArgs;
singletonSQL sSQL = singletonSQL.sSQL;
List<string> remove = new List<string>();
FixedLengthClassBuilder flcb = new FixedLengthClassBuilder(sSQL.FixedDataDefinition.DataTableName);
flcb.IgnoreFirstLines = 1;
flcb.IgnoreLastLines = 1;
flcb.IgnoreEmptyLines = true;
foreach (var dcs in sSQL.FixedDataDefinition.Columns)
{
flcb.AddField(dcs.header, Convert.ToInt32(dcs.length), "String");
if (ignore && dcs.ignore)
{
flcb.LastField.FieldValueDiscarded = true; //If we want to ignore a column, this is how to do it. Would like to incorporate this.
flcb.LastField.Visibility = NetVisibility.Protected;
}
else
{
flcb.LastField.TrimMode = TrimMode.Both;
flcb.LastField.FieldNullValue = string.Empty;
}
}
return new FileHelperEngine(flcb.CreateRecordClass());
}
sSQL.FixedDataDefinition.Columns
是我存储 fixed-width 数据源文件的字段定义的方式。然后我通过执行以下操作生成一个数据表:
DataTable dt = engine.ReadFileAsDT(file);
其中 file
是 fixed-width 文件的完整路径,engine
是我保存上面显示的 GetFixedWidthFileClass()
方法的结果的地方。好的,现在我有一个没有主键和派生列 none 的数据表。此外,dt
中的所有字段都标记为 ReadOnly = true
。这就是事情变得一团糟的地方。
我需要将 dt
填充到 finalDT
中,而且 dt
没有任何主键信息也可以。如果可以的话,那么我可以使用 finalDT
将我的数据上传到我的 SQL table。如果那不可能发生,那么我需要一种方法让 finalDT
没有主键,但仍然上传到我的 SQL table。 SqlBulkCopy
会允许吗?还有别的办法吗?
此时,只要能用FileHelper解析fixed-width文件,并将结果存入我的SQLtable,我愿意从头开始,我只是没有看到那里的路径。
我明白了。它并不漂亮,但它是这样工作的。基本上,我在原始 post 中设置代码的方式仍然适用,因为我在 GetFixedWidthFileClass()
方法中未进行任何更改。然后我不得不添加两种方法来正确设置 finalDT
:
/// <summary>
/// For a given a datasource file, add all rows to the DataSet and collect Hexdump data
/// </summary>
/// <param name="ds">
/// The <see cref="System.Data.DataSet" /> to add to
/// </param>
/// <param name="file">
/// The datasource file to process
/// </param>
internal static void GenerateDatasource(ref DataSet ds, ref FileHelperEngine engine, DataSourceColumnSpecs mktgidSpecs, string file)
{
// Some singleton class instances to hold program data I will need.
singletonSQL sSQL = singletonSQL.sSQL;
singletonArguments sArgs = singletonArguments.sArgs;
try
{
// Load a DataTable with contents of datasource file.
DataTable dt = engine.ReadFileAsDT(file);
// Clean up the DataTable by removing columns that should be ignored.
DataTableCleanUp(ref dt, ref engine);
// ReadFileAsDT() makes all of the columns ReadOnly. Fix that.
foreach (DataColumn column in dt.Columns)
column.ReadOnly = false;
// Okay, now get a Primary Key and add in the derived columns.
GenerateDatasourceSchema(ref dt);
// Parse all of the rows and columns to do data clean up and assign some custom
// values. Add custom values for jobID and serial columns to each row in the DataTable.
for (int row = 0; row < dt.Rows.Count; row++)
{
string version = string.Empty; // The file version
bool found = false; // Used to get out of foreach loops once the required condition is found.
// Iterate all configured jobs and add the jobID and serial number to each row
// based upon match.
foreach (JobSetupDetails job in sSQL.VznJobDescriptions.JobDetails)
{
// Version must match id in order to update the row. Break out once we find
// the match to save time.
version = dt.Rows[row][dt.Columns[mktgidSpecs.header]].ToString().Trim().Split(new char[] { '_' })[0];
foreach (string id in job.ids)
{
if (version.Equals(id))
{
dt.Rows[row][dt.Columns["jobid"]] = job.jobID;
lock (locklist)
dt.Rows[row][dt.Columns["serial"]] = job.serial++;
found = true;
break;
}
}
if (found)
break;
}
// Parse all columns to do data clean up.
for (int column = 0; column < dt.Columns.Count; column++)
{
// This tab character keeps showing up in the data. It should not be there,
// but customer won't fix it, so we have to.
if (dt.Rows[row][column].GetType() == typeof(string))
dt.Rows[row][column] = dt.Rows[row][column].ToString().Replace('\t', ' ');
}
}
dt.AcceptChanges();
// DataTable is cleaned up and modified. Time to push it into the DataSet.
lock (locklist)
{
// If dt is writing back to the DataSet for the first time, Rows.Count will be
// zero. Since the DataTable in the DataSet does not have the table schema and
// since dt.Copy() is not an option (ds is referenced, so Copy() won't work), Use
// Merge() and use the option MissingSchemaAction.Add to create the schema.
if (ds.Tables[sSQL.FixedDataDefinition.DataTableName].Rows.Count == 0)
ds.Tables[sSQL.FixedDataDefinition.DataTableName].Merge(dt, false, MissingSchemaAction.Add);
else
{
// If this is not the first write to the DataSet, remove the PrimaryKey
// column to avoid duplicate key values. Use ImportRow() rather then .Merge()
// since, for whatever reason, Merge() is overwriting ds each time it is
// called and ImportRow() is actually appending the row. Ugly, but can't
// figure out another way to make this work.
dt.PrimaryKey = null;
dt.Columns.Remove(dt.Columns[0]);
foreach (DataRow dr in dt.Rows)
ds.Tables[sSQL.FixedDataDefinition.DataTableName].ImportRow(dr);
}
// Accept all the changes made to the DataSet.
ds.Tables[sSQL.FixedDataDefinition.DataTableName].AcceptChanges();
}
// Clean up memory.
dt.Clear();
// Log my progress.
log.GenerateLog("0038", log.Info
, engine.TotalRecords.ToString() + " DataRows successfully added for file:\r\n\t"
+ file + "\r\nto DataTable "
+ sSQL.FixedDataDefinition.DataTableName);
}
catch (Exception e)
{
// Something bad happened here.
log.GenerateLog("0038", log.Error, "Failed to add DataRows to DataTable "
+ sSQL.FixedDataDefinition.DataTableName
+ " for file\r\n\t"
+ file, e);
}
finally
{
// Successful or not, get rid of the datasource file to prevent other issues.
File.Delete(file);
}
}
还有这个方法:
/// <summary>
/// Deletes columns that are not needed from a given <see cref="System.Data.DataTable" /> reference.
/// </summary>
/// <param name="dt">
/// The <see cref="System.Data.DataTable" /> to delete columns from.
/// </param>
/// <param name="engine">
/// The <see cref="FileHelperEngine" /> object containing data field usability information.
/// </param>
private static void DataTableCleanUp(ref DataTable dt, ref FileHelperEngine engine)
{
// Tracks DataColumns I need to remove from my temp DataTable, dt.
List<DataColumn> removeColumns = new List<DataColumn>();
// If a field is Discarded, then the data was not imported because we don't need this
// column. In that case, mark the column for deletion by adding it to removeColumns.
for (int i = 0; i < engine.Options.Fields.Count; i++)
if (engine.Options.Fields[i].Discarded)
removeColumns.Add(dt.Columns[i]);
// Reverse the List so changes to dt don't generate schema errors.
removeColumns.Reverse();
// Do the deletion.
foreach (DataColumn column in removeColumns)
dt.Columns.Remove(column);
// Clean up memory.
removeColumns.Clear();
}
基本上,由于 ds
(finalDT
所在的数据集)在 GenerateDatasource
方法中被引用,我无法使用 dt.Copy()
将数据推送到其中。我不得不使用 Merge()
来做到这一点。然后,在我想使用 Merge()
的地方,我不得不使用 foreach
循环和 ImportRow()
因为 Merge()
覆盖了 finalDT
.
我必须解决的其他问题是:
- 当我使用
ImportRow()
时,我还需要从 dt
中删除 PrimaryKey
否则我会收到关于重复键的错误。
FileHelperEngine
或 FileHelpers.Dynamic.FixedLengthClassBuilder
在跳过我想忽略的列时遇到问题。它要么根本不承认它们,从而破坏了我的列偏移量,并随后破坏了数据源文件中数据读取方式的准确性(使用 FieldHidden
选项),或者它读取它们并以任何方式创建列,但不加载数据(使用 FieldValueDiscarded
和 Visibility.Private
或 .Protected
选项)。这对我来说意味着我必须在调用 engine.ReadFileAsDT(file)
之后迭代 dt
并删除标记为 Discarded
. 的列
- 由于 FileHelper 对我的主键列或在处理过程中添加到我所有数据源的其他派生列一无所知,我不得不将
dt
传递给方法 (GenerateDatasourceSchema()
) 进行排序那个出来。该方法基本上只是添加这些列并确保 PrimaryKey 是第一列。
其余代码已修复我需要对列和行执行的操作。在某些情况下,我为每一行的列设置值,在其他情况下,我正在清理原始数据中的错误(因为它来自我的客户)。
这不是很好,我希望找到更好的方法。如果有人对我是如何做到的有意见,我很想听听。
好的,我会尽力解释这一点。我编写了一个应用程序,它使用 SQL table 来定义 fixed-width 数据源结构(因此,header、起始索引、字段长度等)。当我的应用程序运行时,它会查询此 table 并创建一个 DataTable object(称之为 finalDT
),DataColumn objects 持有 ColumnName = header。然后,我将一组 DataColumn object 附加到此 table 中,它们存在于我们使用的每个数据源中(我倾向于将其称为派生列)。我还创建了一个主键字段,它是一个 auto-incrementing 整数。最初,我推出了自己的解决方案来读取 fixed-width 文件,但我试图将其转换为使用 FileHelper。主要是,我希望合并它,以便我可以访问 FileHelper 可以解析的其他文件类型(CSV、Excel 等)。
现在,我的问题。使用 FileHelper.Dynamic,我能够使用以下方法创建 FileHelperEngine object:
private static FileHelperEngine GetFixedWidthFileClass(bool ignore)
{
singletonArguments sArgs = singletonArguments.sArgs;
singletonSQL sSQL = singletonSQL.sSQL;
List<string> remove = new List<string>();
FixedLengthClassBuilder flcb = new FixedLengthClassBuilder(sSQL.FixedDataDefinition.DataTableName);
flcb.IgnoreFirstLines = 1;
flcb.IgnoreLastLines = 1;
flcb.IgnoreEmptyLines = true;
foreach (var dcs in sSQL.FixedDataDefinition.Columns)
{
flcb.AddField(dcs.header, Convert.ToInt32(dcs.length), "String");
if (ignore && dcs.ignore)
{
flcb.LastField.FieldValueDiscarded = true; //If we want to ignore a column, this is how to do it. Would like to incorporate this.
flcb.LastField.Visibility = NetVisibility.Protected;
}
else
{
flcb.LastField.TrimMode = TrimMode.Both;
flcb.LastField.FieldNullValue = string.Empty;
}
}
return new FileHelperEngine(flcb.CreateRecordClass());
}
sSQL.FixedDataDefinition.Columns
是我存储 fixed-width 数据源文件的字段定义的方式。然后我通过执行以下操作生成一个数据表:
DataTable dt = engine.ReadFileAsDT(file);
其中 file
是 fixed-width 文件的完整路径,engine
是我保存上面显示的 GetFixedWidthFileClass()
方法的结果的地方。好的,现在我有一个没有主键和派生列 none 的数据表。此外,dt
中的所有字段都标记为 ReadOnly = true
。这就是事情变得一团糟的地方。
我需要将 dt
填充到 finalDT
中,而且 dt
没有任何主键信息也可以。如果可以的话,那么我可以使用 finalDT
将我的数据上传到我的 SQL table。如果那不可能发生,那么我需要一种方法让 finalDT
没有主键,但仍然上传到我的 SQL table。 SqlBulkCopy
会允许吗?还有别的办法吗?
此时,只要能用FileHelper解析fixed-width文件,并将结果存入我的SQLtable,我愿意从头开始,我只是没有看到那里的路径。
我明白了。它并不漂亮,但它是这样工作的。基本上,我在原始 post 中设置代码的方式仍然适用,因为我在 GetFixedWidthFileClass()
方法中未进行任何更改。然后我不得不添加两种方法来正确设置 finalDT
:
/// <summary>
/// For a given a datasource file, add all rows to the DataSet and collect Hexdump data
/// </summary>
/// <param name="ds">
/// The <see cref="System.Data.DataSet" /> to add to
/// </param>
/// <param name="file">
/// The datasource file to process
/// </param>
internal static void GenerateDatasource(ref DataSet ds, ref FileHelperEngine engine, DataSourceColumnSpecs mktgidSpecs, string file)
{
// Some singleton class instances to hold program data I will need.
singletonSQL sSQL = singletonSQL.sSQL;
singletonArguments sArgs = singletonArguments.sArgs;
try
{
// Load a DataTable with contents of datasource file.
DataTable dt = engine.ReadFileAsDT(file);
// Clean up the DataTable by removing columns that should be ignored.
DataTableCleanUp(ref dt, ref engine);
// ReadFileAsDT() makes all of the columns ReadOnly. Fix that.
foreach (DataColumn column in dt.Columns)
column.ReadOnly = false;
// Okay, now get a Primary Key and add in the derived columns.
GenerateDatasourceSchema(ref dt);
// Parse all of the rows and columns to do data clean up and assign some custom
// values. Add custom values for jobID and serial columns to each row in the DataTable.
for (int row = 0; row < dt.Rows.Count; row++)
{
string version = string.Empty; // The file version
bool found = false; // Used to get out of foreach loops once the required condition is found.
// Iterate all configured jobs and add the jobID and serial number to each row
// based upon match.
foreach (JobSetupDetails job in sSQL.VznJobDescriptions.JobDetails)
{
// Version must match id in order to update the row. Break out once we find
// the match to save time.
version = dt.Rows[row][dt.Columns[mktgidSpecs.header]].ToString().Trim().Split(new char[] { '_' })[0];
foreach (string id in job.ids)
{
if (version.Equals(id))
{
dt.Rows[row][dt.Columns["jobid"]] = job.jobID;
lock (locklist)
dt.Rows[row][dt.Columns["serial"]] = job.serial++;
found = true;
break;
}
}
if (found)
break;
}
// Parse all columns to do data clean up.
for (int column = 0; column < dt.Columns.Count; column++)
{
// This tab character keeps showing up in the data. It should not be there,
// but customer won't fix it, so we have to.
if (dt.Rows[row][column].GetType() == typeof(string))
dt.Rows[row][column] = dt.Rows[row][column].ToString().Replace('\t', ' ');
}
}
dt.AcceptChanges();
// DataTable is cleaned up and modified. Time to push it into the DataSet.
lock (locklist)
{
// If dt is writing back to the DataSet for the first time, Rows.Count will be
// zero. Since the DataTable in the DataSet does not have the table schema and
// since dt.Copy() is not an option (ds is referenced, so Copy() won't work), Use
// Merge() and use the option MissingSchemaAction.Add to create the schema.
if (ds.Tables[sSQL.FixedDataDefinition.DataTableName].Rows.Count == 0)
ds.Tables[sSQL.FixedDataDefinition.DataTableName].Merge(dt, false, MissingSchemaAction.Add);
else
{
// If this is not the first write to the DataSet, remove the PrimaryKey
// column to avoid duplicate key values. Use ImportRow() rather then .Merge()
// since, for whatever reason, Merge() is overwriting ds each time it is
// called and ImportRow() is actually appending the row. Ugly, but can't
// figure out another way to make this work.
dt.PrimaryKey = null;
dt.Columns.Remove(dt.Columns[0]);
foreach (DataRow dr in dt.Rows)
ds.Tables[sSQL.FixedDataDefinition.DataTableName].ImportRow(dr);
}
// Accept all the changes made to the DataSet.
ds.Tables[sSQL.FixedDataDefinition.DataTableName].AcceptChanges();
}
// Clean up memory.
dt.Clear();
// Log my progress.
log.GenerateLog("0038", log.Info
, engine.TotalRecords.ToString() + " DataRows successfully added for file:\r\n\t"
+ file + "\r\nto DataTable "
+ sSQL.FixedDataDefinition.DataTableName);
}
catch (Exception e)
{
// Something bad happened here.
log.GenerateLog("0038", log.Error, "Failed to add DataRows to DataTable "
+ sSQL.FixedDataDefinition.DataTableName
+ " for file\r\n\t"
+ file, e);
}
finally
{
// Successful or not, get rid of the datasource file to prevent other issues.
File.Delete(file);
}
}
还有这个方法:
/// <summary>
/// Deletes columns that are not needed from a given <see cref="System.Data.DataTable" /> reference.
/// </summary>
/// <param name="dt">
/// The <see cref="System.Data.DataTable" /> to delete columns from.
/// </param>
/// <param name="engine">
/// The <see cref="FileHelperEngine" /> object containing data field usability information.
/// </param>
private static void DataTableCleanUp(ref DataTable dt, ref FileHelperEngine engine)
{
// Tracks DataColumns I need to remove from my temp DataTable, dt.
List<DataColumn> removeColumns = new List<DataColumn>();
// If a field is Discarded, then the data was not imported because we don't need this
// column. In that case, mark the column for deletion by adding it to removeColumns.
for (int i = 0; i < engine.Options.Fields.Count; i++)
if (engine.Options.Fields[i].Discarded)
removeColumns.Add(dt.Columns[i]);
// Reverse the List so changes to dt don't generate schema errors.
removeColumns.Reverse();
// Do the deletion.
foreach (DataColumn column in removeColumns)
dt.Columns.Remove(column);
// Clean up memory.
removeColumns.Clear();
}
基本上,由于 ds
(finalDT
所在的数据集)在 GenerateDatasource
方法中被引用,我无法使用 dt.Copy()
将数据推送到其中。我不得不使用 Merge()
来做到这一点。然后,在我想使用 Merge()
的地方,我不得不使用 foreach
循环和 ImportRow()
因为 Merge()
覆盖了 finalDT
.
我必须解决的其他问题是:
- 当我使用
ImportRow()
时,我还需要从dt
中删除PrimaryKey
否则我会收到关于重复键的错误。 FileHelperEngine
或FileHelpers.Dynamic.FixedLengthClassBuilder
在跳过我想忽略的列时遇到问题。它要么根本不承认它们,从而破坏了我的列偏移量,并随后破坏了数据源文件中数据读取方式的准确性(使用FieldHidden
选项),或者它读取它们并以任何方式创建列,但不加载数据(使用FieldValueDiscarded
和Visibility.Private
或.Protected
选项)。这对我来说意味着我必须在调用engine.ReadFileAsDT(file)
之后迭代dt
并删除标记为Discarded
. 的列
- 由于 FileHelper 对我的主键列或在处理过程中添加到我所有数据源的其他派生列一无所知,我不得不将
dt
传递给方法 (GenerateDatasourceSchema()
) 进行排序那个出来。该方法基本上只是添加这些列并确保 PrimaryKey 是第一列。
其余代码已修复我需要对列和行执行的操作。在某些情况下,我为每一行的列设置值,在其他情况下,我正在清理原始数据中的错误(因为它来自我的客户)。
这不是很好,我希望找到更好的方法。如果有人对我是如何做到的有意见,我很想听听。