如何处理具有多个空格作为分隔符的文本文件
How to handle text file with multiple spaces as delimiter
我有一个源数据集,它由文本文件组成,其中列由一个或多个空格分隔,具体取决于列值的宽度。数据是右调整的,即在实际数据之前添加空格。
我可以使用内置提取器之一还是必须实现自定义提取器?
您可以创建一个自定义提取器,或者更简单地说,将数据作为一行导入,然后拆分和清理,并使用 U-SQL 中可用的 c# 方法,如 Split
和 IsNullOrWhiteSpace
,像这样:
// Import the row as one column to be split later; NB use a delimiter that will NOT be in the import file
@input =
EXTRACT rawString string
FROM "/input/input.txt"
USING Extractors.Text(delimiter : '|');
// Add a row number to the line and remove white space elements
@working =
SELECT ROW_NUMBER() OVER() AS rn, new SqlArray<string>(rawString.Split(' ').Where(x => !String.IsNullOrWhiteSpace(x))) AS columns
FROM @input;
// Prepare the output, referencing the column's position in the array
@output =
SELECT rn,
columns[0] AS id,
columns[1] AS firstName,
columns[2] AS lastName
FROM @working;
OUTPUT @output
TO "/output/output.txt"
USING Outputters.Tsv(quoting : false);
我的结果:
HTH
@wBob 的解决方案适用于您的行适合字符串 (128kB) 的情况。否则,编写您的自定义提取器,通过提取修复。根据您对格式的了解,您可以使用 input.Split()
拆分成行,然后根据空白规则拆分行,如下所示(Extractor 模式的完整示例是 here) or you could write one similar to the one described in this blog post .
public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow outputrow)
{
foreach (Stream current in input.Split(this._row_delim))
{
using (StreamReader streamReader = new StreamReader(current, this._encoding))
{
int num = 0;
string[] array = streamReader.ReadToEnd().Split(new string[]{this._col_delim}, StringSplitOptions.None).Where(x => !String.IsNullOrWhiteSpace(x)));
for (int i = 0; i < array.Length; i++)
{
// Now write your code to convert array[i] into the extract schema
}
}
yield return outputrow.AsReadOnly();
}
}
}
我有一个源数据集,它由文本文件组成,其中列由一个或多个空格分隔,具体取决于列值的宽度。数据是右调整的,即在实际数据之前添加空格。
我可以使用内置提取器之一还是必须实现自定义提取器?
您可以创建一个自定义提取器,或者更简单地说,将数据作为一行导入,然后拆分和清理,并使用 U-SQL 中可用的 c# 方法,如 Split
和 IsNullOrWhiteSpace
,像这样:
// Import the row as one column to be split later; NB use a delimiter that will NOT be in the import file
@input =
EXTRACT rawString string
FROM "/input/input.txt"
USING Extractors.Text(delimiter : '|');
// Add a row number to the line and remove white space elements
@working =
SELECT ROW_NUMBER() OVER() AS rn, new SqlArray<string>(rawString.Split(' ').Where(x => !String.IsNullOrWhiteSpace(x))) AS columns
FROM @input;
// Prepare the output, referencing the column's position in the array
@output =
SELECT rn,
columns[0] AS id,
columns[1] AS firstName,
columns[2] AS lastName
FROM @working;
OUTPUT @output
TO "/output/output.txt"
USING Outputters.Tsv(quoting : false);
我的结果:
@wBob 的解决方案适用于您的行适合字符串 (128kB) 的情况。否则,编写您的自定义提取器,通过提取修复。根据您对格式的了解,您可以使用 input.Split()
拆分成行,然后根据空白规则拆分行,如下所示(Extractor 模式的完整示例是 here) or you could write one similar to the one described in this blog post .
public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow outputrow)
{
foreach (Stream current in input.Split(this._row_delim))
{
using (StreamReader streamReader = new StreamReader(current, this._encoding))
{
int num = 0;
string[] array = streamReader.ReadToEnd().Split(new string[]{this._col_delim}, StringSplitOptions.None).Where(x => !String.IsNullOrWhiteSpace(x)));
for (int i = 0; i < array.Length; i++)
{
// Now write your code to convert array[i] into the extract schema
}
}
yield return outputrow.AsReadOnly();
}
}
}