u-sql 脚本来搜索一个字符串,然后对该字符串进行 Groupby 并获取不同文件的数量
u-sql script to search for a string then Groupby that string and get the count of distinct files
我对你很陌生sql,正在尝试解决
str1=\global\europe\Moscow345\File1.txt
str2=\global.bee.com\europe\Moscow345\File1.txt
str3=\global\europe\amsterdam321\File1.Rvt
str4=\global.bee.com\europe\amsterdam345\File1.Rvt
案例一:
我如何从字符串变量 str1 和 str2 中获取“\europe\Moscow345\File1.txt”,我只想从 str1 和 str2 中获取 (\europe\Moscow345\File1.txt") 然后 "Groupby(\global\europe\Moscow345)" 并获取路径中不同文件的计数 (""\europe\Moscow345\")
所以输出会是这样的:
distinct_filesby_Location_Date
为了解决上述问题,我尝试了下面的 u-sql 代码,但不太确定我是否在编写正确的脚本:
@inArray = SELECT new SQL.ARRAY<string>(
filepath.Contains("\europe")) AS path
FROM @t;
@filesbyloc =
SELECT [ID],
path.Trim() AS path1
FROM @inArray
CROSS APPLY
EXPLODE(path1) AS r(location);
OUTPUT @filesbyloc
TO "/Outputs/distinctfilesbylocation.tsv"
USING Outputters.Tsv();
如有任何帮助,将不胜感激。
一种方法是将您要使用的所有字符串放在一个文件中,例如 strings.txt
并将其保存在您的 U-SQL 输入文件夹中。还有一个包含您要匹配的城市的文件,例如 cities.txt。然后尝试以下 U-SQL 脚本:
@input =
EXTRACT filepath string
FROM "/input/strings.txt"
USING Extractors.Tsv();
// Give the strings a row-number
@input =
SELECT ROW_NUMBER() OVER() AS rn,
filepath
FROM @input;
// Get the cities
@cities =
EXTRACT city string
FROM "/input/cities.txt"
USING Extractors.Tsv();
// Ensure there is a lower-case version of city for matching / joining
@cities =
SELECT city,
city.ToLower() AS lowercase_city
FROM @cities;
// Explode the filepath into separate rows
@working =
SELECT rn,
new SQL.ARRAY<string>(filepath.Split('\')) AS pathElement
FROM @input AS i;
// Explode the filepath string, also changing to lower case
@working =
SELECT rn,
x.pathElement.ToLower() AS pathElement
FROM @working AS i
CROSS APPLY
EXPLODE(pathElement) AS x(pathElement);
// Create the output query, joining on lower case city name, display, normal case name
@output =
SELECT c.city,
COUNT( * ) AS records
FROM @working AS w
INNER JOIN
@cities AS c
ON w.pathElement == c.lowercase_city
GROUP BY c.city;
// Output the result
OUTPUT @output TO "/output/output.txt"
USING Outputters.Tsv();
//OUTPUT @working TO "/output/output2.txt"
//USING Outputters.Tsv();
我的结果:
HTH
冒昧地将您的输入文件格式化为 TSV 文件,并且不知道所有列语义,这是一种编写查询的方法。请注意,我做出了评论中提供的假设。
@d =
EXTRACT path string,
user string,
num1 int,
num2 int,
start_date string,
end_date string,
flag string,
year int,
s string,
another_date string
FROM @"\users\temp\citypaths.txt"
USING Extractors.Tsv(encoding: Encoding.Unicode);
// I assume that you have only one DateTime format culture in your file.
// If it becomes dependent on the region or city as expressed in the path, you need to add a lookup.
@d =
SELECT new SqlArray<string>(path.Split('\')) AS steps,
DateTime.Parse(end_date, new CultureInfo("fr-FR", false)).Date.ToString("yyyy-MM-dd") AS end_date
FROM @d;
// This assumes your paths have a fixed formatting/mapping into the city
@d =
SELECT steps[4].ToLowerInvariant() AS city,
end_date
FROM @d;
@res =
SELECT city,
end_date,
COUNT( * ) AS count
FROM @d
GROUP BY city,
end_date;
OUTPUT @res
TO "/output/result.csv"
USING Outputters.Csv();
// Now let's pivot the date and count.
OUTPUT @res2
TO "/output/res2.csv"
USING Outputters.Csv();
@res2 =
SELECT city, MAP_AGG(end_date, count) AS date_count
FROM @res
GROUP BY city;
// This assumes you know exactly with dates you are looking for. Otherwise keep it in the first file representation.
@res2 =
SELECT city,
date_count["2016-11-21"]AS [2016-11-21],
date_count["2016-11-22"]AS [2016-11-22]
FROM @res2;
在私人电子邮件中收到一些示例数据后更新:
根据您发送给我的数据(在提取和计算城市之后,您可以按照 Bob 的回答中概述的连接进行连接,您需要提前了解您的城市,或者使用字符串从路径中的城市位置,如我的示例,您不需要提前知道城市),您想要将行集 city, count, date
旋转到行集 date, city1, city2, ...
中,每行包含日期以及每个城市的计数。
您可以通过以下方式更改 @res2
的计算来轻松调整我上面的示例:
// Now let's pivot the city and count.
@res2 = SELECT end_date, MAP_AGG(city, count) AS city_count
FROM @res
GROUP BY end_date;
// This assumes you know exactly with cities you are looking for. Otherwise keep it in the first file representation or use a script generation (see below).
@res2 =
SELECT end_date,
city_count["istanbul"]AS istanbul,
city_count["midlands"]AS midlands,
city_count["belfast"] AS belfast,
city_count["acoustics"] AS acoustics,
city_count["amsterdam"] AS amsterdam
FROM @res2;
请注意,在我的示例中,您需要通过在 SQL.MAP 列中查找来枚举数据透视语句中的所有城市。如果这不是先验知识,您将必须首先提交一个为您创建脚本的脚本。例如,假设您的 city, count, date
行集在一个文件中(或者您可以在生成脚本和生成脚本中复制生成行集的语句),您可以将其编写为以下脚本。然后将结果作为实际处理脚本提交。
// Get the rowset (could also be the actual calculation from the original file
@in = EXTRACT city string, count int?, date string
FROM "/users/temp/Revit_Last2Months_Results.tsv"
USING Extractors.Tsv();
// Generate the statements for the preparation of the data before the pivot
@stmts = SELECT * FROM (VALUES
( "@s1", "EXTRACT city string, count int?, date string FROM \"/users/temp/Revit_Last2Months_Results.tsv\" USING Extractors.Tsv();"),
( "@s2", "SELECT date, MAP_AGG(city, count) AS city_count FROM @s1 GROUP BY date;" )
) AS T( stmt_name, stmt);
// Now generate the statement doing the pivot
@cities = SELECT DISTINCT city FROM @in2;
@pivots =
SELECT "@s3" AS stmt_name, "SELECT date, "+String.Join(", ", ARRAY_AGG("city_count[\""+city+"\"] AS ["+city+"]"))+ " FROM @s2;" AS stmt
FROM @cities;
// Now generate the OUTPUT statement after the pivot. Note that the OUTPUT does not have a statement name.
@output =
SELECT "OUTPUT @s3 TO \"/output/pivot_gen.tsv\" USING Outputters.Tsv();" AS stmt
FROM (VALUES(1)) AS T(x);
// Now put the statements into one rowset. Note that null are ordering high in U-SQL
@result =
SELECT stmt_name, "=" AS assign, stmt FROM @stmts
UNION ALL SELECT stmt_name, "=" AS assign, stmt FROM @pivots
UNION ALL SELECT (string) null AS stmt_name, (string) null AS assign, stmt FROM @output;
// Now output the statements in order of the stmt_name
OUTPUT @result
TO "/pivot.usql"
ORDER BY stmt_name
USING Outputters.Text(delimiter:' ', quoting:false);
现在下载文件并提交。
我对你很陌生sql,正在尝试解决
str1=\global\europe\Moscow345\File1.txt
str2=\global.bee.com\europe\Moscow345\File1.txt
str3=\global\europe\amsterdam321\File1.Rvt str4=\global.bee.com\europe\amsterdam345\File1.Rvt
案例一: 我如何从字符串变量 str1 和 str2 中获取“\europe\Moscow345\File1.txt”,我只想从 str1 和 str2 中获取 (\europe\Moscow345\File1.txt") 然后 "Groupby(\global\europe\Moscow345)" 并获取路径中不同文件的计数 (""\europe\Moscow345\")
所以输出会是这样的:
distinct_filesby_Location_Date
为了解决上述问题,我尝试了下面的 u-sql 代码,但不太确定我是否在编写正确的脚本:
@inArray = SELECT new SQL.ARRAY<string>(
filepath.Contains("\europe")) AS path
FROM @t;
@filesbyloc =
SELECT [ID],
path.Trim() AS path1
FROM @inArray
CROSS APPLY
EXPLODE(path1) AS r(location);
OUTPUT @filesbyloc
TO "/Outputs/distinctfilesbylocation.tsv"
USING Outputters.Tsv();
如有任何帮助,将不胜感激。
一种方法是将您要使用的所有字符串放在一个文件中,例如 strings.txt
并将其保存在您的 U-SQL 输入文件夹中。还有一个包含您要匹配的城市的文件,例如 cities.txt。然后尝试以下 U-SQL 脚本:
@input =
EXTRACT filepath string
FROM "/input/strings.txt"
USING Extractors.Tsv();
// Give the strings a row-number
@input =
SELECT ROW_NUMBER() OVER() AS rn,
filepath
FROM @input;
// Get the cities
@cities =
EXTRACT city string
FROM "/input/cities.txt"
USING Extractors.Tsv();
// Ensure there is a lower-case version of city for matching / joining
@cities =
SELECT city,
city.ToLower() AS lowercase_city
FROM @cities;
// Explode the filepath into separate rows
@working =
SELECT rn,
new SQL.ARRAY<string>(filepath.Split('\')) AS pathElement
FROM @input AS i;
// Explode the filepath string, also changing to lower case
@working =
SELECT rn,
x.pathElement.ToLower() AS pathElement
FROM @working AS i
CROSS APPLY
EXPLODE(pathElement) AS x(pathElement);
// Create the output query, joining on lower case city name, display, normal case name
@output =
SELECT c.city,
COUNT( * ) AS records
FROM @working AS w
INNER JOIN
@cities AS c
ON w.pathElement == c.lowercase_city
GROUP BY c.city;
// Output the result
OUTPUT @output TO "/output/output.txt"
USING Outputters.Tsv();
//OUTPUT @working TO "/output/output2.txt"
//USING Outputters.Tsv();
我的结果:
HTH
冒昧地将您的输入文件格式化为 TSV 文件,并且不知道所有列语义,这是一种编写查询的方法。请注意,我做出了评论中提供的假设。
@d =
EXTRACT path string,
user string,
num1 int,
num2 int,
start_date string,
end_date string,
flag string,
year int,
s string,
another_date string
FROM @"\users\temp\citypaths.txt"
USING Extractors.Tsv(encoding: Encoding.Unicode);
// I assume that you have only one DateTime format culture in your file.
// If it becomes dependent on the region or city as expressed in the path, you need to add a lookup.
@d =
SELECT new SqlArray<string>(path.Split('\')) AS steps,
DateTime.Parse(end_date, new CultureInfo("fr-FR", false)).Date.ToString("yyyy-MM-dd") AS end_date
FROM @d;
// This assumes your paths have a fixed formatting/mapping into the city
@d =
SELECT steps[4].ToLowerInvariant() AS city,
end_date
FROM @d;
@res =
SELECT city,
end_date,
COUNT( * ) AS count
FROM @d
GROUP BY city,
end_date;
OUTPUT @res
TO "/output/result.csv"
USING Outputters.Csv();
// Now let's pivot the date and count.
OUTPUT @res2
TO "/output/res2.csv"
USING Outputters.Csv();
@res2 =
SELECT city, MAP_AGG(end_date, count) AS date_count
FROM @res
GROUP BY city;
// This assumes you know exactly with dates you are looking for. Otherwise keep it in the first file representation.
@res2 =
SELECT city,
date_count["2016-11-21"]AS [2016-11-21],
date_count["2016-11-22"]AS [2016-11-22]
FROM @res2;
在私人电子邮件中收到一些示例数据后更新:
根据您发送给我的数据(在提取和计算城市之后,您可以按照 Bob 的回答中概述的连接进行连接,您需要提前了解您的城市,或者使用字符串从路径中的城市位置,如我的示例,您不需要提前知道城市),您想要将行集 city, count, date
旋转到行集 date, city1, city2, ...
中,每行包含日期以及每个城市的计数。
您可以通过以下方式更改 @res2
的计算来轻松调整我上面的示例:
// Now let's pivot the city and count.
@res2 = SELECT end_date, MAP_AGG(city, count) AS city_count
FROM @res
GROUP BY end_date;
// This assumes you know exactly with cities you are looking for. Otherwise keep it in the first file representation or use a script generation (see below).
@res2 =
SELECT end_date,
city_count["istanbul"]AS istanbul,
city_count["midlands"]AS midlands,
city_count["belfast"] AS belfast,
city_count["acoustics"] AS acoustics,
city_count["amsterdam"] AS amsterdam
FROM @res2;
请注意,在我的示例中,您需要通过在 SQL.MAP 列中查找来枚举数据透视语句中的所有城市。如果这不是先验知识,您将必须首先提交一个为您创建脚本的脚本。例如,假设您的 city, count, date
行集在一个文件中(或者您可以在生成脚本和生成脚本中复制生成行集的语句),您可以将其编写为以下脚本。然后将结果作为实际处理脚本提交。
// Get the rowset (could also be the actual calculation from the original file
@in = EXTRACT city string, count int?, date string
FROM "/users/temp/Revit_Last2Months_Results.tsv"
USING Extractors.Tsv();
// Generate the statements for the preparation of the data before the pivot
@stmts = SELECT * FROM (VALUES
( "@s1", "EXTRACT city string, count int?, date string FROM \"/users/temp/Revit_Last2Months_Results.tsv\" USING Extractors.Tsv();"),
( "@s2", "SELECT date, MAP_AGG(city, count) AS city_count FROM @s1 GROUP BY date;" )
) AS T( stmt_name, stmt);
// Now generate the statement doing the pivot
@cities = SELECT DISTINCT city FROM @in2;
@pivots =
SELECT "@s3" AS stmt_name, "SELECT date, "+String.Join(", ", ARRAY_AGG("city_count[\""+city+"\"] AS ["+city+"]"))+ " FROM @s2;" AS stmt
FROM @cities;
// Now generate the OUTPUT statement after the pivot. Note that the OUTPUT does not have a statement name.
@output =
SELECT "OUTPUT @s3 TO \"/output/pivot_gen.tsv\" USING Outputters.Tsv();" AS stmt
FROM (VALUES(1)) AS T(x);
// Now put the statements into one rowset. Note that null are ordering high in U-SQL
@result =
SELECT stmt_name, "=" AS assign, stmt FROM @stmts
UNION ALL SELECT stmt_name, "=" AS assign, stmt FROM @pivots
UNION ALL SELECT (string) null AS stmt_name, (string) null AS assign, stmt FROM @output;
// Now output the statements in order of the stmt_name
OUTPUT @result
TO "/pivot.usql"
ORDER BY stmt_name
USING Outputters.Text(delimiter:' ', quoting:false);
现在下载文件并提交。