从 dataproc 中的 google 存储读取文件
Read a file from google storage in dataproc
我正在尝试将 scala spark 作业从 hadoop 集群迁移到 GCP,我有这段读取文件并创建 ArrayBuffer[String]
的代码片段
import java.io._
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
import org.apache.hadoop.fs.FSDataInputStream
val filename="it.txt.1604607878987"
val fs = FileSystem.get(new Configuration())
val dataInputStream: FSDataInputStream = fs.open(new Path(filename))
val sourceEDR=new BufferedReader(new InputStreamReader(dataInputStream, "UTF-8")); }
val outputEDRFile = ArrayBuffer[String]()
buffer = new Array[Char](300)
var num_of_chars = 0
while (sourceEDR.read(buffer) > -1) {
val str = new String(buffer)
num_of_chars += str.length
outputEDRFile += (str + "\n");}
println(num_of_chars)
集群中的代码 运行s 并给了我 3025000 个字符,我尝试 运行 dataproc 中的代码:
val path_gs = new Path("gs://my-bucket")
val filename="it.txt.1604607878987"
val fs = path_gs.getFileSystem(new Configuration())
val dataInputStream: FSDataInputStream = fs.open(new Path(filename))
val sourceEDR =new BufferedReader(new InputStreamReader(dataInputStream, "UTF-8")); }
val outputEDRFile = ArrayBuffer[String]()
buffer = new Array[Char](300)
var num_of_chars = 0
while (sourceEDR.read(buffer) > -1) {
val str = new String(buffer)
num_of_chars += str.length
outputEDRFile += (str + "\n");}
println(num_of_chars)
它给出了 3175025 个字符,我认为文件内容中添加了空格,或者我必须使用另一个接口从 dataproc 中的 google 存储读取文件?
我也尝试了其他编码选项,但结果相同。
有帮助吗?
我没有找到使用缓冲区的解决方案,所以我尝试逐个读取一个字符,它对我有用:
var i = 0
var r=0
val response = new StringBuilder
while ( ({r=sourceEDR.read(); r} != -1)) {
val ch= r.asInstanceOf[Char]
if(response.length < 300) { response.append(ch)}
else { val str = response.toString().replaceAll("[\r\n]", " ")
i += str.length
outputEDRFile += (str + "\n");
response.setLength(0)
response.append(ch)
}
}
val str = response.toString().replaceAll("[\r\n]", " ")
i += str.length
outputEDRFile += (str + "\n");
我正在尝试将 scala spark 作业从 hadoop 集群迁移到 GCP,我有这段读取文件并创建 ArrayBuffer[String]
的代码片段 import java.io._
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
import org.apache.hadoop.fs.FSDataInputStream
val filename="it.txt.1604607878987"
val fs = FileSystem.get(new Configuration())
val dataInputStream: FSDataInputStream = fs.open(new Path(filename))
val sourceEDR=new BufferedReader(new InputStreamReader(dataInputStream, "UTF-8")); }
val outputEDRFile = ArrayBuffer[String]()
buffer = new Array[Char](300)
var num_of_chars = 0
while (sourceEDR.read(buffer) > -1) {
val str = new String(buffer)
num_of_chars += str.length
outputEDRFile += (str + "\n");}
println(num_of_chars)
集群中的代码 运行s 并给了我 3025000 个字符,我尝试 运行 dataproc 中的代码:
val path_gs = new Path("gs://my-bucket")
val filename="it.txt.1604607878987"
val fs = path_gs.getFileSystem(new Configuration())
val dataInputStream: FSDataInputStream = fs.open(new Path(filename))
val sourceEDR =new BufferedReader(new InputStreamReader(dataInputStream, "UTF-8")); }
val outputEDRFile = ArrayBuffer[String]()
buffer = new Array[Char](300)
var num_of_chars = 0
while (sourceEDR.read(buffer) > -1) {
val str = new String(buffer)
num_of_chars += str.length
outputEDRFile += (str + "\n");}
println(num_of_chars)
它给出了 3175025 个字符,我认为文件内容中添加了空格,或者我必须使用另一个接口从 dataproc 中的 google 存储读取文件? 我也尝试了其他编码选项,但结果相同。 有帮助吗?
我没有找到使用缓冲区的解决方案,所以我尝试逐个读取一个字符,它对我有用:
var i = 0
var r=0
val response = new StringBuilder
while ( ({r=sourceEDR.read(); r} != -1)) {
val ch= r.asInstanceOf[Char]
if(response.length < 300) { response.append(ch)}
else { val str = response.toString().replaceAll("[\r\n]", " ")
i += str.length
outputEDRFile += (str + "\n");
response.setLength(0)
response.append(ch)
}
}
val str = response.toString().replaceAll("[\r\n]", " ")
i += str.length
outputEDRFile += (str + "\n");