有没有办法使用Java8中的数据流,将多个由特定字符划定的多行字符串收集到一个Arraylist中?
Is there a way to collect many multiline strings delineated by a specific character into an Arraylist using the data stream in Java 8?
我有一个 fasta 文件,我想将其解析为 ArrayList
,每个位置都有一个完整的序列。序列是多行字符串,我不想在我存储的字符串中包含标识行。
我当前的代码将每一行拆分到 ArrayList
中的另一个位置。如何使每个位置都由 >
字符划定?
fasta 文件的格式为:
>identification of a sequence 1
line1
line3
>identification of a sequence 2
line4
>identification of a sequence 3
line5
line6
line7
public static void main(String args[]) {
String fileName = "fastafile.fasta";
List<String> list = new ArrayList<>();
try (Stream<String> stream = Files.lines(Paths.get(fileName))) {
//1. filter line 3
//2. convert all content to upper case
//3. convert it into a List
list = stream
.filter(line -> !line.startsWith(">"))
.map(String::toUpperCase)
.collect(Collectors.toList());
} catch (IOException e) {
e.printStackTrace();
}
list.forEach(System.out::println);
}
对于上面的例子,我们想要这样的输出:
System.out.println(list.size()); // this would be 3
System.out.println(list.get(0)); //this would be line1line3
System.out.println(list.get(1)); //this would be line4
System.out.println(list.get(2)); //this would be line5line6line7
根据您的目标,使用 Files.lines
似乎会使事情变得有点棘手。
假设您可以简单地在一个 String
中获取文件的全部内容 - 以下工作得很好(使用 online compiler 验证):
import java.util.*;
import java.util.stream.*;
public class Test {
public static void main(String args[]) {
String content = ">identification of a sequence 1\n" +
"line1\n" +
"line3\n" +
">identification of a sequence 2\n" +
"line4\n" +
">identification of a sequence 2\n" +
"line5\n" +
"line6\n" +
"line7";
List<String> list = new ArrayList<>();
try {
list = Arrays.stream(content.split(">.*"))
.filter(e -> !e.isEmpty())
.map(e -> e.replace("\n","").trim())
.collect(Collectors.toList());
} catch (Exception e) {
e.printStackTrace();
}
list.forEach(System.out::println);
System.out.println(list.size()); // this would be 3
System.out.println(list.get(0)); // this would be line1line3
System.out.println(list.get(1)); // this would be line4
System.out.println(list.get(2)); // this would be line5line6line7
}
}
输出为:
line1line3
line4
line5line6line7
3
line1line3
line4
line5line6line7
我有一个 fasta 文件,我想将其解析为 ArrayList
,每个位置都有一个完整的序列。序列是多行字符串,我不想在我存储的字符串中包含标识行。
我当前的代码将每一行拆分到 ArrayList
中的另一个位置。如何使每个位置都由 >
字符划定?
fasta 文件的格式为:
>identification of a sequence 1
line1
line3
>identification of a sequence 2
line4
>identification of a sequence 3
line5
line6
line7
public static void main(String args[]) {
String fileName = "fastafile.fasta";
List<String> list = new ArrayList<>();
try (Stream<String> stream = Files.lines(Paths.get(fileName))) {
//1. filter line 3
//2. convert all content to upper case
//3. convert it into a List
list = stream
.filter(line -> !line.startsWith(">"))
.map(String::toUpperCase)
.collect(Collectors.toList());
} catch (IOException e) {
e.printStackTrace();
}
list.forEach(System.out::println);
}
对于上面的例子,我们想要这样的输出:
System.out.println(list.size()); // this would be 3
System.out.println(list.get(0)); //this would be line1line3
System.out.println(list.get(1)); //this would be line4
System.out.println(list.get(2)); //this would be line5line6line7
根据您的目标,使用 Files.lines
似乎会使事情变得有点棘手。
假设您可以简单地在一个 String
中获取文件的全部内容 - 以下工作得很好(使用 online compiler 验证):
import java.util.*;
import java.util.stream.*;
public class Test {
public static void main(String args[]) {
String content = ">identification of a sequence 1\n" +
"line1\n" +
"line3\n" +
">identification of a sequence 2\n" +
"line4\n" +
">identification of a sequence 2\n" +
"line5\n" +
"line6\n" +
"line7";
List<String> list = new ArrayList<>();
try {
list = Arrays.stream(content.split(">.*"))
.filter(e -> !e.isEmpty())
.map(e -> e.replace("\n","").trim())
.collect(Collectors.toList());
} catch (Exception e) {
e.printStackTrace();
}
list.forEach(System.out::println);
System.out.println(list.size()); // this would be 3
System.out.println(list.get(0)); // this would be line1line3
System.out.println(list.get(1)); // this would be line4
System.out.println(list.get(2)); // this would be line5line6line7
}
}
输出为:
line1line3
line4
line5line6line7
3
line1line3
line4
line5line6line7