加上对 Integer 对象的操作,从目录中读取多个文件以在 Java 中创建词袋
plus operation on Integer object, Read in multiple files from a directory to create bag-of-words in Java
词袋和文档术语矩阵是一回事吗?
我有一个包含许多文件的训练数据集。我想将它们全部读入数据结构(哈希映射?),为特定 class 文档(科学、宗教、体育或性)创建一个词袋模型,以准备感知器实现。
现在我有最简单的 Java I/o 构造,即
String text;
BufferedReader br = new BufferedReader(new FileReader("file"));
while ((text = br.readLine()) != null)
{
//read in multiple files
//generate a hash map with each unique word
//as a key and the frequency with which that
//word appears as the value
}
所以我想做的是从目录中的多个文件读取输入并将所有数据保存到一个底层结构中,该怎么做?我应该把它写到某个地方的文件中吗?
根据我对词袋的理解,我认为哈希图(如我在上面代码的注释中所述)会起作用。那正确吗?我怎么能实现这样的事情来同步读取多个文件的输入。我应该如何存储它以便以后将其合并到我的感知器算法中?
我看过这个 like so:
String names = new String[]{"a.txt", "b.txt", "c.txt"};
StringBuffer strContent = new StringBuffer("");
for (String name : names) {
File file = new File(name);
int ch;
FileInputStream stream = null;
try {
stream = new FileInputStream(file);
while( (ch = stream.read()) != -1) {
strContent.append((char) ch);
}
} finally {
stream.close();
}
}
但这是一个蹩脚的解决方案,因为您需要提前指定所有文件,我认为这应该更动态。如果可能的话。
你可以试试下面的程序,它是动态的,你只需要提供你的目录路径。
public class BagOfWords {
ConcurrentHashMap<String, Set<String>> map = new ConcurrentHashMap<String, Set<String>>();
public static void main(String[] args) throws IOException {
File file = new File("F:/Downloads/Build/");
new BagOfWords().iterateDirectory(file);
}
private void iterateDirectory(File file) throws IOException {
for (File f : file.listFiles()) {
if (f.isDirectory()) {
iterateDirectory(file);
} else {
// Read File
// Split and put it in a set
// add to map
}
}
}
}
我认为这非常接近,但与 int
和 integer
之间存在某种差异,如何调和?
ConcurrentHashMap> map = new ConcurrentHashMap>();
public static void main(String[] args) throws IOException
{
String path = "path";
File file = new File( path );
new BagOfWords().iterateDirectory(file);
}
private void iterateDirectory(File file) throws IOException
{
for (File f : file.listFiles())
{
if (f.isDirectory())
{
iterateDirectory(file);
}
else
{
String line;
BufferedReader br = new BufferedReader(new FileReader("file"));
while ((line = br.readLine()) != null)
{
String[] words = line.split(" ");//those are your words
// Read File
// Split and put it in a set
// add to map
String word;
for (int i = 0; i < words.length; i++)
{
word = words[i];
if (!map.containsKey(word))
{
map.put(word, 0);
}
map.put(word, map.get(word) + 1);
}
}
}
}
}
词袋和文档术语矩阵是一回事吗?
我有一个包含许多文件的训练数据集。我想将它们全部读入数据结构(哈希映射?),为特定 class 文档(科学、宗教、体育或性)创建一个词袋模型,以准备感知器实现。
现在我有最简单的 Java I/o 构造,即
String text;
BufferedReader br = new BufferedReader(new FileReader("file"));
while ((text = br.readLine()) != null)
{
//read in multiple files
//generate a hash map with each unique word
//as a key and the frequency with which that
//word appears as the value
}
所以我想做的是从目录中的多个文件读取输入并将所有数据保存到一个底层结构中,该怎么做?我应该把它写到某个地方的文件中吗?
根据我对词袋的理解,我认为哈希图(如我在上面代码的注释中所述)会起作用。那正确吗?我怎么能实现这样的事情来同步读取多个文件的输入。我应该如何存储它以便以后将其合并到我的感知器算法中?
我看过这个 like so:
String names = new String[]{"a.txt", "b.txt", "c.txt"};
StringBuffer strContent = new StringBuffer("");
for (String name : names) {
File file = new File(name);
int ch;
FileInputStream stream = null;
try {
stream = new FileInputStream(file);
while( (ch = stream.read()) != -1) {
strContent.append((char) ch);
}
} finally {
stream.close();
}
}
但这是一个蹩脚的解决方案,因为您需要提前指定所有文件,我认为这应该更动态。如果可能的话。
你可以试试下面的程序,它是动态的,你只需要提供你的目录路径。
public class BagOfWords {
ConcurrentHashMap<String, Set<String>> map = new ConcurrentHashMap<String, Set<String>>();
public static void main(String[] args) throws IOException {
File file = new File("F:/Downloads/Build/");
new BagOfWords().iterateDirectory(file);
}
private void iterateDirectory(File file) throws IOException {
for (File f : file.listFiles()) {
if (f.isDirectory()) {
iterateDirectory(file);
} else {
// Read File
// Split and put it in a set
// add to map
}
}
}
}
我认为这非常接近,但与 int
和 integer
之间存在某种差异,如何调和?
ConcurrentHashMap> map = new ConcurrentHashMap>();
public static void main(String[] args) throws IOException
{
String path = "path";
File file = new File( path );
new BagOfWords().iterateDirectory(file);
}
private void iterateDirectory(File file) throws IOException
{
for (File f : file.listFiles())
{
if (f.isDirectory())
{
iterateDirectory(file);
}
else
{
String line;
BufferedReader br = new BufferedReader(new FileReader("file"));
while ((line = br.readLine()) != null)
{
String[] words = line.split(" ");//those are your words
// Read File
// Split and put it in a set
// add to map
String word;
for (int i = 0; i < words.length; i++)
{
word = words[i];
if (!map.containsKey(word))
{
map.put(word, 0);
}
map.put(word, map.get(word) + 1);
}
}
}
}
}