java 中感知器实现的数据结构混淆
data structure confusion over implementation of perceptron in java
我正在尝试在 java 中实现感知器算法,只是一层类型,而不是完全神经网络类型。这是我要解决的 class 化问题。
我需要做的是为四个类别之一的每个文档创建一个词袋特征向量,政治、科学、体育和无神论。 This 是数据。
我正在努力实现这一目标(直接引用 的第一个答案):
示例:
Document 1 = ["I", "am", "awesome"]
Document 2 = ["I", "am", "great", "great"]
字典是:
["I", "am", "awesome", "great"]
所以作为向量的文档看起来像:
Document 1 = [1, 1, 1, 0]
Document 2 = [1, 1, 0, 2]
有了它,你可以做各种奇特的数学运算,并将其输入你的感知器。
我已经可以生成全局词典了,现在我需要为每个文档制作一个,但是我怎样才能让它们保持完整?文件夹结构非常简单,即 `/politics/' 里面有很多文章,对于每一篇我都需要针对全局字典制作一个特征向量。我认为我正在使用的迭代器让我感到困惑。
这是主要的class:
public class BagOfWords
{
static Set<String> global_dict = new HashSet<String>();
static boolean global_dict_complete = false;
static String path = "/home/Workbench/SUTD/ISTD_50.570/assignments/data/train";
public static void main(String[] args) throws IOException
{
//each of the diferent categories
String[] categories = { "/atheism", "/politics", "/science", "/sports"};
//cycle through all categories once to populate the global dict
for(int cycle = 0; cycle <= 3; cycle++)
{
String general_data_partition = path + categories[cycle];
File file = new File( general_data_partition );
Iterateur.iterateDirectory(file, global_dict, global_dict_complete);
}
//after the global dict has been filled up
//cycle through again to populate a set of
//words for each document, compare it to the
//global dict.
for(int cycle = 0; cycle <= 3; cycle++)
{
if(cycle == 3)
global_dict_complete = true;
String general_data_partition = path + categories[cycle];
File file = new File( general_data_partition );
Iterateur.iterateDirectory(file, global_dict, global_dict_complete);
}
//print the data struc
//for (String s : global_dict)
//System.out.println( s );
}
}
这遍历数据结构:
public class Iterateur
{
static void iterateDirectory(File file,
Set<String> global_dict,
boolean global_dict_complete) throws IOException
{
for (File f : file.listFiles())
{
if (f.isDirectory())
{
iterateDirectory(file, global_dict, global_dict_complete);
}
else
{
String line;
BufferedReader br = new BufferedReader(new FileReader( f ));
while ((line = br.readLine()) != null)
{
if (global_dict_complete == false)
{
Dictionary.populate_dict(file, f, line, br, global_dict);
}
else
{
FeatureVecteur.generateFeatureVecteur(file, f, line, br, global_dict);
}
}
}
}
}
}
这会填满全局字典:
public class Dictionary
{
public static void populate_dict(File file,
File f,
String line,
BufferedReader br,
Set<String> global_dict) throws IOException
{
while ((line = br.readLine()) != null)
{
String[] words = line.split(" ");//those are your words
String word;
for (int i = 0; i < words.length; i++)
{
word = words[i];
if (!global_dict.contains(word))
{
global_dict.add(word);
}
}
}
}
}
这是填充文档特定词典的初步尝试:
public class FeatureVecteur
{
public static void generateFeatureVecteur(File file,
File f,
String line,
BufferedReader br,
Set<String> global_dict) throws IOException
{
Set<String> file_dict = new HashSet<String>();
while ((line = br.readLine()) != null)
{
String[] words = line.split(" ");//those are your words
String word;
for (int i = 0; i < words.length; i++)
{
word = words[i];
if (!file_dict.contains(word))
{
file_dict.add(word);
}
}
}
}
}
如果我理解您的问题,您是在尝试计算全局词典中每个单词在给定文件中出现的次数。我建议创建一个整数数组,其中索引代表全局字典的索引,值代表文件中该词的出现次数。
然后,对于全局词典中的每个单词,计算该单词在文件中出现的次数。但是,您需要注意 - 特征向量需要元素的顺序一致,而 HashSets 不能保证这一点。例如,在您的示例中, "I" 始终需要成为第一个元素。要解决这个问题,您可能希望在全局字典完全完成后将您的集合转换为 ArrayList 或其他顺序列表。
ArrayList<String> global_dict_list = ArrayList<String>( global_dict );
计数可能看起来像这样
int[] wordFrequency = new int[global_dict_list.size()];
for ( String globalWord : global_dict_list )
{
for ( int i = 0; i < words.length; i++ )
{
if ( words[i].equals(globalWord) )
{
wordFrequency[i]++;
}
}
}
将该代码嵌套在逐行读取特征向量代码的 while 循环中。希望对您有所帮助!
我正在尝试在 java 中实现感知器算法,只是一层类型,而不是完全神经网络类型。这是我要解决的 class 化问题。
我需要做的是为四个类别之一的每个文档创建一个词袋特征向量,政治、科学、体育和无神论。 This 是数据。
我正在努力实现这一目标(直接引用
示例:
Document 1 = ["I", "am", "awesome"]
Document 2 = ["I", "am", "great", "great"]
字典是:
["I", "am", "awesome", "great"]
所以作为向量的文档看起来像:
Document 1 = [1, 1, 1, 0]
Document 2 = [1, 1, 0, 2]
有了它,你可以做各种奇特的数学运算,并将其输入你的感知器。
我已经可以生成全局词典了,现在我需要为每个文档制作一个,但是我怎样才能让它们保持完整?文件夹结构非常简单,即 `/politics/' 里面有很多文章,对于每一篇我都需要针对全局字典制作一个特征向量。我认为我正在使用的迭代器让我感到困惑。
这是主要的class:
public class BagOfWords
{
static Set<String> global_dict = new HashSet<String>();
static boolean global_dict_complete = false;
static String path = "/home/Workbench/SUTD/ISTD_50.570/assignments/data/train";
public static void main(String[] args) throws IOException
{
//each of the diferent categories
String[] categories = { "/atheism", "/politics", "/science", "/sports"};
//cycle through all categories once to populate the global dict
for(int cycle = 0; cycle <= 3; cycle++)
{
String general_data_partition = path + categories[cycle];
File file = new File( general_data_partition );
Iterateur.iterateDirectory(file, global_dict, global_dict_complete);
}
//after the global dict has been filled up
//cycle through again to populate a set of
//words for each document, compare it to the
//global dict.
for(int cycle = 0; cycle <= 3; cycle++)
{
if(cycle == 3)
global_dict_complete = true;
String general_data_partition = path + categories[cycle];
File file = new File( general_data_partition );
Iterateur.iterateDirectory(file, global_dict, global_dict_complete);
}
//print the data struc
//for (String s : global_dict)
//System.out.println( s );
}
}
这遍历数据结构:
public class Iterateur
{
static void iterateDirectory(File file,
Set<String> global_dict,
boolean global_dict_complete) throws IOException
{
for (File f : file.listFiles())
{
if (f.isDirectory())
{
iterateDirectory(file, global_dict, global_dict_complete);
}
else
{
String line;
BufferedReader br = new BufferedReader(new FileReader( f ));
while ((line = br.readLine()) != null)
{
if (global_dict_complete == false)
{
Dictionary.populate_dict(file, f, line, br, global_dict);
}
else
{
FeatureVecteur.generateFeatureVecteur(file, f, line, br, global_dict);
}
}
}
}
}
}
这会填满全局字典:
public class Dictionary
{
public static void populate_dict(File file,
File f,
String line,
BufferedReader br,
Set<String> global_dict) throws IOException
{
while ((line = br.readLine()) != null)
{
String[] words = line.split(" ");//those are your words
String word;
for (int i = 0; i < words.length; i++)
{
word = words[i];
if (!global_dict.contains(word))
{
global_dict.add(word);
}
}
}
}
}
这是填充文档特定词典的初步尝试:
public class FeatureVecteur
{
public static void generateFeatureVecteur(File file,
File f,
String line,
BufferedReader br,
Set<String> global_dict) throws IOException
{
Set<String> file_dict = new HashSet<String>();
while ((line = br.readLine()) != null)
{
String[] words = line.split(" ");//those are your words
String word;
for (int i = 0; i < words.length; i++)
{
word = words[i];
if (!file_dict.contains(word))
{
file_dict.add(word);
}
}
}
}
}
如果我理解您的问题,您是在尝试计算全局词典中每个单词在给定文件中出现的次数。我建议创建一个整数数组,其中索引代表全局字典的索引,值代表文件中该词的出现次数。
然后,对于全局词典中的每个单词,计算该单词在文件中出现的次数。但是,您需要注意 - 特征向量需要元素的顺序一致,而 HashSets 不能保证这一点。例如,在您的示例中, "I" 始终需要成为第一个元素。要解决这个问题,您可能希望在全局字典完全完成后将您的集合转换为 ArrayList 或其他顺序列表。
ArrayList<String> global_dict_list = ArrayList<String>( global_dict );
计数可能看起来像这样
int[] wordFrequency = new int[global_dict_list.size()];
for ( String globalWord : global_dict_list )
{
for ( int i = 0; i < words.length; i++ )
{
if ( words[i].equals(globalWord) )
{
wordFrequency[i]++;
}
}
}
将该代码嵌套在逐行读取特征向量代码的 while 循环中。希望对您有所帮助!