java 中感知器实现的数据结构混淆

data structure confusion over implementation of perceptron in java

我正在尝试在 java 中实现感知器算法,只是一层类型,而不是完全神经网络类型。这是我要解决的 class 化问题。

我需要做的是为四个类别之一的每个文档创建一个词袋特征向量,政治、科学、体育和无神论。 This 是数据。

我正在努力实现这一目标(直接引用 的第一个答案):

示例:

Document 1 = ["I", "am", "awesome"]
Document 2 = ["I", "am", "great", "great"]

字典是:

["I", "am", "awesome", "great"]

所以作为向量的文档看起来像:

Document 1 = [1, 1, 1, 0]
Document 2 = [1, 1, 0, 2]

有了它,你可以做各种奇特的数学运算,并将其输入你的感知器。

我已经可以生成全局词典了,现在我需要为每个文档制作一个,但是我怎样才能让它们保持完整?文件夹结构非常简单,即 `/politics/' 里面有很多文章,对于每一篇我都需要针对全局字典制作一个特征向量。我认为我正在使用的迭代器让我感到困惑。

这是主要的class:

public class BagOfWords 
{
    static Set<String> global_dict = new HashSet<String>();

    static boolean global_dict_complete = false; 

    static String path = "/home/Workbench/SUTD/ISTD_50.570/assignments/data/train";

    public static void main(String[] args) throws IOException 
    {
        //each of the diferent categories
        String[] categories = { "/atheism", "/politics", "/science", "/sports"};

        //cycle through all categories once to populate the global dict
        for(int cycle = 0; cycle <= 3; cycle++)
        {
            String general_data_partition = path + categories[cycle]; 

            File file = new File( general_data_partition );
            Iterateur.iterateDirectory(file, global_dict, global_dict_complete);
        }   

        //after the global dict has been filled up
        //cycle through again to populate a set of
        //words for each document, compare it to the
        //global dict. 
        for(int cycle = 0; cycle <= 3; cycle++)
        {
            if(cycle == 3)
                global_dict_complete = true;

            String general_data_partition = path + categories[cycle]; 

            File file = new File( general_data_partition );
            Iterateur.iterateDirectory(file, global_dict, global_dict_complete);
        }

        //print the data struc              
        //for (String s : global_dict)
            //System.out.println( s );
    }
}

这遍历数据结构:

public class Iterateur 
{
    static void iterateDirectory(File file, 
                             Set<String> global_dict, 
                             boolean global_dict_complete) throws IOException 
    {
        for (File f : file.listFiles()) 
        {
            if (f.isDirectory()) 
            {
                iterateDirectory(file, global_dict, global_dict_complete);
            } 
            else 
            {
                String line; 
                BufferedReader br = new BufferedReader(new FileReader( f ));

                while ((line = br.readLine()) != null) 
                {
                    if (global_dict_complete == false)
                    {
                        Dictionary.populate_dict(file, f, line, br, global_dict);
                    }
                    else
                    {
                        FeatureVecteur.generateFeatureVecteur(file, f, line, br, global_dict);
                    }
                }
            }
        }
    }
}

这会填满全局字典:

public class Dictionary 
{

    public static void populate_dict(File file, 
                                 File f, 
                                 String line, 
                                 BufferedReader br, 
                                 Set<String> global_dict) throws IOException
    {

        while ((line = br.readLine()) != null) 
        {
            String[] words = line.split(" ");//those are your words

            String word;

            for (int i = 0; i < words.length; i++) 
            {
                word = words[i];
                if (!global_dict.contains(word))
                {
                    global_dict.add(word);
                }
            }   
        }
    }
}

这是填充文档特定词典的初步尝试:

public class FeatureVecteur 
{
    public static void generateFeatureVecteur(File file, 
                                          File f, 
                                          String line, 
                                          BufferedReader br, 
                                          Set<String> global_dict) throws IOException
    {
        Set<String> file_dict = new HashSet<String>();

        while ((line = br.readLine()) != null) 
        {

            String[] words = line.split(" ");//those are your words

            String word;

            for (int i = 0; i < words.length; i++) 
            {
                word = words[i];
                if (!file_dict.contains(word))
                {
                    file_dict.add(word);
                }
            }   
        }
    }
}

如果我理解您的问题,您是在尝试计算全局词典中每个单词在给定文件中出现的次数。我建议创建一个整数数组,其中索引代表全局字典的索引,值代表文件中该词的出现次数。

然后,对于全局词典中的每个单词,计算该单词在文件中出现的次数。但是,您需要注意 - 特征向量需要元素的顺序一致,而 HashSets 不能保证这一点。例如,在您的示例中, "I" 始终需要成为第一个元素。要解决这个问题,您可能希望在全局字典完全完成后将您的集合转换为 ArrayList 或其他顺序列表。

ArrayList<String> global_dict_list = ArrayList<String>( global_dict );

计数可能看起来像这样

int[] wordFrequency = new int[global_dict_list.size()];

for ( String globalWord : global_dict_list )
{
    for ( int i = 0; i < words.length; i++ ) 
    {
         if ( words[i].equals(globalWord) ) 
         {
             wordFrequency[i]++;
         }
    }
}

将该代码嵌套在逐行读取特征向量代码的 while 循环中。希望对您有所帮助!