哈希映射作为唯一值存储/实例计数器。 java

hash map as unique value store / instance counter. java

我正在尝试创建一个程序来通过推理学习规则,即 'contains'('vitamin c', 'oranges').'prevents'('scurvy', 'vitamin c'). 会产生输出 "rule" 'prevents'('scurvy', 'oranges'). 我有代码可以生成该输出,但随后我想从输入中消除重复的 "rules",同时跟踪它们被观察到的次数(作为一种朴素的置信度度量,因为经常观察到的规则更有可能是真的),所以我实现了一个哈希映射,它将 "rule" 存储为键,将观察到的实例数存储为值。然而,哈希映射似乎没有正常运行,我对这种行为的原因一头雾水,也许比我更有知识的人可能会发现它。

机器学习组件架构:

private List<Sentence> sentences = new ArrayList<>();
/*
 * The following maps store the relation of a string occurring
 * as a subject or object, respectively, to the list of Sentence
 * ordinals where they occur.
 */
private Map<String,List<Integer>> subject2index = new HashMap<>();
private Map<String,List<Integer>> object2index = new HashMap<>();

/*
 * This set contains strings that occur as both,
 * subject and object. This is useful for determining strings
 * acting as an in-between connecting two relations. 
 */
private Set<String> joints = new HashSet<>();

public void addSentence( Sentence s )
{

    // add Sentence to the list of all Sentences
    sentences.add( s );

    // add the Subject of the Sentence to the map mapping strings
    // occurring as a subject to the ordinal of this Sentence
    List<Integer> subind = subject2index.get( s.getSubject() );
    if( subind == null )
    {
        subind = new ArrayList<>();
        subject2index.put( s.getSubject(), subind );
    }
    subind.add( sentences.size() - 1 );

    // add the Object of the Sentence to the map mapping strings
    // occurring as an object to the ordinal of this Sentence
    List<Integer> objind = object2index.get( s.getObject() );
    if( objind == null )
    {
        objind = new ArrayList<>();
        object2index.put( s.getObject(), objind );
    }
    objind.add( sentences.size() - 1 );

    // determine whether we've found a "joining" string
    if( subject2index.containsKey( s.getObject() ) )
    {
        joints.add( s.getObject() );
    }
    if( object2index.containsKey( s.getSubject() ) )
    {
        joints.add( s.getSubject() );
    }
}

public Collection<String> getJoints()
{
    return joints;
}
public List<Integer> getSubjectIndices( String subject )
{
    return subject2index.get( subject );
}
public List<Integer> getObjectIndices( String object )
{
    return object2index.get( object );
}
public Sentence getSentence( int index )
{
    return sentences.get( index );
}

用于仅存储唯一副本和出现次数的哈希映射:

//map to store learned 'rules'
Map<Sentence, Integer> ruleCount = new HashMap<>();
//store data
public void numberRules(Sentence sentence) 
{
    if (!ruleCount.containsKey(sentence))
    {
        ruleCount.put(sentence, 0);
    }
    ruleCount.put(sentence, ruleCount.get(sentence) + 1);
}

对象句:

public class Sentence 
{
private String verb;
private String object;
private String subject;
public Sentence(String verb, String object, String subject )
{
    this.verb = verb;
    this.object = object;
    this.subject = subject;
}

public String getVerb()
{
    return verb; 
}

public String getObject()
{
    return object; 
}

public String getSubject()
{
    return subject;
}

public String toString()
{
    return verb + "(" + object + ", " + subject + ").";
}

}

当前输入:

'prevents'('scurvy', 'vitamin C').
'contains'('vitamin C', 'orange').
'contains'('vitamin C', 'sauerkraut').
'is a'('fruit', 'orange').
'improves'('health', 'fruit').
'contains'('vitamin C', 'orange').
'improves'('health', 'fruit').

当前输出:

prevents(scurvy, orange). : 1
improves(health, orange). : 1
prevents(scurvy, orange). : 1
prevents(scurvy, sauerkraut). : 1
improves(health, orange). : 1

期望的输出:

prevents(scurvy, orange). : 2
improves(health, orange). : 2
prevents(scurvy, sauerkraut). : 1

执行的代码:

public static void main(String[] args) throws IOException 
{


    Ontology ontology = new Ontology();
    BufferedReader br = new BufferedReader(new FileReader("file.txt"));
    Pattern p = Pattern.compile("'(.*?)'\('(.*?)',\s*'(.*?)'\)\.");
    String line;
    while ((line = br.readLine()) != null) 
    {
        Matcher m = p.matcher(line);
        if( m.matches() ) 
        {
            String verb    = m.group(1);
            String object  = m.group(2);
            String subject = m.group(3);
            ontology.addSentence( new Sentence( verb, object, subject ) );
        }
    }

    for( String joint: ontology.getJoints() )
    {
        for( Integer subind: ontology.getSubjectIndices( joint ) )
        {
            Sentence xaS = ontology.getSentence( subind );

            for( Integer obind: ontology.getObjectIndices( joint ) )
            {

                Sentence yOb = ontology.getSentence( obind );

                Sentence s = new Sentence( xaS.getVerb(),
                                           xaS.getObject(),
                                           yOb.getSubject() );

                //System.out.println( s );                
                ontology.numberRules( s );    

            }
        }
    }
    for (Map.Entry<Sentence, Integer> entry : ontology.ruleCount.entrySet()) 
    {
        System.out.println(entry.getKey()+" : "+entry.getValue());
    }       
}   

顺便说一句,当我 运行 在一个大文件上执行此操作时,我得到一个 OutOfMemory: Java heap space 错误并且程序崩溃。我知道我可以增加堆的大小,但这似乎是一个糟糕的解决方案。我如何优化此代码的效率以使其能够处理大型数据集?

正如我在您之前的问题中建议的那样,您应该在 Sentence class 中覆盖 hashCodeequals,因为 Object 中实现的默认行为] class不符合您的需求。

@Override
boolean equals (Object other)
{
    if (!(other instanceof Sentence))
        return false;
    if (other == this)
        return true;
    Sentence o = (Sentence) other;
    return o.subject.equals(subject) && o.object.equals(object) && o.verb.equals(verb);
}

@Override
public int hashCode ()
{
    return Objects.hash(object, subject, verb); // this method only exists since Java 7
}

当您使用自定义 class 中的一个作为 HashMap 中的键时(就像您对 Sentence class 所做的那样),您必须覆盖 equals()hashCode()。如果您不覆盖它们,a.equals(b) 将仅在 a==b 时 return 为真,这可能不是您想要的行为。

当比较的两个 Sentence 的动词、宾语和主语分别相等时,您希望 a.equals(b) 为 return 真。否则,您认为相同的两个句子可能会被 HashMap 视为不同的键。

hashCode() 的实现应与 equals 的实现相匹配,如果 a.equals(b) 为真,则 a.hashCode() == b.hashCode() 也为真。这就是为什么 hashCode 应该是 Sentence class.

的 3 个属性的函数