哈希映射作为唯一值存储/实例计数器。 java
hash map as unique value store / instance counter. java
我正在尝试创建一个程序来通过推理学习规则,即 'contains'('vitamin c', 'oranges').
、'prevents'('scurvy', 'vitamin c').
会产生输出 "rule" 'prevents'('scurvy', 'oranges').
我有代码可以生成该输出,但随后我想从输入中消除重复的 "rules",同时跟踪它们被观察到的次数(作为一种朴素的置信度度量,因为经常观察到的规则更有可能是真的),所以我实现了一个哈希映射,它将 "rule" 存储为键,将观察到的实例数存储为值。然而,哈希映射似乎没有正常运行,我对这种行为的原因一头雾水,也许比我更有知识的人可能会发现它。
机器学习组件架构:
private List<Sentence> sentences = new ArrayList<>();
/*
* The following maps store the relation of a string occurring
* as a subject or object, respectively, to the list of Sentence
* ordinals where they occur.
*/
private Map<String,List<Integer>> subject2index = new HashMap<>();
private Map<String,List<Integer>> object2index = new HashMap<>();
/*
* This set contains strings that occur as both,
* subject and object. This is useful for determining strings
* acting as an in-between connecting two relations.
*/
private Set<String> joints = new HashSet<>();
public void addSentence( Sentence s )
{
// add Sentence to the list of all Sentences
sentences.add( s );
// add the Subject of the Sentence to the map mapping strings
// occurring as a subject to the ordinal of this Sentence
List<Integer> subind = subject2index.get( s.getSubject() );
if( subind == null )
{
subind = new ArrayList<>();
subject2index.put( s.getSubject(), subind );
}
subind.add( sentences.size() - 1 );
// add the Object of the Sentence to the map mapping strings
// occurring as an object to the ordinal of this Sentence
List<Integer> objind = object2index.get( s.getObject() );
if( objind == null )
{
objind = new ArrayList<>();
object2index.put( s.getObject(), objind );
}
objind.add( sentences.size() - 1 );
// determine whether we've found a "joining" string
if( subject2index.containsKey( s.getObject() ) )
{
joints.add( s.getObject() );
}
if( object2index.containsKey( s.getSubject() ) )
{
joints.add( s.getSubject() );
}
}
public Collection<String> getJoints()
{
return joints;
}
public List<Integer> getSubjectIndices( String subject )
{
return subject2index.get( subject );
}
public List<Integer> getObjectIndices( String object )
{
return object2index.get( object );
}
public Sentence getSentence( int index )
{
return sentences.get( index );
}
用于仅存储唯一副本和出现次数的哈希映射:
//map to store learned 'rules'
Map<Sentence, Integer> ruleCount = new HashMap<>();
//store data
public void numberRules(Sentence sentence)
{
if (!ruleCount.containsKey(sentence))
{
ruleCount.put(sentence, 0);
}
ruleCount.put(sentence, ruleCount.get(sentence) + 1);
}
对象句:
public class Sentence
{
private String verb;
private String object;
private String subject;
public Sentence(String verb, String object, String subject )
{
this.verb = verb;
this.object = object;
this.subject = subject;
}
public String getVerb()
{
return verb;
}
public String getObject()
{
return object;
}
public String getSubject()
{
return subject;
}
public String toString()
{
return verb + "(" + object + ", " + subject + ").";
}
}
当前输入:
'prevents'('scurvy', 'vitamin C').
'contains'('vitamin C', 'orange').
'contains'('vitamin C', 'sauerkraut').
'is a'('fruit', 'orange').
'improves'('health', 'fruit').
'contains'('vitamin C', 'orange').
'improves'('health', 'fruit').
当前输出:
prevents(scurvy, orange). : 1
improves(health, orange). : 1
prevents(scurvy, orange). : 1
prevents(scurvy, sauerkraut). : 1
improves(health, orange). : 1
期望的输出:
prevents(scurvy, orange). : 2
improves(health, orange). : 2
prevents(scurvy, sauerkraut). : 1
执行的代码:
public static void main(String[] args) throws IOException
{
Ontology ontology = new Ontology();
BufferedReader br = new BufferedReader(new FileReader("file.txt"));
Pattern p = Pattern.compile("'(.*?)'\('(.*?)',\s*'(.*?)'\)\.");
String line;
while ((line = br.readLine()) != null)
{
Matcher m = p.matcher(line);
if( m.matches() )
{
String verb = m.group(1);
String object = m.group(2);
String subject = m.group(3);
ontology.addSentence( new Sentence( verb, object, subject ) );
}
}
for( String joint: ontology.getJoints() )
{
for( Integer subind: ontology.getSubjectIndices( joint ) )
{
Sentence xaS = ontology.getSentence( subind );
for( Integer obind: ontology.getObjectIndices( joint ) )
{
Sentence yOb = ontology.getSentence( obind );
Sentence s = new Sentence( xaS.getVerb(),
xaS.getObject(),
yOb.getSubject() );
//System.out.println( s );
ontology.numberRules( s );
}
}
}
for (Map.Entry<Sentence, Integer> entry : ontology.ruleCount.entrySet())
{
System.out.println(entry.getKey()+" : "+entry.getValue());
}
}
顺便说一句,当我 运行 在一个大文件上执行此操作时,我得到一个 OutOfMemory: Java heap space 错误并且程序崩溃。我知道我可以增加堆的大小,但这似乎是一个糟糕的解决方案。我如何优化此代码的效率以使其能够处理大型数据集?
正如我在您之前的问题中建议的那样,您应该在 Sentence
class 中覆盖 hashCode
和 equals
,因为 Object
中实现的默认行为] class不符合您的需求。
@Override
boolean equals (Object other)
{
if (!(other instanceof Sentence))
return false;
if (other == this)
return true;
Sentence o = (Sentence) other;
return o.subject.equals(subject) && o.object.equals(object) && o.verb.equals(verb);
}
@Override
public int hashCode ()
{
return Objects.hash(object, subject, verb); // this method only exists since Java 7
}
当您使用自定义 class 中的一个作为 HashMap
中的键时(就像您对 Sentence
class 所做的那样),您必须覆盖 equals()
和 hashCode()
。如果您不覆盖它们,a.equals(b)
将仅在 a==b
时 return 为真,这可能不是您想要的行为。
当比较的两个 Sentence
的动词、宾语和主语分别相等时,您希望 a.equals(b)
为 return 真。否则,您认为相同的两个句子可能会被 HashMap
视为不同的键。
hashCode()
的实现应与 equals
的实现相匹配,如果 a.equals(b)
为真,则 a.hashCode() == b.hashCode()
也为真。这就是为什么 hashCode
应该是 Sentence
class.
的 3 个属性的函数
我正在尝试创建一个程序来通过推理学习规则,即 'contains'('vitamin c', 'oranges').
、'prevents'('scurvy', 'vitamin c').
会产生输出 "rule" 'prevents'('scurvy', 'oranges').
我有代码可以生成该输出,但随后我想从输入中消除重复的 "rules",同时跟踪它们被观察到的次数(作为一种朴素的置信度度量,因为经常观察到的规则更有可能是真的),所以我实现了一个哈希映射,它将 "rule" 存储为键,将观察到的实例数存储为值。然而,哈希映射似乎没有正常运行,我对这种行为的原因一头雾水,也许比我更有知识的人可能会发现它。
机器学习组件架构:
private List<Sentence> sentences = new ArrayList<>();
/*
* The following maps store the relation of a string occurring
* as a subject or object, respectively, to the list of Sentence
* ordinals where they occur.
*/
private Map<String,List<Integer>> subject2index = new HashMap<>();
private Map<String,List<Integer>> object2index = new HashMap<>();
/*
* This set contains strings that occur as both,
* subject and object. This is useful for determining strings
* acting as an in-between connecting two relations.
*/
private Set<String> joints = new HashSet<>();
public void addSentence( Sentence s )
{
// add Sentence to the list of all Sentences
sentences.add( s );
// add the Subject of the Sentence to the map mapping strings
// occurring as a subject to the ordinal of this Sentence
List<Integer> subind = subject2index.get( s.getSubject() );
if( subind == null )
{
subind = new ArrayList<>();
subject2index.put( s.getSubject(), subind );
}
subind.add( sentences.size() - 1 );
// add the Object of the Sentence to the map mapping strings
// occurring as an object to the ordinal of this Sentence
List<Integer> objind = object2index.get( s.getObject() );
if( objind == null )
{
objind = new ArrayList<>();
object2index.put( s.getObject(), objind );
}
objind.add( sentences.size() - 1 );
// determine whether we've found a "joining" string
if( subject2index.containsKey( s.getObject() ) )
{
joints.add( s.getObject() );
}
if( object2index.containsKey( s.getSubject() ) )
{
joints.add( s.getSubject() );
}
}
public Collection<String> getJoints()
{
return joints;
}
public List<Integer> getSubjectIndices( String subject )
{
return subject2index.get( subject );
}
public List<Integer> getObjectIndices( String object )
{
return object2index.get( object );
}
public Sentence getSentence( int index )
{
return sentences.get( index );
}
用于仅存储唯一副本和出现次数的哈希映射:
//map to store learned 'rules'
Map<Sentence, Integer> ruleCount = new HashMap<>();
//store data
public void numberRules(Sentence sentence)
{
if (!ruleCount.containsKey(sentence))
{
ruleCount.put(sentence, 0);
}
ruleCount.put(sentence, ruleCount.get(sentence) + 1);
}
对象句:
public class Sentence
{
private String verb;
private String object;
private String subject;
public Sentence(String verb, String object, String subject )
{
this.verb = verb;
this.object = object;
this.subject = subject;
}
public String getVerb()
{
return verb;
}
public String getObject()
{
return object;
}
public String getSubject()
{
return subject;
}
public String toString()
{
return verb + "(" + object + ", " + subject + ").";
}
}
当前输入:
'prevents'('scurvy', 'vitamin C').
'contains'('vitamin C', 'orange').
'contains'('vitamin C', 'sauerkraut').
'is a'('fruit', 'orange').
'improves'('health', 'fruit').
'contains'('vitamin C', 'orange').
'improves'('health', 'fruit').
当前输出:
prevents(scurvy, orange). : 1
improves(health, orange). : 1
prevents(scurvy, orange). : 1
prevents(scurvy, sauerkraut). : 1
improves(health, orange). : 1
期望的输出:
prevents(scurvy, orange). : 2
improves(health, orange). : 2
prevents(scurvy, sauerkraut). : 1
执行的代码:
public static void main(String[] args) throws IOException
{
Ontology ontology = new Ontology();
BufferedReader br = new BufferedReader(new FileReader("file.txt"));
Pattern p = Pattern.compile("'(.*?)'\('(.*?)',\s*'(.*?)'\)\.");
String line;
while ((line = br.readLine()) != null)
{
Matcher m = p.matcher(line);
if( m.matches() )
{
String verb = m.group(1);
String object = m.group(2);
String subject = m.group(3);
ontology.addSentence( new Sentence( verb, object, subject ) );
}
}
for( String joint: ontology.getJoints() )
{
for( Integer subind: ontology.getSubjectIndices( joint ) )
{
Sentence xaS = ontology.getSentence( subind );
for( Integer obind: ontology.getObjectIndices( joint ) )
{
Sentence yOb = ontology.getSentence( obind );
Sentence s = new Sentence( xaS.getVerb(),
xaS.getObject(),
yOb.getSubject() );
//System.out.println( s );
ontology.numberRules( s );
}
}
}
for (Map.Entry<Sentence, Integer> entry : ontology.ruleCount.entrySet())
{
System.out.println(entry.getKey()+" : "+entry.getValue());
}
}
顺便说一句,当我 运行 在一个大文件上执行此操作时,我得到一个 OutOfMemory: Java heap space 错误并且程序崩溃。我知道我可以增加堆的大小,但这似乎是一个糟糕的解决方案。我如何优化此代码的效率以使其能够处理大型数据集?
正如我在您之前的问题中建议的那样,您应该在 Sentence
class 中覆盖 hashCode
和 equals
,因为 Object
中实现的默认行为] class不符合您的需求。
@Override
boolean equals (Object other)
{
if (!(other instanceof Sentence))
return false;
if (other == this)
return true;
Sentence o = (Sentence) other;
return o.subject.equals(subject) && o.object.equals(object) && o.verb.equals(verb);
}
@Override
public int hashCode ()
{
return Objects.hash(object, subject, verb); // this method only exists since Java 7
}
当您使用自定义 class 中的一个作为 HashMap
中的键时(就像您对 Sentence
class 所做的那样),您必须覆盖 equals()
和 hashCode()
。如果您不覆盖它们,a.equals(b)
将仅在 a==b
时 return 为真,这可能不是您想要的行为。
当比较的两个 Sentence
的动词、宾语和主语分别相等时,您希望 a.equals(b)
为 return 真。否则,您认为相同的两个句子可能会被 HashMap
视为不同的键。
hashCode()
的实现应与 equals
的实现相匹配,如果 a.equals(b)
为真,则 a.hashCode() == b.hashCode()
也为真。这就是为什么 hashCode
应该是 Sentence
class.