PrefixSpan 序列提取误区
PrefixSpan sequence extraction misunderstanding
我在表示窗口化序列的列表中有一组大小为 3 的元组。
我需要的是使用 pyspask 来获得(给定元组的前两个部分)第三个部分。
所以我需要它根据频率创建三个元素的序列。
这就是我正在做的事情:
data = [[['a','b','c'],['b','c','d'],['c','d','e'],['d','e','f'],['e','f','g'],['f','g','h'],['a','b','c'],['d','e','f'],['a','b','c'],['b','c','d'],['f','g','h'],['d','e','f'],['b','c','d']]]
rdd = spark.sparkContext.parallelize(data,2)
rdd.cache()
model = PrefixSpan.train( rdd, 0.2, 3)
print(sorted(model.freqSequences().take(100)))
尽管我希望看到它们的序列和频率遵循字母表,但它们没有。
我得到的序列如下:
FreqSequence(sequence=[[u'c'], [u'd'], [u'b']], freq=1)
FreqSequence(sequence=[[u'g'], [u'c'], [u'c']], freq=1)
没有出现在定义的那些。显然我构建特征的方式存在问题,或者我在该算法的目的和功能中遗漏了一些东西..
谢谢!
首先让我们看看您的输入:
rdd.count()
1
如您所见,您创建了一个只有一个序列的数据集。可以描述为:
<(abc)(bcd)(cde)(def)(efg)(fgh)(abc)(def)(abc)(bcd)(fgh)(def)(bcd)>
所以你得到的模式在给定输入的情况下确实是正确的。例如
FreqSequence(sequence=[[u'c'], [u'd'], [u'b']], freq=1)
对应于:
...(abc)(def)(abc)...
如果数据集的每个元素代表单独的序列数据,则可能具有以下形状:
rdd = sc.parallelize([
[['a'], ['b'], ['c']], [['b'], ['c'], ['d']], [['c'], ['d'], ['e']],
[['d'], ['e'], ['f']], [['e'], ['f'], ['g']], [['f'], ['g'], ['h']],
[['a'], ['b'], ['c']], [['d'], ['e'], ['f']], [['a'], ['b'], ['c']],
[['b'], ['c'], ['d']], [['f'], ['g'], ['h']], [['d'], ['e'], ['f']],
[['b'], ['c'], ['d']]
])
rdd.count()
13
rdd.first()
[['a'], ['b'], ['c']]
其中:
- 每个元素都是列表的列表。
- 每个内部列表代表给定位置的可能备选方案。
数据结构如下:
model = PrefixSpan.train(rdd, 0.2, 3)
model.freqSequences().top(5, key=lambda x: len(x.sequence))
[FreqSequence(sequence=[['d'], ['e'], ['f']], freq=3),
FreqSequence(sequence=[['b'], ['c'], ['d']], freq=3),
FreqSequence(sequence=[['a'], ['b'], ['c']], freq=3),
FreqSequence(sequence=[['f'], ['g']], freq=3),
FreqSequence(sequence=[['d'], ['f']], freq=3)]
model.freqSequences().top(5, key=lambda x: x.freq)
[FreqSequence(sequence=[['d']], freq=7),
FreqSequence(sequence=[['c']], freq=7),
FreqSequence(sequence=[['f']], freq=6),
FreqSequence(sequence=[['b']], freq=6),
FreqSequence(sequence=[['b'], ['c']], freq=6)]
我在表示窗口化序列的列表中有一组大小为 3 的元组。 我需要的是使用 pyspask 来获得(给定元组的前两个部分)第三个部分。
所以我需要它根据频率创建三个元素的序列。
这就是我正在做的事情:
data = [[['a','b','c'],['b','c','d'],['c','d','e'],['d','e','f'],['e','f','g'],['f','g','h'],['a','b','c'],['d','e','f'],['a','b','c'],['b','c','d'],['f','g','h'],['d','e','f'],['b','c','d']]]
rdd = spark.sparkContext.parallelize(data,2)
rdd.cache()
model = PrefixSpan.train( rdd, 0.2, 3)
print(sorted(model.freqSequences().take(100)))
尽管我希望看到它们的序列和频率遵循字母表,但它们没有。
我得到的序列如下:
FreqSequence(sequence=[[u'c'], [u'd'], [u'b']], freq=1)
FreqSequence(sequence=[[u'g'], [u'c'], [u'c']], freq=1)
没有出现在定义的那些。显然我构建特征的方式存在问题,或者我在该算法的目的和功能中遗漏了一些东西..
谢谢!
首先让我们看看您的输入:
rdd.count()
1
如您所见,您创建了一个只有一个序列的数据集。可以描述为:
<(abc)(bcd)(cde)(def)(efg)(fgh)(abc)(def)(abc)(bcd)(fgh)(def)(bcd)>
所以你得到的模式在给定输入的情况下确实是正确的。例如
FreqSequence(sequence=[[u'c'], [u'd'], [u'b']], freq=1)
对应于:
...(abc)(def)(abc)...
如果数据集的每个元素代表单独的序列数据,则可能具有以下形状:
rdd = sc.parallelize([
[['a'], ['b'], ['c']], [['b'], ['c'], ['d']], [['c'], ['d'], ['e']],
[['d'], ['e'], ['f']], [['e'], ['f'], ['g']], [['f'], ['g'], ['h']],
[['a'], ['b'], ['c']], [['d'], ['e'], ['f']], [['a'], ['b'], ['c']],
[['b'], ['c'], ['d']], [['f'], ['g'], ['h']], [['d'], ['e'], ['f']],
[['b'], ['c'], ['d']]
])
rdd.count()
13
rdd.first()
[['a'], ['b'], ['c']]
其中:
- 每个元素都是列表的列表。
- 每个内部列表代表给定位置的可能备选方案。
数据结构如下:
model = PrefixSpan.train(rdd, 0.2, 3)
model.freqSequences().top(5, key=lambda x: len(x.sequence))
[FreqSequence(sequence=[['d'], ['e'], ['f']], freq=3),
FreqSequence(sequence=[['b'], ['c'], ['d']], freq=3),
FreqSequence(sequence=[['a'], ['b'], ['c']], freq=3),
FreqSequence(sequence=[['f'], ['g']], freq=3),
FreqSequence(sequence=[['d'], ['f']], freq=3)]
model.freqSequences().top(5, key=lambda x: x.freq)
[FreqSequence(sequence=[['d']], freq=7),
FreqSequence(sequence=[['c']], freq=7),
FreqSequence(sequence=[['f']], freq=6),
FreqSequence(sequence=[['b']], freq=6),
FreqSequence(sequence=[['b'], ['c']], freq=6)]