[WIP][SPARK-1405][MLLIB]collapsed Gibbs sampling based latent Dirichlet allocation#1983
[WIP][SPARK-1405][MLLIB]collapsed Gibbs sampling based latent Dirichlet allocation#1983witgo wants to merge 1 commit intoapache:masterfrom
Conversation
|
QA tests have started for PR 1983 at commit
|
|
QA tests have finished for PR 1983 at commit
|
|
QA tests have started for PR 1983 at commit
|
|
QA tests have finished for PR 1983 at commit
|
|
QA tests have started for PR 1983 at commit
|
|
QA tests have finished for PR 1983 at commit
|
|
QA tests have started for PR 1983 at commit
|
|
QA tests have finished for PR 1983 at commit
|
|
QA tests have started for PR 1983 at commit
|
|
QA tests have finished for PR 1983 at commit
|
|
QA tests have started for PR 1983 at commit
|
|
QA tests have finished for PR 1983 at commit
|
|
QA tests have started for PR 1983 at commit
|
|
QA tests have started for PR 1983 at commit
|
|
QA tests have finished for PR 1983 at commit
|
|
@mengxr This patch removed the |
|
QA tests have started for PR 1983 at commit
|
|
QA tests have finished for PR 1983 at commit
|
|
QA tests have finished for PR 1983 at commit
|
|
QA tests have started for PR 1983 at commit
|
|
Tests timed out after a configured wait of |
|
QA tests have started for PR 1983 at commit
|
|
QA tests have finished for PR 1983 at commit
|
|
@witgo 下面这一段代码可以多线程化么? 将此代码改成 我目前的情况是集群中单机CPU核多,24核,但内存有限,所以无法充分利用cpu资源。希望多线程化一部分代码。 |
|
@allwefantasy Spark是可以调整executor同时运行的task数量的. |
|
@witgo 感谢这个技巧的分享。 我目前还遇到一个问题。昨天你问我这边24w文档的words是多少,我统计了下,是 2400w words 计算方式是(parsedData.map(f:Document=>f.content.size).sum()),term 数是8w。 初始化非常快,只要分钟左右就跑完。但进行第一轮迭代时候,每个task 大概需要序列化26m的数据。然后到Cleaned broadcast 后 spark-shell 就没有反应了。 进入类似 http://csdn-hdp-nn-01:4040/stages/stage/?id=11 这种url 后task 显示都是running,然后我看了下每个worker 老年代什么的都是正常的。但是cpu很空闲,感觉人物都没有在跑的样子。你有遇到这个问题么? 之后就一直卡在这了 没反应。 |
|
@allwefantasy 现有的代码在迭代计算过程中创建了太多的TopicModel实例, 我现在正在尝试解决这个问题. |
|
@witgo 好的。如果有更新后请通知我。我这里也可以第一时间进行测试。 |
|
QA tests have started for PR 1983 at commit
|
|
|
@witgo @allwefantasy We had an offline discussion about LDA's implementation. Please check the JIRA page for the notes. 我们有大约LDA的实现脱机讨论。请检查JIRA页的注释。 |
|
QA tests have finished for PR 1983 at commit
|
|
The current broadcast-based implementation, especially in the corpus is large, the performance loss is more serious. Next week I will submit a graphx based implementation. |
There was a problem hiding this comment.
@rxin @mengxr
mapPartitions 方法的closure似乎没有正确清理. 序列化后的corpusRDD和序列化后topicModel broadcast 差不多一样大.
mapPartitions method seems to be no correct cleaning. The serialized corpus RDD and serialized topicModel broadcast almost as big.
cat spark.log | grep 'stored as values in memory' =>
14/09/13 00:47:59 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 218.2 KB, free 2.8 GB)
14/09/13 00:48:04 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1 KB, free 2.8 GB)
14/09/13 00:48:08 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.7 KB, free 2.8 GB)
14/09/13 00:48:20 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 2.4 KB, free 2.8 GB)
14/09/13 00:48:23 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 2.6 KB, free 2.8 GB)
14/09/13 00:48:25 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 2.6 KB, free 2.8 GB)
14/09/13 00:48:25 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 3.1 KB, free 2.8 GB)
14/09/13 00:48:30 INFO MemoryStore: Block broadcast_7 stored as values in memory (estimated size 2.9 KB, free 2.8 GB)
14/09/13 00:48:35 INFO MemoryStore: Block broadcast_8 stored as values in memory (estimated size 3.2 KB, free 2.8 GB)
14/09/13 00:48:44 INFO MemoryStore: Block broadcast_9 stored as values in memory (estimated size 68.6 KB, free 2.8 GB)
14/09/13 00:48:45 INFO MemoryStore: Block broadcast_10 stored as values in memory (estimated size 41.7 KB, free 2.8 GB)
14/09/13 00:49:21 INFO MemoryStore: Block broadcast_11 stored as values in memory (estimated size 197.5 MB, free 2.6 GB)
14/09/13 00:49:24 INFO MemoryStore: Block broadcast_12 stored as values in memory (estimated size 197.7 MB, free 2.3 GB)
14/09/13 00:53:25 INFO MemoryStore: Block broadcast_13 stored as values in memory (estimated size 163.9 MB, free 2.1 GB)
14/09/13 00:53:28 INFO MemoryStore: Block broadcast_14 stored as values in memory (estimated size 164.0 MB, free 1878.0 MB)
14/09/13 00:57:34 INFO MemoryStore: Block broadcast_15 stored as values in memory (estimated size 149.7 MB, free 1658.5 MB)
14/09/13 00:57:36 INFO MemoryStore: Block broadcast_16 stored as values in memory (estimated size 150.0 MB, free 1444.0 MB)
14/09/13 01:01:34 INFO MemoryStore: Block broadcast_17 stored as values in memory (estimated size 141.1 MB, free 1238.3 MB)
14/09/13 01:01:36 INFO MemoryStore: Block broadcast_18 stored as values in memory (estimated size 141.2 MB, free 1036.2 MB)
14/09/13 01:05:12 INFO MemoryStore: Block broadcast_19 stored as values in memory (estimated size 134.5 MB, free 840.7 MB)
14/09/13 01:05:14 INFO MemoryStore: Block broadcast_20 stored as values in memory (estimated size 134.7 MB, free 647.8 MB)
14/09/13 01:08:39 INFO MemoryStore: Block broadcast_21 stored as values in memory (estimated size 218.3 KB, free 589.5 MB)
14/09/13 01:08:39 INFO MemoryStore: Block broadcast_22 stored as values in memory (estimated size 218.3 KB, free 589.2 MB)
14/09/13 01:08:40 INFO MemoryStore: Block broadcast_23 stored as values in memory (estimated size 134.6 MB, free 454.6 MB)
14/09/13 01:08:53 INFO MemoryStore: Block broadcast_24 stored as values in memory (estimated size 129.3 MB, free 267.1 MB)
14/09/13 01:08:55 INFO MemoryStore: Block broadcast_25 stored as values in memory (estimated size 129.4 MB, free 82.0 MB)
|
QA tests have started for PR 1983 at commit
|
|
@allwefantasy |
|
QA tests have finished for PR 1983 at commit
|
|
QA tests have started for PR 1983 at commit
|
|
QA tests have finished for PR 1983 at commit
|
|
@witgo i have saw ur spark configuration for new performance test。 I will try your latest code and test in my data today |
|
@witgo i have try ur latest code in my corpus 。 it will not Stuck in broadcasting . However ,some exception are throw。 |
|
QA tests have started for PR 1983 at commit
|
|
QA tests have finished for PR 1983 at commit
|
|
@witgo Since we are converging on a GraphX-based implementation and distributed representation of the topic model, do you mind closing this PR? Thanks! |




This PR is based on @yinxusen's #476
The performance test:
500topics:1001000topics:10010002000topics:1502000conf/spark-defaults.conf: