Conversation
BasMonkey
left a comment
There was a problem hiding this comment.
I've made some remarks about performance and code duplication. Please see the comments left on the code.
| is_codes <- rownames(is_list) | ||
|
|
||
| # check if there is data present for all the samples that the pipeline started with, | ||
| # if not write sample name to a log file. |
There was a problem hiding this comment.
Seems like this comment has been lost in this change. I think its quite useful
DIMS/GenerateQCOutput.R
Outdated
| # pos | ||
| for (line_index in seq_len(nrow(is_pos_selection_subset))) { | ||
| is_selected <- is_pos_selection_subset$HMDB_name[line_index] | ||
| thresh_selected <- all_is_thresholds$plasma$pos[which(all_is_thresholds$names$pos == is_selected)] |
There was a problem hiding this comment.
The which() method preforms a linear search action per row. A more efficient way is to make use of the match() method.
There was a problem hiding this comment.
In this case, the list of values is really small, so I'm not too worried about performance. Which() is used often in different scripts of the pipeline, so it will be a focus of the refactor of v3.5 to investigate which instances of which() can be replaced by match(). The only fundamental difference between which() and match() is that the former returns all instances, whereas the latter returns only the first. For each occurrence of which() in the code, we'll have to decide whether it can be replaced by match().
DIMS/GenerateQCOutput.R
Outdated
| is_selected <- is_pos_selection_subset$HMDB_name[line_index] | ||
| thresh_selected <- all_is_thresholds$plasma$pos[which(all_is_thresholds$names$pos == is_selected)] | ||
| if (is_pos_selection_subset$Intensity[line_index] < thresh_selected) { | ||
| is_below_threshold <- rbind(is_below_threshold, is_pos_selection_subset[line_index, ]) |
There was a problem hiding this comment.
Avoid rbind() in a loop, since it repeatedly reallocates and copies the data frame, which is inefficient and may use a huge amount of ram for larger datasets. Consider collecting indices or rows first and binding once at the end.
There was a problem hiding this comment.
Duly noted. This is a very small data frame, so I'll leave it as is, but in the refactor for v3.5 where all scripts are evaluated, I will take this point into consideration.
Deze feature zorgt ervoor dat er extra QC informatie vanuit de DIMS pipeline in de eindmail komt, zodat de gebruiker in 1 oogopslag de kwaliteit van de run kan beoordelen.
Verschillende stappen van de pipeline, met name AverageTechReplicates en GenerateQCOutput, genereren extra txt bestanden, die als content opgenomen worden in DIMS.nf.