-
Notifications
You must be signed in to change notification settings - Fork 33
Add run_dbcan screening for the CAZyme (carbohydrate active enzyme) and CGC (CAZyme Gene Cluster) annotation #483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
|
Warning Newer version of the nf-core template is available. Your pipeline is using an old version of the nf-core template: 3.4.1. For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation. |
jasmezz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What a great addition! @HaidYi I really appreciate your effort, your PR is really clear and on point. Thank you very much for this contribution. During review I directly pushed some minor changes to your fork.
Some other comments we could consider:
- Thinking about renaming the new
dbcansubworkflow tocazyme. This would be more in line with previous naming, i.e. subworkflow names tell the purpose, not the tool.- This would include changing the output dir in
modules.configto${params.outdir}/cazyme/cazyme_annotation,${params.outdir}/cazyme/cgc,${params.outdir}/cazyme/substrate - file tree in output docs
- test names
- nextflow_schema.json ...
- This would include changing the output dir in
- The database download takes very long because of low download rate (>2 GB at at rate of ~ 1 MB/s). That is too long for the test profiles; we need to create a smaller database somehow...
- Adding manual dbCAN database download (via bioconda) to the respective section in usage docs.
conf/test_preannotated_dbcan.config
Outdated
| dbcan_skip_cgc = true // skip cgc as .gbk is used | ||
| dbcan_skip_substrate = true // skip substrate as .gbk is used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to be able to run the complete CAZyme subworkflow with pre-annotated .gff files while also providing pre-annotated .gbk files for other subworkflows, we need an additional (optional) column in the samplesheet.
docs/output.md
Outdated
| - `*_dbCAN_hmm_results.tsv`: TSV file containing the detailed dbCAN HMM results for CAZyme annotation. | ||
| - `*_dbCANsub_hmm_results.tsv`: TSV file containing the detailed dbCAN subfamily results for CAZyme annotation. | ||
| - `*_diamond.out`: TSV file containing the detailed dbCAN diamond results for CAZyme annotation. | ||
| - `cgc` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many of the files of the cgc and substrate section seem duplicated. Maybe we don't need to store those which are created in the cazyme step already? Can control this in modules.config (e.g. see RGI_MAIN entry).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jasmezz Thank you for reviewing the codes. I will revise it based on your comments.
jfy133
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really good first PR @HaidYi ! Clean and pretty much all of my comments are sort of minor/just polishing
Some additional things to my direct comments:
- Missing
citations.mdupdate - Missing the how to cite/methods text in this file: https://github.com/HaidYi/funcscan/blob/0cad8f95c553b3cdd3a59c34a0db107bd6df14f4/subworkflows/local/utils_nfcore_funcscan_pipeline/main.nf#L174
- Missing metromap update (but we can probably do this before release)
- Missing nf-test test and snapshots for the new tests
| @@ -0,0 +1,34 @@ | |||
| /* | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test should be for all cazyme screening tools, so I would rename accordingly for 'future proofing'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good idea for leaving a placeholder for other cazyme screening developers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow...
Basicaally what I mean this should be: test_cazyme_pyrodigal not test_dbcan_pyrodigal!
conf/test_preannotated_dbcan.config
Outdated
| run_bgc_screening = false | ||
| run_cazyme_screening = true | ||
|
|
||
| dbcan_skip_cgc = true // Skip cgc annotation as .gbk (not .gff) is provided in samplesheet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably add gff files!
You can generate them from a normal funcscan fun, and make a PR against teh funscan branch of nf-core/testdatasets, which has the files and an updated samplesheet for the next funcscan version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, currently the cazyme screening can only use the .gff files in the pipeline. To use the pre-annotated one, I generated the .gff files from pyrodigal. The PR can be found at nf-core/test-datasets#1683.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be updated now you have the file?
docs/output.md
Outdated
| | ├── deepbgc/ | ||
| | ├── gecco/ | ||
| | └── hmmsearch/ | ||
| ├── dbcan/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The top level should be the molecule/gene type (i.e., cazyme), then a subdirectory with each tool (in this case dbcan), and within that each of the different output directories
docs/output.md
Outdated
|
|
||
| - `dbcan/` | ||
| - `cazyme` | ||
| - `*_overview.tsv`: TSV file containing the results of dbCAN CAZyme annotation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're missing the <sample.id> sample subdirectory underneath the tool name (accoeding to your modules.confg)
subworkflows/local/cazyme.nf
Outdated
| .join(ch_gffs_for_rundbcan) | ||
| .multiMap { meta, faa, gff -> | ||
| faa: [meta, faa] | ||
| gff: [meta, gff, 'prodigal'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the gff always from prodigal? Or is this a dummy value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refer to the module description: https://nf-co.re/modules/rundbcan_easycgc/. If it's the generated in the pipeline, it is always the prodigal. But if it's provided using the pre-annotated one, then it could be either NCBI_prok, JGI, NCBI_euk or prodigal. This makes things complicated. An easier way is to define a parameter in the cli for this option but it's kind of hard to deal with the mixed case in a batch without doing the modifications in the samplesheet.
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
|
@jfy133 Thank you for the comments and suggestions. I will fix all the problems one-by-one. As I don't want this PR corrupt other screening steps, I will do a more comprehensive testing, which may take more time. I will let you know when I fix all the issues. |
|
@jfy133 Please review it again. All the issues you raised have already been resolved in the current codes. |
|
@nf-core-bot fix linting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are almost there I think @HaidYi !
A few minor code things, and I noticed a few comments that are still outstanding from previous reviews.
A major thing though: you need to add yourself a as contributor! Please add a changelog entry, and add yourself to the main README list of people and the manifest section of the nextflow.config!
I think we are at maturity of the PR that on my next round of review, I may just make sense for me to directly make changes rather than making you have to do it yourself as I think they they will all be very minor (each fixing the annjoying RO crate linting error) - if you're OK with it, could you add me as a collaborator to your fork, and then I can have push changes rights for my next review? How does that sound?
Also note I've finished travelling now for a while so should be much faster to respond and review - I will check each Wednesday morning :)
workflows/funcscan.nf
Outdated
| // Add gff_type to meta for cazyme screening | ||
| if ((params.run_cazyme_screening && !params.cazyme_skip_dbcan && (!params.dbcan_skip_cgc || !params.dbcan_skip_substrate)) && params.annotation_tool in ['pyrodigal', 'prodigal', 'prokka', 'bakta']) { | ||
| ch_new_annotation_short.map { meta, fasta, faa, gff, gbk -> | ||
| def new_meta = meta + [gff_type: 'prodigal'] // Only Use 'prodigal' as dbcan does not distinguish 'pyrodigal' and 'prodigal' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But what if the annotation is from prokka or bakta 🤔 , this would be the wrong type of GFF, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to use prokka or bakta because they also use either pyrodigal or prodigal in their implementation.
conf/test_preannotated_dbcan.config
Outdated
| run_bgc_screening = false | ||
| run_cazyme_screening = true | ||
|
|
||
| dbcan_skip_cgc = true // Skip cgc annotation as .gbk (not .gff) is provided in samplesheet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be updated now you have the file?
| }, | ||
| "dbcan_skip_cgc": { | ||
| "type": "boolean", | ||
| "description": "Skip CGC during the dbCAN screening.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still missing
|
I see you've not got to this yet @HaidYi , will check again next week :) (no rush though!) |
|
I hope you feel better soon @HaidYi ! |
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
|
@jfy133 Hi James, I am back for this PR. Following your comments, I fixed the problems left if I understand your comments correctly. I will ensure it can pass all the |
|
Hi @HaidYi - sorry for the delay, lots of end of year deadlines. Sounds good! So the error is a time out during the downloading of the dbCan dtabase: |
|
@jfy133 Thank you for pointing this. For this issue, I contacted the tool author. The problem is because the server has a low uploading bandwidth at UNL. So, the authors just got approved for hosting the data on AWS S3. So, they will update the Then, I think the time-out problem in the testing will be resolved automatically when I pull the newest module. I will keep you in the loop. |
|
Ok! Let's see if it helps 👍👍 |
|
@HaidYi I will check again after the holidays, but I just had a though it may also be an idea to ask the developer to make a mini database anyway . It may be useful for other cases too, of just needs to include a couple of gene sequences so there is something that is compatible with running db_can (even if output is nonsense). |
PR checklist
Close #481.
The main changes include:
subworkflows/dbcan.nf) for the support of run_dbcan screening..gfffiles and added the alias of the current modules (e.g.,PYRODIGAL_GFF). So, the inputgbkcolumn may also usegfffile as input. Feel free to change this part as it may need some tweaks considering the both the pipeline and the document.module.config, etc.readmeandoutputThings that are needed the changes from the maintainer:
nf-core pipelines lint).nextflow run . -profile test,docker --outdir <OUTDIR>).nextflow run . -profile debug,test,docker --outdir <OUTDIR>).docs/usage.mdis updated.docs/output.mdis updated.CHANGELOG.mdis updated.README.mdis updated (including new tool citations and authors/contributors).