Skip to content

Conversation

@HaidYi
Copy link

@HaidYi HaidYi commented Jul 2, 2025

PR checklist

Close #481.

The main changes include:

  • Like other screening tools, added a dedicated subworkflow (subworkflows/dbcan.nf) for the support of run_dbcan screening.
  • Added the annotation step for generating the .gff files and added the alias of the current modules (e.g., PYRODIGAL_GFF). So, the input gbk column may also use gff file as input. Feel free to change this part as it may need some tweaks considering the both the pipeline and the document.
  • Other utilities:
    • ci/cd, testing profiles for dbcan, module.config, etc.
    • documents: readme and output

Things that are needed the changes from the maintainer:

  • Add the changelog for this change in the next release version.
  • Add the dbcan screening step in the schematic workflow.

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/funcscan branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core pipelines lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@HaidYi HaidYi self-assigned this Jul 2, 2025
@HaidYi HaidYi added the enhancement Improvement for existing functionality label Jul 2, 2025
@nf-core-bot
Copy link
Member

nf-core-bot commented Jul 2, 2025

Warning

Newer version of the nf-core template is available.

Your pipeline is using an old version of the nf-core template: 3.4.1.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.

Copy link
Collaborator

@jasmezz jasmezz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a great addition! @HaidYi I really appreciate your effort, your PR is really clear and on point. Thank you very much for this contribution. During review I directly pushed some minor changes to your fork.

Some other comments we could consider:

  • Thinking about renaming the new dbcan subworkflow to cazyme. This would be more in line with previous naming, i.e. subworkflow names tell the purpose, not the tool.
    • This would include changing the output dir in modules.config to ${params.outdir}/cazyme/cazyme_annotation, ${params.outdir}/cazyme/cgc, ${params.outdir}/cazyme/substrate
    • file tree in output docs
    • test names
    • nextflow_schema.json ...
  • The database download takes very long because of low download rate (>2 GB at at rate of ~ 1 MB/s). That is too long for the test profiles; we need to create a smaller database somehow...
  • Adding manual dbCAN database download (via bioconda) to the respective section in usage docs.

Comment on lines 35 to 36
dbcan_skip_cgc = true // skip cgc as .gbk is used
dbcan_skip_substrate = true // skip substrate as .gbk is used
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to be able to run the complete CAZyme subworkflow with pre-annotated .gff files while also providing pre-annotated .gbk files for other subworkflows, we need an additional (optional) column in the samplesheet.

docs/output.md Outdated
- `*_dbCAN_hmm_results.tsv`: TSV file containing the detailed dbCAN HMM results for CAZyme annotation.
- `*_dbCANsub_hmm_results.tsv`: TSV file containing the detailed dbCAN subfamily results for CAZyme annotation.
- `*_diamond.out`: TSV file containing the detailed dbCAN diamond results for CAZyme annotation.
- `cgc`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many of the files of the cgc and substrate section seem duplicated. Maybe we don't need to store those which are created in the cazyme step already? Can control this in modules.config (e.g. see RGI_MAIN entry).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jasmezz Thank you for reviewing the codes. I will revise it based on your comments.

Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good first PR @HaidYi ! Clean and pretty much all of my comments are sort of minor/just polishing

Some additional things to my direct comments:

@@ -0,0 +1,34 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test should be for all cazyme screening tools, so I would rename accordingly for 'future proofing'

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea for leaving a placeholder for other cazyme screening developers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow...

Basicaally what I mean this should be: test_cazyme_pyrodigal not test_dbcan_pyrodigal!

run_bgc_screening = false
run_cazyme_screening = true

dbcan_skip_cgc = true // Skip cgc annotation as .gbk (not .gff) is provided in samplesheet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add gff files!

You can generate them from a normal funcscan fun, and make a PR against teh funscan branch of nf-core/testdatasets, which has the files and an updated samplesheet for the next funcscan version

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, currently the cazyme screening can only use the .gff files in the pipeline. To use the pre-annotated one, I generated the .gff files from pyrodigal. The PR can be found at nf-core/test-datasets#1683.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be updated now you have the file?

docs/output.md Outdated
| ├── deepbgc/
| ├── gecco/
| └── hmmsearch/
├── dbcan/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The top level should be the molecule/gene type (i.e., cazyme), then a subdirectory with each tool (in this case dbcan), and within that each of the different output directories

docs/output.md Outdated

- `dbcan/`
- `cazyme`
- `*_overview.tsv`: TSV file containing the results of dbCAN CAZyme annotation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're missing the <sample.id> sample subdirectory underneath the tool name (accoeding to your modules.confg)

.join(ch_gffs_for_rundbcan)
.multiMap { meta, faa, gff ->
faa: [meta, faa]
gff: [meta, gff, 'prodigal']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the gff always from prodigal? Or is this a dummy value?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer to the module description: https://nf-co.re/modules/rundbcan_easycgc/. If it's the generated in the pipeline, it is always the prodigal. But if it's provided using the pre-annotated one, then it could be either NCBI_prok, JGI, NCBI_euk or prodigal. This makes things complicated. An easier way is to define a parameter in the cli for this option but it's kind of hard to deal with the mixed case in a batch without doing the modifications in the samplesheet.

HaidYi and others added 4 commits July 16, 2025 19:24
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
@HaidYi
Copy link
Author

HaidYi commented Jul 17, 2025

@jfy133 Thank you for the comments and suggestions. I will fix all the problems one-by-one. As I don't want this PR corrupt other screening steps, I will do a more comprehensive testing, which may take more time. I will let you know when I fix all the issues.

@HaidYi
Copy link
Author

HaidYi commented Sep 28, 2025

@jfy133 Please review it again. All the issues you raised have already been resolved in the current codes.

@HaidYi HaidYi requested a review from jfy133 September 28, 2025 21:46
@jfy133
Copy link
Member

jfy133 commented Oct 4, 2025

@jfy133 Please review it again. All the issues you raised have already been resolved in the current codes.

Thank @HaidYi ! I just made a release so you have a clean slate and thus a 'clean release' once this PR is in.

I will resolve the conflicts next week and do the review :)

@jfy133
Copy link
Member

jfy133 commented Oct 8, 2025

@nf-core-bot fix linting

Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are almost there I think @HaidYi !

A few minor code things, and I noticed a few comments that are still outstanding from previous reviews.

A major thing though: you need to add yourself a as contributor! Please add a changelog entry, and add yourself to the main README list of people and the manifest section of the nextflow.config!

I think we are at maturity of the PR that on my next round of review, I may just make sense for me to directly make changes rather than making you have to do it yourself as I think they they will all be very minor (each fixing the annjoying RO crate linting error) - if you're OK with it, could you add me as a collaborator to your fork, and then I can have push changes rights for my next review? How does that sound?

Also note I've finished travelling now for a while so should be much faster to respond and review - I will check each Wednesday morning :)

// Add gff_type to meta for cazyme screening
if ((params.run_cazyme_screening && !params.cazyme_skip_dbcan && (!params.dbcan_skip_cgc || !params.dbcan_skip_substrate)) && params.annotation_tool in ['pyrodigal', 'prodigal', 'prokka', 'bakta']) {
ch_new_annotation_short.map { meta, fasta, faa, gff, gbk ->
def new_meta = meta + [gff_type: 'prodigal'] // Only Use 'prodigal' as dbcan does not distinguish 'pyrodigal' and 'prodigal'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what if the annotation is from prokka or bakta 🤔 , this would be the wrong type of GFF, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to use prokka or bakta because they also use either pyrodigal or prodigal in their implementation.

run_bgc_screening = false
run_cazyme_screening = true

dbcan_skip_cgc = true // Skip cgc annotation as .gbk (not .gff) is provided in samplesheet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be updated now you have the file?

},
"dbcan_skip_cgc": {
"type": "boolean",
"description": "Skip CGC during the dbCAN screening.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still missing

@HaidYi HaidYi removed the enhancement Improvement for existing functionality label Oct 10, 2025
@jfy133
Copy link
Member

jfy133 commented Oct 15, 2025

I see you've not got to this yet @HaidYi , will check again next week :) (no rush though!)

@HaidYi
Copy link
Author

HaidYi commented Oct 16, 2025

I see you've not got to this yet @HaidYi , will check again next week :) (no rush though!)

@jfy133 Sure, I will get back to working on it this weekend. A few travels since last weekend, and getting sick this week.

@jfy133
Copy link
Member

jfy133 commented Oct 16, 2025

I hope you feel better soon @HaidYi !

jfy133 and others added 8 commits November 5, 2025 12:14
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
@HaidYi
Copy link
Author

HaidYi commented Nov 24, 2025

@jfy133 Hi James, I am back for this PR. Following your comments, I fixed the problems left if I understand your comments correctly. I will ensure it can pass all the ci/cd test cases. Then, you can give a review on it.

@jfy133
Copy link
Member

jfy133 commented Dec 8, 2025

Hi @HaidYi - sorry for the delay, lots of end of year deadlines.

Sounds good!

So the error is a time out during the downloading of the dbCan dtabase:

ERROR ~ Error executing process > 'NFCORE_FUNCSCAN:FUNCSCAN:CAZYME:RUNDBCAN_DATABASE'
    > 
    > Caused by:
    >   Process exceeded running time limit (1h)
    > 
    > 
    > Command executed:
    > 
    >   run_dbcan database \
    >       --db_dir dbcan_db
    >   
    >   cat <<-END_VERSIONS > versions.yml
    >   "NFCORE_FUNCSCAN:FUNCSCAN:CAZYME:RUNDBCAN_DATABASE":
    >       dbcan: $(echo $(run_dbcan version) | cut -f2 -d':' | cut -f2 -d' ')
    >   END_VERSIONS
    > 
    > Command exit status:
    >   -
    > 
    > Command output:
    >   (empty)
    > 
    > Command error:
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:30<06:37, 1.19MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:30<06:34, 1.20MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:30<06:40, 1.18MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:31<06:22, 1.24MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:31<06:24, 1.23MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:31<06:25, 1.23MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:31<06:32, 1.20MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:31<06:31, 1.21MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:31<06:30, 1.21MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:31<06:28, 1.21MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:31<06:28, 1.21MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:31<06:35, 1.19MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:31<06:32, 1.20MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:32<06:31, 1.20MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:32<06:29, 1.21MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:32<06:28, 1.21MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:32<06:29, 1.21MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:32<06:34, 1.19MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:32<06:32, 1.20MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:32<06:30, 1.21MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:32<06:28, 1.21MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:32<06:35, 1.19MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:33<06:25, 1.22MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:33<06:25, 1.22MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:33<06:32, 1.20MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:33<06:30, 1.20MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:33<06:36, 1.18MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:33<06:25, 1.22MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:33<06:25, 1.22MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:33<06:32, 1.19MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:33<06:24, 1.22MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:34<06:24, 1.22MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:34<06:32, 1.19MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:34<06:29, 1.20MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:34<06:25, 1.22MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:34<06:27, 1.21MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:34<06:26, 1.21MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:34<06:33, 1.19MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:34<06:23, 1.22MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:34<06:30, 1.20MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:35<06:28, 1.20MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:35<06:26, 1.21MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:35<06:26, 1.21MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:35<06:24, 1.21MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:35<06:25, 1.21MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:35<06:31, 1.19MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:35<06:22, 1.22MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:35<06:28, 1.20MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:35<06:21, 1.22MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:35<06:28, 1.20MiB/s]
    > 
    > Work dir:
    >   /home/runner/_work/funcscan/funcscan/~/tests/d7fb331664515ac8c0840002642ad00c/work/83/1bba6f4132d4df56d5262258836271
    > 
    > Container:
    >   quay.io/biocontainers/dbcan:5.1.2--pyhdfd78af_0
    > 
    > Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
    > 
    >  -- Check '/home/runner/_work/funcscan/funcscan/~/tests/d7fb331664515ac8c0840002642ad00c/meta/nextflow.log' file for details
    > Execution cancelled -- Finishing pending tasks before exit
    > ERROR ~ Pipeline failed. Please refer to troubleshooting docs: https://nf-co.re/docs/usage/troubleshooting
    > 
    >  -- Check '/home/runner/_work/funcscan/funcscan/~/tests/d7fb331664515ac8c0840002642ad00c/meta/nextflow.log' file for details
    > -[nf-core/funcscan] Pipeline completed with errors-
    ```
    
    Normally our preferred workaround is to make a very small version of the database that we host with our test data (e.g. just a few genes/taxa etc), do you think that owuld be possible to make?

@HaidYi
Copy link
Author

HaidYi commented Dec 12, 2025

@jfy133 Thank you for pointing this. For this issue, I contacted the tool author. The problem is because the server has a low uploading bandwidth at UNL. So, the authors just got approved for hosting the data on AWS S3. So, they will update the nf-core/rundbcan_database module when they finish the transition of the database to S3.

Then, I think the time-out problem in the testing will be resolved automatically when I pull the newest module. I will keep you in the loop.

@jfy133
Copy link
Member

jfy133 commented Dec 12, 2025

Ok! Let's see if it helps 👍👍

@jfy133
Copy link
Member

jfy133 commented Dec 17, 2025

@HaidYi I will check again after the holidays, but I just had a though it may also be an idea to ask the developer to make a mini database anyway . It may be useful for other cases too, of just needs to include a couple of gene sequences so there is something that is compatible with running db_can (even if output is nonsense).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants