Phylogenetic Covariance Matrix Sparisifcation
The package and its dependencies can then be installed as follows.
To avoid issues, it is recommended to install this package in a new virtual environment inside pcms/.
-
The sparsification algorithm requires the Intel Math Kernel Library (MKL). See installation instructions here.
-
The permutation test requires the
mvhgpackage, which should be installed intopcms/lib/. First clone thepcmsrepositorygit clone https://github.com/spsvihla/pcms.git
Now install the
mvhgpackage inpcms/lib/.mkdir -p lib/ && cd lib/ git clone https://github.com/spsvihla/mvhg.git && cd mvhg/ ./build.sh && ./build.sh -- clean
-
Finally, to install the
pcmspackage itself, navigate back topcms/and run./build.sh && ./build.sh --cleanThe remaining Python dependencies are included in
pyptoject.toml. -
The Python dependencies installed by default are only those required to run the package itself. If you wish to run notebooks, you must install aditional Python dependencies with the command
pip install ".[notebooks]"
We make use of the Greengenes database [1-2], the most up-to-date version of which can be found under /greengenes_release/current at the link provided.
As an example, the gg_13_8 dataset can be downloaded as follows:
wget https://ftp.microbio.me/greengenes_release/gg_13_5/gg_13_8_otus.tar.gz
tar -xzf gg_13_8_otus.tar.gz
Other notebooks use the Guerrero Negro microbial mat datasat collected and analyzed by Harris, Caporaso, et al. [3].
In particular, the sequences available on GenBank are used to pick OTUs clustered against the Greengenes database as described in [3].
These data can be downloaded at the link provided by selecting Send to > Gene features above the search results and creating a file with the "FASTA Nucleotide" format.
Note: The above analysis is for the full-length Sanger sequences. The corresponding table for the 454 partial-length sequences described in the paper may be found on Qiita.
The notebooks in this respository are written assuming the following directory structure:
$DATA/
├── greengenes/
│ ├── gg_13_5_otus/
│ │ └── <contents of gg_13_5.tar.gz>
│ ├── gg_13_8_otus/
│ │ └── <contents of gg_13_8.tar.gz>
│ └── ...
└── greengenes2/
├── gg_22_10_otus/
| ├── 2022.10.phylogeny.id.nwk
│ └── ...
└── ...
We have provided a Mathematica notebook used in the original publication of this work.
- McDonald, D. et al. Greengenes2 unifies microbial data in a single reference tree. Nat Biotechnol 42, 715–718 (2024). https://doi.org/10.1038/s41587-023-01845-1
- McDonald, D. et al. An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME J 6, 610–618 (2012). https://doi.org/10.1038/ismej.2011.139
- Kirk Harris, J. et al. Phylogenetic stratigraphy in the Guerrero Negro hypersaline microbial mat. The ISME Journal 7, 50–60 (2013). https://doi.org/10.1038/ismej.2012.79.
- Svihla, S. and Lladser, M. E. Sparsification of Phylogenetic Covariance Matrices of Critical Beta-splitting Random Trees. In preparation.
- Gorman, E. & Lladser, M. E. Sparsification of large ultrametric matrices: insights into the microbial Tree of Life. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences 479, 20220847 (2023). https://doi.org/10.1098/rspa.2022.0847