Annotating a MAG and prepare it for ENA submission
In the next step, we want to submit a fully annotated MAG to ENA. The corresponding checklist ist this one:
https://www.ebi.ac.uk/ena/browser/view/ERC000047
By checking the mandatory fields, we need an information, that is missing so far: completeness and contamination scores. We can get those using the tool checkm.
Computing completeness and contamination using CheckM
Source the environment to make sure, the check database path is set as variable:
source /etc/environment
Run checkm on all bins (replace the bin folder name with the correct path from your metabat binning):
sudo checkm lineage_wf -t 28 -x fa /mnt/WGS-data/megahit_out/metabat/final.contigs.fa.metabat-bins-*/ /mnt/WGS-data/megahit_out/metabat/checkm/ > /mnt/WGS-data/megahit_out/metabat/checkm.log
Now, the results can be found in the file checkm.log:
tail -20 /mnt/WGS-data/megahit_out/metabat/checkm.log
Note the completeness and contamination values for your MAG, you will need them later.
Annotation with prokka
The multiplex capability and high yield of current day DNA-sequencing instruments has made bacterial whole genome sequencing a routine affair. The subsequent de novo assembly of reads into contigs has been well addressed. The final step of annotating all relevant genomic features on those contigs can be achieved slowly using existing web- and email-based systems, but these are not applicable for sensitive data or integrating into computational pipelines. Prokka is a command line software tool to fully annotate a draft bacterial genome in about 10 min on a typical desktop computer. It produces standards-compliant output files for further analysis or viewing in genome browsers.
prokka is pretty straightforward to use, however, it needs the sequence in unzipped fasta format. Unzip the bin file again:
gunzip /mnt/WGS-data/megahit_out/metabat/bin.*.fa.gz
Check the taxonomy of your bin again, we need the kingdom to call prokka correctly. It probably is a Bacteria (if not, change accordingly to archaea). Then we call Prokka with the bin sequence, the kingdom and specify an output directory:
prokka --kingdom bacteria --cpus 28 /mnt/WGS-data/megahit_out/metabat/bin.*.fa --outdir /mnt/WGS-data/megahit_out/metabat/prokka
Then have look at the output folder:
ls -l /mnt/WGS-data/megahit_out/metabat/prokka
It contains``.gff`` file that we will use along with the bin fasta file to generate an EMBL compatible flat file now.
Create an EMBL flat file
Unfortunately, prokka does not produce a format, that we can submit to ENA. However, we have all the information we need in the bin fasta and gff3 files. An EMBL compatible flat file can be generated using the tool EMBLmyGFF3.
An important note: In order to submit annotated sequences to ENA, you would need to get a locus tag prefix for each of your MAGs. See: https://ena-docs.readthedocs.io/en/latest/faq/locus_tags.html
These need to be registered along with your study and take at least 24 hours to be available. The test service however, does not allow registration of locus tags. We just use the placeholder LOCUSTAG instead.
First we need to change the default python3 version to python3.8 using:
sudo update-alternatives --config python3
Then select the python 3.8 option (2).
The following command yields us an EMBL compatible flat file, you need to fill in some of the fields (correct bin fasta file, study/project accession, and taxid):
EMBLmyGFF3 /mnt/WGS-data/megahit_out/metabat/prokka/PROKKA_YOUR_RESULT.gff /mnt/WGS-data/megahit_out/bin.YOURBIN.fa \
--data_class STD \
--topology linear \
--molecule_type "genomic DNA" \
--transl_table 11 \
--species TODO: your taxid here! \
--environmental_sample \
--isolation_source "forest soil" \
--locus_tag LOCUSTAG \
--project_id TODO: PRJXXXXXXX \
-o /mnt/WGS-data/megahit_out/metabat/mybin.embl
Data class might be HTG as well: https://ena-docs.readthedocs.io/en/latest/retrieval/general-guide/data-classes.html
Inspect your EMBL file. Then we proceed with the submission of a MAG sample before we submit the generated EMBL file.
References
prokka http://www.vicbioinformatics.com/software.prokka.shtml
CheckM https://github.com/Ecogenomics/CheckM
EMBLmyGFF3 https://github.com/NBISweden/EMBLmyGFF3