EZpipe


EZpipe helps users in the fast creation of a data set of aligned and concatenated genetic sequences avoiding operators' errors. It was developed for large data sets (such as phylomitogenomic analyses - diffcult to handle manually), but it can also be useful for small matrices. Note that the procedure covers all the manually intensive steps of dataset construction but not the actual phylogenetic analysis. We cannot currently provide the substantial computational resources required for this final step. We suggest you check other services such as the CIPRES Portal.

Use EZpipe it according to the following instructions:

  1. Prepare a compressed file (.zip, .rar, .tar, .gz, .7z) of fasta sequences. Each file must be titled according to the genetic marker name and must contain all sequences for the indicated locus. Please make sure that:
    • the compressed file contains only fasta data;
    • there are no duplicate taxon names and sequences (sequence duplications are actually allowed but may complicate the following analysis; these will be pointed out during the pipeline);
    • taxa are named consistently across fasta files (i.e. if one taxon is named Drosophila in the cox1 file, it has to be named Drosophila in all gene files for sequences to be correctly concatenated);
    • there are no non-standard nucleotides (only IUPAC ambiguos nucleotides are accepted);
    • there are no GAPs within the sequences (if there are, they will be automatically removed);
  2. Upload the compressed archieve in the Upload link button.
  3. Choose the Genetic Code for the analysis (a list of of available codes as well as some additional information can be found here).
  4. Choose which codon positions should be retained the concatenated file (e.g whether to remove or not third codon positions). We recommend to perform two analyses with each of the two options and to take saturation into account. If your data set is composed of highly divergent sequences, then 2 codon positions is the preferred option. An example can be seen here.
  5. The output will generate a concatenated .phy matrix and a .cfg file. These can be used as the entry point for subsequent partitioning and model optimization using Partition Finder2. Although this is not our advice, the final concatenated matrix can be used in ML programs (e.g. IQ-TREE) directly, if no partitioning is desired.
  6. If the analysis does not produce errors, a compressed folder will be downloaded. All intermediate files are here included (aligned, G-blocked, ...) as well as a final 'log.txt' document describing all steps performed.

Your data will go through the following steps:

  1. sequences will be checked for the presence of stop codons and unusual length (compared to the rest of the dataset);
  2. sequences of each gene will be aligned based on their aminoacid alignment (i.e. retroaligned);
  3. regions of unreliable alignment will be removed using G-blocks from each gene (options: codon, strict);
  4. all gene alignments will be concatenated in a final dataset;
  5. third codon positions will be removed (if requested);
  6. a .cfg file will be created to be used in Partition Finder2. Starting partitions are by gene and by codon position (i.e. 26 or 39 partitions);

Example of input file preparation

Screencast-2020-11-18-11-37-24
Download an input example

*Useful to organize your processed jobs. If not provided, a random ID will be created.