evcouplings.complex package

evcouplings.complex.protocol module

Protocols for matching putatively interacting sequences in protein complexes to create a concatenated sequence alignment

Authors:
Anna G. Green Thomas A. Hopf
evcouplings.complex.protocol.best_hit(**kwargs)[source]

Protocol:

Concatenate alignments based on the best hit to the focus sequence in each species

Parameters:kwargs arguments (Mandatory) – See list below in code where calling check_required
Returns:outcfg – Output configuration of the pipeline, including the following fields:

alignment_file raw_alignment_file focus_mode focus_sequence segments frequencies_file identities_file num_sequences num_sites raw_focus_alignment_file statistics_file

Return type:dict
evcouplings.complex.protocol.describe_concatenation(annotation_file_1, annotation_file_2, genome_location_filename_1, genome_location_filename_2, outfile)[source]

Describes properties of concatenated alignment.

Writes a csv with the following columns

num_seqs_1 : number of sequences in the first monomer alignment num_seqs_2 : number of sequences in the second monomer alignment num_nonred_species_1 : number of unique species annotations in the

first monomer alignment
num_nonred_species_2 : number of unique species annotations in the
second monomer alignment

num_species_overlap: number of unique species found in both alignments median_num_per_species_1 : median number of paralogs per species in the

first monomer alignmment
median_num_per_species_2 : median number of paralogs per species in
the second monomer alignment
num_with_embl_cds_1 : number of IDs for which we found an EMBL CDS in the
first monomer alignment (relevant to distance concatention only)
num_with_embl_cds_2 : number of IDs for which we found an EMBL CDS in the
first monomer alignment (relevant to distance concatention only)
Parameters:
  • annotation_file_1 (str) – Path to annotation.csv file for first monomer alignment
  • annotation_file_2 (str) – Path to annotation.csv file for second monomer alignment
  • genome_location_filename_1 (str) – Path to genome location mapping file for first alignment
  • genome_location_filename_2 (str) – Path to genome location mapping file for second alignment
  • outfile (str) – Path to output file
evcouplings.complex.protocol.genome_distance(**kwargs)[source]

Protocol:

Concatenate alignments based on genomic distance

Parameters:kwargs arguments (Mandatory) – See list below in code where calling check_required
Returns:outcfg – Output configuration of the pipeline, including the following fields:
  • alignment_file
  • raw_alignment_file
  • focus_mode
  • focus_sequence
  • segments
  • frequencies_file
  • identities_file
  • num_sequences
  • num_sites
  • raw_focus_alignment_file
  • statistics_file
Return type:dict
evcouplings.complex.protocol.modify_complex_segments(outcfg, **kwargs)[source]

Modifies the output configuration so that the segments are correct for a concatenated alignment

Parameters:outcfg (dict) – The output configuration
Returns:outcfg – The output configuration, with a new field called “segments”
Return type:dict
evcouplings.complex.protocol.run(**kwargs)[source]

Run alignment concatenation protocol

Parameters:kwargs arguments (Mandatory) – protocol: concatenation protocol to run prefix: Output prefix for all generated files
Returns:outcfg – Output configuration of concatenation stage Dictionary with results in following fields: (in brackets: not mandatory)

alignment_file raw_alignment_file focus_mode focus_sequence segments frequencies_file identities_file num_sequences num_sites raw_focus_alignment_file statistics_file

Return type:dict