evcouplings.align package

evcouplings.align.alignment module

evcouplings.align.pfam module

Code for identifying Pfam domains and mapping Pfam alignments and ECs into target sequence mode.

Authors:
Thomas A. Hopf

Todo

  1. Write code to create list of family sizes
  2. Implement alignments against Pfam-HMM so precomputed results can be reused in focus mode
evcouplings.align.pfam.create_family_size_table(full_pfam_file, outfile=None)[source]

Parse family size table from Pfam flat file (ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.full.gz)

Parameters:
  • full_pfam_file (str) – Path to the pfam file (gzip).
  • outfile (str, optional (default: None)) – Save the parsed table to this file as a csv file.
Returns:

Parsed Pfam table.

Return type:

pd.DataFrame

evcouplings.align.pfam.pfam_hits(query_file, hmm_database, prefix, clan_table_file, size_table_file, resolve_overlaps=True, **kwargs)[source]

Identify hits of Pfam HMMs in a set of sequences.

Parameters:
  • query_file (str) – File containing query sequence(s)
  • hmm_database (str) – File containing HMM database (Pfam-A.hmm, with hmmpress applied)
  • prefix (str) – Prefix path for output files. Folder structure in the prefix will be created if not existing.
  • clan_table_file (str) – File with table linking Pfam families to clans (Pfam-A.clans.tsv). Set to None if not available, but resolve_overlaps cannot be True in that case.
  • size_table_file (str) – File with table of family sizes. Create using create_family_size_table(). Set to None if not available.
  • resolve_overlaps (bool) – Resolve overlapping hits by families from the same clan. Only possible if clan_table_file is given.
  • **kwargs (dict) – kwargs passed on to evcouplings.align.tools.run_hmmscan
Returns:

Pfam hit table

Return type:

pd.DataFrame

evcouplings.align.pfam.remove_clan_overlaps(pfam_table)[source]

Remove overlapping Pfam hits from same Pfam clan (equivalent of PfamScan.pl). Currently only allows to remove overlaps by domain bitscore.

Todo

is bitscore the most sensible choice if different length hits?

Parameters:pfam_table (pd.DataFrame) – Pfam hit table as generated by pfam_hits() function (must contain Pfam clan annotation).
Returns:Pfam hit table with lower-scoring overlaps removed
Return type:pd.DataFrame

evcouplings.align.protocol module

Protein sequence alignment creation protocols/workflows.

Authors:
Thomas A. Hopf Anna G. Green - complex protocol, hmm_build_and_search Chan Kang - hmm_build_and_search
evcouplings.align.protocol.complex(**kwargs)[source]

Protocol:

Run monomer alignment protocol and postprocess it for EVcomplex calculations

Parameters:kwargs arguments (Mandatory) – See list below in code where calling check_required
Returns:outcfg – Output configuration of the alignment protocol, and the following additional field:
genome_location_file : path to file containing
the genomic locations for CDs’s corresponding to identifiers in the alignment.
Return type:dict
evcouplings.align.protocol.cut_sequence(sequence, sequence_id, region=None, first_index=None, out_file=None)[source]

Cut a given sequence to sub-range and save it in a file

Parameters:
  • sequence (str) – Full sequence that will be cut
  • sequence_id (str) – Identifier of sequence, used to construct header in output file
  • region (tuple(int, int), optional (default: None)) – Region that will be cut out of full sequence. If None, full sequence will be returned.
  • first_index (int, optional (default: None)) – Define index of first position in sequence. Will be set to 1 if None.
  • out_file (str, optional (default: None)) – Save sequence in a FASTA file (header: >sequence_id/start_region-end_region)
Returns:

  • str – Subsequence contained in region
  • tuple(int, int) – Region. If no input region is given, this will be (1, len(sequence)); otherwise, the input region is returned.

Raises:

InvalidParameterError – Upon invalid region specification (violating boundaries of sequence)

evcouplings.align.protocol.describe_coverage(alignment, prefix, first_index, minimum_column_coverage)[source]

Produce “classical” buildali coverage statistics, i.e. number of sequences, how many residues have too many gaps, etc.

Only to be applied to alignments focused around the target sequence.

Parameters:
  • alignment (Alignment) – Alignment for which coverage statistics will be calculated
  • prefix (str) – Prefix of alignment file that will be stored as identifier in table
  • first_index (int) – Sequence index of first position of target sequence
  • minimum_column_coverage (Iterable(float) or float) –

    Minimum column coverage threshold(s) that will be tested (creating one row for each threshold in output table).

    Note

    int values given to this function instead of a float will be divided by 100 to create the corresponding floating point representation. This parameter is 1.0 - maximum fraction of gaps per column.

Returns:

Table with coverage statistics for different gap thresholds

Return type:

pd.DataFrame

evcouplings.align.protocol.describe_frequencies(alignment, first_index, target_seq_index=None)[source]

Get parameters of alignment such as gaps, coverage, conservation and summarize.

Parameters:
  • alignment (Alignment) – Alignment for which description statistics will be calculated
  • first_index (int) – Sequence index of first residue in target sequence
  • target_seq_index (int, optional (default: None)) – If given, will add the symbol in the target sequence into a separate column of the output table
Returns:

Table detailing conservation and symbol frequencies for all positions in the alignment

Return type:

pandas.DataFrame

evcouplings.align.protocol.describe_seq_identities(alignment, target_seq_index=0)[source]

Calculate sequence identities of any sequence to target sequence and create result dataframe.

Parameters:alignment (Alignment) – Alignment for which description statistics will be calculated
Returns:Table giving the identity to target sequence for each sequence in alignment (in order of occurrence)
Return type:pandas.DataFrame
evcouplings.align.protocol.existing(**kwargs)[source]

Protocol:

Use external sequence alignment and extract all relevant information from there (e.g. sequence, region, etc.), then apply gap & fragment filtering as usual

Parameters:kwargs arguments (Mandatory) – See list below in code where calling check_required
Returns:outcfg – Output configuration of the pipeline, including the following fields:
  • sequence_id (passed through from input)
  • alignment_file
  • raw_focus_alignment_file
  • statistics_file
  • sequence_file
  • first_index
  • target_sequence_file
  • annotation_file (None)
  • frequencies_file
  • identities_file
  • focus_mode
  • focus_sequence
  • segments
Return type:dict
evcouplings.align.protocol.extract_header_annotation(alignment, from_annotation=True)[source]

Extract Uniprot/Uniref sequence annotation from Stockholm file (as output by jackhmmer). This function may not work for other formats.

Parameters:
  • alignment (Alignment) – Multiple sequence alignment object
  • from_annotation (bool, optional (default: True)) – Use annotation line (in Stockholm file) rather than sequence ID line (e.g. in FASTA file)
Returns:

Table containing all annotation (one row per sequence in alignment, in order of occurrence)

Return type:

pandas.DataFrame

evcouplings.align.protocol.fetch_sequence(sequence_id, sequence_file, sequence_download_url, out_file)[source]

Fetch sequence either from database based on identifier, or from input sequence file.

Parameters:
  • sequence_id (str) – Identifier of sequence that should be retrieved
  • sequence_file (str) – File containing sequence. If None, sqeuence will be downloaded from sequence_download_url
  • sequence_download_url (str) – URL from which to download missing sequence. Must contain “{}” at the position where sequence ID will be inserted into download URL (using str.format).
  • out_file (str) – Output file in which sequence will be stored, if sequence_file is not existing.
Returns:

  • str – Path of file with stored sequence (can be sequence_file or out_file)
  • tuple (str, str) – Identifier of sequence as stored in file, and sequence

Protocol:

Build HMM from sequence alignment using hmmbuild and search against a sequence database using hmmsearch.

Parameters:kwargs arguments (Mandatory) – See list below in code where calling check_required
Returns:outcfg – Output configuration of the protocol, including the following fields:
  • target_sequence_file
  • sequence_file
  • raw_alignment_file
  • hittable_file
  • focus_mode
  • focus_sequence
  • segments
Return type:dict

Protocol:

Iterative jackhmmer search against a sequence database.

Parameters:kwargs arguments (Mandatory) – See list below in code where calling check_required

:param .. todo::: explain meaning of parameters in detail.

Returns:outcfg – Output configuration of the protocol, including the following fields:
  • sequence_id (passed through from input)
  • first_index (passed through from input)
  • target_sequence_file
  • sequence_file
  • raw_alignment_file
  • hittable_file
  • focus_mode
  • focus_sequence
  • segments
Return type:dict
evcouplings.align.protocol.modify_alignment(focus_ali, target_seq_index, target_seq_id, region_start, **kwargs)[source]

Apply pairwise identity filtering, fragment filtering, and exclusion of columns with too many gaps to a sequence alignment. Also generates files describing properties of the alignment such as frequency distributions, conservation, and “old-style” alignment statistics files.

Note

assumes focus alignment (otherwise unprocessed) as input.

Todo

come up with something more clever to filter fragments than fixed width (e.g. use 95% quantile of length distribution as reference point)

Parameters:
  • focus_ali (Alignment) – Focus-mode input alignment
  • target_seq_index (int) – Index of target sequence in alignment
  • target_seq_id (str) – Identifier of target sequence (without range)
  • region_start (int) – Index of first sequence position in target sequence
  • kwargs (See required arguments in source code) –
Returns:

  • outcfg (Dict) – File products generated by the function:
    • alignment_file
    • statistics_file
    • frequencies_file
    • identities_file
    • raw_focus_alignment_file
  • ali (Alignment) – Final processed alignment

evcouplings.align.protocol.run(**kwargs)[source]

Run alignment protocol to generate multiple sequence alignment from input sequence.

Parameters:
  • kwargs arguments (Mandatory) – protocol: Alignment protocol to run prefix: Output prefix for all generated files
  • Optional
Returns:

  • Alignment
  • Dictionary with results of stage in following fields (in brackets - not returned by all protocols)
    • alignment_file
    • [raw_alignment_file]
    • statistics_file
    • target_sequence_file
    • sequence_file
    • [annotation_file]
    • frequencies_file
    • identities_file
    • [hittable_file]
    • focus_mode
    • focus_sequence
    • segments

evcouplings.align.protocol.search_thresholds(use_bitscores, seq_threshold, domain_threshold, seq_len)[source]

Set homology search inclusion parameters.

HMMER hits get included in the HMM according to a two-step rule
  1. sequence passes sequence-level treshold
  2. domain passes domain-level threshold
Therefore, search thresholds are set based on the following logic:
  1. If only sequence threshold is given, a MissingParameterException is raised
  2. If only bitscore threshold is given, sequence threshold is set to the same
  3. If both thresholds are given, they are according to defined values
Valid inputs for bitscore thresholds:
  1. int or str: taken as absolute score threshold
  2. float: taken as relative threshold (absolute threshold derived by

multiplication with domain length)

Valid inputs for integer thresholds:
  1. int: Used as negative exponent, threshold will be set to 1E-<exponent>
  2. float or str: Interpreted literally
Parameters:
  • use_bitscores (bool) – Use bitscore threshold instead of E-value threshold
  • domain_threshold (str or int or float) – Domain-level threshold. See rules above.
  • seq_threshold (str or int or float) – Sequence-level threshold. See rules above.
  • seq_len (int) – Length of sequence. Used to calculate absolute bitscore threshold for relative bitscore thresholds.
Returns:

Sequence- and domain-level thresholds ready to be fed into HMMER

Return type:

tuple (str, str)

evcouplings.align.protocol.standard(**kwargs)[source]

Protocol:

Standard buildali4 workflow (run iterative jackhmmer search against sequence database, than determine which sequences and columns to include in the calculation based on coverage and maximum gap thresholds).

Parameters:kwargs arguments (Mandatory) – See list below in code where calling check_required
Returns:
  • outcfg (dict) – Output configuration of the pipeline, including the following fields:
    • sequence_id (passed through from input)
    • first_index (passed through from input)
    • alignment_file
    • raw_alignment_file
    • raw_focus_alignment_file
    • statistics_file
    • target_sequence_file
    • sequence_file
    • annotation_file
    • frequencies_file
    • identities_file
    • hittable_file
    • focus_mode
    • focus_sequence
    • segments
  • ali (Alignment) – Final sequence alignment

evcouplings.align.tools module

Wrappers for running external sequence alignment tools

Authors:
Thomas A. Hopf Anna G. Green - run_hmmbuild, run_hmmsearch Chan Kang - run_hmmbuild, run_hmmsearch
class evcouplings.align.tools.HmmbuildResult(prefix, hmmfile, output)

Bases: tuple

hmmfile

Alias for field number 1

output

Alias for field number 2

prefix

Alias for field number 0

class evcouplings.align.tools.HmmscanResult(prefix, output, tblout, domtblout, pfamtblout)

Bases: tuple

domtblout

Alias for field number 3

output

Alias for field number 1

pfamtblout

Alias for field number 4

prefix

Alias for field number 0

tblout

Alias for field number 2

class evcouplings.align.tools.HmmsearchResult(prefix, alignment, output, tblout, domtblout)

Bases: tuple

alignment

Alias for field number 1

domtblout

Alias for field number 4

output

Alias for field number 2

prefix

Alias for field number 0

tblout

Alias for field number 3

class evcouplings.align.tools.JackhmmerResult(prefix, alignment, output, tblout, domtblout)

Bases: tuple

alignment

Alias for field number 1

domtblout

Alias for field number 4

output

Alias for field number 2

prefix

Alias for field number 0

tblout

Alias for field number 3

evcouplings.align.tools.read_hmmer_domtbl(filename)[source]

Read a HMMER domtbl file into DataFrame.

Parameters:filename (str) – Path of domtbl file
Returns:DataFrame with parsed domtbl
Return type:pd.DataFrame
evcouplings.align.tools.read_hmmer_tbl(filename)[source]

Read a HMMER tbl file into DataFrame.

Parameters:filename (str) – Path of tbl file
Returns:DataFrame with parsed tbl
Return type:pd.DataFrame
evcouplings.align.tools.run_hhfilter(input_file, output_file, threshold=95, columns='a2m', binary='hhfilter')[source]

Redundancy-reduce a sequence alignment using hhfilter from the HHsuite alignment suite.

Parameters:
  • input_file (str) – Path to input alignment in A2M/FASTA format
  • output_file (str) – Path to output alignment (will be in A3M format)
  • threshold (int, optional (default: 95)) – Sequence identity threshold for maximum pairwise identity (between 0 and 100)
  • columns ({"first", "a2m"}, optional (default: "a2m")) – Definition of match columns (based on first sequence or upper-case columns (a2m))
  • binary (str) – Path to hhfilter binary
Returns:

output_file

Return type:

str

Raises:
  • ResourceError – If output alignment is non-existent/empty
  • ValueError – Upon invalid value of columns parameter
evcouplings.align.tools.run_hmmbuild(alignment_file, prefix, cpu=None, stdout_redirect=None, symfrac=None, binary='hmmbuild')[source]

Profile HMM construction from multiple sequence alignments Refer to HMMER documentation for details.

http://eddylab.org/software/hmmer3/3.1b2/Userguide.pdf

Parameters:
  • alignment_file (str) – File containing the multiple sequence alignment. Can be in Stockholm, a2m, or clustal formats, or any other format recognized by hmmer. Please note that ALL POSITIONS above the symfrac cutoff will be used in HMM construction (if the alignment contains columns that are insertions relative to the query sequence, this may be problematic for structure comparison)
  • prefix (str) – Prefix path for output files. Folder structure in the prefix will be created if not existing.
  • cpu (int, optional (default: None)) – Number of CPUs to use for search. Uses all if None.
  • stdout_redirect (str, optional (default: None)) – Redirect bulky stdout instead of storing with rest of results (use “/dev/null” to dispose)
  • symfrac (float, optional (default: None)) – range 0.0 - 1.0, HMMbuild will use columns with > symfrac percent gaps to construct the HMM. If None provided, HMMbuild internal default is 0.5. (Note: this is calculated after their internal sequence weighting is calculated)
  • binary (str (default: "hmmbuild")) – Path to jackhmmer binary (put in PATH for default to work)
Returns:

namedtuple with fields corresponding to the different output files (prefix, alignment, output, tblout, domtblout)

Return type:

HmmbuildResult

Raises:

ExternalToolError, ResourceError

evcouplings.align.tools.run_hmmscan(query, database, prefix, use_model_threshold=True, threshold_type='cut_ga', use_bitscores=True, domain_threshold=None, seq_threshold=None, nobias=False, cpu=None, stdout_redirect=None, binary='hmmscan')[source]

Run hmmscan of HMMs in database against sequences in query to identify matches of these HMMs. Refer to HMMER Userguide for explanation of these parameters.

Parameters:
  • query (str) – File containing query sequence(s)
  • database (str) – File containing HMM database (prepared with hmmpress)
  • prefix (str) – Prefix path for output files. Folder structure in the prefix will be created if not existing.
  • use_model_threshold (bool (default: True)) – Use model-specific inclusion thresholds from HMM database rather than global bitscore/E-value thresholds (use_bitscores, domain_threshold and seq_threshold are overriden by this flag).
  • threshold-type ({"cut_ga", "cut_nc", "cut_tc"} (default: "cut_ga")) – Use gathering (default), noise or trusted cutoff to define scan hits. Please refer to HMMER manual for details.
  • use_bitscores (bool) – Use bitscore inclusion thresholds rather than E-values. Overriden by use_model_threshold flag.
  • domain_threshold (int or float or str) – Inclusion threshold applied on the domain level (e.g. “1E-03” or 0.001 or 50)
  • seq_threshold (int or float or str) – Inclusion threshold applied on the sequence level (e.g. “1E-03” or 0.001 or 50)
  • nobias (bool, optional (default: False)) – Turn of bias correction
  • cpu (int, optional (default: None)) – Number of CPUs to use for search. Uses all if None.
  • stdout_redirect (str, optional (default: None)) – Redirect bulky stdout instead of storing with rest of results (use “/dev/null” to dispose)
  • binary (str (default: "hmmscan")) – Path to hmmscan binary (put in PATH for default to work)
Returns:

namedtuple with fields corresponding to the different output files (prefix, output, tblout, domtblout, pfamtblout)

Return type:

HmmscanResult

Raises:

ExternalToolError, ResourceError

evcouplings.align.tools.run_hmmsearch(hmmfile, database, prefix, use_bitscores, domain_threshold, seq_threshold, nobias=False, cpu=None, stdout_redirect=None, binary='hmmsearch')[source]

Search profile(s) against a sequence database. Refer to HMMER documentation for details.

http://eddylab.org/software/hmmer3/3.1b2/Userguide.pdf

Parameters:
  • hmmfile (str) – File containing the profile(s)
  • database (str) – File containing sequence database
  • prefix (str) – Prefix path for output files. Folder structure in the prefix will be created if not existing.
  • use_bitscores (bool) – Use bitscore inclusion thresholds rather than E-values.
  • domain_threshold (int or float or str) – Inclusion threshold applied on the domain level (e.g. “1E-03” or 0.001 or 50)
  • seq_threshold (int or float or str) – Inclusion threshold applied on the sequence level (e.g. “1E-03” or 0.001 or 50)
  • nobias (bool, optional (default: False)) – Turn of bias correction
  • cpu (int, optional (default: None)) – Number of CPUs to use for search. Uses all if None.
  • stdout_redirect (str, optional (default: None)) – Redirect bulky stdout instead of storing with rest of results (use “/dev/null” to dispose)
  • binary (str (default: "hmmsearch")) – Path to jackhmmer binary (put in PATH for default to work)
Returns:

namedtuple with fields corresponding to the different output files (prefix, alignment, output, tblout, domtblout)

Return type:

HmmsearchResult

Raises:

ExternalToolError, ResourceError

evcouplings.align.tools.run_jackhmmer(query, database, prefix, use_bitscores, domain_threshold, seq_threshold, iterations=5, nobias=False, cpu=None, stdout_redirect=None, checkpoints_hmm=False, checkpoints_ali=False, binary='jackhmmer')[source]

Run jackhmmer sequence search against target database. Refer to HMMER Userguide for explanation of these parameters.

Parameters:
  • query (str) – File containing query sequence
  • database (str) – File containing sequence database
  • prefix (str) – Prefix path for output files. Folder structure in the prefix will be created if not existing.
  • use_bitscores (bool) – Use bitscore inclusion thresholds rather than E-values.
  • domain_threshold (int or float or str) – Inclusion threshold applied on the domain level (e.g. “1E-03” or 0.001 or 50)
  • seq_threshold (int or float or str) – Inclusion threshold applied on the sequence level (e.g. “1E-03” or 0.001 or 50)
  • iterations (int) – number of jackhmmer search iterations
  • nobias (bool, optional (default: False)) – Turn of bias correction
  • cpu (int, optional (default: None)) – Number of CPUs to use for search. Uses all if None.
  • stdout_redirect (str, optional (default: None)) – Redirect bulky stdout instead of storing with rest of results (use “/dev/null” to dispose)
  • checkpoints_hmm (bool, optional (default: False)) – Store checkpoint HMMs to prefix.<iter>.hmm
  • checkpoints_ali (bool, optional (default: False)) – Store checkpoint alignments to prefix.<iter>.sto
  • binary (str (default: "jackhmmer")) – Path to jackhmmer binary (put in PATH for default to work)
Returns:

namedtuple with fields corresponding to the different output files (prefix, alignment, output, tblout, domtblout)

Return type:

JackhmmerResult

Raises:

ExternalToolError, ResourceError