evcouplings.align package

evcouplings.align.alignment module

Class and functions for reading, storing and manipulating multiple sequence alignments.

Authors:
Thomas A. Hopf
class evcouplings.align.alignment.Alignment(sequence_matrix, sequence_ids=None, annotation=None, alphabet='-ACDEFGHIKLMNPQRSTVWY')[source]

Bases: object

Container to store and manipulate multiple sequence alignments.

Note

Important:

  1. Sequence annotation currently is not transformed when selecting subsets of columns or positions (e.g. affects GR and GC lines in Stockholm alignments)
  2. Sequence ranges in IDs are not adjusted when selecting subsets of positions
apply(columns=None, sequences=None, func=<MagicMock name='mock.lower' id='140580392284960'>)[source]

Apply a function along columns and/or rows of alignment matrix, or to entire matrix.

Parameters:
  • columns (np.array(bool) or np.array(int), optional) – Vector containing True for each column that should be retained, False otherwise; or the indices of columns that should be selected
  • sequences (np.array(bool) or np.array(int), optional) – Vector containing True for each sequence that should be retained, False otherwise; or the indices of sequences that should be selected
  • func (callable) – Vectorized numpy function that will be applied to the selected subset of the alignment matrix
Returns:

Alignment with modified columns and sequences (this alignment maintains annotation)

Return type:

Alignment

conservation(normalize=True)[source]

Calculate per-column conservation of sequence alignment based on entropy of single-column frequency distribution.

If self.set_weights() was called previously, the frequencies used to calculate site entropies will be based on reweighted sequences.

Parameters:normalize (bool, optional (default: True)) – Transform column entropy to range 0 (no conservation) to 1 (fully conserved)
Returns:Vector of length L with conservation scores
Return type:np.array
count(char, axis='pos', normalize=True)[source]

Count occurrences of a character in the sequence alignment.

Note

The counts are raw counts not adjusted for sequence redundancy.

Parameters:
  • char (str) – Character which is counted
  • axis ({"pos", "seq"}, optional (default="pos")) – Count along positions or sequences
  • normalize (bool, optional (default=True)) – Normalize count for length of axis (i.e. relative count)
Returns:

Vector containing counts of char along the axis

Return type:

np.array

Raises:

ValueError – Upon invalid axis specification

frequencies

Returns/calculates single-site frequencies of symbols in alignment. Also sets self._frequencies member variable for later reuse.

Previously calculated sequence weights using self.set_weights() will be used to adjust frequency counts; otherwise, each sequence will contribute with equal weight.

Returns:Reference to self._frequencies
Return type:np.array
classmethod from_dict(sequences, **kwargs)[source]

Construct an alignment object from a dictionary with sequence IDs as keys and aligned sequences as values.

Parameters:sequences (dict-like) – Dictionary with pairs of sequence ID (key) and aligned sequence (value)
Returns:initialized alignment
Return type:Alignment
classmethod from_file(fileobj, format='fasta', a3m_inserts='first', raise_hmmer_prefixes=True, **kwargs)[source]

Construct an alignment object by reading in an alignment file.

Parameters:
  • fileobj (file-like obj) – Alignment to be read in
  • format ({"fasta", "stockholm", "a3m"}) – Format of input alignment
  • a3m_inserts ({"first", "delete"}, optional (default: "first")) – Strategy to deal with inserts in a3m alignment files (see read_a3m documentation for details)
  • raise_hmmer_prefixes (bool, optional (default: True)) – HMMER adds number prefixes to sequence identifiers in Stockholm files if identifiers are not unique. If True, the parser will raise an exception if a Stockholm alignment has such prefixes.
  • **kwargs – Additional arguments to be passed to class constructor
Returns:

Parsed alignment

Return type:

Alignment

Raises:

ValueError – For invalid alignments or alignment formats

identities_to(seq, normalize=True)[source]

Calculate sequence identity between sequence and all sequences in the alignment.

seq : np.array, list-like, or str
Sequence for comparison
normalize : bool, optional (default: True)
Calculate relative identity between 0 and 1 by normalizing with length of alignment
lowercase_columns(columns)[source]

Change a subset of columns to lowercase character and replace “-” gaps with “.” gaps, e.g. to exclude them from EC calculations

Parameters:columns (numpy index array) – Subset of columns to make lowercase
Returns:Alignment with lowercase columns
Return type:Alignment
pair_frequencies

Returns/calculates pairwise frequencies of symbols in alignment. Also sets self._pair_frequencies member variable for later reuse.

Previously calculated sequence weights using self.set_weights() will be used to adjust frequency counts; otherwise, each sequence will contribute with equal weight.

Returns:Reference to self._pair_frequencies
Return type:np.array
replace(original, replacement, columns=None, sequences=None)[source]

Replace character with another in full matrix or subset of columns/sequences.

Parameters:
  • original (char) – Character that should be replaced
  • replacement (char) – Replacement character
  • columns (numpy index array) – See self.apply for explanation
  • sequences (numpy index array) – See self.apply for explanation
Returns:

Alignment with replaced characters

Return type:

Alignment

select(columns=None, sequences=None)[source]

Create a sub-alignment that contains a subset of sequences and/or columns.

Note

This does currently not adjust the indices of the sequences. Annotation in the original alignment will be lost and not passed on to the new object.

Parameters:
  • columns (np.array(bool) or np.array(int), optional) – Vector containing True for each column that should be retained, False otherwise; or the indices of columns that should be selected
  • sequences (np.array(bool) or np.array(int), optional) – Vector containing True for each sequence that should be retained, False otherwise; or the indices of sequences that should be selected
Returns:

Alignment with selected columns and sequences (note this alignment looses annotation)

Return type:

Alignment

set_weights(identity_threshold=0.8)[source]

Calculate weights for sequences in alignment by clustering all sequences with sequence identity greater or equal to the given threshold.

Note

This method sets self.weights. After this method was called, methods/attributes such as self.frequencies or self.conservation() will make use of sequence weights.

Note

(Implementation: cannot use property here since we need identity threshold as a parameter….)

Parameters:identity_threshold (float, optional (default: 0.8)) – Sequence identity threshold
write(fileobj, format='fasta', width=80)[source]

Write an alignment to a file.

Parameters:
  • fileobj (file-like object) – File to which alignment is saved
  • format ({"fasta", "aln", "a3m"}) – Output format for alignment
  • width (int) – Column width for fasta alignment
Raises:

ValueError – Upon invalid file format specification

class evcouplings.align.alignment.StockholmAlignment(seqs, gf, gc, gs, gr)

Bases: tuple

gc

Alias for field number 2

gf

Alias for field number 1

gr

Alias for field number 4

gs

Alias for field number 3

seqs

Alias for field number 0

evcouplings.align.alignment.detect_format(fileobj)[source]

Detect if an alignment file is in FASTA or Stockholm format.

Parameters:fileobj (file-like obj) – Alignment file for which to detect format
Returns:format – Format of alignment, None if not detectable
Return type:{“fasta”, “stockholm”, None}
evcouplings.align.alignment.map_from_alphabet(alphabet='-ACDEFGHIKLMNPQRSTVWY', default='-')[source]

Creates a mapping dictionary from a given alphabet.

Parameters:
  • alphabet (str) – Alphabet for remapping. Elements will be remapped according to alphabet starting from 0
  • default (Elements in matrix that are not) – contained in alphabet will be treated as this character
Raises:

ValueError – For invalid default character

evcouplings.align.alignment.map_matrix(matrix, map_)[source]

Map elements in a numpy array using alphabet

Parameters:
  • matrix (np.array) – Matrix that should be remapped
  • map (defaultdict) – Map that will be applied to matrix elements
Returns:

Remapped matrix

Return type:

np.array

evcouplings.align.alignment.parse_header(header)[source]

Extract ID of the (overall) sequence and the sequence range fromat a sequence header of the form sequenceID/start-end.

If the header contains any additional information after the first whitespace character (e.g. sequence annotation), it will be discarded before parsing. If there is no sequence range, only the id (part of string before whitespace) will be returned but no range.

Parameters:header (str) – Sequence header
Returns:
  • seq_id (str) – Sequence identifier
  • start (int) – Start of sequence range. Will be None if it cannot be extracted from the header.
  • stop (int) – End of sequence range. Will be None if it cannot be extracted from the header.
evcouplings.align.alignment.read_a3m(fileobj, inserts='first')[source]

Read an alignment in compressed a3m format and expand into a2m format.

Note

this function is currently not able to keep inserts in all the sequences

..todo:

implement this
Parameters:
  • fileobj (file-like object) – A3M alignment file
  • inserts ({"first", "delete"}) – Keep inserts in first sequence, or delete any insert column and keep only match state columns.
Returns:

Sequences in alignment (key: ID, value: sequence), in order they appeared in input file

Return type:

OrderedDict

Raises:

ValueError – Upon invalid choice of insert strategy

evcouplings.align.alignment.read_fasta(fileobj)[source]

Generator function to read a FASTA-format file (includes aligned FASTA, A2M, A3M formats)

Parameters:fileobj (file-like object) – FASTA alignment file
Returns:Returns tuples of (sequence ID, sequence)
Return type:generator of (str, str) tuples
evcouplings.align.alignment.read_stockholm(fileobj, read_annotation=False, raise_hmmer_prefixes=True)[source]

Generator function to read Stockholm format alignment file (e.g. from hmmer).

Note

Generator iterates over different alignments in the same file (but not over individual sequences, which makes little sense due to wrapped Stockholm alignments).

Parameters:
  • fileobj (file-like object) – Stockholm alignment file
  • read_annotation (bool, optional (default=False)) – Read annotation columns from alignment
  • raise_hmmer_prefixes (bool, optional (default: True)) – HMMER adds number prefixes to sequence identifiers if identifiers are not unique. If True, the parser will raise an exception if the alignment has such prefixes.
Returns:

namedtuple with the following fields: seqs, gf, gc, gs, gr (all DefaultOrderedDict)

Return type:

StockholmAlignment

Raises:

ValueError

evcouplings.align.alignment.sequences_to_matrix(sequences)[source]

Transforms a list of sequences into a numpy array.

Parameters:sequences (list-like (str)) – List of strings containing aligned sequences
Returns:2D array containing sequence alignment (first axis: sequences, second axis: columns)
Return type:numpy.array
evcouplings.align.alignment.write_a3m(sequences, fileobj, insert_gap='.', width=80)[source]

Write a list of IDs/sequences to a FASTA-format file

Parameters:
  • sequences (Iterable) – Iterator returning tuples(str, str), where first value is header/ID and second the sequence
  • fileobj (file-like obj) – File to which alignment will be written
evcouplings.align.alignment.write_aln(sequences, fileobj, width=80)[source]

Write a list of IDs/sequences to a ALN-format file

Currently, the file will not contain headers but simply a block matrix of the alignment

Parameters:
  • sequences (Iterable) – Iterator returning tuples(str, str), where first value is header/ID and second the sequence
  • fileobj (file-like obj) – File to which alignment will be written
evcouplings.align.alignment.write_fasta(sequences, fileobj, width=80)[source]

Write a list of IDs/sequences to a FASTA-format file

Parameters:
  • sequences (Iterable) – Iterator returning tuples(str, str), where first value is header/ID and second the sequence
  • fileobj (file-like obj) – File to which alignment will be written

evcouplings.align.pfam module

Code for identifying Pfam domains and mapping Pfam alignments and ECs into target sequence mode.

Authors:
Thomas A. Hopf

Todo

  1. Write code to create list of family sizes
  2. Implement alignments against Pfam-HMM so precomputed results can be reused in focus mode
evcouplings.align.pfam.create_family_size_table(full_pfam_file, outfile=None)[source]

Parse family size table from Pfam flat file (ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.full.gz)

Parameters:
  • full_pfam_file (str) – Path to the pfam file (gzip).
  • outfile (str, optional (default: None)) – Save the parsed table to this file as a csv file.
Returns:

Parsed Pfam table.

Return type:

pd.DataFrame

evcouplings.align.pfam.pfam_hits(query_file, hmm_database, prefix, clan_table_file, size_table_file, resolve_overlaps=True, hmmscan_binary='hmmscan')[source]

Identify hits of Pfam HMMs in a set of sequences.

Parameters:
  • query_file (str) – File containing query sequence(s)
  • hmm_database (str) – File containing HMM database (Pfam-A.hmm, with hmmpress applied)
  • prefix (str) – Prefix path for output files. Folder structure in the prefix will be created if not existing.
  • clan_table_file (str) – File with table linking Pfam families to clans (Pfam-A.clans.tsv). Set to None if not available, but resolve_overlaps cannot be True in that case.
  • size_table_file (str) – File with table of family sizes. Create using create_family_size_table(). Set to None if not available.
  • resolve_overlaps (bool) – Resolve overlapping hits by families from the same clan. Only possible if clan_table_file is given.
  • hmmscan_binary (str (default: "hmmscan")) – Path to hmmscan binary (put in PATH for default to work)
Returns:

Pfam hit table

Return type:

pd.DataFrame

evcouplings.align.pfam.remove_clan_overlaps(pfam_table)[source]

Remove overlapping Pfam hits from same Pfam clan (equivalent of PfamScan.pl). Currently only allows to remove overlaps by domain bitscore.

Todo

is bitscore the most sensible choice if different length hits?

Parameters:pfam_table (pd.DataFrame) – Pfam hit table as generated by pfam_hits() function (must contain Pfam clan annotation).
Returns:Pfam hit table with lower-scoring overlaps removed
Return type:pd.DataFrame

evcouplings.align.protocol module

Protein sequence alignment creation protocols/workflows.

Authors:
Thomas A. Hopf Anna G. Green - complex protocol, hmm_build_and_search Chan Kang - hmm_build_and_search
evcouplings.align.protocol.complex(**kwargs)[source]

Protocol:

Run monomer alignment protocol and postprocess it for EVcomplex calculations

Parameters:kwargs arguments (Mandatory) – See list below in code where calling check_required
Returns:outcfg – Output configuration of the alignment protocol, and the following additional field:
genome_location_file : path to file containing
the genomic locations for CDs’s corresponding to identifiers in the alignment.
Return type:dict
evcouplings.align.protocol.cut_sequence(sequence, sequence_id, region=None, first_index=None, out_file=None)[source]

Cut a given sequence to sub-range and save it in a file

Parameters:
  • sequence (str) – Full sequence that will be cut
  • sequence_id (str) – Identifier of sequence, used to construct header in output file
  • region (tuple(int, int), optional (default: None)) – Region that will be cut out of full sequence. If None, full sequence will be returned.
  • first_index (int, optional (default: None)) – Define index of first position in sequence. Will be set to 1 if None.
  • out_file (str, optional (default: None)) – Save sequence in a FASTA file (header: >sequence_id/start_region-end_region)
Returns:

  • str – Subsequence contained in region
  • tuple(int, int) – Region. If no input region is given, this will be (1, len(sequence)); otherwise, the input region is returned.

Raises:

InvalidParameterError – Upon invalid region specification (violating boundaries of sequence)

evcouplings.align.protocol.describe_coverage(alignment, prefix, first_index, minimum_column_coverage)[source]

Produce “classical” buildali coverage statistics, i.e. number of sequences, how many residues have too many gaps, etc.

Only to be applied to alignments focused around the target sequence.

Parameters:
  • alignment (Alignment) – Alignment for which coverage statistics will be calculated
  • prefix (str) – Prefix of alignment file that will be stored as identifier in table
  • first_index (int) – Sequence index of first position of target sequence
  • minimum_column_coverage (Iterable(float) or float) –

    Minimum column coverage threshold(s) that will be tested (creating one row for each threshold in output table).

    Note

    int values given to this function instead of a float will be divided by 100 to create the corresponding floating point representation. This parameter is 1.0 - maximum fraction of gaps per column.

Returns:

Table with coverage statistics for different gap thresholds

Return type:

pd.DataFrame

evcouplings.align.protocol.describe_frequencies(alignment, first_index, target_seq_index=None)[source]

Get parameters of alignment such as gaps, coverage, conservation and summarize.

Parameters:
  • alignment (Alignment) – Alignment for which description statistics will be calculated
  • first_index (int) – Sequence index of first residue in target sequence
  • target_seq_index (int, optional (default: None)) – If given, will add the symbol in the target sequence into a separate column of the output table
Returns:

Table detailing conservation and symbol frequencies for all positions in the alignment

Return type:

pandas.DataFrame

evcouplings.align.protocol.describe_seq_identities(alignment, target_seq_index=0)[source]

Calculate sequence identities of any sequence to target sequence and create result dataframe.

Parameters:alignment (Alignment) – Alignment for which description statistics will be calculated
Returns:Table giving the identity to target sequence for each sequence in alignment (in order of occurrence)
Return type:pandas.DataFrame
evcouplings.align.protocol.existing(**kwargs)[source]

Protocol:

Use external sequence alignment and extract all relevant information from there (e.g. sequence, region, etc.), then apply gap & fragment filtering as usual

Parameters:kwargs arguments (Mandatory) – See list below in code where calling check_required
Returns:outcfg – Output configuration of the pipeline, including the following fields:
  • sequence_id (passed through from input)
  • alignment_file
  • raw_focus_alignment_file
  • statistics_file
  • sequence_file
  • first_index
  • target_sequence_file
  • annotation_file (None)
  • frequencies_file
  • identities_file
  • focus_mode
  • focus_sequence
  • segments
Return type:dict
evcouplings.align.protocol.extract_header_annotation(alignment, from_annotation=True)[source]

Extract Uniprot/Uniref sequence annotation from Stockholm file (as output by jackhmmer). This function may not work for other formats.

Parameters:
  • alignment (Alignment) – Multiple sequence alignment object
  • from_annotation (bool, optional (default: True)) – Use annotation line (in Stockholm file) rather than sequence ID line (e.g. in FASTA file)
Returns:

Table containing all annotation (one row per sequence in alignment, in order of occurrence)

Return type:

pandas.DataFrame

evcouplings.align.protocol.fetch_sequence(sequence_id, sequence_file, sequence_download_url, out_file)[source]

Fetch sequence either from database based on identifier, or from input sequence file.

Parameters:
  • sequence_id (str) – Identifier of sequence that should be retrieved
  • sequence_file (str) – File containing sequence. If None, sqeuence will be downloaded from sequence_download_url
  • sequence_download_url (str) – URL from which to download missing sequence. Must contain “{}” at the position where sequence ID will be inserted into download URL (using str.format).
  • out_file (str) – Output file in which sequence will be stored, if sequence_file is not existing.
Returns:

  • str – Path of file with stored sequence (can be sequence_file or out_file)
  • tuple (str, str) – Identifier of sequence as stored in file, and sequence

Protocol:

Build HMM from sequence alignment using hmmbuild and search against a sequence database using hmmsearch.

Parameters:kwargs arguments (Mandatory) – See list below in code where calling check_required
Returns:outcfg – Output configuration of the protocol, including the following fields:
  • target_sequence_file
  • sequence_file
  • raw_alignment_file
  • hittable_file
  • focus_mode
  • focus_sequence
  • segments
Return type:dict

Protocol:

Iterative jackhmmer search against a sequence database.

Parameters:kwargs arguments (Mandatory) – See list below in code where calling check_required

:param .. todo::: explain meaning of parameters in detail.

Returns:outcfg – Output configuration of the protocol, including the following fields:
  • sequence_id (passed through from input)
  • first_index (passed through from input)
  • target_sequence_file
  • sequence_file
  • raw_alignment_file
  • hittable_file
  • focus_mode
  • focus_sequence
  • segments
Return type:dict
evcouplings.align.protocol.modify_alignment(focus_ali, target_seq_index, target_seq_id, region_start, **kwargs)[source]

Apply pairwise identity filtering, fragment filtering, and exclusion of columns with too many gaps to a sequence alignment. Also generates files describing properties of the alignment such as frequency distributions, conservation, and “old-style” alignment statistics files.

Note

assumes focus alignment (otherwise unprocessed) as input.

Todo

come up with something more clever to filter fragments than fixed width (e.g. use 95% quantile of length distribution as reference point)

Parameters:
  • focus_ali (Alignment) – Focus-mode input alignment
  • target_seq_index (int) – Index of target sequence in alignment
  • target_seq_id (str) – Identifier of target sequence (without range)
  • region_start (int) – Index of first sequence position in target sequence
  • kwargs (See required arguments in source code) –
Returns:

  • outcfg (Dict) – File products generated by the function:
    • alignment_file
    • statistics_file
    • frequencies_file
    • identities_file
    • raw_focus_alignment_file
  • ali (Alignment) – Final processed alignment

evcouplings.align.protocol.run(**kwargs)[source]

Run alignment protocol to generate multiple sequence alignment from input sequence.

Parameters:
  • kwargs arguments (Mandatory) – protocol: Alignment protocol to run prefix: Output prefix for all generated files
  • Optional
Returns:

  • Alignment
  • Dictionary with results of stage in following fields (in brackets - not returned by all protocols)
    • alignment_file
    • [raw_alignment_file]
    • statistics_file
    • target_sequence_file
    • sequence_file
    • [annotation_file]
    • frequencies_file
    • identities_file
    • [hittable_file]
    • focus_mode
    • focus_sequence
    • segments

evcouplings.align.protocol.search_thresholds(use_bitscores, seq_threshold, domain_threshold, seq_len)[source]

Set homology search inclusion parameters.

HMMER hits get included in the HMM according to a two-step rule
  1. sequence passes sequence-level treshold
  2. domain passes domain-level threshold
Therefore, search thresholds are set based on the following logic:
  1. If only sequence threshold is given, a MissingParameterException is raised
  2. If only bitscore threshold is given, sequence threshold is set to the same
  3. If both thresholds are given, they are according to defined values
Valid inputs for bitscore thresholds:
  1. int or str: taken as absolute score threshold
  2. float: taken as relative threshold (absolute threshold derived by

multiplication with domain length)

Valid inputs for integer thresholds:
  1. int: Used as negative exponent, threshold will be set to 1E-<exponent>
  2. float or str: Interpreted literally
Parameters:
  • use_bitscores (bool) – Use bitscore threshold instead of E-value threshold
  • domain_threshold (str or int or float) – Domain-level threshold. See rules above.
  • seq_threshold (str or int or float) – Sequence-level threshold. See rules above.
  • seq_len (int) – Length of sequence. Used to calculate absolute bitscore threshold for relative bitscore thresholds.
Returns:

Sequence- and domain-level thresholds ready to be fed into HMMER

Return type:

tuple (str, str)

evcouplings.align.protocol.standard(**kwargs)[source]

Protocol:

Standard buildali4 workflow (run iterative jackhmmer search against sequence database, than determine which sequences and columns to include in the calculation based on coverage and maximum gap thresholds).

Parameters:kwargs arguments (Mandatory) – See list below in code where calling check_required
Returns:
  • outcfg (dict) – Output configuration of the pipeline, including the following fields:
    • sequence_id (passed through from input)
    • first_index (passed through from input)
    • alignment_file
    • raw_alignment_file
    • raw_focus_alignment_file
    • statistics_file
    • target_sequence_file
    • sequence_file
    • annotation_file
    • frequencies_file
    • identities_file
    • hittable_file
    • focus_mode
    • focus_sequence
    • segments
  • ali (Alignment) – Final sequence alignment

evcouplings.align.tools module

Wrappers for running external sequence alignment tools

Authors:
Thomas A. Hopf Anna G. Green - run_hmmbuild, run_hmmsearch Chan Kang - run_hmmbuild, run_hmmsearch
class evcouplings.align.tools.HmmbuildResult(prefix, hmmfile, output)

Bases: tuple

hmmfile

Alias for field number 1

output

Alias for field number 2

prefix

Alias for field number 0

class evcouplings.align.tools.HmmscanResult(prefix, output, tblout, domtblout, pfamtblout)

Bases: tuple

domtblout

Alias for field number 3

output

Alias for field number 1

pfamtblout

Alias for field number 4

prefix

Alias for field number 0

tblout

Alias for field number 2

class evcouplings.align.tools.HmmsearchResult(prefix, alignment, output, tblout, domtblout)

Bases: tuple

alignment

Alias for field number 1

domtblout

Alias for field number 4

output

Alias for field number 2

prefix

Alias for field number 0

tblout

Alias for field number 3

class evcouplings.align.tools.JackhmmerResult(prefix, alignment, output, tblout, domtblout)

Bases: tuple

alignment

Alias for field number 1

domtblout

Alias for field number 4

output

Alias for field number 2

prefix

Alias for field number 0

tblout

Alias for field number 3

evcouplings.align.tools.read_hmmer_domtbl(filename)[source]

Read a HMMER domtbl file into DataFrame.

Parameters:filename (str) – Path of domtbl file
Returns:DataFrame with parsed domtbl
Return type:pd.DataFrame
evcouplings.align.tools.read_hmmer_tbl(filename)[source]

Read a HMMER tbl file into DataFrame.

Parameters:filename (str) – Path of tbl file
Returns:DataFrame with parsed tbl
Return type:pd.DataFrame
evcouplings.align.tools.run_hhfilter(input_file, output_file, threshold=95, columns='a2m', binary='hhfilter')[source]

Redundancy-reduce a sequence alignment using hhfilter from the HHsuite alignment suite.

Parameters:
  • input_file (str) – Path to input alignment in A2M/FASTA format
  • output_file (str) – Path to output alignment (will be in A3M format)
  • threshold (int, optional (default: 95)) – Sequence identity threshold for maximum pairwise identity (between 0 and 100)
  • columns ({"first", "a2m"}, optional (default: "a2m")) – Definition of match columns (based on first sequence or upper-case columns (a2m))
  • binary (str) – Path to hhfilter binary
Returns:

output_file

Return type:

str

Raises:
  • ResourceError – If output alignment is non-existent/empty
  • ValueError – Upon invalid value of columns parameter
evcouplings.align.tools.run_hmmbuild(alignment_file, prefix, cpu=None, stdout_redirect=None, symfrac=None, binary='hmmbuild')[source]

Profile HMM construction from multiple sequence alignments Refer to HMMER documentation for details.

http://eddylab.org/software/hmmer3/3.1b2/Userguide.pdf

Parameters:
  • alignment_file (str) – File containing the multiple sequence alignment. Can be in Stockholm, a2m, or clustal formats, or any other format recognized by hmmer. Please note that ALL POSITIONS above the symfrac cutoff will be used in HMM construction (if the alignment contains columns that are insertions relative to the query sequence, this may be problematic for structure comparison)
  • prefix (str) – Prefix path for output files. Folder structure in the prefix will be created if not existing.
  • cpu (int, optional (default: None)) – Number of CPUs to use for search. Uses all if None.
  • stdout_redirect (str, optional (default: None)) – Redirect bulky stdout instead of storing with rest of results (use “/dev/null” to dispose)
  • symfrac (float, optional (default: None)) – range 0.0 - 1.0, HMMbuild will use columns with > symfrac percent gaps to construct the HMM. If None provided, HMMbuild internal default is 0.5. (Note: this is calculated after their internal sequence weighting is calculated)
  • binary (str (default: "hmmbuild")) – Path to jackhmmer binary (put in PATH for default to work)
Returns:

namedtuple with fields corresponding to the different output files (prefix, alignment, output, tblout, domtblout)

Return type:

HmmbuildResult

Raises:

ExternalToolError, ResourceError

evcouplings.align.tools.run_hmmscan(query, database, prefix, use_model_threshold=True, threshold_type='cut_ga', use_bitscores=True, domain_threshold=None, seq_threshold=None, nobias=False, cpu=None, stdout_redirect=None, binary='hmmscan')[source]

Run hmmscan of HMMs in database against sequences in query to identify matches of these HMMs. Refer to HMMER Userguide for explanation of these parameters.

Parameters:
  • query (str) – File containing query sequence(s)
  • database (str) – File containing HMM database (prepared with hmmpress)
  • prefix (str) – Prefix path for output files. Folder structure in the prefix will be created if not existing.
  • use_model_threshold (bool (default: True)) – Use model-specific inclusion thresholds from HMM database rather than global bitscore/E-value thresholds (use_bitscores, domain_threshold and seq_threshold are overriden by this flag).
  • threshold-type ({"cut_ga", "cut_nc", "cut_tc"} (default: "cut_ga")) – Use gathering (default), noise or trusted cutoff to define scan hits. Please refer to HMMER manual for details.
  • use_bitscores (bool) – Use bitscore inclusion thresholds rather than E-values. Overriden by use_model_threshold flag.
  • domain_threshold (int or float or str) – Inclusion threshold applied on the domain level (e.g. “1E-03” or 0.001 or 50)
  • seq_threshold (int or float or str) – Inclusion threshold applied on the sequence level (e.g. “1E-03” or 0.001 or 50)
  • nobias (bool, optional (default: False)) – Turn of bias correction
  • cpu (int, optional (default: None)) – Number of CPUs to use for search. Uses all if None.
  • stdout_redirect (str, optional (default: None)) – Redirect bulky stdout instead of storing with rest of results (use “/dev/null” to dispose)
  • binary (str (default: "hmmscan")) – Path to hmmscan binary (put in PATH for default to work)
Returns:

namedtuple with fields corresponding to the different output files (prefix, output, tblout, domtblout, pfamtblout)

Return type:

HmmscanResult

Raises:

ExternalToolError, ResourceError

evcouplings.align.tools.run_hmmsearch(hmmfile, database, prefix, use_bitscores, domain_threshold, seq_threshold, nobias=False, cpu=None, stdout_redirect=None, binary='hmmsearch')[source]

Search profile(s) against a sequence database. Refer to HMMER documentation for details.

http://eddylab.org/software/hmmer3/3.1b2/Userguide.pdf

Parameters:
  • hmmfile (str) – File containing the profile(s)
  • database (str) – File containing sequence database
  • prefix (str) – Prefix path for output files. Folder structure in the prefix will be created if not existing.
  • use_bitscores (bool) – Use bitscore inclusion thresholds rather than E-values.
  • domain_threshold (int or float or str) – Inclusion threshold applied on the domain level (e.g. “1E-03” or 0.001 or 50)
  • seq_threshold (int or float or str) – Inclusion threshold applied on the sequence level (e.g. “1E-03” or 0.001 or 50)
  • nobias (bool, optional (default: False)) – Turn of bias correction
  • cpu (int, optional (default: None)) – Number of CPUs to use for search. Uses all if None.
  • stdout_redirect (str, optional (default: None)) – Redirect bulky stdout instead of storing with rest of results (use “/dev/null” to dispose)
  • binary (str (default: "hmmsearch")) – Path to jackhmmer binary (put in PATH for default to work)
Returns:

namedtuple with fields corresponding to the different output files (prefix, alignment, output, tblout, domtblout)

Return type:

HmmsearchResult

Raises:

ExternalToolError, ResourceError

evcouplings.align.tools.run_jackhmmer(query, database, prefix, use_bitscores, domain_threshold, seq_threshold, iterations=5, nobias=False, cpu=None, stdout_redirect=None, checkpoints_hmm=False, checkpoints_ali=False, binary='jackhmmer')[source]

Run jackhmmer sequence search against target database. Refer to HMMER Userguide for explanation of these parameters.

Parameters:
  • query (str) – File containing query sequence
  • database (str) – File containing sequence database
  • prefix (str) – Prefix path for output files. Folder structure in the prefix will be created if not existing.
  • use_bitscores (bool) – Use bitscore inclusion thresholds rather than E-values.
  • domain_threshold (int or float or str) – Inclusion threshold applied on the domain level (e.g. “1E-03” or 0.001 or 50)
  • seq_threshold (int or float or str) – Inclusion threshold applied on the sequence level (e.g. “1E-03” or 0.001 or 50)
  • iterations (int) – number of jackhmmer search iterations
  • nobias (bool, optional (default: False)) – Turn of bias correction
  • cpu (int, optional (default: None)) – Number of CPUs to use for search. Uses all if None.
  • stdout_redirect (str, optional (default: None)) – Redirect bulky stdout instead of storing with rest of results (use “/dev/null” to dispose)
  • checkpoints_hmm (bool, optional (default: False)) – Store checkpoint HMMs to prefix.<iter>.hmm
  • checkpoints_ali (bool, optional (default: False)) – Store checkpoint alignments to prefix.<iter>.sto
  • binary (str (default: "jackhmmer")) – Path to jackhmmer binary (put in PATH for default to work)
Returns:

namedtuple with fields corresponding to the different output files (prefix, alignment, output, tblout, domtblout)

Return type:

JackhmmerResult

Raises:

ExternalToolError, ResourceError