evcouplings.align package¶

evcouplings.align.alignment module¶

evcouplings.align.pfam module¶

Code for identifying Pfam domains and mapping Pfam alignments and ECs into target sequence mode.

Authors:: Thomas A. Hopf

Todo

Write code to create list of family sizes
Implement alignments against Pfam-HMM so precomputed results can be reused in focus mode

evcouplings.align.pfam.create_family_size_table(full_pfam_file, outfile=None)[source]¶

Parse family size table from Pfam flat file (ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.full.gz)

Parameters:	full_pfam_file (str) – Path to the pfam file (gzip). outfile (str, optional (default: None)) – Save the parsed table to this file as a csv file.
Returns:	Parsed Pfam table.
Return type:	pd.DataFrame

evcouplings.align.pfam.pfam_hits(query_file, hmm_database, prefix, clan_table_file, size_table_file, resolve_overlaps=True, **kwargs)[source]¶

Identify hits of Pfam HMMs in a set of sequences.

Parameters:	query_file (str) – File containing query sequence(s) hmm_database (str) – File containing HMM database (Pfam-A.hmm, with hmmpress applied) prefix (str) – Prefix path for output files. Folder structure in the prefix will be created if not existing. clan_table_file (str) – File with table linking Pfam families to clans (Pfam-A.clans.tsv). Set to None if not available, but resolve_overlaps cannot be True in that case. size_table_file (str) – File with table of family sizes. Create using create_family_size_table(). Set to None if not available. resolve_overlaps (bool) – Resolve overlapping hits by families from the same clan. Only possible if clan_table_file is given. *kwargs (dict*) – kwargs passed on to evcouplings.align.tools.run_hmmscan
Returns:	Pfam hit table
Return type:	pd.DataFrame

evcouplings.align.pfam.remove_clan_overlaps(pfam_table)[source]¶

Remove overlapping Pfam hits from same Pfam clan (equivalent of PfamScan.pl). Currently only allows to remove overlaps by domain bitscore.

Todo

is bitscore the most sensible choice if different length hits?

Parameters:	pfam_table (pd.DataFrame) – Pfam hit table as generated by pfam_hits() function (must contain Pfam clan annotation).
Returns:	Pfam hit table with lower-scoring overlaps removed
Return type:	pd.DataFrame

evcouplings.align.protocol module¶

Protein sequence alignment creation protocols/workflows.

Authors:: Thomas A. Hopf Anna G. Green - complex protocol, hmm_build_and_search Chan Kang - hmm_build_and_search

evcouplings.align.protocol.complex(**kwargs)[source]¶

Protocol:

Run monomer alignment protocol and postprocess it for EVcomplex calculations

Parameters:	kwargs arguments (Mandatory) – See list below in code where calling check_required
Returns:	outcfg – Output configuration of the alignment protocol, and the following additional field: genome_location_file : path to file containing the genomic locations for CDs’s corresponding to identifiers in the alignment.
Return type:	dict

evcouplings.align.protocol.cut_sequence(sequence, sequence_id, region=None, first_index=None, out_file=None)[source]¶

Cut a given sequence to sub-range and save it in a file

Parameters:

sequence (str) – Full sequence that will be cut
sequence_id (str) – Identifier of sequence, used to construct header in output file
region (tuple(int, int), optional (default: None)) – Region that will be cut out of full sequence. If None, full sequence will be returned.
first_index (int, optional (default: None)) – Define index of first position in sequence. Will be set to 1 if None.
out_file (str, optional (default: None)) – Save sequence in a FASTA file (header: >sequence_id/start_region-end_region)

Returns:

str – Subsequence contained in region
tuple(int, int) – Region. If no input region is given, this will be (1, len(sequence)); otherwise, the input region is returned.

Raises:

InvalidParameterError – Upon invalid region specification (violating boundaries of sequence)

evcouplings.align.protocol.describe_coverage(alignment, prefix, first_index, minimum_column_coverage)[source]¶

Produce “classical” buildali coverage statistics, i.e. number of sequences, how many residues have too many gaps, etc.

Only to be applied to alignments focused around the target sequence.

Parameters:	alignment (Alignment) – Alignment for which coverage statistics will be calculated prefix (str) – Prefix of alignment file that will be stored as identifier in table first_index (int) – Sequence index of first position of target sequence minimum_column_coverage (Iterable(float) or float) – Minimum column coverage threshold(s) that will be tested (creating one row for each threshold in output table). Note `int` values given to this function instead of a float will be divided by 100 to create the corresponding floating point representation. This parameter is 1.0 - maximum fraction of gaps per column.
Returns:	Table with coverage statistics for different gap thresholds
Return type:	pd.DataFrame

evcouplings.align.protocol.describe_frequencies(alignment, first_index, target_seq_index=None)[source]¶

Get parameters of alignment such as gaps, coverage, conservation and summarize.

Parameters:	alignment (Alignment) – Alignment for which description statistics will be calculated first_index (int) – Sequence index of first residue in target sequence target_seq_index (int, optional (default: None)) – If given, will add the symbol in the target sequence into a separate column of the output table
Returns:	Table detailing conservation and symbol frequencies for all positions in the alignment
Return type:	pandas.DataFrame

evcouplings.align.protocol.describe_seq_identities(alignment, target_seq_index=0)[source]¶

Calculate sequence identities of any sequence to target sequence and create result dataframe.

Parameters:	alignment (Alignment) – Alignment for which description statistics will be calculated
Returns:	Table giving the identity to target sequence for each sequence in alignment (in order of occurrence)
Return type:	pandas.DataFrame

evcouplings.align.protocol.existing(**kwargs)[source]¶

Protocol:

Use external sequence alignment and extract all relevant information from there (e.g. sequence, region, etc.), then apply gap & fragment filtering as usual

Parameters:	kwargs arguments (Mandatory) – See list below in code where calling check_required
Returns:	outcfg – Output configuration of the pipeline, including the following fields: sequence_id (passed through from input) alignment_file raw_focus_alignment_file statistics_file sequence_file first_index target_sequence_file annotation_file (None) frequencies_file identities_file focus_mode focus_sequence segments
Return type:	dict

evcouplings.align.protocol.extract_header_annotation(alignment, from_annotation=True)[source]¶

Extract Uniprot/Uniref sequence annotation from Stockholm file (as output by jackhmmer). This function may not work for other formats.

Parameters:	alignment (Alignment) – Multiple sequence alignment object from_annotation (bool, optional (default: True)) – Use annotation line (in Stockholm file) rather than sequence ID line (e.g. in FASTA file)
Returns:	Table containing all annotation (one row per sequence in alignment, in order of occurrence)
Return type:	pandas.DataFrame

evcouplings.align.protocol.fetch_sequence(sequence_id, sequence_file, sequence_download_url, out_file)[source]¶

Fetch sequence either from database based on identifier, or from input sequence file.

Parameters:

sequence_id (str) – Identifier of sequence that should be retrieved
sequence_file (str) – File containing sequence. If None, sqeuence will be downloaded from sequence_download_url
sequence_download_url (str) – URL from which to download missing sequence. Must contain “{}” at the position where sequence ID will be inserted into download URL (using str.format).
out_file (str) – Output file in which sequence will be stored, if sequence_file is not existing.

Returns:

str – Path of file with stored sequence (can be sequence_file or out_file)
tuple (str, str) – Identifier of sequence as stored in file, and sequence

evcouplings.align.protocol.hmmbuild_and_search(**kwargs)[source]¶

Protocol:

Build HMM from sequence alignment using hmmbuild and search against a sequence database using hmmsearch.

Parameters:	kwargs arguments (Mandatory) – See list below in code where calling check_required
Returns:	outcfg – Output configuration of the protocol, including the following fields: target_sequence_file sequence_file raw_alignment_file hittable_file focus_mode focus_sequence segments
Return type:	dict

evcouplings.align.protocol.jackhmmer_search(**kwargs)[source]¶

Protocol:

Iterative jackhmmer search against a sequence database.

Parameters:	kwargs arguments (Mandatory) – See list below in code where calling check_required

:param .. todo::: explain meaning of parameters in detail.

Returns:	outcfg – Output configuration of the protocol, including the following fields: sequence_id (passed through from input) first_index (passed through from input) target_sequence_file sequence_file raw_alignment_file hittable_file focus_mode focus_sequence segments
Return type:	dict

evcouplings.align.protocol.modify_alignment(focus_ali, target_seq_index, target_seq_id, region_start, **kwargs)[source]¶

Apply pairwise identity filtering, fragment filtering, and exclusion of columns with too many gaps to a sequence alignment. Also generates files describing properties of the alignment such as frequency distributions, conservation, and “old-style” alignment statistics files.

Note

assumes focus alignment (otherwise unprocessed) as input.

Todo

come up with something more clever to filter fragments than fixed width (e.g. use 95% quantile of length distribution as reference point)

Parameters:

focus_ali (Alignment) – Focus-mode input alignment
target_seq_index (int) – Index of target sequence in alignment
target_seq_id (str) – Identifier of target sequence (without range)
region_start (int) – Index of first sequence position in target sequence
kwargs (See required arguments in source code) –

Returns:

outcfg (Dict) – File products generated by the function:
- alignment_file
- statistics_file
- frequencies_file
- identities_file
- raw_focus_alignment_file
ali (Alignment) – Final processed alignment

evcouplings.align.protocol.run(**kwargs)[source]¶

Run alignment protocol to generate multiple sequence alignment from input sequence.

Parameters:

kwargs arguments (Mandatory) – protocol: Alignment protocol to run prefix: Output prefix for all generated files
Optional –

Returns:

Alignment
Dictionary with results of stage in following fields (in brackets - not returned by all protocols) –
- alignment_file
- [raw_alignment_file]
- statistics_file
- target_sequence_file
- sequence_file
- [annotation_file]
- frequencies_file
- identities_file
- [hittable_file]
- focus_mode
- focus_sequence
- segments

evcouplings.align.protocol.search_thresholds(use_bitscores, seq_threshold, domain_threshold, seq_len)[source]¶

Set homology search inclusion parameters.

HMMER hits get included in the HMM according to a two-step rule

sequence passes sequence-level treshold
domain passes domain-level threshold

Therefore, search thresholds are set based on the following logic:

If only sequence threshold is given, a MissingParameterException is raised
If only bitscore threshold is given, sequence threshold is set to the same
If both thresholds are given, they are according to defined values

Valid inputs for bitscore thresholds:

int or str: taken as absolute score threshold

float: taken as relative threshold (absolute threshold derived by

multiplication with domain length)

Valid inputs for integer thresholds:

int: Used as negative exponent, threshold will be set to 1E-<exponent>
float or str: Interpreted literally

Parameters:	use_bitscores (bool) – Use bitscore threshold instead of E-value threshold domain_threshold (str or int or float) – Domain-level threshold. See rules above. seq_threshold (str or int or float) – Sequence-level threshold. See rules above. seq_len (int) – Length of sequence. Used to calculate absolute bitscore threshold for relative bitscore thresholds.
Returns:	Sequence- and domain-level thresholds ready to be fed into HMMER
Return type:	tuple (str, str)

evcouplings.align.protocol.standard(**kwargs)[source]¶

Protocol:

Standard buildali4 workflow (run iterative jackhmmer search against sequence database, than determine which sequences and columns to include in the calculation based on coverage and maximum gap thresholds).

Parameters: kwargs arguments (Mandatory) – See list below in code where calling check_required

Returns:

outcfg (dict) – Output configuration of the pipeline, including the following fields:
- sequence_id (passed through from input)
- first_index (passed through from input)
- alignment_file
- raw_alignment_file
- raw_focus_alignment_file
- statistics_file
- target_sequence_file
- sequence_file
- annotation_file
- frequencies_file
- identities_file
- hittable_file
- focus_mode
- focus_sequence
- segments
ali (Alignment) – Final sequence alignment

evcouplings.align.tools module¶

Wrappers for running external sequence alignment tools

Authors:: Thomas A. Hopf Anna G. Green - run_hmmbuild, run_hmmsearch Chan Kang - run_hmmbuild, run_hmmsearch

class evcouplings.align.tools.HmmbuildResult(prefix, hmmfile, output)¶

Bases: tuple

hmmfile¶: Alias for field number 1

output¶: Alias for field number 2

prefix¶: Alias for field number 0

class evcouplings.align.tools.HmmscanResult(prefix, output, tblout, domtblout, pfamtblout)¶

Bases: tuple

domtblout¶: Alias for field number 3

output¶: Alias for field number 1

pfamtblout¶: Alias for field number 4

prefix¶: Alias for field number 0

tblout¶: Alias for field number 2

class evcouplings.align.tools.HmmsearchResult(prefix, alignment, output, tblout, domtblout)¶

Bases: tuple

alignment¶: Alias for field number 1

domtblout¶: Alias for field number 4

output¶: Alias for field number 2

prefix¶: Alias for field number 0

tblout¶: Alias for field number 3

class evcouplings.align.tools.JackhmmerResult(prefix, alignment, output, tblout, domtblout)¶

Bases: tuple

alignment¶: Alias for field number 1

domtblout¶: Alias for field number 4

output¶: Alias for field number 2

prefix¶: Alias for field number 0

tblout¶: Alias for field number 3

evcouplings.align.tools.read_hmmer_domtbl(filename)[source]¶

Read a HMMER domtbl file into DataFrame.

Parameters:	filename (str) – Path of domtbl file
Returns:	DataFrame with parsed domtbl
Return type:	pd.DataFrame

evcouplings.align.tools.read_hmmer_tbl(filename)[source]¶

Read a HMMER tbl file into DataFrame.

Parameters:	filename (str) – Path of tbl file
Returns:	DataFrame with parsed tbl
Return type:	pd.DataFrame

evcouplings.align.tools.run_hhfilter(input_file, output_file, threshold=95, columns='a2m', binary='hhfilter')[source]¶

Redundancy-reduce a sequence alignment using hhfilter from the HHsuite alignment suite.

Parameters:	input_file (str) – Path to input alignment in A2M/FASTA format output_file (str) – Path to output alignment (will be in A3M format) threshold (int, optional (default: 95)) – Sequence identity threshold for maximum pairwise identity (between 0 and 100) columns ({"first", "a2m"}, optional (default: "a2m")) – Definition of match columns (based on first sequence or upper-case columns (a2m)) binary (str) – Path to hhfilter binary
Returns:	output_file
Return type:	str
Raises:	`ResourceError` – If output alignment is non-existent/empty `ValueError` – Upon invalid value of columns parameter

evcouplings.align.tools.run_hmmbuild(alignment_file, prefix, cpu=None, stdout_redirect=None, symfrac=None, binary='hmmbuild')[source]¶

Profile HMM construction from multiple sequence alignments Refer to HMMER documentation for details.

http://eddylab.org/software/hmmer3/3.1b2/Userguide.pdf

Parameters:	alignment_file (str) – File containing the multiple sequence alignment. Can be in Stockholm, a2m, or clustal formats, or any other format recognized by hmmer. Please note that ALL POSITIONS above the symfrac cutoff will be used in HMM construction (if the alignment contains columns that are insertions relative to the query sequence, this may be problematic for structure comparison) prefix (str) – Prefix path for output files. Folder structure in the prefix will be created if not existing. cpu (int, optional (default: None)) – Number of CPUs to use for search. Uses all if None. stdout_redirect (str, optional (default: None)) – Redirect bulky stdout instead of storing with rest of results (use “/dev/null” to dispose) symfrac (float, optional (default: None)) – range 0.0 - 1.0, HMMbuild will use columns with > symfrac percent gaps to construct the HMM. If None provided, HMMbuild internal default is 0.5. (Note: this is calculated after their internal sequence weighting is calculated) binary (str (default: "hmmbuild")) – Path to jackhmmer binary (put in PATH for default to work)
Returns:	namedtuple with fields corresponding to the different output files (prefix, alignment, output, tblout, domtblout)
Return type:	HmmbuildResult
Raises:	ExternalToolError, ResourceError

evcouplings.align.tools.run_hmmscan(query, database, prefix, use_model_threshold=True, threshold_type='cut_ga', use_bitscores=True, domain_threshold=None, seq_threshold=None, nobias=False, cpu=None, stdout_redirect=None, binary='hmmscan')[source]¶

Run hmmscan of HMMs in database against sequences in query to identify matches of these HMMs. Refer to HMMER Userguide for explanation of these parameters.

Parameters:	query (str) – File containing query sequence(s) database (str) – File containing HMM database (prepared with hmmpress) prefix (str) – Prefix path for output files. Folder structure in the prefix will be created if not existing. use_model_threshold (bool (default: True)) – Use model-specific inclusion thresholds from HMM database rather than global bitscore/E-value thresholds (use_bitscores, domain_threshold and seq_threshold are overriden by this flag). threshold-type ({"cut_ga", "cut_nc", "cut_tc"} (default: "cut_ga")) – Use gathering (default), noise or trusted cutoff to define scan hits. Please refer to HMMER manual for details. use_bitscores (bool) – Use bitscore inclusion thresholds rather than E-values. Overriden by use_model_threshold flag. domain_threshold (int or float or str) – Inclusion threshold applied on the domain level (e.g. “1E-03” or 0.001 or 50) seq_threshold (int or float or str) – Inclusion threshold applied on the sequence level (e.g. “1E-03” or 0.001 or 50) nobias (bool, optional (default: False)) – Turn of bias correction cpu (int, optional (default: None)) – Number of CPUs to use for search. Uses all if None. stdout_redirect (str, optional (default: None)) – Redirect bulky stdout instead of storing with rest of results (use “/dev/null” to dispose) binary (str (default: "hmmscan")) – Path to hmmscan binary (put in PATH for default to work)
Returns:	namedtuple with fields corresponding to the different output files (prefix, output, tblout, domtblout, pfamtblout)
Return type:	HmmscanResult
Raises:	ExternalToolError, ResourceError

evcouplings.align.tools.run_hmmsearch(hmmfile, database, prefix, use_bitscores, domain_threshold, seq_threshold, nobias=False, cpu=None, stdout_redirect=None, binary='hmmsearch')[source]¶

Search profile(s) against a sequence database. Refer to HMMER documentation for details.

http://eddylab.org/software/hmmer3/3.1b2/Userguide.pdf

Parameters:	hmmfile (str) – File containing the profile(s) database (str) – File containing sequence database prefix (str) – Prefix path for output files. Folder structure in the prefix will be created if not existing. use_bitscores (bool) – Use bitscore inclusion thresholds rather than E-values. domain_threshold (int or float or str) – Inclusion threshold applied on the domain level (e.g. “1E-03” or 0.001 or 50) seq_threshold (int or float or str) – Inclusion threshold applied on the sequence level (e.g. “1E-03” or 0.001 or 50) nobias (bool, optional (default: False)) – Turn of bias correction cpu (int, optional (default: None)) – Number of CPUs to use for search. Uses all if None. stdout_redirect (str, optional (default: None)) – Redirect bulky stdout instead of storing with rest of results (use “/dev/null” to dispose) binary (str (default: "hmmsearch")) – Path to jackhmmer binary (put in PATH for default to work)
Returns:	namedtuple with fields corresponding to the different output files (prefix, alignment, output, tblout, domtblout)
Return type:	HmmsearchResult
Raises:	ExternalToolError, ResourceError

evcouplings.align.tools.run_jackhmmer(query, database, prefix, use_bitscores, domain_threshold, seq_threshold, iterations=5, nobias=False, cpu=None, stdout_redirect=None, checkpoints_hmm=False, checkpoints_ali=False, binary='jackhmmer')[source]¶

Run jackhmmer sequence search against target database. Refer to HMMER Userguide for explanation of these parameters.

Parameters:	query (str) – File containing query sequence database (str) – File containing sequence database prefix (str) – Prefix path for output files. Folder structure in the prefix will be created if not existing. use_bitscores (bool) – Use bitscore inclusion thresholds rather than E-values. domain_threshold (int or float or str) – Inclusion threshold applied on the domain level (e.g. “1E-03” or 0.001 or 50) seq_threshold (int or float or str) – Inclusion threshold applied on the sequence level (e.g. “1E-03” or 0.001 or 50) iterations (int) – number of jackhmmer search iterations nobias (bool, optional (default: False)) – Turn of bias correction cpu (int, optional (default: None)) – Number of CPUs to use for search. Uses all if None. stdout_redirect (str, optional (default: None)) – Redirect bulky stdout instead of storing with rest of results (use “/dev/null” to dispose) checkpoints_hmm (bool, optional (default: False)) – Store checkpoint HMMs to prefix.<iter>.hmm checkpoints_ali (bool, optional (default: False)) – Store checkpoint alignments to prefix.<iter>.sto binary (str (default: "jackhmmer")) – Path to jackhmmer binary (put in PATH for default to work)
Returns:	namedtuple with fields corresponding to the different output files (prefix, alignment, output, tblout, domtblout)
Return type:	JackhmmerResult
Raises:	ExternalToolError, ResourceError