Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consensus Sequence #3

Closed
j23414 opened this issue Aug 29, 2018 · 8 comments
Closed

Consensus Sequence #3

j23414 opened this issue Aug 29, 2018 · 8 comments
Assignees

Comments

@j23414
Copy link
Contributor

j23414 commented Aug 29, 2018

Any thoughts on smof being able to output a consensus sequence when given an alignment file?

input fasta

>seq_1
AAATATAT
>seq_2
AAAATTAT
>seq_3
AAATATAA

output fasta

>Consensus
AAATATAT
@arendsee
Copy link
Collaborator

@j23414 Thanks for the suggestion. It's a fine idea and I think falls within the intended scope of smof. I'll put it on the TODO list.

@arendsee arendsee self-assigned this Aug 29, 2018
arendsee added a commit that referenced this issue Aug 29, 2018
I can now build a consensus table that counts the characters in each
column of the alignment.

If we want to reduce these to a single consensus string, then we need to
decide how to resolve ties.

```
>seq_1
AAATATAT
>seq_2
AAAATTAT
>seq_3
AAATATAA
```

```
A       T
3       0
3       0
3       0
1       2
2       1
0       3
3       0
1       2
```
@arendsee
Copy link
Collaborator

I've partially implemented a consensus function over on the dev branch. You can check out the comment in the commit message.

The current output is a table of character counts across the columns in the alignment. But if we want to go from this table to a single consensus string, we will need to decide how to resolve ties.

@j23414
Copy link
Contributor Author

j23414 commented Aug 30, 2018

For ties, would it make sense to output both options (both consensus sequences)?

Although I could see that becoming many consensus sequences (ties in more than 1 position)...

@arendsee
Copy link
Collaborator

I can't do that in FASTA format.

Perhaps if there is a near tie, I could use a wildcard, like *. For cutoff, I could use Shannon entropy. This is the standard measure conservation used in LOGO plots and the like.

@j23414
Copy link
Contributor Author

j23414 commented Aug 30, 2018

As a standard input for LOGO plots would be acceptable.

By multiple consensuses I'm talking about:

>consensus_1
AATATAT
>consensus_2
AAAATAT

I'm looking at influenza virus sequence which group as H1.alpha, H1.beta, H1.gamma, etc...
I need the consensus for each group (alpha, beta, gamma... ) and am aligning their consensus(es) so I can determine their major nucleotide (or amino acid) differences between groups.

This might explain why a vaccine works for alpha (targets certain amino acids), but not for gamma, etc...

No worries either way, I can work with the table of character counts. : )

@arendsee
Copy link
Collaborator

Hmm, I'd worry about a combinatorial explosion.

@arendsee
Copy link
Collaborator

@j23414 You can check out the lastest version (2.13.0) of smof. Now the consensus command prints the consensus (resolving ties alphabetically) by default. Printing a table is optional.

Is good?

@j23414
Copy link
Contributor Author

j23414 commented Sep 19, 2018

Good! : ) Consensus and table look good.

smof consensus fasta_aln.fna > consensus.fna
smof consensus -t fasta_aln.fna > consensus.tsv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants