summaryrefslogtreecommitdiffstats
path: root/NOTES.org
blob: d58027217b9dce8e6cf1a95e8fc956a8cc5708ff (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
* POC
** This is the phi6 genome:
[[file:phi6 RefWT_from Lele.txt]]

** CSV file
[[file:phi6 wt protein start stops.csv]]

This is a CSV file with three columns: protein name, start nucleotide, ending nucleotide
These numbers are inclusive.  Everything else in the genome that’s not in at least one of those ranges (there’s one nucleotide overlaps
between some reading frames) isn’t protein-coding.

** Standard genetic code
[[file:Genetic-Code-Amino-Acid-Codon-Chart-sidebyside-03.png]]

The standard genetic code that you’ve used for some of my class projects applies, we will be using the single capital letter abbreviations for
amino acids.  Because of this please use lowercase “a, c, g, t” for nucleotides.  This is a chart that uses the DNA bases (no need to switch “u”
to “t” in your head) and has the single letter amino acids.  The three stop codons (taa, tag, tga) should all code for the same thing — could be
“STOP” could be an asterisk… you can have some creative control here :-)

** Test
As a test that our coordinates are correct, can you spit out the protein sequence from each of those proteins? Each will start with a M (one with
a V, it’s an “alternate start codon) and should stop with a stop.  Please send me that as a text file.

If that works I’ll get you sample input and output for what we need the program to actually do

have a nucleotide number and nucleotide inputted
print out reference sequence nt at that number, the nt number, the inputted nucleotide (Tab) the name of the protein involved OR
“noncoding” (Tab) Amino acid called by wild type sequence, the number in the protein that amino acid is, the amino acid called by the
inputted nucleotide being in the sequence.

Something like:
input 7500g

output:
a7500g P7 S34T

-- p7 is protein number from first email
-- S is orig aa
-- 34 is amino acid index inside p7
-- T is new aa

-- say non-coding instead "P7 S34T" if P can't be calculated

(sometimes the variant nucleotide will be in a protein-coding region but won’t change the called amino acid, this is normal and fine so we’ll
see, for example, “S34S”

Thanks!
SD

* Sample runs
Protein runs:
#+begin_src shell
  ./codon2aa.pl 'phi6 RefWT_from Lele.txt'  'phi6 wt protein start stops.csv'
#+end_src


Full conversion:
#+begin_src shell
  ./varscan2codon.pl 'phi6 RefWT_from Lele.txt' 'phi6 wt protein start stops.csv' t10-varscan.csv
#+end_src

Iter:
#+name: iter
#+begin_src shell :stdin e8k-redo
  for i in $(cat); do
      res=$(basename $i .csv).res.csv
      ./varscan2codon.pl 'phi6 RefWT_from Lele.txt' 'phi6 wt protein start stops.csv' $i > $res
      echo file:$res
  done
#+end_src

#+RESULTS: iter
: file:E8K_lowStrin.res.csv

todo directory:
#+name: todo-directory
#+begin_src shell
  mkdir -p to-send
  for i in todo/*.csv; do
      res=to-send/$(basename "$i" .csv).res.csv
      guix shell --pure -m manifest.scm -- ./varscan2codon.pl 'phi6 RefWT_from Lele.txt' 'phi6 wt protein start stops.csv' "$i" > "$res"
      echo $res
  done
#+end_src

#+RESULTS: todo-directory
: to-send/jan4_samplevariation.res.csv