#!/home/erik/bin/python3 #import packages to be used import cgi, cgitb import warnings from sklearn.svm import SVC from sklearn.feature_extraction.text import CountVectorizer from sklearn.preprocessing import StandardScaler from sklearn.externals import joblib import re warnings.simplefilter("ignore", UserWarning)#ignore a joblib version warning #----------------------------------------------\ # Parse the web-form information to variables \ # \_______________________________________________________ # | cgitb.enable(display=1, logdir="/var/www/html/bin/") form=cgi.FieldStorage() alignment = form.getvalue('fasta') if alignment.startswith(">"): #naive check for FASTA format list=alignment.split(">") if list[0] == "": list.pop(0)#get rid of the leading empty string seqList=[] lenList=[] nameList=[] for a in list: tempList=a.split("\r\n") if tempList[-1]=="": tempList.pop(-1)#get rid of the trailing empty string tempSeq="" nameList.append(tempList[0]) for element in tempList[1:]: tempSeq+=element seqList.append(tempSeq) lenList.append(str(len(tempSeq))) if len(seqList)==0: #check for empty sequence list seqList = ["MPSKKSGPQPHKRWVFTLNNPSEEEKNKIRELPISLFDYFVCGEEGLEEGRTAHLQGFANFAKKQTFNKVKWYFGARCHIEKAKGTDQQNKEYCSKEGHILIECGAPRNQGKRSDLSTAYFDYQQSGPPGMVLLNCCPSCRSSLSEDYYFAILEDCWRTINGGTRRPI"] nameList=['Demo'] lenList=[str(len(alignment[0]))] else: seqList = ["MPSKKSGPQPHKRWVFTLNNPSEEEKNKIRELPISLFDYFVCGEEGLEEGRTAHLQGFANFAKKQTFNKVKWYFGARCHIEKAKGTDQQNKEYCSKEGHILIECGAPRNQGKRSDLSTAYFDYQQSGPPGMVLLNCCPSCRSSLSEDYYFAILEDCWRTINGGTRRPI"] nameList=['Demo'] lenList=[str(len(alignment[0]))] #--------------------------------------------------------------------------------------------------------+ #----------------------------------------------\ # predict genus of input sequences \ # \_______________________________________________________ # | #list of amino acids as vocabulary for the CountVectorizer AAs=['a','c','d','e','f','g','h','i','k','l','m','n','p','q','r','s','t','v','w','y'] #load the classifier and scaler clf=joblib.load("./clf_11_21_2017.pkl") StSc=joblib.load("./StSc_11_21_2017.pkl") cv=CountVectorizer(analyzer='char',ngram_range=(1,1),vocabulary=AAs) #initialize text data vectorizer dataVect=cv.transform(seqList) #Scale the data to the training set X=StSc.transform(dataVect.astype("float64")) #make predictions for the original dataset predictions=clf.predict(X) #----------------------------------------------\ # Build HTML table of results \ # \_______________________________________________________ # #results="
Entered Text Content Seq Name is {0} length {1}
".format(nameList,predictions) results="" results+="""Sequence Name | Length | Prediction |
---|---|---|
{0} | {1} | {2} |
Welcome to CRESSdna.org
Part of the National Science Foundation's Assembling the Tree of Life.
Many animal-infecting CRESS-DNA viruses are classified into the Circoviridae family. There are two genera within the group, the older Circovirus and the more recently codified Cyclovirus, but both are well represented. At least one disease of economic importance is associated with circovirus infections: post-weaning maturation wasting syndrome in pigs (caused in part by porcine circovirus 2, which is now largely controlled through vaccination in commercial hog production). However, several worldwide veterinary diseases are due to circoviruses, including beak and feather disease and fatal acute diarrhea in dogs.
While some of the environmental isolates assigned to Circoviridae have genomes over 3,000 and 4,000 bases, it also contains some of the smallest genomes of CRESS-DNA viruses - some well-studied circoviruses have genomes about 1700nt long, and circularized putative genomes from metagenomics studies can be even smaller. Most analyzed sequences have two ORFs: the replication-associated protein (Rep, also referred to as the replication initiator protein) and capsid protein (Cp or Cap), with some isolates having had a third ORF experimentally verified, and some sequences having many hypothetical ORFs called that have not yet been studied in the lab.
Both cycloviruses and circoviruses have non-enveloped, icosahedral virions of 15-25nm encapsidating their circular, ssDNA genomes, but while members of Circovirus are found infecting or associated with mammals, birds and fish, cycloviruses have been found infecting or associated with mammals, birds and insects. Sequences assigned to Circovirus have ambisense genomes, with the Rep gene in sense, sequences in Cyclovirus typically are ambisense in the opposite orientation (Rep gene in anti-sense).
A great primer on Circoviridae
For more information about Circovirus:
ICTV report on circovirus
ExPASy ViralZone summary of circovirus
Type species: Porcine circovirus 1 (NC_001792.2)
For more information about Cyclovirus:
ICTV report on cyclovirus
ExPASy ViralZone summary of cyclovirus
Type species: Human-associated cyclovirus 8 (KF031466)
The plant infecting CRESS-DNA viruses with more than two genomic segments belong in the family Nanoviridae, which includes the genera Babuvirus and Nanovirus. One of the most economically important species in the family Nanoviridae is Banana bunchy top virus (BBTV), the type species of babuvirus. BBTV causes banana bunchy top disease, which is common in banana growing areas such as Southeast Asia, the South Pacific, India and Africa. This virus is transmitted by the banana aphid and causes plant crumpling, shrinking and chlorosis, which may develop into necrosis.
Viruses in the Family Nanoviridae have multipartite genomes consisting of 6 to 8 ~1000 nucleotide segments of circular ssDNA. Five of these DNA components are shared between babuviruses and nanoviruses. (DNA-R, -N, -S, -C and -M). Nanoviruses infect dicots, have 8 genomic DNAs and may include three other DNA components with functions that have yet to be determined (DNA-U1, -U2 and U-4). Babuviruses infect monocots, have 6 genomic DNAs and may include another DNA component with an unknown function (DNA-U4). Each of these components encode a single ORF that is transcribed in one direction, thogh a second putative ORF has been identified on one segment of BBTV (DNA-R). The virions are non-enveloped, sized 17-20nm in diameter and have on CP (coat protein). Additional DNA segments (alphasatellites) are also associated with many viruses in the family and can alter disease symptoms.
For more information about Nanovirus:
ICTV report on nanovirus.
ExPASy ViralZone summary of nanovirus
Type Species: Subterranean clover stunt virus (NC_003818.1)
For more information about Babuvirus:
ICTV report on babuvirus.
ExPASy ViralZone summary of babuvirus
Type Species: Banana bunchy top virus (NC_003479.1)
And has been trained on the following Genera:
ProtTest3 with CRESS DNA virus model
ProtTest3 with a CRESS DNA virus based substitution matrix developed by Lele Zhao.
This site is under construction
Please be patient while we tidy up a bit!
This site is under construction
Please be patient while we tidy up a bit!
Results from Taxonomy prediction
""" #Page contents, second part (results fit between body1 and body2) body2="""
This classifier will return the best fit of the submitted sequence to the training data.
Currently included in the training data: