#!/home/erik/bin/python3 #import packages to be used import cgi, cgitb import warnings from sklearn.svm import SVC from sklearn.feature_extraction.text import CountVectorizer from sklearn.preprocessing import StandardScaler from sklearn.externals import joblib import re warnings.simplefilter("ignore", UserWarning)#ignore a joblib version warning #----------------------------------------------\ # Parse the web-form information to variables \ # \_______________________________________________________ # | cgitb.enable(display=1, logdir="/var/www/html/bin/") form=cgi.FieldStorage() alignment = form.getvalue('fasta') if alignment.startswith(">"): #naive check for FASTA format list=alignment.split(">") if list[0] == "": list.pop(0)#get rid of the leading empty string seqList=[] lenList=[] nameList=[] for a in list: tempList=a.split("\r\n") if tempList[-1]=="": tempList.pop(-1)#get rid of the trailing empty string tempSeq="" nameList.append(tempList[0]) for element in tempList[1:]: tempSeq+=element seqList.append(tempSeq) lenList.append(str(len(tempSeq))) if len(seqList)==0: #check for empty sequence list seqList = ["MPSKKSGPQPHKRWVFTLNNPSEEEKNKIRELPISLFDYFVCGEEGLEEGRTAHLQGFANFAKKQTFNKVKWYFGARCHIEKAKGTDQQNKEYCSKEGHILIECGAPRNQGKRSDLSTAYFDYQQSGPPGMVLLNCCPSCRSSLSEDYYFAILEDCWRTINGGTRRPI"] nameList=['Demo'] lenList=[str(len(alignment[0]))] else: seqList = ["MPSKKSGPQPHKRWVFTLNNPSEEEKNKIRELPISLFDYFVCGEEGLEEGRTAHLQGFANFAKKQTFNKVKWYFGARCHIEKAKGTDQQNKEYCSKEGHILIECGAPRNQGKRSDLSTAYFDYQQSGPPGMVLLNCCPSCRSSLSEDYYFAILEDCWRTINGGTRRPI"] nameList=['Demo'] lenList=[str(len(alignment[0]))] #--------------------------------------------------------------------------------------------------------+ #----------------------------------------------\ # predict genus of input sequences \ # \_______________________________________________________ # | #list of amino acids as vocabulary for the CountVectorizer AAs=['a','c','d','e','f','g','h','i','k','l','m','n','p','q','r','s','t','v','w','y'] #load the classifier and scaler clf=joblib.load("./clf_11_21_2017.pkl") StSc=joblib.load("./StSc_11_21_2017.pkl") cv=CountVectorizer(analyzer='char',ngram_range=(1,1),vocabulary=AAs) #initialize text data vectorizer dataVect=cv.transform(seqList) #Scale the data to the training set X=StSc.transform(dataVect.astype("float64")) #make predictions for the original dataset predictions=clf.predict(X) #----------------------------------------------\ # Build HTML table of results \ # \_______________________________________________________ # #results="

Entered Text Content Seq Name is {0} length {1}

".format(nameList,predictions) results="" results+=""" """ for k in range(len(nameList)): results+="".format(nameList[k],lenList[k],predictions[k]) results+="
Sequence Name Length Prediction
{0}{1}{2}
" #----------------------------------------------\ # Build output page \ # \_______________________________________________________ # | #build output page parts #Header and CSS Style bits header="""Content-type:text/html """ #Page contents, first part body1="""

Welcome to CRESSdna.org

Home

Part of the National Science Foundation's Assembling the Tree of Life.

Sponsored with a Grant from the National Science Foundation

Circoviridae


Many animal-infecting CRESS-DNA viruses are classified into the Circoviridae family. There are two genera within the group, the older Circovirus and the more recently codified Cyclovirus, but both are well represented. At least one disease of economic importance is associated with circovirus infections: post-weaning maturation wasting syndrome in pigs (caused in part by porcine circovirus 2, which is now largely controlled through vaccination in commercial hog production). However, several worldwide veterinary diseases are due to circoviruses, including beak and feather disease and fatal acute diarrhea in dogs.

missing
Gastrointestinal system of dogs infected with dog circovirus (DogCV) with hemorrhaging in stomach and intestines. CC-BY Li et al. 2013
missing
Immune electron microscopy image of PCV2 (porcine circovirus 2) particles. CC-BY Guo et al. 2011

While some of the environmental isolates assigned to Circoviridae have genomes over 3,000 and 4,000 bases, it also contains some of the smallest genomes of CRESS-DNA viruses - some well-studied circoviruses have genomes about 1700nt long, and circularized putative genomes from metagenomics studies can be even smaller. Most analyzed sequences have two ORFs: the replication-associated protein (Rep, also referred to as the replication initiator protein) and capsid protein (Cp or Cap), with some isolates having had a third ORF experimentally verified, and some sequences having many hypothetical ORFs called that have not yet been studied in the lab.

Both cycloviruses and circoviruses have non-enveloped, icosahedral virions of 15-25nm encapsidating their circular, ssDNA genomes, but while members of Circovirus are found infecting or associated with mammals, birds and fish, cycloviruses have been found infecting or associated with mammals, birds and insects. Sequences assigned to Circovirus have ambisense genomes, with the Rep gene in sense, sequences in Cyclovirus typically are ambisense in the opposite orientation (Rep gene in anti-sense).

A great primer on Circoviridae

For more information about Circovirus:
ICTV report on circovirus
ExPASy ViralZone summary of circovirus Type species: Porcine circovirus 1 (NC_001792.2)

For more information about Cyclovirus:
ICTV report on cyclovirus
ExPASy ViralZone summary of cyclovirus Type species: Human-associated cyclovirus 8 (KF031466)

Nanoviridae

The plant infecting CRESS-DNA viruses with more than two genomic segments belong in the family Nanoviridae, which includes the genera Babuvirus and Nanovirus. One of the most economically important species in the family Nanoviridae is Banana bunchy top virus (BBTV), the type species of babuvirus. BBTV causes banana bunchy top disease, which is common in banana growing areas such as Southeast Asia, the South Pacific, India and Africa. This virus is transmitted by the banana aphid and causes plant crumpling, shrinking and chlorosis, which may develop into necrosis.

missing
Banana bunchy top, caused by Banana bunchy top virus (BBTV). CC-BY Scott Nelson 2014.

Viruses in the Family Nanoviridae have multipartite genomes consisting of 6 to 8 ~1000 nucleotide segments of circular ssDNA. Five of these DNA components are shared between babuviruses and nanoviruses. (DNA-R, -N, -S, -C and -M). Nanoviruses infect dicots, have 8 genomic DNAs and may include three other DNA components with functions that have yet to be determined (DNA-U1, -U2 and U-4). Babuviruses infect monocots, have 6 genomic DNAs and may include another DNA component with an unknown function (DNA-U4). Each of these components encode a single ORF that is transcribed in one direction, thogh a second putative ORF has been identified on one segment of BBTV (DNA-R). The virions are non-enveloped, sized 17-20nm in diameter and have on CP (coat protein). Additional DNA segments (alphasatellites) are also associated with many viruses in the family and can alter disease symptoms.

For more information about Nanovirus:
ICTV report on nanovirus.
ExPASy ViralZone summary of nanovirus
Type Species: Subterranean clover stunt virus (NC_003818.1)

For more information about Babuvirus:
ICTV report on babuvirus.
ExPASy ViralZone summary of babuvirus
Type Species: Banana bunchy top virus (NC_003479.1)

Taxonomy



Contact

This site is under construction

Please be patient while we tidy up a bit!

Contributors

This site is under construction

Please be patient while we tidy up a bit!

Results

Results from Taxonomy prediction

""" #Page contents, second part (results fit between body1 and body2) body2="""



This classifier will return the best fit of the submitted sequence to the training data.
Currently included in the training data:

  • Circoviridae
  • Nanoviridae
  • Genomoviridae
  • Geminiviridae
  • Smacovirus


  • """ #close the Page footer=""" """ #build the output page page=header+body1+results+body2+footer #send the output as html print (page) quit()