blast教程文档格式.docx
《blast教程文档格式.docx》由会员分享,可在线阅读,更多相关《blast教程文档格式.docx(16页珍藏版)》请在冰点文库上搜索。
![blast教程文档格式.docx](https://file1.bingdoc.com/fileroot1/2023-5/5/c967f82f-1df7-448b-9a0e-9a27ab299acb/c967f82f-1df7-448b-9a0e-9a27ab299acb1.gif)
module4
Wewilldiveintoeachdirectory,oneforeachmodule,
intheorderbelow.
Module-1:
FASTA,BLASTandMSA
InthismodulewewanttopracticemanipulatingFASTAfileandindoingso,constructphylogenetictreeforacertaingenefamilyCesA.
$cdmodule1/
Kenttools
isanexcellent
setoftoolsusedbyUCSCgenomebrowserteam(http:
//hgdownload.cse.ucsc.edu/admin/jksrc.zip;
butwe’veinstalledforyou).Nodocumentations,buteasytouse.WewillpracticemanipulatingFASTAfile.
Gotomedicagowebsite(http:
//medicago.jcvi.org/cgi-bin/medicago/download.cgi).Findalinkcalled“Mt3.5v5_GenesCDSSeq_20111014.fa”.RightclickandcopytheURL,backtoyourterminal,type“wget”,andthenCmd+V(Mac)orShift+Insert(Windows/Linux)topastethatlinkin!
$wget
ftp:
//ftp.jcvi.org/pub/data/m_truncatula/Mt3.5/Annotation/Mt3.5v5/Mt3.5v...
ToseethegeneralstatisticsofaFASTAfile:
$faSizeMt3.5v5_GenesCDSSeq_20111014.fa
Thisfilecontains:
#sequences______,minsize______,maxsize_____
Toseethesizeofeachsequence:
$faSizeMt3.5v5_GenesCDSSeq_20111014.fa-detailed-tab
HowaboutfindingthelongestCDS?
Pipetheabovecommandtoasort(youwanttosortthesecondcolumn,numerically,andinreversesothelongestappearsfirst)
$faSizeMt3.5v5_GenesCDSSeq_20111014.fa-detailed-tab|sort-k2,2nr
Whichsequenceisthelongest?
(Hint:
you’llneedtoseethefirstfewlinesoftheoutputfromtheabovecommand)
LongestsequenceID:
________________,it’ssize:
____________bp
Now,youwanttoextractthelongestsequence.
$faOneRecordMt3.5v5_GenesCDSSeq_20111014.fa"
IMGA|contig_49801_3.1"
Youneedtoquotetheidentifierbecausethereisapipe(|)
inthename.
Savetheoutputtoafile(use>
toredirectoutputtoanewfile).RunfaSize-detailed
onthenewfile,isthelengththesameasyoursawbefore?
Findgenefamilymembers
Iwanttostudyallthecellulosesynthase(CesA)genesinMedicago.HowshouldIdoit?
1.ExtractalistofCesAgenesfromarelatedspecies
2.QuerythislistagainsttheMedicagogenesettofishouttheMedicagoCesAgenes
3.Runmultiplesequencealignmentsandbuildaphylogenetictree
Manyofyouarealreadyfamiliarwiththisprocess,butlet’sdoitthroughthecommandline!
ExtractalistofCesAgenesfromA.thaliana
GotoTAIRwebsite(http:
//www.arabidopsis.org).Inthesearchboxontopright,typein“CesA”.Onresultspage,findabutton“getallsequences”.Youwanttodownloadthemasproteinsequences.
Thereturnedpagehasallthesequences.Nowcopythemandthencreateafileontheserver,pastethecontentsin,andcallit“At-CesA.fasta”.Rememberyoucandoitthroughatexteditor:
gedit
ornano.Note:
youdonotneedthetwocommentlinesinthebeginning.YouFASTAfileshouldstartwith‘>
’.
Makesureyouhavealltheproteinsyouwantedtodownload:
$faSizeAt-CesA.fasta
HowmanysequencesinArabidopsis?
______
RunalocalBLAST
Inthiscontext,‘local’meansyouarerunningBLASTonyourownserver,notatNCBIoranyoneelse’sserver.Thisgivesyoutheflexibilityofcomparingyourqueryeitheragainstprecomputeddatabases(likeNR,Swissprot,trEMBL,etc.)oragainstacustomizeddatabase,containingaspecificsetofsequences.
QuerySequenceSet:
At-CesA.fasta
ReferenceDatabase:
Mt3.5v5_GenesCDSSeq_20111014.fa
Rememberthequeryisasetofprotein
sequences,databaseisasetofnucleotide
sequences.WhichBLASTprogramshouldyouusehere?
______________________
(Hint:
chooseamong:
BLASTP,BLASTN,BLASTX,TBLASTN,TBLASTX)
BLASTReference:
http:
//www.ncbi.nlm.nih.gov/BLAST/blast_program.shtml
Generateacustomizedsearchdatabase(MedicagoCDSs)
$makeblastdb-dbtypenucl-inMt3.5v5_GenesCDSSeq_20111014.fa
BLASTtheAtCesA’sagainstMedicagoCDSs
$tblastn-queryAt-CesA.fasta-dbMt3.5v5_GenesCDSSeq_20111014.fa-outfmt6-outAt-CesA.MtCDS.tblastn
Waituntilitfinishesthentakealookattheoutput:
$lessAt-CesA.MtCDS.tblastn
BLASTtabularformat(asspecifiedin-outfmt6)hasmultiplecolumns,andistheeasiestoutputformattoworkwith.
Whichcolumnhasthequery?
Whichcolumnhasthesubject(i.e.thesequencesinthedatabasethatmatchyourquery,whichyouwanttoextractoutlater)?
Thesubjectcolumnis______
Cutoutthecolumn(-f2):
$cut-f2At-CesA.MtCDS.tblastn|sort-u>
hits.ids
Howmanyuniquehitsdidweget?
Thisequalstothenumberoflinesinthefile:
$wc-lhits.ids
Numberofuniquehits:
_________
Itappearswearegettinglotsofhits.WemayneedtogobacktotheBLASTcommandtoaddanE-valuecutofftobemorestringent.HowtoaddE-valuecutoff?
Let’slookatthehelpfortblastn
$tblastn-help
NowrunBLASTwithmorestringentsettings(E-valuecutoff:
1e-20):
$tblastn-queryAt-CesA.fasta-dbMt3.5v5_GenesCDSSeq_20111014.fa-outfmt6-outAt-CesA.MtCDS.tblastn-evalue1e-20
NumberofuniquehitswithE-value1e-20:
Extractsequencesfrommedicago
ThecommandfaSomeRecords
canbeusedtoextractmultiplesequences,nowlet’sgetalltheBLASThitstoretrieveagenefamily.
$faSomeRecordsMt3.5v5_GenesCDSSeq_20111014.fahits.idshits.fasta
Verifythatyouaregettingallthemedicagohitsthatyouwant:
-countthelinesinhits.ids(hint:
$wc-lhits.ids)_________
-countthenumberofrecordsinhits.fasta(hint:
$faSizehits.fasta)________
Thetwonumbersshouldagree.Whenworkinginbioinformatics,alwayscheck,check,andchecktomakesurenumbersaddup.
RunMUSCLEtogetamultiplesequencealignment
$muscle-maxiters1-inhits.fasta-clw-outhits.aln
Notewejustwantafastalignmenthere(-maxiter1).Takealookattheoutputfilehits.aln.
HowtomakemyslowBLASTjobrunfaster?
!
Welcometotheworldofparallelcomputing!
TheBLASTjobiswhatwecall“embarrassinglyparallel”,meaningthatit’srelativelyeasytodo-byhavingeachCPUworkonapartoftheinputfile!
Todothis,let’ssplitthequeryFASTAfirstintofourfiles.LearnhowtoconstructthecommandbyreadingthefaSplit
manual,simplytypefaSplit
withnoarguments.
$faSplitsequenceAt-CesA.fasta4split
Wewillthenuseashelltoolcalled
GNUParallel,thatallowsuserstoeasilylaunchmultiplejobsbyusingatemplatecontainingplaceholders,whichcaneasilybereplacedwithcontentfromaninputsource.
Firstlet’sidentifytheinputsource.Inourcase,itisgoingtobethe4FASTAfilesyoujustcreatedusingfaSplit
$lssplit*.fa*
Whatdoyousee?
Wewillnowtakethetblastn
commandfromabove,insertsomeplaceholdersandpasstheinputsourcetoit(usingthepipetoredirecttheinput).
$lssplit*.fa*|parallel'
tblastn-query{}-dbMt3.5v5_GenesCDSSeq_20111014.fa-outfmt6-out{.}.MtCDS.tblastn-evalue1e-20'
Meaningoftheplaceholdersused:
{}:
Replacedbyeachlinefromtheinputsource
{.}:
Replacedbyeachlinefromtheinputsource,withtheextensionremoved
Withthismechanism,eachsplittedFASTAfilewillgenerateatblastncommand,makingatotalof4commandstorunsimultaneously,thereforegetjobsdonefaster.Don’tforgettoconcatenatetheoutputstogether.
$catsplit*.MtCDS.tblastn>
all.MtCDS.tblastn
Refertothe
manpage
forGNUparallel
forfurtherdetailedinformation,ifyouhavetime.
Module-2:
FASTQ,FASTQCandTRIMMOMATIC
$cd../module2/
PE-350.1.fastq
PE-350.2.fastq
adapters.fasta
UnderstandFASTQformat
AlsothefilePE-350.1.fastq
andPE-350.2.fastq
arepairedfiles,withrecordfromeachfilesequentiallymatcheachother,i.e.pairofreads,sothisiscalleda“paired-end”library.FASTQfilesaretextfiles:
$lessPE-350.1.fastq
Whatinformationisincluded?
Whatdoeseachlinemean?
FourconsecutivelinesdefineONEreadinthefastqfile
EachbasehasaqualityvaluecalledPHREDscore,typicallybetween0-40forIllumina.Phredscore:
10-10%error
20-1%error
30-0.1%error
40-0.01%error
Well,showingnumbersinfilesisnotreallycompact,soFASTQfileconvertsthenumericvaluetoacharacter.Therearealsotwosystemsofsuchconversion,PHRED+33andPHRED+64.PHRED+33isusedinallnewIlluminaprotocols.
CanyouguessifyourFASTQfileisinPHRED+33orPHRED+64encoding?
Quicktip:
Lookforstretchof#######orBBBBBBBthattrailsanyread,whichmeansit’sreallyBADquality.Ifyousee#######,thenthat’sPhred+33,andBBBBBBBBthat’sPhred+64(Why?
)
GobacktoourFASTQfiles,
findoutanswerstothesequestions:
LibraryPE-350containsatotalof______reads.(hint:
count#oflineswith`wc-lPE-350.1.fastq`,thendivideitby4).
Thebasequalityinmydataisencodedin________(Phred+33orPhred+64?
).
Howgoodisyoursequencingrun?
UseFASTQC
WearetoolazytoinstallFASTQCforyou!
J
Seeifyoucangetittowork-GoogleFASTQCandfindthe“Download”link.Findthelinkthatsayssomethinglike“FastQC(Linuxzipfile)”.Nowdownloadit.
//www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.10.1.zip
$unzipfastqc_v0.10.1.zip
$cdFastQC
$