blast教程文档格式.docx

上传人:b****2 文档编号:5799063 上传时间:2023-05-05 格式:DOCX 页数:16 大小:21.10KB
下载 相关 举报
blast教程文档格式.docx_第1页
第1页 / 共16页
blast教程文档格式.docx_第2页
第2页 / 共16页
blast教程文档格式.docx_第3页
第3页 / 共16页
blast教程文档格式.docx_第4页
第4页 / 共16页
blast教程文档格式.docx_第5页
第5页 / 共16页
blast教程文档格式.docx_第6页
第6页 / 共16页
blast教程文档格式.docx_第7页
第7页 / 共16页
blast教程文档格式.docx_第8页
第8页 / 共16页
blast教程文档格式.docx_第9页
第9页 / 共16页
blast教程文档格式.docx_第10页
第10页 / 共16页
blast教程文档格式.docx_第11页
第11页 / 共16页
blast教程文档格式.docx_第12页
第12页 / 共16页
blast教程文档格式.docx_第13页
第13页 / 共16页
blast教程文档格式.docx_第14页
第14页 / 共16页
blast教程文档格式.docx_第15页
第15页 / 共16页
blast教程文档格式.docx_第16页
第16页 / 共16页
亲,该文档总共16页,全部预览完了,如果喜欢就下载吧!
下载资源
资源描述

blast教程文档格式.docx

《blast教程文档格式.docx》由会员分享,可在线阅读,更多相关《blast教程文档格式.docx(16页珍藏版)》请在冰点文库上搜索。

blast教程文档格式.docx

module4 

Wewilldiveintoeachdirectory,oneforeachmodule, 

intheorderbelow.

Module-1:

FASTA,BLASTandMSA

InthismodulewewanttopracticemanipulatingFASTAfileandindoingso,constructphylogenetictreeforacertaingenefamilyCesA.

$cdmodule1/

Kenttools 

isanexcellent 

setoftoolsusedbyUCSCgenomebrowserteam(http:

//hgdownload.cse.ucsc.edu/admin/jksrc.zip;

butwe’veinstalledforyou).Nodocumentations,buteasytouse.WewillpracticemanipulatingFASTAfile.

Gotomedicagowebsite(http:

//medicago.jcvi.org/cgi-bin/medicago/download.cgi).Findalinkcalled“Mt3.5v5_GenesCDSSeq_20111014.fa”.RightclickandcopytheURL,backtoyourterminal,type“wget”,andthenCmd+V(Mac)orShift+Insert(Windows/Linux)topastethatlinkin!

$wget 

ftp:

//ftp.jcvi.org/pub/data/m_truncatula/Mt3.5/Annotation/Mt3.5v5/Mt3.5v...

ToseethegeneralstatisticsofaFASTAfile:

$faSizeMt3.5v5_GenesCDSSeq_20111014.fa

Thisfilecontains:

#sequences______,minsize______,maxsize_____

Toseethesizeofeachsequence:

$faSizeMt3.5v5_GenesCDSSeq_20111014.fa-detailed-tab

HowaboutfindingthelongestCDS?

Pipetheabovecommandtoasort(youwanttosortthesecondcolumn,numerically,andinreversesothelongestappearsfirst)

$faSizeMt3.5v5_GenesCDSSeq_20111014.fa-detailed-tab|sort-k2,2nr

Whichsequenceisthelongest?

(Hint:

you’llneedtoseethefirstfewlinesoftheoutputfromtheabovecommand)

LongestsequenceID:

________________,it’ssize:

____________bp

Now,youwanttoextractthelongestsequence.

$faOneRecordMt3.5v5_GenesCDSSeq_20111014.fa"

IMGA|contig_49801_3.1"

Youneedtoquotetheidentifierbecausethereisapipe(|) 

inthename.

Savetheoutputtoafile(use>

toredirectoutputtoanewfile).RunfaSize-detailed 

onthenewfile,isthelengththesameasyoursawbefore?

Findgenefamilymembers

Iwanttostudyallthecellulosesynthase(CesA)genesinMedicago.HowshouldIdoit?

1.ExtractalistofCesAgenesfromarelatedspecies

2.QuerythislistagainsttheMedicagogenesettofishouttheMedicagoCesAgenes

3.Runmultiplesequencealignmentsandbuildaphylogenetictree

Manyofyouarealreadyfamiliarwiththisprocess,butlet’sdoitthroughthecommandline!

ExtractalistofCesAgenesfromA.thaliana

GotoTAIRwebsite(http:

//www.arabidopsis.org).Inthesearchboxontopright,typein“CesA”.Onresultspage,findabutton“getallsequences”.Youwanttodownloadthemasproteinsequences.

Thereturnedpagehasallthesequences.Nowcopythemandthencreateafileontheserver,pastethecontentsin,andcallit“At-CesA.fasta”.Rememberyoucandoitthroughatexteditor:

gedit 

ornano.Note:

youdonotneedthetwocommentlinesinthebeginning.YouFASTAfileshouldstartwith‘>

’.

Makesureyouhavealltheproteinsyouwantedtodownload:

$faSizeAt-CesA.fasta

HowmanysequencesinArabidopsis?

______

RunalocalBLAST

Inthiscontext,‘local’meansyouarerunningBLASTonyourownserver,notatNCBIoranyoneelse’sserver.Thisgivesyoutheflexibilityofcomparingyourqueryeitheragainstprecomputeddatabases(likeNR,Swissprot,trEMBL,etc.)oragainstacustomizeddatabase,containingaspecificsetofsequences.

QuerySequenceSet:

At-CesA.fasta

ReferenceDatabase:

Mt3.5v5_GenesCDSSeq_20111014.fa

Rememberthequeryisasetofprotein 

sequences,databaseisasetofnucleotide 

sequences.WhichBLASTprogramshouldyouusehere?

______________________

(Hint:

chooseamong:

BLASTP,BLASTN,BLASTX,TBLASTN,TBLASTX)

BLASTReference:

http:

//www.ncbi.nlm.nih.gov/BLAST/blast_program.shtml

Generateacustomizedsearchdatabase(MedicagoCDSs)

$makeblastdb-dbtypenucl-inMt3.5v5_GenesCDSSeq_20111014.fa

BLASTtheAtCesA’sagainstMedicagoCDSs

$tblastn-queryAt-CesA.fasta-dbMt3.5v5_GenesCDSSeq_20111014.fa-outfmt6-outAt-CesA.MtCDS.tblastn

Waituntilitfinishesthentakealookattheoutput:

$lessAt-CesA.MtCDS.tblastn

BLASTtabularformat(asspecifiedin-outfmt6)hasmultiplecolumns,andistheeasiestoutputformattoworkwith. 

Whichcolumnhasthequery?

Whichcolumnhasthesubject(i.e.thesequencesinthedatabasethatmatchyourquery,whichyouwanttoextractoutlater)?

Thesubjectcolumnis______

Cutoutthecolumn(-f2):

$cut-f2At-CesA.MtCDS.tblastn|sort-u>

hits.ids

Howmanyuniquehitsdidweget?

Thisequalstothenumberoflinesinthefile:

$wc-lhits.ids

Numberofuniquehits:

_________

Itappearswearegettinglotsofhits.WemayneedtogobacktotheBLASTcommandtoaddanE-valuecutofftobemorestringent.HowtoaddE-valuecutoff?

Let’slookatthehelpfortblastn

$tblastn-help

NowrunBLASTwithmorestringentsettings(E-valuecutoff:

1e-20):

$tblastn-queryAt-CesA.fasta-dbMt3.5v5_GenesCDSSeq_20111014.fa-outfmt6-outAt-CesA.MtCDS.tblastn-evalue1e-20

NumberofuniquehitswithE-value1e-20:

Extractsequencesfrommedicago

ThecommandfaSomeRecords 

canbeusedtoextractmultiplesequences,nowlet’sgetalltheBLASThitstoretrieveagenefamily.

$faSomeRecordsMt3.5v5_GenesCDSSeq_20111014.fahits.idshits.fasta

Verifythatyouaregettingallthemedicagohitsthatyouwant:

-countthelinesinhits.ids(hint:

$wc-lhits.ids)_________

-countthenumberofrecordsinhits.fasta(hint:

$faSizehits.fasta)________

Thetwonumbersshouldagree.Whenworkinginbioinformatics,alwayscheck,check,andchecktomakesurenumbersaddup.

RunMUSCLEtogetamultiplesequencealignment

$muscle-maxiters1-inhits.fasta-clw-outhits.aln

Notewejustwantafastalignmenthere(-maxiter1).Takealookattheoutputfilehits.aln.

HowtomakemyslowBLASTjobrunfaster?

!

Welcometotheworldofparallelcomputing!

TheBLASTjobiswhatwecall“embarrassinglyparallel”,meaningthatit’srelativelyeasytodo-byhavingeachCPUworkonapartoftheinputfile!

Todothis,let’ssplitthequeryFASTAfirstintofourfiles.LearnhowtoconstructthecommandbyreadingthefaSplit 

manual,simplytypefaSplit 

withnoarguments.

$faSplitsequenceAt-CesA.fasta4split

Wewillthenuseashelltoolcalled 

GNUParallel,thatallowsuserstoeasilylaunchmultiplejobsbyusingatemplatecontainingplaceholders,whichcaneasilybereplacedwithcontentfromaninputsource.

Firstlet’sidentifytheinputsource.Inourcase,itisgoingtobethe4FASTAfilesyoujustcreatedusingfaSplit

$lssplit*.fa*

Whatdoyousee?

Wewillnowtakethetblastn 

commandfromabove,insertsomeplaceholdersandpasstheinputsourcetoit(usingthepipetoredirecttheinput).

$lssplit*.fa*|parallel'

tblastn-query{}-dbMt3.5v5_GenesCDSSeq_20111014.fa-outfmt6-out{.}.MtCDS.tblastn-evalue1e-20'

Meaningoftheplaceholdersused:

{}:

Replacedbyeachlinefromtheinputsource

{.}:

 

Replacedbyeachlinefromtheinputsource,withtheextensionremoved

Withthismechanism,eachsplittedFASTAfilewillgenerateatblastncommand,makingatotalof4commandstorunsimultaneously,thereforegetjobsdonefaster.Don’tforgettoconcatenatetheoutputstogether.

$catsplit*.MtCDS.tblastn>

all.MtCDS.tblastn

Refertothe 

manpage 

forGNUparallel 

forfurtherdetailedinformation,ifyouhavetime.

Module-2:

FASTQ,FASTQCandTRIMMOMATIC

$cd../module2/

PE-350.1.fastq 

PE-350.2.fastq 

adapters.fasta

UnderstandFASTQformat

AlsothefilePE-350.1.fastq 

andPE-350.2.fastq 

arepairedfiles,withrecordfromeachfilesequentiallymatcheachother,i.e.pairofreads,sothisiscalleda“paired-end”library.FASTQfilesaretextfiles:

$lessPE-350.1.fastq

Whatinformationisincluded?

Whatdoeseachlinemean?

FourconsecutivelinesdefineONEreadinthefastqfile

EachbasehasaqualityvaluecalledPHREDscore,typicallybetween0-40forIllumina.Phredscore:

10-10%error

20-1%error

30-0.1%error

40-0.01%error

Well,showingnumbersinfilesisnotreallycompact,soFASTQfileconvertsthenumericvaluetoacharacter.Therearealsotwosystemsofsuchconversion,PHRED+33andPHRED+64.PHRED+33isusedinallnewIlluminaprotocols.

CanyouguessifyourFASTQfileisinPHRED+33orPHRED+64encoding?

Quicktip:

Lookforstretchof#######orBBBBBBBthattrailsanyread,whichmeansit’sreallyBADquality.Ifyousee#######,thenthat’sPhred+33,andBBBBBBBBthat’sPhred+64(Why?

GobacktoourFASTQfiles, 

findoutanswerstothesequestions:

LibraryPE-350containsatotalof______reads.(hint:

count#oflineswith`wc-lPE-350.1.fastq`,thendivideitby4).

Thebasequalityinmydataisencodedin________(Phred+33orPhred+64?

).

Howgoodisyoursequencingrun?

UseFASTQC

WearetoolazytoinstallFASTQCforyou!

J

Seeifyoucangetittowork-GoogleFASTQCandfindthe“Download”link.Findthelinkthatsayssomethinglike“FastQC(Linuxzipfile)”.Nowdownloadit.

//www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.10.1.zip

$unzipfastqc_v0.10.1.zip

$cdFastQC

$

展开阅读全文
相关资源
猜你喜欢
相关搜索
资源标签

当前位置:首页 > 解决方案 > 学习计划

copyright@ 2008-2023 冰点文库 网站版权所有

经营许可证编号:鄂ICP备19020893号-2