一. 簡介Cufflinks下主要包含cufflinks,cuffmerge,cuffcompare和cuffdiff等幾支主要的程序。主要用于基因表達量的計算和差異表達基因的尋找。 二. 安裝Cufflinks下載網頁。1. 為了安裝Cufflinks,必須有Boost C++libraries。下載Boost并安裝。默認安裝在/usr/local。 $ tar jxvf boost_1_53_0.tar.bz2$ cd boost_1_53_0$ ./bootstrap.sh$ sudo ./b2 install2.安裝SAM tools。 下載SAM tools。$ tar jxvf samtools-0.1.18.tar.bz2$ cd samtools-0.1.18$ make$ sudo su # mkdir /usr/local/include/bam# cp libbam.a /usr/local/lib# cp *.h /usr/local/include/bam/# cp samtools /usr/bin/3. 安裝 Eigen libraries。 下載Eigen$ tar jxvf 3.1.2.tar.bz2$ cd eigen-eigen-5097c01bcdc4$ sudo cp -r Eigen/ /usr/local/include/4. 安裝Cufflinks。 $ tar zxvf cufflinks-2.0.2.tar.gz$ cd cufflinks-2.0.2$ ./configure --PRefix=/path/to/cufflinks/install --with-boost=/usr/local/ --with-eigen=/usr/local/include//Eigen/$ make$ make install5. 可以直接下載linux x86_64 binary。不需要上述繁瑣步驟,解壓后的程序直接可用。(推薦) 三. Cufflinks的使用1. Cufflinks簡介Cufflinks程序主要根據Tophat的比對結果,依托或不依托于參考基因組的GTF注釋文件,計算出(各個gene的)isoform的FPKM值,并給出trascripts.gtf注釋結果(組裝出轉錄組)。 注意: 1.fragment的長度的估測,若為pair-end測序,則cufflinks自己會有一套算法,算出結果。若為single-end測序,則cufflinks默認的是高斯分布,或者你自己提供相關的參數設置。 2. cufflinks計算multi-mapped reads,一般a read map到10個位置,則每個位置記為10%。aread mapping to 10 positions will count as 10% of a read at eachposition. 3. 一般不推薦用cufflinks拼接細菌的轉錄組,推薦 Glimmer。但是,若有注釋文件,可以用cufflinks和cuffdiff來檢測基因的表達和差異性。 4. cufflinks/cuffdiff不能計算出exon或splicing event的FPKM 5.cuffdiff處理時間序列data:采用參數-t 6.當你使用cufflinks時,在最后出現了99%,然后一直不動。因為cuffdiff需要更多的CPU來處理一些匹配很多reads的loci。而這些位點一般要等其他位點全部解決了后,才由cuffdiff來處理。可以用參數-M來提供相關的文件,過濾掉rRNA或者線粒體RNA。 7. 當使用cufflinks或cuffdiff出現了“crash with a ‘bad_alloc'error”,cuffdiff和cufflinks運行了很長時間才結束————這表明計算機拼接一個高表達的基因或定量分析一個高表達的基因,運行的內存使用玩盡了!解決方法:修改選項“-max-bundle-frags”,可以先嘗試500000,若錯誤依舊在,可以繼續下調! 8. cuffdiff報道的結果里面所有的基因和轉錄本的FPKM=0,這表明GTF中的染色體名字和BAM里的名字不匹配。 9. cuffdiff和cufflinks的缺點:存在一定的假基因和轉錄本(原因:測序深度,測序質量,測序樣本的測序次數,以及注釋的錯誤) 10. large foldchange表達量不代表數據的明顯性(這些基因的isform多或這些基因測序測到的少,整體較低的表達)。cuffdiff中明顯表達倍數改變的基因,存在不確定性。 11. 通過cufflinks產生的結果中transcript.gtf文件中cuff標識的轉錄本就是新的轉錄本。相應的,其他模塊輸出中CUFF標識代表著新的轉錄本。 12. 若出現了如下錯誤: You are using Cufflinks v2.2.1, which is the most recent release.open: No such file or directoryFile 30 doesn't appear to be a valid BAM file, trying SAM...Error: cannot open alignment file 30 for reading這表明,你的參數有問題。例如“--min-intron-length”,你設置為了:“-min-intron-length”2. 使用方法$ cufflinks [options]* 一個常用的例子:$ cufflinks -p 8 -G transcript.gtf --library-type fr-unstranded -o cufflinks_output tophat_out/accepted_hits.bam3. 普通參數 -h | --help -o | --output-dir default: ./ 設置輸出的文件夾名稱 -p | --num-threads default: 1 用于比對reads的CPU線程數 -G | --GTF 提供一個GFF文件,以此來計算isoform的表達。此時,將不會組裝新的transcripts,程序會忽略和reference transcript不兼容的比對結果 -g | --GTF-guide 提供GFF文件,以此來指導轉錄子組裝(RABT assembly)。此時,輸出結果會包含reference transcripts和novel genes and isforms。 -M | --mask-file 提供GFF文件。Cufflinks將忽略比對到該GTF文件的transcripts中的reads。該文件中常常是rRNA的注釋,也可以包含線立體和其它希望忽略的transcripts的注釋。將這些不需要的RNA去除后,對計算mRNA的表達量是有利的。 -b | --frag-bias-correct 提供一個fasta文件來指導Cufflinks運行新的bias detection and correction algorithm。這樣能明顯提高轉錄子豐度計算的精確性。 -u | --multi-read-correct 讓Cufflinks來做initial estimation步驟,從而更精確衡量比對到genome多個位點的reads。 --library-type default:fr-unstranded 處理的reads具有鏈特異性。比對結果中將會有個XS標簽。一般Illumina數據的library-type為 fr-unstranded。--library-norm-method 具體參考官網,三種方式:classic-fpkm 默認的方式。geometric 針對DESeq。quartile 計算時,fragments和總的map的count取75%4. 豐度評估參數-m | --frag-len-mean default: 200插入片段的平均長度。不過現在Cufflinks能learns插入片段的平均長度,因此不推薦自主設置此值。 -s | --frag-len-std-dev default: 80插入片段長度的標準差。不過現在Cufflinks能learns插入片段的平均長度,因此不推薦自主設置此值。 -N | --upper-quartile-form使用75%分為數的值來代替總的值(比對到單一位點的fragments的數值),作normalize。這樣有利于在低豐度基因和轉錄子中尋找差異基因。 --total-hits-norm default: TRUECufflinks在計算FPKM時,算入所有的fragments和比對上的reads。和下一個參數對立。默認激活該參數。 --compatible-hits-norm Cufflinks在計算FPKM時,只針對和reference transcripts兼容的fragments以及比對上的reads。該參數默認不激活,只能在有 --GTF 參數下有效,并且作 RABT或 ab initio 的時候無效。--max-mle-iterations 進行極大似然法時選擇的迭代次數,默認為:5000--max-bundle-frags 一個skipped locus/loci在別skipped前可以擁有的最大的fragment片段。默認為1000000 --no-effective-length-correction Cufflinks will not employ its "effective" length normalization to transcript FPKM.Cufflinks將不會使用它的“effective” 長度標準化去計算轉錄的FPKM--no-length-correction Cufflinks將根本不會使用轉錄本的長度去標準化fragment的數目。當fragment的數目和the features being quantified的size是獨立的,可以使用(例如for small RNA libraries, where no fragmentation takes place, or 3 prime end sequencing, where sampled RNA fragments are all essentially the same length).小心使用5. 組裝常用參數-L | --label default: CUFF Cufflink以GTF格式來報告轉錄子片段(transfrags),該參數是GTF文件的前綴 -F/--min-isoform-fraction <0.0-1.0> 在計算一個基因的isoform 豐度后,過濾了豐度極低的轉錄本,因為這些轉錄本不可以信任。也可以過濾一些read匹配極低的外顯子。默認為0.1或者10% of the most abundant isoform (the major isoform) of the gene.(一個基因的主要isoform的豐度的10%)-j/--pre-mrna-fraction <0.0-1.0> 內含子被aligment覆蓋的最低深度。若小于這個值則那些內含子的alignments被忽略掉。默認為15%。 The minimum depth of coverage in the intronic region covered by the alignment is divided by the number of spliced reads, and if the result is lower than this parameter value, the intronic alignments are ignored. The default is 15%.-I/--max-intron-length 內含子的最大長度。若大于該值的內含子,cufflinks不會報告。默認為300000.Cufflinks will not report transcripts with introns longer than this, and will ignore SAM alignments with REF_SKIP CIGAR Operations longer than this. The default is 300,000.-a/--junc-alpha <0.0-1.0> 剪接比對過濾中假陽性的二項檢驗中的 alpha value。默認為 0.001-A/--small-anchor-fraction <0.0-1.0> 在junction中一個reads小于自身長度的這個百分比,會被懷疑,可能會在拼接前被過濾掉。默認為0.09--min-frags-per-transfrag default: 10 組裝出的transfrags被支持的RNA-seq的fragments數少于該值則不被報道。 --overhang-tolerance 當決定一個reads或轉錄本與某個轉錄本兼容或匹配的時候,允許的能加入該轉錄本的外顯子的延伸長度。默認是8bp和bowtie/tophat默認的一致。--max-bundle-length Maximum genomic length allowed for a given bundle. The default is 3,500,000bp.--min-intron-length default: 50 最小的intron大小。 --trim-3-avgcov-thresh 最小的3‘端的平均覆蓋程度。小于該值,則刪除其3’端序列。默認10 Minimum average coverage required to attempt 3' trimming. The default is 10.--trim-3-dropoff-frac 最低百分比的拼接的轉錄本的3‘端的平均覆蓋程度。默認0.1 The fraction of average coverage below which to trim the 3' end of an assembled transcript. The default is 0.1.--max-multiread-fraction <0.0-1.0> 若一個轉錄本Transfrags的reads能匹配到基因組的多個位置,其中該轉錄本的reads有超過該百分比是multireads,則不會報告這個轉錄本。默認為75% The fraction a transfrag's supporting reads that may be multiply mapped to the genome. A transcript composed of more than this fraction will not be reported by the assembler. Default: 0.75 (75% multireads or more is suppressed). --overlap-radius default: 50 Transfrags之間的距離少于該值,則將其連到一起。Advanced Reference Annotation Based Transcript (RABT) Assembly Options:當你使用-g/--GTF-guide這個參數時,需要考慮的選項。--3-overhang-tolerance 當決定一個拼接的轉錄本(這個轉錄本可能不是新的轉錄本)和一個參考轉錄本是否合并時,參考轉錄本的3‘端允許延伸的長度。默認600bp The number of bp allowed to overhang the 3' end of a reference transcript when determining if an assembled transcript should be merged with it (ie, the assembled transcript is not novel). The default is 600 bp. --intron-overhang-tolerance 當決定一個拼接的轉錄本(這個轉錄本可能不是新的轉錄本)和一個參考轉錄本是否合并時,參考轉錄本的外顯子允許延伸的長度。默認50bp The number of bp allowed to enter the intron of a reference transcript when determining if an assembled transcript should be merged with it (ie, the assembled transcript is not novel). The default is 50 bp.--no-faux-reads This option disables tiling of the reference transcripts with faux reads. Use this if you only want to use sequencing reads in assembly but do not want to output assembled transcripts that lay within reference transcripts. All reference transcripts in the input annotation will also be included in the output.這一項將不能掩蓋參考轉錄組中的假reads。當你只想在拼接中使用測序的reads而不想輸出lay within reference transcripts的拼接的轉錄組。輸入時注釋的所有的參考轉錄組也將會輸入到輸出中。其他參數(無關緊要)-v/--verbose 顯示版本信息等等-q/--quiet 除了警告和錯誤外,其他信息將不會print--no-update-check 關系cufflinks自動更新的能力6. Cufflinks輸出結果cufflinks的輸入文件是sam或bam格式。并且sam或bam格式的文件必須排好序。(The SAM file supplied to Cufflinks must be sorted by reference position.)Tophat的輸出結果sam或bam已經排好了序。針對其他的未排序的sam或bam文件采用如下排序方式:sort -k 3,3 -k 4,4n hits.sam > hits.sam.sorted1. transcripts.gtf該文件包含Cufflinks的組裝結果isoforms。前7列為標準的GTF格式,最后一列為attributes。其每一列的意義:列數 列的名稱 例子 描述1 序列名 chrX 染色體或contig名; 2 來源 Cufflinks 產生該文件的程序名; 3 類型 exon 記錄的類型,一般是transcript或exon; 4 起始 1 1-base的值; 5 結束 1000 結束位置; 6 得分 1000 ; 7 鏈 + Cufflinks猜測isoform來自參考序列的那一條鏈,一般是'+','-'或'.'; 8 frame . Cufflinks不去預測起始或終止密碼子框的位置; 9 attributes ... 詳見下每一個GTF記錄包含如下attributes:Attribute 例子 描述gene_id CUFF.1 Cufflinks的gene id; transcript_id CUFF.1.1 Cufflinks的轉錄子 id ; FPKM 101.267 isoform水平上的豐度, Fragments Per Kilobase of exon model per Million mapped fragments; frac 0.7647 保留著的一項,忽略即可,以后可能會取消這個; conf_lo 0.07 isoform豐度的95%置信區間的下邊界,即 下邊界值 = FPKM * ( 1.0 - conf_lo ); conf_hi 0.1102 isoform豐度的95%置信區間的上邊界,即 上邊界值 = FPKM * ( 1.0 + conf_hi ); cov 100.765 計算整個transcript上read的覆蓋度; full_read_support yes 當使用 RABT assembly 時,該選項報告所有的introns和exons是否完全被reads所覆蓋2. ispforms.fpkm_trackingisoforms(可以理解為gene的各個外顯子)的fpkm計算結果3. genes.fpkm_trackinggene的fpkm計算結果四. Cuffmerge的使用1. Cuffmerge簡介Cuffmerge將各個Cufflinks生成的transcripts.gtf文件融合稱為一個更加全面的transcripts注釋結果文件merged.gtf。以利于用Cuffdiff來分析基因差異表達。2. 使用方法$ cuffmerge [options]* 輸入文件為一個文本文件,是包含著GTF文件路徑的list。常用例子:$ cuffmerge -o ./merged_asm -p 8 assembly_list.txt3. 使用參數-h | --help-o default: ./merged_asm將結果輸出至該文件夾。 -g | --ref-gtf將該reference GTF一起融合到最終結果中。-p | --num-threads defautl: 1使用的CPU線程數-s | --ref-sequence /該參數指向基因組DNA序列。如果是一個文件夾,則每個contig則是一個fasta文件;如果是一個fasta文件,則所有的contigs都需要在里面。Cuffmerge將使用該ref-sequence來幫助對transfrags分類,并排除repeats。比如transcripts包含一些小寫堿基的將歸類到repeats.4. Cuffmerge輸出結果輸出的結果文件默認為 /merged.gtf五. Cuffcompare的使用1. Cuffcompare簡介Cuffcompare使用Cufflinks的GTF結果,對GTF結果進行比較。和reference gtf比較尋找novel轉錄本等。2. Cuffcompare的使用方法$ cuffcompare [options]* [cuff2.gtf] ... [cuffN.gtf]使用例子:$ cuffcompare -o cuffcmp cuff1.gtf cuff2.gtf3. 使用參數-h -V 顯示進程 -C 默認,表示"contained" transcripts 也會寫入.combined.gtf中。-o default: cuffcmp輸出文件的前綴-r 參考的GFF文件。用來評估輸入的gtf文件中gene models的精確性。每一個輸入的gtf的isoforms將和該參考文件進行比較,并被標注為 overlapping, matching 或 novel。 -R當有了 -r 參數時,指定該參數時,將忽略參考GFF文件中的一些transcripts。這些transcripts不和任何輸入的GTF文件overlapped。-s 該參數指向基因組DNA序列。如果是一個文件夾,則每個contig則是一個fasta文件;如果是一個fasta文件,則所有的contigs都需要在里面。小寫字母的堿基用來將相應的transcripts作為repeats處理。4. 輸出結果在當前目錄下輸出3個文件:.stats, 報告與參考注釋比較時,各種與準確性相關的數據。其中,Sn和Sp展示的是specificity and sensitivity values。 fSn and fSp 列展示的 "fuzzy" variants of these same accuracy calculations。允許存在變動。(-o 沒有設置,默認為cuffcmp為文件前綴).combined.gtf 報告每個樣本的所有的 transfrags 的信息。若一個transfrag在多個樣本中,它只報道一次。 .tracking 匹配到樣本間的轉錄本。this file matches transcripts up between samples. Each row contains a transcript structure that is present in one or more input GTF files. Because the transcripts will generally have different IDs (unless you assembled your RNA-Seq reads against a reference transcriptome), cuffcompare examines the structure of each the transcripts, matching transcripts that agree on the coordinates and order of all of their introns, as well as strand. Matching transcripts are allowed to differ on the length of the first and last exons, since these lengths will naturally vary from sample to sample due to the random nature of sequencing.例子;TCONS_00000045 XLOC_000023 Tcea|uc007afj.1 j q1:exp.115|exp.115.0|100|3.061355|0.350242|0.350207 q2:60hr.292|60hr.292.0|100|4.094084|0.000000|0.000000In this example, a transcript present in the two input files,called exp.115.0 in the first and60hr.292.0 inthe second, doesn't match any reference transcript exactly, butshares exons withuc007afj.1, an isoform of the gene Tcea,as indicated by the classcodej. The first three columns are as follows: 其中,1 Cufflinks transfrag id TCONS_00000045 內部的transfrag id;2 Cufflinks locus id XLOC_000023 內部的locus id; 3 Reference gene id Tcea 參考的注釋的gene的id或者“-”表示沒有匹配到參考的轉錄本; 4 Reference transcript id uc007afj.1 參考的注釋的轉錄本的id或者“-”表示沒有匹配到參考的轉錄本 ; 5 Class code c 轉錄本和參考轉錄本之間的匹配類型。第五列之后如下:qJ: | | | | | | |在輸入的GTF的同目錄下輸出.refmap 和 .tmap 文件。.refmap 具體內容如下:1 Reference gene name 參考注釋的gtf中的基因名字 2 Reference transcript id 參考的轉錄本id 3 Class code 表示cufflinks拼接的轉錄本和參考轉錄本間的匹配情況:c 表示部分匹配;= 表示全部匹配4 Cufflinks matches 匹配到參考轉錄本的cufflinks拼接的轉錄本的id.tmap 具體內容如下:1 Reference gene name 參考注釋的gtf中的基因名字 2 Reference transcript id 參考的轉錄本id 3 Class code 表示cufflinks拼接的轉錄本和參考轉錄本間的匹配情況:c 表示部分匹配;= 表示全部匹配4 Cufflinks gene id ; 5 Cufflinks transcript id; 6 Fraction of major isofor m (FMI) ; 7 FPKM ; 8 FPKM_conf_lo; 9 FPKM_conf_hi ; 10 Coverage ; 11 Length; 12 Major isoform IDclass cord :| Priority | Code | Description | | 1 | = | Complete match of intron chain | | 2 | c | Contained | | | 3 | j | Potentially novel isoform (fragment): at least onesplice junction is shared with a reference transcript | | | 4 | e | Single exon transfrag overlapping a reference exonand at least 10 bp of a reference intron, indicating a possiblepre-mRNA fragment. | | | 5 | i | A transfrag falling entirely within a referenceintron | | | 6 | o | Generic exonic overlap with a referencetranscript | | | 7 | p | Possible polymerase run-on fragment (within2Kbases of a reference transcript) | | | 8 | r | Repeat. Currently determined by looking at thesoft-masked reference sequence and applied to transcripts where atleast 50% of the bases are lower case | | | 9 | u | Unknown, intergenic transcript | | | 10 | x | Exonic overlap with reference on the oppositestrand | | | 11 | s | An intron of the transfrag overlaps a referenceintron on the opposite strand (likely due to read mappingerrors) | | | 12 | . | (.tracking file only, indicates multipleclassifications) |
六. Cuffdiff的使用1. Cuffdiff簡介用于尋找轉錄子表達的顯著性差異。2. Cuffdiff使用方法cuffdiff主要是發現轉錄本表達,剪接,啟動子使用的明顯變化。 cuffdiff [options]* ...[sampleN.sam_replicate1.sam[,...,sample2_replicateM.sam]] $ cuffdiff [options]* ...[sampleN_1.sam[,...,sampleN_M.sam]]其中transcripts.gtf是由cufflinks,cuffcompare,cuffmerge所生成的文件,或是由其它程序生成的。一個樣本有多個replicate,用逗號隔開。sample多于一個時,cuffdiff將比較samples間的基因表達的差異性。一個常用例子:$ cuffdiff --lables lable1,lable2 -p 8 --time-series --multi-read-correct --library-type fr-unstranded --poisson-dispersion transcripts.gtf sample1.sam sample2.samcuffdiff接受bam/sam或cuffquant的CXB文件,同時也可以接受bam與sam的混合文件,不能接受bam/sam和CXB的混合文件。3. 使用參數-h | --help-o | --output-dir default: ./輸出的文件夾目錄。-L | --lables default: q1,q2,...qN給每個sample一個樣品名或者一個環境條件一個lable-p | --num-threads default: 1使用的CPU線程數-T | --time-series讓Cuffdiff來按樣品順序來比對樣品,而不是對所有的samples都進行兩兩比對。即第二個SAM和第一個SAM比;第三個SAM和第二個SAM比;第四個SAM和第三個SAM比...-N | --upper-quartile-form使用75%分為數的值來代替總的值(比對到單一位點的fragments的數值),作normalize。這樣有利于在低豐度基因和轉錄子中尋找差異基因。--total-hits-norm Cufflinks在計算FPKM時,算入所有的fragments和比對上的reads。和下一個參數對立。默認不激活該參數。 --compatible-hits-normCufflinks在計算FPKM時,只針對和reference transcripts兼容的fragments以及比對上的reads。該參數默認激活,使用該參數可以降低核糖體rna的reads對基因表達的干擾。 -b | --frag-bias-correct(一般是genome.fa)提供一個fasta文件來指導Cufflinks運行新的bias detection and correction algorithm。這樣能明顯提高轉錄子豐度計算的精確性。 -u | --multi-read-correct讓Cufflinks來做initial estimation步驟,從而更精確衡量比對到genome多個位點的reads。-c | --min-alignment-count default: 10如果比對到某一個位點的fragments數目少于該值,則不做該位點的顯著性分析。認為該位點的表達量沒有顯著性差異。-M | --mask-file 提供GFF文件。Cufflinks將忽略比對到該GTF文件的transcripts中的reads。該文件中常常是rRNA的注釋,也可以包含線立體和其它希望忽略的transcripts的注釋。將這些不需要的RNA去除后,對計算mRNA的表達量是有利的。-FDR default: 0.05允許的false discovery rate.--library-type default:fr-unstranded處理的reads具有鏈特異性。比對結果中將會有個XS標簽。一般Illumina數據的library-type為 fr-unstranded。--dispersion-method 其他高級參數:-m | --frag-len-mean default: 200插入片段的平均長度。不過現在Cufflinks能learns插入片段的平均長度,因此不推薦自主設置此值。-s | --frag-len-std-dev default: 80插入片段長度的標準差。不過現在Cufflinks能learns插入片段的平均長度,因此不推薦自主設置此值。-v/--verbose 顯示版本信息等等 -q/--quiet 除了警告和錯誤外,其他信息將不會print--no-update-check 關系cufflinks自動更新的能力-F/--min-isoform-fraction <0.0-1.0> 建議不要更改,主要的isorform豐度若低于這個分數,可變的isoform將四舍五入為0.默認為1e-5--max-bundle-frags 一個skipped locus/loci在skipped前可以擁有的最大的fragment片段。默認為1000000 --max-frag-count-draws (默認為100)和--max-frag-assign-draws (默認為50)--min-reps-for-js-test 一個針對不同調控的基因做test的最小的復制次數。Cuffdiff won't test genes for differential regulation unless the conditions in question have at least this many replicates. Default: 3. --no-effective-length-correction Cuffdiff will not employ its "effective" length normalization to transcript FPKM. Cufflinks將不會使用它的“effective” 長度標準化去計算轉錄的FPKM--no-length-correction cufflinks將根本不會使用轉錄本的長度去標準化fragment的數目。當fragment的數目和the features being quantified的size是獨立的,可以使用(例如for small RNA libraries, where no fragmentation takes place, or 3 prime end sequencing, where sampled RNA fragments are all essentially the same length).小心使用--max-mle-iterations 極大似然法的迭代次數,默認5000--poisson-dispersionUse the Poisson fragment dispersion model instead of learning one in each condition.4. Cuffdiff輸出1. FPKM tracking files cuffdiff計算每個樣本中的轉錄本,初始轉錄本和基因的FPKM。其中,基因和初始轉錄本的FPKM的計算是在每個轉錄本group和基因group中的轉錄本的FPKM的求和。| isoforms.fpkm_tracking | Transcript FPKMs | | genes.fpkm_tracking | Gene FPKMs. Tracks the summed FPKM of transcriptssharing each gene_id | | cds.fpkm_tracking | Coding sequence FPKMs. Tracks the summed FPKM oftranscripts sharing eachp_id, independent oftss_id | | tss_groups.fpkm_tracking | Primary transcript FPKMs. Tracks the summed FPKMof transcripts sharing eachtss_id |
2. Count tracking files 評估每個樣本中來自每個 transcript, primary transcript, and gene的fragment數目。其中primary transcript, and gene的fragment數目是每個primary transcript group或gene group中trancript的數目之和。| isoforms.count_tracking | Transcript counts | | genes.count_tracking | Gene counts. Tracks the summed counts oftranscripts sharing eachgene_id | | cds.count_tracking | Coding sequence counts. Tracks the summed countsof transcripts sharing eachp_id, independent oftss_id | | tss_groups.count_tracking | Primary transcript counts. Tracks the summedcounts of transcripts sharing eachtss_id |
3. Read group tracking files 計算在每個repulate中每個transcript, primary transcript和gene的表達量和frage數目| isoforms.read_group_tracking | Transcript read group tracking | | genes.read_group_tracking | Gene read group tracking. Tracks the summedexpression and counts of transcripts sharing eachgene_idin each replicate | | cds.read_group_tracking | Coding sequence FPKMs. Tracks the summedexpression and counts of transcripts sharing eachp_id,independent oftss_id in each replicate | | tss_groups.read_group_tracking | Primary transcript FPKMs. Tracks the summedexpression and counts of transcripts sharing eachtss_idin each replicate |
4. Differential expression test 對于splicing transcript, primary transcripts, genes, and coding sequences.樣本之間的表達差異檢驗。對于每一對樣本x和y,都會有以下四個文件:| isoform_exp.diff | Transcript differential FPKM. | | gene_exp.diff | Gene differential FPKM. Tests difference sin thesummed FPKM of transcripts sharing eachgene_id | | tss_group_exp.diff | Primary transcript differential FPKM. Testsdifferences in the summed FPKM of transcripts sharing eachtss_id | | cds_exp.diff | Coding sequence differential FPKM. Testsdifferences in the summed FPKM of transcripts sharing eachp_id independent oftss_id |
每個文件的樣式如下: | Column number | Column name | Example | Description | | 1 | Tested id | XLOC_000001 | A unique identifier describing the transcipt,gene, primary transcript, or CDS being tested | | 2 | gene | Lypla1 | The gene_name(s) or gene_id(s)being tested | | 3 | locus | chr1:4797771-4835363 | Genomic coordinates for easy browsing to the genesor transcripts being tested. | | 4 | sample 1 | Liver | Label (or number if no labels provided) of thefirst sample being tested | | 5 | sample 2 | Brain | Label (or number if no labels provided) of thesecond sample being tested | | 6 | Test status | NOTEST | Can be one of OK (test successful), NOTEST (notenough alignments for testing), LOWDATA (too complex or shallowlysequenced), HIDATA (too many fragments in locus), or FAIL, when anill-conditioned covariance matrix or other numerical exceptionprevents testing. | | 7 | FPKMx | 8.01089 | FPKM of the gene in sample x | | 8 | FPKMy | 8.551545 | FPKM of the gene in sample y | | 9 | log2(FPKMy/FPKMx) | 0.06531 | The (base 2) log of the fold changey/x | | | 10 | test stat | 0.860902 | The value of the test statistic used to computesignificance of the observed change in FPKM | | 11 | p value | 0.389292 | The uncorrected p-value of thetest statistic | | 12 | q value | 0.985216 | The FDR-adjusted p-value of thetest statistic | | 13 | significant | no | Can be either "yes" or "no", depending on whetherp is greater then the FDRafter Benjamini-Hochbergcorrection for multiple-testing |
5. Differential splicing tests – splicing.diff 對于每個primary transcript,鑒定的不同的isoform的差異性。只有2個或2個以上的isoforms的primary transcript存在| Column number | Column name | Example | Description | | 1 | Tested id | TSS10015 | A unique identifier describing the primarytranscript being tested. | | 2 | gene name | Rtkn | The gene_name or gene_id thatthe primary transcript being tested belongs to | | 3 | locus | chr6:83087311-83102572 | Genomic coordinates for easy browsing to the genesor transcripts being tested. | | 4 | sample 1 | Liver | Label (or number if no labels provided) of thefirst sample being tested | | 5 | sample 2 | Brain | Label (or number if no labels provided) of thesecond sample being tested | | 6 | Test status | OK | Can be one of OK (test successful), NOTEST (notenough alignments for testing), LOWDATA (too complex or shallowlysequenced), HIDATA (too many fragments in locus), or FAIL, when anill-conditioned covariance matrix or other numerical exceptionprevents testing. | | 7 | Reserved | 0 | | | 8 | Reserved | 0 | | | 9 | √JS(x,y) | 0.22115 | The splice overloading of the primary transcript,as measured by the square root of the Jensen-Shannon divergencecomputed on the relative abundances of the splice variants | | | 10 | test stat | 0.22115 | The value of the test statistic used to computesignificance of the observed overloading, equal to √JS(x,y) | | 11 | p value | 0.000174982 | The uncorrected p-value of thetest statistic. | | 12 | q value | 0.985216 | The FDR-adjusted p-value of thetest statistic | | 13 | significant | yes | Can be either "yes" or "no", depending on whetherp is greater then the FDRafter Benjamini-Hochbergcorrection for multiple-testing |
6. Differential coding output – cds.diff 對于每個基因,它的cds的鑒定。樣本間的輸出cds的差異性。只有2個或2個以上的cds(multi-protein genes)列舉在文件中。| Column number | Column name | Example | Description | | 1 | Tested id | XLOC_000002-[chr1:5073200-5152501] | A unique identifier describing the gene beingtested. | | 2 | gene name | Atp6v1h | The gene_name or gene_id | | 3 | locus | chr1:5073200-5152501 | Genomic coordinates for easy browsing to the genesor transcripts being tested. | | 4 | sample 1 | Liver | Label (or number if no labels provided) of thefirst sample being tested | | 5 | sample 2 | Brain | Label (or number if no labels provided) of thesecond sample being tested | | 6 | Test status | OK | Can be one of OK (test successful), NOTEST (notenough alignments for testing), LOWDATA (too complex or shallowlysequenced), HIDATA (too many fragments in locus), or FAIL, when anill-conditioned covariance matrix or other numerical exceptionprevents testing. | | 7 | Reserved | 0 | | | 8 | Reserved | 0 | | | 9 | √JS(x,y) | 0.0686517 | The CDS overloading of the gene, as measured bythe square root of the Jensen-Shannon divergence computed on therelative abundances of the coding sequences | | | 10 | test stat | 0.0686517 | The value of the test statistic used to computesignificance of the observed overloading, equal to √JS(x,y) | | 11 | p value | 0.00546783 | The uncorrected p-value of thetest statistic | | 12 | q value | 0.985216 | The FDR-adjusted p-value of thetest statistic | | 13 | significant | yes | Can be either "yes" or "no", depending on whetherp is greater then the FDRafter Benjamini-Hochbergcorrection for multiple-testing |
7. Differential promoter use – promoters.diff 樣本間啟動子使用的差異性。只有表達2個或2個以上isoform的基因列舉在這里。8. Read group info – read_groups.info 每個repulate,在進行定量分析時,cuffdiff的關鍵屬性會列出。| Column number | Column name | Example | Description | | 1 | file | mCherry_rep_A/accepted_hits.bam | BAM or SAM file containing the data for the readgroup | | 2 | condition | mCherry | Condition to which the read group belongs | | 3 | replicate_num | 0 | Replicate number of the read group | | 4 | total_mass | 4.72517e+06 | Total number of fragments for the read group | | 5 | norm_mass | 4.72517e+06 | Fragment normalization constant used duringcalculation of FPKMs. | | 6 | internal_scale | 1.23916 | Internal scaling factor, used to transformreplicates of a single condition onto the "internal" common countscale. | | 7 | external_scale | 0.96 | External scaling factor, used to transform countsfrom different conditions onto an internal common count scale. |
9. Run info – run.info 運行的信息。其中:輸出文件FPKM Tracking file的格式如下:1 tracking_id
|