26 March 2012

组装方面的老问题了,就是从大到小排序然后累加,第一次超过比例后输出当前值。
关于N50,可以看wiki文章 (Assembly algorithms for next-generation sequencing data)

Assemblies are measured by the size and accuracy of their contigs and scaffolds. Assembly size is usually given by statistics including maximum length, average length, combined total length, and N50. The contig N50 is the length of the smallest contig in the set that contains the fewest (largest) contigs whose combined length represents at least 50% of the assembly. The N50 statistics for different assemblies are not comparable unless each is calculated using the same combined length value. Assembly accuracy is difficult to measure. Some inherent measure of accuracy is provided by the degrees of mate-constraint satisfaction and violation [17]. Alignment to reference sequences is useful whenever trusted references exist.

代码可以看SEQanswers论坛的帖子,第二页的附件也是。
关键的就是这堆:

@x=sort{$b<=>$a}@x;
my $count=0;
my ($n50,$n80,$n90);
my $len_mean=$tlen/$n;
foreach my $i(@x)
{
	$count+=$i;
	$n50=$i if (($count>=$tlen*0.5)&&(!defined $n50));
	$n80=$i if (($count>=$tlen*0.8)&&(!defined $n80));
	$n90=$i if (($count>=$tlen*0.9)&&(!defined $n90));
}

#/usr/bin/perl -w
use strict;
my ($len,$total)=(0,0);
my @x;
while(<>){
	if(/^[\>\@]/){
		if($len>0){
			$total+=$len;
			push @x,$len;
		}
		$len=0;
	}
	else{
		s/\s//g;
		$len+=length($_);
	}
}
if ($len>0){
	$total+=$len;
	push @x,$len;
}
@x=sort{$b<=>$a} @x; 
my ($count,$half)=(0,0);
for (my $j=0;$j<@x;$j++){
	$count+=$x[$j];
	if (($count>=$total/2)&&($half==0)){
		print "N50: $x[$j]\n";
		$half=$x[$j]
	}elsif ($count>=$total*0.9){
		print "N90: $x[$j]\n";
		exit;
	}
}

or run this command as before:
Code:

perl -e 'my ($len,$total)=(0,0);my @x;while(<>){if(/^[\>\@]/){if($len>0){$total+=$len;[email protected],$len;};$len=0;}else{s/\s//g;$len+=length($_);}}if ($len>0){$total+=$len;push @x,$len;}@x=sort{$b<=>$a}@x; my ($count,$half)=(0,0);for (my $j=0;$j<@x;$j++){$count+=$x[$j];if(($count>=$total/2)&&($half==0)){print "N50: $x[$j]\n";$half=$x[$j]}elsif($count>=$total*0.9){print "N90: $x[$j]\n";exit;}}' contigs.fa

论坛附件:calengc.pl



blog comments powered by Disqus