Quality Assurance / Quality Control

Eric Archer

2017-04-10

Quality of genotypes and sequences can be assessed with several functions. For genotypic data, by-locus summaries can be conducted with the summarizeLoci function, which will also produce summaries for each strata:

data(msats.g)
msats <- msats.g
smry <- summarizeLoci(msats)
head(smry)
##       num.genotyped prop.genotyped num.alleles allelic.richness
## D11t            125           0.99          12            0.096
## EV37            119           0.94          22            0.185
## EV94            125           0.99          15            0.120
## Ttr11           125           0.99           9            0.072
## Ttr34           126           1.00          10            0.079
##       prop.unique.alleles exptd.heterozygosity obsvd.heterozygosity
## D11t                0.250                 0.75                 0.70
## EV37                0.136                 0.83                 0.70
## EV94                0.067                 0.83                 0.78
## Ttr11               0.222                 0.80                 0.70
## Ttr34               0.200                 0.81                 0.70

The dupGenotypes function identifies samples that have the same or nearly the same genotypes. The number (or percent) of loci that must be shared in order for it to be considered a duplicate can be set by the num.shared argument. The return data.frame provides which loci the two samples show mismatches at so they can be reviewed.

# Find samples that share alleles at 2/3rds of the loci
dupGenotypes(msats, num.shared = 0.66)
##    ids.1 ids.2       strata.1       strata.2 num.loci.genotyped
## 1  41579 45237        Coastal        Coastal                  4
## 2  23945 78065        Coastal        Coastal                  5
## 3  25503 78053        Coastal        Coastal                  5
## 4  25509 41822        Coastal        Coastal                  5
## 5  41540 78040        Coastal        Coastal                  5
## 6  41578 45233        Coastal        Coastal                  5
## 7  44720 78058        Coastal        Coastal                  5
## 8  45230 78040        Coastal        Coastal                  5
## 9  78034 78040        Coastal        Coastal                  5
## 10 78034 78043        Coastal        Coastal                  5
## 11 78035 78053        Coastal        Coastal                  5
## 12 78045 78058        Coastal        Coastal                  5
## 13 40916 78038        Coastal        Coastal                  4
## 14 41820 42193        Coastal        Coastal                  4
## 15 42193 44719        Coastal        Coastal                  4
## 16 42193 45230        Coastal        Coastal                  4
## 17 42193 78035        Coastal        Coastal                  4
## 18 42193 78045        Coastal        Coastal                  4
## 19 41821  4495        Coastal Offshore.North                  3
## 20  4495 78041 Offshore.North        Coastal                  3
## 21  4495 78054 Offshore.North        Coastal                  3
## 22  4495 78056 Offshore.North        Coastal                  3
##    num.loci.shared prop.loci.shared mismatch.loci
## 1                4             1.00              
## 2                4             0.80          EV37
## 3                4             0.80         Ttr11
## 4                4             0.80         Ttr34
## 5                4             0.80          D11t
## 6                4             0.80          EV94
## 7                4             0.80          EV94
## 8                4             0.80         Ttr11
## 9                4             0.80          EV94
## 10               4             0.80          EV37
## 11               4             0.80          D11t
## 12               4             0.80         Ttr34
## 13               3             0.75         Ttr34
## 14               3             0.75         Ttr11
## 15               3             0.75          EV94
## 16               3             0.75          EV94
## 17               3             0.75         Ttr34
## 18               3             0.75          EV94
## 19               2             0.67         Ttr11
## 20               2             0.67         Ttr11
## 21               2             0.67         Ttr11
## 22               2             0.67         Ttr11

The start and end positions and number of N’s and indels can be generated with the summarizeSeqs function:

library(ape)
data(dolph.seqs)
seq.smry <- summarizeSeqs(as.DNAbin(dolph.seqs))
head(seq.smry)
##      start end length num.ns num.indels
## 4495     1 402    402      0          2
## 4496     1 402    402      0          2
## 4498     1 402    402      0          1
## 5814     1 402    402      0          2
## 5815     1 402    402      0          2
## 5816     1 402    402      0          2

Base frequencies can be generated with baseFreqs:

bf <- baseFreqs(as.DNAbin(dolph.seqs))

# nucleotide frequencies by site
bf$site.freq[, 1:15]
##     1   2   3   4   5   6 7   8   9  10  11  12  13  14  15
## a   0 126 126 126 126 126 5   0   0   0   0 126   0   0   0
## c   0   0   0   0   0   0 0   0 126   0   0   0   0   0   0
## g 126   0   0   0   0   0 0 126   0   0   0   0   0   0 126
## t   0   0   0   0   0   0 0   0   0 126 126   0 126 126   0
## u   0   0   0   0   0   0 0   0   0   0   0   0   0   0   0
## r   0   0   0   0   0   0 0   0   0   0   0   0   0   0   0
## y   0   0   0   0   0   0 0   0   0   0   0   0   0   0   0
## m   0   0   0   0   0   0 0   0   0   0   0   0   0   0   0
## k   0   0   0   0   0   0 0   0   0   0   0   0   0   0   0
## w   0   0   0   0   0   0 0   0   0   0   0   0   0   0   0
## s   0   0   0   0   0   0 0   0   0   0   0   0   0   0   0
## b   0   0   0   0   0   0 0   0   0   0   0   0   0   0   0
## d   0   0   0   0   0   0 0   0   0   0   0   0   0   0   0
## h   0   0   0   0   0   0 0   0   0   0   0   0   0   0   0
## v   0   0   0   0   0   0 0   0   0   0   0   0   0   0   0
## n   0   0   0   0   0   0 0   0   0   0   0   0   0   0   0
## x   0   0   0   0   0   0 0   0   0   0   0   0   0   0   0
## -   0   0   0   0   0   0 0   0   0   0   0   0   0   0   0
## .   0   0   0   0   0   0 0   0   0   0   0   0   0   0   0
# overall nucleotide frequencies
bf$base.freqs
## 
##      a      c      g      t      u      r      y      m      k      w 
## 0.2997 0.2282 0.1283 0.3389 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 
##      s      b      d      h      v      n      x      -      . 
## 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0048 0.0000

Sequences can be scanned for low-frequency substitutions with lowFreqSubs:

lowFreqSubs(as.DNAbin(dolph.seqs), min.freq = 2)
##      id site base       motif
## 1 23792  274    t cctattgatcc
## 2 23794  287    g cctccgttata
## 3 26304  274    a cctataaatcc
## 4 26304  394    t taccttgtggg
## 5 74962   57    a taaaaataatt
## 6 74962  104    g catacgcatgt
## 7 74962  392    t catgctccgtg
## 8 74962  393    c atgctccgtgg

Unusual sequences can be identified by plotting likelihoods based on pairwise distances:

data(dolph.haps)
sequenceLikelihoods(as.DNAbin(dolph.haps))

##        id mean.dist negLogLik deltaLogLik
## 1  Hap.01       8.0        85         1.7
## 2  Hap.02       6.3        99        16.0
## 3  Hap.03       7.1        92         8.9
## 4  Hap.04       8.1        92         8.8
## 5  Hap.05       7.9        86         2.9
## 6  Hap.06      13.1       102        19.2
## 7  Hap.07       7.2        90         6.8
## 8  Hap.08       8.9        85         1.4
## 9  Hap.09       7.3        91         7.5
## 10 Hap.10       7.8        94        11.1
## 11 Hap.11       7.7        83         0.0
## 12 Hap.12      11.6        91         7.4
## 13 Hap.13       8.6        88         4.3
## 14 Hap.14       8.2        91         7.6
## 15 Hap.15       7.4        98        14.8
## 16 Hap.16       8.3        86         2.9
## 17 Hap.17       7.2        86         2.2
## 18 Hap.18      11.7        90         7.2
## 19 Hap.19      11.3        90         7.0
## 20 Hap.20       8.1        87         3.3
## 21 Hap.21       8.5        88         4.3
## 22 Hap.22      12.7       104        20.7
## 23 Hap.23       9.7        93         9.2
## 24 Hap.24       8.8        86         2.7
## 25 Hap.25      11.2        85         1.8
## 26 Hap.26       8.7        87         3.3
## 27 Hap.27       7.2        86         3.2
## 28 Hap.28      11.0        84         1.2
## 29 Hap.29      11.8        97        13.4
## 30 Hap.30       8.9        93         9.4
## 31 Hap.31       7.2        91         8.1
## 32 Hap.32      13.7       110        26.3
## 33 Hap.33       7.9        92         8.8

All of the above functions can be conducted at once with the qaqc function. Only those functions appropriate to the data type contained (haploid or diploid) will be run. Files are written for each output that are labelled either by the @description slot of the gtypes object or the optional label argument of the function.