Simple regex example

The goal of this vignette is to explain how to estimate asymptotic complexity for custom units (other than the defaults, seconds and kilobytes).

Simple regex example

Let us consider a simple adaptation of the regex example from the README. The code below defines a function which replaces all occurrences of the character a with some other string replace, in subject.

gsub_replace <- function(replace,subject){
  string <- gsub("a",replace,subject)
  data.frame(string, nchar=nchar(string))
}
gsub_replace("-","foobar")
#>   string nchar
#> 1 foob-r     6
gsub_replace("--","foobar")
#>    string nchar
#> 1 foob--r     7

The output above shows that we can replace the a in foobar with either one or two dashes, and the output is a data frame with one row and two columns, including the numeric nchar column which we may want to visualize as a function of N. To do that in the code below, we use that function inside a call to atime with result=TRUE (required in order to save the resulting 1 row data frames).

atime.gsub.list <- atime::atime(
  setup={
    subject <- paste(rep("a", N), collapse="")
    pattern <- paste(rep(c("a?", "a"), each=N), collapse="")
  },
  constant.replacement=gsub_replace("constant size replacement",subject),
  linear.replacement=gsub_replace(subject,subject),
  result=TRUE)
plot(atime.gsub.list)
#> Warning in ggplot2::scale_y_log10("median line, min/max band"): log-10 transformation introduced infinite values.
#> log-10 transformation introduced infinite values.
#> log-10 transformation introduced infinite values.

plot of chunk unnamed-chunk-2

The output above shows that atime computes and plots nchar, in addition to the usual kilobytes and seconds. In this case, it is useful to see the different asymptotic slopes of linear and constant replacement, for both nchar and seconds. To estimate their asymptotic complexity classes, we use the code below,

ref.gsub.list <- atime::references_best(atime.gsub.list)
plot(ref.gsub.list)
#> Warning in ggplot2::scale_y_log10(""): log-10 transformation introduced
#> infinite values.

plot of chunk unnamed-chunk-3

The plot above clearly shows the linear complexity of constant replacement, and the quadratic complexity of linear replacement.

Dynamic programming algorithms for change-point detection

The time complexity of the Dynamic Programming algorithm implemented in the PeakSegDisk package depends on the number of intervals (candidate changepoints stored), so we may want to look at how the number of intervals changes with the data size N. Also we must input a non-negative penalty parameter, and then the algorithm outputs a certain number of segments, again which may be interesting to look at as a function of data size N. Here we compute the mean number of intervals for real Mono27ac data, and synthetic count data which are always increasing.

First, in the code below, we define the data sets, as a function of data size N,

library(data.table)
data(Mono27ac, package="PeakSegDisk", envir=environment())
setup <- quote({
  data.list <- list(real=Mono27ac$coverage[1:N])
  data.list$synthetic <- data.table(data.list$real)[, count := 1:.N]
})

Next, we define a list of expressions to run for each data size:

penalty <- 1e6
(expr.list <- c(
  if(requireNamespace("PeakSegDisk"))atime::atime_grid(
    list(Data=c("real","synthetic")),
    FPOP={
      fit <- PeakSegDisk::PeakSegFPOP_df(data.list[[Data]], penalty)
      fit$loss[, .(intervals=mean.intervals, segments)]
    }),
  atime::atime_grid(mean={
    mean(data.list$real$count)
    data.frame(intervals=NA, segments=1)
  })
))
#> Loading required namespace: PeakSegDisk
#> $`FPOP Data=real`
#> {
#>     fit <- PeakSegDisk::PeakSegFPOP_df(data.list[["real"]], penalty)
#>     fit$loss[, .(intervals = mean.intervals, segments)]
#> }
#> 
#> $`FPOP Data=synthetic`
#> {
#>     fit <- PeakSegDisk::PeakSegFPOP_df(data.list[["synthetic"]], 
#>         penalty)
#>     fit$loss[, .(intervals = mean.intervals, segments)]
#> }
#> 
#> $mean
#> {
#>     mean(data.list$real$count)
#>     data.frame(intervals = NA, segments = 1)
#> }

The output above shows the expressions which will be run for each data size. Note that each expression returns a data frame with one row and two numeric columns, intervals and segments, which will be used as additional units in asymptotic complexity analysis.

Next, we use substitute to use the setup we defined above as an argument to atime, along with the other required arguments:

(atime.DP.lang <- substitute(atime::atime(
  N=as.integer(10^seq(1, 3, by=0.5)),
  setup=SETUP,
  expr.list=expr.list,
  seconds.limit=Inf,
  result=TRUE),
  list(SETUP=setup)))
#> atime::atime(N = as.integer(10^seq(1, 3, by = 0.5)), setup = {
#>     data.list <- list(real = Mono27ac$coverage[1:N])
#>     data.list$synthetic <- data.table(data.list$real)[, `:=`(count, 
#>         1:.N)]
#> }, expr.list = expr.list, seconds.limit = Inf, result = TRUE)

Next we run the timings:

atime.DP.list <- eval(atime.DP.lang)
plot(atime.DP.list)
#> Warning in ggplot2::scale_y_log10("median line, min/max band"): log-10 transformation introduced infinite values.
#> log-10 transformation introduced infinite values.
#> log-10 transformation introduced infinite values.
#> Warning: Removed 5 rows containing missing values or values outside the scale range
#> (`geom_line()`).
#> Warning: Removed 5 rows containing missing values or values outside the scale range
#> (`geom_dl()`).

plot of chunk unnamed-chunk-7

The plot above shows the timings in both kinds of data (real and synthetic). Clearly the algorithm has much fewer intervals, in real data than in synthetic increasing data. The code below computes and plots the best asymptotic references:

ref.DP.list <- atime::references_best(atime.DP.list)
plot(ref.DP.list)
#> Warning in ggplot2::scale_y_log10(""): log-10 transformation introduced
#> infinite values.
#> Warning in (function (..., deparse.level = 1) : number of rows of result is not
#> a multiple of vector length (arg 2)

plot of chunk unnamed-chunk-8

Note in the code above the more.units="intervals" argument, which says to use the intervals column as an additional unit. The plot above shows plots of all three units as a function of data size. It is clear that there is a substantial difference in the number of intervals stored by the algorithm, between real and synthetic increasing data. From the plot above it is clear that

Exercise for the reader: to see the expected asymptotic time complexity in the last plot, re-do the previous analyses, increasing the penalty as well as the max data size N.