Furniture

2018-01-17

This vignette is current as of furniture v1.7.2.

Using furniture

We will first make a ficticious data set:

df <- data.frame(a = rnorm(1000, 1.5, 2), 
                 b = seq(1, 1000, 1), 
                 c = c(rep("control", 400), rep("Other", 70), rep("treatment", 500), rep("None", 30)),
                 d = c(sample(1:1000, 900, replace=TRUE), rep(-99, 100)))

There are four functions that we’ll demonstrate here:

  1. washer
  2. table1
  3. tableC
  4. tableF

Washer

washer is a great function for quick data cleaning. In situations where there are placeholders, extra levels in a factor, or several values need to be changed to another.

library(tidyverse)

df <- df %>%
  mutate(d = washer(d, -99),  ## changes the placeholder -99 to NA
         c = washer(c, "Other", "None", value = "control")) ## changes "Other" and "None" to "Control"

Table1

Now that the data is “washed” we can start exploring and reporting.

table1(df, a, b, factor(c), d)
## 
## ────────────────────────────────
##               Mean/Count (SD/%)
##               n = 1000         
##  a                             
##               1.5 (2.0)        
##  b                             
##               500.5 (288.8)    
##  factor.c.                     
##     control   500 (50%)        
##     treatment 500 (50%)        
##  d                             
##               496.5 (291.7)    
## ────────────────────────────────

The variables must be numeric or factor. Since we use a special type of evaluation (i.e. Non-Standard Evaluation) we can change the variables in the function (e.g., factor(c)). This can be extended to making a whole new variable in the function as well.

table1(df, a, b, d, ifelse(a > 1, 1, 0))
## 
## ───────────────────────────────────────
##                      Mean/Count (SD/%)
##                      n = 1000         
##  a                                    
##                      1.5 (2.0)        
##  b                                    
##                      500.5 (288.8)    
##  d                                    
##                      496.5 (291.7)    
##  ifelse.a...1..1..0.                  
##                      0.6 (0.5)        
## ───────────────────────────────────────

This is just the beginning though. Two powerful things the function can do are shown below:

table1(df, a, b, d, ifelse(a > 1, 1, 0),
       splitby=~factor(c), 
       test=TRUE)
## 
## ─────────────────────────────────────────────────────────
##                                 factor(c) 
##                      control       treatment     P-Value
##                      n = 500       n = 500              
##  a                                               0.705  
##                      1.5 (2.0)     1.6 (2.0)            
##  b                                               <.001  
##                      280.5 (221.7) 720.5 (144.5)        
##  d                                               0.492  
##                      502.9 (289.9) 489.5 (293.8)        
##  ifelse.a...1..1..0.                             0.848  
##                      0.6 (0.5)     0.6 (0.5)            
## ─────────────────────────────────────────────────────────

The splitby = ~factor(c) stratifies the means and counts by a factor variable (in this case either control or treatment). When we use this we can also automatically compute tests of significance using test=TRUE.

We can also use it intuitively within the pipe:

library(tidyverse)

df %>%
  group_by(c) %>%
  table1(a, b, d, ifelse(a > 1, 1, 0), 
       test=TRUE)
## 
## ─────────────────────────────────────────────────────────
##                                     c 
##                      control       treatment     P-Value
##                      n = 500       n = 500              
##  a                                               0.705  
##                      1.5 (2.0)     1.6 (2.0)            
##  b                                               <.001  
##                      280.5 (221.7) 720.5 (144.5)        
##  d                                               0.492  
##                      502.9 (289.9) 489.5 (293.8)        
##  ifelse.a...1..1..0.                             0.848  
##                      0.6 (0.5)     0.6 (0.5)            
## ─────────────────────────────────────────────────────────

In this case, we used the group_by() function from dplyr (within the tidyverse) and table1() knows to use that as the grouping variable in place of the splitby argument.

Finally, you can polish it quite a bit using a few other options. For example, you can do the following:

table1(df, a, b, d, ifelse(a > 1, 1, 0),
       splitby=~factor(c), 
       test=TRUE,
       var_names = c("A", "B", "D", "New Var"),
       type = c("simple", "condensed"))
## 
## ─────────────────────────────────────────────
##                     factor(c) 
##          control       treatment     P-Value
##          n = 500       n = 500              
##  A       1.5 (2.0)     1.6 (2.0)     0.705  
##  B       280.5 (221.7) 720.5 (144.5) <.001  
##  D       502.9 (289.9) 489.5 (293.8) 0.492  
##  New Var 0.6 (0.5)     0.6 (0.5)     0.848  
## ─────────────────────────────────────────────

Note that var_names can be used for more complex naming (e.g., with spaces, brackets) that otherwise cannot be used with data frames. Alternatively, for more simple naming, we can name them directly.

table1(df, A = a, B = b, D = d, A2 = ifelse(a > 1, 1, 0),
       splitby=~factor(c), 
       test=TRUE,
       type = c("simple", "condensed"))
## 
## ────────────────────────────────────────
##                factor(c) 
##     control       treatment     P-Value
##     n = 500       n = 500              
##  A  1.5 (2.0)     1.6 (2.0)     0.705  
##  B  280.5 (221.7) 720.5 (144.5) <.001  
##  D  502.9 (289.9) 489.5 (293.8) 0.492  
##  A2 0.6 (0.5)     0.6 (0.5)     0.848  
## ────────────────────────────────────────

You can also format the numbers (adding a comma for big numbers such as in 20,000 instead of 20000):

table1(df, a, b, d, ifelse(a > 1, 1, 0),
       splitby=~factor(c), 
       test=TRUE,
       var_names = c("A", "B", "D", "New Var"),
       format_number = TRUE)
## 
## ─────────────────────────────────────────────
##                     factor(c) 
##          control       treatment     P-Value
##          n = 500       n = 500              
##  A                                   0.705  
##          1.5 (2.0)     1.6 (2.0)            
##  B                                   <.001  
##          280.5 (221.7) 720.5 (144.5)        
##  D                                   0.492  
##          502.9 (289.9) 489.5 (293.8)        
##  New Var                             0.848  
##          0.6 (0.5)     0.6 (0.5)            
## ─────────────────────────────────────────────

The table can be exported directly to a folder in the working directory called “Table1”. Using export, we provide it with a string that will be the name of the CSV containing the formatted table.

table1(df, a, b, d, ifelse(a > 1, 1, 0),
       splitby=~factor(c), 
       test=TRUE,
       var_names = c("A", "B", "D", "New Var"),
       format_number = TRUE,
       export = "example_table1")

This can also be outputted as a latex, markdown, or pandoc table (matching all the output types of knitr::kable). Below shows how to do a latex table (not using kable however, but a built-in function that provides the variable name at the top of the table):

table1(df, a, b, d, "new var" = ifelse(a > 1, 1, 0),
       splitby=~factor(c), 
       test=TRUE,
       output = "latex2")
## \begin{table}[ ht ] 
## \centering 
## \caption{}
## \begin{tabular}{ l c c c }
## \hline 
##  & \multicolumn{ 2 }{c}{ factor(c) }\\ 
##   & control & treatment & P-Value \\ 
##   & n = 500 & n = 500 &   \\ 
##  \hline 
## a &   &   & 0.705\\ 
## \hspace{6pt}   & 1.5 (2.0) & 1.6 (2.0) & \\ 
## b &   &   & <.001\\ 
## \hspace{6pt}   & 280.5 (221.7) & 720.5 (144.5) & \\ 
## d &   &   & 0.492\\ 
## \hspace{6pt}   & 502.9 (289.9) & 489.5 (293.8) & \\ 
## new.var &   &   & 0.848\\ 
## \hspace{6pt}   & 0.6 (0.5) & 0.6 (0.5) & \\ 
##  \hline
##       \end{tabular}
##       \end{table}

Last item to show you regarding table1() is that it can be printed in a simplified and condensed form. This instead of reporting counts and percentages for categorical variables, it reports only percentages and the table has much less white space.

table1(df, a, b, d, "new var" = ifelse(a > 1, 1, 0),
       splitby=~factor(c), 
       test=TRUE,
       type = c("simple", "condensed"))
## 
## ─────────────────────────────────────────────
##                     factor(c) 
##          control       treatment     P-Value
##          n = 500       n = 500              
##  a       1.5 (2.0)     1.6 (2.0)     0.705  
##  b       280.5 (221.7) 720.5 (144.5) <.001  
##  d       502.9 (289.9) 489.5 (293.8) 0.492  
##  new.var 0.6 (0.5)     0.6 (0.5)     0.848  
## ─────────────────────────────────────────────

Table C

This function is to create simple, beautiful correlation tables. The syntax is just like table1() in most respects. Below we include all the numberic variables to see their correlations. Since there are missing values in d we will use the natural na.rm=TRUE.

tableC(df, 
       a, b, d,
       na.rm = TRUE)
## N = 900
## Note: pearson correlation (p-value).
## 
## ──────────────────────────────────────────
##       [1]            [2]            [3]  
##  [1]a 1.00                               
##  [2]b -0.017 (0.621) 1.00                
##  [3]d 0.024 (0.463)  -0.047 (0.161) 1.00 
## ──────────────────────────────────────────

All the adjustments that you can make in table1() can be done here as well. For example,

tableC(df, 
       "A" = a, "B" = b, "D" = d,
       na.rm = TRUE,
       output = "html")
## N = 900
## Note: pearson correlation (p-value).
[1] [2] [3]
[1]A 1.00
[2]B -0.017 (0.621) 1.00
[3]D 0.024 (0.463) -0.047 (0.161) 1.00

Table F

This function is to create simple frequency tables. The syntax is just like table1() and tableC() in most respects, except that it uses only one variable instead of many.

tableF(df, a)
## 
## ────────────────────────────────────────────────
##  a                 Freq CumFreq Percent CumPerc
##  -4.60325965856683 1    1       0.10%   0.10%  
##  -3.52258149514284 1    2       0.10%   0.20%  
##  -3.5115413578839  1    3       0.10%   0.30%  
##  -3.48102300021307 1    4       0.10%   0.40%  
##  -3.14750984447431 1    5       0.10%   0.50%  
##  -3.1016143004761  1    6       0.10%   0.60%  
##  -3.04911354067806 1    7       0.10%   0.70%  
##  -3.03588637157607 1    8       0.10%   0.80%  
##  -2.97697941696271 1    9       0.10%   0.90%  
##  -2.93306499179011 1    10      0.10%   1.00%  
##  ...               ...  ...     ...     ...    
##  5.92522014874906  1    990     0.10%   99.00% 
##  6.13806804549696  1    991     0.10%   99.10% 
##  6.24562679786518  1    992     0.10%   99.20% 
##  6.27224005633084  1    993     0.10%   99.30% 
##  6.48213529514952  1    994     0.10%   99.40% 
##  6.57381801850885  1    995     0.10%   99.50% 
##  6.90050548224347  1    996     0.10%   99.60% 
##  6.95312810209947  1    997     0.10%   99.70% 
##  7.05215987446464  1    998     0.10%   99.80% 
##  7.18689125870712  1    999     0.10%   99.90% 
##  7.66620501427896  1    1000    0.10%   100.00%
## ────────────────────────────────────────────────

Similarly to table1() we can use a splitby argument (or group_by()).

tableF(df, d, splitby = c)
## Variable:d
## 
## ─────────────────────────────────────────────────────
##  control Freq CumFreq Percent CumPerc Valid CumValid
##  2       1    1       0.20%   0.20%   0.21% 0.21%   
##  3       1    2       0.20%   0.40%   0.21% 0.43%   
##  4       1    3       0.20%   0.60%   0.21% 0.64%   
##  8       2    5       0.40%   1.00%   0.43% 1.06%   
##  11      1    6       0.20%   1.20%   0.21% 1.28%   
##  18      1    7       0.20%   1.40%   0.21% 1.49%   
##  19      1    8       0.20%   1.60%   0.21% 1.70%   
##  23      2    10      0.40%   2.00%   0.43% 2.13%   
##  26      2    12      0.40%   2.40%   0.43% 2.55%   
##  27      1    13      0.20%   2.60%   0.21% 2.77%   
##  ...     ...  ...     ...     ...     ...   ...     
##  975     2    461     0.40%   92.20%  0.43% 98.09%  
##  979     1    462     0.20%   92.40%  0.21% 98.30%  
##  981     1    463     0.20%   92.60%  0.21% 98.51%  
##  984     1    464     0.20%   92.80%  0.21% 98.72%  
##  987     1    465     0.20%   93.00%  0.21% 98.94%  
##  990     1    466     0.20%   93.20%  0.21% 99.15%  
##  991     1    467     0.20%   93.40%  0.21% 99.36%  
##  992     1    468     0.20%   93.60%  0.21% 99.57%  
##  993     1    469     0.20%   93.80%  0.21% 99.79%  
##  997     1    470     0.20%   94.00%  0.21% 100.00% 
##  Missing 30   500     6.00%   100.00%               
## ─────────────────────────────────────────────────────
## 
## ───────────────────────────────────────────────────────
##  treatment Freq CumFreq Percent CumPerc Valid CumValid
##  1         1    1       0.20%   0.20%   0.23% 0.23%   
##  6         3    4       0.60%   0.80%   0.70% 0.93%   
##  12        1    5       0.20%   1.00%   0.23% 1.16%   
##  14        2    7       0.40%   1.40%   0.47% 1.63%   
##  15        1    8       0.20%   1.60%   0.23% 1.86%   
##  16        1    9       0.20%   1.80%   0.23% 2.09%   
##  18        1    10      0.20%   2.00%   0.23% 2.33%   
##  21        1    11      0.20%   2.20%   0.23% 2.56%   
##  23        1    12      0.20%   2.40%   0.23% 2.79%   
##  24        1    13      0.20%   2.60%   0.23% 3.02%   
##  ...       ...  ...     ...     ...     ...   ...     
##  977       1    419     0.20%   83.80%  0.23% 97.44%  
##  979       2    421     0.40%   84.20%  0.47% 97.91%  
##  982       1    422     0.20%   84.40%  0.23% 98.14%  
##  984       1    423     0.20%   84.60%  0.23% 98.37%  
##  985       1    424     0.20%   84.80%  0.23% 98.60%  
##  988       2    426     0.40%   85.20%  0.47% 99.07%  
##  994       1    427     0.20%   85.40%  0.23% 99.30%  
##  996       1    428     0.20%   85.60%  0.23% 99.53%  
##  998       1    429     0.20%   85.80%  0.23% 99.77%  
##  1000      1    430     0.20%   86.00%  0.23% 100.00% 
##  Missing   70   500     14.00%  100.00%               
## ───────────────────────────────────────────────────────
df %>%
  group_by(c) %>%
  tableF(d)
## Using a grouped data frame: default using the grouping variables and not splitby
## Variable:d
## 
## ─────────────────────────────────────────────────────
##  control Freq CumFreq Percent CumPerc Valid CumValid
##  2       1    1       0.20%   0.20%   0.21% 0.21%   
##  3       1    2       0.20%   0.40%   0.21% 0.43%   
##  4       1    3       0.20%   0.60%   0.21% 0.64%   
##  8       2    5       0.40%   1.00%   0.43% 1.06%   
##  11      1    6       0.20%   1.20%   0.21% 1.28%   
##  18      1    7       0.20%   1.40%   0.21% 1.49%   
##  19      1    8       0.20%   1.60%   0.21% 1.70%   
##  23      2    10      0.40%   2.00%   0.43% 2.13%   
##  26      2    12      0.40%   2.40%   0.43% 2.55%   
##  27      1    13      0.20%   2.60%   0.21% 2.77%   
##  ...     ...  ...     ...     ...     ...   ...     
##  975     2    461     0.40%   92.20%  0.43% 98.09%  
##  979     1    462     0.20%   92.40%  0.21% 98.30%  
##  981     1    463     0.20%   92.60%  0.21% 98.51%  
##  984     1    464     0.20%   92.80%  0.21% 98.72%  
##  987     1    465     0.20%   93.00%  0.21% 98.94%  
##  990     1    466     0.20%   93.20%  0.21% 99.15%  
##  991     1    467     0.20%   93.40%  0.21% 99.36%  
##  992     1    468     0.20%   93.60%  0.21% 99.57%  
##  993     1    469     0.20%   93.80%  0.21% 99.79%  
##  997     1    470     0.20%   94.00%  0.21% 100.00% 
##  Missing 30   500     6.00%   100.00%               
## ─────────────────────────────────────────────────────
## 
## ───────────────────────────────────────────────────────
##  treatment Freq CumFreq Percent CumPerc Valid CumValid
##  1         1    1       0.20%   0.20%   0.23% 0.23%   
##  6         3    4       0.60%   0.80%   0.70% 0.93%   
##  12        1    5       0.20%   1.00%   0.23% 1.16%   
##  14        2    7       0.40%   1.40%   0.47% 1.63%   
##  15        1    8       0.20%   1.60%   0.23% 1.86%   
##  16        1    9       0.20%   1.80%   0.23% 2.09%   
##  18        1    10      0.20%   2.00%   0.23% 2.33%   
##  21        1    11      0.20%   2.20%   0.23% 2.56%   
##  23        1    12      0.20%   2.40%   0.23% 2.79%   
##  24        1    13      0.20%   2.60%   0.23% 3.02%   
##  ...       ...  ...     ...     ...     ...   ...     
##  977       1    419     0.20%   83.80%  0.23% 97.44%  
##  979       2    421     0.40%   84.20%  0.47% 97.91%  
##  982       1    422     0.20%   84.40%  0.23% 98.14%  
##  984       1    423     0.20%   84.60%  0.23% 98.37%  
##  985       1    424     0.20%   84.80%  0.23% 98.60%  
##  988       2    426     0.40%   85.20%  0.47% 99.07%  
##  994       1    427     0.20%   85.40%  0.23% 99.30%  
##  996       1    428     0.20%   85.60%  0.23% 99.53%  
##  998       1    429     0.20%   85.80%  0.23% 99.77%  
##  1000      1    430     0.20%   86.00%  0.23% 100.00% 
##  Missing   70   500     14.00%  100.00%               
## ───────────────────────────────────────────────────────

Table X

Lastly, tableX() is a pipable two-way version of table() with a similar syntax to that of the rest of the furniture functions.

df %>%
  tableX(c, ifelse(d > 500, 1, 0))
##            ifelse(d > 500, 1, 0)
## c           0   1   Missing Total
##   control   239 231 30      500  
##   treatment 217 213 70      500  
##   Total     456 444 100     1000

By default, it provides the total counts for the rows and columns with flexibility as to what is displayed and where.

Conclusion

The three functions: table1, tableC, tableF, and washer add simplicity to cleaning up and understanding your data. Use these pieces of furniture to make your quantitative life a bit easier.