Abstract

This vignette is an introduction to the package
`groupdata2`

.

`groupdata2`

is a set of methods for easy grouping,
windowing, folding, partitioning, splitting and balancing of data.

For a more extensive description of `groupdata2`

, please
see Description of
groupdata2

Contact author at r-pkgs@ludvigolsen.dk

When working with data you sometimes want to divide it into groups and subgroups for processing or descriptive statistics. It can help reduce the amount of information, allowing you to compare measurements on different scales - e.g. income per year instead of per month.

`groupdata2`

is a set of tools for creating groups from
your data. It consists of six, easy to use, main functions, namely
`group_factor()`

, `group()`

, `splt()`

,
`partition()`

, `fold()`

, and
`balance()`

.

**group_factor**() is at the heart of it all. It creates
the groups and is used by the other functions. It returns a grouping
factor with group numbers, i.e. 1s for all elements in group 1, 2s for
group 2, etc. So if you ask it to create 2 groups from a
`vector`

`('Hans', 'Dorte', 'Mikkel', 'Leif')`

it
will return a factor `(1, 1, 2, 2)`

.

**group**() takes in either a `data frame`

or
`vector`

and returns a `data frame`

with a
grouping factor added to it. The `data frame`

is grouped by
the grouping factor (using `dplyr::group_by`

), which makes it
very easy to use in `dplyr`

/`magrittr`

pipelines.

If, for instance, you have a column in a `data frame`

with
quarterly measurements, and you would like to see the average
measurement per year, you can simply create groups with a size of 4, and
take the mean of each group, all within a 3-line pipeline.

**splt**() takes in either a `data frame`

or
`vector`

, creates a grouping factor, and splits the given
data by this factor using `base::split`

. Often it will be
faster to use `group()`

instead of `splt()`

. I
also find it easier to work with the output of `group()`

.

**partition**() creates (optionally) balanced partitions
(e.g. train/test sets) from given group sizes. It can balance partitions
on one categorical variable and/or one numerical variable. It is able to
keep all datapoints with a shared ID in the same partition.

**fold**() creates (optionally) balanced folds for
cross-validation. It can balance folds on one categorical variable
and/or one numerical variable. It is able to keep all datapoints with a
shared ID in the same fold.

**balance**() uses up- or downsampling to fix the size
of all groups to the min, max, mean, or median group size or to a
specific number of rows. Balancing can also happen on the ID level,
e.g. to ensure the same number of IDs in each category.

I came up with too many use cases to present them all neatly in one vignette. To give each example more space I instead aim to create vignettes for each of them. For now, these are the available vignettes dealing with each their topic:

Cross-validation with
groupdata2

In this vignette, we go through the basics of cross-validation, such as
creating balanced train/test sets with `partition()`

and
balanced folds with `fold()`

. We also write up a simple
cross-validation function and compare multiple linear regression
models.

Time series with
groupdata2

In this vignette, we divide up a time series into groups (windows) and
subgroups using `group()`

with the `greedy`

and
`staircase`

methods. We do some basic descriptive stats of
each group and use them to reduce the data size.

Automatic groups with
groupdata2

In this vignette, we will use the `l_starts`

method with
`group()`

to allow transferring of information from one
dataset to another. We will use the automatic grouping function that
finds group starts all by itself.

For a more extensive description of the features in
`groupdata2`

, see Description of groupdata2.

Well done, you made it to the end of this introduction to
`groupdata2`

! If you want to know more about the various
methods and arguments, you can read the Description of groupdata2.

If you have any questions or comments to this vignette (tutorial) or
`groupdata2`

, please send them to me at

r-pkgs@ludvigolsen.dk, or open an issue on the github
page https://github.com/LudvigOlsen/groupdata2 so I can make
improvements.