Splitting data frames into smaller subsets
I recently needed a way to break my large data frame into
smaller subgroups for batch processing. Sounds easy enough. Thanks to Stack Overflow, I learned
that there is the R command split()
. Re-using the example from the first answer
in the Stack Overflow post:
The second argument of split()
essentially defines which groupings of x
into
one of two groups. Each group consists of 13 values. While this solution is great
if you know exactly how to split your data (e.g. two groups of size 13), it’s
not as intuitive if you have a thousands or millions of rows – how many items should each group contain? Say you want to split it into k
equal groups, what if nrow(x)
does not evenly divide by k
? What happens to the remainder? These can
be easily solved by some simiple arithmetic. To save time, I have created a
convenience function, split_k()
, to do just that.
Here, split_k()
takes as input the data frame x
, and the number of groups k
.
The method will then split x
into k
evenly sized subsets. If x
does not
divide evenly into k
, then the remainder (computed by rem <- nrow(x) %% k
) is added
as the k
‘th group. Alternatively, the remainder can be assigned to its own k+1
group.
Using the same example above, say we wanted to now split the data frame into 4 groups. Then, using split_k()
, we would get: