# Create a character vector or file hash of a dataset and each variable

Source:`R/utils.R`

`hashDataset.Rd`

Given a `data.frame`

or `data.table`

, create a character vector
MD5 hash of the overall dataset and each variable. The goal of this is to create
a secure vector / text file that can be tracked using version control
(e.g., GitHub) without requiring commiting sensitive datasets.
The tracking will make it possible to evaluate whether two datasets are the
same, such as when sending data or when datasets may change over time
to know which variable(s) changed, if any.

## Arguments

- x
A

`data.frame`

or`data.table`

to be hashed.- file
An optional character string. If given, assumed to be the path/name of a file to write the character string hash out to, for convenience. When non missing, the character vector is returned invisibly and a file written. When missing (default), the character vector is returned directly.

## Value

A (possibly invisible) character vector. Also (optionally) a text file written version of the character string.

## Examples

```
hashDataset(mtcars)
#> [1] "dataset: 'mtcars', MD5: a63c70e73b58d0823ab3bcbd3b543d6f"
#> [2] "'mpg' (numeric), f8e0303e137d946eec8ee06e2f152f51, 95d3087aa4b1ccb7e1ef49ba4b6ed83b (sorted)"
#> [3] "'cyl' (numeric), 1b8fcb72575f2a120d77279ecf8851f9, ebfb767c78836025f7917bc845443804 (sorted)"
#> [4] "'disp' (numeric), 4743a496928535c54d157d145aa114e1, 46b67296541e9f9c4c4e2b70227f9673 (sorted)"
#> [5] "'hp' (numeric), c1d443aac7c7e1284e2d0a5a6cbe1a98, cd2e69df37f9f864e98780c5d588c95a (sorted)"
#> [6] "'drat' (numeric), 87a99a22f86d30cea74565dbad577790, f8bbcfc7e573cf1c72600683ac354d7a (sorted)"
#> [7] "'wt' (numeric), 2515679118871778eb21a32575973085, 67b94c293a66443abcb0427684b5ae28 (sorted)"
#> [8] "'qsec' (numeric), 8632d4fe83c9a91aa2c3f5069002c870, 6875ea7f4d4a3ec56ff6fc214b69f7ff (sorted)"
#> [9] "'vs' (numeric), 34b3720888fd730d1dbb5d01c5d57ae3, e5a86583130d2649356bc9dc6060787d (sorted)"
#> [10] "'am' (numeric), 9bc5bef75a08e8812da6f8542311aaba, 3ec14454a35c411d95eedb78c176c0b4 (sorted)"
#> [11] "'gear' (numeric), 1415993e090fa6fd2078f6d907deb29f, 1cd9d9b506d3170cfeb79a869b29211f (sorted)"
#> [12] "'carb' (numeric), 5b570f391814e8569d81dd5e2a9f30e0, 2acb8f5f747919d7befb9bba53dd6a68 (sorted)"
## if a file is specified it will write the results to the text file
## nicely formatted, along these lines
cat(hashDataset(cars), sep = "\n")
#> dataset: 'cars', MD5: f98a59010652c8e1ee062ed4c43f648e
#> 'speed' (numeric), 4eb3e01aee9abbc01e91d22b651be559, 4eb3e01aee9abbc01e91d22b651be559 (sorted)
#> 'dist' (numeric), be6c7701fccdd59b5e47c48881a2acae, 5cf4867b5bc8dc320c7aea4513d0f557 (sorted)
```