Use cases
- Public reports.
- Public data sharing, e.g. R packages download logs from CRAN's RStudio mirror - cran-logs.rstudio.com - mask ip addresses.
- Reports or data sharing for external vendor.
- Development works can operate on anonymized PRODUCTION data.
Manually or semi-manually populated data can often brings some new issue after migration to PRODUCTION data.
Such anonymized PRODUCTION data can be quite handy for the devs.
Dependencies
suppressPackageStartupMessages({library(data.table)library(digest)library(knitr)# used only for post creation
})
Sample of survey data
Anonymize sensitive information in survey data, data storage in a single table.
# pretty print
kable(head(SURV))
City | Postal Code | Address | Name | Sex | Age | Height | Weight | Score |
---|
London | SW1H 0QW | Silk Road 17 | John Lennon | M | 48 | 176 | 94 | 3 |
Cardiff | CF23 9AE | Queen Road 19 | Edward Snowden | M | 55 | 185 | 74 | 2 |
London | SW1P 3BU | Edinburgh Road 19 | John Kennedy | M | 46 | 156 | 84 | 1 |
London | SW1P 3BU | Cardiff Road 21 | Mahatma Gandhi | M | 56 | 186 | 54 | 5 |
Cardiff | CF23 9AE | King Road 10 | Nelson Mandela | M | 61 | 181 | 84 | 2 |
London | SW1P 2EE | Cardiff Road 23 | Vandana Shiva | F | 41 | 192 | 64 | 5 |
Anonymize function
Function will calculate hashes only for unique inputs and return vector of masked inputs.
My version will use digest(x, algo="crc32")
because it fits better into html tables, algo crc32
is not really secure.
Read ?digest::digest
for supported algo
, also consider to salt your input vector, e.g. x=paste0("prefix",x,"suffix")
.
Performance improvement possible using Rcpp
/ C
: digest #2.
anonymize<-function(x,algo="crc32"){unq_hashes<-vapply(unique(x),function(object)digest(object,algo=algo),FUN.VALUE="",USE.NAMES=TRUE)unname(unq_hashes[x])}
Anonymize survey data
We will keep city and sex fields unmasked.
# choose columns to mask
cols_to_mask<-c("name","address","postal_code")# backup original data
SURV_ORG<-copy(SURV)# anonymize
SURV[,cols_to_mask:=lapply(.SD,anonymize),.SDcols=cols_to_mask,with=FALSE]# pretty print
kable(head(SURV))
City | Postal Code | Address | Name | Sex | Age | Height | Weight | Score |
---|
London | 913ad86c | c26dc5a8 | a6ccb226 | M | 48 | 176 | 94 | 3 |
Cardiff | 921485db | 58be1ead | 14404453 | M | 55 | 185 | 74 | 2 |
London | 4c0d9ac8 | 7996c8e1 | 66dc3ad0 | M | 46 | 156 | 84 | 1 |
London | 4c0d9ac8 | 1a5ecf8b | 44f84c46 | M | 56 | 186 | 54 | 5 |
Cardiff | 921485db | b4dce820 | b3445a6d | M | 61 | 181 | 84 | 2 |
London | 1f39765c | f450aea7 | 56efd861 | F | 41 | 192 | 64 | 5 |
Why not just random data or integer sequence
When using the digest
function to hide sensitive data you:
- keep rows distribution:
aggregates by masked columns will still match to aggregates on original columns, see simple grouping below:
SURV_ORG[,.(.N,mean_age=mean(age),mean_score=mean(score)),by=.(city,postal_code)][,kable(.SD)]
City | Postal Code | N | Mean Age | Mean Score |
---|
London | SW1H 0QW | 1 | 48.00 | 3.00 |
Cardiff | CF23 9AE | 3 | 65.33 | 2.33 |
London | SW1P 3BU | 2 | 51.00 | 3.00 |
London | SW1P 2EE | 2 | 36.50 | 3.50 |
Glasgow | G40 3AS | 1 | 53.00 | 2.00 |
SURV[,.(.N,mean_age=mean(age),mean_score=mean(score)),by=.(city,postal_code)][,kable(.SD)]
City | Postal Code | N | Mean Age | Mean Score |
---|
London | 913ad86c | 1 | 48.00 | 3.00 |
Cardiff | 921485db | 3 | 65.33 | 2.33 |
London | 4c0d9ac8 | 2 | 51.00 | 3.00 |
London | 1f39765c | 2 | 36.50 | 3.50 |
Glasgow | 90b79e54 | 1 | 53.00 | 2.00 |
- keep relationships on equi joins:
if t1.col1 == t2.col4
TRUE then also digest(t1.col1) == digest(t2.col4)
TRUE
Example in next section below.
Sample of sales data
Anonymize relational data in sales data, data normalized into SALES and CUSTOMER tables.
Customer Uid | Product Name | Transaction Date | Quantity | Value |
---|
CUST_3 | rgr | 2014-10-28 | 34 | 612 |
CUST_4 | jfc | 2014-10-13 | 42 | 588 |
CUST_6 | hnm | 2014-11-06 | 40 | 200 |
CUST_9 | zgm | 2014-11-04 | 40 | 760 |
Customer Uid | City | Postal Code | Address | Name | Sex |
---|
CUST_1 | London | SW1H 0QW | Silk Road 17 | John Lennon | M |
CUST_2 | Cardiff | CF23 9AE | Queen Road 19 | Edward Snowden | M |
# join
kable(head(CUSTOMER[SALES]))
Customer Uid | City | Postal Code | Address | Name | Sex | Product Name | Transaction Date | Quantity | Value |
---|
CUST_3 | London | SW1P 3BU | Edinburgh Road 19 | John Kennedy | M | rgr | 2014-10-28 | 34 | 612 |
CUST_4 | London | SW1P 3BU | Cardiff Road 21 | Mahatma Gandhi | M | jfc | 2014-10-13 | 42 | 588 |
CUST_6 | London | SW1P 2EE | Cardiff Road 23 | Vandana Shiva | F | hnm | 2014-11-06 | 40 | 200 |
CUST_9 | Glasgow | G40 3AS | Simple Road 11 | Bob Marley | M | zgm | 2014-11-04 | 40 | 760 |
CUST_2 | Cardiff | CF23 9AE | Queen Road 19 | Edward Snowden | M | qej | 2014-11-06 | 29 | 493 |
CUST_9 | Glasgow | G40 3AS | Simple Road 11 | Bob Marley | M | fnz | 2014-10-30 | 59 | 649 |
# join and aggregate
kable(head(CUSTOMER[SALES][,.(quantity=sum(quantity),value=sum(value)),by=.(city,postal_code)]))
City | Postal Code | Quantity | Value |
---|
London | SW1P 3BU | 845 | 10783 |
London | SW1P 2EE | 729 | 9732 |
Glasgow | G40 3AS | 376 | 4887 |
Cardiff | CF23 9AE | 981 | 12983 |
London | SW1H 0QW | 329 | 4099 |
Anonymize sales data
SALES[,customer_uid:=anonymize(customer_uid)]cols_to_mask<-c("customer_uid","name","address","postal_code")CUSTOMER[,cols_to_mask:=lapply(.SD,anonymize),.SDcols=cols_to_mask,with=FALSE]setkey(CUSTOMER,customer_uid)
# preview result
kable(head(CUSTOMER,2))
Customer Uid | City | Postal Code | Address | Name | Sex |
---|
4a7d777 | Cardiff | 921485db | a759d95 | b51a2e5c | F |
73a0e7e1 | Glasgow | 90b79e54 | a8708751 | 7c739cd6 | M |
Customer Uid | Product Name | Transaction Date | Quantity | Value |
---|
93750eff | rgr | 2014-10-28 | 34 | 612 |
d119b5c | jfc | 2014-10-13 | 42 | 588 |
# datasets will still join correctly even on masked columns
kable(head(CUSTOMER[SALES]))
Customer Uid | City | Postal Code | Address | Name | Sex | Product Name | Transaction Date | Quantity | Value |
---|
93750eff | London | 4c0d9ac8 | 7996c8e1 | 66dc3ad0 | M | rgr | 2014-10-28 | 34 | 612 |
d119b5c | London | 4c0d9ac8 | 1a5ecf8b | 44f84c46 | M | jfc | 2014-10-13 | 42 | 588 |
e31ffa70 | London | 1f39765c | f450aea7 | 56efd861 | F | hnm | 2014-11-06 | 40 | 200 |
73a0e7e1 | Glasgow | 90b79e54 | a8708751 | 7c739cd6 | M | zgm | 2014-11-04 | 40 | 760 |
e4723e69 | Cardiff | 921485db | 58be1ead | 14404453 | M | qej | 2014-11-06 | 29 | 493 |
73a0e7e1 | Glasgow | 90b79e54 | a8708751 | 7c739cd6 | M | fnz | 2014-10-30 | 59 | 649 |
# also the aggregates on masked columns will match to the origin
kable(head(CUSTOMER[SALES][,.(quantity=sum(quantity),value=sum(value)),by=.(city,postal_code)]))
City | Postal Code | Quantity | Value |
---|
London | 4c0d9ac8 | 845 | 10783 |
London | 1f39765c | 729 | 9732 |
Glasgow | 90b79e54 | 376 | 4887 |
Cardiff | 921485db | 981 | 12983 |
London | 913ad86c | 329 | 4099 |
Reproduce from Rmd
Script used to produce this post is available in the github repo (link in the page footer) as Rmd
file and can be easily reproduced locally in R (required knitr or rmarkdown) to any format (md
, html
, pdf
, docx
).
# html output
rmarkdown::render("2014-11-07-Data-Anonymization-in-R.Rmd",html_document())# markdown file used as current post
knitr::knit("2014-11-07-Data-Anonymization-in-R.Rmd")
Minimal script
Minimal script example on survey data as SURV_ORG
data.table:
anonymize<-function(x,algo="crc32"){unq_hashes<-vapply(unique(x),function(object)digest(object,algo=algo),FUN.VALUE="",USE.NAMES=TRUE)unname(unq_hashes[x])}cols_to_mask<-c("name","address","postal_code")SURV_ORG[,cols_to_mask:=lapply(.SD,anonymize),.SDcols=cols_to_mask,with=FALSE][]