Quantcast
Viewing latest article 1
Browse Latest Browse All 21

Data anonymization in R

Use cases

  • Public reports.
  • Public data sharing, e.g. R packages download logs from CRAN's RStudio mirror - cran-logs.rstudio.com - mask ip addresses.
  • Reports or data sharing for external vendor.
  • Development works can operate on anonymized PRODUCTION data.
    Manually or semi-manually populated data can often brings some new issue after migration to PRODUCTION data.
    Such anonymized PRODUCTION data can be quite handy for the devs.

Dependencies

suppressPackageStartupMessages({library(data.table)library(digest)library(knitr)# used only for post creation
})

Sample of survey data

Anonymize sensitive information in survey data, data storage in a single table.

# pretty print
kable(head(SURV))
CityPostal CodeAddressNameSexAgeHeightWeightScore
LondonSW1H 0QWSilk Road 17John LennonM48176943
CardiffCF23 9AEQueen Road 19Edward SnowdenM55185742
LondonSW1P 3BUEdinburgh Road 19John KennedyM46156841
LondonSW1P 3BUCardiff Road 21Mahatma GandhiM56186545
CardiffCF23 9AEKing Road 10Nelson MandelaM61181842
LondonSW1P 2EECardiff Road 23Vandana ShivaF41192645

Anonymize function

Function will calculate hashes only for unique inputs and return vector of masked inputs.
My version will use digest(x, algo="crc32") because it fits better into html tables, algo crc32 is not really secure.
Read ?digest::digest for supported algo, also consider to salt your input vector, e.g. x=paste0("prefix",x,"suffix").
Performance improvement possible using Rcpp / C: digest #2.

anonymize<-function(x,algo="crc32"){unq_hashes<-vapply(unique(x),function(object)digest(object,algo=algo),FUN.VALUE="",USE.NAMES=TRUE)unname(unq_hashes[x])}

Anonymize survey data

We will keep city and sex fields unmasked.

# choose columns to mask
cols_to_mask<-c("name","address","postal_code")# backup original data
SURV_ORG<-copy(SURV)# anonymize
SURV[,cols_to_mask:=lapply(.SD,anonymize),.SDcols=cols_to_mask,with=FALSE]# pretty print
kable(head(SURV))
CityPostal CodeAddressNameSexAgeHeightWeightScore
London913ad86cc26dc5a8a6ccb226M48176943
Cardiff921485db58be1ead14404453M55185742
London4c0d9ac87996c8e166dc3ad0M46156841
London4c0d9ac81a5ecf8b44f84c46M56186545
Cardiff921485dbb4dce820b3445a6dM61181842
London1f39765cf450aea756efd861F41192645

Why not just random data or integer sequence

When using the digest function to hide sensitive data you:

  • keep rows distribution:
    aggregates by masked columns will still match to aggregates on original columns, see simple grouping below:
SURV_ORG[,.(.N,mean_age=mean(age),mean_score=mean(score)),by=.(city,postal_code)][,kable(.SD)]
CityPostal CodeNMean AgeMean Score
LondonSW1H 0QW148.003.00
CardiffCF23 9AE365.332.33
LondonSW1P 3BU251.003.00
LondonSW1P 2EE236.503.50
GlasgowG40 3AS153.002.00
SURV[,.(.N,mean_age=mean(age),mean_score=mean(score)),by=.(city,postal_code)][,kable(.SD)]
CityPostal CodeNMean AgeMean Score
London913ad86c148.003.00
Cardiff921485db365.332.33
London4c0d9ac8251.003.00
London1f39765c236.503.50
Glasgow90b79e54153.002.00
  • keep relationships on equi joins:
    if t1.col1 == t2.col4 TRUE then also digest(t1.col1) == digest(t2.col4) TRUE
    Example in next section below.

Sample of sales data

Anonymize relational data in sales data, data normalized into SALES and CUSTOMER tables.

kable(head(SALES,4))
Customer UidProduct NameTransaction DateQuantityValue
CUST_3rgr2014-10-2834612
CUST_4jfc2014-10-1342588
CUST_6hnm2014-11-0640200
CUST_9zgm2014-11-0440760
kable(head(CUSTOMER,2))
Customer UidCityPostal CodeAddressNameSex
CUST_1LondonSW1H 0QWSilk Road 17John LennonM
CUST_2CardiffCF23 9AEQueen Road 19Edward SnowdenM
# join
kable(head(CUSTOMER[SALES]))
Customer UidCityPostal CodeAddressNameSexProduct NameTransaction DateQuantityValue
CUST_3LondonSW1P 3BUEdinburgh Road 19John KennedyMrgr2014-10-2834612
CUST_4LondonSW1P 3BUCardiff Road 21Mahatma GandhiMjfc2014-10-1342588
CUST_6LondonSW1P 2EECardiff Road 23Vandana ShivaFhnm2014-11-0640200
CUST_9GlasgowG40 3ASSimple Road 11Bob MarleyMzgm2014-11-0440760
CUST_2CardiffCF23 9AEQueen Road 19Edward SnowdenMqej2014-11-0629493
CUST_9GlasgowG40 3ASSimple Road 11Bob MarleyMfnz2014-10-3059649
# join and aggregate
kable(head(CUSTOMER[SALES][,.(quantity=sum(quantity),value=sum(value)),by=.(city,postal_code)]))
CityPostal CodeQuantityValue
LondonSW1P 3BU84510783
LondonSW1P 2EE7299732
GlasgowG40 3AS3764887
CardiffCF23 9AE98112983
LondonSW1H 0QW3294099

Anonymize sales data

SALES[,customer_uid:=anonymize(customer_uid)]cols_to_mask<-c("customer_uid","name","address","postal_code")CUSTOMER[,cols_to_mask:=lapply(.SD,anonymize),.SDcols=cols_to_mask,with=FALSE]setkey(CUSTOMER,customer_uid)
# preview result
kable(head(CUSTOMER,2))
Customer UidCityPostal CodeAddressNameSex
4a7d777Cardiff921485dba759d95b51a2e5cF
73a0e7e1Glasgow90b79e54a87087517c739cd6M
kable(head(SALES,2))
Customer UidProduct NameTransaction DateQuantityValue
93750effrgr2014-10-2834612
d119b5cjfc2014-10-1342588
# datasets will still join correctly even on masked columns
kable(head(CUSTOMER[SALES]))
Customer UidCityPostal CodeAddressNameSexProduct NameTransaction DateQuantityValue
93750effLondon4c0d9ac87996c8e166dc3ad0Mrgr2014-10-2834612
d119b5cLondon4c0d9ac81a5ecf8b44f84c46Mjfc2014-10-1342588
e31ffa70London1f39765cf450aea756efd861Fhnm2014-11-0640200
73a0e7e1Glasgow90b79e54a87087517c739cd6Mzgm2014-11-0440760
e4723e69Cardiff921485db58be1ead14404453Mqej2014-11-0629493
73a0e7e1Glasgow90b79e54a87087517c739cd6Mfnz2014-10-3059649
# also the aggregates on masked columns will match to the origin
kable(head(CUSTOMER[SALES][,.(quantity=sum(quantity),value=sum(value)),by=.(city,postal_code)]))
CityPostal CodeQuantityValue
London4c0d9ac884510783
London1f39765c7299732
Glasgow90b79e543764887
Cardiff921485db98112983
London913ad86c3294099

Reproduce from Rmd

Script used to produce this post is available in the github repo (link in the page footer) as Rmd file and can be easily reproduced locally in R (required knitr or rmarkdown) to any format (md, html, pdf, docx).

# html output
rmarkdown::render("2014-11-07-Data-Anonymization-in-R.Rmd",html_document())# markdown file used as current post
knitr::knit("2014-11-07-Data-Anonymization-in-R.Rmd")

Minimal script

Minimal script example on survey data as SURV_ORG data.table:

anonymize<-function(x,algo="crc32"){unq_hashes<-vapply(unique(x),function(object)digest(object,algo=algo),FUN.VALUE="",USE.NAMES=TRUE)unname(unq_hashes[x])}cols_to_mask<-c("name","address","postal_code")SURV_ORG[,cols_to_mask:=lapply(.SD,anonymize),.SDcols=cols_to_mask,with=FALSE][]

Viewing latest article 1
Browse Latest Browse All 21

Trending Articles