New AI tool ensures anonymous COVID-19 data remains secure and private


Developed as a collaboration between CSIRO’s Data61, the digital specialist arm of Australia’s national science agency, the NSW Government, the Australian Computer Society (ACS) and several other groups, a new data privacy tool has been developed to help ensure key datasets – such as those tracking COVID-19 – can be publicly shared with an extra layer of security for sensitive personal information.  The privacy tool assesses the risks to an individual’s data within any dataset; allowing targeted and effective protection mechanisms to be put in place.

Traditionally, such assessments are undertaken by leading data and privacy experts who can now rely on computer models to validate this work.

Since 2020, CSIRO has explored ways of enhancing the tool in collaboration with the Cyber Security Cooperative Research Centre (CSCRC). 

Known as Personal Information Factor (PIF) tool, the software uses a sophisticated data analytics algorithm to identify the risks that sensitive, de-identified and personal information within a dataset can be re-identified and matched to its owner. 

The early version of the tool is already being used by the NSW Government to analyse datasets tracking the spread of COVID-19 in the state since March 2020 and apply appropriate levels of protection before this data is released as open data. 

Dr Ian Oppermann is the NSW Government’s Chief Data Scientist.

“There’s no other piece of software like the PIF tool,” Dr Oppermann said.

“It was developed through a long and very collaborative process involving many state, Commonwealth and industry colleagues. CSIRO’s Data61 really brought it to life and made it useable.

“Every day, it helps us analyse the security and privacy risks of releasing de-identified datasets of people infected with COVID-19 in NSW and the testing cases for COVID-19, allowing us to minimise the re-identification risk before releasing to the public.”

Dr Oppermann said COVID-19 had amplified public awareness of the need for data privacy.

“Given the very strong community interest in growing COVID-19 cases, we needed to release critical and timely information at a fine-grained level detailing when and where COVID-19 cases were identified,” Dr Oppermann said. 

“This also included information such as the likely cause of infection and, earlier in the pandemic, the age range of people confirmed to be infected

“We wanted the data to be as detailed and granular as possible, but we also needed to protect the privacy and identity of the individuals associated with those datasets.” 

Project lead researcher and Senior Research Scientist at CSIRO’s Data61, Dr Sushmita Ruj, said new methods of data de-identification can provide enhanced levels of data privacy and ensure data involving personal information is protected.

“Having studied other privacy metrics, the team concluded a one-size-fits-all approach to estimating the re-identification risks of unique applications of data can be significantly improved upon,” Dr Ruj said. 

“The evolving approach to a PIF takes a tailored approach to each dataset by considering various attack scenarios used to de-identify information. The tool then assigns a PIF score to each set.”

If the PIF is higher than a desired threshold, the program makes recommendations on how to design a more secure and safe framework to certify the dataset is safe to be publicly released.

The CSCRC’s Research Director, Professor Helge Janicke, said privacy must be protected in balancing the need to share information. 

“With PIF, you have a scale on which you can understand the risk, and that is something other tools don’t provide,” Professor Janicke said.

“Data analysis is well understood but how good the output is once shared is very difficult to understand. 

“Hence, the metrics-based approach and analysis that underpins PIF is hugely valuable in achieving the ethical and responsible sharing of critical data, with this technology allowing data owners to fully assess the risks and residual impacts associated with data sharing.”

The PIF tool is also being used to examine other data sets before public release in areas such as domestic violence data collected during the COVID-19 lockdown and public transport usage. 

The tool will continue to be developed by CSIRO’s Data61 and the CSCRC and is expected to be made available for wider public use by June 2022. CSIRO would like to acknowledge and thank the Government of New South Wales and the Government of Western Australia and the Australian Computer Society (ACS) for providing datasets needed to test PIF and supporting the research, along with our partners in advancing the Cyber Security Cooperative Research Centre.