Stanford University researchers have illustrated that forcing a blackbox AI system to comply with data protection and privacy regulations is going to be onerous and potentially impossible. As outlined in a recently published paper, the process often requires completely retraining the neural network model itself, which is going to be both expensive and time consuming.
Consequently, the scope of complying with something like GPDR could be ruinous for an AI system that has been deployed. If it transpired that the training data contained information that the developer was later instructed to delete, such as faces, voices, personal data, or health records, it might require a complete overhaul – starting from scratch.
The paper was authored by four Stanford academics, and demonstrates an algorithm that has some success in making these deletions. The system involves checking if the data in question has an affect on the model. If there is no impact, it can be deleted. For data that does feed into other parts of the model, the algorithm looks to find the extent of its reach, so that it can retrain only the specific parts.
However, this approach is only suited for clustering models, and for deterministic models (deep learning), this is not going to work easily. As models get more complex, the chance of any given data point being needed is going to increase, and if some models feed into entirely new models, the knock-on effect from the original training data is going to be quite far reaching.
To this end, developers are going to have to have a good long think about the quality of their training data. It’s not hard to imagine that there’s a business opportunity here, selling data that has been sanitized and properly checked for compliance with global legislative frameworks.
Retroactive compliance looks like it is going to be a real burden, so every step should be taken to avoid having to carry it out. However, tools like the one described in the paper are definitely going to have a place going forward. What’s more, if concerns about the dangers of data sets, that they could be ticking time bombs, might lead to even greater islanding between AI and ML developers, where lack of trust leads to working in greater isolation.
The abstract describes how the researchers built a framework for studying what to do when you are no longer allowed to deploy models derived from specific user data. The main concern is how to efficiently delete individual data points from a trained machine learning model, using an algorithmic process to avoid having to delete the thing entirely and retrain it. The results are the two aforementioned algorithmic processes, which the researchers say achieved an average of over a 100x improvement in deletion efficiency over six data sets.
The inspiration from the work was an email from the UK Biobank, which the researchers describe as one of the most valuable collections of genetic and medical data records. It consists of over half a million participants, but the UK Biobank had written to one of the researchers notifying them that they were no longer allowed to use the data of a specific individual that had withdrawn their consent.
This was a problem, as the researchers noted that thousands of machine learning classifiers have been trained using this data, resulting in thousands of published papers. To this end, if this individual could find a bold (or dumb) enough lawyer, and prove that their data was still being used somewhere, there could be grounds for a lawsuit.
Now, whether a judge would take their side is another matter. It seems that there’s not much of a chance of proving damages from this sort of complaint, and arguing that you felt great emotional distress from being 0.0002% of a data set probably isn’t going to get you far. However, the law is the law, and if a company isn’t complying with regulations, then fines could be levied by regulators, rather than by individual citizens – a much higher chance of causing actual pain to these businesses, perhaps.
GDPR is the most pressing concern, in this regard, as is the Right to Be Forgotten legislation. The legal academic scene has already argued that it may be impossible to comply with these regulations, and some model-inversion attacks have found that it is possible to extract user data from a machine learning model – meaning that so long as the data is used in the model, there is a risk that it can be extracted with malicious intent.
The researchers point out that their tools could be used for other purposes, such as identifying key or valuable datapoints with a model, supporting a user data marketplace, or speeding up the leave-one-out-cross-validation process, as well as simply for efficiency-driven data deletion. You might find that a lot of fat could be trimmed from larger models, using such tools.