Digital Provenance – what it is and why it’s important to you…

on

Many will be familiar with the idea of provenance in art; the provenance is the documentation that authenticates a particular piece of art. Digital provenance is the same for digital objects.

With AI performing many tasks behind the scenes in the digital transactions and experiences we all perform daily, the question of digital provenance is an important one.

If the insurance premium for a car goes up based on some predictive model of traffic data run by your insurance company, you may want to understand why.

Digital provenance would ensure that both developers and users of AI-based systems understand how the algorithm made that decision. 

A group of researchers at Victoria University of Wellington and the University of Canterbury are currently looking into the issue as part of a Science for Technological Innovation National Science Challenge project on veracity technology that aims to ensure integrity of data and products.

“Who wouldn’t want to be confident that their personal data are safe and used in a meaningful way by DHBs when tracking medical treatments? Or why would banks purchase software if they cannot be certain that the software meets regulatory requirements of the financial sector,” the authors write in a recent post on the ITP techblog.

The team wants feedback (all responses are anonymous) from those in the tech sector via a short survey – available here.

I reached out to Jens Dietrich, an associate professor in the School of Engineering and Computer Science at the Victoria University of Wellington and Matthias Galster – an associate professor in the Department of Computer Science and Software Engineering at the University of Canterbury – to find out more.

Firstly can you tell our readers, in simple terms, what Digital Provenance is and why it should be important to them?

Digital provenance is about the truthfulness, trustworthiness, and authenticity of actions and interactions in software, and of digital products, services and data. To give an example, in ransomware attacks data is stolen, copied or manipulated. This not only damages trust in a business, but also has severe negative consequences for those whose data is affected. Similarly, social media sites may use photos of their community to develop algorithms to classify our mood based on facial expressions. This may happen without us even knowing. 

Therefore, understanding how software works and explaining to users why software works in a certain way can build trust in current and future users (and customers). Furthermore, it allows us to increase transparency during the development and use of digital products and services and “baking in” ethical and broader human and societal values into our digital products and services. For example, we can provide “proof” to users that their data is secure and not used for a purpose they did not agree to. 

Finally, tools and practices that help developers implement and demonstrate provenance can increase the efficiency and effectiveness of developers. For example, a developer may be interested in ensuring that their software does not use any malicious third-party components that compromise the security of their app. 

You say “we need explainable applications and services, no matter whether they use AI or “old-fashioned” procedural logic.” As AI develops at pace are we in danger of losing control of accountability?

AI is increasingly used across a full range application from making decisions about home loans, insurance policies or whether a person gets hired. In the US, courts have even been using AI to decide on the sentence of criminals based on a risk score produced by an AI-based system. How exactly an insurance premium, hiring decision or risk score comes about is often not understood by the those who make decisions (insurance brokers, hiring manager, judges) and those who have to live with a decision (insurance customers, applicants, convicted), see for example discussions around risks of AI in law [https://www.americanbar.org/groups/judicial/publications/judges_journal/2021/winter/artificial-intelligence-benefits-and-unknown-risks/, https://www.theatlantic.com/ideas/archive/2019/06/should-we-be-afraid-of-ai-in-the-criminal-justice-system/592084/]. In New Zealand, the risk of using AI in law has also been discussed by the New Zealand Law Foundation’s “Artificial Intelligence and Law in New Zealand” Project. 

There is a significant risk that humans are not part of the decision process anymore and instead “delegate” the responsibility to the system which makes the decision. Developers of these systems on the other hand may delegate responsibility to those who use their software to make recommendations. Therefore, provenance is critically important to help developers and users of AI-based systems understand which algorithm made a decision, how the algorithm made a decision and why. 

AI is now rolled into many enterprise applications that we use everyday as consumers – banking, insurance etc. – how would folding in digital provenance assist the end user?

Folding in digital provenance would allow end users to understand what a system does and why. It would “explain” to users what data the system processes, that the processed data is trustworthy, and how exactly that data is used (imagine a system that explains why an insurance premium is what it is and based on what data the insurance premium is calculated). However, provenance information may not necessarily be relevant to the end user immediately, but help engineers build and maintain systems that are safe for end users to use (think for example of provenance mechanisms in flight control software).

Therefore, we also see provenance as a foundation to build contestability into digital products and services. Once a user can understand why software makes a certain decision, protocols can be developed to allow users to object and challenge decisions, for example, to not use data gathered in an app to display ads in a browser, to not use metadata about ethnicity in an insurance premium calculation or determining a sentence. 

Ultimately, provenance leads to more trust in digital products and services. This will benefit businesses. A good example are product and service ratings which impact purchasing decisions of customers. However, the trust in those reviews is being eroded as those scores can be easily manipulated as shown in the famous Shed of Dulwich story. This hurts businesses with genuine customer feedback.

How does blockchain figure in this discussion – that’s a pretty undeniable form of digital provenance isn’t it?

Blockchain is an umbrella term applied to a number of different technologies, but in general those technologies have a common goal of forming a distributed and immutable record of events over time. Blockchains can definitely be useful in improving management of digital provenance, but they are not a panacea in and of themselves. If we consider provenance as inferred metadata, then there are use cases for archiving this in a secure way. This is where blockchains could also play a role. Also, where the recorded events represent transactions on a system contained within the blockchain, such as a cryptocurrency, then blockchain typically provides good digital provenance. However, if these recorded events are observations of real-world interactions, they may well not capture all the information required to establish provenance. In some situations aspects of the provenance may be subjective. 

Furthermore, blockchain may help ensure the integrity of data records once those records are in the system. We still need mechanisms in place to ensure that data records that go into a system are trustworthy. Once in the system, compromised data cannot be corrected due to immutability of the blockchain. Also, blockchain-based solutions would assume that whatever entity is governing a blockchain is trustworthy and prevents bad actors from intruding the system (malicious smart contracts, etc.). 

Finally, blockchains may not be practically feasible in all cases, e.g., when enhancing existing systems with provenance mechanisms, or when working on domains where scalability, energy consumption are key drivers.

This seems a particularly timely discussion in light of the upcoming Vaccine Passport app – which some won’t trust because of the privacy concerns what’s your view?

The Vaccine Passport app could be a real-world use case for built-in provenance. Other examples could be driver licenses or passports. These examples are also use cases to illustrate the broader impact and relevance of built-in provenance for end users: Ensuring provenance is not only about technical implementation details, but also about the trustworthiness of these apps – gaining the trust of users via built-in provenance mechanisms and presenting provenance information in a meaningful way to users and those verifying the passports and licenses will increase the use of the apps.

How is digital provenance related to veracity technology – or is it the same thing?

Generally speaking, veracity is broader. Simply speaking (and taking a bit of a philosophical stance), veracity is a sense of truth. Or, as the science lead of the veracity technology project [https://www.sftichallenge.govt.nz/our-research/projects/spearhead/veracity-technology/]  Markus Luczak-Roesch explains that, “Veracity is about ensuring the integrity of data or products as it moves into different spaces.” 

In our context we look at veracity as transactions of products, services or data, and the assurance that everything and everyone involved in producing, delivering and consuming that product, service or data need to know is trustworthy, truthful and authentic. Digital provenance on the other hand focuses on actual mechanisms for achieving veracity in digital products and services. For example, how can the manufacturer of a banking app show to regulators in Australia that their app meets privacy standards and how can the Australian regulator be sure that the provenance data provided by the manufacturer is trustworthy and authentic? 

Many app developers and engineers already roll into their software some kind of digital provenance – is more needed?

While this might be true for regulated domains such as health care and banking (where regulators enforce certain provenance requirements), our initial industry engagement so far (involving CTO/CEO level representatives) has shown that many companies and developers only use basic provenance tools and practices. Also, we have found that many provenance practices are manual, for instance in the form of source code and process reviews or security audits, or by restricting access to software components only available to certified “digital service providers”. We have found that there is little automation and provenance is considered as an overhead.

Furthermore, the “provenance needs” of developers may vary and these needs may require different provenance mechanisms. We have found that some companies require provenance for the whole software supply chain, while some focus on security or data sovereignty. Others have provenance concerns during development (for instance, that no developer should see actual customer data). Some companies have specific requirements that arise from industry standards, such as GDPR or FDA regulations. How can we support these?

Finally, provenance may be a concern during development and impact how developers work (but not what end users experience). This may result in constraints on development processes and how data are used during development.

You mention the “why am I seeing this ad” from Google or Facebook” as an example of provenance making its way to the end user – but what else can Big Tech do?

The example of “why am I seeing this ad” is not really provenance as we would like to see it, but some “lightweight” assurance to users that allows them to understand why the ad is shown. For example, users may search for a gift via some web search. When they log into Facebook, they may see ads for the products they just searched for. However, users cannot see the connection between browsing behaviour, app use and ads. Instead, the information presented to users is usually some broad metadata (for example, user resides in New Zealand and is older than 18 years old) to justify the ad.

Therefore, Big Tech could for example allow users to trace their data, where does it come from, where does it go, to whom, and how is it used. This would require some end user oriented data tracing and visualization mechanisms. Such mechanisms would not necessarily provide business value to companies, but increase the trust of users. Furthermore, this will make it harder for individuals and organisations with bad intentions to deceive, misappropriate and defraud; and it will make it easier to find and penalise actors who do so.

You say the team has “started to implement a few demonstrators based on real-world scenarios and datasets” – can you go into a little more detail around this?

The scenarios we use are a movie recommendation system based on the Netflix prize dataset with some embedded machine learning. This is a non-critical system to illustrate what is possible and how built-in provenance could work. Our second demonstrator will apply a car insurance scenario to compute the eligibility and premiums based on a fixed set of business rules. Our third scenario is a health record management system based on OpenMRS. Note that these scenarios are purely demonstrators at the early stage of this project. We will build more tools and systems to solve concrete problems later in the project and encourage interested individuals and organisations to reach out to us.

We have a voluntary code the Algorithm Charter that seems to govern some of this for government agencies – is this a good start?

The Algorithm Charter for Aotearoa New Zealand is a good start and provides relevant principles. We are interested in supporting the development and maintenance of systems that provide provenance information to different types of stakeholders, for instance, regulators or end users. Some of that provenance information is also captured in the Charter (for instance, commitments in the Charter around transparency, partnership, data, and human oversight). The “Trustworthy AI in Aotearoa AI Principles” touch on these points as well. Also, we are interested in how such information can be represented in a meaningful way to users. Not all end users are familiar with computer science concepts and algorithms. How can we help those users understand (and trust) the provenance information we provide?

Are you supporting regulating this area – aren’t there already standards in use by developers?

What we have learned from our initial industry engagement is that some companies would in fact appreciate more regulations and guidance from government bodies to target and justify their efforts, time and cost spent on provenance. 

But even with regulations there are challenges developers face, think for example of “typosquatting” that makes a malicious package appear to be a trustworthy commodity package by simply using a similar name. Or the SolarWinds cyberattack that spread to many clients of a major US IT company for months without being noticed. Regulations and standards did not prevent these incidents. In New Zealand there are already some regulations, for example, regarding the compatibility of licenses in software components. and the use of open source. The question then becomes how we can show that software and related processes and actors comply with these regulations and standards. This is where provenance as envisioned in our work comes in.

The focus of our work is to show what is technically feasible, and to investigate how the cost of integrating provenance into application can be brought down by means of automatisation.  We see provenance as a  software quality attribute like scalability, reliability etc. Creating software always means finding a suitable trade-off between those qualities, while staying within budget. But here research can make a difference: it might make it easier to add provenance to applications if the cost of doing so is transparent and reasonable.  And this might go hand-in-hand with regulation. 

For instance, a tech company might argue that adding provenance to an ad campaign is technically impossible — the algorithms don’t scale, there is too much runtime overhead, and the cost of implementing the respective features would render the company uncompetitive  and unprofitable. Research can demonstrate that this is not the case, and a regulator might use these results to force or incentivise the adaptation of the provenance technologies in certain areas.  But in many domains, we see end users, not regulations, as the drivers.

How will you use data gained from the study?

The data collected in this survey will allow us to understand current provenance practices and requirements in the industry. This will help us ensure that whatever solutions we build to support trustworthiness and authenticity of digital products and services are relevant for New Zealand’s technology sector. For example, we are interested in finding out what provenance data is captured, how it is stored and processed, how it is presented, and for whom. The ultimate goal is to support “veracity-auditing” of existing and future systems in a minimally-invasive and computer supported manner. 

Anything else you’d like to add…

Our work is part of a larger project on veracity technology [https://www.sftichallenge.govt.nz/our-research/projects/spearhead/veracity-technology/] which tries to find data and computer science solutions for verifying that something is what it appears to be in all possible contexts of the world around us. Whether something is a natural product, a digital product or data, it is increasingly difficult to know if something is truly what we think it is. How do we know for sure where a product, service, data are made or whether the claims made about them by the producer are authentic? 

How can we trust that our data is protected and used only in ways we have agreed to?