Having played with this and researched the products and services available, we can see the opportunities in the long term. Redacting DSARS or any PII removal is a skilled and often very time consuming task.

The questions we look at in this article are.

  1. How does AI help redaction (more than dictionaries and regular expressions)
  2. When should we invest?
  3. Who will provide the facilities?

But first, Let’s look at what AI is not.

Not using AI

One thing many suppliers are adding to their marketing is the phrase “AI based” and “learning models” for suggesting redactions. Most are, however offering simple dictionaries (e.g. first and second and nickname and postal codes, etc.) and Regular Expressions (RegEx) tools. The latter allow you to search for stings so, if an asterisk “*is a wildcard in a search, then*@*.*” would be a way of saying:

any string where there are characters then an “@” sign followed by some characters either side of a dot“.

The system would then be programmed to recognise those is email addresses then mark up as a suggestion for redaction.

This mechanism can redact the names of authors or expert interviewees from a document provided as a response to an EIR applicant about a land development policy. In a response to a DSAR, it could redact all the other PII (personal identifying information) in the document names and offer you the option to leave the applicant’s name showing. Usually, the results of these “string” searches are suggested to a user to accept or reject in the same way as you would do for red underlined correction suggestions in text you are writing that as a spell checker.

You can add your vocabulary and names to the dictionaries if you see they get missed in the suggested redactions. Likewise, you can change the RegEx if you read up on the skills. This is quite complicated. For example, the actual RegEx for searching for a valid email would be more like \b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b.

Most of the good tools offer some user friendly way of evolving and improving the dictionaries and the chracter strings recognised by their software. All the above is possible without AI. Where AI might help is if it learned the way that human experts redacted document and started to spot and change things that even a careful human might miss. All the well known limitations of AI lerning would be expected.

How does AI help redaction?

The suppliers say their AI will learn from previous redactions experts and suggest redactions on that basis.

This sounds plausible. After all, this method has revolutionised certain complex work such as medical diagnoses from scans xrays and MRIs. Obviously, a consultant specialist with decades of experience would see this as a tool to improve their diagnoses and not a replacement. It woul have to be the same with any IG professionals as shown by a simple example of social services notes below.

Many learning algorithms will increase their accuracy in discovering items for you to redact as humans correct their previous attempts. However, that is a long journey. Simple ones work, like this example below from ChatGPT.

Some specialist AI tools are better at this specific job.

However, it seems to fail when given just slightly harder texts and certainly cannot deal with this example. With apologies to Happy Valley, Vera and Line of Duty writers, Sally Plod is the applicant for this DSAR.

“I interviewed Sally Plod at her home on 23/01/2023 and was unable to prevent her sister Clare Wino listening in from the hall. Mrs Plod reported that her Grandson’s estranged father Billy Bunter was in prison serving an 18 year sentence for various vicious violent assaults. She expressed concern about his escape from a court visit. She stated that Mr Bunter was dangerous and she fears injury to her grandson. She informed me that her son John Plod’s teacher Mrs Jones has been in communication with Billy Bunter and has facilitated correspondence and is attempting to arrange a meeting between them. Sally believes that her grandson, John, has been influenced by his father’s example and has recently been arrested for breaking and entering and that he also has a serious drug dependency and keeps stealing money from her at home. I later spoke with the grandson and he denied the allegations of stealing from Sally.”

If you asked four experienced IG managers you would probably receive seven answers as to how it should be redacted. So AI in redaction has a fair way to go yet. Nevertheless, to save you the trouble, We asked Chat GPT as the “spokesbot” for the AI community.

The format of the two screens is slightly different.  The screenshot above is from https://platform.openai.com/playground
Below is a variation from https://chat.openai.com/chat.
You can spend many a happy hour varying your question and even more training it to improve.
Slightly different question and model.
Would you give ChatGPT thumbs up or thumbs down for either of these? How would you start to train it. Who is confident in their own ability to provide the “text book” answer.
Is there anything here that a dictionary and some RegEx could not do for a competent information governance manager. Neither realise Sally is the applicant. Both hide her name. You can see that an attempt has been made to address this in the question. One appears far more hung up on gender as PII than the other. How would that difference have come from the slight change in question. Let me know if you have had the time to learn more than me about how to construct such goals or correct AI answers and adjust the “models”.

Correcting learning engines so they improve their redaction is exhausting for humans. This application of AI will take a huge amount of training and guidance. Suppliers may come and go and there is a risk they may take your hard work with them. So, will we all be duplicating the effort (and hence multiplying the cost to the taxpayer) or can someone lead everybody to work together as one on behalf of the Crown.

The suppliers are also hoping to participate in the discover process. For example, if you asked a complex business like a local authority to correct spelling of your name, they may need to look in twenty systems from tax, rent student loans, planning etc. Then there will be hundreds of other places your name could be stored from social services notes to the leisure centre membership system and the spreadsheets for booking playing fields for your five-a-side team. The suppliers are pitching their wares at a muti-file type scan for PII right across anywhere ICT users access including the email systems. Having done a few collation exercises for publication schemes and information asset registers, this sounds helpful. However, access will be a challenge – especially in the shadow IT area of people’s spreadsheets.

When should I invest in AI for redaction?

The suppliers say “now”. Many are hundreds of thousands underwater financially and need a lifeline. We all understand that work is required to make this happened but what if the work is all lost because a supplier goes bust. Set against that, what if we do nothing and never obtain the benefits? It seems there are three possibilities.

  1. Now: Jump in and collaborate with the learning process.  Help build better AI.
  2. Later: Wait till someone else has taken the risk and done the work – especially the AI learning process and huge time investment.  Let’s see which supplier emerges with the market leadership and domination. Dictionaries and RegEx are fine for now and the updates we make to those may be fed into the learning engines later.
  3. Never: Will not trust it and there is a better approach by having RegEx and Dictionaries which we can control and understand. The players could go bust and we could lose our learning investment. It could even be dangerous. 

Full author transparency: my heart says number 1. My head says number 2 (and so does my wallet).

Who should provide the AI?

Presently, only private sector companies are investing in the R&D and winning clients to use it. However, the learning engine, some starter dictionaries, and RegEx are not enough. The suppliers do not want the responsibility of holding client data. The onus for all the corrections and training of the AIs will be for the end user. That means huge amounts of duplication of effort. If a new spelling of a first name becomes popular after a film or TV series, every police force, health board, local authority etc. must add it to their learning engine’s dictionary.

The answer must surely be that a government led initiative should get everyone together. The taxpayer could also own the IP for UK Ltd and reduce risk.

If you follow that argument to its conclusion, the plan could only work if a standard to which AI suppliers must comply is created. Alternatively, a single AI supplier could be chosen and their technology locked up in ESCROW

For more information on how to redact material, this link form ICO (Information Commissioners Office) is a good place to start.