Setting fire to the haystack 🔥
In my last post, I mentioned that I’d finally managed to crack how to extract meaningful data from ~3.2 million emails that I’d scraped from WhatDoTheyKnow. I thought it was worth jotting down.
Over the years, I’ve seen many people approach the problem by trying to figure out exactly what to extract, but the breakthrough for me came when I flipped the problem upside down. If you are looking for a needle in a haystack, you can rummage around to find it, or just set fire to the hay.
The substantive part of requests and responses that I was interested in finding is rich and varied, meaning it is also extremely hard to predict. The “junk” surrounding it, on the other hand, is formulaic, soulless, bland, and digestible. There is a rhythm to any official correspondence, and so it is with FOI. There are natural points where the conversation or communication can branch or extend, but at the most basic level, FOI is a very functional thing. Even authorities who don’t spam templates will unwittingly still fall into patterns given that they are essentially performing the same repetitive task day in and day out.
So I started my quest by identifying and labelling the repeating language, not just within responses from one authority, but across all of them. I quickly established that my hunch was correct, and that this “bumpf” and “assorted gubbins” is far more a reliable indicator of the function of a message than the substantive content itself.
This noise includes warnings, disclaimers, copyright notices, details on your appeal rights, re-use information, service standards, privacy policies, virus warnings, system-generated footers, salutations, apologies, and general advertising of village fetes. Even the requesters use templates if you look closely enough. The default WhatDoTheyKnow Dear Sir/Madam/Public_Authority, Yours faithfully/sincerely user_name dominates everything, but so do set phrases about the desired response format, availability to provide clarification, and so on. As an aside, a classic tell for requests that have been generated by ChatGPT is a semi passive-aggressive sentence about expecting “a response inside 20 working days.” Use it, and I will know.
When you programmatically find and discard all of the above, you are left with the real substance of the thing.
The things I looked for when labelling might not be what you would expect. An acknowledgement can almost always be spotted by looking for a mention of the possibility of fees. Authorities almost never charge, and the conditional language (“there might be a fee”) is always a boilerplate and not an invoice. Another strong indicator is the mention of 20 working days.
Responses, regardless of the outcome, have to inform requesters of their appeal rights, and will often tell you all about the copyright or permit/forbid re-use by mentioning ROPSI or an OGL licence. Find this, and you’ve found the response. If it’s not in the message body, they’ll usually point to it in an attachment in a detectable way. And what of the attachments themselves? File type is a huge clue. A spreadsheet or zip file is almost always a sign of substantive data being released, for example. We also know the message order, and timeframe between the parts of the correspondence, so you can sense-check auto-labels that way.
Putting this all together is really magic. With a mix of basic classifiers, regex and hope, you can get most of the way there without needing anything more sophisticated than that. You do need to practice good data hygiene. This means masking personally identifiable information and using symbols for things like the names of all the public authorities to standardise the data. You need to leave the ICO and OSIC alone, of course. I also made sure to look at all the words that appeared only once to fix typos - there is a clear opening for a second request about aardvarks (plural).
The beauty of doing it like this is that unlike generative extraction methods, stripping away the noise guarantees that what is left is present in the original text. There is no risk of hallucinations - I once had AI award the freedom of a small town to Terry Wogan out of the ether when I asked it to reproduce a list, so it happens.
The remaining big challenge with this dataset is that a lot of the information I am looking for is locked away inside PDFs and other attachments. I think I know how to get at this this data too though, so I have found around 100k attachments cached on the Internet Archive that I can use to test my theory if the mood ever takes me.