What was the data approach to breaking The FinCEN Files?

Gary Flood Profile picture for user gflood September 24, 2020
How do you work backwards from thousands of disconnected PDFs to the 100,000 suspicious financial transactions they’re reporting on? A combination of hard human work, OCR, data analysis and graph.

Image of money being laundered
(Image by Alexas_Fotos from Pixabay )

The International Consortium of Investigative Journalists (ICIJ) is a Washington, DC-based nonprofit that sees itself as breaking the "most important" stories in the world. So far, its work has centred on exposing efforts by the global elite to avoid tax (the Panama Papers), but its latest scoop-what it's callingThe FinCEN Files-claims to have uncovered the role of global banks in what it dubs "industrial-scale" money laundering.

The files are copies of 2,500 documents, mostly PDFs of the alerts (suspicious activity reports-SARs) global banks were obliged for compliance reasons to send US authorities between 2000 and 2017. These SARs were originally leaked to Buzzfeed News, then passed along to ICIJ and its global network of journalist volunteers to unravel.

‘Extensive cross-correlation and spotting of connections needed to be made'

Please note that what we're going to focus on in this article is not the substance of the revelations, but how they were effectively reverse-engineered out of the disconnected sources the Consortium was originally given. This is of interest to enterprise IT as it seems to be yet another use case of how the graph data representation approach has proven useful in data mining useful insights out of a large mass of information (see previousdiginomica analyses of this topic here). It's possibly also another example of how data visualisation tools can help non-data experts usefully navigate complex data, too.

First, some basics. To be clear up-front, graph was just one approach used; the FinCEN project involved several different approaches, including use of a non-graph ICIJ-developed Open Source document analysis tool, Datashare, which can perform smart data capture and extraction in large sets of scanned documents. In terms of graph, the two main applications were the Neo4j native graph database and a graph-based AML (Anti-Money Laundering) data visualisation app, Linkurious.

Speaking to us from her DC office ICIJ's research editor (and former investigative reporter herself) Emilia Díaz Struck, ICIJ's research editor and Latin American coordinator explained that the database was used to explore transactional data in approximately 400 spreadsheets and connect it with data from the UK Corporate registry.

A key problem here: while iCIJ had the PDFs it obviously didn't have the spreadsheets they were spawned off, so extensive cross-correlation and spotting of connections needed to be made. Much of that work was automated, but not all of it could be, so Linkurious was in turn used by 85 non-data specialist volunteer journalist helpers in 30 countries to go the next step and deepen understanding of the questionable activities the SARs were pointing to.

The combined Datashare-graph database and journo grunt work took a year, she says:

Either when we have millions of records to work through, like in the Panama or Paradise Papers, or like FinCEN with just thousands, it's all about the extraction of identities. The data was very, very complex, as it's about information on about 100,000 separate transactions. To find connections between the data and those financial transactions, we did try Machine Learning for some of it, we tried a lot of things actually. But the data was written in narrative form in many of the reports, so you really need a human to do the last part of it.

How does the FinCEN project compare to other data-driven investigations, which since the Panama work in 2017 has also involved numerous similar probes ending up with The Luanda Leaks earlier this year? The common denominator is that continual use of graph:

The way we use graph databases is always the same: to find hidden connections that are not obvious. If you find a shareholder or a person, could this person also actually be this person or entity you've seen over here, and so be connected to more things I'm not seeing yet. Whenever you have vast amounts of data, your risk is missing what is there; technology and machine learning, things like graph databases, allow you to see things that sometimes could take you years as a human.

Yes, we have many approaches based on the complexity of the data we're dealing with; we have built a fact-checking tool to fact-check the extraction, we do text and statistical analysis, there are many audit things to explore-but we continue to use graph databases as a way to finding connections inside the dataset we're working on.

A great way of opening up data

There's no doubt, then, that graph is a powerful addition to the armoury of investigative journalists. But Díaz Struck also stresses the importance of data visualisation as as way to foster collaboration among what is, by the nature of the ICIJ's distributed way of working, a very heterogeneous workforce:

I used to be an investigative reporter, based in Venezuela, where there are big issues around lack of access to information. I actually started working with data as a way to bulletproof my own stories, and when data was not available I learned that connecting with colleagues in other countries could be a big help. I became convinced a data-driven approach was a way of opening up transparency and empowering journalism, but also how powerful collaborations are and how you can actually go beyond what is in one country, as corruption is global and the stories are global.

At ICIJ, data visualisation has been something we have used in several of our projects as a way of helping partners navigate and edit data even though they might not always have coding skills. Now, we have a very nice interface for users to log in and then navigate, do queries and searches without them needing to code. It's a great way of opening up data and graph databases for everyone in the network.

In terms of next steps for ICIJ and data, we couldn't help asking if more Panama and FinCEN Files level leaks are coming. Díaz Struck laughed us off, but did make an offer diginomica readers may find intriguing:

We're always looking for new projects and new stories. If anyone has any material or any idea for a great public interest story of global scale where people are affected or the system is broken, they can pitch them to us. But yes, there's always more coming!

A grey colored placeholder image