What GMail knows about me and you

MyEmailWe’ve known for a long time that Google collects vast amounts of data about us. What may not be so apparent is how that can be visualized and then rendered open to interpretation. Until now.

The image you see above is a snapshot of all the email that Google Mail has stored for me stretching back close to eight years. I got this information courtesy of a free service called Immersion Media at MIT. Immersion takes meta data held in Google’s mail service and then plots the relationships between myself and others. I’ll get to what this all means in a moment. There’s something that stands out as a massive surprise.

It says that Google has processed close to a quarter of a million emails yet according to my inbox, there are ‘only’ 71,300 emails still sitting in my account. What’s going on?

Just to check what that 71K represents, I deleted a couple of messages in my inbox and sure enough, the counter dropped by two. I then went to my sent box and deleted another message. Once again the counter dropped. I then went to my spam box, selected 100 messages and rechecked the counter. No change. That must mean then that despite the actions I might take, Google is still holding some data about more than 70 percent of all the email that has been addressed to me in some way or another over that near eight year period.

What is Google doing with all that crap? I’ve no idea but I can guess. Moving on.

Clustering nodes

Ethan Zuckerman, who directs the Center for Civic Media at MIT makes the point:

…analyzing these diagrams is a bit like analyzing your dreams – fascinating to you, but off-putting to everyone else). It’s to make the case that this metadata paints a very revealing portrait of oneself.

He’s right. The above snapshot shows for instance that I have very strong ties with a limited number of people but they in turn have very strong networks. I can for instance see how certain individuals reach ‘through’ to others in sometimes surprising ways. Out of the top 15 ‘collaborators’ as Immersion calls them, I know 14 of them personally and of that group, I would count at least 10 as close colleagues in one realm or another.

I ran a couple of other analyses. One was for the previous year and the other for the previous two years. Here is the one for the last year:

MyGmail 1 year

Once again, you can see there is strong clustering. this time however, the emphasis on certain groups is clearer. That should not be a surprise given the time frame.

One interpretation is that the clustering reflects the main topics in which I am interested – or rather the topics that flow through my long held email account.

Another talks to the frequency with which I am in communication with certain groups and people within those groups. You can deduce that from the thickness of the communication lines between various individuals. That makes sense to me.

The crunch

Here comes the crunch. In order to get this data I explicitly gave Immersion permission to harvest it for the purpose of this analysis. Google has acknowledged that it shares some of those data with security agencies. Those same agencies (and goodness knows who else) has access to many accounts. It can therefore easily draw all sorts of inferences simply by looking at the various pieces of connecting ’tissue’ and patterns that exist within the meta data that Google holds and which seems to include anything that I might have had communicated to me, whether it went straight in the junk box or was deleted without reading.

That’s pretty powerful stuff. It also strikes me as scary. But what does this mean for business?

Business beware

Beyond search, GMail has been the most successful of the services Google markets and sells. We use it as do countless other businesses. Our purposes are entirely legitimate and I would like to think that most of those email conversations are private. So far, the evidence suggests that they are. That is until one day some agency or other comes knocking on the door out of the blue demanding to impound all your machines. Think that’s far fetched?

Several years ago I got in touch with one of my old firms. During the conversation they told me that for reasons far too complicated to get into here, they had that ‘knock’ one morning. It caused chaos for several days. They were entirely innocent of any wrong doing but because of an association with someone who was eventually charged with a serious crime, they got caught up in the dragnet. They never did get a satisfactory explanation as to what law officers were looking for but the experience left them un-nerved.

We are all being encouraged to capitulate use of our on-premise systems to various cloud services. That comes with a level of giving up sovereignty over ‘our’ data. Hasn’t the time come for services like Google and others to safeguard our interests in a more transparent manner?

While I have no difficulty in the idea that services like Google can and should use our data to meet their commercial interests, it is another matter altogether when it is so easy for third parties to assemble patterns of which we may be blissfully unaware and yet which may come back to bite us for all the wrong reasons.

The lesson is clear. Regardless of who you choose to partner for cloud services, the question of your privacy just got notched up on the RFI.

In closing, I want to make a small observation about networks and the ties that link us. The notion of running this service came to me via Christine Cavalier, someone I’ve know from afar since the early days of Seesmic. That’s about five years. It is an age since I read anything of hers so you can deduce we have very loose ties. Having said that, the post that drew my attention popped up in another social networking service. We truly are far more connected than we know.

Image credit: © creative soul – Fotolia.com

Den Howlett
Following 20+ years in finance and IT related roles, Den Howlett became a freelance writer/analyst/commenter specialising in enterprise IT. I co-launched Information Week in the UK (1996-7) and was a contributor to numerous UK based trade magazines. Most recently, he was a long term columnist on ZDNet. Today, Howlett provides strategic product direction support to a variety of enterprise vendors along with delivering M&A due diligence services. The raw idea for diginomica came to him at a time when enterprise topics in media were being crushed by consumer stories. There had to be a better way. diginomica is that better way.
Den Howlett

@dahowlett

Disruptor, enterprise applications drama critic, BS cleaner, buyer champion and foodie trying to suck less every day.
Den Howlett
Den Howlett

Latest posts by Den Howlett (see all)

Leave a Reply

  • cliveboulton says:

    Good research. As far as business cloud services RFIs go. Posit the ability to graph metadata independent of the application and sell that data onward drives a new class of on-premise in-cloud database. Likely a trusted technology similar to GitHub with content-addressable sharing for interoperability in the cloud. The PRISM datastore reportedly uses Apache Accumulo with cell-level security tags to selectively share data with different intelligence agencies. Whats good for the goose is…