The Privacy Equation

Edward Snowden recently did an AMA (“Ask Me Anything”) on Reddit where he said:

Arguing that you don’t care about the right to privacy because you have nothing to hide is no different than saying you don’t care about free speech because you have nothing to say.

A pithy statement, is it not? Unfortunately, the situation is not so simple – privacy and free speech are at odds with each other both technologically and legally. This is because the ability to preserve privacy and free speech are inversely related by the same fundamental processes. Simply stated, that which makes free speech more possible makes privacy less possible.

In this article I will show how the degradation of one’s privacy is inevitable and potentially accelerates over time by factors outside one’s direct control. This is a recent phenomenon brought upon by the digitization of information, always-on connectivity and continuous advancements in machine learning. These technologies and the infrastructures built from them also facilitate the propagation of uncensored free speech.

Thus one can accept the futility of preserving their privacy yet still cherish their freedom of expression. One day we will truly have very little to hide, regardless of whether we have something to say.

Semantic Quibble? (Define “Privacy”)

A valid point – is my issue with Snowden’s statement simply a matter of applying a different context of interpretation than the one he intended? Am I seeing a rhetorical statement meant for the field of law through the eyes of a physicist or computer scientist instead?

Yes, and no. My issue is the irreconcilability of the common definition of privacy with the direction the world is headed. By demonstrating this irreconcilability from a scientist’s point of view (i.e., via a formal equation), I hope to steer the legal conversation to a more realistic and long-term thinking direction:

It is more important to establish a legal framework that guides the utilization of one’s information than it is to attempt to use the law to prevent our information from being obtained in the first place.

Note that this is an apolitical statement – I am not arguing for or against a surveillance state. I am also not implying that laws such as HIPAA have no value. I am stating that no law can prevent the eventual dissemination of personal information. Once the information is out in the digital wild, it’s out. What we can do though is ensure that legal constructs are in place for victims of mis-utilized personal information to seek recompense. In addition, we can precisely define the procedure by which our government may utilize personal information for the purpose of preventing or prosecuting a crime.

Fully extrapolated, privacy will evolve into a construct that exists primarily in law only. What makes this inevitable though? Can we characterize the erosion of privacy as we currently know it? I believe we can by stepping through an equation that shows how one’s digital footprint expands over time.

The Privacy Equation

At any instant t in time, an individual’s digital footprint (\Phi_t) can be represented by this simple equation:

\Phi_t= \vec{F}_t\circ \vec{I}_t

One’s digital footprint is the set of information quanta about an individual that is out in the digital wild. In other words, it is the set of all stored, digital information about an individual. This could be a tax return on the IRS’ servers, a set of photographs on a memory stick, a Facebook profile or even the third letter of the second sentence of the electric bill of the house they lived in seven years ago. As long as it is in digital form and in some way relates to an individual (no matter how trivially), it is part of their digital footprint.

\vec{F}_t is the feature selector. It is a vector of k binary values – every element is a 1 or a 0. A 1 at a particular index i indicates the piece of information with index i is digitally known, and 0 indicates it is not.

\vec{I}_t is the set of all information that can be possibly known about an individual. From a physicist’s point of view you could say it is that individual’s wave function at time t, but it will be easier to understand (and use) as a discrete vector also of k elements, where each element represents a single quantum or unit of information. This is the interpretation we’ll use throughout, and we’ll call a single quantum of information a feature. Thus \vec{I}_t is the knowable feature vector for an individual at time t.

Note that k is equal for both vectors. This is because each element of the feature selector indicates whether its corresponding element in the knowable feature vector is known.

By performing an element-wise multiplication (the hollow dot ‘\circ‘) of the feature selector \vec{F}_t  with the knowable feature vector \vec{I}_t of an individual, we obtain their digital footprint. Before we go into how the feature selector is derived, let’s setup an example:

Pretend for a moment that we live in a bizarre universe where there are only three pieces of information that distinguish one human being from another: a person’s name, the color of their hair, and what they had for breakfast two days ago. In bizarro land there are literally no other properties that define one’s entire existence. In that case, k=3 and \vec{I}_t=\left<name,hair color,breakfast_{t-2 days}\right>.

Let’s say that two days after birth (t=day2) a bizarropian creates a Facebook profile with their name and hair color. Facebook’s feature selector would be \vec{F}_{day2}^{(Facebook)}=\left<1,1,0\right>. The first bit is on (1) because Facebook has their current name, the second bit is on because Facebook has their current hair color, and the third bit is off (0) because Facebook does not know what that person ate for breakfast two days before t.

On day three (t=day3), this person posts what they ate for breakfast two days ago. Now \vec{F}_{day3}^{(Facebook)}=\left<1,1,1\right>. From this we have:

\Phi_{day2}=\vec{F}_{day2}^{(Facebook)}\circ \vec{I}_{day2} = \left<name,haircolor,0\right>

and

\Phi_{day3} = \vec{F}_{day3}^{(Facebook)} \circ \vec{I}_{day3}=\left<name,haircolor,breakfast_{day1}\right>

A few clarifications. The superscript (Facebook) in \vec{F}_t^{(Facebook)} indicates that feature selector pertains to the system that is known as Facebook. We’ll continue to use this notation but generalize it later with a single variable like x or i.

Also notice that the footprint and knowable feature vector use placeholders like name and haircolor instead of actual values like “Bob” and “brown.” From a computer scientist’s perspective, these are equivalent to pointers that point to the actual value (and yes, a 0 indicates pointing to nothing). We use placeholders because we’re not concerned with the actual values, we just want to know if they are known by someone/something else.

Another thing. In my simple example I have arbitrarily constrained the knowable feature vector to three elements i.e., k=3. In reality, k changes over time because an individual is not static. From a physics perspective, you could say that one is constantly gaining new particles and energies and shedding old ones. I’d prefer to more practically state it however: what was known about a person yesterday may not be the same as it is today. For instance, after a legal name change, you would have two elements in \vec{I}_t where there was one originally: your name before the change and your name thereafter.

The growth of k is actually not just due to simple changes in state. Historical facts are created every time you interact with something. Walking across a room creates the historical fact that you walked across a certain room at a certain time in a certain way. Whenever you scroll across a post in Facebook, the amount of time the post was visible is tracked by their site. Every call you make, every text you send, every time you a visit a website, every time you appear on a CCTV camera recording, a historical fact about you is created.

As a result, the minimum bound for the growth of k over time is at least linear with t. This is one important element of the inevitable decline in privacy: as the number of things to know about you increases, it becomes more difficult to prevent some of them from being known. And as we’ll discuss later, the more information we have about someone, the easier it is to fill the gaps in our important knowledge of them.

Finally, the placeholders in my example are intentionally broad in order to make it easier to convey what I mean by an “information quantum.” For instance, I used name as the label for a person’s name, but in actuality a name is composed of multiple components depending on culture and language (such as first  and last name, second letter of last name, etc.,) Strictly speaking, the elements of the knowable feature vector would point to encodings of information close to or at the size of their Shannon entropies. Don’t glaze over yet – this detail is not important in order to use the equation for practical purposes.

Now, to calculate the feature selector at a given t:

\vec{F}_t=\sum\limits_{x=1}^{N}\vec{F}_t^{(x)}

Where N is the number of systems that have acquired or could potentially acquire information about an individual. It’s important to stop and consider what this means for a moment: N is not entirely under an individual’s control. In my earlier contrived example, N=1 because there was only one system in existence: Facebook. Even if the example individual didn’t have a Facebook account, N=1 because Facebook could still potentially acquire information about them.

How? Consider for example someone (person A) posts a photograph that has their friend (person B) in it as well. Person B currently doesn’t have an account. That photo may have GPS coordinates and a timestamp in the EXIF header. Perhaps person A even writes person B’s name in the body of the post. Person B never created an account, but Facebook has a picture of their likeness, their name and a timestamp of when they visited a certain place.

The Growth of N

That’s not where the story ends. Any account that is a “friend” of person A now has the potential to access the post unless it was explicitly restricted. If any of those friend accounts has a snoopy app installed that can access person A’s posts, N increases because now the maker of that app has the potential to acquire information about person B.

You can stop most or all of this sharing to apps on Facebook by configuring your privacy settings, so is this example a bit unrealistic? No. Even if we assume that every user of Facebook is 100% on top of perfectly configuring their privacy settings, Facebook is not the only social media game in town. And each approaches privacy differently. Imgur, for instance, flat out states that an image on their site can never truly be secret. Also, many systems require you to agree to an EULA whereby they can directly share your information with other parties. Each share with a new system (and their subsequent shares with other systems) increases N.

There’s another way N can increase – malware, hacking and physical theft. Back to the Facebook example. If the computer of any friend of person A (and including person A) has been compromised, N increases with the number of systems that receive data from that malware package, hacker and/or thief.

I’m not done. Mistakes happen and legitimate systems leak information through no act of malice. Thus, N can increase any time an information leaking mistake is made by any existing system with one or more quanta of information about an individual.

Finally, the most obvious – N increases with the number of systems an individual purposely interacts with. Just think of how many different types of systems the average individual uses – social networking, phones, SMS, instant messaging, email, banking, tax services, insurance companies, investment accounts,  healthcare services, online shopping, travel booking, airlines, hotels, car rentals, video streaming services, utility companies, ISPs, educational institutions, online dating, cloud backup, search engines, and other websites.

Virtually all major systems also use both offsite physical backups and online cloud/cluster redundancy, which is not always provided by the same company. Remember – any of those previously mentioned systems or their backups have the potential to be compromised and make information leaking mistakes.

The crux of the matter is that to prevent N from increasing, you must never interact with anyone or anything in any conceivable way. Or ensure that no human ever makes a mistake or acts maliciously with your data. Therefore I argue that preventing the growth of N is intractable. See my post The Intractability Problem for a thorough discussion on intractability.

Can N Decrease?

So can N ever decrease? Ignoring the obvious (natural disasters), what about by deliberate action of the individual? Perhaps something like the EU’s “right to be forgotten?” No, in fact a law like that can backfire. For example, in a recent case against Google, the search engine was ordered to prevent certain links from appearing when specific queries (a person’s name) were entered.

Do you see how this doesn’t reduce N? Google didn’t delete the information, and neither did the site the search originally pointed to. Instead, a correlation between the information and a specific term were prevented from appearing in Google’s search results. The irony is that searches for “Mario Costeja González” (the plaintiff) now not only reveal the information he tried to have removed, they also reveal his attempts to remove it. Thus, his N increased. Welcome to the Streisand effect.

Can an individual directly delete or overwrite their data to decrease N? Of course, but consider for a moment what it  takes to truly delete something. For instance, dragging a file to the trash/recycle bin and emptying the bin in no way actually causes the data from that file to disappear. Instead, it tells the operating system that the space occupied by that file is now available for something else in the future.

To truly delete data, you must utilize a forensically sound procedure. To do that requires direct, physical control over the system in question.

So imagine asking the online dating site you use to “delete” your profile. What does that even mean? That site probably has backups of your data going back several months or more, in addition to redundant copies of your data stored on separate hard drives, and perhaps a bit of your data even lingers in database transaction journals. Every device involved would have to be forensically wiped for a true “delete” to occur. Do you trust every site with a “delete” button to do that immediately and expediently, without mistakes?

Thus for most cases, N does not immediately decrease by any action of the individual. Instead, the decreased likelihood of information transferring out of the system is reflected in its system transfer vector (discussed in the next section).

So we’ve established that N is huge and preventing its growth is intractable, but for any given system x (where x \in\,[1,2,...,N]), what information does it actually have?

The System State Equation

The process by which a particular system’s feature selector \vec{F}_t^{(x)} evolves over time is defined by:

\vec{F}_t^{(x)} = \mathcal{L}_{t}^{(x)}(\psi_{t}^{(x)})

In other words, this equation tells us what information is known by system x at time t. \mathcal{L}_{t}^{(x)} is the learning function, which can figure out new features by analyzing the ones it’s given. I’ll go in depth on this crucial detail in the next section. \psi_{t}^{(x)} is the transfer function, which shows the transfer, prior accumulation and loss of data by a system at a particular time. It is:

\psi_{t}^{(x)} = \vec{\mathcal{U}}_t^{(x)}+\left(\vec{F}_{t-1}^{(x)}\land \vec{S}_t^{(x)}\right)+\sum\limits_{\substack{i=1\\i\neq{x}}}^{N}\vec{\mathcal{T}}_t^{(i\to x)}

If this seems intimidating, let me start the explanation by pointing out that the only mathematical operations these equations perform is logical OR (the + and \sum symbols) and logical AND (the \land symbol).

\vec{\mathcal{U}}_t^{(x)} is the user transfer vector. It is a k element feature selector indicating what information the individual transferred into system x at time t, knowingly or not. For instance,  an individual makes a picture post on Facebook. They knowingly uploaded the picture and entered the text of the post, but they also indicated to Facebook what path the information was delivered over e.g., an IP address and type of browser/client. If location services weren’t specifically disabled, their physical location was provided as well.

For each piece of information they transferred, the corresponding element of the user transfer vector would be 1, and all other elements would be 0. For those times when the user transfers no data to the system, the user transfer vector would be all zeroes (a zero vector).

The third term, \sum\limits_{\substack{i=1\\i\neq{x}}}^{N}\vec{\mathcal{T}}_t^{(i\to x)}, is the accumulation (by logical OR) of the individual system transfer vectors. A system transfer vector \vec{\mathcal{T}}_t^{(i\to x)} is a k element feature selector indicating what information system i transferred to system x at time t. This term accounts for data sharing (and theft) between systems.

The middle term of the transfer function, \left(\vec{F}_{t-1}^{(x)}\land \vec{S}_t^{(x)}\right), accounts for the accumulation of data obtained prior to t as well as any loss of that data that may have occurred in this instant. \vec{F}_{t-1}^{(x)} (the prior feature selector) is a zero vector at t=0, which puts a finite bound on the recursion of the transfer function.

\vec{S}_t^{(x)} is the entropy vector, a k element vector which indicates what prior data is lost at time t. Note I use the term entropy here to describe any event that could cause a system to lose data. A bit flipping in the storage medium of a system would be just as much an entropic event as a hacker wiping a drive, a system administrator spilling his coffee on a machine or even the individual purposely deleting their data using whatever process is provided by the system.

Unlike the user and system transfer vectors, the elements of the entropy vector are normally all 1s (not zeros). This is to simplify the equation by allowing entropy to be applied with a single element-wise logical AND. Any element of the entropy vector that is 0 will cause the corresponding element of the prior (t-1) feature selector to be lost.

Let us now address the elephant in the room…

The Learning Function

We’ve seen how N (the number of systems) has a natural inclination to increase and do so somewhat outside of an individual’s control. As a result, one’s digital footprint becomes more resilient over time because it becomes increasingly difficult to remove a particular piece of information from every system that has it.

We’ve also seen how k (the number of knowable features) increases over time, which leads to the expansion of one’s digital footprint because of the difficulty in preventing some of that information from being leaked. In other words, every action you take not only increases k by creating new historical facts, it also increases the probability that some action will lead to digital information about you being created and stored.

What may be less obvious though is how a particular system’s feature selector can grow to include more features over time, without additional information being transferred in by the user or another system. Said differently, systems can figure out new information about you by getting better at analyzing the data they already have.

But how?

Personal information in digital form is highly correlated! What this means is that most of our features are so related to each other that it’s possible to figure one out by just analyzing another. Consider for example if I have your street address without zip code. Say, “123 Main Street.” All I need is one other innocuous piece of information about you (say the IP address you connected to my website with) and I can determine your zip code, city and state. I can then search voter registration records, court documents and other sources to determine your full name and how much you spent on your house. If you’re renting, I can look at what your landlord paid/is paying for the property. Either way will give me insight into your financial capabilities.

This is literally just scratching the surface of the type of inferences, deductions and inductions that are possible. I challenge you to spend 30 minutes figuring out ways to piece someone’s life together with just a little bit of information. Now here’s the kicker – every method you came up with, a computer can (or will) be able to do, regardless of whether the original human operators can or know how to do it themselves.

If this seems hyperbolic, that is only because we are still in the early stages of the machine learning revolution. Most people don’t realize yet that a field that took off less than 30 years ago has led to the creation of machines that can read Wikipedia (and other sources) and then beat human champions at Jeopardy. Or news aggregators like Google News that automatically determine what news articles are breaking and cover the same topic in order to group them together. Facebook, Google, Apple and others use machine learning to recognize faces and automatically tag them with the right name. Google can even analyze an image and group it with similar ones under the same search term.

The point of all of this is that we are at the dawn of an age where a single picture of someone will be enough to discover almost every piece of digital information that exists about them. Through public profiles, public information and leaked private information, an executive summary of the last 10 year’s of someone’s life will be only a few clicks away.

Is this preventable? What if Google explicitly prevents a search on someone’s face? Unfortunately, for the same reasons we already discussed, this is outside Google’s control. They may be able to remove certain explicit correlations from being directly searchable, but they can’t make the original information go away. Thus some other entity will fill that gap in functionality.

I’m Scared

Don’t be. Machine learning will likely save your life someday. Its development is being fueled by the quickening of computers and the availability of more data. Those same technologies that will enable us to find people by their face will allow us to accurately diagnose disease and construct personalized cures. It will allow us to identify individuals on the verge of suicide or murder and get them help before it’s too late. It will automate tedious tasks, safely route traffic, simplify the fields of governance and law and even facilitate our learning.

Think about it – one day machine learning will tell you immediately if that political meme someone just posted has any credibility to it, with the ability to drill down in detail any related issue. Fact checking will occur live and automatically at every debate, speech and public hearing. Attorneys, judges and patent clerks will have instant access to all contextually relevant precedence. You will be able to converse with anyone and read any book on Earth, regardless of language.

Machine learning is a decidedly good thing, as are many of the inferences made possible at the cost of reduced privacy.

Conclusion

The ability for our actions and who we are to remain private will fade, but that does not mean we have to sacrifice our individuality. When everyone can be held under the same microscope it will be easier to accept the bumbling series of mistakes we all make that is called being human.

As I said before, I am not advocating the abandonment of reasonable efforts to slow the spread of our personal information. This applies both legally (regulations such as HIPAA) and to the things we do. For instance, I am not particularly interested in the activities that take place in your bedroom, nor do I really care to share such things about myself. Thus I choose to continue to perform them in a physically isolated setting and hope that you will do the same. On the other hand, I don’t fret that some day there will be a computer out there that will know exactly what you and I are doing in our respective bedrooms, regardless of how we attempt to conceal it.

I believe we won’t be truly at peace with the loss of privacy until society is at peace with itself. For example, the day that the name, face and itinerary of a child is no longer considered sensitive is the day that information being publicly available no longer meaningfully increases the threat to that child.

For the immediate future, this is a very tall order. As a stop gap measure, we may have to utilize intent prediction and detection in order to stave off some of the more severe ramifications of our degrading privacy until society catches up. Which of course will only facilitate its further encroachment.

For sure, the downward spiral of privacy is going to be one of humanity’s wilder and defining rides.

– Rob

P.S. My wife bet that I wouldn’t complete this post in 4000 words or less. She was right (~4400 words). Thank you for reading!

One thought on “The Privacy Equation”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s