The hazards, difficulties and dangers of name matching in large data pools was explored in NSA…Verizon…Obama…Connecting the Dots. Or not., republished at Naked Capitalism as: Could the Verizon-NSA Metadata Collection Be a Stealth Political Kickback?. Safe to conclude that without more, name matching is at best happenstance.
A private comment wondered if Social Security Numbers (SSNs) could be the magic key that ties phone records to credit records to bank records and so on. It is, after all, the default government identifier in the United States. It is certainly less untrustworthy than simple name matching. How trustworthy an SSN identifier is in fact, is the subject of this post.
Are SSNs a magic key for matching phone, credit, bank, government records?
SSNs: A Short History
Wikipedia gives us a common starting point to answer that question: http://en.wikipedia.org/wiki/United_States
In the United States, a Social Security number (SSN) is a nine-digit number issued to U.S. citizens, permanent residents, and temporary (working) residents under section 205(c)(2) of the Social Security Act, codified as . The number is issued to an individual by the Social Security Administration, an independent agency of the United States government. Its primary purpose is to track individuals for Social Security purposes.
(…)
The original purpose of this number was to track individuals’ accounts within the Social Security program. It has since come to be used as an identifier for individuals within the United States, although rare errors occur where duplicates do exist.
The Wikipedia article also points out duplicates issued by the Social Security Administration are rare, but people claiming the same SSN are not.
The Social Security Administration expands the story of the SSN of Mrs. Hilda Schrader Witcher (in Wikipedia) this way:
The most misused SSN of all time was (078-05-1120). In 1938, wallet manufacturer the E. H. Ferree company in Lockport, New York decided to promote its product by showing how a Social Security card would fit into its wallets. A sample card, used for display purposes, was inserted in each wallet. Company Vice President and Treasurer Douglas Patterson thought it would be a clever idea to use the actual SSN of his secretary, Mrs. Hilda Schrader Whitcher.
The wallet was sold by Woolworth stores and other department stores all over the country. Even though the card was only half the size of a real card, was printed all in red, and had the word “specimen” written across the face, many purchasers of the wallet adopted the SSN as their own. In the peak year of 1943, 5,755 people were using Hilda’s number. SSA acted to eliminate the problem by voiding the number and publicizing that it was incorrect to use it. (Mrs. Whitcher was given a new number.) However, the number continued to be used for many years. In all, over 40,000 people reported this as their SSN. As late as 1977, 12 people were found to still be using the SSN “issued by Woolworth.” (Social Security Cards Issued By Woolworth)
Do People Claim More Than One SSN?
The best evidence that people can and do claim multiple SSNs are our information systems for tracking individuals.
The FBI advises in Guidelines for Preparation of Fingerprint Cards and Associated Criminal History Information provides that:
Enter the subject’s Social Security number, if known. Additional Social Security numbers used by the subject may be entered in the “Additional Information/Basis for Caution” block #34 on the reverse side of the fingerprint card.
The FBI maintains the National Crime Information Center (NCIC), “…an electronic clearinghouse of crime data….” The system requires authorization for access and there are no published statistics about the number of social security numbers claimed by people listed in NCIC.
I can relate anecdotally that I have seen NCIC printouts that reported multiple SSNs for a single individual. I have written to the FBI asking for either a pointer to # of individuals with multiple SSNs in NCIC or a response with that statistic.
Beyond evildoers who claim multiple SSNs, there is also the problem of identity theft, which commonly involves a person’s SSN.
Identity Theft
Another source of dirty identity data is identity theft.
How prevalent is identity theft?:
Approximately 15 million United States residents have their identities used fraudulently each year with financial losses totalling upwards of $50 billion.*
On a case-by-case basis, that means approximately 7% of all adults have their identities misused with each instance resulting in approximately $3,500 in losses.
Close to 100 million additional Americans have their personal identifying information placed at risk of identity theft each year when records maintained in government and corporate databases are lost or stolen.
These alarming statistics demonstrate identity theft may be the most frequent, costly and pervasive crime in the United States. (http://www.identitytheft.info/victims.aspx
BTW, www.IdentityTheft.info reports as of June 9, 2013, “…6,558,655 identity theft victims year-to-date.”
Assuming the NSA is monitoring all phone and other electronic traffic, what difficulties does it face with SSNs?
Partial Summary of How Dirty are SSNs?
- From identity theft, 2012 to date, 21,558,655 errors in its resolution to other data.
- An unknown number of multiple SSNs as evidenced in part by people listed in NCIC.
- Mistakes, foulups, confusion, bad record keeping by credit reporting agencies (The NSA Verizon Collection Coming on DVD) (an unknown number)
- Terrorists, being bent on mass murder, are unlikely to be stymied by “…I declare under penalities of perjury…” or warnings about false statements resulting in denial of future service clauses in contracts. (an unknown number)
Don’t have to take my word that reliable identification is difficult. Ask your local district attorney what evidence they need to prove someone was previously convicted of drunk driving. The courts have wrestled with this type of issue for years. Which is one reason why FBI record keeping requires biometric data along with names and numbers.
Does More Data = Better Data?
The debate over data collection should distinguish two uses of large data sets.
Pattern Matching
The most common use is to search for patterns in data. For example, Twitter users forming tribes with own language, tweet analysis shows.
Another example of pattern matching research was described as:
When Senn was first given his assignment to compare two months of weather satellite data with 830 million GPS records of 80 million taxi trips, he was a little disappointed. “Everyone in Singapore knows it’s impossible to get a taxi in a rainstorm,” says Senn, “so I expected the data to basically confirm that assumption.” As he sifted through the data related to a vast fleet of more than 16,000 taxicabs, a strange pattern emerged: it appeared that many taxis weren’t moving during rainstorms. In fact, the GPS records showed that when it rained (a frequent occurrence in this tropical island state), many drivers pulled over and didn’t pick up passengers at all.
Senn confirmed his findings by sitting down with drivers. And what did he learn?
He learned that the company owning most of the island’s taxis would withhold S$1,000 (about US$800) from a driver’s salary immediately after an accident until it was determined who was at fault. The process could take months, and the drivers had independently decided that it simply wasn’t worth the risk of having their livelihood tangled up in bureaucracy for that long. So when it started raining, they simply pulled over and waited out the storm. Why you don’t get taxis in Singapore when it rains?
“…[U]sing two months of weather satellite data with 830 million GPS records of 80 million taxi trips…” Sounds like the NSA project. Yes?
Detecting patterns is one thing. But patterns don’t connect diverse data sources. Nor do they provide explanations.
Reconciling Dirty Data
Starting from diverse data sets, even if they purport to share SSNs, the difficult question is how to reconcile the data. Any of the data sets could be correct or they could all be incorrect.
Here is a more formal statement on error analysis and multiple data sets:
The most challenging problem within data cleansing remains the correction of values to eliminate domain format errors, constraint violations, duplicates and invalid tuples. In many cases the available information and knowledge is insufficient to determine the correct modification of tuples to remove these anomalies. This leaves deleting those tuples as the only practical solution. This deletion of tuples leads to a loss of information if the tuple is not invalid as a whole. This loss of information can be avoided by keeping the tuple in the data collection and mask the erroneous values until appropriate information for error correction is available. The data management system is then responsible for enabling the user to include and exclude erroneous tuples in processing and analysis where this is desired.
In other cases the proper correction is known only roughly. This leads to a set of alternative values. The same is true when dissolving contradictions and merging duplicates without exactly knowing which of the contradicting values is the correct one. The ability of managing alternative values allows to defer the error correction until one of the alternatives is selected as the right correction. Keeping alternative values has a major impact on managing and processing the data. Logically, each of the alternatives forms a distinct version of the data collection, because the alternatives are mutually exclusive. It is a technical challenge to manage the large amount of different logical versions and still enable high performance in accessing and processing them.
When performing data cleansing one has to keep track of the version of data used because the deduced values can depend on a certain value from the set of alternatives of being true. If this specific value later becomes invalid, maybe because another value is selected as the correct alternative, all deduced and corrected values based on the now invalid value have to be discarded. For this reason the cleansing lineage of corrected values has to maintained. By cleansing lineage we mean the entirety of values and tuples used within the cleansing of a certain tuple. If any value in the lineage becomes invalid or changes the performed operations have to be redone to verify the result is still valid. The management of cleansing lineage is also of interest for the cleansing challenges described in the following two sections. Problems, Methods, and Challenges in Comprehensive Data Cleansing by Heiko Müller and Johann-Christoph Freytag.
The more data you collect, the more problematic accurate mass identification becomes.
NSA Competency with Data (SSN or otherwise)
The “Underwear Bomber” parents meet with the CIA at least twice to warn them about their son.
A useful Senate Budget hearing on the NSA and its acquisition of phone, credit, bank and other records should go something like:
The following dialogue is fictional but the facts and links are real.
Sen. X: Mr. N, as a representative of the NSA, are you familiar with the case of Umar Farouk Abdulmutallab?
Mr. N: Yes.
Sen. X: I understand that the CIA entered his name in the Terrorist Identities Datamart Environment in November of 2009. But his name was not added to the FBI’s Terrorist Screening Database, which feeds the Secondary Screening Selectee list and the U.S.’s No Fly List.
Mr. N: Yes.
Sen. X: The Terrorist Identities Datamart Environment, Terrorist Screening Database, Secondary Screening Selectee list and the U.S.’s No Fly List are all U.S. government databases? Databases to which the NSA has complete access?
Mr. N: Yes.
Sen. X: So, the NSA was unable to manage data in four (4) U.S. government databases well enough to prevent a terrorist from boarding a plane destined from the United States.
My question is if the NSA can’t manage four U.S. goverment databases, what proof is there the NSA can usefully manage all phone and other electronic traffic usefully?
Mr. N: I don’t know.
Sen. X: Who would know?
Mr. N: The computer and software bidders for the NSA DarkStar facility in Utah.
Sen. X: And who would they be?
Mr. N: That’s classified.
Sen. X: Classified?
Mr. N: Yes, classified.
Sen. X: Has it ever occurred to you that bidders have an interest in their contracts being funded and agencies in having larger budgets?
Mr. N: Sorry, I don’t understand the question.
Sen. X: My bad, I really didn’t think you would.
End of fictional hearing transcript
The known facts show the NSA can’t manage four (4) U.S. government databases to prevent a known terrorist from entering the U.S.
What evidence would you offer to prove the NSA is competent to use complex data sets? (You can find more negative evidence on evavesdropping at Bruce Scheneier’s NSA Eavesdropping Yields Dead Ends.)
PS: On the National Security Industrial Complex, see: Apparently Some Stuff Happened This Weekend
Addendum: Edward Snowden Makes Himself an Even Bigger Problem to the Officialdom
A must watch interview with Edward Snowden and great commentary as well.
On a quick listen, you may think Edward is describing a more competent system that I do above.
On the contrary, if you listen closely, Edward does not diverge from anything that I have said on this issue to date.
Starting at time mark 07:10, Glenn Greenwald asks:
Why should people care about surveillance?
Snowden: Because even if you aren’t doing anything wrong, you are being watched and recorded and the storage capability of these systems increases every year, consistently, by orders of magnitude, ah, to where it is getting to the point that you don’t have to have done anything wrong, you have to eventually fall under suspicion from somebody, even from a wrong call, and they can use the system to go back in time and scrutinize every decision you have ever made, every friend you have ever discussed anything with, and attack you on that basis to sort of derive suspicion from an innocent life and paint anyone in the context of a wrong doer.
Yes, the NSA can use a phone call to search all other phone calls, within the phone call database. Ho-hum. Annoying but hardly high tech merging of data from diverse data sources.
It is also true that once you are selected, the NSA could invest the time and effort to reconcile all the information about you, on a one-off basis.
But that has always been the case.
The public motivation for the NSA project was to data mine diverse data sources. Computers replacing labor-intensive human investigation of terrorism.
But as Snowden points out, it takes a human to connect dots in the noisy results of computer processing.
Fewer humans = less effective against terrorism.