The Programming Language for Mass Surveillance

According to government documents studied by The New York Times, the FBI asked several phone companies to analyze phone-call patterns of Americans using a technology called “communities of interest”. Verizon refused, saying that it didn’t have any such technology. AT&T, famously, did not refuse.What is the “communities of interest” technology? It’s spelled out very clearly in a 2001 research paper from AT&T itself, entitled “Communities of Interest” (by C. Cortes, D. Pregibon, and C. Volinsky). They use high-tech data-mining algorithms to scan through the huge daily logs of every call made on the AT&T network; then they use sophisticated algorithms to analyze the connections between phone numbers: who is talking to whom? The paper literally uses the term “Guilt by Association” to describe what they’re looking for: what phone numbers are in contact with other numbers that are in contact with the bad guys?

When this research was done, back in the last century, the bad guys where people who wanted to rip off AT&T by making fraudulent credit-card calls. (Remember, back in the last century, intercontinental long-distance voice communication actually cost money!) But it’s easy to see how the FBI could use this to chase down anyone who talked to anyone who talked to a terrorist. Or even to a “terrorist.”

AT&T Invents Surveillance Programming Language  

By Ryan Singel

From the company that brought you the C programming language comes Hancock, a C variant developed by AT&T researchers to mine gigabytes of the company’s telephone and internet records for surveillance purposes.

An AT&T research paper published in 2001 and unearthed today by Andrew Appel at Freedom to Tinker shows how the phone company uses Hancock-coded software to crunch through tens of millions of long distance phone records a night to draw up what AT&T calls “communities of interest” — i.e., calling circles that show who is talking to whom.

The system was built in the late 1990s to develop marketing leads, and as a security tool to see if new customers called the same numbers as previously cut-off fraudsters — something the paper refers to as “guilt by association.”

But it’s of interest to THREAT LEVEL because of recent revelations that the FBI has been requesting “communities of interest” records from phone companies under the USA PATRIOT Act without a warrant. Where the bureau got the idea that phone companies collect such data has, until now, been a mystery.

According to a letter from Verizon to a congressional committee earlier this month, the FBI has been asking Verizon for “community of interest” records on some of its customers out to two generations — i.e., not just the people that communicated with an FBI target, but also those who talked to people who talked to an FBI target. Verizon, though, doesn’t create those records and couldn’t comply. Now it appears that AT&T invented the concept and the technology. It even owns a patent on some of its data mining methods, issued to two of Hancock’s creators in 2002.

Programs written in Hancock work by analyzing data as it flows into a data warehouse. That differentiates the language from traditional data-mining applications which tend to look for patterns in static databases. A 2004 paper published in ACM Transactions on Programming Languages and Systems shows how Hancock code can sift calling card records, long distance calls, IP addresses and internet traffic dumps, and even track the physical movements of mobile phone customers as their signal moves from cell site to cell site. 

With Hancock, “analysts could store sufficiently precise information to enable new applications previously thought to be infeasible,” the program authors wrote. AT&T uses Hancock code to sift 9 GB of telephone traffic data a night, according to the paper.

The good news for budding data miners is that Hancock’s source code and binaries (now up to version 2.0) are available free to noncommercial users from an AT&T Research website.

The instruction manual (.pdf) is also free, and old-timers will appreciate its spare Kernighan & Ritchie style. The manual even includes a few sample programs in the style of K&R’s Hello World, but coded specifically to handle data collected by AT&T’s phone and internet switches. This one reads in a dump of internet headers, computes what IP addresses were visited, makes a record and prints them out, in less than 40 lines of code.

#include "ipRec.hh" 
#include "ihash.h" 

hash_table *ofInterest; 

int inSet (ipPacket_t * p) 
 if (hash_get (ofInterest, p->source.hash_value) == 1) 
  return 1; 
 if (hash_get (ofInterest, p->dest.hash_value) == 1) 
  return 1; 
 return 0; 
void sig_main (ipAddr_s addrs < l:>, 
 /* code to set up hash table */ 
 ofInterest = hash_empty (); 
  (over addrs) { 
  event (ipAddr_t * addr) { 
    if (hash_insert (ofInterest, addr->hash_value, 1) < 0) 
 /* code to select packets */ 
  (over packets 
   filteredby inSet) 
  event (ipPacket_t * p) 
    printPacketInfo (p); 

Another sample program included in the manual shows how a Hancock program could create historical maps of a person’s travels by recording nightly what cell phone towers a person’s phone had used or pinged throughout a day.

AT&T is currently defending itself in federal court from allegations that it installed, on behalf of the NSA, secret internet spying rooms in its domestic internet switching facilities. AT&T and Verizon are also accused of giving the NSA access to billions of Americans’ phone records, in order to data-mine them to spot suspected terrorists, and presumably to identify targets for warrantless wiretapping.