Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 12, 2013

Anatomy of Programming Languages

Filed under: Functional Programming,Haskell,Programming — Patrick Durusau @ 11:14 am

Anatomy of Programming Languages by William R. Cook.

From the Introduction:

In order to understand programming languages, it is useful to spend some time thinking about languages in general. Usually we treat language like the air we breathe: it is everywhere but it is invisible. I say that language is invisible because we are usually more focused on the message, or the content, that is being conveyed than on the structure and mechanisms of the language itself. Even when we focus on our use of language, for example in writing a paper or a poem, we are still mostly focused on the message we want to convey, while working with (or struggling with) the rules and vocabulary of the language as a given set of constraints. The goal is to work around and avoid problems. A good language is invisible, allowing us to speak and write our intent clearly and creatively.

The same is true for programming. Usually we have some important goal in mind when writing a program, and the programming language is a vehicle to achieve the goal. In some cases the language may fail us, by acting as an impediment or obstacle rather than an enabler. The normal reaction in such situations is to work around the problem and move on.

The study of language, including the study of programming languages, requires a different focus. We must examine the language itself, as an artifact. What are its rules? What is the vocabulary? How do different parts of the language work together to convey meaning? A user of a language has an implicit understanding of answers to these questions. But to really study language we must create an explicit description of the answers to these questions.

The concepts of structure and meaning have technical names. The structure of a language is called its syntax. The rules that defined the meaning of a language are called semantics. Syntax is a particular way to structure information, while semantics can be viewed as a mapping from syntax to its meaning, or interpretation. The meaning of a program is usually some form of behavior, because programs do things. Fortunately, as programmers we are adept at describing the structure of information, and at creating mappings between different kinds of information and behaviors. This is what data structures and functions/procedures are for.

Thus the primary technique in these notes is to use programming to study programming languages. In other words, we will write programs to represent and manipulate programs. One general term for this activity is metaprogramming. A metaprogram is any program whose input or output is a program. Familiar examples of metaprograms include compilers, interpreters, virtual machines. In this course we will read, write and discuss many metaprograms.

I have only started to read this book but to say:

A good language is invisible, allowing us to speak and write our intent clearly and creatively.

May describe a good quality for a language but it is also the source of much difficulty.

Our use of a word may be perfectly clear to us but that does not mean it is clear to others.

For example:

Verbal Confusion

Wood you believe that I didn’t no
About homophones until too daze ago?
That day in hour class in groups of for,
We had to come up with won or more.

Mary new six; enough to pass,
But my ate homophones lead the class.
Then a thought ran threw my head,
”Urn a living from homophones,” it said.

I guess I just sat and staired into space.
My hole life seamed to fall into place.
Our school’s principle happened to come buy,
And asked about the look in my I.

“Sir,” said I as bowled as could bee,
My future roll I clearly see.”
“Sun,” said he, “move write ahead,
Set sail on your coarse, Don’t be mislead.”

I herd that gnus with grate delight.
I will study homophones both day and knight,
For weaks and months, through thick oar thin,
I’ll pursue my goal. Eye no aisle win.

—George E. Coon
The Reading Teacher, April, 1976

I first saw this at Verbal Confusion.

June 11, 2013

How the NSA, and your boss, can intercept and break SSL

Filed under: Cybersecurity,NSA,Security — Patrick Durusau @ 2:31 pm

How the NSA, and your boss, can intercept and break SSL by Steven J. Vaughan-Nichols.

From the post:

Is the National Security Agency (NSA) really “wiretapping” the Internet? Accused accomplices Microsoft and Google deny that they have any part in it and the core evidence isn’t holding up that well under closer examination.

Some, however, doubt that the NSA could actually intercept and break Secure-Socket Layer (SSL) protected Internet communications.

Ah, actually the NSA can.

And, you can too and it doesn’t require “Mission Impossible” commandos, hackers or supercomputers. All you need is a credit-card number.

There are many ways to attack SSL, but you don’t need fake SSL certificates, a rogue Certification Authority (CA), or variations on security expert Moxie Marlinspike’s man-in-the-middle SSL attacks. Why go to all that trouble when you can just buy a SSL interception proxy, such as Blue Coat Systems’ ProxySG or their recently acquired Netronome SSL appliance to do the job for you?

Blue Coat, the biggest name in the SSL interception business, is far from the only one offering SSL interception and breaking in a box. Until recently, for example, Microsoft would sell you a program, Forefront Threat Management Gateway 2010, which could do the job for you as well.

There’s nothing new about these services. Packer Forensics was advertising appliances that could do this in 2010. The company is still in business and, while they’re keeping a low profile, they appear to be offering the same kind of devices with the same services.

What would you like to bet the NSA took the most expensive route to a commodity product to break SSL?

Improving your Erlang programming skills doing katas

Filed under: Erlang,Programming — Patrick Durusau @ 2:14 pm

Improving your Erlang programming skills doing katas by Paolo D’incau.

From the post:

There is one sure thing about programming: you should try to improve your set of skills in a regular way. There are several different methods to achieve this kind of result: reading books and blogs, working on your own pet project and doing pair programming are all very good examples of this, but today I want to introduce you code kata. What is a kata? Well, since you ask, you won’t mind if I digress for a while first!

What is a kata?

In Japanese, the word kata is used to describe choreographed patterns of movements that are practised in solo or possibly with a partner. Kata are especially applied in martial arts because they do represent a way of teaching and practicing in a systematic approach rather than as individuals in a clumsy manner. If the concept of kata is still not clear (shame on me!) you just need to watch again the movie Karate Kid. For the whole movie Mr. Miyagi San teaches Daniel LaRusso the importance of kata and we know that Miyagi San is always right!

The basic concept behind kata is fairly simple: if we keep on practicing in a repetitive manner we can acquire the ability to execute movements without hesitation and to adapt them to a set of different situations without any fear. Pretty cool uh?

Coming back to the good old world of software developers (and especially Erlang ones) we may ask ourselves: “how can we apply the concept of kata to our daily routine?”. David Thomas (one of the authors of “The Pragmatic Programmer”) introduced the concept of Code Kata which is a programming exercise useful to improve our knowledge and skills through practice and repetition. The interesting point of code kata is that usually the exercises proposed are easy and can be implemented on a step-by-step fashion.

A chance to learn/improve your Erlang skills and to learn a good new habit! (Bad habits are easy to acquire.)

Enjoy!

Rya: A Scalable RDF Triple Store for the Clouds

Filed under: Cloud Computing,RDF,Rya,Semantic Web — Patrick Durusau @ 2:06 pm

Rya: A Scalable RDF Triple Store for the Clouds by Roshan Punnoose, Adina Crainiceanu, and David Rapp.

Abstract:

Resource Description Framework (RDF) was designed with the initial goal of developing metadata for the Internet. While the Internet is a conglomeration of many interconnected networks and computers, most of today’s best RDF storage solutions are confined to a single node. Working on a single node has significant scalability issues, especially considering the magnitude of modern day data. In this paper we introduce a scalable RDF data management system that uses Accumulo, a Google Bigtable variant. We introduce storage methods, indexing schemes, and query processing techniques that scale to billions of triples across multiple nodes, while providing fast and easy access to the data through conventional query mechanisms such as SPARQL. Our performance evaluation shows that in most cases, our system outperforms existing distributed RDF solutions, even systems much more complex than ours.

Based on Accumulo (open-source NoSQL database by the NSA).

Interesting re-thinking of indexing of triples.

Future work includes owl:sameAs, owl:inverseOf and other inferencing rules.

Certainly a project to watch.

The Pragmatic Haskeller, Episode 5 – Let’s Write a DSL!

Filed under: DSL,Functional Programming,Haskell — Patrick Durusau @ 1:26 pm

The Pragmatic Haskeller, Episode 5 – Let’s Write a DSL! by Alfredo Di Napoli.

From the post:

Good morning everyone,

after a small break let’s resume our journey into the pragmatic world of the “Pragmatic Haskeller” series, this time exploring Parsec, a combinators library which will allow us to write a domain specific language to describe recipes.

We’ll see how Haskell type safety makes the process a breeze, and as a nice side effect (pun intended), our parser will be less than 100 SLOC! What are you waiting for? The code and the full disclosure is hosted once again on “The School of Haskell”, so that I can provide the reader with interactive examples:

https://www.fpcomplete.com/user/adinapoli/the-pragmatic-haskeller/episode-5-a-simple-dsl

Very cool!

Alfredo challenges you to find a bug in the data structure.

My challenge would be how do we reconcile DSLs with the same subjects but different tokens to represent them?

The Filtering vs. Clustering Dilemma

Filed under: Clustering,Graphs,Mathematics,Topological Data Analysis,Topology — Patrick Durusau @ 1:10 pm

Hierarchical information clustering by means of topologically embedded graphs by Won-Min Song, T. Di Matteo, and Tomaso Aste.

Abstract:

We introduce a graph-theoretic approach to extract clusters and hierarchies in complex data-sets in an unsupervised and deterministic manner, without the use of any prior information. This is achieved by building topologically embedded networks containing the subset of most significant links and analyzing the network structure. For a planar embedding, this method provides both the intra-cluster hierarchy, which describes the way clusters are composed, and the inter-cluster hierarchy which describes how clusters gather together. We discuss performance, robustness and reliability of this method by first investigating several artificial data-sets, finding that it can outperform significantly other established approaches. Then we show that our method can successfully differentiate meaningful clusters and hierarchies in a variety of real data-sets. In particular, we find that the application to gene expression patterns of lymphoma samples uncovers biologically significant groups of genes which play key-roles in diagnosis, prognosis and treatment of some of the most relevant human lymphoid malignancies.

I like the framing of the central issue a bit further in the paper:

Filtering information out of complex datasets is becoming a central issue and a crucial bottleneck in any scientifi c endeavor. Indeed, the continuous increase in the capability of automatic data acquisition and storage is providing an unprecedented potential for science. However, the ready accessibility of these technologies is posing new challenges concerning the necessity to reduce data-dimensionality by fi ltering out the most relevant and meaningful information with the aid of automated systems. In complex datasets information is often hidden by a large degree of redundancy and grouping the data into clusters of elements with similar features is essential in order to reduce complexity [1]. However, many clustering methods require some a priori information and must be performed under expert supervision. The requirement of any prior information is a potential problem because often the fi ltering is one of the preliminary processing on the data and therefore it is performed at a stage where very little information about the system is available. Another difficulty may arise from the fact that, in some cases, the reduction of the system into a set of separated local communities may hide properties associated with the global organization. For instance, in complex systems, relevant features are typically both local and global and di fferent levels of organization emerge at diff erent scales in a way that is intrinsically not reducible. We are therefore facing the problem of catching simultaneously two complementary aspects: on one side there is the need to reduce the complexity and the dimensionality of the data by identifying clusters which are associated with local features; but, on the other side, there is a need of keeping the information about the emerging global organization that is responsible for cross-scale activity. It is therefore essential to detect clusters together with the diff erent hierarchical gatherings above and below the cluster levels. (emphasis added)

Simplification of data is always lossy. The proposed technique does not avoid all loss but hopes to mitigate its consequences.

Briefly the technique relies upon building a network of the “most significant” links and analyzing the network structure. The synthetic and real data sets show that the technique works quite well. At least for data sets where we can judge the outcome.

What of larger data sets? Where the algorithmic approaches are the only feasible means of analysis? How do we judge accuracy in those cases?

A revised version of this paper appears as: Hierarchical Information Clustering by Means of Topologically Embedded Graphs by Won-Min Song, T. Di Matteo, and Tomaso Aste.

The original development of the technique used here can be found in: A tool for filtering information in complex systems by M. Tumminello, T. Aste, T. Di Matteo, and R. N. Mantegna.

Orthogonal Range Searching for Text Indexing

Filed under: Indexing,Text Mining — Patrick Durusau @ 10:32 am

Orthogonal Range Searching for Text Indexing by Moshe Lewenstein.

Abstract:

Text indexing, the problem in which one desires to preprocess a (usually large) text for future (shorter) queries, has been researched ever since the sux tree was invented in the early 70’s. With textual data continuing to increase and with changes in the way it is accessed, new data structures and new algorithmic methods are continuously required. Therefore, text indexing is of utmost importance and is a very active research domain.

Orthogonal range searching, classically associated with the computational geometry community, is one of the tools that has increasingly become important for various text indexing applications. Initially, in the mid 90’s there were a couple of results recognizing this connection. In the last few years we have seen an increase in use of this method and are reaching a deeper understanding of the range searching uses for text indexing.

From the paper:

Orthogonal range searching refers to the preprocessing of a collection of points in d-dimensional space to allow queries on ranges de ned by rectangles whose sides are aligned with the coordinate axes (orthogonal).

If you are not already familiar with this area, you may find Lecture 11: Orthogonal Range Searching useful.

In a very real sense, indexing, as in a human indexer, lies at the heart of topic maps.

A human indexer recognizes synonyms, relationships represented by synonyms and distinguishes other uses of identifiers.

Topic maps are an effort to record that process so it can be followed mechanically by a calculator.

Mechanical indexing is a powerful tool in the hands of a human indexer, whether working on a traditional index or its successor, a topic map.

What type of mechanical indexing are you using?

IEEE Computer: Special Issue on Computing in Astronomy

Filed under: Astroinformatics,Topic Maps — Patrick Durusau @ 9:39 am

IEEE Computer: Special Issue on Computing in Astronomy

From the post:

Edited by Victor Pankratius (MIT) and Chris Mattmann (NASA)
Final submissions due: December 1, 2013
Publication date: September 2014

Computer seeks submissions for a September 2014 special issue on computing in astronomy.

Computer science has become a key enabler in astronomy’s ability to progress beyond the processing capacity of humans. In fact, computer science is a major bottleneck in the quest of making new discoveries and understanding the universe. Sensors of all kinds collect vast amounts of data that require unprecedented storage capacity, network bandwidth, and compute performance in cloud environments. We are now capable of more sophisticated data acquisition, analysis, and prediction than ever before, thanks to progress in parallel computing and multicore technologies. Social media, open source, and distributed scientific communities have also shed light on new methods for spreading astronomical observations and results quickly. The field of astroinformatics is emerging to unite interdisciplinary efforts across several communities.

This special issue aims to present high-quality articles to the computer science community that describe these new directions in computing in astronomy and astroinformatics. Only submissions describing previously unpublished, original, state-of-the-art research that are not currently under review by a conference or journal will be considered.

Appropriate topics include, but are not limited to, the following:

  • Collecting and processing big data in astronomy
  • Multicore systems, GPU accelerators, high-performance computing, clusters, clouds
  • Data mining, classification, information retrieval
  • Computational astronomy, simulations, algorithms
  • Astronomical visualization, graphics processing, computer vision
  • Crowdsourcing and social media in astronomical data collection
  • Computing aspects of next-generation instruments, sensor networks
  • Astronomical open source software and libraries
  • Automated searches for astronomical objects or phenomena, such as planets, pulsars, organic molecules
  • Feature and event recognition in complex multidimensional datasets
  • Analysis of cosmic ray airshower data of various kinds
  • Computing in antenna arrays, very long baseline interferometry

The guest editors are soliciting three types of contributions: (1) regular research articles describing novel research results (full page l ength, 5,000 words); (2) experience reports describing approaches, instruments, experiments, or missions, with an emphasis on computer science aspects (half the page length of a regular article, 2,500 words); and (3) sidebars serving as summaries or quick pointers to projects, missions, systems, or results that complement any of the topics of interest (600 words).

Articles should be original and understandable to a broad audience of computer science and engineering professionals. All manuscripts are subject to peer review on both technical merit and relevance to Computer’s readership. Accepted papers will be professionally edited for content and style.

For additional information, contact the guest editors directly: Victor Pankratius, Massachusetts Institute of Technology, Haystack Observatory (http://www.victorpankratius.com); and Chris Mattman, NASA JPL (http://sunset.usc.edu/~mattmann).

Paper submissions are due December 1, 2013. For author guidelines and information on how to submit a manuscript electronically, visit http://www.computer.org/portal/web/peerreviewmagazines/computer

A great opportunity to combine two fun interests: astronomy and topic maps!

The deadline of December 1, 2013 will be here sooner than you think. Best to start drafting now.

Updated Database Landscape map – June 2013

Filed under: Database,Graphs,NoSQL — Patrick Durusau @ 9:27 am

Updated Database Landscape map – June 2013 by Matthew Aslett.

database map

I appreciate all the work that went into the creation of the map but even in a larger size (see Matthew’s post), I find it difficult to use.

Or perhaps that’s part of the problem, I don’t know what use it was intended to serve?

If I understand the legend, then “search” isn’t found in the relational or grid/cache zones. Which I am sure would come as a surprise to the many vendors and products in those zones.

Moreover, the ordering of entries along each colored line isn’t clear. Taking graph databases for example, they are listed from top to bottom:

But GrapheneDB is Neo4j as a service. So shouldn’t they be together?

I have included links to all the listed graph databases in case you can see a pattern that I am missing.

BTW, GraphLab, in May of 2013, raised $6.75M for further development of GraphLab (GraphLab – Next Generation [Johnny Come Lately VCs]) and GraphChi, a project at GraphLab, were both omitted from this list.

Are there other graph databases that are missing?

How would you present this information differently? What ordering would you use? What other details would you want to have accessible?

Kiji

Filed under: Hadoop,HBase,KIji Project — Patrick Durusau @ 8:40 am

What’s Next for HBase? Big Data Applications Using Frameworks Like Kiji by Michael Stack.

From the post:

Apache Hadoop and HBase have quickly become industry standards for storage and analysis of Big Data in the enterprise, yet as adoption spreads, new challenges and opportunities have emerged. Today, there is a large gap — a chasm, a gorge — between the nice application model your Big Data Application builder designed and the raw, byte-based APIs provided by HBase and Hadoop. Many Big Data players have invested a lot of time and energy in bridging this gap. Cloudera, where I work, is developing the Cloudera Development Kit (CDK). Kiji, an open source framework for building Big Data Applications, is another such thriving option. A lot of thought has gone into its design. More importantly, long experience building Big Data Applications on top of Hadoop and HBase has been baked into how it all works.

Kiji provides a model and set of libraries that you to get up and running quickly.

Kiji provides a model and a set of libraries that allow developers to get up and running quickly. Intuitive Java APIs and Kiji’s rich data model allow developers to build business logic and machine learning algorithms without having to worry about bytes, serialization, schema evolution, and lower-level aspects of the system. The Kiji framework is modularized into separate components to support a wide range of usage and encourage clean separation of functionality. Kiji’s main components include KijiSchema, KijiMR, KijiHive, KijiExpress, KijiREST, and KijiScoring. KijiSchema, for example, helps team members collaborate on long-lived Big Data management projects, and does away with common incompatibility issues, and helps developers build more integrated systems across the board. All of these components are available in a single download called a BentoBox.

When mainstream news only has political scandals, wars and rumors of wars, tech news can brighten your day!

Be sure to visit the Kiji Project website.

Turn-key tutorials to get you started.

SSNs: Close Enough for a Drone Strike?

Filed under: NSA,Security — Patrick Durusau @ 8:28 am

The hazards, difficulties and dangers of name matching in large data pools was explored in NSA…Verizon…Obama…Connecting the Dots. Or not., republished at Naked Capitalism as: Could the Verizon-NSA Metadata Collection Be a Stealth Political Kickback?. Safe to conclude that without more, name matching is at best happenstance.

A private comment wondered if Social Security Numbers (SSNs) could be the magic key that ties phone records to credit records to bank records and so on. It is, after all, the default government identifier in the United States. It is certainly less untrustworthy than simple name matching. How trustworthy an SSN identifier is in fact, is the subject of this post.

Are SSNs a magic key for matching phone, credit, bank, government records?

SSNs: A Short History

Wikipedia gives us a common starting point to answer that question: http://en.wikipedia.org/wiki/United_States

In the United States, a Social Security number (SSN) is a nine-digit number issued to U.S. citizens, permanent residents, and temporary (working) residents under section 205(c)(2) of the Social Security Act, codified as 42 U.S.C. § 405(c)(2). The number is issued to an individual by the Social Security Administration, an independent agency of the United States government. Its primary purpose is to track individuals for Social Security purposes.

(…)

The original purpose of this number was to track individuals’ accounts within the Social Security program. It has since come to be used as an identifier for individuals within the United States, although rare errors occur where duplicates do exist.

The Wikipedia article also points out duplicates issued by the Social Security Administration are rare, but people claiming the same SSN are not.

The Social Security Administration expands the story of the SSN of Mrs. Hilda Schrader Witcher (in Wikipedia) this way:

The most misused SSN of all time was (078-05-1120). In 1938, wallet manufacturer the E. H. Ferree company in Lockport, New York decided to promote its product by showing how a Social Security card would fit into its wallets. A sample card, used for display purposes, was inserted in each wallet. Company Vice President and Treasurer Douglas Patterson thought it would be a clever idea to use the actual SSN of his secretary, Mrs. Hilda Schrader Whitcher.

The wallet was sold by Woolworth stores and other department stores all over the country. Even though the card was only half the size of a real card, was printed all in red, and had the word “specimen” written across the face, many purchasers of the wallet adopted the SSN as their own. In the peak year of 1943, 5,755 people were using Hilda’s number. SSA acted to eliminate the problem by voiding the number and publicizing that it was incorrect to use it. (Mrs. Whitcher was given a new number.) However, the number continued to be used for many years. In all, over 40,000 people reported this as their SSN. As late as 1977, 12 people were found to still be using the SSN “issued by Woolworth.” (Social Security Cards Issued By Woolworth)

Do People Claim More Than One SSN?

The best evidence that people can and do claim multiple SSNs are our information systems for tracking individuals.

The FBI advises in Guidelines for Preparation of Fingerprint Cards and Associated Criminal History Information provides that:

Enter the subject’s Social Security number, if known. Additional Social Security numbers used by the subject may be entered in the “Additional Information/Basis for Caution” block #34 on the reverse side of the fingerprint card.

The FBI maintains the National Crime Information Center (NCIC), “…an electronic clearinghouse of crime data….” The system requires authorization for access and there are no published statistics about the number of social security numbers claimed by people listed in NCIC.

I can relate anecdotally that I have seen NCIC printouts that reported multiple SSNs for a single individual. I have written to the FBI asking for either a pointer to # of individuals with multiple SSNs in NCIC or a response with that statistic.

Beyond evildoers who claim multiple SSNs, there is also the problem of identity theft, which commonly involves a person’s SSN.

Identity Theft

Another source of dirty identity data is identity theft.

How prevalent is identity theft?:

Approximately 15 million United States residents have their identities used fraudulently each year with financial losses totalling upwards of $50 billion.*

On a case-by-case basis, that means approximately 7% of all adults have their identities misused with each instance resulting in approximately $3,500 in losses.

Close to 100 million additional Americans have their personal identifying information placed at risk of identity theft each year when records maintained in government and corporate databases are lost or stolen.

These alarming statistics demonstrate identity theft may be the most frequent, costly and pervasive crime in the United States. (http://www.identitytheft.info/victims.aspx

BTW, www.IdentityTheft.info reports as of June 9, 2013, “…6,558,655 identity theft victims year-to-date.

Assuming the NSA is monitoring all phone and other electronic traffic, what difficulties does it face with SSNs?

Partial Summary of How Dirty are SSNs?

  • From identity theft, 2012 to date, 21,558,655 errors in its resolution to other data.
  • An unknown number of multiple SSNs as evidenced in part by people listed in NCIC.
  • Mistakes, foulups, confusion, bad record keeping by credit reporting agencies (The NSA Verizon Collection Coming on DVD) (an unknown number)
  • Terrorists, being bent on mass murder, are unlikely to be stymied by “…I declare under penalities of perjury…” or warnings about false statements resulting in denial of future service clauses in contracts. (an unknown number)

Don’t have to take my word that reliable identification is difficult. Ask your local district attorney what evidence they need to prove someone was previously convicted of drunk driving. The courts have wrestled with this type of issue for years. Which is one reason why FBI record keeping requires biometric data along with names and numbers.

Does More Data = Better Data?

The debate over data collection should distinguish two uses of large data sets.

Pattern Matching

The most common use is to search for patterns in data. For example, Twitter users forming tribes with own language, tweet analysis shows.

Another example of pattern matching research was described as:

When Senn was first given his assignment to compare two months of weather satellite data with 830 million GPS records of 80 million taxi trips, he was a little disappointed. “Everyone in Singapore knows it’s impossible to get a taxi in a rainstorm,” says Senn, “so I expected the data to basically confirm that assumption.” As he sifted through the data related to a vast fleet of more than 16,000 taxicabs, a strange pattern emerged: it appeared that many taxis weren’t moving during rainstorms. In fact, the GPS records showed that when it rained (a frequent occurrence in this tropical island state), many drivers pulled over and didn’t pick up passengers at all.

Senn confirmed his findings by sitting down with drivers. And what did he learn?

He learned that the company owning most of the island’s taxis would withhold S$1,000 (about US$800) from a driver’s salary immediately after an accident until it was determined who was at fault. The process could take months, and the drivers had independently decided that it simply wasn’t worth the risk of having their livelihood tangled up in bureaucracy for that long. So when it started raining, they simply pulled over and waited out the storm. Why you don’t get taxis in Singapore when it rains?

“…[U]sing two months of weather satellite data with 830 million GPS records of 80 million taxi trips…” Sounds like the NSA project. Yes?

Detecting patterns is one thing. But patterns don’t connect diverse data sources. Nor do they provide explanations.

Reconciling Dirty Data

Starting from diverse data sets, even if they purport to share SSNs, the difficult question is how to reconcile the data. Any of the data sets could be correct or they could all be incorrect.

Here is a more formal statement on error analysis and multiple data sets:

The most challenging problem within data cleansing remains the correction of values to eliminate domain format errors, constraint violations, duplicates and invalid tuples. In many cases the available information and knowledge is insufficient to determine the correct modification of tuples to remove these anomalies. This leaves deleting those tuples as the only practical solution. This deletion of tuples leads to a loss of information if the tuple is not invalid as a whole. This loss of information can be avoided by keeping the tuple in the data collection and mask the erroneous values until appropriate information for error correction is available. The data management system is then responsible for enabling the user to include and exclude erroneous tuples in processing and analysis where this is desired.

In other cases the proper correction is known only roughly. This leads to a set of alternative values. The same is true when dissolving contradictions and merging duplicates without exactly knowing which of the contradicting values is the correct one. The ability of managing alternative values allows to defer the error correction until one of the alternatives is selected as the right correction. Keeping alternative values has a major impact on managing and processing the data. Logically, each of the alternatives forms a distinct version of the data collection, because the alternatives are mutually exclusive. It is a technical challenge to manage the large amount of different logical versions and still enable high performance in accessing and processing them.

When performing data cleansing one has to keep track of the version of data used because the deduced values can depend on a certain value from the set of alternatives of being true. If this specific value later becomes invalid, maybe because another value is selected as the correct alternative, all deduced and corrected values based on the now invalid value have to be discarded. For this reason the cleansing lineage of corrected values has to maintained. By cleansing lineage we mean the entirety of values and tuples used within the cleansing of a certain tuple. If any value in the lineage becomes invalid or changes the performed operations have to be redone to verify the result is still valid. The management of cleansing lineage is also of interest for the cleansing challenges described in the following two sections. Problems, Methods, and Challenges in Comprehensive Data Cleansing by Heiko Müller and Johann-Christoph Freytag.

The more data you collect, the more problematic accurate mass identification becomes.

NSA Competency with Data (SSN or otherwise)

The “Underwear Bomber” parents meet with the CIA at least twice to warn them about their son.

A useful Senate Budget hearing on the NSA and its acquisition of phone, credit, bank and other records should go something like:

The following dialogue is fictional but the facts and links are real.

Sen. X: Mr. N, as a representative of the NSA, are you familiar with the case of Umar Farouk Abdulmutallab?

Mr. N: Yes.

Sen. X: I understand that the CIA entered his name in the Terrorist Identities Datamart Environment in November of 2009. But his name was not added to the FBI’s Terrorist Screening Database, which feeds the Secondary Screening Selectee list and the U.S.’s No Fly List.

Mr. N: Yes.

Sen. X: The Terrorist Identities Datamart Environment, Terrorist Screening Database, Secondary Screening Selectee list and the U.S.’s No Fly List are all U.S. government databases? Databases to which the NSA has complete access?

Mr. N: Yes.

Sen. X: So, the NSA was unable to manage data in four (4) U.S. government databases well enough to prevent a terrorist from boarding a plane destined from the United States.

My question is if the NSA can’t manage four U.S. goverment databases, what proof is there the NSA can usefully manage all phone and other electronic traffic usefully?

Mr. N: I don’t know.

Sen. X: Who would know?

Mr. N: The computer and software bidders for the NSA DarkStar facility in Utah.

Sen. X: And who would they be?

Mr. N: That’s classified.

Sen. X: Classified?

Mr. N: Yes, classified.

Sen. X: Has it ever occurred to you that bidders have an interest in their contracts being funded and agencies in having larger budgets?

Mr. N: Sorry, I don’t understand the question.

Sen. X: My bad, I really didn’t think you would.

End of fictional hearing transcript

The known facts show the NSA can’t manage four (4) U.S. government databases to prevent a known terrorist from entering the U.S.

What evidence would you offer to prove the NSA is competent to use complex data sets? (You can find more negative evidence on evavesdropping at Bruce Scheneier’s NSA Eavesdropping Yields Dead Ends.)

PS: On the National Security Industrial Complex, see: Apparently Some Stuff Happened This Weekend


Addendum: Edward Snowden Makes Himself an Even Bigger Problem to the Officialdom

A must watch interview with Edward Snowden and great commentary as well.

On a quick listen, you may think Edward is describing a more competent system that I do above.

On the contrary, if you listen closely, Edward does not diverge from anything that I have said on this issue to date.

Starting at time mark 07:10, Glenn Greenwald asks:

Why should people care about surveillance?

Snowden: Because even if you aren’t doing anything wrong, you are being watched and recorded and the storage capability of these systems increases every year, consistently, by orders of magnitude, ah, to where it is getting to the point that you don’t have to have done anything wrong, you have to eventually fall under suspicion from somebody, even from a wrong call, and they can use the system to go back in time and scrutinize every decision you have ever made, every friend you have ever discussed anything with, and attack you on that basis to sort of derive suspicion from an innocent life and paint anyone in the context of a wrong doer.

Yes, the NSA can use a phone call to search all other phone calls, within the phone call database. Ho-hum. Annoying but hardly high tech merging of data from diverse data sources.

It is also true that once you are selected, the NSA could invest the time and effort to reconcile all the information about you, on a one-off basis.

But that has always been the case.

The public motivation for the NSA project was to data mine diverse data sources. Computers replacing labor-intensive human investigation of terrorism.

But as Snowden points out, it takes a human to connect dots in the noisy results of computer processing.

Fewer humans = less effective against terrorism.

June 10, 2013

FoundationDB Beta 2 [NSA Scale?]

Filed under: FoundationDB,NoSQL,NSA — Patrick Durusau @ 4:31 pm

Beta 2 is here – with 100X increased capacity!

From the post:

We’re happy to announce that we’ve released FoundationDB Beta 2!

Most of our testing and tuning in the past has focused on data sets ranging up to 1TB, but our users have told us that they’re excited to begin applying FoundationDB’s transactional processing to data sets larger than 1 TB, so we made that our major focus for Beta 2.

db scale

Beta 2 significantly reduces memory and CPU usage while increasing server robustness when working with larger data sets. FoundationDB now supports data sets up to 100 TB of aggregate key-value size. Though if you are planning on going above 10 TB you might want to talk to us at support@foundationdb.com for some configuration recommendations—we’re always happy to help.

Also new in Beta 2 is support for Node 0.10 and Ruby on Windows. Of course, there are a whole lot of behind-the-scenes improvements to both the core and our APIs, some of which are documented in the release notes.

New Website!

We also recently rolled out a cool new website to explain the transformative effect that ACID transactions have on NoSQL technology. Be sure to check it out, along with our community site where you can share your insights and get questions answered.

Do you think “web scale” is rather passé nowadays?

Really should be talking about NSA scale.

Yes?

When will my computer understand me?

Filed under: Language,Markov Decision Processes,Semantics,Translation — Patrick Durusau @ 2:57 pm

When will my computer understand me?

From the post:

It’s not hard to tell the difference between the “charge” of a battery and criminal “charges.” But for computers, distinguishing between the various meanings of a word is difficult.

For more than 50 years, linguists and computer scientists have tried to get computers to understand human language by programming semantics as software. Driven initially by efforts to translate Russian scientific texts during the Cold War (and more recently by the value of information retrieval and data analysis tools), these efforts have met with mixed success. IBM’s Jeopardy-winning Watson system and Google Translate are high profile, successful applications of language technologies, but the humorous answers and mistranslations they sometimes produce are evidence of the continuing difficulty of the problem.

Our ability to easily distinguish between multiple word meanings is rooted in a lifetime of experience. Using the context in which a word is used, an intrinsic understanding of syntax and logic, and a sense of the speaker’s intention, we intuit what another person is telling us.

“In the past, people have tried to hand-code all of this knowledge,” explained Katrin Erk, a professor of linguistics at The University of Texas at Austin focusing on lexical semantics. “I think it’s fair to say that this hasn’t been successful. There are just too many little things that humans know.”

Other efforts have tried to use dictionary meanings to train computers to better understand language, but these attempts have also faced obstacles. Dictionaries have their own sense distinctions, which are crystal clear to the dictionary-maker but murky to the dictionary reader. Moreover, no two dictionaries provide the same set of meanings — frustrating, right?

Watching annotators struggle to make sense of conflicting definitions led Erk to try a different tactic. Instead of hard-coding human logic or deciphering dictionaries, why not mine a vast body of texts (which are a reflection of human knowledge) and use the implicit connections between the words to create a weighted map of relationships — a dictionary without a dictionary?

“An intuition for me was that you could visualize the different meanings of a word as points in space,” she said. “You could think of them as sometimes far apart, like a battery charge and criminal charges, and sometimes close together, like criminal charges and accusations (“the newspaper published charges…”). The meaning of a word in a particular context is a point in this space. Then we don’t have to say how many senses a word has. Instead we say: ‘This use of the word is close to this usage in another sentence, but far away from the third use.'”

Before you jump to the post looking for the code, Erk is working with a 10,000 dimension space to analyze her data.

The most recent paper: Montague Meets Markov: Deep Semantics with Probabilistic Logical Form (2013)

Abstract:

We combine logical and distributional representations of natural language meaning by transforming distributional similarity judgments into weighted inference rules using Markov Logic Networks (MLNs). We show that this framework supports both judging sentence similarity and recognizing textual entailment by appropriately adapting the MLN implementation of logical connectives. We also show that distributional phrase similarity, used as textual inference rules created on the fly, improves its performance.

Mahout for R Users

Filed under: Machine Learning,Mahout,R — Patrick Durusau @ 2:31 pm

Mahout for R Users by Simon Raper.

From the post:

I have a few posts coming up on Apache Mahout so I thought it might be useful to share some notes. I came at it as primarily an R coder with some very rusty Java and C++ somewhere in the back of my head so that will be my point of reference. I’ve also included at the bottom some notes for setting up Mahout on Ubuntu.

What is Mahout?

A machine learning library written in Java that is designed to be scalable, i.e. run over very large data sets. It achieves this by ensuring that most of its algorithms are parallelizable (they fit the map-reduce paradigm and therefore can run on Hadoop.) Using Mahout you can do clustering, recommendation, prediction etc. on huge datasets by increasing the number of CPUs it runs over. Any job that you can split up into little jobs that can done at the same time is going to see vast improvements in performance when parallelized.

Like R it’s open source and free!

So why use it?

Should be obvious from the last point. The parallelization trick brings data and tasks that were once beyond the reach of machine learning suddenly into view. But there are other virtues. Java’s strictly object orientated approach is a catalyst to clear thinking (once you get used to it!). And then there is a much shorter path to integration with web technologies. If you are thinking of a product rather than just a one off piece of analysis then this is a good way to go.

Large data sets have been in the news of late. 😉

Are you ready to apply machine learning techniques to large data sets?

And will you be familiar enough with the techniques to spot computational artifacts?

Can’t say for sure but more knowledge of and practice with Mahout might help with those questions.

Beyond NSA, the intelligence community has a big technology footprint [Funding]

Filed under: Funding,NSA — Patrick Durusau @ 2:10 pm

Beyond NSA, the intelligence community has a big technology footprint

From the post:

While all through the past few days the focus has been on NSA activities, the discussion has often veered around the technologies and products used by NSA. At the same time, a side discussion topic has been the larger technical ecosystem of intelligence units. CIA has been one of the more prolific users of Information Technology by its own admission. To that extent, CIA spinned off a venture capital firm In-Q-Tel in 1999 to invest in focused sector companies. Per Helen Coster of Fortune Magazine, In-Q-Tel (IQT) has been named “after the gadget-toting James Bond character Q”.

In-Q-Tel states on its website that “We design our strategic investments to accelerate product development and delivery for this ready-soon innovation, and specifically to help companies add capabilities needed by our customers in the Intelligence Community”. To that effect, it has made over 200 investments in early stage companies for propping up products. Being a not-for-profit group, unlike Private Venture capitalists, the ROI is not the primary motive. Beyond funds, the startups have benefited from a government organization association. “It’s an ego boost to get a phone call from In-Q-Tel, but more importantly, it’s a direct path to major government customers.” states Venture Beat’s Christina Farr.

A partial listing of companies from its portfolio:

  • 10gen:  the company behind MongoDB, a leading NOSQL database.
  • Adaptive Computing: offers a portfolio of Moab cloud management and Moab HPC workload management products for HPC, data centers and cloud.
  • Adapx: helps to collect data with paper forms, GIS maps, and/or all-weather field notebooks using digital pens; software integrated with Excel, Sharepoint and ArcGIS among others.
  • Apigee: provides API technology and services for enterprises and developers
  • Basis Technology: provides software solutions for text analytics, natural language processing, information retrieval, and name resolution in over twenty languages
  • Cleversafe: provides resilient storage solutions, ideally suited for storage clouds and massive digital archives
  • Cloudera: leading provider of Hadoop distribution and services.
  • Cloudant: provides “data layer as a service” for loading, storing, analyzing, and distributing application data
  • Delphix: provides solutions for provisioning, managing refreshing, and recovering databases for business-critical applications
  • Digital Reasoning: provides tools people need to understand relationships between entities in vast amounts of unstructured, semi-structured and structured data.
  • FMSAdvanced Systems Group: focuses on visualization and analytical solutions, as well as solutions in the area of Geospatial/Temporal Analysis and Situational Awareness
  • Huddle: provides enterprise content software with intelligent technology for recommending valuable information to users, with no need for search.
  • LucidWorks: provides search solution development platforms built on the power of Apache Solr/Lucene open source search via enterprise-grade subscriptions.
  • Narrative Insight: provides automated business analytics and natural language communication technology that transform data into personalized stories.
  • Novo Dynamics: develops intelligent information capture software and provides advanced analytics solutions that transform data into actionable insights
  • NetBase: provides semantic technology tool that reads sentences to surface insights from public and private online information sources.
  • piXlogic: has created software that automatically analyzes and searches images and video based on their visual content
  • Palantir Technologies: “was developed to address the most complex information analysis and security challenges faced by the U.S. intelligence, military, and law enforcement communities”. Provides extensible software solution designed from the ground up for data integration and analysis.
  • Pure Storage: the all-flash enterprise storage company, enables the broad deployment of flash in the data center
  • Power Assure: developer of data center infrastructure and energy management software for large enterprises, government agencies, and managed service providers
  • Platfora: works with existing Hadoop clusters and provide business intelligence software for Big Data.
  • Quantum 4D: integrates large data sets into sophisticated models producing moving visual representations that enable users to identify trends, relationships, and anomalies in real-time data
  • ReversingLabs: provides tools for analysis of all unknown binary content which may involve removing of all protection and obfuscation artifacts, unwrapping formatting elements and extracting relevant meta-data.  Results are compared against analysis reports on billions of goodware and malware files.
  • Recorded Future: provides software which utilizes linguistics and statistics algorithms to extract time-related information, and through temporal reasoning, help users understand relationships between entities and events over time, to form the world’s “first temporal analytics engine”.
  • Signal Innovations Group, Inc. (SIG): provides customers with an advanced platform for video analysis, including motion tracking, enhanced metadata creation, and motion-based anomalous behavior detection
  • SitScape: provides software to assemble situational business dashboards based on on-demand componentization and user interface virtualization of disparate enterprise-wide live Web applications and digital assets.
  • StreamBase Systems: provides software for high-performance Complex Event Processing (CEP) that analyze and act on real-time streaming data for instantaneous decision-making
  • Traction software: provides enterprise hypertext collaboration software which combines wiki-style group editing and the simplicity of a blog to provide business and government organizations with enterprise software
  • Visible Technologies: provides enterprise ready social media solution, offering a combination of software and services to harness business value from social communities.
  • Weather Analytics: delivers global climate intelligence by providing statistically stable, gap-free data formed by an extensive collection of historical, current and forecasted weather content, coupled with proprietary analytics and methodologies.

Thinking that after a long Summer and Fall hearings rooting out some of the grifters in the NSA surveillance program, there could be opportunities opening up for new faces.

If the U.S. loses its taste for intelligence gathering, there are plenty of other governments and private organizations who might be interested.

“… it’s an ill wind blaws naebody gude.” (Walter Scott in Rob Roy)

Edward Snowden Makes Himself an Even Bigger Problem to the Officialdom

Filed under: NSA,Security — Patrick Durusau @ 1:42 pm

Edward Snowden Makes Himself an Even Bigger Problem to the Officialdom by Yves Smith.

Unless you have been in a coma or a reeducation camp for the last week or so, you have heard about Edward Snowden. Snowden is the source of a recent set of leaks about NSA surveillance programs that targets everyone.

The post links to an interview with Snowden and adds extensive commentary along with links to more materials.

It will be a useful resource for those of you creating Edward Snowden or related topic maps.

Why Theoretical Computer Scientists Aren’t Worried About Privacy

Filed under: Cryptography,NSA,Privacy,Security — Patrick Durusau @ 1:33 pm

Why Theoretical Computer Scientists Aren’t Worried About Privacy by Jeremy Kun.

From the post:

There has been a lot of news recently on government surveillance of its citizens. The biggest two that have pervaded my news feeds are the protests in Turkey, which in particular have resulted in particular oppression of social media users, and the recent light on the US National Security Agency’s widespread “backdoor” in industry databases at Google, Verizon, Facebook, and others. It appears that the facts are in flux, as some companies have denied their involvement in this program, but regardless of the truth the eye of the public has landed firmly on questions of privacy.

Barack Obama weighed in on the controversy as well, being quoted as saying,

You can’t have 100% security and 100% privacy, and also zero inconvenience.

I don’t know what balance the US government hopes to strike, but what I do know is that privacy and convenience are technologically possible, and we need not relinquish security to attain it.

Before I elaborate, let me get my personal beliefs out of the way. I consider the threat of terrorism low compared to the hundreds of other ways I can die. I should know, as I personally have been within an \varepsilon fraction of my life for all \varepsilon > 0 (when I was seven I was hit by a bus, proclaimed dead, and revived). So I take traffic security much more seriously than terrorism, and the usual statistics will back me up in claiming one would be irrational to do otherwise. On the other hand, I also believe that I only need so much privacy. So I don’t mind making much of my personal information public, and I opt in to every one of Google’s tracking services in the hopes that my user experience can be improved. Indeed it has, as services like Google Now will, e.g., track my favorite bands for me based on my Google Play listening and purchasing habits, and alert me when there are concerts in my area. If only it could go one step further and alert me of trending topics in theoretical computer science! I have much more utility for timely knowledge of these sorts of things than I do for the privacy of my Facebook posts. Of course, ideologically I’m against violating privacy as a matter of policy, but this is a different matter. One can personally loathe a specific genre of music and still recognize its value and one’s right to enjoy it.

But putting my personal beliefs aside, I want to make it clear that there is no technological barrier to maintaining privacy and utility. This may sound shocking, but it rings true to the theoretical computer scientist. Researchers in cryptography have experienced this feeling many times, that their wildest cryptographic dreams are not only possible but feasible! Public-key encryption and digital signatures, secret sharing on a public channel, zero-knowledge verification, and many other protocols have been realized quite soon after being imagined. There are still some engineering barriers to implementing these technologies efficiently in large-scale systems, but with demand and a few years of focused work there is nothing stopping them from being used by the public. I want to use this short post to describe two of the more recent ideas that have pervaded the crypto community and provide references for further reading.

Jeremy injects a note of technical competence into the debate over privacy and security in the wake of NSA disclosures.

Not that our clueless representatives in government, greedy bidders or turf building agencies will pick up on this line of discussion.

The purpose of the NSA program is what the Republicans call a “transfer of wealth.” In this case from the government to select private contractors.

How much is being transferred isn’t known. If we knew the amount of the transfer and that the program it funds is almost wholly ineffectual, we might object to our representatives.

Some constitutional law scholars (Obama) have forgotten informed participation by voters in public debate is a keystone of the U.S. constitution.

Advanced Suggest-As-You-Type With Solr

Filed under: Indexing,Searching,Solr — Patrick Durusau @ 10:10 am

Advanced Suggest-As-You-Type With Solr by John Berryman.

From the post:

In my previous post, I talked about implementing Search-As-You-Type using Solr. In this post I’ll cover a closely related functionality called Suggest-As-You-Type.

Here’s the use case: A user comes to your search-driven website to find something. And it is your goal to be as helpful as possible. Part of this is by making term suggestions as they type. When you make these suggestions, it is critical to make sure that your suggestion leads to search results. If you make a suggestion of a word just because it is somewhere in your index, but it is inconsistent with the other terms that the user has typed, then the user is going to get a results page full of white space and you’re going to get another dissatisfied customer!

A lot of search teams jump at the Solr suggester component because, after all, this is what it was built for. However I haven’t found a way to configure the suggester so that it suggests only completions that that correspond to search results. Rather, it is based upon a dictionary lookup that is agnostic of what the user is currently searching for. (Please someone tell me if I’m wrong!) In any case, getting the suggester working takes a bit of configuration. — Why not use a solution that is based upon the normal, out-of-the-box Solr setup. Here’s how:

Topic map authoring is what jumps to my mind as a use case for suggest-as-you-type. Particularly if you use fields for particular types of topics, making the suggestions more focused and manageable.

Good for search as well, for the same reasons.

John offers several cautions near the end of his post, but the final one is quite amusing:

Inappropriate Content: Be very cautious about content of the fields being used for suggestions. For instance, if the content has misspellings, so will the suggestions. And don’t include user comments unless you want to endorse their opinions and choice of language as your search suggestions!

I don’t think of auto-suggestions as “endorsing” anything. Purely mechanical assistance to assist the user.

If some term or opinion offends a user, they don’t have to choose it to follow.

At least in my view, technology should not be used to build intellectual tombs for users. Intellectual tombs that protect them from thoughts or expressions different from their own.

Search-As-You-Type With Solr

Filed under: Lucene,Searching,Solr — Patrick Durusau @ 9:53 am

Search-As-You-Type With Solr by John Berryman.

From the post:

In my previous post, I talked about implementing Suggest-As-You-Type using Solr. In this post I’ll cover a closely related functionality called Suggest-As-You-Type.

Several years back, Google introduced an interesting new interface for their search called Search-As-You-Type. Basically, as you type in the search box, the result set is continually updated with better and better search results. By this point, everyone is used to Google’s Search-As-You-Type, but for some reason I have yet to see any of our clients use this interface. So I thought it would be cool to take a stab at this with Solr.

Let’s get started. First things first, download Solr and spin up Solr’s example.

cd solr-4.2.0/example
java -jar start.jar

Next click this link and POOF! you will have the following documents indexed:

  • There’s nothing better than a shiny red apple on hot summer day.
  • Eat an apple!
  • I prefer a Grannie Smith apple over Fuji.
  • Apricots is kinda like a peach minus the fuzz.

(Kinda cool how that link works isn’t it?) Now let’s work on the strategy. Let’s assume that the user is going to search for “apple”. When the user types “a” what should we do? In a normal index, there’s a buzillion things that start with “a”, so maybe we should just do nothing. Next “ap” depending upon how large your index is, two letters may be a reasonably small enough set to start providing feedback back to your users. The goal is to provide Solr with appropriate information so that it continuously comes back with the best results possible.

Good demonstration that how you form a query makes a large difference in the result you get.

The NSA Files

Filed under: NSA,Security,Topic Maps — Patrick Durusau @ 8:26 am

The NSA Files (Guardian)

Chinese military intelligence may be freer of U.S. influence than the Guardian (U.K.), but if so, the Guardian runs a close second.

The Guardian has helpfully collected up the NSA files that it has published here, along with other news coverage about the files.

You can comment on other people’s comments or you can read the documents and make knowledgeable comments. I suggest the latter.

So far as I know, no one other than the NSA is tracking all the public statements from various figures.

Excellent opportunity to improve your skills at research and analysis.

Consider creating partial topic maps of portions of the story that can be merged to reveal other connections.

Think of it as distributed investigation and curation of information.

June 9, 2013

NLTK 2.1 – Working with Text Corpora

Filed under: Natural Language Processing,NLTK,Text Corpus — Patrick Durusau @ 5:46 pm

NLTK 2.1 – Working with Text Corpora by Vsevolod Dyomkin.

From the post:

Let’s return to start of chapter 2 and explore the tools needed to easily and efficiently work with various linguistic resources.

What are the most used and useful corpora? This is a difficult question to answer because different problems will likely require specific annotations and often a specific corpus. There are even special conferences dedicated to corpus linguistics.

Here’s a list of the most well-known general-purpose corpora:

  • Brown Corpus – one of the first big corpora and the only one in the list really easily accessible – we’ve already worked with it in the first chapter
  • Penn Treebank – Treebank is a corpus of sentences annotated with their constituency parse trees so that they can be used to train and evaluate parsers
  • Reuters Corpus (not to be confused with the ApteMod version provided with NLTK)
  • British National Corpus (BNC) – a really huge corpus, but, unfortunately, not freely available

Another very useful resource which isn’t structured specifically as academic corpora mentioned above, but at the same time has other dimensions of useful connections and annotations is Wikipedia. And there’s being a lot of interesting linguistic research performed with it.

Besides there are two additional valuable language resources that can’t be classified as text corpora at all, but rather as language databases: WordNet and Wiktionary. We have already discussed CL-NLP interface to Wordnet. And we’ll touch working with Wiktionary in this part.

Vsevolod continues to recast the NLTK into Lisp.

Learning corpus processing along with Lisp. How can you lose?

Presto is Coming!

Filed under: Facebook,Hive,Presto — Patrick Durusau @ 5:34 pm

Facebook unveils Presto engine for querying 250 PB data warehouse by Jordan Novet.

From the post:

At a conference for developers at Facebook headquarters on Thursday, engineers working for the social networking giant revealed that it’s using a new homemade query engine called Presto to do fast interactive analysis on its already enormous 250-petabyte-and-growing data warehouse.

More than 850 Facebook employees use Presto every day, scanning 320 TB each day, engineer Martin Traverso said.

“Historically, our data scientists and analysts have relied on Hive for data analysis,” Traverso said. “The problem with Hive is it’s designed for batch processing. We have other tools that are faster than Hive, but they’re either too limited in functionality or too simple to operate against our huge data warehouse. Over the past few months, we’ve been working on Presto to basically fill this gap.”

Facebook created Hive several years ago to give Hadoop some data warehouse and SQL-like capabilities, but it is showing its age in terms of speed because it relies on MapReduce. Scanning over an entire dataset could take many minutes to hours, which isn’t ideal if you’re trying to ask and answer questions in a hurry.

With Presto, however, simple queries can run in a few hundred milliseconds, while more complex ones will run in a few minutes, Traverso said. It runs in memory and never writes to disk, Traverso said.

Traverso goes onto say that Facebook will opensource Presto this coming Fall.

See my prior post on a more technical description of Presto: Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices.

Bear in mind that getting an answer from 250 PB of data quickly isn’t the same thing as getting a useful answer quickly.

IEEE Intelligence and Security Informatics 2013

Filed under: Intelligence,Security — Patrick Durusau @ 4:49 pm

IEEE Intelligence and Security Informatics 2013

The conference ran from June 4 – 7, 2013 and with recent disclosures about the NSA, is a subject of interest.

You may want to scan the program (title link) for topics and/or researchers of interest.

Making Sense of Patterns in the Twitterverse

Filed under: Pattern Matching,Pattern Recognition,Tweets — Patrick Durusau @ 4:45 pm

Making Sense of Patterns in the Twitterverse

From the post:

If you think keeping up with what’s happening via Twitter, Facebook and other social media is like drinking from a fire hose, multiply that by 7 billion — and you’ll have a sense of what Court Corley wakes up to every morning.

Corley, a data scientist at the Department of Energy’s Pacific Northwest National Laboratory, has created a powerful digital system capable of analyzing billions of tweets and other social media messages in just seconds, in an effort to discover patterns and make sense of all the information. His social media analysis tool, dubbed “SALSA” (SociAL Sensor Analytics), combined with extensive know-how — and a fair degree of chutzpah — allows someone like Corley to try to grasp it all.

“The world is equipped with human sensors — more than 7 billion and counting. It’s by far the most extensive sensor network on the planet. What can we learn by paying attention?” Corley said.

Among the payoffs Corley envisions are emergency responders who receive crucial early information about natural disasters such as tornadoes; a tool that public health advocates can use to better protect people’s health; and information about social unrest that could help nations protect their citizens. But finding those jewels amidst the effluent of digital minutia is a challenge.

“The task we all face is separating out the trivia, the useless information we all are blasted with every day, from the really good stuff that helps us live better lives. There’s a lot of noise, but there’s some very valuable information too.”

The work by Corley and colleagues Chase Dowling, Stuart Rose and Taylor McKenzie was named best paper given at the IEEE conference on Intelligence and Security Informatics in Seattle this week.

Another one of those “name” issues as the IEEE conference site reports:

Courtney Corley, Chase Dowling, Stuart Rose and Taylor McKenzie. SociAL Sensor Analytics: Measuring Phenomenology at Scale.

I am assuming from the other researchers matching that this is the “Court/Courtney” in question.

I was unable to find an online copy of the paper but suspect it will eventually appear in an IEEE archive.

From the news report, very interesting and useful work.

OpenVis Conf 2013 [Videos]

Filed under: Graphics,Visualization — Patrick Durusau @ 4:32 pm

OpenVis Conf 2013

Videos of the presentations at OpenVis 2013.

From the YouTube playlist:

We don’t have a Senator like Sam Ervin now so I am laying in a supply of course lectures and conference videos.

Pointers to other conference videos or lectures appreciated.

Build your own finite state transducer

Filed under: FSTs,Lucene — Patrick Durusau @ 3:26 pm

Build your own finite state transducer by Michael McCandless.

From the post:

Have you always wanted your very own Lucene finite state transducer (FST) but you couldn’t figure out how to use Lucene’s crazy APIs?

Then today is your lucky day! I just built a simple web application that creates an FST from the input/output strings that you enter.

If you just want a finite state automaton (no outputs) then enter only inputs, such as this example:

(…)

Mike’s post: Lucene finite state transducer (FST) summaries the potential for FSTs in Lucene.

HTRT? Be good with your tools. Be very good with your tools.

Build Your Own Lucene Codec!

Filed under: Indexing,Lucene — Patrick Durusau @ 3:09 pm

Build Your Own Lucene Codec! by Doug Turnbull.

From the post:

I’ve been having a lot of fun hacking on a Lucene Codec lately. My hope is to create a Lucene storage layer based on FoundationDB – a new distributed and transactional key-value store. It’s a fun opportunity to learn about both FoundationDB and low-level Lucene details.

But before we get into all that fun technical stuff, there’s some work we need to do. Our goal is going to be to get MyFirstCodec to work! Here’s the source code:

(…)

From the Lucene 4.1 documentation: Codec – Class in org.apache.lucene.codecs Encodes/decodes an inverted index segment.

How good do you want to be with your tools?

Edward Snowden:…. [American Hero]

Filed under: Security — Patrick Durusau @ 2:39 pm

Edward Snowden: the whistleblower behind revelations of NSA surveillance by Glenn Greenwald.

From the post:

The individual responsible for one of the most significant leaks in US political history is Edward Snowden, a 29-year-old former technical assistant for the CIA and current employee of the defence contractor Booz Allen Hamilton. Snowden has been working at the National Security Agency for the last four years as an employee of various outside contractors, including Booz Allen and Dell.

The Guardian, after several days of interviews, is revealing his identity at his request. From the moment he decided to disclose numerous top-secret documents to the public, he was determined not to opt for the protection of anonymity. “I have no intention of hiding who I am because I know I have done nothing wrong,” he said.

Snowden will go down in history as one of America’s most consequential whistleblowers, alongside Daniel Ellsberg and Bradley Manning. He is responsible for handing over material from one of the world’s most secretive organisations – the NSA.

In a note accompanying the first set of documents he provided, he wrote: “I understand that I will be made to suffer for my actions,” but “I will be satisfied if the federation of secret law, unequal pardon and irresistible executive powers that rule the world that I love are revealed even for an instant.”

(…)

Tyranny cannot be resisted by anonymous comments, emails or vigils.

Resistance to tyranny is a question of acts by brave people, some of who will not remain anonymous.

Ask yourself: How To Resist Tyranny (HTRT)?

Then ask your friends the same question.

Under the Hood: The entities graph

Filed under: Entities,Facebook,Graphs — Patrick Durusau @ 1:23 pm

Under the Hood: The entities graph (Eric Sun is a tech lead on the entities team, and Venky Iyer is an engineering manager on the entities team.)

From the post:

Facebook’s social graph now comprises over 1 billion monthly active users, 600 million of whom log in every day. What unites each of these people is their social connections, and one way we map them is by traversing the graph of their friendships.

entity graph

But this is only a small portion of the connections on Facebook. People don’t just have connections to other people—they may use Facebook to check in to restaurants and other points of interest, they might show their favorite books and movies on their timeline, and they may also list their high school, college, and workplace. These 100+ billion connections form the entity graph.

There are even connections between entities themselves: a book has an author, a song has an artist, and movies have actors. All of these are represented by different kinds of edges in the graph, and the entities engineering team at Facebook is charged with building, cleaning, and understanding this graph.

Instructive read on building an entity graph.

Differs from NSA data churning in several important ways:

  1. The participants want their data to be found with like data. Participants generally have no motive to lie or hide.
  2. The participants seek out similar users and data.
  3. The participants correct bad data for the benefit of others.

None of those characteristics can be attributed to the victims of NSA data collection efforts.

June 8, 2013

Released OrientDB v1.4:…

Filed under: Graphs,OrientDB — Patrick Durusau @ 3:57 pm

Released OrientDB v1.4: new TinkerPop Blueprints API and many other features

From the post:

NuvolaBase is glad to announce the new release 1.4 of OrientDB: http://www.orientdb.org/download.htm!

What’s new with 1.4?

  • Graph: total rewrite of TinkerPop Blueprints API that now are the default Java interface, support for light-weight edges (no document), labeled relationships using separate classes and vertex fields
  • Storage: new Paged-Local compressed “plocal” engine  (not yet transactional)
  • SQL: INSERT and UPDATE supports JSON syntax, improved usage of indexes upon ORDER BY, supported timeout in query and global, new create function command, flatten() now is expand(), new OSQLMethod classes to handle methods even in chain, new encode() and decode() functions, support for new dictionary:<key> as target in SELECT and TRAVERSE
  • new SCHEDULER component using CRON syntax
  • new OTriggered class to use JS as hook
  • MMap: auto flush of pages on regular basis
  • Fetch-plan: support for skip field using “-2”
  • Index: auto rebuild in background, usage of different data-segment
  • Export: supported partial export like schema, few clusters, etc.
  • Console: improved formatting of resultsets
  • HTTP: new /batch command supporting transaction too, faster connection through /connect command, /document returns a JSON
  • StudioUML display of class

I did not mean to get so distracted yesterday writing about useless projects that I would neglect very useful ones, like the new release of OrientDB!

The “light-weight edges” sounds particularly interesting.

With no documentation you could not reliably merge the edges but as a design decision, that could be entirely appropriate.

You are always going to be faced with decisions of what subjects are represented at all and to what extent.

Nice to have a mechanism that doesn’t make that an entirely binary decision.

« Newer PostsOlder Posts »

Powered by WordPress