Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 9, 2013

Neo4j 2.0 Milestone 4

Filed under: Graphs,Neo4j — Patrick Durusau @ 3:58 pm

Summer Release – Neo4j 2.0 Milestone 4 by Chris Leishman.

From the post:

Perfectly suited for your summer holiday exploration, we are proud to present the Neo4j 2.0 Milestone 4 release (2.0.0-M04).

In working towards a major 2.0 release with some outstanding new functionality, this milestone contains many beneficial and necessary changes, some of which will require changes in the way Neo4j is used (see: deprecations).

As Chris says, it is something to explore over the summer holidays!

Download.

BTW, update to Java 7 before you try this milestone. It’s required.

Complex Adaptive Dynamical Systems, a Primer

Filed under: Cellular Automata,Complexity,Game Theory,Information Theory,Self-Organizing — Patrick Durusau @ 3:47 pm

Complex Adaptive Dynamical Systems, a Primer by Claudius Gros. (PDF)

The high level table of contents should capture your interest:

  1. Graph Theory and Small-World Networks
  2. Chaos, Bifurcations and Diffusion
  3. Complexity and Information Theory
  4. Random Boolean Networks
  5. Cellular Automata and Self-Organized Criticality
  6. Darwinian Evolution, Hypercycles and Game Theory
  7. Synchronization Phenomena
  8. Elements of Cognitive Systems Theory

If not, you can always try the video lectures by the author.

While big data is a crude approximation of some part of the world as we experience it, it is less coarse than prior representations.

Curious how less coarse representations will need to become in order to exhibit the complex behavior of what they represent?

I first saw this at Complex Adaptive Dynamical Systems, a Primer (Claudius Gros) by Charles Iliya Krempeaux.

Counting Citations in U.S. Law

Filed under: Graphics,Law,Law - Sources,Visualization — Patrick Durusau @ 3:17 pm

Counting Citations in U.S. Law by Gary Sieling.

From the post:

The U.S. Congress recently released a series of XML documents containing U.S. Laws. The structure of these documents allow us to find which sections of the law are most commonly cited. Examining which citations occur most frequently allows us to see what Congress has spent the most time thinking about.

Citations occur for many reasons: a justification for addition or omission in subsequent laws, clarifications, or amendments, or repeals. As we might expect, the most commonly cited sections involve the IRS (Income Taxes, specifically), Social Security, and Military Procurement.

To arrive at this result, we must first see how U.S. Code is laid out. The laws are divided into a hierarchy of units, which allows anything from an entire title to individual sentences to cited. These sections have an ID and an identifier – “identifier” is used an an citation reference within the XML documents, and has a different form from the citations used by the legal community, comes in a form like “25 USC Chapter 21 § 1901″.

If you are interested in some moderate XML data processing, this is the project for you!

Gary has posted the code for developing a citation index to the U.S. Laws in XML.

If you want to skip to one great result of this effort, see: Visualizing Citations in U.S. Law, also by Gary, which is based on d3.js and Uber Data visualization.

In the “Visualizing” post Gary enables the reader to see what laws (by title) cite other titles in U.S. law.

More interesting that you would think.

Take Title 26, Internal Revenue Code (IRC).

Among others, the IRC does not cite:

Title 30 – MINERAL LANDS AND MINING
Title 31 – MONEY AND FINANCE
Title 32 – NATIONAL GUARD

I can understand not citing the NATIONAL GUARD but MONEY AND FINANCE?

Looking forward to more ways to explore the U.S. Laws.

Tying legislative history of laws to say New York Times stories on the subject matter of a law could prove to be very interesting.

I started to suggest tracking donations to particular sponsors and then to legislation that benefits the donors.

But that level of detail is just a distraction. Most elected officials have no shame at selling their offices. Documenting their behavior may regularize pricing of senators and representatives but not have much other impact.

I suggest you find a button other than truth to influence their actions.

Using Hue to Access Hive Data Through Pig

Filed under: Hive,Hue,Pig — Patrick Durusau @ 2:39 pm

Demo: Using Hue to Access Hive Data Through Pig by Hue Team.

From the post:

This installment of the Hue demo series is about accessing the Hive Metastore from Hue, as well as using HCatalog with Hue. (Hue, of course, is the open source Web UI that makes Apache Hadoop easier to use.)

What is HCatalog?

HCatalog is a module in Apache Hive that enables non-Hive scripts to access Hive tables. You can then directly load tables with Apache Pig or MapReduce without having to worry about re-defining the input schemas, or caring about or duplicating the data’s location.

Hue contains a Web application for accessing the Hive metastore called Metastore Browser, which lets you explore, create, or delete databases and tables using wizards. (You can see a demo of these wizards in a previous tutorial about how to analyze Yelp data.) However, Hue uses HiveServer2 for accessing the metastore instead of HCatalog. This is because HiveServer2 is the new secure and concurrent server for Hive and it includes a fast Hive Metastore API.

HCatalog connectors are still useful for accessing Hive data through Pig, though. Here is a demo about accessing the Hive example tables from the Pig Editor:

Even prior to the semantics of data is access to the data! 😉

Plus mentions of what’s coming in Hue 3.0. (go read the post)

60+ R resources to improve your data skills

Filed under: Data,R — Patrick Durusau @ 1:12 pm

60+ R resources to improve your data skills by Sharon Machlis.

Great collection of R resources! Some you will know, others are likely to be new to you.

Definitely worth the time to take a quick scan of Sharon’s listing.

I first saw this in Vincent Granville’s 60+ R resources.

August 8, 2013

Google Developers R Programming Video Lectures

Filed under: Programming,R — Patrick Durusau @ 6:14 pm

Google Developers R Programming Video Lectures by Stephen Turner.

From the post:

Google Developers recognized that most developers learn R in bits and pieces, which can leave significant knowledge gaps. To help fill these gaps, they created a series of introductory R programming videos. These videos provide a solid foundation for programming tools, data manipulation, and functions in the R language and software. The series of short videos is organized into four subsections: intro to R, loading data and more data formats, data processing and writing functions. Start watching the YouTube playlist here, or watch an individual lecture below:
(…)

Twenty-one (21) high quality lectures on R.

Don’t be frightened off by the number of videos! Just scanning their lengths, I have only found one that is over four (4) minutes, one that is over three (3) minutes but less than four and a number of them between two (2) and four (4) minutes.

They are short but should be sufficient to get you started with R.

Using the Unix Chainsaw:…

Filed under: Bioinformatics,Linux OS,Programming — Patrick Durusau @ 2:50 pm

Using the Unix Chainsaw: Named Pipes and Process Substitution by Vince Buffalo.

From the post:

It’s hard not to fall in love with Unix as a bioinformatician. In a past post I mentioned how Unix pipes are an extremely elegant way to interface bioinformatics programs (and do inter-process communication in general). In exploring other ways of interfacing programs in Unix, I’ve discovered two great but overlooked ways of interfacing programs: the named pipe and process substitution.

Why We Love Unix and Pipes

A few weeks ago I stumbled across a great talk by Gary Bernhardt entitled The Unix Chainsaw. Bernhardt’s “chainsaw” analogy is great: people sometimes fear doing work in Unix because it’s a powerful tool, and it’s easy to screw up with powerful tools. I think in the process of grokking Unix it’s not uncommon to ask “is this clever and elegant? or completely fucking stupid?”. This is normal, especially if you come from a language like Lisp or Python (or even C really). Unix is a get-shit-done system. I’ve used a chainsaw, and you’re simultaneously amazed at (1) how easily it slices through a tree, and (2) that you’re dumb enough to use this thing three feet away from your vital organs. This is Unix.
(…)

“The Unix Chainsaw.” Definitely a title for a drama about a group of shell hackers that uncover fraud and waste in large government projects. 😉

If you are not already a power user on *nix, this could be a step in that direction.

Bret Victor – The Future of Programming

Filed under: Computer Science,Design,Programming — Patrick Durusau @ 1:32 pm

I won’t try to describe or summarize Bret’s presentation for fear I will spoil it for you.

I can say that if you aspire to make a difference in computer science, large or small, this is a video for you.

There are further materials at Bret’s website: http://worrydream.com/

How to Design Programs, Second Edition

Filed under: Computer Science,Design,Programming — Patrick Durusau @ 12:44 pm

How to Design Programs, Second Edition by Matthias Felleisen, Robert Bruce Findler, Matthew Flatt, Shriram Krishnamurthi.

From the webpage:

Bad programming is easy. Idiots can learn it in 21 days, even if they are Dummies.

Good programming requires thought, but everyone can do it and everyone can experience the satisfaction that comes with it. The price is worth paying for the sheer joy of the discovery process, the elegance of the result, and the commercial benefits of a systematic program design process.

The goal of our book is to introduce readers of all ages and backgrounds to the craft of designing programs systematically. We assume few prerequisites: arithmetic, a tiny bit of middle school algebra, and the willingness to think through issues. We promise that the travails will pay off not just for future programmers but for anyone who has to follow a process or create one for others.

We are grateful to Ada Brunstein, our editor at MIT Press, who gave us permission to develop this second edition of “How to Design Programs” on-line.

Good to see this “classic” being revised online.

Differences: This second edition of “How to Design Programs” continues to present an introduction to systematic program design and problem solving. Here are some important differences:

  1. The recipes are applied in two different, typical settings: interactive graphical programs and so-called “batch” programs. The former mode of interaction is typical for games, the latter for data processing in business centers. Both kinds of programs are still created with our design recipes.

  2. While testing has always been a part of the “How to Design Programs” philosophy, the software started supporting it properly only in 2002, just after we had released the first edition. This new edition heavily relies on this testing support now.
  3. This edition of the book drops the design of imperative programs. The old chapters remain available on-line. The material will flow into the next volume of the book, “How to Design Components.”
  4. The book and its programs employ new libraries, also known as teachpacks. The preferred style is to link in these libraries via so-called require specifications, but it is still possible to add teachpacks via a menu in DrRacket.
  5. Finally, we decided to use a slightly different terminology:

    HtDP/1e

    HtDP/2e

    contract

    signature

    union

    itemization

Any other foundation texts that have abandoned imperative programming?

I first saw this in Nat Torkington’s Four short links: 5 August 2013.

August 7, 2013

Extremely Large Images: Considerations for Contemporary Approach

Filed under: Astroinformatics,Semantics — Patrick Durusau @ 6:53 pm

Extremely Large Images: Considerations for Contemporary Approach by Bruce Berriman.

From the post:

This is the title of a paper by Kitaeff, Wicenec, Wu and Taubman recently posted on astro-ph. The paper addresses the issues of accessing and interacting with very large data-cube images that will be be produced by next generation of radio telescopes such as the Square Kilometer Array (SKA), the Low Frequency Array for Radio Astronomy (LOFAR) and others. Individual images may be TB-sized, and one SKA Reference Mission Project, “Galaxy Evolution in the Nearby Uni-verse: HI Observations,” will generate individual images of 70-90 TB each.

Data sets this large cannot reside on local disks, even with anticipated advances in storage and network technology. Nor will any new lossless compression techniques that preserve the low S/N of the data save the day, for the act of decompression will impose excessive computational demands on servers and clients.

(emphasis added)

Yes, you read that correctly: “generate individual images of 70-90 TB each.”

Looks like the SW/WWW is about to get a whole lot smaller, comparatively speaking.

But the data you will be encountering will be getting larger. A lot larger.

Bear in mind that the semantics we associate with data will be getting larger as well.

Read that carefully, especially the part about “…we associate with data…”

Data may appear to have intrinsic semantics but only because we project but do not acknowledge the projection of semantics.

The more data we have, the more space there is for semantic projection, by everyone who views the data.

Whose views/semantics do you want to capture?

Visualizing Astronomical Data with Blender

Filed under: Astroinformatics,Image Processing,Visualization — Patrick Durusau @ 6:42 pm

Visualizing Astronomical Data with Blender by Brian R. Kent.

From the post:

Astronomy is a visually stunning science. From wide-field multi-wavelength images to high-resolution 3D simulations, astronomers produce many kinds of important visualizations. Astronomical visualizations have the potential for generating aesthetically appealing images and videos, as well as providing scientists with the ability to inspect phase spaces not easily explored in 2D plots or traditional statistical analysis. A new paper is now available in the Publications of the Astronomical Society of the Pacific (PASP) entitled “Visualizing Astronomical Data with Blender.” The paper discusses:
(…)

Don’t just skip to the paper, Brian’s post has a video demo of Blender that you will want to see!

BaseX 7.7 has been released!

Filed under: BaseX,XML,XPath,XQuery — Patrick Durusau @ 6:27 pm

BaseX 7.7 has been released!

From the webpage:

BaseX is a very light-weight, high-performance and scalable XML Database engine and XPath/XQuery 3.0 Processor, including full support for the W3C Update and Full Text extensions. An interactive and user-friendly GUI frontend gives you great insight into your XML documents.

To maximize your productivity and workflows, we offer professional support, highly customized software solutions and individual trainings on XML, XQuery and BaseX. Our product itself is completely Open Source (BSD-licensed) and platform independent; join our mailing lists to get regular updates!

But most important: BaseX runs out of the box and is easy to use…

This was a fortunate find. I have some XML work coming up and need to look at the latest offerings.

Topic Map Patterns?

Filed under: Authoring Topic Maps,Topic Maps — Patrick Durusau @ 6:18 pm

A comment yesterday:

However, the first step would be to create a catalog of common topic map structures or patterns. It seems like such a catalog could eventually enable automated or computer assisted construction of topic maps to supplement hand-editing of topic maps. Hand-editing is a necessary first step but it does not scale well. Imagine how few applications there would be now if everything had to be coded in assembler. Or how few databases would there be now if everyone had to build their own out of B-Trees. (Carl)

resonated when I was writing an entry about computational linguistics today.

I think Carl is right, people don’t create their own databases out of B-Trees.

But by the same token, they don’t forge completely new patterns of speaking either.

I don’t know what the numbers are, but how many original constructions in your native language do you use every day? Particularly in a professional setting?

Rather than looking for “topic map” patterns, shouldn’t we be looking for speech patterns in particular user communities?

Such that our interfaces, when set to a particular community, can automatically parse input into a topic map.

Not unconstrained subject recognition but using language patterns to capture some percentage of subjects rather than the user.

EACL 2014 – Gothenburg, Sweden – Call for Papers

Filed under: Computational Linguistics,Conferences,Linguistics — Patrick Durusau @ 6:02 pm

EACL 2014 – 26-30 April, Gothenburg, Sweden

IMPORTANT DATES

Long papers:

  • Long paper submissions due: 18 October 2013
  • Long paper reviews due: 19 November 2013
  • Long paper author responses due: 29 November 2013
  • Long paper notification to authors: 20 December 2013
  • Long paper camera-ready due: 14 February 2014

Short papers:

  • Short paper submissions due: 6 January 2014
  • Short paper reviews due: 3 February 2014
  • Short paper notification to authors: 24 February 2014
  • Short paper camera-ready due: 3 March 2014

EACL conference: 26–30 April 2014

From the call:

The 14th Conference of the European Chapter of the Association for Computational Linguistics invites the submission of long and short papers on substantial, original, and unpublished research in all aspects of automated natural language processing, including but not limited to the following areas:

  • computational and cognitive models of language acquisition and language processing
  • information retrieval and question answering
  • generation and summarization
  • language resources and evaluation
  • machine learning methods and algorithms for natural language processing
  • machine translation and multilingual systems
  • phonetics, phonology, morphology, word segmentation, tagging, and chunking
  • pragmatics, discourse, and dialogue
  • semantics, textual entailment
  • social media, sentiment analysis and opinion mining
  • spoken language processing and language modeling
  • syntax, parsing, grammar formalisms, and grammar induction
  • text mining, information extraction, and natural language processing applications

Papers accepted to TACL by 30 November 2013 will also be eligible for presentation at EACL 2014; please see the TACL website (http://www.transacl.org) for details.

It’s not too early to begin making plans for next Spring!

Drilling into Big Data with Apache Drill

Filed under: BigData,Dremel,Drill — Patrick Durusau @ 5:52 pm

Drilling into Big Data with Apache Drill by Steven J Vaughan-Nichols.

From the post:

Apache’s Drill goal is striving to do nothing less than answer queries from petabytes of data and trillions of records in less than a second.

You can’t claim that the Apache Drill programmers think small. Their design goal is for Drill to scale to 10,000 servers or more and to process petabyes of data and trillions of records in less than a second.

If this sounds impossible, or at least very improbable, consider that the NSA already seems to be doing exactly the same kind of thing. If they can do it, open-source software can do it.

In at interview at OSCon, the major open source convention in Portland, OR, Ted Dunning, the chief application architect for MapR, a big data company, and a Drill mentor and committer, explained the reason for the project. “There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data in such formats as Avro; Apache Hadoop data serialization system; JSON (JavaScript Object Notation); and Protocol Buffers Google’s data interchange format.”

As Dunning explained, big business wants fast access to big data and none of the traditional solutions, such as a relational database management system (RDBMS), MapReduce, or Hive, can deliver those speeds.

Dunning continued, “This need was identified by Google and addressed internally with a system called Dremel.” Dremel was the inspiration for Drill, which also is meant to complement such open-source big data systems as Apache Hadoop. The difference between Hadoop and Drill is that while Hadoop is designed to achieve very high throughput, it’s not designed to achieve the sub-second latency needed for interactive data analysis and exploration.

(…)

At this point, Drill is very much a work in progress. “It’s not quite production quality at this point, but by third or fourth quarter of 2013 it will become quite usable.” Specifically, Drill should be in beta by the third quarter.

So, if Drill sounds interesting to you, you can start contributing as soon as you get up to speed. To do that, there’s a weekly Google Hangout on Tuesdays at 9am Pacific time and a Twitter feed at @ApacheDrill. And, of course, there’s an Apache Drill Wiki and users’ and developers’ mailing lists.

NSA claims, actually any claims by any government officials, have to be judged by President Obama announcing yesterday: “There is No Spying on Americans.”

It has been creeping along for a long time but the age of Newspeak is here.

But leaving doubtful comments by members of the government to one side, Apache Drill does sound like an exciting project!

DB-Engines Ranking

Filed under: Database,Graphs — Patrick Durusau @ 4:02 pm

Method of calculating the scores of the DB-Engines Ranking

From the webpage:

The DB-Engines Ranking is a list of database management systems ranked by their current popularity. We measure the popularity of a system by using the following parameters:

  • Number of mentions of the system on websites, measured as number of results in search engines queries. At the moment, we use Google and Bing for this measurement. In order to count only relevant results, we are searching for “<system name> database”, e.g. “Oracle database”.
  • General interest in the system. For this measurement, we use the frequency of searches in Google Trends.
  • Frequency of technical discussions about the system. We use the number of related questions and the number of interested users on the well-known IT-related Q&A sites Stack Overflow and DBA Stack Exchange.
  • Number of job offers, in which the system is mentioned.
    We use the number of offers on the leading job search engines Indeed and Simply Hired.

  • Number of profiles in professional networks, in which the system is mentioned. We use the internationally most popular professional network LinkedIn.

We calculate the popularity value of a system by standardizing and averaging of the individual parameters. These mathematical transformations are made in a way ​​so that the distance of the individual systems is preserved. That means, when system A has twice as large a value in the DB-Engines Ranking as system B, then it is twice as popular when averaged over the individual evaluation criteria.

The DB-Engines Ranking does not measure the number of installations of the systems, or their use within IT systems. It can be expected, that an increase of the popularity of a system as measured by the DB-Engines Ranking (e.g. in discussions or job offers) precedes a corresponding broad use of the system by a certain time factor. Because of this, the DB-Engines Ranking can act as an early indicator. (emphasis added in last paragraph)

I mention this ranking explanation for two reasons.

First, it is a remarkably honest statement about how databases are ranked. It is as if the RIAA were to admit their “piracy” estimates are chosen for verbal impact than any relationship to a measurable reality.

Second, it demonstrates that semantics are lurking just behind the numbers of any ranking. True, DB-Engines said some ranking was X, but anyone who relies on that ranking needs to evaluate how it was arrived at.

August 6, 2013

Lire

Filed under: Image Processing,Image Recognition,Searching — Patrick Durusau @ 6:29 pm

Lire

From the webpage:

LIRE (Lucene Image Retrieval) is an open source library for content based image retrieval, which means you can search for images that look similar. Besides providing multiple common and state of the art retrieval mechanisms LIRE allows for easy use on multiple platforms. LIRE is actively used for research, teaching and commercial applications. Due to its modular nature it can be used on process level (e.g. index images and search) as well as on image feature level. Developers and researchers can easily extend and modify Lire to adapt it to their needs.

The developer wiki & blog are currently hosted on http://www.semanticmetadata.net

An online demo can be found at http://demo-itec.uni-klu.ac.at/liredemo/

Lire will be useful if you start collecting images of surveillance cameras or cars going into or out of known alphabet agency parking lots.

HBase CON2013 – Videos are UP!

Filed under: HBase — Patrick Durusau @ 6:19 pm

HBase CON2013 – Videos are UP!

The videos from HBase CON2013 are up!

I will create a sorted speakers list with links to the videos/presentations later this week.

Thought you might be as tired of the PBS fund raising specials as I am. 😉

I would prefer to have shows I like to watch as opposed to the “specials” they have on during fund raising.

Stop writing Regular Expressions. Express them with Verbal Expressions.

Filed under: Interface Research/Design,Topic Maps — Patrick Durusau @ 6:12 pm

Stop writing Regular Expressions. Express them with Verbal Expressions. by Jerod Santo.

From the post:

GitHub user jehna has fashioned a runaway hit with his unique way of constructing difficult regular expressions.

VerbalExpressions turns the often-obscure-and-tricky-to-type regular expression operators into descriptive, chainable functions. The result of this is quite astounding. Here’s the example URL tester from the README:

// Create an example of how to test for correctly formed URLs
var tester = VerEx()
            .startOfLine()
            .then( "http" )
            .maybe( "s" )
            .then( "://" )
            .maybe( "www." )
            .anythingBut( " " )
            .endOfLine();

You can think of other regex languages that are more concise, but can you think of one as easy to teach to new users?

I think there is a lesson here for hand editing of topic maps.

Such as discovering what commons terms particular people would use for topic map constructs.

Those terms could become their topic map authoring language, with a translation script that casts their expression into some briefer notation.

Yes?

PLOP – Place Oriented Programming

Filed under: Clojure,Functional Programming — Patrick Durusau @ 5:57 pm

The Value of Values by Rich Hickey.

Description:

Rich Hickey compares value-oriented programming with place-oriented programming concluding that the time of imperative languages has passed and it is the time of functional programming.

A deeply entertaining keynote by Rich Hickey on value based programming.

I do have a couple of quibbles, probably mostly terminology, with his presentation.

My first quibble is that Rich says that values are “semantically transparent.”

There are circumstances where that’s true but I would hesitate to make that claim for all values.

Remember that Egyptian hieroglyphs were “translated” centuries before the Rosetta Stone was found.

We don’t credit those earlier translations now but I don’t sense that a value can be “semantically transparent” (when written) and then become “semantically opaque” at some point and then return to “semantic transparency” at some future point.

How would we judge the “semantic transparency” of any value, simply upon receipt of the value?

If the Egyptian example seems a bit far fetched, what about the number 42?

In one context it meant: Year of the Consulship of Caesar and Piso.

In another context, it was the answer to the Ultimate Question of Life, the Universe, and Everything.

In another context, it was Jackie Robinson‘s number.

In yet another context, it is the natural number immediately following 41 and directly preceding 43 (Now there’s a useful definition.).

So I would say that values are not semantically transparent.

You?

My second quibble is with Rich’s definition of a fact to be: “an event or thing known to have happened or existed”

That is a very impoverished notion of what qualifies as a fact.

Like the topic map definition of a subject, I think a fact is anything a user chooses to record.

If on no other basis than some particular user claimed the existence of such a “fact.”

Rich closes with a quote from Aldous Huxley: “Facts do not cease to exist because they are ignored.”

I would paraphrase that to read: Facts do not cease to exist because of narrow definitions of “fact.”

Yes?

August 5, 2013

The Uninformed to Liberals: Terry Connell

Filed under: NSA,Security — Patrick Durusau @ 6:16 pm

Terry Connell writes of liberal criticism the NSA’s “metadata” collection efforts:

Progressives rightly call these climate science deniers Luddites with good reason. But they should be careful that the term does not boomerang against them as they take on the NSA’s metadata telephone record collection program.

And continues to extol the virtues of our “new data analytic science” this way:

The core of scientific method involved hypotheses tested against well-designed samples or “focus groups.” Trial and error and retrial prevailed because that was the best we could do. In terms of intelligence gathering, we spied on people based on suspicions that prompted us to check them out. We looked for the needle in the haystack as best we could — essentially, on our hands and knees.

But now digital science has progressed to the point where we can actually capture the “whole haystack” in its most minute parts. When we do that, the needle just sticks out like a sore thumb. We can see the whole chess board and all the possible moves at once. That’s why Big Blue can now match the world’s best grand masters.

Is Deep Blue a quantum computer?

I thought not.

It isn’t “new data analytic science” to take a known phone number and track incoming and outgoing calls from that number.

Signals intelligence (tracking and intercepting electronic communications) dates from the Boer War in 1900. The British were the only ones broadcasting. Made the task of interpretation much easier.

Nor is Starting with the Needle a new data strategy.

I am interested in advances in data analytic techniques.

But that doesn’t require repeating the mindless hype that “big data” is mystically different from existing data analysis techniques.

And I would be careful about the “dots” that get connected.

On the most recent high alert:

To CNN National Security Contributor Frances Fragos Townsend, the timing of the prison breaks and increased intelligence chatter building up to the end of Ramadan signaled heightened al Qaeda activity that required precautionary steps in response.

“These seem like dots that ought to be connected,” said Townsend, a former homeland security and anti-terrorism adviser to the Bush administration. “You can figure out later whether or not you were right.” (Prison breaks part of heightened security) (emphasis added)

Here is the scary thought, “You can figure out later whether or not you were right.

I guess if you aren’t a regular target of police sweeps/surveillance, etc. that may not bother you over much.

It should.

Semantic Parsing with Combinatory Categorial Grammars

Filed under: Parsing,Semantics — Patrick Durusau @ 10:19 am

Semantic Parsing with Combinatory Categorial Grammars by Yoav Artzi, Nicholas FitzGerald and Luke Zettlemoyer.

Slides from an ACL tutorial, 2013. Three hundred and fifty-one (351) slides.

You may want to also visit: The University of Washington Semantic Parsing Framework v1.3 site where you can download source or binary files.

The ACL wiki introduces combinatory categorical grammars with:

Combinatory Categorial Grammar (CCG) is an efficiently parseable, yet linguistically expressive grammar formalism. It has a completely transparent interface between surface syntax and underlying semantic representation, including predicate-argument structure, quantification and information structure. CCG relies on combinatory logic, which has the same expressive power as the lambda calculus, but builds its expressions differently.

The first linguistic and psycholinguistic arguments for basing the grammar on combinators were put forth by Mark Steedman and Anna Szabolcsi. More recent proponents of the approach are Jacobson and Baldridge. For example, the combinator B (the compositor) is useful in creating long-distance dependencies, as in “Who do you think Mary is talking about?” and the combinator W (the duplicator) is useful as the lexical interpretation of reflexive pronouns, as in “Mary talks about herself”. Together with I (the identity mapping) and C (the permutator) these form a set of primitive, non-interdefinable combinators. Jacobson interprets personal pronouns as the combinator I, and their binding is aided by a complex combinator Z, as in “Mary lost her way”. Z is definable using W and B.

CCG is known to define the same language class as tree-adjoining grammar, linear indexed grammar, and head grammar, and is said to be mildly context-sensitive.

One of the key publications of CCG is The Syntactic Process by Mark Steedman. There are various efficient parsers available for CCG.

The ACL wiki page also lists other software packages and references.

Machine parsing/searching are absolute necessities if you want to create topic maps on a human scale. (Web Scale? Or do you want to try for human scale?)

To surpass current search results, build correction/interaction with users directly into your interface. So that search results “get smarter” the more your interface is used.

In contrast to the pagerank/lemming approach to document searching.

August 4, 2013

Server-side clustering of geo-points…

Server-side clustering of geo-points on a map using Elasticsearch by Gianluca Ortelli.

From the post:

Plotting markers on a map is easy using the tooling that is readily available. However, what if you want to add a large number of markers to a map when building a search interface? The problem is that things start to clutter and it’s hard to view the results. The solution is to group results together into one marker. You can do that on the client using client-side scripting, but as the number of results grows, this might not be the best option from a performance perspective.

This blog post describes how to do server-side clustering of those markers, combining them into one marker (preferably with a counter indicating the number of grouped results). It provides a solution to the “too many markers” problem with an Elasticsearch facet.

The Problem

The image below renders quite well the problem we were facing in a project:

clustering

The mass of markers is so dense that it replicates the shape of the Netherlands! These items represent monuments and other things of general interest in the Netherlands; for an application we developed for a customer we need to manage about 200,000 of them and they are especially concentrated in the cities, as you can see in this case in Amsterdam: The “draw everything” strategy doesn’t help much here.

Server-side clustering of geo-points will be useful for representing dense geo-points.

Such as an Interactive Surveillance Map.

Or if you were building a map of police and security force sightings over multiple days to build up a pattern database.

Web Scale? Or do you want to try for human scale?

Filed under: Data Mining,Machine Learning,Ontology — Patrick Durusau @ 4:41 pm

How often have your heard the claim this or that technology is “web scale?”

How big is “web scale?”

Visit http://www.worldwidewebsize.com/ to get an estimate of the size of the Web.

As of today, the estimated number of indexed web pages for Google is approximately 47 billion pages.

How does that compare, say to scholarly literature?

Would you believe 1 trillion pages of scholarly journal literature?

An incomplete inventory (Fig. 1), divided into biological, social, and physical sciences, contains 400, 200, and 65 billion pages, respectively (see supplemental data*).

Or better with an image:

webscale

I didn’t bother putting in the trillion page data but for your information, the indexed Web is < 5% of all scholarly journal literature.

Nor did I try to calculate the data that Chicago is collecting every day with 10,000 video cameras.

Is your app ready to step up to human scale information retrieval?

*Advancing science through mining libraries, ontologies, and communities by JA Evans, A. Rzhetsky. J Biol Chem. 2011 Jul 8;286(27):23659-66. doi: 10.1074/jbc.R110.176370. Epub 2011 May 12.

Interesting times for literary theory

Filed under: Computer Science,Humanities,Literature — Patrick Durusau @ 3:16 pm

Interesting times for literary theory by Ted Underwood.

From the post:

(…)
This could be the beginning of a beautiful friendship. I realize a marriage between machine learning and literary theory sounds implausible: people who enjoy one of these things are pretty likely to believe the other is fraudulent and evil.** But after reading through a couple of ML textbooks,*** I’m convinced that literary theorists and computer scientists wrestle with similar problems, in ways that are at least loosely congruent. Neither field is interested in the mere accumulation of data; both are interested in understanding the way we think and the kinds of patterns we recognize in language. Both fields are interested in problems that lack a single correct answer, and have to be mapped in shades of gray (ML calls these shades “probability”). Both disciplines are preoccupied with the danger of overgeneralization (literary theorists call this “essentialism”; computer scientists call it “overfitting”). Instead of saying “every interpretation is based on some previous assumption,” computer scientists say “every model depends on some prior probability,” but there’s really a similar kind of self-scrutiny involved.
(…)

Computer science and the humanities could enrich each other greatly.

This could be a starting place for that enrichment.

DBLP feeds

Filed under: Bibliography,Computer Science — Patrick Durusau @ 2:16 pm

DBLP feeds

An RSS feed for conferences and journals that appear in the DBLP Computer Science Bibliography.

I count 1448 conference and 977 journal RSS feeds.

A great resource that merits your attention.

Google, Friends with Big Brother

Filed under: Cybersecurity,NSA,Security — Patrick Durusau @ 1:59 pm

I was reading Update: Now We Know Why Googling ‘Pressure Cookers’ Gets a Visit from Cops and decided to write a post on how the lonely, depressed, etc., can get new friends by Web searches.

So I start writing:

Try searching for “backpacks” and “pressure cookers.”

If you need friends more quickly, search for “pressure cooker bomb instructions.

But then I noticed something funny about my search URL.

I first searched for “pressure cooker bomb instructions” and got:

https://www.google.com/search?q=
pressure+cooker+bomb+instructions
&oq=pressure+cooker+bomb+instructions

OK, that makes sense, but then I searched for just “pressure cookers.”

https://www.google.com/search?q=
pressure+cooker+bomb+instructions
&oq=pressure+cooker+bomb+instructions&aqs=
chrome.0.69i57j0l3.10770j0&sourceid=chrome&ie=UTF-8#bav=
on.2,or.r_cp.r_qf.&fp=ad7ba9d9beaae1da&q=
pressure+cooker

So, why is my prior search for “pressure cooker bomb instructions,” showing up with my search for “pressure cookers?”

Try putting “backpack” in the searchbox next:

https://www.google.com/search?q=
pressure+cooker+bomb+instructions
&oq=pressure+cooker+bomb+instructions&aqs=
chrome.0.69i57j0l3.10770j0&sourceid=chrome&ie=
UTF-8#bav=on.2,or.r_cp.r_qf.&fp=ad7ba9d9beaae1da&q=
backpack

Now my search for “pressure cooker bomb instructions” shows up with my search for “backpack.”

A “search customization” feature of Google, but do you really think the government has the chops to mine Google search data effectively?

Or is it more likely that Google is using its expertise against the rest of us at the request of the government?

What do you think?

PS: When Google corporate protests that it is under government orders, remind them the word they have forgotten is “no.”

If you thought people died in the 20th century because citizens forgot how to say “no” to government, you haven’t seen anything yet.

By this time next century, they won’t be able to number government victims in the 21st century.


Addendum

A close friend pointed out the reason for my analysis of the Google URLs was unclear.

Google is pre-collating search queries, such as putting “pressure cooker bomb instructions” together with “pressure cookers.”

Google/NSA does not have to collate those queries to a particular user on its own. They come pre-collated, courtesy of Google.

Building Smaller Data

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 9:41 am

Throw the Bath Water Out, Keep the Baby: Keeping Medically-Relevant Terms for Text Mining by Jay Jarman, MS and Donald J. Berndt, PhD.

Abstract:

The purpose of this research is to answer the question, can medically-relevant terms be extracted from text notes and text mined for the purpose of classification and obtain equal or better results than text mining the original note? A novel method is used to extract medically-relevant terms for the purpose of text mining. A dataset of 5,009 EMR text notes (1,151 related to falls) was obtained from a Veterans Administration Medical Center. The dataset was processed with a natural language processing (NLP) application which extracted concepts based on SNOMED-CT terms from the Unified Medical Language System (UMLS) Metathesaurus. SAS Enterprise Miner was used to text mine both the set of complete text notes and the set represented by the extracted concepts. Logistic regression models were built from the results, with the extracted concept model performing slightly better than the complete note model.

The researchers created two datasets. One composed of the original text medical notes and the second of extracted named entities using NLP and medical vocabularies.

The named entity only dataset was found to perform better than the full text mining approach.

A smaller data set that had a higher performance than the larger data set of notes.

Wait! Isn’t that backwards? I thought “big data” was always better than “smaller data?”

Maybe not?

Maybe having the “right” dataset is better than having a “big data” set.

The 97% Junk Part of Human DNA

Filed under: Bioinformatics,Biomedical,Gene Ontology,Genome,Genomics — Patrick Durusau @ 9:21 am

Researchers from the Gene and Stem Cell Therapy Program at Sydney’s Centenary Institute have confirmed that, far from being “junk,” the 97 per cent of human DNA that does not encode instructions for making proteins can play a significant role in controlling cell development.

And in doing so, the researchers have unravelled a previously unknown mechanism for regulating the activity of genes, increasing our understanding of the way cells develop and opening the way to new possibilities for therapy.

Using the latest gene sequencing techniques and sophisticated computer analysis, a research group led by Professor John Rasko AO and including Centenary’s Head of Bioinformatics, Dr William Ritchie, has shown how particular white blood cells use non-coding DNA to regulate the activity of a group of genes that determines their shape and function. The work is published today in the scientific journal Cell.*

There’s a poke with a sharp stick to any gene ontology.

Roles in associations of genes have suddenly expanded.

Your call:

  1. Wait until a committee can officially name the new roles and parts of the “junk” that play those roles, or
  2. Create names/roles on the fly and merge those with subsequent identifiers on an ongoing basis as our understanding improves.

Any questions?

*Justin J.-L. Wong, William Ritchie, Olivia A. Ebner, Matthias Selbach, Jason W.H. Wong, Yizhou Huang, Dadi Gao, Natalia Pinello, Maria Gonzalez, Kinsha Baidya, Annora Thoeng, Teh-Liane Khoo, Charles G. Bailey, Jeff Holst, John E.J. Rasko. Orchestrated Intron Retention Regulates Normal Granulocyte Differentiation. Cell, 2013; 154 (3): 583 DOI: 10.1016/j.cell.2013.06.052

August 3, 2013

Interactive Surveillance Map?

Filed under: Maps,Security — Patrick Durusau @ 7:39 pm

Interactive crime map of London (Map by James Cheshire.)

An interactive map of crime in London, May 2012 – April 2013.

Which should be titillating for tourists, etc.

A better interactive map would be of the London surveillance cameras and their fields of view.

In case you didn’t know, the number of surveillance cameras is increasing.

Ginny Sloan writes in Will More Video Surveillance Cameras Make Us Any Safer?:

In the wake of the Boston marathon bombing, Boston Police Commissioner Davis has called for more surveillance cameras, and press accounts report new calls for cameras from Richmond, Virginia to San Francisco. Mayor Emmanuel has said Chicago will keep adding cameras, and Mayor Bloomberg is warning New York City residents that more cameras are coming, scoffing at complaints that this will be “Big Brother,” and telling New Yorkers to “Get used to it!” But does the Boston investigation really teach us that what our major cities need is more cameras?

True, it was video surveillance footage from a department store camera that provided the first important clues leading to the suspects in the marathon bombing. Additional video footage from members of the public also helped police identify and apprehend the suspects. The law enforcement officials who sought and examined the video footage, and the businesses and individuals who provided their videos in response, all deserve our praise and gratitude.

But we must be careful in identifying lessons from this use of video evidence. Most importantly, we should recognize that video cameras did not, and cannot, prevent an attack like the Boston marathon bombing. Nor did the ubiquitous cameras in London, the most-surveilled city on the planet, prevent the devastating bombing attacks in that city in 2005. This is not to discredit the important role that surveillance footage has played in identifying suspects after the fact in these cases and others. Yet increasing the number of cameras in cities like Boston, or Chicago — which already has over ten-thousand cameras — would not convert the cameras into a terrorism-prevention tool. Nor is there any indication that Boston investigators were hampered by having too little video footage to examine.

I think Ginny is missing the point. Cameras are cheaper than police officers, don’t get sick, have insurance, paid vacation or retirement. No, cameras are not ever going to prevent any crimes, but then that isn’t the point.

The point is that cameras are an easy way to appear to be doing something, even if the something is ineffectual and an invasion of your privacy.

If you want to protect your privacy and the privacy of others, take pictures of surveillance cameras with a GPS enabled cellphone.

That won’t give you field of view but just having all of them located will be a major step forward.

Chicago has approximately 2.7 million residents. With 10,000 cameras, one out of every 270 people could take an image of one camera and all of their locations would be captured.

Hardly a secret, the cameras are in public view.

The freedom you regain may be your own.

« Newer PostsOlder Posts »

Powered by WordPress