Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 15, 2013

NSA — Untangling the Web: A Guide to Internet Research

Filed under: Humor,Requirements,Research Methods,WWW — Patrick Durusau @ 2:28 pm

NSA — Untangling the Web: A Guide to Internet Research

A Freedom of Information Act (FOIA) request caused the NSA to disgorge its guide to web research, which is some six years out of date.

From the post:

The National Security Agency just released “Untangling the Web,” an unclassified how-to guide to Internet search. It’s a sprawling document, clocking in at over 650 pages, and is the product of many years of research and updating by a NSA information specialist whose name is redacted on the official release, but who is identified as Robyn Winder of the Center for Digital Content on the Freedom of Information Act request that led to its release.

It’s a droll document on many levels. First and foremost, it’s funny to think of officials who control some of the most sophisticated supercomputers and satellites ever invented turning to a .pdf file for tricks on how to track down domain name system information on an enemy website. But “Untangling the Web” isn’t for code-breakers or wire-tappers. The target audience seems to be staffers looking for basic factual information, like the preferred spelling of Kazakhstan, or telephonic prefix information for East Timor.

I take it as guidance on how “good” does your application or service need to be to pitch to the government?

I keep thinking to attract government attention, an application needs to fall just short of solving P = NP?

On the contrary, the government needs spell checkers, phone information and no doubt lots of other dull information, quickly.

Perhaps an app that signals fresh doughnuts from bakeries within X blocks would be just the thing. 😉

Stock Trading – The Movie

Filed under: BigData,Data Streams — Patrick Durusau @ 2:10 pm

How Half a Second of High Frequency Stock Trading Looks Like by Andrew Vande Moere.

stock trading

If you fancy your application as handling data at velocity with a capital V, you need to see the movie of half a second of stock trades.

The rate is slowed down so you can see the trades at millisecond intervals.

From the post:

In the movie, one can observe how High Frequency Traders (HFT) jam thousands of quotes at the millisecond level, and how every exchange must process every quote from the others for proper trade through price protection. This complex web of technology must run flawlessly every millisecond of the trading day, or arbitrage (HFT profit) opportunities will appear. However, it is easy for HFTs to cause delays in one or more of the connections between each exchange. Yet if any of the connections are not running perfectly, High Frequency Traders tend to profit from the price discrepancies that result.

See Andrew’s post for the movie and more details.

Topic Maps in Lake Wobegon

Filed under: Authoring Topic Maps,Decision Making,Topic Maps — Patrick Durusau @ 12:55 pm

Jim Harris writes in The Decision Wobegon Effect:

In his book The Most Human Human, Brian Christian discussed what Baba Shiv of the Stanford Graduate School of Business called the decision dilemma, “where there is no objectively best choice, where there are simply a number of subjective variables with trade-offs between them. The nature of the situation is such that additional information probably won’t even help. In these cases – consider the parable of the donkey that, halfway between two bales of hay and unable to decide which way to walk, starves to death – what we want, more than to be correct, is to be satisfied with our choice (and out of the dilemma).”

(…)

Jim describes the Wobegon effect, an effect that blinds decision makers to alternative bales of hay.

Topic maps are composed of a mass of decisions, both large and small.

Is the Wobegon effect affecting your topic map authoring?

Check Jim’s post and think about your topic map authoring practices.

May 14, 2013

Eating dog food with Lucene

Filed under: Lucene,Solr — Patrick Durusau @ 4:22 pm

Eating dog food with Lucene by Michael McCandless.

From the post:

Eating your own dog food is important in all walks of life: if you are a chef you should taste your own food; if you are a doctor you should treat yourself when you are sick; if you build houses for a living you should live in a house you built; if you are a parent then try living by the rules that you set for your kids (most parents would fail miserably at this!); and if you build software you should constantly use your own software.

So, for the past few weeks I’ve been doing exactly that: building a simple Lucene search application, searching all Lucene and Solr Jira issues, and using it instead of Jira’s search whenever I need to go find an issue.

It’s currently running at jirasearch.mikemccandless.com and it’s still quite rough (feedback welcome!).

Now there’s a way to learn the details!

Makes me think about the poor search capabilities at an SDO I frequent.

Could be a way to spend some quality time with Lucene and Solr.

Will have to give it some thought.

Interpreting the knowledge map of digital library research (1990–2010)

Filed under: Digital Library,Knowledge Graph,Knowledge Map — Patrick Durusau @ 4:13 pm

Interpreting the knowledge map of digital library research (1990–2010) by Son Hoang Nguyen and Gobinda Chowdhury. (Nguyen, S. H. and Chowdhury, G. (2013), Interpreting the knowledge map of digital library research (1990–2010). J. Am. Soc. Inf. Sci., 64: 1235–1258. doi: 10.1002/asi.22830)

Abstract:

A knowledge map of digital library (DL) research shows the semantic organization of DL research topics and also the evolution of the field. The research reported in this article aims to find the core topics and subtopics of DL research in order to build a knowledge map of the DL domain. The methodology is comprised of a four-step research process, and two knowledge organization methods (classification and thesaurus building) were used. A knowledge map covering 21 core topics and 1,015 subtopics of DL research was created and provides a systematic overview of DL research during the last two decades (1990–2010). We argue that the map can work as a knowledge platform to guide, evaluate, and improve the activities of DL research, education, and practices. Moreover, it can be transformed into a DL ontology for various applications. The research methodology can be used to map any human knowledge domain; it is a novel and scientific method for producing comprehensive and systematic knowledge maps based on literary warrant.

This is a an impressive piece of work and likely to be read by librarians, particularly digital librarians.

That restricted readership is unfortunate because anyone building a knowledge (topic) map will benefit from the research methodology detailed in this article.

Information organization and the philosophy of history

Filed under: History,Library,Philosophy,Subject Identity — Patrick Durusau @ 3:54 pm

Information organization and the philosophy of history by Ryan Shaw. (Shaw, R. (2013), Information organization and the philosophy of history. J. Am. Soc. Inf. Sci., 64: 1092–1103. doi: 10.1002/asi.22843)

Abstract:

The philosophy of history can help articulate problems relevant to information organization. One such problem is “aboutness”: How do texts relate to the world? In response to this problem, philosophers of history have developed theories of colligation describing how authors bind together phenomena under organizing concepts. Drawing on these ideas, I present a theory of subject analysis that avoids the problematic illusion of an independent “landscape” of subjects. This theory points to a broad vision of the future of information organization and some specific challenges to be met.

You are unlikely to find this article directly actionable in your next topic map project.

On the other hand, if you enjoy the challenge of thinking about how we think, you will find it a real treat.

Shaw writes:

Different interpretive judgments result in overlapping and potentially contradictory organizing principles. Organizing systems ought to make these overlappings evident and show the contours of differences in perspective that distinguish individual judgments. Far from providing a more “complete” view of a static landscape, organizing systems should multiply and juxtapose views. As Geoffrey Bowker (2005) has argued,

the goal of metadata standards should not be to produce a convergent unity. We need to open a discourse—where there is no effective discourse now—about the varying temporalities, spatialities and materialities that we might represent in our databases, with a view to designing for maximum flexibility and allowing as much as possible for an emergent polyphony and polychrony. (pp. 183–184)

The demand for polyphony and polychrony leads to a second challenge, which is to find ways to open the construction of organizing systems to wider participation. How might academics, librarians, teachers, public historians, curators, archivists, documentary editors, genealogists, and independent scholars all contribute to a shared infrastructure for linking and organizing historical discourse through conceptual models? If this challenge can be addressed, the next generation of organizing systems could provide the infrastructure for new kinds of collaborative scholarship and organizing practice.

Once upon a time, you could argue that physical limitations of cataloging systems meant that a single classification system (convergent unity) was necessary for systems to work at all.

But that was an artifact of the physical medium of the catalog.

The deepest irony of the digital age is continuation of the single classification system requirement, a requirement past its discard date.

Binify + D3 = Gorgeous honeycomb maps

Filed under: D3,Graphics,Maps,Visualization — Patrick Durusau @ 2:42 pm

Binify + D3 = Gorgeous honeycomb maps by Chris Wilson.

From the post:

Most Americans prefer to huddle together around urban areas, which raises all sorts of problems for map-based visualizations. Coloring regions according to a data value, known as a choropleth map, leaves the map maker beholden to arbitrary political boundaries and, at the county level, pixel-wide polygons in parts of the Northeast. Many publications prefer to place dots proportional in area to the data values over the center of each county, which inevitably produces overlapping circles in these same congested regions. Here’s a particularly atrocious example of that strategy I once made at Slate:

Slate map

Two weeks ago, Kevin Schaul released an exciting new command-line tool called binify that offers a brilliant alternative. Schaul’s tool takes a series of points and clusters them (or “bins” them) into hexagonal tiles. Check out the introductory blog post on his site.

Binify operates on .shp files, which can be a bit difficult to work with for those of us who aren’t GIS pros. I put together this tutorial to demonstrate how you can take a raw series of coordinates and end up with a binned hexagonal map rendered in the browser using d3js and topojson, both courtesy of the beautiful mind of Mike Bostock. All the source files we’ll need are on Github.

I think everyone will agree with Chris, that is truly an ugly map. 😉

Chris’ post takes you through how to make a much better one.

Aaron Swartz – Accountability?

Filed under: Cybersecurity,Government,Security — Patrick Durusau @ 2:14 pm

Court orders names to be withheld before release of Aaron Swartz records by John Ribeiro.

From the post:

The government dismissed charges against Swartz shortly after his death. But his estate filed to remove a protective order of November 2011, barring disclosure of documents, files or records except in certain situations. The estate cited the need to disclose the records to the U.S. Congress after a House Committee on Oversight and Government Reform decided to investigate the prosecution of Swartz, and review one of the statutes under which he was charged.

MIT, JSTOR and the government, however, asked that the names and other personal identification of their staff referred to in the documents should be redacted.

(…)

The judge said the court concludes that “the estate’s interest in disclosing the identity of individuals named in the production, as it relates to enhancing the public’s understanding of the investigation and prosecution of Mr. Swartz, is substantially outweighed by the interest of the government and the victims in shielding their employees from potential retaliation.”

Well, that certainly makes sense.

The government and MIT can smear Aaron Swartz, engage in “intimidation and prosecutorial overreach,” literally drive Aaron to suicide, but after all, it’s MIT and the U.S. Attorney’s office.

Why should they be held accountable?

It’s clear the government isn’t going to hold those responsible accountable, but that doesn’t have to be the end of the story.

First, MIT donors can withhold donations to MIT unless and until such time as MIT outs all of those involved at MIT and they are no longer employed by MIT.

Second, everyone in education, industry and technology, here or abroad, can shun those outed by MIT. No jobs, no appointments, no contracts, not ever. They need a long opportunity to feel some of the pain they inflicted on Aaron Swartz.

Third, the U.S. Attorney’s office personnel should be known from court records, although holding them accountable may be more difficult.

Their conduct in this case will be a plus for the sort of law firms likely to hire them when they leave government service.

You will have to be creative in finding legal social practices to make them sincerely regret their conduct in this case.

If the government won’t act on our behalf, who else do we have to turn to?

Cascade: Crowdsourcing Taxonomy Creation

Filed under: Crowd Sourcing,Taxonomy — Patrick Durusau @ 12:52 pm

Cascade: Crowdsourcing Taxonomy Creation by Lydia B. Chilton, Greg Little, Darren Edge, Daniel S. Weld, James A. Landay.

Abstract:

Taxonomies are a useful and ubiquitous way of organizing information. However, creating organizational hierarchies is difficult because the process requires a global understanding of the objects to be categorized. Usually one is created by an individual or a small group of people working together for hours or even days. Unfortunately, this centralized approach does not work well for the large, quickly-changing datasets found on the web. Cascade is an automated workflow that creates a taxonomy from the collective efforts of crowd workers who spend as little as 20 seconds each. We evaluate Cascade and show that on three datasets its quality is 80-90% of that of experts. The cost of Cascade is competitive with expert information architects, despite taking six times more human labor. Fortunately, this labor can be parallelized such that Cascade will run in as fast as five minutes instead of hours or days.

In the introduction the authors say:

Crowdsourcing has become a popular way to solve problems that are too hard for today’s AI techniques, such as translation, linguistic tagging, and visual interpretation. Most successful crowdsourcing systems operate on problems that naturally break into small units of labor, e.g., labeling millions of independent photographs. However, taxonomy creation is much harder to decompose, because it requires a global perspective. Cascade is a unique, iterative workflow that emergently generates this global view from the distributed actions of hundreds of people working on small, local problems.

The authors demonstrate the potential for time and cost savings in the creation of taxonomies but I take the significance of their paper to be something different.

As the paper demonstrates, taxonomy creation does not require a global perspective.

Any one of the individuals who participated, contributed localized knowledge that when combined with other localized knowledge, can be formed into what an observer would call a taxonomy.

A critical point since every user represents/reflects slightly varying experiences and viewpoints, while the most learned expert represents only one.

Does “your” taxonomy reflect your views or some expert’s?

What You Don’t See Makes A Difference

Filed under: Multimedia,Recommendation — Patrick Durusau @ 12:21 pm

Social and Content Hybrid Image Recommender System for Mobile Social Networks by Faustino Sanchez, Marta Barrilero, Silvia Uribe, Federico Alvarez, Agustin Tena, Jose Manuel. Menendez.

Recommender System for Sport Videos Based on User Audiovisual Consumption by Sanchez, F. ; Alduan, M. ; Alvarez, F. ; Menendez, J.M. ; Baez, O.

A pair of papers I discovered at: New Model to Recommend Media Content According to Your Preferences, which summarizes the work as:

The traditional recommender system usually use: semantic techniques which result in products defined by themes, similar tags to the user interests, algorithms that use collective intelligence of a large set of user, in a way that this traditional system recommends themes that suit other people with similar preferences.

From this knowledge state, an applied model of multimedia content that goes beyond this paradigm has been developed, and it incorporates other features of whose influence, the user is not always aware and because of that reason has not been used so far in these types of systems.

Therefore, researchers at the UPM have analyzed in depth the audiovisual features that can be influential for users and they proved that some of these features that determine aesthetic trends and usually go unnoticed can be decisive when defining the user tastes.

For example, researchers proved that in a movie, the relative information to the narrative rhythm (shot length, scenes and sequences), the movements (camera or frame content) or the image nature (brightness, color, texture, information quantity) is relevant when cataloguing the preferences of each piece of information. Analogously to the movies, the researchers have analyzed images using a subset of descriptors considered in the case of video.

In order to verify this model, researchers used a database of 70,000 users and a million of reviews in a set of 200 movies whose features were previously extracted.

These descriptors, once they are standardized, processed and generated adequate statistical data, allow researchers to formally characterize the contents and to find the influence degree on each user as well as their preference conditions.

This makes me curious about how to exploit similar “unseen / unnoticed” factors that influence subject identification?

Both from a quality control perspective but also for the design of topic map authoring/consumption interfaces.

Our senses, as Scrooge points out: A slight disorder of the stomach makes them cheats.

Now we know they may be cheating and we are unaware of it.

CHI2013 [Warning: Cognitive Overload Ahead]

Filed under: CHI,HCIR,Interface Research/Design,Usability,Users,UX — Patrick Durusau @ 9:52 am

I have commented on several papers from CHI2013 Enrico Bertini posted to his blog.

I wasn’t aware of the difficulty Enrico must have had done to come up with his short list!

Take a look at the day-by-day schedule for CHI2013.

You will gravitate to some papers more than others. But I haven’t seen any slots that don’t have interesting material.

May be oversight on my part but I did not see any obvious links for the presentations/papers.

Definitely a resource to return to over and over again.

Labels and Schema Indexes in Neo4j

Filed under: Cypher,Indexing,Neo4j — Patrick Durusau @ 9:24 am

Labels and Schema Indexes in Neo4j by Tareq Abedrabbo.

From the post:

Neo4j recently introduced the concept of labels and their sidekick, schema indexes. Labels are a way of attaching one or more simple types to nodes (and relationships), while schema indexes allow to automatically index labelled nodes by one or more of their properties. Those indexes are then implicitly used by Cypher as secondary indexes and to infer the starting point(s) of a query.

I would like to shed some light in this blog post on how these new constructs work together. Some details will be inevitably specific to the current version of Neo4j and might change in the future but I still think it’s an interesting exercise.

Before we start though I need to populate the graph with some data. I’m more into cartoon for toddlers than second-rate sci-fi and therefore Peppa Pig shall be my universe.

So let’s create some labeled graph resources.

Nice review of the impact of the new label + schema index features in Neo4j.

I am still wondering why Neo4j “simple types” cannot be added to nodes and edges without the additional machinery of labels?

Allow users to declare properties to be indexed and used by Cypher for queries.

Which creates a generalized mechanism that requires no changes to the data model.

I have a question pending with the Neo4j team on this issue and will report back with their response.

HeadStart for Planet Earth [Titan]

Filed under: Education,Graphs,Networks,Titan — Patrick Durusau @ 8:45 am

Educating the Planet with Pearson by Marko A. Rodriguez.

From the post:

Pearson is striving to accomplish the ambitious goal of providing an education to anyone, anywhere on the planet. New data processing technologies and theories in education are moving much of the learning experience into the digital space — into massive open online courses (MOOCs). Two years ago Pearson contacted Aurelius about applying graph theory and network science to this burgeoning space. A prototype proved promising in that it added novel, automated intelligence to the online education experience. However, at the time, there did not exist scalable, open-source graph database technology in the market. It was then that Titan was forged in order to meet the requirement of representing all universities, students, their resources, courses, etc. within a single, unified graph. Moreover, beyond representation, the graph needed to be able to support sub-second, complex graph traversals (i.e. queries) while sustaining at least 1 billion transactions a day. Pearson asked Aurelius a simple question: “Can Titan be used to educate the planet?” This post is Aurelius’ answer.

Liking the graph approach in general and Titan in particular does not make me any more comfortable with some aspects of this posting.

You don’t need to spin up a very large Cassandra database on Amazon to see the problems.

Consider the number of concepts for educating the world, some 9,000 if the chart is to be credited.

Suggested Upper Merged Ontology (SUMO) has “~25,000 terms and ~80,000 axioms when all domain ontologies are combined.

The SUMO totals being before you get into the weeds of any particular subject, discipline or course material.

Or the subset of concepts and facts represented in DBpedia:

The English version of the DBpedia knowledge base currently describes 3.77 million things, out of which 2.35 million are classified in a consistent Ontology, including 764,000 persons, 573,000 places (including 387,000 populated places), 333,000 creative works (including 112,000 music albums, 72,000 films and 18,000 video games), 192,000 organizations (including 45,000 companies and 42,000 educational institutions), 202,000 species and 5,500 diseases.

In addition, we provide localized versions of DBpedia in 111 languages. All these versions together describe 20.8 million things, out of which 10.5 million overlap (are interlinked) with concepts from the English DBpedia. The full DBpedia data set features labels and abstracts for 10.3 million unique things in up to 111 different languages; 8.0 million links to images and 24.4 million HTML links to external web pages; 27.2 million data links into external RDF data sets, 55.8 million links to Wikipedia categories, and 8.2 million YAGO categories. The dataset consists of 1.89 billion pieces of information (RDF triples) out of which 400 million were extracted from the English edition of Wikipedia, 1.46 billion were extracted from other language editions, and about 27 million are data links to external RDF data sets. The Datasets page provides more information about the overall structure of the dataset. Dataset Statistics provides detailed statistics about 22 of the 111 localized versions.

I don’t know if the 9,000 concepts cited in the post would be sufficient for a world wide HeadStart program in multiple languages.

Moreover, why would any sane person want a single unified graph to represent course delivery from Zaire to the United States?

How is a single unified graph going to deal with the diversity of educational institutions around the world? A diversity that I take as a good thing.

It sounds like Pearson is offering a unified view of education.

My suggestion is to consider the value of your own diversity before passing on that offer.

May 13, 2013

How to Build a Text Mining, Machine Learning….

Filed under: Document Classification,Machine Learning,R,Text Mining — Patrick Durusau @ 3:51 pm

How to Build a Text Mining, Machine Learning Document Classification System in R! by Timothy DAuria.

From the description:

We show how to build a machine learning document classification system from scratch in less than 30 minutes using R. We use a text mining approach to identify the speaker of unmarked presidential campaign speeches. Applications in brand management, auditing, fraud detection, electronic medical records, and more.

Well made video introduction to R and text mining.

Motif Simplification…[Simplifying Graphs]

Filed under: Graphics,Graphs,Interface Research/Design,Networks,Visualization — Patrick Durusau @ 3:22 pm

Motif Simplification: Improving Network Visualization Readability with Fan, Connector, and Clique Glyphs by Cody Dunne and Ben Shneiderman.

Abstract:

Analyzing networks involves understanding the complex relationships between entities, as well as any attributes they may have. The widely used node-link diagrams excel at this task, but many are difficult to extract meaning from because of the inherent complexity of the relationships and limited screen space. To help address this problem we introduce a technique called motif simplification, in which common patterns of nodes and links are replaced with compact and meaningful glyphs. Well-designed glyphs have several benefits: they (1) require less screen space and layout effort, (2) are easier to understand in the context of the network, (3) can reveal otherwise hidden relationships, and (4) preserve as much underlying information as possible. We tackle three frequently occurring and high-payoff motifs: fans of nodes with a single neighbor, connectors that link a set of anchor nodes, and cliques of completely connected nodes. We contribute design guidelines for motif glyphs; example glyphs for the fan, connector, and clique motifs; algorithms for detecting these motifs; a free and open source reference implementation; and results from a controlled study of 36 participants that demonstrates the effectiveness of motif simplification.

When I read “replace,” “aggregation,” etc., I automatically think about merging in topic maps. 😉

After replacing “common patterns of nodes and links” I may still be interested in the original content of those nodes and links.

Or I may wish to partially unpack them based on some property in the original content.

Definitely a paper for a slow, deep read.

Not to mention research on the motifs in graph representations of your topic maps.

I first saw this in Visualization Papers at CHI 2013 by Enrico Bertini.

Putting Linked Data on the Map

Filed under: Linked Data,Semantic Web — Patrick Durusau @ 2:32 pm

Putting Linked Data on the Map by Richard Wallis.

In fairness to Linked Data/Semantic Web, I really should mention this post by one of its more mainstream advocates:

Show me an example of the effective publishing of Linked Data – That, or a variation of it, must be the request I receive more than most when talking to those considering making their own resources available as Linked Data, either in their enterprise, or on the wider web.

There are some obvious candidates. The BBC for instance, makes significant use of Linked Data within its enterprise. They built their fantastic Olympics 2012 online coverage on an infrastructure with Linked Data at its core. Unfortunately, apart from a few exceptions such as Wildlife and Programmes, we only see the results in a powerful web presence. The published data is only visible within their enterprise.

Dbpedia is another excellent candidate. From about 2007 it has been a clear demonstration of Tim Berners-Lee’s principles of using URIs as identifiers and providing information, including links to other things, in RDF – it is just there at the end of the dbpedia URIs. But for some reason developers don’t seem to see it as a compelling example. Maybe it is influenced by the Wikipedia effect – interesting but built by open data geeks, so not to be taken seriously.

A third example, which I want to focus on here, is Ordnance Survey. Not generally known much beyond the geographical patch they cover, Ordnance Survey is the official mapping agency for Great Britain. Formally a government agency, they are best known for their incredibly detailed and accurate maps that are the standard accessory for anyone doing anything in the British countryside. A little less known is that they also publish information about post-code areas, parish/town/city/county boundaries, parliamentary constituency areas, and even European regions in Britain. As you can imagine, these all don’t neatly intersect, which makes the data about them a great case for a graph based data model and hence for publishing as Linked Data. Which is what they did a couple of years ago.

The reason I want to focus on their efforts now, is that they have recently beta released a new API suite, which I will come to in a moment. But first I must emphasise something that is often missed.

Linked Data is just there – without the need for an API the raw data (described in RDF) is ‘just there to consume’. With only standard [http] web protocols, you can get the data for an entity in their dataset by just doing a http GET request on the identifier…

(images omitted)

Richard does a great job describing the Linked Data APIs from the Ordnance Survey.

My only quibble is with his point:

Linked Data is just there – without the need for an API the raw data (described in RDF) is ‘just there to consume’.

True enough but it omits the authoring side of Linked Data.

Or understanding the data to be consumed.

With HTML, authoring hyperlinks was only marginally more difficult than “using” hyperlinks.

And the consumption of a hyperlink, beyond mime types, was unconstrained.

So linked data isn’t “just there.”

It’s there with an authoring burden that remains unresolved and that constrains consumption, should you decide to follow “standard [http] web protocols” and Linked Data.

I am sure the Ordnance Survey Linked Data and other Linked Data resources Richard mentions will be very useful, to some people in some contexts.

But pretending Linked Data is easier than it is, will not lead to improved Linked Data or other semantic solutions.

Analyzing Twitter: An End-to-End Data Pipeline Recap

Filed under: BigData,Cloudera,Mahout,Tweets — Patrick Durusau @ 10:32 am

Analyzing Twitter: An End-to-End Data Pipeline Recap by Jason Barbour.

Jason reviews presentations at a recent Data Science MD meeting:

Starting off the night, Joey Echeverria, a Principal Solutions Architect, first discussed a big data architecture and how a key components of relational data management system can be replaced with current big data technologies. With Twitter being increasingly popular with marketing teams, analyzing Twitter data becomes a perfect use case to demonstrate a complete big data pipeline.

(…)

Following Joey, Sean Busbey, a Solutions Architect at Cloudera, discussed working with Mahout, a scalable machine learning library for Hadoop. Sean first introduced the three C’s of machine learning: classification, clustering, and collaborative filtering. With classification, learning from a training set supervised, and new examples can be categorized. Clustering allows examples to be grouped together with common features, while collaborative filtering allows new candidates to be suggested.

Great summaries, links to additional resources and the complete slides.

Check the DC Data Community Events Calendar if you plan to visit the DC area. (I assume residents already do.)

Seventh ACM International Conference on Web Search and Data Mining

Filed under: Conferences,Data Mining,Searching,WWW — Patrick Durusau @ 10:08 am

WSDM 2014 : Seventh ACM International Conference on Web Search and Data Mining

Abstract submission deadline: August 19, 2013
Paper submission deadline: August 26, 2013
Tutorial proposals due: September 9, 2013
Tutorial and paper acceptance notifications: November 25, 2013
Tutorials: February 24, 2014
Main Conference: February 25-28, 2014

From the call for papers:

WSDM (pronounced “wisdom”) is one of the premier conferences covering research in the areas of search and data mining on the Web. The Seventh ACM WSDM Conference will take place in New York City, USA during February 25-28, 2014.

WSDM publishes original, high-quality papers related to search and data mining on the Web and the Social Web, with an emphasis on practical but principled novel models of search, retrieval and data mining, algorithm design and analysis, economic implications, and in-depth experimental analysis of accuracy and performance.

WSDM 2014 is a highly selective, single track meeting that includes invited talks as well as refereed full papers. Topics covered include but are not limited to:

(…)

Papers emphasizing novel algorithmic approaches are particularly encouraged, as are empirical/analytical studies of specific data mining problems in other scientific disciplines, in business, engineering, or other application domains. Application-oriented papers that make innovative technical contributions to research are welcome. Visionary papers on new and emerging topics are also welcome.

Authors are explicitly discouraged from submitting papers that do not present clearly their contribution with respect to previous works, that contain only incremental results, and that do not provide significant advances over existing approaches.

Sets a high bar but one that can be met.

Would be very nice PR to have a topic map paper among those accepted.

EarSketch

Filed under: Music,Programming — Patrick Durusau @ 9:07 am

EarSketch: computational music remixing and sharing as a tool to drive engagement and interest in computing

From the “about” page:

EarSketch engages students in computing principles through collaborative computational music composition and remixing. It consists of an integrated curriculum, software toolset, and social media website. The EarSketch curriculum targets introductory high school and college computing education. The software toolset enables students to create music by manipulating loops, composing beats, and applying effects with Python code. The social media website invites students to upload their music and source code, view other students’ work, and create derivative musical remixes from other students’ code. EarSketch is built on top of Reaper, an intuitive digital audio workstation (DAW) program comparable to those used in professional recording studios.

EarSketch is designed to enable student creativity, to enhance collaboration, and to leverage cultural relevance. This focus has created unique advantages for our approach to computing education:

  • EarSketch leverages musical remixing as it relates to popular musical forms, such as hip hop, and to industry-standard methods of music production, in an attempt to connect to students in a culturally relevant fashion that spans gender, ethnicity, and socioeconomic status.
  • EarSketch focuses on the level of beats, loops, and effects more than individual notes, enabling students with no background in music theory or composition to begin creating personally relevant music immediately, with a focus on higher-level musical concepts such as formal organization, texture, and mixing.
  • The EarSketch social media site allows a tight coupling between code sharing / reuse and the musical practice of remixing. Students can grab code snippets from other projects and directly inject them into their own work, modifying them to fit their idiosyncratic musical ideas.
  • EarSketch builds on professional development techniques using an industry-relevant, text-based programming language (Python), giving students concrete skills directly applicable to further study.

EarSketch is a National Science Foundation-funded initiative that was created to motivate students to consider further study and careers in computer science. The program, now in its second year, is focused on groups traditionally underrepresented in computing, but with an approach that is intended to have broad appeal.

I encountered EarSketch when I found: Creating your own effects: 8. Graph data structures.

Curious how you would use music to introduce topic maps and/or semantic integration?

Non-Adoption of Semantic Web, Reason #1002

Filed under: RDF,Semantic Web,Virtuoso — Patrick Durusau @ 8:22 am

Kingsley Idehen offers yet another explanation/excuses for non-adoption of the semantic web in On Hybrid Relational Databases. Interview with Kingsley Uyi Idehen by Roberto V. Zicari.

The highlight of this interview reads:

The only obstacle to Semantic Web technologies in the enterprise lies in better articulation of the value proposition in a manner that reflects the concerns of enterprises. For instance, the non disruptive nature of Semantic Web technologies with regards to all enterprise data integration and virtualization initiatives has to be the focal point

You may recall Kingsley’s demonstration of the non-complexity of authoring for the Semantic Web in The Semantic Web Is Failing — But Why? (Part 3).

Could it be users sense the “lock-in” of RDF/Semantic Web?

Q14. Big Data Analysis: could you connect Virtuoso with Hadoop? How does Viruoso relate to commercial data analytics platforms, e.g Hadapt, Vertica?

K​ingsley Uyi Idehen: You can integrate data managed by Hadoop based ETL workflows via ODBC or Web Services driven by Hapdoop clusters that expose RESTful interaction patterns for data access. As for how Virtuoso relates to the likes of Vertica re., analytics, this is about Virtuoso being the equivalent of Vertica plus the added capability of RDF based data management, Linked Data Deployment, and share-nothing clustering. There is no job that Vertica performs that Virtuoso can’t perform.

There are several jobs that Virtuoso can perform that Vertica, VoltDB, Hadapt, and many other NoSQL and NewSQL simply cannot perform with regards to scalable, high-performance RDF data management and Linked Data deployment. Remember, RDF based Linked Data is all about data management and data access without any kind of platform lock-in. Virtuoso locks you into a value proposition (performance and scale) not the platform itself. (emphasis added to last sentence)

It’s comforting to know RDF/Semantic Web “lock-in” has our best interest at heart.

See Kingley dodging the next question on Virtuoso’s ability scale:

Q15. Do you also benchmark loading trillion of RDF triples? Do you have current benchmark results? How much time does it take to querying them?

K​ingsley Uyi Idehen: As per my earlier responses, there is no shortage of benchmark material for Virtuoso.

The benchmarks are also based on realistic platform configurations unlike the RDBMS patterns of the past which compromised the utility of TPC benchmarks.

Full Disclosure: I haven’t actually counted all of Kingsley’s reasons for non-adoption of the Semantic Web. The number I assign here may be high or low.

May 12, 2013

Guess: The Graph Exploration System

Filed under: Graphs,Networks,Visualization — Patrick Durusau @ 4:47 pm

Guess: The Graph Exploration System

From the webpage:

GUESS is an exploratory data analysis and visualization tool for graphs and networks. The system contains a domain-specific embedded language called Gython (an extension of Python, or more specifically Jython) which supports the operators and syntactic sugar necessary for working on graph structures in an intuitive manner. An interactive interpreter binds the text that you type in the interpreter to the objects being visualized for more useful integration. GUESS also offers a visualization front end that supports the export of static images and dynamic movies.

Graph movies? Cool!

If you could catch a graph in an unguarded moment, what would you want to capture in a movie?

See also: Sourceforge – Guess.

Contextifier: Automatic Generation of Annotated Stock Visualizations

Filed under: Annotation,Business Intelligence,Interface Research/Design,News — Patrick Durusau @ 4:36 pm

Contextifier: Automatic Generation of Annotated Stock Visualizations by Jessica Hullman, Nicholas Diakopoulos and Eytan Adar.

Abstract:

Online news tools—for aggregation, summarization and automatic generation—are an area of fruitful development as reading news online becomes increasingly commonplace. While textual tools have dominated these developments, annotated information visualizations are a promising way to complement articles based on their ability to add context. But the manual effort required for professional designers to create thoughtful annotations for contextualizing news visualizations is difficult to scale. We describe the design of Contextifier, a novel system that automatically produces custom, annotated visualizations of stock behavior given a news article about a company. Contextifier’s algorithms for choosing annotations is informed by a study of professionally created visualizations and takes into account visual salience, contextual relevance, and a detection of key events in the company’s history. In evaluating our system we find that Contextifier better balances graphical salience and relevance than the baseline.

The authors use a stock graph as the primary context in which to link in other news about a publicly traded company.

Other aspects of Contextifier were focused on enhancement of that primary context.

The lesson here is that a tool with a purpose is easier to hone than a tool that could be anything for just about anybody.

I first saw this at Visualization Papers at CHI 2013 by Enrico Bertini.

Bond Percolation in GraphLab

Filed under: GraphLab,Graphs,Networks — Patrick Durusau @ 4:11 pm

Bond Percolation in GraphLab by Danny Bickson.

From the post:

I was asked by Prof. Scott Kirkpatrick to help and implement bond percolation in GraphLab. It is an oldie but goldie problem which is closely related to the connected components problem.

Here is an explanation about bond percolation from Wikipedia:

A representative question (and the source of the name) is as follows. Assume that some liquid is poured on top of some porous material. Will the liquid be able to make its way from hole to hole and reach the bottom? This physical question is modelled mathematically as a three-dimensional network of n × n × n vertices, usually called “sites”, in which the edge or “bonds” between each two neighbors may be open (allowing the liquid through) with probability p, or closed with probability 1 – p, and they are assumed to be independent. Therefore, for a given p, what is the probability that an open path exists from the top to the bottom? The behavior for large n is of primary interest. This problem, called now bond percolation, was introduced in the mathematics literature by Broadbent & Hammersley (1957), and has been studied intensively by mathematicians and physicists since.

Perculation Graph

In social networks, Danny notes this algorithm is used to find groups of friends.

Similar mazes appear in puzzle books.

My curiosity is about finding groups of subject identity properties.

A couple of other percolation resources of interest:

Percolation Exercises by Eric Mueller.

PercoVis (Mac), visualization of percolation by Daniel B. Larremore.

Finding Significant Phrases in Tweets with NLTK

Filed under: Natural Language Processing,NLTK,Tweets — Patrick Durusau @ 3:17 pm

Finding Significant Phrases in Tweets with NLTK by Sujit Pal.

From the post:

Earlier this week, there was a question about finding significant phrases in text on the Natural Language Processing People (login required) group on LinkedIn. I suggested looking at this LingPipe tutorial. The idea is to find statistically significant word collocations, ie, those that occur more frequently than we can explain away as due to chance. I first became aware of this approach from the LLG Book, where two approaches are described – one based on Log-Likelihood Ratios (LLR) and one based on the Chi-Squared test of independence – the latter is used by LingPipe.

I had originally set out to actually provide an implementation for my suggestion (to answer a followup question). However, the Scipy Pydoc notes that the chi-squared test may be invalid when the number of observed or expected frequencies in each category are too small. Our algorithm compares just two observed and expected frequencies, so it probably qualifies. Hence I went with the LLR approach, even though it is slightly more involved.

The idea is to find, for each bigram pair, the likelihood that the components are dependent on each other versus the likelihood that they are not. For bigrams which have a positive LLR, we repeat the analysis by adding its neighbor word, and arrive at a list of trigrams with positive LLR, and so on, until we reach the N-gram level we think makes sense for the corpus. You can find an explanation of the math in one of my earlier posts, but you will probably find a better explanation in the LLG book.

For input data, I decided to use Twitter. I’m not that familiar with the Twitter API, but I’m taking the Introduction to Data Science course on Coursera, and the first assignment provided some code to pull data from the Twitter 1% feed, so I just reused that. I preprocess the feed so I am left with about 65k English tweets using the following code:

An interesting look “behind the glass” on n-grams.

I am using AntConc to generate n-grams for proofing standards prose.

But as a finished tool, AntConc doesn’t give you insight into the technical side of the process.

Visualization – HCIL – University of Maryland

Filed under: Graphics,Interface Research/Design,Visualization — Patrick Durusau @ 3:00 pm

Visualization – Human-Computer Interaction Lab – University of Maryland

From the webpage:

We believe that the future of user interfaces is in the direction of larger, information-abundant displays. With such designs, the worrisome flood of information can be turned into a productive river of knowledge. Our experience during the past eight years has been that visual query formulation and visual display of results can be combined with the successful strategies of direct manipulation. Human perceptual skills are are quite remarkable and largely underutilized in current information and computing systems. Based on this insight, we developed dynamic queries, starfield displays, treemaps, treebrowsers, zoomable user interfaces, and a variety of widgets to present, search, browse, filter, and compare rich information spaces.

There are many visual alternatives but the basic principle for browsing and searching might be summarized as the Visual Information Seeking Mantra: Overview first, zoom and filter, then details-on-demand. In several projects we rediscovered this principle and therefore wrote it down and highlighted it as a continuing reminder. If we can design systems with effective visual displays, direct manipulation interfaces, and dynamic queries then users will be able to responsibly and confidently take on even more ambitious tasks.

Projects and summaries of projects too numerous to list.

Working my way through them now.

Thought you might enjoy perusing the list for yourself.

Lots of very excellent work!

Evaluating the Efficiency of Physical Visualizations

Filed under: Graphics,Interface Research/Design,Visualization — Patrick Durusau @ 2:52 pm

Evaluating the Efficiency of Physical Visualizations by Yvonne Jansen, Pierre Dragicevic and Jean-Daniel Fekete.

Abstract:

Data sculptures are an increasingly popular form of physical visualization whose purposes are essentially artistic, communicative or educational. But can physical visualizations help carry out actual information visualization tasks? We present the first infovis study comparing physical to on-screen visualizations. We focus on 3D visualizations, as these are common among physical visualizations but known to be problematic on computers. Taking 3D bar charts as an example, we show that moving visualizations to the physical world can improve users’ efficiency at information retrieval tasks. In contrast, augmenting on-screen visualizations with stereoscopic rendering alone or with prop-based manipulation was of limited help. The efficiency of physical visualizations seems to stem from features that are unique to physical objects, such as their ability to be touched and their perfect visual realism. These findings provide empirical motivation for current research on fast digital fabrication and self-reconfiguring interfaces.

My first thought on reading this paper was a comparison of looking at a topographic map of an area and seeing it from the actual location.

May explain some of the disconnect between military planners looking at maps and troops looking at terrain.

I’m not current on the latest feedback research to simulate the sense of touch in VR.

Curious how good the simulation would need to be to approach the efficiency of physical visualizations?

While others struggle to deliver content to a 3″ to 5″ inch screen, you can work on the next generation of interfaces, which are as large as you can “see.”

I first saw this at: Visualization Papers at CHI 2013 by Enrico Bertini.

Every Band On Spotify Gets A Soundrop Listening Room [Almost a topic map]

Filed under: Music,Music Retrieval,Topic Maps — Patrick Durusau @ 2:07 pm

Every Band On Spotify Gets A Soundrop Listening Room by Eliot Van Buskirk.

From the post:

Soundrop, a Spotify app that shares a big investor with Spotify, says it alone has the ability to scale listening rooms up so that thousands of people can listen to the same song together at the same time, using a secret sauce called Erlang — a hyper-efficient coding language developed by Ericsson for use on big telecom infrastructures (updated).

Starting today, Soundrop will offer a new way to listen: individual rooms dedicated to any single artist or band, so that fans of (or newcomers to) their music can gather to listen to that bands music. The rooms are filled with tunes already, but anyone in the room can edit the playlist, add new songs (only from that artist or their collaborations), and of course talk to other listeners in the chatroom.

“The rooms are made automatically whenever someone clicks on the artist,” Soundrop head of partnerships Cortney Harding told Evolver.fm. “No one owns the rooms, though. Artists, labels and management have to come to us to get admin rights.”

In topic map terminology, what I hear is:

Using the Soundrop app, Spotify listeners can create topics for any single artist or band with a single click. Associations between the artist/band and their albums, individual songs, etc., are created automatically.

What I don’t hear is the exposure of subject identifiers to allow fans to merge in information from other resources, such as fan zines, concert reports and of course, covers from the Rolling Stone.

Perhaps Soundrop will offer subject identifiers and merging as a separate, perhaps subscription feature.

Could be a win-win if the Rolling Stone, for example, were to start exposing their subject identifiers for articles, artists and bands.

Some content producers will follow others, some will invent their own subject identifiers.

The important point being that with topic maps we can merge based on their identifiers.

Not some uniform-identifier-in-the-sky-by-an-by, which stymies progress until universal agreement arrives.

NuoDB: Deployment Architecture with Seth Proctor

Filed under: NuoDB — Patrick Durusau @ 10:45 am

NuoDB: Deployment Architecture with Seth Proctor.

A very, very high level view of NuoDB (warning SQL/ACID database).

For those of you who don’t think SQL/ACID is all that weird, it is a fairly compelling presentation.

What I like about it is Seth is clear and to the point. Not always the case, even with short videos.

Will be interesting to see if the clarity remains in future videos.

I visited www.nuodb.com today but did not see more recent videos listed.

May 11, 2013

Defeating DRM in HTML5

Filed under: Cybersecurity,DRM,HTML5 — Patrick Durusau @ 4:59 pm

You may heard that the W3C is giving the WWW label to DRM-based content vendors in HTML5: W3C presses ahead with DRM interface in HTML5

From the post:

On Friday, the World Wide Web Consortium (W3C) published the first public draft of Encrypted Media Extensions (EME). EME enables content providers to integrate digital rights management (DRM) interfaces into HTML5-based media players. Encrypted Media Extensions is being developed jointly by Google, Microsoft and online streaming-service Netflix. No actual encryption algorithm is part of the draft; that element is designed to be contained in a CDM (Content Decryption Module) that works with EME to decode the content. CDMs may be plugins or built into browsers.

The publication of the new draft is a blow for critics of the extensions, led by the Free Software Foundation (FSF). Under the slogan, “We don’t want the Hollyweb”, FSF’s anti-DRM campaign Defective by Design has started a petition against the “disastrous proposal”, though FSF and allied organisations have so far only succeeded in mobilising half of their target of 50,000 supporters.

I could understand this better if the W3C was getting paid by the DRM-based content vendors for the WWW label. Giving it away to commercial profiteers seems like poor business judgement.

On the order of the U.S. government developing the public internet and then giving it away as it became commercially viable. As one of the involuntary investors in the U.S. government, I would have liked a better return on that investment.

There is one fairly easy way to defeat DRM in HTML5.

Don’t use it. Don’t view/purchase products that use it, don’t produce products or services that use it.

The people who produce and sell DRM-based products will find other ways to occupy themselves should DRM-based products fail.

Unlike the FSF, they are not producing products for obscure motives. They are looking to make a profit. No profit, no DRM-vendors.

You may say that “other people” will purchase those products and services, encouraging DRM vendors. They very well may but that’s their choice.

It is unconvincing to argue for a universe of free choice when some people get to choose on behalf of others, like the public.

The Map Myth of Sandy Island

Filed under: Geographic Data,Geography,Mapping,Maps — Patrick Durusau @ 4:39 pm

The Map Myth of Sandy Island by Rebecca Maxwell.

From the post:

Sandy Island has long appeared on maps dating back to the early twentieth century. This island was supposedly located in the Pacific Ocean northwest of Australia in the Coral Sea. It first appeared on an edition of a British admiralty map back in 1908 proving that Sandy Island had been discovered by the French in 1876. Even modern maps, like the General Bathymetic Chart of the Oceans (the British Oceanopgraphic Dat Centre issued an errata about Sandy Island) and Google Earth, show the presence of an island at its coordinates. Sandy Island is roughly the size of Manhattan; it is about three miles wide and fifteen miles long. However, there is only one problem. The island does not actually exist.

Back in October 2012, an Australian research ship undiscovered the island. The ship, called the Southern Surveyor, was led by Maria Seton, a scientist from the University of Sydney. The purpose of the twenty-five-day expedition was to gather information about tectonic activity, map the sea floor, and gather rock samples from the bottom. The scientific data that they had, including the General Bathymetic Chart of the Oceans, indicated the presence of Sandy Island halfway between Australia and the island of New Caledonia, a French possession. The crew began to get suspicious, however, when the chart from the ship’s master only showed open water. Plus, Google Earth only showed a dark blob where it should have been.

When the ship arrived at Sandy Island’s supposed coordinates, they found nothing but ocean a mile deep. One of the ship’s crewmembers, Steven Micklethwaite, said that they all had a good laugh at Google’s expense as they sailed through the island. The crew was quick to make their findings known. The story originally appeared in the Sydney Morning Herald and prompted a large amount of controversy. Cartographers were the most puzzled of all. Many wondered whether the island had ever existed or if it had been eroded away by the ocean waves over the years. Others wondered if the island mysteriously disappeared into the ocean like the legendary city of Atlantis. An “obituary” for Sandy Island, reporting the findings, was published in Eos, Transactions of the Geophysical Union in April of 2013.

Rebecca details the discovered/undiscovered history of Sandy Island in rich detail.

It’s a great story and you should treat yourself by reading it.

My only disagreement with Rebecca comes when she writes:

Maps are continually changing and modern maps still contain a human element that is vulnerable to mistakes.

On the contrary, maps, even modern ones, are wholly human constructs.

Not just the mistakes but the degree of accuracy, the implicit territorial or political claims, what is interesting enough to record, etc., are all human choices in production.

To say nothing of humans on the side of reading/interpretation as well.

If there were no sentient creatures to read it, would a map have any meaning?

« Newer PostsOlder Posts »

Powered by WordPress