Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 11, 2012

Convergent and Commutative Replicated Data Types [Warning: Heavy Sledding Ahead]

Filed under: Consistency,CRDT,Data Structures,Data Types — Patrick Durusau @ 4:23 pm

A comprehensive study of Convergent and Commutative Replicated Data Types (PDF file) Marc Shapiro, Nuno M. Preguiça, Carlos Baquero, Marek Zawirski.

Abstract:

Eventual consistency aims to ensure that replicas of some mutable shared object converge without foreground synchronisation. Previous approaches to eventual consistency are ad-hoc and error-prone. We study a principled approach: to base the design of shared data types on some simple formal conditions that are sufficient to guarantee eventual consistency. We call these types Convergent or Commutative Replicated Data Types (CRDTs). This paper formalises asynchronous object replication, either state based or operation based, and provides a sufficient condition appropriate for each case. It describes several useful CRDTs, including container data types supporting both add and remove operations with clean semantics, and more complex types such as graphs, montonic DAGs, and sequences. It discusses some properties needed to implement non-trivial CRDTs.

I found this following a link in the readme for riak dt which said:

WHAT?

Currently under initial development, riak_dt is a platform for convergent data types. It’s built on riak core and deployed with riak. All of our current work is around supporting fast, replicated, eventually consistent counters (though more data types are in the repo, and on the way.) This work is based on the paper – A Comprehensive study of Convergent and Commutative Replicated Data Types – which you may find an interesting read.

WHY?

Riak’s current model for handling concurrent writes is to store sibling values and present them to the client for resolution on read. The client must encode the logic to merge these into a single, meaningful value, and then inform Riak by doing a further write. Convergent data types remove this burden from the client, as their structure guarantees they will deterministically converge to a single value. The simplest of these data types is a counter.

I haven’t thought of merging of subject representatives as a quest for “consistency” but that is one way to think about it.

The paper is forty-seven pages long and has forty-four references, most of which I suspect are necessary to fully appreciate the work.

Having said that, I suspect it will be well worth the effort.

Big Data Security Part One: Introducing PacketPig

Filed under: BigData,Hadoop,PacketPig,Security,Systems Administration — Patrick Durusau @ 4:04 pm

Big Data Security Part One: Introducing PacketPig by Michael Baker.

From the post:

Packetloop CTO Michael Baker (@cloudjunky) made a big splash when he presented ‘Finding Needles in Haystacks (the Size of Countries)‘ at Blackhat Europe earlier this year. The paper outlines a toolkit based on Apache Pig, Packetpig @packetpig (available on github), for doing network security monitoring and intrusion detection analysis on full packet captures using Hadoop.

In this series of posts, we’re going to introduce Big Data Security and explore using Packetpig on real full packet captures to understand and analyze networks. In this post, Michael will introduce big data security in the form of full data capture, Packetpig and Packetloop.

If you are a bit rusty on packets, TCP/IP, I could just wave my hands and say: “See the various tutorials.” and off you go to hunt something down.

Let me be more helpful than that and suggest: TCP/IP Tutorial and Technical Overview from the IBM RedBooks we were talking about earlier.

It’s not short (almost a thousand pages) but it isn’t W. Richards Stevens on the other hand (in three volumes). 😉

You won’t need all of either resource but it is better to start with too much than too little.

Conflict History: All Human Conflicts on a Single Map [Battle of Jericho -1399-04-20?]

Filed under: Geography,History,Mapping,Maps — Patrick Durusau @ 3:44 pm

Conflict History: All Human Conflicts on a Single Map

From the post:

Conflict History [conflicthistory.com], developed by TecToys, summarizes all major human conflicts onto a single world map – from the historical wars way before the birth of Christ, until the drone attacks in Pakistan that are still happening today. The whole interactive map is build upon data retrieved from Google and Freebase open data services.

The world map is controlled by an interactive timeline. An additional search box allows more focused exploration by names or events, while individual conflict titles or icons can be selected to reveal more detailed information, all geographically mapped.

I had to run it back a good ways before I could judge its coverage.

I am not sure about the Battle of Jericho occurring on 04-20 in 1399 BCE. That seems a tad precise.

Still, it is an interesting demonstration of mapping technology.

For Eurocentric points, can you name the longest continuous period of peace (according to European historians)?

Think Bayes: Bayesian Statistics Made Simple

Filed under: Bayesian Data Analysis,Bayesian Models,Mathematics,Statistics — Patrick Durusau @ 3:24 pm

Think Bayes: Bayesian Statistics Made Simple by Allen B. Downey.

Think Bayes is an introduction to Bayesian statistics using computational methods. This version of the book is a rough draft. I am making this draft available for comments, but it comes with the warning that it is probably full of errors.

Allen has written free books on Python, statistics, complexity and now Bayesian statistics.

If you don’t know his books, good opportunity to give them a try.

Hadoop/HBase Cluster ~ 1 Hour/$10 (What do you have to lose?)

Filed under: Hadoop,HBase — Patrick Durusau @ 3:08 pm

Set Up a Hadoop/HBase Cluster on EC2 in (About) an Hour by George London.

From the post:

I’m going to walk you through a (relatively) simple set of steps that will get you up and running MapReduce programs on a cloud-based, six-node distributed Hadoop/HBase cluster as fast as possible. This is all based on what I’ve picked up on my own, so if you know of better/faster methods, please let me know in comments!

We’re going to be running our cluster on Amazon EC2, and launching the cluster using Apache Whirr and configuring it using Cloudera Manager Free Edition. Then we’ll run some basic programs I’ve posted on Github that will parse data and load it into Apache HBase.

All together, this tutorial will take a bit over one hour and cost about $10 in server costs.

This is the sort of tutorial that I long to write for topic maps.

There is a longer version of this tutorial here.

IBM Redbooks

Filed under: Books,Data,Marketing,Topic Maps — Patrick Durusau @ 2:22 pm

IBM Redbooks

You can look at this resource one of two ways:

First, as a great source of technical information about mostly IBM products and related technologies.

Second, as a starting point of IBM content for mining and navigation using a topic map.

May not be of burning personal interest to you, but to IBM clients, consultants and customers?

Here’s one pitch:

How much time do you spend searching the WWW, IBM for answers to IBM software questions? In a week? In a month?

Try (TM4IBM-Product-Name) for a week or a month. Then you do the time math.

(I would host a little time keeping applet to “assist” with the record keeping.)

“IBM® Compatible” On The Outside?

Filed under: Content Management System (CMS),Topic Map Software,Topic Maps — Patrick Durusau @ 2:06 pm

I ran across Customizing and Extending IBM Content Navigator today.

Abstract:

IBM® Content Navigator is a ready-to-use, modern, standards-based user interface that supports Enterprise Content Management (ECM) use cases, including collaborative document management, production imaging, and report management. It is also a flexible and powerful user platform for building custom ECM applications using open web-based standards.

This IBM Redbooks® publication has an overview of the functions and features that IBM Content Navigator offers, and describes how you can configure and customize the user interface with the administration tools that are provided. This book also describes the extension points and customization options of IBM Content Navigator and how you can customize and extend it with sample code. Specifically, the book shows you how to set up a development environment, and develop plug-ins that add new actions and provide special production imaging layout to the user interface. Other customization topics include working with external data services, using IBM Content Navigator widgets externally in other applications, and wrapping the widgets as iWidgets to be used in other applications. In addition, this book describes how to reuse IBM Content Navigator components in mobile development, and how to work with existing viewer or incorporate third-party viewer into IBM Content Navigator.

This book is intended for IT architects, and application designers and developers. It offers both a high-level description of how to extend and customize IBM Content Navigator and also more technical details of how to do implementation with sample code.

IBM Content Navigator has all the hooks and features you expect in a content navigation system.

Except for explicit subject identity and merging out of the box. Like you would have with a topic map based solution.

Skimming through the table of contents, it occurred to me that IBM has done most of the work necessary for a topic map based content management system.

Subject identity and merging doctrines are domain specific so entirely appropriate to handle as extensions to the IBM Content Navigator.

Think about it. Given IBM’s marketing budget and name recognition, is saying:

“IBM® Compatible” on the outside of your product a bad thing?.

Using (Spring Data) Neo4j for the Hubway Data Challenge [Boston Biking]

Filed under: Challenges,Data,Dataset,Graphs,Neo4j,Networks,Spring — Patrick Durusau @ 12:33 pm

Using (Spring Data) Neo4j for the Hubway Data Challenge by Michael Hunger.

From the post:

Using Spring Data Neo4j it was incredibly easy to model and import the Hubway Challenge dataset into a Neo4j graph database, to make it available for advanced querying and visualization.

The Challenge and Data

Tonight @graphmaven pointed me to the boston.com article about the Hubway Data Challenge.

(graphics omitted)

Hubway is a bike sharing service which is currently expanding worldwide. In the Data challenge they offer the CSV-data of their 95 Boston stations and about half a million bike rides up until the end of September. The challenge is to provide answers to some posted questions and develop great visualizations (or UI’s) for the Hubway data set. The challenge is also supported by MAPC (Metropolitan Area Planning Council).

Useful import tips for data into Neo4j and on modeling this particular dataset.

Not to mention the resulting database as well!

PS: From the challenge site:

Submission will open here on Friday, October 12, 2012.

Deadline

MIDNIGHT (11:59 p.m.) on Halloween,
Wednesday, October 31, 2012.

Winners will be announced on Wednesday, November 7, 2012.

Prizes:

  • A one-year Hubway membership
  • Hubway T-shirt
  • Bern helmet
  • A limited edition Hubway System Map—one of only 61 installed in the original Hubway stations.

For other details, see the challenge site.

Verification: In God We Trust, All Others Pay Cash

Filed under: Authoring Topic Maps,Crowd Sourcing — Patrick Durusau @ 10:56 am

Crowdsourcing is a valuable technique, at least if accurate information is the result. Incorrect information or noise is still incorrect information or noise, crowdsourced or not.

From PLOS ONE (not Nature or Science) comes news of progress on verification of crowdsourced information. (Verification in Referral-Based Crowdsourcing Naroditskiy V, Rahwan I, Cebrian M, Jennings NR (2012) Verification in Referral-Based Crowdsourcing. PLoS ONE 7(10): e45924. doi:10.1371/journal.pone.0045924)

Abstract:

Online social networks offer unprecedented potential for rallying a large number of people to accomplish a given task. Here we focus on information gathering tasks where rare information is sought through “referral-based crowdsourcing”: the information request is propagated recursively through invitations among members of a social network. Whereas previous work analyzed incentives for the referral process in a setting with only correct reports, misreporting is known to be both pervasive in crowdsourcing applications, and difficult/costly to filter out. A motivating example for our work is the DARPA Red Balloon Challenge where the level of misreporting was very high. In order to undertake a formal study of verification, we introduce a model where agents can exert costly effort to perform verification and false reports can be penalized. This is the first model of verification and it provides many directions for future research, which we point out. Our main theoretical result is the compensation scheme that minimizes the cost of retrieving the correct answer. Notably, this optimal compensation scheme coincides with the winning strategy of the Red Balloon Challenge.

UCSD Jacobs School of Engineering, in Making Crowdsourcing More Reliable, reported the following experience with this technique:

The research team has successfully tested this approach in the field. Their group accomplished a seemingly impossible task by relying on crowdsourcing: tracking down “suspects” in a jewel heist on two continents in five different cities, within just 12 hours. The goal was to find five suspects. Researchers found three. That was far better than their nearest competitor, which located just one “suspect” at a much later time.

It was all part of the “Tag Challenge,” an event sponsored by the U.S. Department of State and the U.S. Embassy in Prague that took place March 31. Cebrian’s team promised $500 to those who took winning pictures of the suspects. If these people had been recruited to be part of “CrowdScanner” by someone else, that person would get $100. To help spread the word about the group, people who recruited others received $1 per person for the first 2,000 people to join the group.

This has real potential!

Could use money, but what of other inducements?

What if department professors agree to substitute participation in a verified crowdsourced bibliography in place of the usual 10% class participation?

Motivation, structuring the task, are all open areas for experimentation and research.

Suggestions on areas for topic maps using this methodology?

Some other resources you may find of interest:

Tag Challenge website

Tag Challenge – Wikipedia (Has links to team pages, etc.)

…[A] Common Operational Picture with Google Earth (webcast)

Filed under: Geographic Data,Geographic Information Retrieval,Google Earth,Mapping,Maps — Patrick Durusau @ 10:01 am

Joint Task Force – Homeland Defense Builds a Common Operational Picture with Google Earth

October 25, 2012 at 02:00 PM Eastern Daylight Time

The security for the Asia-Pacific Economic Collaboration summit in 2011 in Honolulu, Hawaii involved many federal, state & local agencies. The complex task of coordinating information sharing among agencies was the responsibility of Joint Task Force – Homeland Defense (JTF-HD). JTF-HD turned to Google Earth technology to build a visualization capability that enabled all agencies to share information easily & ensure a safe and secure summit.

What you will learn:

  • Best practices for sharing geospatial information among federal, state & local agencies
  • How to incorporate data from many sources into your own Google Earth globe
  • How do get accurate maps with limited bandwidth or no connection at all.

Speaker: Marie Kennedy, Joint Task Force – Homeland Defense

Sponsored by Google.

In addition to the techniques demonstrated, I suspect the main lesson will be leveraging information/services that already exist.

Or information integration if you prefer a simpler description.

Information can be integrated by conversion or mapping.

Which one you choose depends upon your requirements and the information.

Reusable information integration (RI2), where you leverage your own investment, well, that’s another topic altogether. 😉

Ask: Are you spending money to be effective or spending money to maintain your budget relative to other departments?

If the former, consider topic maps. If the latter, carry on.

October 10, 2012

Artificial Intelligence and Machine Learning [Mid-week present]

Filed under: Artificial Intelligence,Machine Learning — Patrick Durusau @ 4:20 pm

Artificial Intelligence and Machine Learning (Research at Google)

I assume you have been good so far this week so time for a mid-week present!

As of today, a list of two hundred and forty-nine publications in artificial intelligence and machine learning from Google Research!

From the webpage:

Much of our work on language, speech, translation, and visual processing relies on Machine Learning and AI. In all of those tasks and many others, we gather large volumes of direct or indirect evidence of relationships of interest, and we apply learning algorithms to generalize from that evidence to new cases of interest. Machine Learning at Google raises deep scientific and engineering challenges. Contrary to much of current theory and practice, the statistics of the data we observe shifts very rapidly, the features of interest change as well, and the volume of data often precludes the use of standard single-machine training algorithms. When learning systems are placed at the core of interactive services in a rapidly changing and sometimes adversarial environment, statistical models need to be combined with ideas from control and game theory, for example when using learning in auction algorithms.

Research at Google is at the forefront of innovation in Machine Learning with one of the most active groups working on virtually all aspects of learning, theory as well as applications, and a strong academic presence through technical talks and publications in major conferences and journals.

Don’t neglect your “real” work but either find a paper relevant to your “real” work or read one during lunch or on break.

You will be glad you did!

Distributed Algorithms in NoSQL Databases

Filed under: Algorithms,Distributed Systems,NoSQL — Patrick Durusau @ 4:20 pm

Distributed Algorithms in NoSQL Databases by Ilya Katsov.

From the post:

Scalability is one of the main drivers of the NoSQL movement. As such, it encompasses distributed system coordination, failover, resource management and many other capabilities. It sounds like a big umbrella, and it is. Although it can hardly be said that NoSQL movement brought fundamentally new techniques into distributed data processing, it triggered an avalanche of practical studies and real-life trials of different combinations of protocols and algorithms. These developments gradually highlight a system of relevant database building blocks with proven practical efficiency. In this article I’m trying to provide more or less systematic description of techniques related to distributed operations in NoSQL databases.

In the rest of this article we study a number of distributed activities like replication of failure detection that could happen in a database. These activities, highlighted in bold below, are grouped into three major sections:

  • Data Consistency. Historically, NoSQL paid a lot of attention to tradeoffs between consistency, fault-tolerance and performance to serve geographically distributed systems, low-latency or highly available applications. Fundamentally, these tradeoffs spin around data consistency, so this section is devoted data replication and data repair.
  • Data Placement. A database should accommodate itself to different data distributions, cluster topologies and hardware configurations. In this section we discuss how to distribute or rebalance data in such a way that failures are handled rapidly, persistence guarantees are maintained, queries are efficient, and system resource like RAM or disk space are used evenly throughout the cluster.
  • System Coordination. Coordination techniques like leader election are used in many databases to implements fault-tolerance and strong data consistency. However, even decentralized databases typically track their global state, detect failures and topology changes. This section describes several important techniques that are used to keep the system in a coherent state.

Slow going but well worth the effort.

Not the issues discussed in the puff-piece webinars extolling NoSQL solutions to “big data.”

But you already knew that if you read this far! Enjoy!

I first saw this at Christophe Lalanne’s A bag of tweets / September 2012

Machine Learning in Gradient Descent

Filed under: Machine Learning — Patrick Durusau @ 4:19 pm

Machine Learning in Gradient Descent by Ricky Ho.

From the post:

In Machine Learning, gradient descent is a very popular learning mechanism that is based on a greedy, hill-climbing approach.

Gradient Descent

The basic idea of Gradient Descent is to use a feedback loop to adjust the model based on the error it observes (between its predicted output and the actual output). The adjustment (notice that there are multiple model parameters and therefore should be considered as a vector) is pointing to a direction where the error is decreasing in the steepest sense (hence the term “gradient”).

A general introduction to a machine learning technique you are going to see fairly often.

Will Data Storage Make Us Dumber?

Filed under: Information Overload,Information Retrieval,Searching — Patrick Durusau @ 4:19 pm

Coming to a data center and then desk top near you:

Case Western Reserve University researchers have developed technology aimed at making an optical disc that holds 1 to 2 terabytes of data – the equivalent of 1,000 to 2,000 copies of Encyclopedia Britannica. The entire print collection of the Library of Congress could fit on five to 10 discs.

Only a matter of time before you have the Library of Congress on a single disk on your local computer. All of it.

Questions:

  • Can you find useful information about a subject?
  • If you find it once, can you find it again?
  • If you can find it again, how much work does it take?
  • Can you share your trail of discovery or “bread crumbs” with others?

If TB data storage means you can’t find information, doesn’t that mean you are getting dumber, one TB at a time?

Storage density isn’t going to slow down so we had better start working on search/IR.

See: Making computer data storage cheaper and easier

Interesting large scale dataset: D4D mobile data [Deadline: October 31, 2012]

Filed under: Data,Data Mining,Dataset,Graphs,Networks — Patrick Durusau @ 4:19 pm

Interesting large scale dataset: D4D mobile data by Danny Bickson.

From the post:

I got the following from Prof. Scott Kirkpatrick.

Write a 250-words research project and get access within a week to the largest ever released mobile phone datasets: datasets based on 2.5 billion records, calls and text messages exchanged between 5 million anonymous users over 5 months.

Participation rules: http://www.d4d.orange.com/

Description of the datasets: http://arxiv.org/abs/1210.0137

The “Terms and Conditions” by Orange allows the publication of resultsbobtained from the datasets even if they do not directly relate to the challenge.

Cash prizes for winning participants and an invitation to present the results at the NetMob conference be held in May 2-3, 2013 at the Medialab at MIT (www.netmob.org).

Deadline: October 31, 2012

Looking to exercise your graph software? Compare to other graph software? Do interesting things with cell phone data?

This could be your chance!

Twitter Recommendations by @alpa

Filed under: Algorithms,Recommendation,Tweets — Patrick Durusau @ 4:18 pm

Twitter Recommendations by @alpa by Marti Hearst.

From the post:

Alpa Jain has great experience teaching from her time as a graduate student at Columbia University, and it shows in the clarity of her descriptions of SVD and other recommendation algorithms in today’s lecture:

Would you incorporate recommendation algorithms into a topic map authoring solution?

Fighting Spam at Twitter [Spam means non-licensed by service provider?]

Filed under: Ad Targeting,Spam,Tweets — Patrick Durusau @ 4:18 pm

Fighting Spam at Twitter by Marti Hearst.

From the post:

On Thursday, Delip Rao electrified the class with a lecture on how Twitter combats the pervasive threat of tweet spam:

The video failed but lecture notes are available.

Spam defined:

An unintended use of a service by an adversary to potentially cause harm or degrade user experience while maximizing benefit for the adversary.

On the slides, “Rate-limit avoidance” appears under “unintended use.”

Licensing by service provider means material that “degrade[s] user experience while maximizing benefit for the adversary” isn’t spam?

My experience with licensed spam on television (including cable) and online is that it all degrades my experience in hope of maximizing their gain.

We need a pull model for advertising instead of a push one.

Banning all push spam would be a step in the right direction.

Explore Python, machine learning, and the NLTK library

Filed under: Machine Learning,NLTK,Python — Patrick Durusau @ 4:18 pm

Explore Python, machine learning, and the NLTK library by Chris Joakim (cjoakim@bellsouth.net), Senior Software Engineer, Primedia Inc.

From the post:

The challenge: Use machine learning to categorize RSS feeds

I was recently given the assignment to create an RSS feed categorization subsystem for a client. The goal was to read dozens or even hundreds of RSS feeds and automatically categorize their many articles into one of dozens of predefined subject areas. The content, navigation, and search functionality of the client website would be driven by the results of this daily automated feed retrieval and categorization.

The client suggested using machine learning, perhaps with Apache Mahout and Hadoop, as she had recently read articles about those technologies. Her development team and ours, however, are fluent in Ruby rather than Java™ technology. This article describes the technical journey, learning process, and ultimate implementation of a solution.

If a wholly automated publication process leaves you feeling uneasy, imagine the same system that feeds content to subject matter experts for further processing.

Think of it as processing raw ore on the way to finding diamonds and then deciding which ones get polished.

Information Retrieval and Search Engines [Committers Needed!]

Filed under: Information Retrieval,Search Engines — Patrick Durusau @ 4:18 pm

Information Retrieval and Search Engines

A proposal is pending to create a Q&A site for people interested in information retrieval and search engines.

But it needs people to commit to using it and answering questions!

That could be you!

There’s a lot of action left in information retrieval and search engines.

Don’t have to believe me. Have you tried one lately? 😉

What is Hadoop Metrics2?

Filed under: Cloudera,Hadoop,Systems Administration — Patrick Durusau @ 4:17 pm

What is Hadoop Metrics2? by Ahmed Radwan.

I’ve been wondering about that. How about you? 😉

From the post:

Metrics are collections of information about Hadoop daemons, events and measurements; for example, data nodes collect metrics such as the number of blocks replicated, number of read requests from clients, and so on. For that reason, metrics are an invaluable resource for monitoring Hadoop services and an indispensable tool for debugging system problems.

This blog post focuses on the features and use of the Metrics2 system for Hadoop, which allows multiple metrics output plugins to be used in parallel, supports dynamic reconfiguration of metrics plugins, provides metrics filtering, and allows all metrics to be exported via JMX.

However cool the software, can’t ever really get away from managing the software.

And it isn’t a bad skill to have. Read on!

October 9, 2012

Appropriating IT: Glue Steps [Gluing Subject Representatives Together?]

Filed under: Legends,Proxies,Semantic Diversity,Semantic Inconsistency,TMRM — Patrick Durusau @ 4:39 pm

Appropriating IT: Glue Steps by Tony Hirst.

Over the years, I’ve been fortunate enough to have been gifted some very evocative, and powerful, ideas that immediately appealed to me when I first heard them and that I’ve been able to draw on, reuse and repurpose over and over again. One such example is “glue logic”, introduced to me by my original OU PhD supervisor George Kiss. The idea of glue logic is to provide a means by which two digital electronic circuits (two “logic” circuits) that don’t share a common interface can be “glued” together.

(diagrams and other material omitted)

This idea is powerful enough in its own right, but there was a second bit to it that made it really remarkable: the circuitry typically used to create the glue logic was a device known as a Field Programmable Gate Array, or FPGA. This is a type of digital circuit whose logical function can be configured, or programmed. That is, I can take my “shapeless” FPGA, and programme it so that it physically implements a particular digital circuit. Just think about that for a moment… You probably have a vague idea that the same computer can be reprogrammed to do particular things, using some vaguely mysterious and magical thing called software, instructions that computer processors follow in order to do incredible things. With an FPGA, the software actually changes the hardware: there is no processor that “runs a programme”; when you programme an FPGA, you change its hardware. FPGAs are, literally, programmable chips. (If you imagine digital circuits to be like bits of plastic, an FPGA is like polymorph.)

The notion of glue logic has stuck with me for two reasons, I think: firstly, because of what it made possible, the idea of flexibly creating an interface between two otherwise incompatible components; secondly, because of the way in which it could be achieved – using a flexible, repurposable, reprogrammable device – one that you could easily reprogramme if the mapping from one device to another wasn’t quite working properly.

If instead of “don’t share a common interface” you read “semantic diversity” and in place of Field Programmable Gate Array, or FPGA, you read “legend,” to “creat[e] an interface between two otherwise incompatible [subject representatives],” you would think Tony’s post was about the topic maps reference model.

Well, this post is and Tony’s is very close.

Particularly the part about being a “reprogrammable device.”

I can tell you: “black” = “schwarz,” but without more, you won’t be able to rely on or extend that statement.

For that, you need a “reprogrammable device” and some basis on which to do the reprogramming.

Legends anyone?

Animating Random Projections of High Dimensional Data [“just looking around a bit”]

Filed under: Data Mining,Graphics,High Dimensionality,Visualization — Patrick Durusau @ 4:02 pm

Animating Random Projections of High Dimensional Data by Andreas Mueller.

From the post:

Recently Jake showed some pretty cool videos in his blog.

This inspired me to go back to an idea I had some time ago, about visualizing high-dimensional data via random projections.

I love to do exploratory data analysis with scikit-learn, using the manifold, decomposition and clustering module. But in the end, I can only look at two (or three) dimensions. And I really like to see what I am doing.

So I go and look at the first two PCA directions, than at the first and third, than at the second and third… and so on. That is a bit tedious and looking at more would be great. For example using time.

There is a software out there, called ggobi, which does a pretty good job at visualizing  high dimensional data sets. It is possible to take interactive tours of your high dimensions, set projection angles and whatnot. It has a UI and tons of settings.

I used it a couple of times and I really like it. But it doesn’t really fit into my usual work flow. It  has good R integration, but not Python integration that I know of. And it also seems a bit overkill for “just looking around a bit”.

It’s hard to over estimate the value of “just looking around a bit.”

As opposed to defending a fixed opinion about data, data structures, or processing.

Who knows?

Practice at “just looking around a bit,” may make your opinions less fixed.

Chance you will have to take.

A Semantic Look at the Presidential Debates

Filed under: Debate,Natural Language Processing,Politics,Semantics — Patrick Durusau @ 3:30 pm

A Semantic Look at the Presidential Debates

Warning: For entertainment purposes only.*

Angela Guess reports:

Luca Scagliarini of Expert System reports, “This week’s presidential debate is being analyzed across the web on a number of fronts, from a factual analysis of what was said, to the number of tweets it prompted. Instead, we used our Cogito semantic engine to analyze the transcript of the debate through a semantic and linguistic lens. Cogito extracted the responses by question, breaking sentences down to their granular detail. This analysis allows us to look at the individual language elements to better understand what was said, as well as how the combined effect of word choice, sentence structure and sentence length might be interpreted by the audience.”

The full post: Presidential Debates 2012: Semantically speaking

*I don’t doubt the performance of the Cogito engine, just the semantics, if any, of the target content. 😉

WebPlatform.org [Pump Up Web Technology Search Clutter]

Filed under: CSS3,HTML,HTML5 — Patrick Durusau @ 2:58 pm

WebPlatform.org

From the webpage:

We are an open community of developers building resources for a better web, regardless of brand, browser or platform. Anyone can contribute and each person who does makes us stronger. Together we can continue to drive innovation on the Web to serve the greater good. It starts here, with you.

From Matt Brian:

In an attempt to create the “definitive resource” for all open Web technologies, Apple, Adobe, Facebook, Google, HP, Microsoft, Mozilla, Nokia, and Opera have joined the W3C to launch a new website called ‘Web Platform

The new website will serve as a single source of relevant, up-to-date and quality information on the latest HTML5, CSS3, and other Web standards, offering tips on web development and best practises for the technologies.

I first saw this at the Semanticweb.com (Angela Guess).

So, maybe having documentation, navigable and good documentation, isn’t so weird after all. 😉

Assume I search for guidance on HTML5, CSS3, etc. Now there is a new site to add to web technology search results.

Glad to see the site, but not the addition to search clutter.

I suppose you could boost the site in response to all searches for web technology. Wonder if that will happen?

Doesn’t help your local silo of links.

How MongoDB’s Journaling Works

Filed under: MongoDB — Patrick Durusau @ 2:05 pm

How MongoDB’s Journaling Works by Kristina Chodorow.

From the post:

I was working on a section on the gooey innards of journaling for The Definitive Guide, but then I realized it’s an implementation detail that most people won’t care about. However, I had all of these nice diagrams just laying around.

Well, journaling may be “an implementation detail,” but Kristina explains it well and some “implementation details” shape our views of what is or isn’t possible.

Doesn’t hurt to know more than we when we started reading the post.

Is your appreciation of journaling the same or different after reading Kristina’s post?

The 13 Steps to Running Any Statistical Model (Webinar)

Filed under: Statistics — Patrick Durusau @ 1:46 pm

The 13 Steps to Running Any Statistical Model

Webinar:

Date: December 5, 2012

Time: 3pm Eastern Time UTC -4 (2pm Central, 1pm Mountain, 12pm Pacific)

From the post:

All statistical modeling–whether ANOVA, Multiple Regression, Poisson Regression, Multilevel Model–is about understanding the relationship between independent and dependent variables. The content differs, but as a data analyst, you need to follow the same 13 steps to complete your modeling.

This webinar will give you an overview of these 13 steps:

  • what they are
  • why each one is important
  • the general order in which to do them
  • on which steps the different types of modeling differ and where they’re the same

Having a road map for the steps to take will make your modeling more efficient and keep you on track.

Whether the model is the point of your analysis or you are using statistical model to discover subjects, this could be useful.

Code for America: open data and hacking the government

Filed under: Government,Government Data,Open Data,Open Government,Splunk — Patrick Durusau @ 12:50 pm

Code for America: open data and hacking the government by Rachel Perkins.

From the post:

Last week, I attended the Code for America Summit here in San Francisco. I attended as a representative of Splunk>4Good (we sponsored the event via a nice outdoor patio lounge area and gave away some of our (in)famous tshirts and a few ponies). Since this wasn’t your typical “conference”, and I’m not so great at schmoozing, i was a little nervous–what would Christy Wilson, Clint Sharp, and I do there? As it turned out, there were so many amazing takeaways and so much potential for awesomeness that my nervousness was totally unfounded.

So what is Code for America?

Code for America is a program that sends technologists (who take a year off and apply to their Fellowship program) to cities throughout the US to work with advocates in city government. When they arrive, they spend a few weeks touring the city and its outskirts, meeting residents, getting to know the area and its issues, and brainstorming about how the city can harness its public data to improve things. Then they begin to hack.
Some of these partnerships have come up with amazing tools–for example,

  • Opencounter Santa Cruz mashes up several public datasets to provide tactical and strategic information for persons looking to start a small business: what forms and permits you’ll need, zoning maps with overlays of information about other businesses in the area, and then partners with http://codeforamerica.github.com/sitemybiz/ to help you find commercial space for rent that matches your zoning requirements.
  • Another Code for America Fellow created blightstatus.org, which uses public data in New Orleans to inform residents about the status and plans for blighted properties in their area.
  • Other apps from other cities do cool things like help city maintenance workers prioritize repairs of broken streetlights based on other public data like crime reports in the area, time of day the light was broken, and number of other broken lights in the vicinity, or get the citizenry involved with civic data, government, and each other by setting up a Stack Exchange type of site to ask and answer common questions.

Whatever your view data sharing by the government, too little, too much, just right, Rachel points to good things can come from open data.

Splunk has a “corporate responsibility program: Splunk>4Good.

Check it out!

BTW, do you have a topic maps “corporate responsibility” program?

“The treacherous are ever distrustful…” (Gandalf to Saruman at Orthanc)

Filed under: Business Intelligence,Marketing,Transparency — Patrick Durusau @ 12:29 pm

Andrew Gelman’s post: Ethical standards in different data communities reminded me of this quote from The Two Towers (Lord of the Rings, Book II, J.R.R. Tolkien).

Andrew reports on a widely repeated claim by a former associate of a habitual criminal offender enterprise that recent government statistics were “cooked” to help President Obama in his re-election campaign.

After examining motives for “cooking” data and actual instances of data being “cooked” (by the habitual criminal offender enterprise), Andrew remarks:

One reason this interests me is the connection to ethics in the scientific literature. Jack Welch has experience in data manipulation and so, when he sees a number he doesn’t like, he suspects it’s been manipulated.

The problem is that anyone searching for this accusation or further information about the former associate or the habitual criminal offender enterprise, is unlike to encounter GE: Decades of Misdeeds and Wrongdoing.

Everywhere the GE stock ticker appears, there should be a link to: GE Corporate Criminal History. With links to the original documents, including pleas, fines, individuals, etc. Under whatever name or guise the activity was conducted.

This isn’t an anti-corruption rant. People in other criminal offender enterprises should be able to judge for themselves the trustworthiness of their individual counter-parts in other enterprises.

Although, someone willing to cheat the government is certainly ready to cheat you.

Topic maps can deliver that level of transparency.

Or not, if you the sort with a “cheating heart.”

A Good Example of Semantic Inconsistency [C-Suite Appropriate]

Filed under: Marketing,Semantic Diversity,Semantic Inconsistency,Semantics — Patrick Durusau @ 10:27 am

A Good Example of Semantic Inconsistency by David Loshin.

You can guide users through the intellectual minefield of Frege, Peirce, Russell, Carnap, Sowa and others to illustrate the need for topic maps, with stunning (as in daunting) graphics.

Or, you can use David’s story:

I was at an event a few weeks back talking about data governance, and a number of the attendees were from technology or software companies. I used the term “semantic inconsistency” and one of the attendees asked me to provide an example of what I meant.

Since we had been discussing customers, I thought about it for a second and then asked him what his definition was of a customer. He said that a customer was someone who had paid the company money for one of their products. I then asked if anyone in the audience was on the support team, and one person raised his hand. I asked him for a definition, and he said that a customer is someone to whom they provide support.

I then posed this scenario: the company issued a 30-day evaluation license to a prospect with full support privileges. Since the prospect had not paid any money for the product, according to the first definition that individual was not a customer. However, since that individual was provided full support privileges, according to the second definition that individual was a customer.

Within each silo, the associated definition is sound, but the underlying data sets are not compatible. An attempt to extract the two customer lists and merge them together into a single list will lead to inconsistent results. This may be even worse if separate agreements dictate how long a purchaser is granted full support privileges – this may lead to many inconsistencies across those two data sets.

Illustrating “semantic inconsistency,” one story at a time.

What’s your 250 – 300 word semantic inconsistency story?

PS: David also points to webinar that will be of interest. Visit his post.

Advanced Data Structures [Jeff Erickson, UIUC]

Filed under: Algorithms,Data Structures — Patrick Durusau @ 10:08 am

Advanced Data Structures

From the description:

This course will survey important developments in data structures that have not (yet) worked their way into the standard computer science curriculum. The precise topics will depend on the interests and background of the course participants; see the current schedule for details. Potential topics include include self-adjusting binary search trees, dynamic trees and graphs, persistent data structures, geometric data structures, kinetic data structures, I/O-efficient and cache-oblivious data structures, hash tables and Bloom filters, data structures that beat information-theoretic lower bounds, and applications of these data structures in computational geometry, combinatorial optimization, systems and networking, databases, and other areas of computer science.

The course page has links to similar courses.

For hardy souls exploring data structures per se or for specialized topic maps, an annotated bibliography of readings.

If you haven’t seen it, visit Jeff’s homepage.

To see what Jeff has been up to lately: DBLP: Jeff Erickson.

« Newer PostsOlder Posts »

Powered by WordPress