Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 16, 2013

Hadoop, The Perfect App for OpenStack

Filed under: Cloud Computing,Hadoop,Hortonworks,OpenStack — Patrick Durusau @ 6:03 pm

Hadoop, The Perfect App for OpenStack by Shaun Connolly.

From the post:

The convergence of big data and cloud is a disruptive market force that we at Hortonworks not only want to encourage but also accelerate. Our partnerships with Microsoft and Rackspace have been perfect examples of bringing Hadoop to the cloud in a way that enables choice and delivers meaningful value to enterprise customers. In January, Hortonworks joined the OpenStack Foundation in support of our efforts with Rackspace (i.e. OpenStack-based Hadoop solution for the public and private cloud).

Today, we announced our plans to work with engineers from Red Hat and Mirantis within the OpenStack community on open source Project Savanna to automate the deployment of Hadoop on enterprise-class OpenStack-powered clouds.

Why is this news important?

Because big data and cloud computing are two of the top priorities in enterprise IT today, and it’s our intention to work diligently within the Hadoop and OpenStack open source communities to deliver solutions in support of these market needs. By bringing our Hadoop expertise to the OpenStack community in concert with Red Hat (the leading contributor to OpenStack), Mirantis (the leading system integrator for OpenStack), and Rackspace (a founding member of OpenStack), we feel we can speed the delivery of operational agility and efficient sharing of infrastructure that deploying elastic Hadoop on OpenStack can provide.

Why is this news important for topic maps?

Have you noticed that none, read none of the big data or cloud efforts say anything about data semantics?

As if when big data and the cloud arrives, all your data integration problems will magically melt away.

I don’t think so.

What I think is going to happen is discordant data sets are going to start rubbing and binding on each other. Perhaps not a lot at first but as data explorers get bolder, the squeaks are going to get louder.

So loud in fact the squeaks (now tearing metal sounds) are going to attract the attention of… (drum roll)… the CEO.

What’s your answer for discordant data?

  • Ear plugs?
  • Job with another company?
  • Job in another country?
  • Job under an assumed name?

I would say none of the above.

Probabilistic Bounds — A Primer

Filed under: Mathematics,Probability — Patrick Durusau @ 4:43 pm

Probabilistic Bounds — A Primer by Jeremy Kun.

From the post:

Probabilistic arguments are a key tool for the analysis of algorithms in machine learning theory and probability theory. They also assume a prominent role in the analysis of randomized and streaming algorithms, where one imposes a restriction on the amount of storage space an algorithm is allowed to use for its computations (usually sublinear in the size of the input).

While a whole host of probabilistic arguments are used, one theorem in particular (or family of theorems) is ubiquitous: the Chernoff bound. In its simplest form, the Chernoff bound gives an exponential bound on the deviation of sums of random variables from their expected value.

This is perhaps most important to algorithm analysis in the following mindset. Say we have a program whose output is a random variable X. Moreover suppose that the expected value of X is the correct output of the algorithm. Then we can run the algorithm multiple times and take a median (or some sort of average) across all runs. The probability that the algorithm gives a wildly incorrect answer is the probability that more than half of the runs give values which are wildly far from their expected value. Chernoff’s bound ensures this will happen with small probability.

So this post is dedicated to presenting the main versions of the Chernoff bound that are used in learning theory and randomized algorithms. Unfortunately the proof of the Chernoff bound in its full glory is beyond the scope of this blog. However, we will give short proofs of weaker, simpler bounds as a straightforward application of this blog’s previous work laying down the theory.

If the reader has not yet intuited it, this post will rely heavily on the mathematical formalisms of probability theory. We will assume our reader is familiar with the material from our first probability theory primer, and it certainly wouldn’t hurt to have read our conditional probability theory primer, though we won’t use conditional probability directly. We will refrain from using measure-theoretic probability theory entirely (some day my colleagues in analysis will like me, but not today).

Another heavy sledding post from Jeremy but if you persist, you will gain a deeper understanding of the algorithms of machine learning theory.

If that sounds esoteric, consider that it will help you question results produced by algorithms of machine learning.

Do you really want to take a machine’s “word” for something important?

Or do you want the chance to know why an answer is correct, questionable or incorrect?

The non-negative matrix factorization toolbox for biological data mining

Filed under: Bioinformatics,Data Mining,Matrix — Patrick Durusau @ 4:20 pm

The non-negative matrix factorization toolbox for biological data mining by Yifeng Li and Alioune Ngom. (Source Code for Biology and Medicine 2013, 8:10 doi:10.1186/1751-0473-8-10)

From the post:

Background: Non-negative matrix factorization (NMF) has been introduced as an important method for mining biological data. Though there currently exists packages implemented in R and other programming languages, they either provide only a few optimization algorithms or focus on a specific application field. There does not exist a complete NMF package for the bioinformatics community, and in order to perform various data mining tasks on biological data.

Results: We provide a convenient MATLAB toolbox containing both the implementations of various NMF techniques and a variety of NMF-based data mining approaches for analyzing biological data. Data mining approaches implemented within the toolbox include data clustering and bi-clustering, feature extraction and selection, sample classification, missing values imputation, data visualization, and statistical comparison.

Conclusions: A series of analysis such as molecular pattern discovery, biological process identification, dimension reduction, disease prediction, visualization, and statistical comparison can be performed using this toolbox.

Written in a bioinformatics context but also used in text data mining (Enron emails), spectral analysis and other data mining fields. (See Non-negative matrix factorization)

Visualization and Projection

Filed under: Mathematics,Projection,Visualization — Patrick Durusau @ 4:06 pm

Visualization and Projection by Jesse Johnson.

From the post:

One of the common themes that I’ve emphasized so far on this blog is that we should try to analyze high dimensional data sets without being able to actually “see” them. However, it is often useful to visualize the data in some way, and having just introduced principle component analysis, this is probably a good time to start the discussion. There are a number of types of visualization that involve representing statistics about the data in different ways, but today we’ll look at ways of representing the actual data points in two or three dimensions.

In particular, what we want to do is to draw the individual data points of a high dimensional data set in a lower dimensional space so that the hard work can be done by the best pattern recognition machine there is: your brain. When you look at a two- or three-dimensional data set, you will naturally recognize patterns based on how close different points are to each other. Therefore, we want to represent the points so that the distances between them change as little as possible. In general, this is called projection, the term coming from the idea that we will do the same thing to the data as you do when you make shadow puppets: We project a high dimensional object (such as your three-dimensional hands) onto a lower dimensional object (such as the two-dimensional wall).

We’ve already used linear projection implicitly when thinking about higher dimensional spaces. For example, I suggested the you think about the three-dimensional space that we live in as being a piece of paper on the desk of someone living in a higher dimensional space. In the post about the curse of dimensionality, we looked at the three-dimensional cube from the side and saw that it was two-dimensional, then noted that a four-dimensional cube would look like a three-dimensional cube if we could look at it “from the side”.

When we look at an object from the side like this, we are essentially ignoring one of the dimensions. This the simplest form of projection, and in general we can choose to ignore more than one dimension at a time. For example, if you have data points in five dimensions and you want to plot them in two dimension, you could just pick the two dimensions that you thought were most important and plot the points based on those. It’s hard to picture how that works because you still have to think about the original five dimensional data. But this is similar to the picture if we were to take a two-dimensional data set and throw away one of the dimensions as in the left and middle pictures in the Figure below. You can see the shadow puppet analogy too: In the figure on the left, the light is to the right of the data, while in the middle, it’s above.

I hesitated over:

…then noted that a four-dimensional cube would look like a three-dimensional cube if we could look at it “from the side”

Barely remembering Martin Gardner’s column on visualizing higher dimensions and the illustrations of projecting a fourth dimensional box into three dimensions.

But the original post is describing a fourth dimensional cube in a fourth dimensional space being viewed “on its side” by a being that exists in fourth dimensional space.

That works.

How would you choose which dimensions to project for a human observer to judge?

Does statistics have an ontology? Does it need one? (draft 2)

Filed under: Ontology,Statistics — Patrick Durusau @ 3:49 pm

Does statistics have an ontology? Does it need one? (draft 2) by D. Mayo.

From the post:

Chance, rational beliefs, decision, uncertainty, probability, error probabilities, truth, random sampling, resampling, opinion, expectations. These are some of the concepts we bandy about by giving various interpretations to mathematical statistics, to statistical theory, and to probabilistic models. But are they real? The question of “ontology” asks about such things, and given the “Ontology and Methodology” conference here at Virginia Tech (May 4, 5), I’d like to get your thoughts (for possible inclusion in a Mayo-Spanos presentation).* Also, please consider attending**.

Interestingly, I noticed the posts that have garnered the most comments have touched on philosophical questions of the nature of entities and processes behind statistical idealizations (e.g.,http://errorstatistics.com/2012/10/18/query/).

The post and ensuing comments offer much to consider.

From my perspective, if assumptions, ontological and otherwise, go unstated, the results opaque.

You can accept them, because they fit your prior opinion or how you wanted the results to be, or reject them as not fitting your prior opinion or desired result.

Iterative Map Reduce – Prior Art

Filed under: Hadoop,MapReduce — Patrick Durusau @ 11:59 am

Iterative Map Reduce – Prior Art

From the post:

There have been several attempts in the recent past at extending Hadoop to support efficient iterative data processing on clusters. To facilitate understanding this problem better here is a collection of some prior art relating to this problem space.

Short summaries of:

Other proposals to add to this list?

How To Debug Solr With Eclipse

Filed under: Eclipse,Lucene,Solr — Patrick Durusau @ 11:49 am

How To Debug Solr With Eclipse by Doug Turnbull.

From the post:

Recently I was puzzled by some behavior Solr was showing me. I scratched my head and called over a colleague. We couldn’t quite figure out what was going on. Well Solr is open source so… next stop – Debuggersville!

Running Solr in the Eclipse debugger isn’t hard, but there are many scattered user group posts and blog articles that you’ll need to manually tie together into a coherent picture. So let me do you the favor of tying all of that info together for you here.

This looks very useful.

Curious of there are any statistical function debuggers?

That step you through the operations and show the state of values as they change?

Thinking that could be quite useful as a sanity test when the numbers just don’t jive.

April 15, 2013

Almost There: Neo4j 1.9-RC1!

Filed under: Graphs,Neo4j — Patrick Durusau @ 7:31 pm

Almost There: Neo4j 1.9-RC1! by Philip Rathle.

From the post:

Today is Leonhard Euler’s birthday, and we’re celebrating by announcing a first Release Candidate for Neo4j 1.9, now available for download! This release includes a number of incremental changes from the last Milestone (1.9-M05). This release candidate includes the last set of features we’d love our community to try out, as we prepare Neo4j 1.9 for General Availability (GA).

Philip also reports changes since the last milestone (1.9-M05).

I’m curious if the final Neo4j 1.9 release is going to be benchmarked against earlier releases?

Or released with benchmarks at all?

2ND International Workshop on Mining Scientific Publications

Filed under: Conferences,Data Mining,Searching,Semantic Search,Semantics — Patrick Durusau @ 2:49 pm

2ND International Workshop on Mining Scientific Publications

May 26, 2013 – Submission deadline
June 23, 2013 – Notification of acceptance
July 7, 2013 – Camera-ready
July 26, 2013 – Workshop

From the CFP:

Digital libraries that store scientific publications are becoming increasingly important in research. They are used not only for traditional tasks such as finding and storing research outputs, but also as sources for mining this information, discovering new research trends and evaluating research excellence. The rapid growth in the number of scientific publications being deposited in digital libraries makes it no longer sufficient to provide access to content to human readers only. It is equally important to allow machines analyse this information and by doing so facilitate the processes by which research is being accomplished. Recent developments in natural language processing, information retrieval, the semantic web and other disciplines make it possible to transform the way we work with scientific publications. However, in order to make this happen, researchers first need to be able to easily access and use large databases of scientific publications and research data, to carry out experiments.

This workshop aims to bring together people from different backgrounds who:
(a) are interested in analysing and mining databases of scientific publications,
(b) develop systems, infrastructures or datasets that enable such analysis and mining,
(c) design novel technologies that improve the way research is being accomplished or
(d) support the openness and free availability of publications and research data.

2. TOPICS

The topics of the workshop will be organised around the following three themes:

  1. Infrastructures, systems, open datasets or APIs that enable analysis of large volumes of scientific publications.
  2. Semantic enrichment of scientific publications by means of text-mining, crowdsourcing or other methods.
  3. Analysis of large databases of scientific publications to identify research trends, high impact, cross-fertilisation between disciplines, research excellence and to aid content exploration.

Of particular interest for topic mappers:

Topics of interest relevant to theme 2 include, but are not limited to:

  • Novel information extraction and text-mining approaches to semantic enrichment of publications. This might range from mining publication structure, such as title, abstract, authors, citation information etc. to more challenging tasks, such as extracting names of applied methods, research questions (or scientific gaps), identifying parts of the scholarly discourse structure etc.
  • Automatic categorization and clustering of scientific publications. Methods that can automatically categorize publications according to an established subject-based classification/taxonomy (such as Library of Congress classification, UNESCO thesaurus, DOAJ subject classification, Library of Congress Subject Headings) are of particular interest. Other approaches might involve automatic clustering or classification of research publications according to various criteria.
  • New methods and models for connecting and interlinking scientific publications. Scientific publications in digital libraries are not isolated islands. Connecting publications using explicitly defined citations is very restrictive and has many disadvantages. We are interested in innovative technologies that can automatically connect and interlink publications or parts of publications, according to various criteria, such as semantic similarity, contradiction, argument support or other relationship types.
  • Models for semantically representing and annotating publications. This topic is related to aspects of semantically modeling publications and scholarly discourse. Models that are practical with respect to the state-of-the-art in Natural Language Processing (NLP) technologies are of special interest.
  • Semantically enriching/annotating publications by crowdsourcing. Crowdsourcing can be used in innovative ways to annotate publications with richer metadata or to approve/disapprove annotations created using text-mining or other approaches. We welcome papers that address the following questions: (a) what incentives should be provided to motivate users in contributing metadata, (b) how to apply crowdsourcing in the specialized domains of scientific publications, (c) what tasks in the domain of organising scientific publications is crowdsourcing suitable for and where it might fail, (d) other relevant crowdsourcing topics relevant to the domain of scientific publications.

The other themes could be viewed through a topic map lens but semantic enrichment seems like a natural.

Node.js integrates with M:…

Filed under: MUMPS,node-js — Patrick Durusau @ 1:08 pm

Node.js integrates with M: The NoSQL Hierarchical database by Luis Ibanez.

From the post:

We have talked recently about the significance of integrating the Node.js language with the powerful M database, particularly in the space of healthcare applications.

The efficiency of Node.js combined with the high performance of M, provides an unparalleled fresh approach for building healthcare applications.

Healthcare needs help from Node.js but other areas do as well!

I first saw this at: Node.js Integrates With M: The NoSQL Hierarchical Database by Alex Popescu.

Miriam Registry [More Identifiers For Science]

Filed under: Identifiers,Science — Patrick Durusau @ 12:42 pm

Miriam Registry

From the homepage:

Persistent identification for life science data

The MIRIAM Registry provides a set of online services for the generation of unique and perennial identifiers, in the form of URIs. It provides the core data which is used by the Identifiers.org resolver.

The core of the Registry is a catalogue of data collections (corresponding to controlled vocabularies or databases), their URIs and the corresponding physical URLs or resources. Access to this data is made available via exports (XML) and Web Services (SOAP).

And from the FAQ:

What is MIRIAM, and what does it stand for?

MIRIAM is an acronym for the Minimal Information Required In the Annotation of Models. It is important to distinguish between the MIRIAM Guidelines, and the MIRIAM Registry. Both being part of the wider BioModels.net initiative.

What are the ‘MIRIAM Guidelines’?

The MIRIAM Guidelines are an effort to standardise upon the essential, minimal set of information that is sufficient to annotate a model in such a way as to enable its reuse. This includes a means to identify the model itself, the components of which it is composed, and formalises a means by which unambiguous annotation of components should be encoded. This is essential to allow collaborative working by different groups which may not be spatially co-located, and facilitates model sharing and reuse by the general modelling community. The goal of the project, initiated by the BioModels.net effort, was to produce a set of guidelines suitable for model annotation. These guidelines can be implemented in any structured format used to encode computational models, for example SBML, CellML, or NeuroML . MIRIAM is a member of the MIBBI family of community-developed ‘minimum information’ reporting guidelines for the biosciences.

More information on the requirements to achieve MIRIAM Guideline compliance is available on the MIRIAM Guidelines page.

What is the MIRIAM Registry?

The MIRIAM Registry provides the necessary information for the generation and resolving of unique and perennial identifiers for life science data. Those identifiers are of the URI form and make use of Identifiers.org for providing access to the identified data records on the Web. Examples of such identifiers: http://identifiers.org/pubmed/22140103, http://identifiers.org/uniprot/P01308, …

More identifiers for the life sciences, for those who choose to use them.

The curation may be helpful in terms of mappings to other identifiers.

The Power of Collaboration [Cultural Gulfs]

Filed under: Collaboration,Design — Patrick Durusau @ 8:57 am

The Power of Collaboration by Andrea Ruskin.

From the post:

A quote that I stumbled on during grad school stuck with me. From the story of the elder’s box as told by Eber Hampton, it sums up my philosophy of working and teaching:

How many sides do you see?
One,” I said.
He pulled the box towards his chest and turned it so one corner faced me.
Now how many do you see?
Now I see three sides.
He stepped back and extended the box, one corner towards him and one towards me.
You and I together can see six sides of this box,” he told me.

—Eber Hampton (2002) The Circle Unfolds, p. 41–42

Andrea describes a graduate school project to develop a learning resource for Aboriginal students.

A task made more difficult by Andrea being a non-Aboriginal designer.

The gap between you and a topic map customer may not be as obvious but will be no less real.

Disambiguating Hilarys

Filed under: Disambiguation,Entities,Entity Resolution — Patrick Durusau @ 5:57 am

Hilary Mason (live, data scientist) writes about Google confusing her with Hilary Mason (deceased, actress) in Et tu, Google?

To be fair, Hilary Mason (live, data scientist), notes Bing has made the same mistake in the past.

Hilary Mason (live, data scientist) goes on to say:

I know that entity disambiguation is a hard problem. I’ve worked on it, though never with the kind of resources that I imagine Google can bring to it. And yet, this is absurd!

Is entity disambiguation a hard problem?

Or is entity disambiguation a hard problem after the act of authorship?

Authors (in general) know what entities they meant.

The hard part is inferring what entity they meant when they forgot to disambiguate between possible entities.

Rather than focusing on mining low grade ore (content where entities are not disambiguated), wouldn’t a better solution be authoring with automatic entity disambiguation?

We have auto-correction in word processing software now, why not auto-entity software that tags entities in content?

Presenting the author of content with disambiguated entities for them to accept, reject or change.

Won’t solve the problem of prior content with undistinguished entities but can keep the problem from worsening.

April 14, 2013

Let Them Pee:…

Filed under: Interface Research/Design,Usability,UX — Patrick Durusau @ 3:46 pm

Let Them Pee: Avoiding the Sign-Up/Sign-In Mobile Antipattern by Greg Nudelman.

From the post:

The application SitOrSquat is a brilliant little piece of social engineering software that enables people to find bathrooms on the go, when they gotta go. Obviously, the basic use case implies a, shall we say, certain sense of urgency. This urgency is all but unfelt by the company that acquired the app, Procter and Gamble (P&G), as it would appear for the express purposes of marketing the Charmin brand of toilet paper. (It’s truly a match made in heaven—but I digress.)

Not content with the business of simply “Squeezing the Charmin” (that is, simple advertising), P&G executives decided for some unfathomable reason to force people to sign up for the app in multiple ways. First, as you can see in Figure 1, the app forces the customer (who is urgently looking for a place to relieve himself, let’s not forget) to use the awkward picker control to select his birthday to allegedly find out if he has been “potty trained.” This requirement would be torture on a normal day, but—I think you’ll agree—it’s excruciating when you really gotta go.

The horror of SitOrSquat doesn’t stop there.

Greg’s telling of the story is masterful. You owe it to yourself to read it more than once.

Relevant for mobile apps but also to “free” whitepapers, demo software or the other crap that requires name/email/phone details.

Name/email/phone details support marketing people who drain funds away from development and induce potential customers to look elsewhere.

Nozzle R Package

Filed under: Documentation,Graphics,R,Visualization — Patrick Durusau @ 3:29 pm

Nozzle R Package

From the webpage:

Nozzle is an R package for generation of reports in high-throughput data analysis pipelines. Nozzle reports are implemented in HTML, JavaScript, and Cascading Style Sheets (CSS), but developers do not need any knowledge of these technologies to work with Nozzle. Instead they can use a simple R API to design and implement powerful reports with advanced features such as foldable sections, zoomable figures, sortable tables, and supplementary information. Please cite our Bioinformatics paper if you are using Nozzle in your work.

I have only looked at the demo reports but this looks quite handy.

It doesn’t hurt to have extensive documentation to justify a conclusion that took you only moments to reach.

Planform:… [Graph vs. SQL?]

Filed under: Bioinformatics,Graphs,SQL,SQLite — Patrick Durusau @ 3:16 pm

Planform: an application and database of graph-encoded planarian regenerative experiments by Daniel Lobo, Taylor J. Malone and Michael Levin. Bioinformatics (2013) 29 (8): 1098-1100. doi: 10.1093/bioinformatics/btt088

Abstract:

Summary: Understanding the mechanisms governing the regeneration capabilities of many organisms is a fundamental interest in biology and medicine. An ever-increasing number of manipulation and molecular experiments are attempting to discover a comprehensive model for regeneration, with the planarian flatworm being one of the most important model species. Despite much effort, no comprehensive, constructive, mechanistic models exist yet, and it is now clear that computational tools are needed to mine this huge dataset. However, until now, there is no database of regenerative experiments, and the current genotype–phenotype ontologies and databases are based on textual descriptions, which are not understandable by computers. To overcome these difficulties, we present here Planform (Planarian formalization), a manually curated database and software tool for planarian regenerative experiments, based on a mathematical graph formalism. The database contains more than a thousand experiments from the main publications in the planarian literature. The software tool provides the user with a graphical interface to easily interact with and mine the database. The presented system is a valuable resource for the regeneration community and, more importantly, will pave the way for the application of novel artificial intelligence tools to extract knowledge from this dataset.

Availability: The database and software tool are freely available at http://planform.daniel-lobo.com.

Watch the video tour for an example of a domain specific authoring tool.

It does not use any formal graph notation/terminology or attempt a new form of ASCII art.

Users can enter data about worms with four (4) heads. That bodes well for new techniques to author topic maps.

On the use of graphs, the authors write:

We have created a formalism based on graphs to encode the resultant morphologies and manipulations of regenerative experiments (Lobo et al., 2013). Mathematical graphs are ideal to encode relationships between individuals and have been previously used to encode morphologies (Lobo et al., 2011). The formalism divided a morphology into adjacent regions (graph nodes) connected to each other (graph edges). The geometrical characteristics of the regions (connection angles, distances, shapes, type, etc.) are stored as node and link labels. Importantly, the formalism permits automatic comparisons between morphologies: we implemented a metric to quantify the difference between two morphologies based on the graph edit distance algorithm.

The experiment manipulations are encoded in a tree structure. Nodes represent specific manipulations (cuts, irradiation and transplantations) where links define the order and relations between manipulations. This approach permits encode the majority of published planarian regenerative experiments.

The graph vs. relational crowd will be disappointed to learn the project uses SQLite (“the most widely deployed SQL database engine in the world”) for the storage/access to its data. 😉

You were aware that hypergraphs were used to model relational databases in the “old days.” Yes?

I will try to pull together some of those publications in the near future.

A walk-through for the Twitter streaming API

Filed under: Scala,Tweets — Patrick Durusau @ 2:42 pm

A walk-through for the Twitter streaming API by Jason Baldridge.

From the post:

Analyzing tweets is all the rage, and if you are new to the game you want to know how to get them programmatically. There are many ways to do this, but a great start is to use the Twitter streaming API, a RESTful service that allows you to pull tweets in real time based on criteria you specify. For most people, this will mean having access to the spritzer, which provides only a very small percentage of all the tweets going through Twitter at any given moment. For access to more, you need to have a special relationship with Twitter or pay Twitter or an affiliate like Gnip.

This post provides a basic walk-through for using the Twitter streaming API. You can get all of this based on the documentation provided by Twitter, but this will be slightly easier going for those new to such services. (This post is mainly geared for the first phase of the course project for students in my Applied Natural Language Processing class this semester.)

You need to have a Twitter account to do this walk-through, so obtain one now if you don’t have one already.

Basics of obtaining tweets from the Twitter stream.

I mention it as an active data source that may find its way into your topic map.

Data Mining with Weka

Filed under: Machine Learning,Weka — Patrick Durusau @ 3:59 am

Data Mining with Weka by Prof Ian H. Witten.

From the documentation:

The purpose of this study is to gain information to help design and implement the main WekaMOOC course.

If you are interested in Weka or helping with the development of a MOOC or both, this is an opportunity for you.

I am curious if MOOCs or at least mini-MOOCs are going to replace the extended infomercials touted as webinars.


Update already: For Ubuntu, manually install (not with Aptitude). So you can start with JVM memory settings. I got JDBC error messages but otherwise ran properly.

April 13, 2013

Office of the Director of National Intelligence: Data Mining 2012

Filed under: Data Mining,Intelligence — Patrick Durusau @ 6:57 pm

Office of the Director of National Intelligence: Data Mining 2012

Office of the Director of National Intelligence = ODNI

To cut directly to the chase:

II. ODNI Data Mining Activities

The ODNI did not engage in any activities to use or develop data mining functionality during the reporting period.

My source, KDNuggets, provides the legal loophole analysis.

Who watches the watchers?

Looks like that it’s going to be you and me.

Every citizen who recognizes a government employee, agent, official, tweet the name you know them by with your location.

Just that.

If enough of us do that, patterns will begin to appear in the data stream.

If enough patterns appear in the data stream, the identities of government employees, agents, officials, will slowly become known.

Transparency won’t happen overnight or easily.

But if you are waiting for the watchers to watch themselves, you are going to be severely disappointed.

Cypher: It doesn’t all start with the START (in Neo4j 2.0!) [Benchmarks?]

Filed under: Cypher,Graphs,Neo4j — Patrick Durusau @ 6:35 pm

Cypher: It doesn’t all start with the START (in Neo4j 2.0!)

From the post:

So, apparently, the Neo Technology guys read one of my last blog posts titled “It all starts with the START” and wanted to make a liar out of me. Actually, I’m quite certain it had nothing at all to do with that–they are just wanting to improve Cypher to make it the best graph query language out there. But yes, the START clause is now optional. “How do I tell Neo4j where to start my traversals”, you might ask. Well, in the long run, you won’t need to anymore. Neo4j will keep index and node/rel statistics and know which index to use, and know which start points to use to make the match and where the most efficient query based on its cost optimization. It’s not quite there yet, so for a while we’ll probably want to make generous use of “index hints”, but I love the direction this is going–feels just like the good old SQL.

While you are looking at Neo4j 2.0, remember the performance benchmarks by René Pickhardt up through Neo4j 1.9:

Get the full neo4j power by using the Core Java API for traversing your Graph data base instead of Cypher Query Language

As of Neo4j 1.7, the core Java API was a full order of magnitude faster than Cypher and up to Neo4j 1.9, the difference was even greater.

Has anyone run the benchmark against Neo4j 2.0?

Apache Marmotta (incubator)

Filed under: Apache Marmotta,Linked Data — Patrick Durusau @ 6:19 pm

Apache Marmotta (incubator)

From the webpage:

Apache Marmotta (incubator) is an Open Platform for Linked Data.

The goal of Apache Marmotta is to provide an open implementation of a Linked Data Platform that can be used, extended and deployed easily by organizations who want to publish Linked Data or build custom applications on Linked Data.

Right now the project is being setting up installed in the Apache Software Foundation infrastructure. The team is working to have available to download in the upcoming weeks the first release under incubator. Check the development section for further details how we work or subscribe to our mailing lists to follow the projects day to day.

Features

  • Read-Write Linked Data
  • RDF triple store with transactions, versioning and rule-base reasoning
  • SPARQL and LDPath query languages
  • Transparent Linked Data Caching
  • Integrated security mechanisms

Background

Marmotta comes as a continuation of the work in the Linked Media Framework project. LMF is an easy-to-setup server application that bundles some technologies such as Apache Stanbol or Apache Solr to offer some advanced services. After the release 2.6, the Read-Write Linked Data server code and some related libraries have been set aside to incubate Marmotta within the The Apache Software Foundation. LMF still keeps exactly the same functionallity, but now bundling Marmotta too.

If a client wants a Linked Data Platform, the least you can do is recommend one from Apache.

Mongraph

Filed under: MongoDB,Mongraph,Neo4j — Patrick Durusau @ 6:00 pm

Mongraph

From the readme:

Mongraph combines documentstorage database with graph-database relationships by creating a corresponding node for each document.

Flies in the face of every app being the “universal” app orthodoxy but still worth watching.

Wikileaks: Kissinger Cables

Filed under: Data,Wikileaks — Patrick Durusau @ 2:13 pm

Wikileaks: Kissinger Cables

The code behind the Public Library of US Diplomacy.

Another rich source of information for anyone creating a mapping of relationships and events in the early 1970’s.

My only puzzle over Wikileaks is their apparent focus on US diplomatic cables.

Where are the diplomatic cables of the former government in Egypt? Or the USSR? Or of any of the many existing regimes around the globe?

Surely those aren’t more difficult to obtain than those of the US?

Perhaps that would make an interesting topic map.

Those who could be exposed by Wikileaks but aren’t.

I first saw this as: Wikileaks ProjectK Code (Github) on Nat Torkington’s Four short links: 12 April 2013.

Debuggex [Emacs Alternative, Others?]

Filed under: Regex,Regexes — Patrick Durusau @ 1:28 pm

Debuggex: A visual regex helper

Regexes (regular expressions) are a mainstay of data mining/extraction.

Debuggex is a regex debugger with visual cues to help you with writing/debugging regular expressions.

The webpage reports full JS regexes are not yet supported.

If you need a fuller alternative, consider debugging regex expressions in Emacs.

M - x regexp-builder

which shows matches as you type.

Be aware that regex languages vary (no real surprise).

One helpful resource: Regular Expression Flavor Comparison

Linked Data and Law

Filed under: Law,Linked Data — Patrick Durusau @ 4:48 am

Linked Data and Law

A listing of linked data and law resources maintained by the Legal Informatics Blog.

Most recently updated to reflect the availability of the Library of Congress classification K – Class Law Classifcation as linked data.

Law Classification Added to Library of Congress Linked Data Service

Filed under: Classification,Law,Linked Data — Patrick Durusau @ 4:39 am

Law Classification Added to Library of Congress Linked Data Service by Kevin Ford.

From the post:

The Library of Congress is pleased to make the K ClassLaw Classification – and all its subclasses available as linked data from the LC Linked Data Service, ID.LOC.GOV. K Class joins the B, N, M, and Z Classes, which have been in beta release since June 2012. With about 2.2 million new resources added to ID.LOC.GOV, K Class is nearly eight times larger than the B, M, N, and Z Classes combined. It is four times larger than the Library of Congress Subject Headings (LCSH). If it is not the largest class, it is second only to the P Class (Literature) in the Library of Congress Classification (LCC) system.

We have also taken the opportunity to re-compute and reload the B, M, N, and Z classes in response to a few reported errors. Our gratitude to Caroline Arms for her work crawling through B, M, N, and Z and identifying a number of these issues.

Please explore the K Class for yourself at http://id.loc.gov/authorities/classification/K or all of the classes at http://id.loc.gov/authorities/classification.

The classification section of ID.LOC.GOV remains a beta offering. More work is needed not only to add the additional classes to the system but also to continue to work out issues with the data.

As always, your feedback is important and welcomed. Your contributions directly inform service enhancements. We are interested in all forms of constructive commentary on all topics related to ID. But we are particularly interested in how the data available from ID.LOC.GOV is used and continue to encourage the submission of use cases describing how the community would like to apply or repurpose the LCC data.

You can send comments or report any problems via the ID feedback form or ID listserv.

Not leisure reading for everyone but if you are interested, this is fascinating source material.

And an important source of information for potential associations between subjects.

I first saw this at: Ford: Law Classification Added to Library of Congress Linked Data Service.

Unknown Association Roles (TMDM specific)

Filed under: Associations,TMDM — Patrick Durusau @ 4:22 am

As I was pondering the treatment of nulls in Neo4j (Null Values in Neo4j), it occurred to me that we have something quite similar in the TMDM.

The definition of association items includes this language:

[roles]: A non-empty set of association role items. The association roles for all the topics that participate in this relationship.

I read this as saying that if I don’t know their role, I can’t include a known player in an association.

For example, I am modeling an association between two players to a phone conversation, who are discussing a drone strike or terrorist attack by other means.

I know their identities but I don’t know their respective roles in relationship to each other or in the planned attack.

I want to capture this association because I may have other associations where they are players where roles are known. Perhaps enabling me to infer possible roles in this association.

Newcomb has argued roles in associations are unique and in sum, constitute the type of the association. I appreciate the subtlety and usefulness of that position but it isn’t a universal model for associations.

By the same token, the TMDM restricts associations to use where all roles are known. Given that roles are often unknown, that also isn’t a universal model for associations.

I don’t think the problem can be solved by an “unknown role” topic because that would merge unknown roles across associations.

My preference would be to allow players to appear in associations without roles.

Where the lack of a role prevents the default merging of associations. That is, all unknown roles are presumed to be unique.

Suggestions?

April 12, 2013

A First Encounter with Machine Learning

Filed under: Machine Learning — Patrick Durusau @ 6:41 pm

A First Encounter with Machine Learning (PDF) by Max Welling, Professor at University of California, Irvine.

From the preface:

In winter quarter 2007 I taught an undergraduate course in machine learning at UC Irvine. While I had been teaching machine learning at a graduate level it became soon clear that teaching the same material to an undergraduate class was a whole new challenge. Much of machine learning is build upon concepts from mathematics such as partial derivatives, eigenvalue decompositions, multivariate probability densities and so on. I quickly found that these concepts could not be taken for granted at an undergraduate level. The situation was aggravated by the lack of a suitable textbook. Excellent textbooks do exist for this field, but I found all of them to be too technical for a first encounter with machine learning. This experience led me to believe there was a genuine need for a simple, intuitive introduction into the concepts of machine learning. A first read to wet the appetite so to speak, a prelude to the more technical and advanced textbooks. Hence, the book you see before you is meant for those starting out in the field who need a simple, intuitive explanation of some of the most useful algorithms that our field has to offer

This looks like a fun read!

Although I think an intuitive approach may be more important than as a prelude to more technical explanations.

In part because the machinery of technical explanations and their use, may obscure fundamental “meta-questions” that are important.

For example, in Jeremy Kun’s Homology series, which I strongly recommend, the technical side of homology isn’t going to prepare a student to ask questions like:

How did data collection impact the features of the data now subject to homology calculations?

How did the modeling of features impact the outcome of homology calculations?

What features are missing that could impact the findings from homology calculations?

Persistent homology is important but however well you learn the rules for its use, those rules won’t answer the meta-questions for its use.

An intuitive understanding of the technique and its limitations are as important as learning the latest computational details.

I first saw this at: Introductory Machine Learning Textbook by Ryan Swanstrom.

Visual Computing: Geometry, Graphics, and Vision (source code)

Filed under: Geometry,Graphics,Programming — Patrick Durusau @ 4:48 pm

Frank Nielsen blogged today that he had posted the C++ source code for “Visual Computing: Geometry, Graphics, and Vision.”

See: Source codes for all chapters of “Visual Computing: Geometry, Graphics, and Vision”

New demos are reported to be on the way.

…Apache HBase REST Interface, Part 2

Filed under: HBase,JSON,XML — Patrick Durusau @ 4:29 pm

How-to: Use the Apache HBase REST Interface, Part 2 by Jesse Anderson.

From the post:

This how-to is the second in a series that explores the use of the Apache HBase REST interface. Part 1 covered HBase REST fundamentals, some Python caveats, and table administration. Part 2 below will show you how to insert multiple rows at once using XML and JSON. The full code samples can be found on GitHub.

Only fair to cover both XML and TBL’s new favorite, JSON. (Tim Berners-Lee Renounces XML?)

« Newer PostsOlder Posts »

Powered by WordPress