March « 2011 « Another Word For It

March 7, 2011

logstash

Filed under: Data Mining — Patrick Durusau @ 7:06 am

From the website:

logstash is a tool for managing events and logs. You can use it to collect logs, parse them, and store them for later use (like, for searching). Speaking of searching, logstash comes with a web interface for searching and drilling into all of your logs.

I mention this for two reasons:

First, obviously as a tool for mining/searching logs. Deciding what subjects in a log will later appear in a topic map starts with discovery of those subjects.

Secondly, perhaps less obviously, thinking that adding subject identity to events discovered in logs could enable mapping across logs, say for example that were mining TCP/IP packet traffic.

Can’t imagine why anyone would be sitting on or near a big switch doing that, ;-), but just to cover all the edge cases.

If you filtered out all the known porn site and search engine traffic, both of which are large but knowable lists, the amount of stuff you have to process starts to look pretty manageable.

Does anyone know the ratio of porn/search to other traffic into the Pentagon? Or Congress? Just curious if there is a useful baseline.

Comments Off

March 6, 2011

An Introduction To The Scala Programming Language by Bill Venners- Webinar

Filed under: Scala — Patrick Durusau @ 3:33 pm

An Introduction To The Scala Programming Language by Bill Venners

From the post:

As those who know me will most definitely know, I have been dabbling with functional programming again. At first with F# and now with Scala. Just thought I’d share this webminar about it. Now I am starting to get it, back in university I never quite got what the deal is and found it very hard to comprehend.

Somewhat dated (2008) but interesting background and idea material. Does have some examples.

Comments Off

Processing

Filed under: Graphs,Processing,Visualization — Patrick Durusau @ 3:32 pm

Processing

From the website

Processing is an open source programming language and environment for people who want to create images, animations, and interactions. Initially developed to serve as a software sketchbook and to teach fundamentals of computer programming within a visual context, Processing also has evolved into a tool for generating finished professional work. Today, there are tens of thousands of students, artists, designers, researchers, and hobbyists who use Processing for learning, prototyping, and production.

Free to download and open source

Interactive programs using 2D, 3D or PDF output

OpenGL integration for accelerated 3D

For GNU/Linux, Mac OS X, and Windows

Projects run online or as double-clickable applications

Over 100 libraries extend the software into sound, video, computer vision, and more…

With advances in graph databases, visualization techniques, availability of data, etc., now is a great time to be working with topic maps.

Comments Off

Scala + Processing – an entertaining way to learn a new language – Post

Filed under: Processing,Scala,Visualization — Patrick Durusau @ 3:32 pm

Scala + Processing – an entertaining way to learn a new language

From the post:

If you’ve read a book about some new technology it doesn’t necessarily mean that you learned or even understood it. Without practice your newly acquired knowledge will vanish soon. That’s why doing exercises from the book you are reading is important.

But all those examples are usually boring. Of course you can start your own pet project to master your skills. Several months ago to learn Scala I started my little command line tool which semi-worked at the end and I gave up on it. So, in a month or so I had to google syntax of “for loop”…

That’s where I decided that I should start writing simple examples for different Scala features that must be fun. Here’s where Processing comes into play. Using it, every novice like me can turn dull exercises into visual installations. And later you can try advanced stuff like fractals, particle systems or data visualisation.

You might be wondering what the hell is Scala. It’s a relatively new and extremely cool programming language. You can read more about it on Wikipedia or on official web site.

Processing, in case you haven’t heard, is a graphics language/environment. Has a great deal of potential for topic maps and their representations.

Comments Off

Gaussian Processes for Machine Learning

Filed under: Algorithms,Guassian Processes,Machine Learning — Patrick Durusau @ 3:31 pm

Gaussian Processes for Machine Learning

Complete text of:

Gaussian Processes for Machine Learning, Carl Edward Rasmussen and Christopher K. I. Williams, MIT Press, 2006. ISBN-10 0-262-18253-X, ISBN-13 978-0-262-18253-9.

I like the quote from James Clerk Maxwell that goes:

The actual science of logic is conversant at present only with things either certain, impossible, or entirely doubtful, none of which (fortunately) we have to reason on. Therefore the true logic for this world is the calculus of Probabilities, which takes account of the magnitude of the probability which is, or ought to be, in a reasonable man’s mind.

Interesting. Is our identification of subjects probabilistic or is our identification of what we thought others meant probabilistic?

Or both? Neither?

From the preface:

Over the last decade there has been an explosion of work in the “kernel machines” area of machine learning. Probably the best known example of this is work on support vector machines, but during this period there has also been much activity concerning the application of Gaussian process models to machine learning tasks. The goal of this book is to provide a systematic and unified treatment of this area. Gaussian processes provide a principled, practical, probabilistic approach to learning in kernel machines. This gives advantages with respect to the interpretation of model predictions and provides a well founded framework for learning and model selection. Theoretical and practical developments of over the last decade have made Gaussian processes a serious competitor for real supervised learning applications.

I am downloading the PDF version but have just ordered a copy from Amazon.

If you want to encourage MIT Press and other publishers to put materials online as well as in print, order a copy of this and other online materials.

Saying online copies don’t hurt print sales isn’t as convincing as hearing the cash register go “cha-ching!”

(I would also drop a note to the press saying you bought a copy of the online book as well.)

Comments Off

Genetic Algorithm Examples – Post

Filed under: Artificial Intelligence,Genetic Algorithms,Machine Learning — Patrick Durusau @ 3:31 pm

Genetic Algorithm Examples

From the post:

There’s been a lot of buzz recently on reddit and HN about genetic algorithms. Some impressive new demos have surfaced and I’d like to take this opportunity to review some of the cool things people have done with genetic algorithms, a fascinating subfield of evolutionary computing / machine learning (which is itself a part of the broader study of artificial intelligence (ah how academics love to classify things (and nest parentheses (especially computer scientists)))).

Interesting collection of examples of uses of genetic algorithms.

Posted here to provoke thinking about the use of genetic algorithms in topic maps.

See also the author’s tutorial: Genetic Algorithm For Hello World.

Have you used genetic algorithms with a topic map?

Appreciate a note if you have.

Comments (2)

March 5, 2011

Keep an Eye on the emerging Open-Source Analytics Stack – Post

Filed under: Examples,Marketing — Patrick Durusau @ 3:34 pm

Keep an Eye on the emerging Open-Source Analytics Stack

David Smith’s summary captures the tone of the piece:

For the business user, the key takeaway is that this data analytics stack, built on commodity hardware and leading-edge open-source software, and a is a lower-cost, higher-value alternative to the existing status quo solutions offered by traditional vendors. Just a couple of years ago, these types of robust analytic capabilities were only available through major vendors. Today, the open-source community provides everything that the traditional vendors provide — and more. With open-source, you have choice, support, lower costs and faster cycles of innovation. The open-source analytics stack is more than a handy collection of interoperable tools — it’s an intelligence platform.

In that sense, the open-source analytics stack is genuinely revolutionary.

I use and promote the use of open source software so don’t take this as being anti-open source.

I think the jury is still out on the lower-cost question.

In part because the notion that anyone who can use a keyboard and an open source package is qualified to do BI, will reap its own reward.

There was a rumor years ago that local bar associations actually sponsored the “How to Avoid Probate” kits.

Reasoning that self-help would only increase the eventual fees for qualified counsel.

Curious to see how much of the “lower cost” of open source software is absorbed by correcting amateurish mistakes (assuming they are even admitted).

Comments (2)

Cassandra Data Model – Semantic Impedance

Filed under: Cassandra,NoSQL — Patrick Durusau @ 3:13 pm

WTF is a SuperColumn? An Intro to the Cassandra Data Model

A bit dated now but I thought some readers might find it useful.

From the posting:

If you’re coming from an RDBMS background (which is almost everyone) you’ll probably trip over some of the naming conventions while learning about Cassandra’s data model. It took me and my team members at Digg a couple days of talking things out before we “got it”. In recent weeks a bikeshed went down in the dev mailing list proposing a completely new naming scheme to alleviate some of the confusion. Throughout this discussion I kept thinking: “maybe if there were some decent examples out there people wouldn’t get so confused by the naming.” So, this is my stab at explaining Cassandra’s data model; It’s intended to help you get your feet wet & doesn’t go into every single detail but, hopefully, it helps clarify a few things.

Seems like I have heard about grouping sets of key/value pairs before but I will have to look for it. 😉

More seriously, the current wave of data sets only aggravates the known semantic impedance problem.

A wave of data sets that promises to only increase.

So semantic impedance is going to increase.

Semantic impedance can be:

ignored – most current stove-piped information systems
save-the-world semantic solutions – poor adoption rates
broken by self-interested mapping that is reusable – the topic maps solution

Comments Off

Procrastination Flowchart – Cross-Cultural?

Filed under: Humor — Patrick Durusau @ 2:58 pm

Procrastination Flowchart

Strictly for your weekend enjoyment!

Has anyone researched procrastination cross-culture?

That make an interesting topic map.

Would help with regard to expectations of productivity, etc.

Suggestions?

Comments Off

GraphLab

Filed under: GraphLab,Graphs,Machine Learning,Visualization — Patrick Durusau @ 2:51 pm

GraphLab

Progress on graph processing continues.

From the website:

A New Parallel Framework for Machine Learning

Designing and implementing efficient and provably correct parallel machine learning (ML) algorithms can be very challenging. Existing high-level parallel abstractions like MapReduce are often insufficiently expressive while low-level tools like MPI and Pthreads leave ML experts repeatedly solving the same design challenges. By targeting common patterns in ML, we developed GraphLab, which improves upon abstractions like MapReduce by compactly expressing asynchronous iterative algorithms with sparse computational dependencies while ensuring data consistency and achieving a high degree of parallel performance.

The popular MapReduce abstraction, is defined in two parts, a Map stage which performs computation on indepedent problems which can be solved in isolation, and a Reduce stage which combines the results.

GraphLab provides a similar analog to the Map in the form of an Update Function. The Update Function however, is able to read and modify overlapping sets of data (program state) in a controlled fashion as defined by the user provided data graph. The user provided data graph represents the program state with arbitrary blocks of memory associated with each vertex and edges. In addition the update functions can be recursively triggered with one update function spawning the application of update functions to other vertices in the graph enabling dynamic iterative computation. GraphLab uses powerful scheduling primitives to control the order update functions are executed.

The GraphLab analog to Reduce is the Sync Operation. The Sync Operation also provides the ability to perform reductions in the background while other computation is running. Like the update function sync operations can look at multiple records simultaneously providing the ability to operate on larger dependent contexts.

See also: GraphLab: A New Framework For Parallel Machine Learning (The original paper.)

This is a project that bears close watching.

Comments (1)

Gephi Workshop

Filed under: Gephi,Graphs,Visualization — Patrick Durusau @ 2:14 pm

Gephi Workshop 23 March 2011

Its events like this that make me wish I were on the West Coast.

Even so, there are a number of resources listed for those of us who cannot attend.

From the website:

The next Gephi Workshop will be on Wednesday, March 23rd at 1PM at the IC classroom in Green Library.

I’ll occasionally be able to provide two-hour workshops on the basics of using Gephi, the network analysis package with which I’ve made the images and videos below. The workshops will focus on:

getting graph data into Gephi using .gexf, .csv and database connections

running Filters, Analytics and Layouts on the data

optimization of Gephi for large datasets

overview of layout algorithms and strategies for their use

creating dynamic (time-enabled) networks

general Q&A

Comments Off

March 4, 2011

Your Help Needed: the Effect of Aesthetics on Visualization – Post

Filed under: Graphical Models,Graphics,Visualization — Patrick Durusau @ 4:09 pm

Your Help Needed: the Effect of Aesthetics on Visualization

Your opportunity to make a contribution to the study of visualization!

From the website:

We have just launched an online study on measuring the effect of aesthetics in data visualization. If you have about 10-20 minutes of uninterrupted time, please head over to Aesthetic Impact [aesthetic-impact.com] and take part in our online study. The main task that will be expected from you, is to interact with a visualization, and describe what you have learned from it.

The study is not only meant for visualization fanatics, so please pass around the URL to any person who might be interested in participating. The only thing you need to know is that the study is less about usability, utility or usefulness, and more about measuring what non-trivial and unexpected insights you actually ‘get’ from interacting with a specific data representation.

As communicating insight is the main reason for any interactive visualization, we think that measuring this aspect has become really important. Yet, we require the help of many ‘users’ to be able to say something meaningful…

Comments Off

Learning to classify text using support vector machines

Filed under: Classifier,Machine Learning,Vectors — Patrick Durusau @ 3:58 pm

I saw a tweet recently that pointed to: Learning to classify text using support vector machines, which is Thorsten Joachims’ dissertation, The Maximum-Margin Approach to Learning Text Classifiers as published by Kluwer (not Springer).

Of possible greater interest would be Joachims more recent work found at his homepage, which includes software from his dissertation as well as more recent projects.

I am sure his dissertation will repay close study but at > $150 U.S., I am going to have to wait for an library ILL to find its way to me.

Comments Off

Castles Made of Sand or Blowing in the Wind?

Filed under: Ontology — Patrick Durusau @ 3:30 pm

Products Types Ontology

I haven’t covered the 300,000 product descriptions offered by this site because I could not choose a blog title between “castles made of sand” and “blowing in the wind.”

From the webpage:

Your idea sucks: What you call an ontology is no ontology, because it lacks an axiomatic theory.

First, this is not question but a statement. Second, yes, you are absolutely right: Besides the rdfs:subClassOf axiom, we don’t have any formal semantics for each class. Third: Your ontology lacks

social grounding (ours: constant challenging by millions of reviews and revisions),

….

The line: social grounding (ours: constant challenging by millions of reviews and revisions) captures the problem doesn’t it?

Your classes are constantly changing and so I won’t know if your class tomorrow means the same thing when I used it today. (Hence, the “castles made of sand” line as a possible header.)

Yes?

But, social grounding is at work on both ends, that is my use of an identification has a social grounding.

So we have an uncertain/changing meaning to your classes, being applied to and equally uncertain/changing meaning to my application of your class. (Hence, the “blowing in the wind” line as a possible header.)

There is other information about each of the “300,000” (is that a possible movie title?) classes, but we don’t know what information has to match to identify a particular class. Or to tell others why we used one class and not another.

Appreciate the social grounding but identifiers without more leave sand moving under our feet and don’t enable us to make meaningful statements about our choices of identifiers.

*****

PS: Show of hands your preference for “Castles Made of Sand” or “Blowing in the Wind” as a title.

PPS: Best of luck with the axiomatic critics. Axioms are all they have, how does that go, “…tis an ill-favored thing, Sir, but mine own”? Something like that.

Comments Off

Table competition at ICDAR 2011

Filed under: Dataset,Subject Identity — Patrick Durusau @ 10:40 am

I first noticed this item at Mathew Hurst’s blog Table Competition at ICDAR 2011.

As a markup person with some passing familiarity with table encoding issues, this is just awesome!

Update: March 10, 2011 Competition registration, which consists of expressing interest in competing, by email, to the competition organisers

The basic description is OK:

Motivation: Tables are a prominent element of communication in documents, often containing information that would take many a paragraph to write otherwise. The first step to table understanding is to draw the tables physical model, i.e. identify its location and component cells, rows ad columns. Several authors have dedicated themselves to these tasks, using diverse methods, however it is difficult to know which methods work best under which circumstance because of the diverse testing conditions used by each. This competition aims at addressing this lacuna in our field.

Tasks: This competition will involve two independent sub-competitions. Authors may choose to compete for one task or the other or both.

1. Table location sub-competition:

This task consists of identifying which lines in the document belong to one same table area or not;

2. Table segmentation sub-competition:

This task consists of identifying which column the cells of each table belong to, i.e. identifying which cells belong to one same column. Each cell should be attributed a start and end column index (which will be different from each other for spanning cells). Identifying row spanning cells is not relevant for this competition.

But what I think will excite markup folks (and possibly topic map advocates) is the description of the data sets:

Description of the datasets: We have gathered 22 PDF financial statements. Our documents have lengths varying between 13 and 235 pages with very diverse page layouts, for example, pages can be organised in one or two columns and page headers and footers are included; each document contains between 3 and 162 tables. In Appendix A, we present some examples of pages in our dataset with tables that we consider hard to locate or segment. We randomly chose 19 documents for training and 3 for validation; our tougher cases turned out to be in the training set.

We then converted all files to ASCII using the pdttotext linux utility2 (2Red Hat Linux 7.2 (Enigma), October 22, 2001, Linux 2.4.7-10, pdftotext version 0.92., copyright 1996-2000 Derek B. Noonburg.). As a result of the conversion, each line of each document became a line of ASCII, which when imported into a database becomes a record in a relational table. Apart from this, we collected an extra 19 PDF financial statements to form the test set; these were converted into ASCII using the same tool as the training set.

Table 1 underneath shows the resulting dimensions of the datasets and how they compare to those used by other authors (Wang et al. (2002)’s tables were automatically generated and Pinto et al. (2003)’s belong to the same government statistics website). The sizes of the datasets in other papers are not distant from ours. An exception would be Cafarella et al. (2008), who created the first large repository of HTML tables, with 154 million tables. These consist of non-marked up HTML tables detected using Wang and Hu (2002)’s algorithm, which is naturally subject to mistakes.

We have then manually created the ground-truth for this data, which involved: a) identifying which lines belong to tables and which do not; b) for each line, identifying how it should be clipped into cells; c) for each cell, identifying which table column it belongs to.

Whether you choose to compete or not, this should prove to be very interesting.

Sorry, left off the dates from the original post:

Important dates:

February 26, 2011 Training set is made available on the Competition Website

March 10, 2011 Competition registration, which consists of expressing interest in competing, by email, to the competition organisers

May 13, 2011 Validation set is made available on the Competition Website

May 15, 2011 Submission of results by competitors, which should be executable files; if at all impossible, the test data will be given out to competitors, but results must be submitted within no more than one hour (negotiable)

June 15, 2011 Submission of summary paper for ICDAR’s proceedings, already including the identification of the competition’s winner

September, 2011 Test set is made available on the Competition Website

September, 2011 Announcement of the results will be made during ICDAR’2011, the competition session

Comments Off

ApacheCon NA 2011

Filed under: Cassandra,Cloud Computing,Conferences,CouchDB,HBase,Lucene,Mahout,Solr — Patrick Durusau @ 7:17 am

ApacheCon NA 2011

Proposals: Be sure to submit your proposal no later than Friday, 29 April 2011 at midnight Pacific Time.

7-11 November 2011 Vancouver

From the website:

This year’s conference theme is “Open Source Enterprise Solutions, Cloud Computing, and Community Leadership”, featuring dozens of highly-relevant technical, business, and community-focused sessions aimed at beginner, intermediate, and expert audiences that demonstrate specific professional problems and real-world solutions that focus on “Apache and …”:

… Enterprise Solutions (from ActiveMQ to Axis2 to ServiceMix, OFBiz to Chemistry, the gang’s all here!)

… Cloud Computing (Hadoop, Cassandra, HBase, CouchDB, and friends)

… Emerging Technologies + Innovation (Incubating projects such as Libcloud, Stonehenge, and Wookie)

… Community Leadership (mentoring and meritocracy, GSoC and related initiatives)

… Data Handling, Search + Analytics (Lucene, Solr, Mahout, OODT, Hive and friends)

… Pervasive Computing (Felix/OSGi, Tomcat, MyFaces Trinidad, and friends)

… Servers, Infrastructure + Tools (HTTP Server, SpamAssassin, Geronimo, Sling, Wicket and friends)

Comments Off

Berlin Buzzwords 2011

Filed under: Conferences,NoSQL,Topic Maps — Patrick Durusau @ 7:02 am

Berlin Buzzwords 2011

What a great name for a conference!

Extended Deadline: Sunday, March 6th at midnight MST

From the website:

Berlin Buzzwords 2011 is a conference for developers and users of open source software projects, focussing on the issues of scalable search, data-analysis in the cloud and NoSQL-databases. Berlin Buzzwords presents more than 30 talks and presentations of international speakers specific to the three tags “search”, “store” and “scale”.

Would be nice to have at least one or two topic map entries under search, if not one of the other terms.

Comments Off

Metaoptimize Q+A

Filed under: Artificial Intelligence,Data Mining,Information Retrieval,Machine Learning,Natural Language Processing,Visualization — Patrick Durusau @ 6:12 am

Metaoptimize Q+A is one of the Q/A sites I just stumbled across.

From the website:

A community of scientists interested in machine learning, natural language processing, artificial intelligence, text analysis, information retrieval, search, data mining, statistical modeling, and data visualization, as well as adjacent topics.

Looks like an interesting place to hang out.

Comments Off

Third Cross Validated Journal Club

Filed under: Data Mining,Statistics — Patrick Durusau @ 6:08 am

Third Cross Validated Journal Club

From the posting:

CVJC is a whole day meeting on chat where we discuss some paper and its theoretical/practical surroundings.

As mentioned above the event is whole-day (00:00-23:59UTC), but there are three meet-up sessions at 1:00, 9:00 and 16:00UTC on which most talking take place; they are spread over day to put at least one CVJC session in reach regardless of time zone.

The paper must be OpenAccess or a (p)reprint suggested previously on a meta thread like this one and selected in voting.

I would try to invite the author (it worked last time).

See the posting for the proposal for the next Cross Validated meeting date and discussion material.

Thinking something like this could be of interest in the topic maps community.

Comments Off

Cross Validated

Filed under: Data Mining,Statistics,Visualization — Patrick Durusau @ 5:58 am

Cross Validated

From the website:

This is a collaboratively edited question and answer site for statisticians, data analysts, data miners and data visualization experts. It’s 100% free, no registration required.

This is one of a series of such Q/A sites that I am going to be listing as of possible interest to the topic maps community.

Comments Off

User Interface Design

Filed under: Interface Research/Design — Patrick Durusau @ 5:55 am

User Interface Design

From the website:

Many technological innovations rely upon User Interface Design to elevate their technical complexity to a usable product. Technology alone may not win user acceptance and subsequent marketability. The User Experience, or how the user experiences the end product, is the key to acceptance. And that is where User Interface Design enters the design process. While product engineers focus on the technology, usability specialists focus on the user interface. For greatest efficiency and cost effectiveness, this working relationship should be maintained from the start of a project to its rollout.

When applied to computer software, User Interface Design is also known as Human-Computer Interaction or HCI. While people often think of Interface Design in terms of computers, it also refers to many products where the user interacts with controls or displays. Military aircraft, vehicles, airports, audio equipment, and computer peripherals, are a few products that extensively apply User Interface Design.

Optimized User Interface Design requires a systematic approach to the design process. But, to ensure optimum performance, Usability Testing is required. This empirical testing permits naïve users to provide data about what does work as anticipated and what does not work. Only after the resulting repairs are made can a product be deemed to have a user optimized interface.

The importance of good User Interface Design can be the difference between product acceptance and rejection in the marketplace. If end-users feel it is not easy to learn, not easy to use, or too cumbersome, an otherwise excellent product could fail. Good User Interface Design can make a product easy to understand and use, which results in greater user acceptance.

Caveat: I know nothing about this company or it services other than what I read on the website.

I am listing their site because of the wealth of materials they have gathered together on this important area for topic map authors.

Comments Off

OrientDB v0.9.25 & beyond!

Filed under: NoSQL,OrientDB — Patrick Durusau @ 5:53 am

OrientDB v0.9.25 has been released!

Features include:

Brand new memory model with level-1 and level-2 caches (Issue #242)
SQL prepared statement (Issue #49)
SQL Projections with the support of links (Issue #15)
Graphical editor for documents in OrientDB Studio app (Issue #217)
Graph representation in OrientDB Studio app
Support for JPA annotation by the Object Database interface (Issue #102)
Smart Console under bash: history, auto completition, etc. (Issue #228)
Operations to work with GEO-spatial points (Issue #182)
@rid support in SQL UPDATE statement (Issue #72)
Range queries against Indexes (Issue #231)
100% support of TinkerPop Blueprints 0.5

Even more good news: 1.0RC1 is planned for April 2011.

Comments Off

Shark: Machine Learning Library

Filed under: Machine Learning — Patrick Durusau @ 5:51 am

Shark: Machine Learning Library

From the website:

SHARK is a modular C++ library for the design and optimization of adaptive systems. It provides methods for linear and nonlinear optimization, in particular evolutionary and gradient-based algorithms, kernel-based learning algorithms and neural networks, and various other machine learning techniques.

SHARK serves as a toolbox to support real world applications as well as research in different domains of computational intelligence and machine learning. The sources are compatible with the following platforms: Linux, Windows, Solaris and MacOS X.

Comments Off

Benchmark: Python Machine Learning – Post

Filed under: Dataset,Machine Learning — Patrick Durusau @ 5:49 am

Benchmark for several Python machine learning packages

From the website:

We compare computation time for a few algorithms implemented in the major machine learning toolkits accessible in Python. We use the Madelon data set Guyon2004, 4400 instances and 500 attributes, that can be used in supervised and unsupervised settings and is quite large, but small enough for most algorithms to run.

Useful site for a couple of reasons:

1) A cross-check to make sure I have some of the major Python machine learning packages listed.

2) Another reminder that we don’t have similar test sets of data for topic maps.

The first one I can check and remedy fairly quickly.

The second one is going to take more thought, planning and mostly effort. 😉

Suggestions/comments?

Comments Off

March 3, 2011

Baking a topic map (err, I mean bread)

Filed under: Authoring Topic Maps,Examples,Topic Maps — Patrick Durusau @ 1:49 pm

Benjamin Bock asked last week about how to topic map ingredients (and the measures of) as well as the order of steps in a recipe.

I can’t give you a complete answer in one post (or even in several) but I can highlight some of the issues and possible solutions.

First, we need a recipe. I will be using the basic bread recipe, from the Artisan Bread in 5 Minutes a Day site, which lists the following ingredients:

3 1/2 cups lukewarm water

4 teaspoons active dry yeast

4 teaspoons coarse salt

7 1/4 cups (2 lb. 4 oz.; 1027.67 grams) unbleached all-purpose flour (measure using scoop and sweep method)

That’s right. Carol has been teaching me to cook and I really enjoy baking bread.

If it is a good day, call ahead and I am likely to have fresh bread out of the oven within minutes of your arrival.

Anyway, at first blush, this looks easy, after all , people have been passing recipes along for thousands of years.

Second look, not so easy.

First try at baking the topic map

The recipe itself has a name, Master Artisan Bread Recipe.

That looks like a promising place to start, we have a recipe, it has a name and from what we read above, some ingredients.

We could simply create a topic for the recipe, record its name and include the ingredients as occurrences, of type ingredient.

After all, since we can search for strings across the topic map, it won’t be hard to find recipes with yeast, flour, etc., whatever ingredient we want.

And that would be a perfectly valid topic map.

Well, except that you or I may want to say something about the yeast, as a subject. Could be which brand to use, etc.

Could simply stuff that information into the occurrence but topic maps have a better solution.

Second try at baking the topic map

Isn’t there a hint in the way we have passed recipes down for years about how we should represent them in a topic map?

That is each ingredient, more or less, stands on its own. We can talk about each one and often measure them all out before starting.

What if we represented each ingredient as a subject, that is with a topic?

And we represent their relationships to the recipe, remember Master Artisan Bread Recipe?, with an ingredient_of association. (Stolen shamelessly from Sam Hunting’s chapter, How to Start Topic Mapping Right Away with the XTM Specification, in XML Topic Maps, ed. by Jack Park and Sam Hunting.)

Oh, err, one thing, how do I get from 3 1/2 cups lukewarm water from water as a subject in an ingredient_of association?

That wasn’t explained very well. 😉

Third try at baking the topic map

Err, hmmm, yes (stalling for time),

Well, let’s break the water subject out and see if we can establish some principles for a solution that works for the other ingredients.

The measurement, 3 1/2 cups and the temperature, lukewarm, do not affect the subject identity of the water, but the first establishes a particular/specific, set aside amount of water and lukewarm, defines a temperature for that set aside portion.

At its core the problem is that we would prefer to talk about water as an ingredient and to not have to use 3 1/2 cups as part of its identity.

That is, how would your topic map look with an ingredient_of association between a recipe and 3 1/2 cups of water?

Would your 3 1/2 cups of water only merge with other 3 1/2 cups of water topics in other recipes?

That sounds like a bad plan.

Fourth try at baking the topic map

Let’s think about this for a moment.

We want ingredient as subject so we can say things about them. We also want to record the amount or some condition of an ingredient as part of the recipe.

One work around, not necessarily a good one (discussion please!) would be to model the recipe – ingredient association as a three role relationship:

recipe
ingredient
measure_condition

That breaks out the measurement or condition of the ingredient as a separate subject. It also dodges some fairly complicated issues with regard to measurement but those are probably not critical to a bread recipe anyway.

Oh, sorry, did not mean to avoid answering Benjamin’s question about ordering steps in the recipe.

Did you know that when practicing my typing in grade school I duplicated my mother’s recipes and then discarded the originals?

I also left off the steps then. Had the amounts and ingredients, but no steps. 😉

She took it good naturedly enough but declined my further help where the recipe box was concerned.

I promise I won’t repeat that error but I won’t reach the step question today.

Besides, interested to hear what you think about the recipe illustration so far?

Understand that I need to include syntax but thought I would do that in the next post, before I get to the steps question.

Comments (2)

MongoDB 1.8 Released!

Filed under: MongoDB,NoSQL — Patrick Durusau @ 10:12 am

MongoDB 1.8 Released

Release notes for MongoDB 1.8.

Incremental map/reduce supported to enable incremental updating of collections.

Reminds me to ask about incremental updating of topic maps.

Comments Off

Introduction to MongoDB – Post

Filed under: MongoDB,NoSQL — Patrick Durusau @ 10:06 am

Introduction to MongoDB

If you want a quick introduction to MongoDB with some lite examples, this post is for you.

Comments Off

Enterprise Integration Patterns

Filed under: Enterprise Integration — Patrick Durusau @ 10:04 am

Enterprise Integration Patterns

Just an illustration of the breath of meaning that the term integration has in a modern IT context.

I have ordered a copy of the book because I wasn’t overly impressed with the message patterns on the website.

It is necessary to document patterns such as Publish-Subscribe Channel but I would not be holding my breath for the applause.

More news to follow.

Comments Off

Real-Time Log Processing System based on Flume and Cassandra – Post

Filed under: Cassandra,Flume,NoSQL — Patrick Durusau @ 10:01 am

Real-Time Log Processing System based on Flume and Cassandra

Very cool!

What would be even cooler, would be to have real-time associations with subjects that have information from outside the data set.

Or better yet, real-time on-demand associations with subjects that have information from outside the data set.

I suppose the classic use case would be running stats on all the sports events on a Saturday or Sunday, including individuals stats and merging in the latest doping, paternity and similar tests.

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 7, 2011

logstash

March 6, 2011

An Introduction To The Scala Programming Language by Bill Venners- Webinar

Processing

Scala + Processing – an entertaining way to learn a new language – Post

Gaussian Processes for Machine Learning

Genetic Algorithm Examples – Post

March 5, 2011

Keep an Eye on the emerging Open-Source Analytics Stack – Post

Cassandra Data Model – Semantic Impedance

Procrastination Flowchart – Cross-Cultural?

GraphLab

Gephi Workshop

March 4, 2011

Your Help Needed: the Effect of Aesthetics on Visualization – Post

Learning to classify text using support vector machines

Castles Made of Sand or Blowing in the Wind?

Table competition at ICDAR 2011

ApacheCon NA 2011

Berlin Buzzwords 2011

Metaoptimize Q+A

Third Cross Validated Journal Club

Cross Validated

User Interface Design

OrientDB v0.9.25 & beyond!

Shark: Machine Learning Library

Benchmark: Python Machine Learning – Post

March 3, 2011

Baking a topic map (err, I mean bread)

MongoDB 1.8 Released!

Introduction to MongoDB – Post

Enterprise Integration Patterns

Real-Time Log Processing System based on Flume and Cassandra – Post

Wikipedia Page Traffic Statistics Dataset