Archive for January, 2011

Introduction to Riak Video with Rusty Klophaus – Post

Monday, January 31st, 2011

Introduction to Riak Video with Rusty Klophaus from MyNoSQL by Alex Popescu. Viewable online or downloadable in a couple of formats.

Starts with the observation that there are 47 different NoSQL projects. Doesn’t list them. 😉

I would watch this at the PivotLabs link because the related talks.

Oh, Riak homepage.

While I like the video, it is also an example that you don’t need high end video production or editing to produce useful video of presentations.

I mention as an answer to conferences that protest they need expensive equipment to video presentations.

That is simply not the case and anyone who says otherwise, to be generous, is mis-informed.

MongoVUE

Monday, January 31st, 2011

MongoVUE

From the website:

What Is MongoVUE?

  • MongoVUE is a GUI (graphical user interface) application that helps you administer, develop and learn MongoDB.
  • MongoVUE is FREE to use.
  • To run properly, MongoVUE requires Microsoft .NET Framework 2.0 SP1 installed on your computer.

Tools for working with NoSQL databases are starting to appear.

Any thoughts on this one that you would like to share?

Applicatives are generalized functors

Monday, January 31st, 2011

Applicatives are generalized functors

A continuation of Heiko Seeberger’s coverage of Scala and category theory.

Highly recommended.

Redis Tutorials

Monday, January 31st, 2011

Redis Tutorials

As of 01-31-2011 you will find at DeGizmo:

  1. Getting Started: Redis and Python
  2. Redis: Relations in a NoSQL world
  3. Redis: Relations in a NoSQL world: Using Hashes
  4. (To-Do) – Redis on the web: Redis and Django
  5. (To-Do) – Redis and Express: Ultra-fast REST API

For topic mappers who are interested in Python and Redis, not a bad place to start.


Update: 23 November 2011.

As of today, parts 4 and 5 remain to-do items.

Who Identified Roger Magoulas?

Monday, January 31st, 2011

Did you know that Roger Magoulas appears 28 times on the O’Reilly website? (as of 01-29-2010)

With the following 5 hyperlink texts:

Can you name the year that Tim O’Reilly used a hyperlink for Roger Magoulas three times but hasn’t since then?

One consistent resolution for Roger Magoulas, reflecting updates and presented without hand-authoring HTML would be nice.

But, that’s just me.

What do you think?

Pseudo-Code: A New Definition

Monday, January 31st, 2011

How to Speed up Machine Learning using a Set-Oriented Approach

The detail article for Need faster machine learning? Take a set-oriented approach, which I mentioned in a separate post.

Well, somewhat more detail.

Gives new meaning to pseudo-code:

The application side becomes:

Computing the model:

Fetch “compute-model over data items”

Classifying new items:

Fetch “classify over data items”

I am reminded of the cartoon with two people at a blackboard and one of them says: I think you should be more explicit in step two., where the text reads: Then a miracle occurs.

How about you?

Hurl is now open! – Post

Monday, January 31st, 2011

Hurl is now open!

From the website:

Chris Wanstrath and I originally developed Hurl for the 2009 Rails Rumble where it won Most Complete. The idea was to create a simple web version of cURL, a command-line tool often used to test web APIs.

Hurl is super easy – just enter a URL and any extra parameters such as HTTP headers, body parameters, and authentication and then click “Send.” You’ll get the response and can save and share it.

By open sourcing the code behind Hurl we hope that other developers will be able to build on the concept. One very requested feature is to create an embeddable version of Hurl that can be used in developer documentation for easy try-it functionality.

Not topic map specific but certainly will be useful for developers working on web APIs for topic map software.

OpenData + R + Google = Easy Maps
(& Lessons for Topic Maps)

Monday, January 31st, 2011

OpenData + R + Google = Easy Maps from James Chesire (via R-Bloggers is a compelling illustration of the use of R for mapping.

It also illustrates a couple of principles that are important for topic map authors to keep in mind:

1) An incomplete [topic] map is better than no [topic] map at all.

Chesire could have waited until he had all the data from every agency studying the issue of child labor and reconciling that data with field surveys, plus published reports from news organizations, etc., but then we would not have this article would we?

We also would not have a useful mapping of the data we have on hand.

I mention this one first because it is one that afflicts me the most.

I work on example topic maps but because they aren’t complete I am reluctant to see them as being in publishable shape.

The principle from software coding, release early and often, should be the operative principle for topic map authoring.

2) There is no true view of the data that should be honored.

Many governments of countries on this map would dispute the accuracy of the data. And your point would be?

Every map tells a story from a point of view.

There isn’t any reason for your topic map to await approval of any particular group or organization included in it.

A world of data awaits us as topic mappers.

The only question is whether we are going to step up to take advantage of it?

*****
PS: My position on incomplete topic maps is not inconsistent with my view on PR driven SQL data dumps that are topic maps in name only. As they say, you can put lipstick on a pig, ….

Tutorial: Developing in Erlang with Webmachine, ErlyDTL, and Riak

Monday, January 31st, 2011

Tutorial: Developing in Erlang with Webmachine, ErlyDTL, and Riak

From Alex Popescu’s MyNoSQL blog:

  • Part 1
    • In Part 1 of the series we covered the basics of getting the development environment up and running. We also looked at how to get a really simple ErlyDTL template rendering
  • Part 2
    • There are a few reasons this series is targeting this technology stack. One of them is uptime. We’re aiming to build a site that stays up as much as possible. Given that, one of the things that I missed in the previous post was setting up a load balancer. Hence this post will attempt to fill that gap.
  • Part 3 In this post we’re going to cover:
    • A slight refactor of code structure to support the “standard” approach to building applications in Erlang using OTP.
    • Building a small set of modules to talk to Riak.
    • Creation of some JSON helper functions for reading and writing data.
    • Calling all the way from the Webmachine front-end to Riak to extract data and display it in a browser using ErlyDTL templates.

Erlang is important for anyone building high availability (think telecommunications) systems that can be dynamically reconfigured without taking the systems offline.

Hubs and Connectors: Understanding Networks Through Data Visualization – Post

Sunday, January 30th, 2011

Hubs and Connectors: Understanding Networks Through Data Visualization

I have been shying away from the rash of LinkedIn graph visualizations but then I ran across this one by Whitney Hess at her Pleasure + Pain: Improving the human experience one day at a time blog.

The title alone made me take a double take. 😉

This post merits your reading as an introduction to network analysis, albeit presented in an easy to understand way with eye-candy along the way.

While you are there, check out her archives and other posts as well.

Such as: 10 Most Common Misconceptions About User Experience Design

If I could get topic map project managers to read one article, it would be that one.

Need faster machine learning? Take a
set-oriented approach

Saturday, January 29th, 2011

Need faster machine learning? Take a set-oriented approach.

Roger Magoulas, using not small iron reports:

The result: The training set was processed and the sample data set classified in six seconds. We were able to classify the entire 400,000-record data set in under six minutes — more than a four-orders-of-magnitude records processed per minute (26,000-fold) improvement. A process that would have run for days, in its initial implementation, now ran in minutes! The performance boost let us try out different feature options and thresholds to optimize the classifier. On the latest run, a random sample showed the classifier working with 92% accuracy.

or

set-oriented machine learning makes for:

  • Handling larger and more diverse data sets
  • Applying machine learning to a larger set of problems
  • Faster turnarounds
  • Less risk
  • Better focus on a problem
  • Improved accuracy, greater understanding and more usable results
  • Seems to me sameness of subject representation is a classification task. Yes?

    Going from days to minutes sounds attractive to me.

    How about you?

    R & Subject Identity/Identification

    Saturday, January 29th, 2011

    While posting R Books for Undergraduates, it occurred to me that having examples of using R for subject identity/identification would be helpful.

    I could create examples of first instance, but that would be a lot of work.

    Not to mention limiting me to domain in which I have some interest and expertise.

    What if I were to re-cast existing R examples as subject identity/identification issues?

    That saves me the time of creating new examples.

    More importantly, gives me a ready made audience to chime in on how I did with subject identity:

    • correct
    • close but incorrect
    • incorrect
    • incorrect and far away
    • incoherent
    • what subject did I think I was talking about?
    • etc.

    More than one answer is possible for any one example. 😉

    R Books for Undergraduate Students

    Saturday, January 29th, 2011

    R Books for Undergraduate Students.

    Recommended R titles by Colin Gillespie, Statistics Lecturer, Newcastle University – University of Newcastle, Newcastle Upon Tyne, United Kingdom.

    R is useful for exploring and analyzing data sets in order to discover, confirm or investigate subjects and their identification.

    Unified Intelligence: Completing the Mosaic of Analytics

    Friday, January 28th, 2011

    Unified Intelligence: Completing the Mosaic of Analytics

    Tuesday, Feb. 15 @ 4 ET

    From the announcement:

    Seeing the big picture requires a convergence of both structured and unstructured data. While each side of that puzzle presents challenges, the unstructured world poses a wider range of issues that must be resolved before meaningful analysis can be done. However, many organizations are discovering that new technologies can be employed to process and transform this unwieldy data, such that it can be united with the traditional realm of business intelligence to bring new meaning and context to analytics.

    Register for this episode of The Briefing Room to learn from veteran Analyst James Taylor about how companies can incorporate unstructured data into their decision systems and processes. Taylor will be briefed by Sid Probstein of Attivio, who will tout his company’s patented technology, the Active Intelligence Engine, which uses inverted indexing and a mathematical graph engine to extract, process and align unstructured data. A host of Attivio connectors allow integration with most analytical and many operational systems, including the capability for hierarchical XML data.

    I am not real sure what a non-mathematical graph engine would look like but this could be fun.

    It is also an opportunity to learn something about how others view the world.

    CouchDB 1.0.2: 3rd is Lucky – Post

    Friday, January 28th, 2011

    CouchDB 1.0.2: 3rd is Lucky

    Alex Popescu covers the release of CouchDB 1.0.2.

    A point release with new features.

    Next Generation Data Integration – Webinar

    Friday, January 28th, 2011

    Next Generation Data Integration

    Date: April 12, 2011 Time: 9:00AM PT

    Speaker: Philip Russom

    From the website:

    Data integration (DI) has undergone an impressive evolution in recent years. Today, DI is a rich set of powerful techniques, including ETL (extract, transform, and load), data federation, replication, synchronization, change data capture, natural language processing, business-to-business data exchange, and more. Furthermore, vendor products for DI have achieved maturity, users have grown their DI teams to epic proportions, competency centers regularly staff DI work, new best practices continue to arise (like collaborative DI and agile DI), and DI as a discipline has earned its autonomy from related practices like data warehousing and database administration.

    Given these and the many other generational changes data integration has gone through recently, it’s natural that many people aren’t quite up-to-date with the full potential of modern data integration. Based on a recent TDWI Best Practices report this webinar seeks to cure that malady by redefining data integration in modern terms, plus showing where it’s going with its next generation. This information will help user organizations make more enlightened decisions, as they upgrade, modernize, and expand existing data integration solutions, plus plan infrastructure for next generation data integration.

    Every group (tribe as Jack Park would call them) has its own terminology when it comes to data and managing data.

    As you can tell from the description of the webinar, data integration is concerned with many of the same issues as topic maps. Albeit under different names.

    Regard this as an opportunity to visit another tribe and learn some new terminology.

    And some new ideas you can use with topic maps.

    Alchemy Database: A Hybrid Relational-Database/NOSQL-Datastore

    Friday, January 28th, 2011

    Alchemy Database: A Hybrid Relational-Database/NOSQL-Datastore

    From the website:

    Alchemy Database is a lightweight SQL server that is built on top of the NOSQL datastore redis. It supports redis data-structures and redis commands and supports (de)normalisation of these data structures (lists,sets,hash-tables) to/from SQL tables. Lua is deeply embedded and lua scripts can be run internally on Alchemy’s data objects. Alchemy Database is not only a data storage Swiss Army Knife, it is also blazingly fast and extremely memory efficient.

    • Speed is achieved by being an event driven network server that stores ALL data in RAM and achieves disk persistence by using a spare cpu-core to periodically log data changes (i.e. no threads, no locks, no undo-logs, no disk-seeks, serving data over a network at RAM speed)
    • Storage data structures w/ very low memory overhead and data compression, via algorithms w/ insignificant performance hits, greatly increase the amount of data you can fit in RAM
    • Optimising to the SQL statements most commonly used in OLTP workloads yields a lightweight SQL server designed for low latency at high concurrency (i.e. mindblowing speed).

    The Philosophy of Alchemy Database is that RAM is now affordable enough to be able to store ENTIRE OLTP Databases in a single machine’s RAM (e.g. Wikipedia’s DB was 50GB in 2009 and a Dell PowerEdge R415 w/ 64GB RAM costs $4000), as long as the data is made persistent to disk. So Alchemy Database provides a non-blocking event-driven network-I/O-based relational-database, with very little memory overhead, that does the most common OLTP SQL statements amazingly fast and then throws in the NOSQL Data-store redis to create fantastic optimisation possibilities.

    Leaving words/phrases like, blazingly fast, amazingly fast, fantastic optimisation, mindblowing speed, to one side, one does wonder how it performs for a topic map?

    Reports welcome!

    NLP (Natural Language Processing) tools

    Friday, January 28th, 2011

    Statistical natural language processing and corpus-based computational linguistics: An annotated list of resources

    From Stanford University.

    It may not be every NLP resource but it is the place to start looking if you are looking for a new tool.

    This should give you an idea of the range of tools that could be applied to the AF war diaries for example.

    Sho: the .NET Playground for Data

    Friday, January 28th, 2011

    Sho: the .NET Playground for Data

    Since we are talking about data analysis and display tools.

    From the website:

    Sho is an interactive environment for data analysis and scientific computing that lets you seamlessly connect scripts (in IronPython) with compiled code (in .NET) to enable fast and flexible prototyping. The environment includes powerful and efficient libraries for linear algebra as well as data visualization that can be used from any .NET language, as well as a feature-rich interactive shell for rapid development.

    Building a Better Word Cloud – Post

    Friday, January 28th, 2011

    Building a Better Word Cloud

    Drew Conway talks about why word clouds don’t work (space based display of non-spatial data is how I would summarize it, but see for yourself).

    He then proceeds to create a comparative word cloud. Palin and Obama on the Arizona shootings.

    I include this post here as a caution that space based clustering can be mis-leading if not outright deceptive.

    Node.js Key-Value Stores – Post

    Friday, January 28th, 2011

    Node.js Key-Value Stores

    A blog post by Alex Popescu on the use of node.js as a key-value store.

    • Alfred: in-process key-value store. You can read more about it here
    • node-dirty: tiny and fast key-value store with append-only disk log

    And I’d bet there are more to come. Real question though, is how many will be around in 6-12 months.

    I suspect the number of node.js key-value stores will increase and then decrease.

    I am not sure I understand the implied problem?

    But then I remember when there were > 300+ formats for document conversion.

    There are fewer than that now. Well, at least major ones. 😉

    Is an upsurge in solutions, experimentation and a winnowing down, until the next explosion of creativity, a bad thing?

    Sofia-ML and Maui: Two Cool Machine Learning and Extraction libraries – Post

    Friday, January 28th, 2011

    Sofia-ML and Maui: Two Cool Machine Learning and Extraction libraries

    Jeff Dalton reports on two software packages for text analysis.

    These are examples of just some of the tools that could be run on a corpus like the Afghan War Diaries.

    Functional Data Structures – Post

    Friday, January 28th, 2011

    On the Theoretical Computer Science blog the following question was asked:

    What’s new in purely functional data structures since Okasaki?

    Since Chris Okasaki’s 1998 book “Purely functional data structures”, I haven’t seen too many new exciting purely functional data structures appear; I can name just a few:…

    What follows is a listing of resources that will be of interest to topic map researchers.

    Why Command Helpers Suck – Post

    Friday, January 28th, 2011

    Why Command Helpers Suck is an amusing rant by Kristina Chodorow (author of MongdoDB: The Definitive Guide) on the different command helpers for the same underlying database commands.

    Shades of XWindows documentation and the origins of topic maps. Same commands, different terminology.

    If as Robert Cerny has suggested topic maps don’t offer something new then I think it is fair to observe that the problems topic maps work to solve aren’t new either. 😉

    A bit more seriously, topic maps could offer Kristina a partial solution.

    Imagine a utility for command helpers that is actively maintained and that has a mapping between all the known command helpers and a given database command.

    Just enter the command you know and the appropriate command is sent to the database.

    That is the sort of helper application that could easily find a niche.

    The master mapping could be maintained with full identifications, notes, etc. but there needs to be a compiled version for speed of response.

    Comet – An Example of the New Key-Code Databases – Post

    Thursday, January 27th, 2011

    Comet – An Example of the New Key-Code Databases

    Another NoSQL database.

    The post summaries the goals of Comet, which is described as: … an extensible storage service that allows clients to inject snippets of code that control their data’s behavior inside the storage service.

    One thing you will notice fairly quickly when reading Comet: An active distributed key-value store is that the authors were not trying to build a fully generalized solution.

    They had specific requirements in mind to be met and if your needs fall outside those requirements, you need to look elsewhere.

    Rather refreshing to find a project that expressly isn’t trying to replace MS Office or Facebook. 😉

    That still leaves a lot of interesting and commercially successful work to be done.

    Isidorus

    Thursday, January 27th, 2011

    Isidorus

    From the website:

    Isidorus is an Open Source Topic Map engine actively developed using sbcl and elephant. Isidorus supports import and export of XTM 1.0 and 2.0, full versioning, merge semantics, an Atom-based RESTful API and Topic Map querying — with more to come.

    Current areas of development include:

    • Enforcements of constraints (TMCL)
    • Json-import / export and a AJAX front end for data curation
    • Enhanced querying

    Also note:

    A Virtual Box image of a pre-installed isidorus-environment on an Ubuntu-Linux system is available at: http://festus.textgrid.it.fh-worms.de/TMRA2009/isidorus-vbox-image.tar.gz.

    Flapjax

    Thursday, January 27th, 2011

    Flapjax

    From the website:

    Flapjax is a new programming language designed around the demands of modern, client-based Web applications. Its principal features include:

    • Event-driven, reactive evaluation
    • An event-stream abstraction for communicating with web services
    • Interfaces to external web services

    Flapjax is easy to learn: it is just a JavaScript framework. Furthermore, because Flapjax is built entirely atop JavaScript, it runs on traditional Web browsers without the need for plug-ins or other downloads. It integrates seamlessly with existing JavaScript code and other frameworks.

    Don’t know if anyone will find this useful but some of the demos looked interesting.

    Thought it would be worth mentioning for anyone looking to build client-based topic map applications.

    Baltimore – Semi-Transparent or Semi-Opaque?

    Thursday, January 27th, 2011

    Open Baltimore is leading the way towards semi-transparent or semi-opaque government.

    You be the judge.

    The City of Baltimore is leading in placing hundreds of data sets online.

    But is that being semi-transparent or semi-opaque?

    Data sets I would like to see:

    • City contracts, their amounts and who was successful at bidding on them?
    • Successful bidders not be corporate names but who owns them? Who works there? What lawyers represent them?
    • What are the relationships, personal, business, etc., between staff, elected officials and anyone who does business with the city?
    • Same questions for school, fire, police and other departments.
    • Code violations, what are they, which inspectors write them, for what locations?
    • Arrests made of who, by which officers, for what crimes, locations and times.
    • etc. (these are illustrations and not an exhaustive list)

    Make no mistake, I am grateful for the information the city has already provided.

    What they have provided took a lot of work and will be useful for a number of purposes.

    But I don’t want people to think that a large number of data sets means transparency.

    Transparency involves questions of relevant data and meaningful ways to evaluate it and to connect it to other data.

    Think Outside the (Comment) Box

    Thursday, January 27th, 2011

    Think Outside the (Comment) Box: Social Applications for Publishers

    From the announcement:

    Learn about the next generation of social applications and how publishers are leveraging them for editorial and financial benefit.

    I will spare you the rest of the breathless language.

    Still, I will be there and suggest you be there as well.

    Norm Walsh, who needs no introduction in markup circles, works at MarkLogic.

    That gives me confidence this may be worth hearing.

    Details:

    February 9, 2011 – 8:00 am pacific, 11:00 am eastern – 4:00 pm GMT

    *****
    PS: For anyone who has been under a rock for the last several years, MarkLogic makes an excellent XML database solution.

    See for example, MarkMail, a collection of technical mailing lists from around the web.

    Searching it also illustrates how much semantic improvement can be made to searching.

    Infochimps

    Thursday, January 27th, 2011

    Infochimps.com

    Another free data source. (Commercial plans also available.)

    Large number of data sources and what looks like a friendly number of free API calls while you are building an application.

    Observation: Finding one data source or project seems to lead to several others in the same area.

    Definitely worth a visit.

    *****
    PS: The abundance of online data sources opens the door to semantic mappings (can you say topic maps?) that enhance the value of these data sets.

    Such as resolving the semantic impedance between the data sets.

    Topic map artifacts as commercial products.

    The trick is going to be discovering (and resolving) semantic impedances that people are willing to pay to avoid.