Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 20, 2012

Solving the Queens Problem with Haskell

Filed under: Algorithms,Haskell — Patrick Durusau @ 8:35 pm

Solving the Queens Problem with Haskell

From the post:

The eight queens puzzle is a classic problem in algorithm design because it showcases the power of backtracking and recursion. The premise is simple: find a way to place 8 queens on an 8×8 board such that none of them are attacking each other, which means that no two queens can share the same row, column, or diagonal.

Sounds simple enough, but how to implement it? The most basic solution would be to brute force it: try every possible configuration, stopping when a valid one is found. Not very efficient, but hey, at least it’s easy to understand.

A good illustration of recursion and backtracking but also suggests that brute force isn’t the best option for authoring or manipulating topic maps.

A deep understanding of your data, relevant subjects, methods for identifying those subjects, etc., will be more useful than brute force.

Understand Neo4j Commercial License Options

Filed under: Conferences,Neo4j — Patrick Durusau @ 8:35 pm

Understand Neo4j Commercial License Options by Andreas Kollegger.

Webinar: Thursday, February 23, 2012 10:00 PST // 18:00 GMT

Register

Description:

In this fast-paced 30 minute webinar, Andreas Kollegger walks through the different types of Neo4j open source and commercial licenses. Learn how to decide what type of license, commercial or open source, is right for your situation.

I assume if you intend to make money in some way from your software you already have a lawyer. If you don’t, get one. Then have your lawyer look at the various licenses from Neo4j (or any other software source) and the plans for your software. The take his advice.

Lay discussions of licensing options for software always give me the shakes. IBM, MS and others have troops of lawyers for a reason. It isn’t because they like collecting lawyers. “….Go and do likewise.” Luke 37:10.

Rather than cursing semantic darkness…

Filed under: Humor — Patrick Durusau @ 8:35 pm

Rather than finding fault with the Semantic Web and cursing the semantic darkness, I should be trying to light one candle of semantic clarity, even if it is a universal meaning for a diverse world. (Opps, sorry lapsed, again)

In the spirit of lighting one candle of semantic clarity, consider the: Plergb Bylaws Summary.

The background statement reads:

The Plergb Language Entropy Regulatory Governing Bureaucratic Commission Overseeing Multiple Managerial Issues Surrounding Singular Instances Of Nomenclature (“P. L. E. R. G. B. C. O. M. M. I. S. S. I. O. N.” or “Plergb Commission” for short) (Hereafter referred to as the Commission) is the body which oversees and administers the use of the word Plergb and its authorized variants.

Further, the Uses of the Word Plergb reads:

The word Plergb may be used in many ways, some of which have not been discovered yet.

  • The simplest use of the word Plergb is as a general-purpose noun, sort of like "doohickey" and "thingamabob". It may be
    dynamically redefined within a sentence if the context makes the meaning clear, as in the old saying "A rolling Plergb gathers no Plergb". It can also be used as a euphemism, as in "What the Plergb is this Plergb?!?!" The use of the word Plergb to maliciously hinder communication is strongly discouraged.
  • Another simple use is to bring good luck by saying "Plergb" as you are plugging something in. The Commission makes no warranty covering such use.
  • Whenever the word Plergb is used in a song, it may be defined as the entire remainder of the song. For example, singing "Oh, say can you see by the Plergb?" lets you start the ball game that much sooner. When used in a song on radio or TV it may be defined as the rest of the broadcast day, allowing the staff to shut off the transmitter and go home without further ado or explanation (CAUTION: Some advertisers consider this to be bad manners, especially if their commercials are affected).
  • The most sophisticated use of the word Plergb, however, is as an operator. Computer people may think of it as a sort of macro to be executed by one’s audience. For example, if you are composing something on a manual typewriter but wish to have a professional-appearing document, you may add at the bottom "Plergb, defined as justifying margins, correcting typos, and general cleanup". It can also be defined as minus something you wish to cancel. For example, if you are running for office you might end all your speeches with "Plergb, defined as minus anything you didn’t want to hear". The word Plergb can also be used to correct hyphenation at ends of lines, clean up politically incorrect language, cleanse erotica of anything the authorities might consider obscene, conditionally insert or delete paragraphs depending on the state of other variables, insert illustrations or other special effects the your Web software cannot handle, and so on.

Now that we have clarity for the word Plergb, who will volunteer to take up the cause of owl:sameAs?

To be truly universal, we will need to talk about UN semantic enforcement coalition forces in the event any group fails to heed UN resolutions on semantic usage. But that is for another post.

(Sam Hunting forwarded the Plergb Bylaws to my attention. Thanks Sam!)

February 19, 2012

Combinatorial Algorithms and Data Structures

Filed under: Combinatorics,Data Structures — Patrick Durusau @ 8:41 pm

Combinatorial Algorithms and Data Structures

In the Berkeley course listed I posted earlier, this course listing came up as a 404.

After a little digging I found it (it has links to the prior versions of the class) and I thought you might want something challenging to start the week!

From Data to Knowledge

Filed under: BigData,Conferences — Patrick Durusau @ 8:40 pm

From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications / May 7-11 2012, University of California, Berkeley.

From the website:

We are experiencing a revolution in the capacity to quickly collect and transport large amounts of data. Not only has this revolution changed the means by which we store and access this data, but has also caused a fundamental transformation in the methods and algorithms that we use to extract knowledge from data. In scientific fields as diverse as climatology, medical science, astrophysics, particle physics, computer vision, and computational finance, massive streaming data sets have sparked innovation in methodologies for knowledge discovery in data streams. Cutting-edge methodology for streaming data has come from a number of diverse directions, from on-line learning, randomized linear algebra and approximate methods, to distributed optimization methodology for cloud computing, to multi-class classification problems in the presence of noisy and spurious data.

This workshop will bring together researchers from applied mathematics and several diverse scientific fields to discuss the current state of the art and open research questions in streaming data and real-time machine learning. The workshop will be domain driven, with talks focusing on well-defined areas of application and describing the techniques and algorithms necessary to address the current and future challenges in the field. Sessions will be accessible to a broad audience.

This looks really good!

Despite the fact that I am unsure that “big data” is as important as our skill at extracting (conferring?) meaning from it. To put it another way, I think careful analysis of a small amount of data is just as likely to be useful as coarse analysis of a large amount of data.

Neocons, a Clojure client for the Neo4J REST API

Filed under: Clojure,Neo4j,Neocons — Patrick Durusau @ 8:40 pm

Neocons, a Clojure client for the Neo4J REST API

From the webpage:

Neocons is a young idiomatic Clojure client for the Neo4J REST API.

Supported Features

Neocons currently supports the following features (all via REST API, so you can use open source Neo4J Server edition for commercial projects):

  • Create, read, update and delete nodes
  • Create, read, update and delete relationships
  • Fetch relationships for given node
  • Create and delete indexes
  • Index nodes
  • Query node indexes for exact matches and using full text search queries
  • Query automatic node index
  • Traverse nodes, relationships and paths
  • Find shortest path or all paths between nodes
  • Predicates over paths, for example, if they include specific nodes/relationships
  • Cypher queries (with Neo4J Server 1.6 and later)

Another client for Neo4j! Which one do you use?

Talking with Neo4j Graphs

Filed under: Challenges,Neo4j — Patrick Durusau @ 8:39 pm

Talking with Neo4j Graphs by Tomás Augusto Müller.

From the post:

In this post I will be covering all main details regarding the development of my entry for the Neo4j Challenge.

The main objective of this challenge is to create a Heroku-ready template or demo application using Neo4j. So, I thought to myself: – what kind of application would be nice to show up in this contest?

After many ideas, here it is!

In short, the application is a Stock Exchange symbol lookup using Neo4j and your voice.

Looks like competitors are starting to emerge in the Neo4j challenge!

Very cool!

Mechanical Turk vs oDesk: My experiences

Filed under: Mechanical Turk,oDesk — Patrick Durusau @ 8:38 pm

Mechanical Turk vs oDesk: My experiences by Panos Ipeirotis.

From the post:

A question that I receive often is how to structure tasks on Mechanical Turk for which it is necessary for the workers to pass training before doing the task. My common answer to most such question is that Mechanical Turk is not the ideal environment for such tasks: When training and frequent interaction is required, an employer is typically better off by using a site such as oDesk to hire people for the long term to do the job.

Like most things, whether you choose oDesk or Amazon’s Mechanical Turk, should not be an automatic or knee-jerk reaction.

Panos is an “academic-in-residence” with oDesk but even handedly points out when oDesk or Mechanical Turk would be the better choice. Depends on the task at hand and a number of other factors.

If you are considering using either service now or in the future, this is definitely an article you need to keep close at hand.

EECS Course WEB Sites

Filed under: CS Lectures — Patrick Durusau @ 8:37 pm

EECS Course WEB Sites

Archives of EE and CS classes at Berkeley.

Some with more resources than others. But interesting none the less.

MoleculaRnetworks

Filed under: Data Mining,Graphs,PageRank — Patrick Durusau @ 8:37 pm

MoleculaRnetworks: An integrated graph theoretic and data mining tool to explore solvent organization in molecular simulation by Barbara Logan Mooney, L. René Corrales and Aurora E. Clark.

Abstract:

This work discusses scripts for processing molecular simulations data written using the software package R: A Language and Environment for Statistical Computing. These scripts, named moleculaRnetworks, are intended for the geometric and solvent network analysis of aqueous solutes and can be extended to other H-bonded solvents. New algorithms, several of which are based on graph theory, that interrogate the solvent environment about a solute are presented and described. This includes a novel method for identifying the geometric shape adopted by the solvent in the immediate vicinity of the solute and an exploratory approach for describing H-bonding, both based on the PageRank algorithm of Google search fame. The moleculaRnetworks codes include a preprocessor, which distills simulation trajectories into physicochemical data arrays, and an interactive analysis script that enables statistical, trend, and correlation analysis, and other data mining. The goal of these scripts is to increase access to the wealth of structural and dynamical information that can be obtained from molecular simulations. © 2012 Wiley Periodicals, Inc.

Data mining, graph theory, PageRank, something for everyone in this article!

Not to mention innovative use of PageRank with non-WWW data.

MoculaRnetworks code.

Selling Data Mining to Management

Filed under: Data Management,Data Mining,Marketing — Patrick Durusau @ 8:36 pm

Selling Data Mining to Management by Sandro Saitta.

From the post:

Preparing data and building data mining models are two very well documented steps of analytics projects. However, whatever interesting your results are, they are useless if no action is taken. Thus, the step from analytics to action is a crucial one in any analytics project. Imagine you have the best data and found the best model of all time. You need to industrialize the data mining solution to make your company benefits from them. Often, you will first need to sell your project to the management.

Sandro references three very good articles on pitching data management/mining/analytics to management.

I would rephrase Sandra’s opening line to read: “Preparing data [for a topic map] and building [a topic map] are two very well documented steps of [topic map projects]. However, whatever interesting your results are, [there is no revenue if no one buys the map].”

OK, maybe I am being generous on the preparing data and building a topic map points but you can see where the argument is going.

And there are successful topic map merchants with active clients, just not enough of either one.

These papers maybe the push in the right direction to get more of them.

Identity – The Philosophical Challenge For the Web

Filed under: Identity,Subject Identifiers,Subject Identity — Patrick Durusau @ 8:35 pm

Identity – The Philosophical Challenge For the Web by Matthew Hurst.

From the post:

I work in local search at Microsoft which means, like all those working in this space, I have to deal with an identity crisis on a daily basis. Currently, most local search products – like Bing’s and Google’s – leverage multiple data sets to derive a digital model of the world that users can then interact with. In creating this digital model, multiple statements have to be conflated to form a unified representation. This can be extremely challenging for two reasons. Firstly, the system has to decided when two records are intended to denote the same real world entity. Secondly, the designers of the system have to determine what real world entities are and how to describe them.

For example, if a business moves is that the same business or the closure of one and the opening of another? What does it mean to categorize a business? The cafe in Barnes and Noble is branded Starbucks but isn’t actually part of the Starbucks chain – should is surface as a separate entity or is it ‘hidden’ within the bookshop as an attribute (‘has cafe’)?

Thinking through these hard representational problems is as much part of the transformative trends going on in the tech industry as are those characterized by terms like ‘big data’ and ‘data scientist’.

Questions of identity and how to resolve different multiple references to the same entity have been debated at least since the time of Greek philosophers. Identity (Wikipedia page, see references on the various pages.)

This “philosophical challenge” has been going on for a very long time and so far I haven’t seen any demonstrations that the Web raises new questions.

You need to read Matthew’s identity example in his post.

The songs in question could be said to be instances of the same subject and a reference to that subject would be satisfied with any of those instances. From another point of view, the origin of the instances could be said to distinguish them into different subjects, say for proof of licensing purposes. Other view points are possible. Depends upon the purpose of your criteria of identification.

SML: Scalable Machine Learning

Filed under: Machine Learning — Patrick Durusau @ 8:35 pm

SML: Scalable Machine Learning

Alex Smola’s lectures on Scalable Machine Learning at Berkeley with a wealth of supplemental materials.

Overview:

Scalable Machine Learning occurs when Statistics, Systems, Machine Learning and Data Mining are combined into flexible, often nonparametric, and scalable techniques for analyzing large amounts of data at internet scale. This class aims to teach methods which are going to power the next generation of internet applications. The class will cover systems and processing paradigms, an introduction to statistical analysis, algorithms for data streams, generalized linear methods (logistic models, support vector machines, etc.), large scale convex optimization, kernels, graphical models and inference algorithms such as sampling and variational approximations, and explore/exploit mechanisms. Applications include social recommender systems, real time analytics, spam filtering, topic models, and document analysis.

Just to give you a taste for the content, the first set of lectures is on Hardware and covers:

  • Hardware
    • Processor, RAM, buses, GPU, disk, SSD, network, switches, racks, server centers
    • Bandwidth, latency and faults
  • Basic parallelization paradigms
    • Trees, stars, rings, queues
    • Hashing (consistent, proportional)
    • Distributed hash tables and P2P
  • Storage
    • RAID
    • Google File System / HadoopFS
    • Distributed (key, value) storage
  • Processing
    • MapReduce
    • Dryad
    • S4 / stream processing
  • Structured access beyond SQL
    • BigTable
    • Cassandra

Each set of lectures was back to back (to reduce travel time for Smola).

Hardware influences our thinking and design choices so it was good to see the lectures starting with coverage of hardware.

Interesting point near the end of the first lecture about never using editors to create editorial data. Then Alex explains that query results were validated at one point by women in their twenties so other perspectives on query results were not reflected in the results. He suggested getting users to provide data for search validation than using experts to label the data.

I would split his comments on editorial content into:

  1. Editorial content from experts
  2. Editorial content from users

I would put #1 in the same category as getting ontologists or linked data types to markup data. It works for them and from their point of view, but that doesn’t mean it works for the users of the data.

On the other hand, #2, content from users about how they think about their data and what constitutes a good result, seems a lot more appealing to me.

I would say that Alex’s point isn’t to not to use editors but to choose one’s editors carefully, favoring the users who will be using the results of the searches. (And avoiding the activity of labeling, there are better ways to get the needed data from users.)

That doesn’t work for a generalized search interface like Google but then a public ….., err, water trough is a public water trough.

February 18, 2012

Hadoop and Machine Learning: Present and Future

Filed under: Hadoop,Machine Learning — Patrick Durusau @ 5:26 pm

Hadoop and Machine Learning: Present and Future by Josh Wills.

Presentation at LA Machine Learning.

Josh Wills is Cloudera’s Director of Data Science, working with customers and engineers to develop Hadoop-based solutions across a wide-range of industries. Prior to joining Cloudera, Josh worked at Google, where he worked on the ad auction system and then led the development of the analytics infrastructure used in Google+. Prior to Google, Josh worked at a variety of startups- sometimes as a Software Engineer, and sometimes as a Statistician. He earned his Bachelor’s degree in Mathematics from Duke University and his Master’s in Operations Research from The University of Texas – Austin.

Very practice oriented view of Hadoop and machine learning. If you aren’t excited about Hadoop and machine learning already, you will be after this presentation!

Variations for computing results from sequences in Scala

Filed under: Functional Programming,Scala — Patrick Durusau @ 5:26 pm

Variations for computing results from sequences in Scala

From the post:

A common question from students who are new to Scala is: What is the difference between using the map function on lists, using for expressions and foreach loops? One of the major sources of confusion with regard to this question is that a for expression in Scala in not the equivalent of for loops in languages like Python and Java — instead, the equivalent of for loops is foreach in Scala. This distinction highlights the importance of understanding what it means to return values versus relying on side-effects to perform certain computations. It also helps reinforce some points about fixed versus reassignable variables and immutable versus mutable data structures.

Continuing with the FP theme. Don’t miss the links to additional tutorial materials on Scala at the end of this post.

An invitation to FP for Clojure noobs

Filed under: Clojure,Functional Programming — Patrick Durusau @ 5:26 pm

An invitation to FP for Clojure noobs

From the post:

I’ve heard newcomers to Clojure ask how to get started with functional programming. I believe that learning to program in the functional style is mostly a matter of practice. The newcomer needs to become familiar with a handful of higher order functions, and how they are used in common idioms. This can be done by practice with simple, well defined problems. I assume that the prospective reader already has a grasp of the rudiments of Clojure, and can operate the Leiningen build tool.

Here is what I propose. I have prepared a set of annotated exercises illustrating typical Clojure fp idioms. The exercises are the first 31 Project Euler problems, one for each day of the month. I believe these problems are ideal for the purpose at hand. Each problem is succinctly stated, interesting, and well defined. Each lends itself to a natural functional solution.

Sounds like an interesting approach to learning Clojure.

Would be interesting if processor speed and virtually unlimited storage tips the scales in favor of write-only memory and functional programming.

Where even mis-typed keystrokes are recorded, it being easier to record the next try than to correct the previous one. Will appear in secure facilities first but won’t remain there.

The 7 Deadly Sins of Solr

Filed under: Solr — Patrick Durusau @ 5:26 pm

The 7 Deadly Sins of Solr

Largely the same material appears in an interview with Jay Hill but the illustrations Jay puts with the slides are worth the effort to register at Dzone.com for the download.

Particularly the final one. 😉

Bird’s Eye View of the ElasticSearch Query DSL

Filed under: DSL,ElasticSearch,Query Language — Patrick Durusau @ 5:26 pm

Bird’s Eye View of the ElasticSearch Query DSL
Peter Karich.

From the post:

I’ve copied the whole post into a gist so that you can simply clone, copy and paste the important stuff and even could contribute easily.

Several times per month there are questions regarding the query structure on the ElasticSearch user group.

Although there are good docs explaining this in depth, I think a bird’s eye view of the Query DSL is necessary to understand what is written there. There is even some good external documentation available. And there were attempts to define a schema but nevertheless I’ll add my 2 cents here. I assume you set up your ElasticSearch instance correctly and on the local machine filled with exactly those 3 articles.

Do you have a feel for what a “bird’s eye view” would say about the differences in NoSQL query languages?

SQL has been relatively uniform, enabling users to learn the basics and then fill in the particulars as necessary. How far are we from a query DSL that obscures most of the differences from the average user?

How to Store Google n-gram in Neo4j

Filed under: AutoSuggestion,N-Grams,Neo4j — Patrick Durusau @ 5:25 pm

How to Store Google n-gram in Neo4j by r.schiessler.

From the post:

In the end of September I discovered an amazing data set which is provided by Google! It is called the Google n gram data set. Even thogh the english wikipedia article about ngrams needs some clen up it explains nicely what an ngram is. http://en.wikipedia.org/wiki/N-gram The data set is available in several languages and I am sure it is very useful for many tasks in web retrieval, data mining, information retrieval and natural language processing.

This data set is very well described on the official google n gram page which I also include as an iframe directly here on my blog.

Schiessler describes the project as follows:

The idea is that once a user has started to type a sentence statistics about the n-grams can be used to semantically and syntactically correctly predict what the next word will be and in this way increase the speed of typing by making suggestions to the user. This will be in particular usefull with all these mobile devices where typing is really annoying.

Another suggestion project!

Worth your time both for its substance and use of Neo4j.

Online Python Tutor

Filed under: Programming,Python — Patrick Durusau @ 5:25 pm

Online Python Tutor: Learn and practice Python programming in your web browser

Lars Marius Garshol tweeted:

Wow. For anyone who wants to learn programming, this looks like an amazing resource

Truly. Not to mention being another example of web browsers as interfaces.

Different ways to make auto suggestions with Solr

Filed under: AutoComplete,AutoSuggestion,Query Expansion,Solr — Patrick Durusau @ 5:24 pm

Different ways to make auto suggestions with Solr

From the post:

Nowadays almost every website has a full text search box as well as the auto suggestion feature in order to help users to find what they are looking for, by typing the least possible number of characters possible. The example below shows what this feature looks like in Google. It progressively suggests how to complete the current word and/or phrase, and corrects typo errors. That’s a meaningful example which contains multi-term suggestions depending on the most popular queries, combined with spelling correction.

(graphic omitted)

There are different ways to make auto complete suggestions with Solr. You can find many articles and examples on the internet, but making the right choice is not always easy. The goal of this post is compare the available options in order to identify the best solution tailored to your needs, rather than describe any one specific approach in depth.

It’s common practice to make auto-suggestions based on the indexed data. In fact a user is usually looking for something that can be found within the index, that’s why we’d like to show the words that are similar to the current query and at the same time relevant within the index. On the other hand, it is recommended to provide query suggestions; we can for example capture and index on a specific solr core all the user queries which return more than zero results, so we can use those information to make auto-suggestions as well. What actually matters is that we are going to make suggestions based on what’s inside the index; for this purpose it’s not relevant if the index contains user queries or “normal data”, the solutions we are going to consider can be applied in both cases.

The Suggester module is the method that looks the most promising:

This solution has its own separate index which you can automatically build on every commit. Using collation you can have multi-term suggestions. Furthermore, it is possible to use a custom dictionary instead of the index content, which makes the current solution even more flexible.

I like to think of multi-term suggestions as tuneable query expansions that return materials on a subject more precisely than the original query.

The custom dictionary has even more potential:

When a file-based dictionary is used (non-empty sourceLocation parameter above) then it’s expected to be a plain text file in UTF-8 encoding. Blank lines and lines that start with a ‘#’ are ignored. The remaining lines must consist of either a string without literal TAB (\u0007) character, or a string and a TAB separated floating-point weight. (http://wiki.apache.org/solr/Suggester)

The custom dictionary can contain single terms or phrases.

Hmmm, a custom dictionary:

  1. Is easy to author
  2. Contains words and phrases
  3. Is an editorial artifact
  4. Not limited to a single Solr installation
  5. Could be domain specific
  6. Assists in returning more, not less precise results

The handling of the more precise results is up to your imagination.

Seeking an efficient algorithm to group identical values

Filed under: Group identical values,IBM Cognos,LINQ — Patrick Durusau @ 5:24 pm

Seeking an efficient algorithm to group identical values a post by Daniel Lemire from 2008.

Grouping identical values is a common operation for topic map engines. Daniel’s post and the comments on it should prove helpful to anyone seeking to solve that problem.

Just from a dirty search, I found that LINQ has “GroupBy(v => v)” for this operation (November of 2011).

And, IBM’s Cognos Business Intelligence software has a group by identical value function.

I suspect other BI software has a similar “group by identical value” capability.

Occurs to me that depending on the scripting/programming capabilities of BI software with “group by identical value” functions, it should be possible to create merging capabilities in that software.

The “merging” being what happens after you have grouped a set of items by some identical value set.

That would work for identical values but doesn’t do anything for groups of different values that should lead to merging.

Anyone working with these or other software packages with “group by identical value” functions?

Thinking it may be easier to offer an extension or service of merging that doesn’t rely on changing software.

Signal/Collect

Filed under: Graphs,Parallel Programming,Signal/Collect — Patrick Durusau @ 5:24 pm

Signal/Collect: a framework for parallel graph processing

I became aware of Signal/Collect because of René Pickhardt’s graph reading club assignment for 22 February 2012.

A paper to use as a starting point for Signal/Collect: Signal/Collect: Graph Algorithms for the (Semantic) Web.

From the code.google.com website (first link above):

Signal/Collect is a programming model and framework for large-scale graph processing. The model is expressive enough to concisely formulate many iterated and data-flow algorithms on graphs, while allowing the framework to transparently parallelize the processing. The current release of the framework is not distributed yet, but this is planned for March 2012.

In Signal/Collect an algorithm is written from the perspective of vertices and edges. Once a graph has been specified the edges will signal and the vertices will collect. When an edge signals it computes a message based on the state of its source vertex. This message is then sent along the edge to the target vertex of the edge. When a vertex collects it uses the received messages to update its state. These operations happen in parallel all over the graph until all messages have been collected and all vertex states have converged.

Many algorithms have very simple and elegant implementations in Signal/Collect. You find more information about the programming model and features in the project wiki. Please take the time to explore some of the example algorithms below.

Signal/Collect development and source code is now on github.

The name of the project is written variously: Signal/Collect, signal collect, signal-collect. Except for when I am quoting other sources, I will be using Signal/Collect.

February 17, 2012

Should you use SQL or Hadoop?

Filed under: Humor — Patrick Durusau @ 5:11 pm

Should you use SQL or Hadoop?

Follow the link, you need to see it full size.

I first saw this at: Dr Data’s Blog.

Effective Scala – Best Practices from Twitter

Filed under: Scala — Patrick Durusau @ 5:10 pm

Effective Scala – Best Practices from Twitter by Bienvenido David III.

From the post:

Twitter has open sourced its Effective Scala guide. The document is on GitHub, with the Git repository URL https://github.com/twitter/effectivescala.git. The document is licensed under CC-BY 3.0.

Scala is one of the primary programming languages used at Twitter, and most of the Twitter infrastructure is written in Scala. The Effective Scala guide is a series of short essays, a set of “best practices” learned from using Scala inside Twitter. Twitter’s use of Scala is mainly for creating high volume, distributed systems, though most of the guide should be applicable to other domains.

Sounds like a book to read if you are either looking for work at Twitter or just want to get better at Scala. Both are worthwhile goals.

Building a Graph data structure in PHP

Filed under: Graphs,PHP — Patrick Durusau @ 5:09 pm

Building a Graph data structure in PHP

From the post:

Graphs are one of the most frequently used data structures,along with linked lists and trees. In a recent PHP project I needed to build a Graph structure to analyze some interlinked urls. The problem was of a simple nature, so rather than writing my own code, I went with the one available in the Pear repository.

The Pear Structures_Graph package allows creating and manipulating graph data structures. It allows building of either directed or undirected graphs, with data and metadata stored in nodes. The library provides functions for graph traversing as well as for characteristic extraction from the graph topology.

You won’t be processing sharded graph databases in PHP (hopefully) but you may have other graph applications for which PHP will be entirely appropriate.

And if nothing else, it is an easy way to experiment with graphs as a data structure.

Oracle Announces General Availability of MySQL Cluster 7.2

Filed under: MySQL,Oracle — Patrick Durusau @ 5:08 pm

Oracle Announces General Availability of MySQL Cluster 7.2

Another demonstration that high quality open source projects are not inconsistent with commercial products.

From the post:

Delivers up to 70x More Performance for Complex Queries; Adds New NoSQL Memcached Interface

News Facts

  • Continuing to drive MySQL innovation, Oracle today announced the general availability of MySQL Cluster 7.2.
  • For highly demanding Web-based and communications products and services, MySQL Cluster is designed to cost-effectively deliver 99.999% availability, high write scalability and very low latency.
  • With SQL and NoSQL access through a new Memcached API, MySQL Cluster represents a “best of both worlds” solution allowing key value operations and complex SQL queries within the same database.
  • With MySQL Cluster 7.2, users can also gain up to a 70x increase in performance on complex queries, and enhanced multi-data center scalability.
  • MySQL Cluster 7.2 is also certified with Oracle VM. The combination of its elastic, on-demand scalability and self-healing features, together with Oracle VM support, makes MySQL Cluster an ideal choice for deployments in the cloud.
  • Also generally available today is the latest release of the MySQL Cluster Manager, version 1.1.4, further improving the ease of use and administration of MySQL Cluster.

Scalding

Filed under: Cascading,Scalding — Patrick Durusau @ 5:08 pm

Scalding by Patrick Oscar Boykin.

From the blog:

Today, we’re excited to open source Scalding, a Scala API for Cascading. Cascading is a thin Java library and API that sits on top of Apache Hadoop’s MapReduce layer. Scalding is comprised of two main components:

  • a DSL to make MapReduce computations look very similar to Scala’s collection API
  • A wrapper for Cascading to make it simpler to define the typical use cases of jobs, tests and describing data sources on a Hadoop Distributed File System (HDFS) or local disk

Interesting find since I just mentioned Cascading yesterday.

How to create a visualization

Filed under: Graphics,Visualization — Patrick Durusau @ 5:06 pm

How to create a visualization: Pete Warden walks through the steps behind his latest Facebook visualization. by Pete Warden.

From the post:

Over the last few years I’ve created a few popular visualizations, a lot of duds, and I’ve learned a few lessons along the way. For my latest analysis of where Facebook users go on vacation, I decided to document the steps I follow to build my visualizations . It’s a very rough guide, these are just stages I’ve learned to follow by trial and error, but following these guidelines is a good way to start if you’re looking to create your first visualization.

If you want to get better at creating visualizations, this is a post to read and re-read, on a fairly regular basis.

FOSDEM Videos

Filed under: Conferences,Open Source — Patrick Durusau @ 5:05 pm

FOSDEM – 2012 – First videos uploaded!

Some of the videos for FOSDEM 2012 have been uploaded with more on the way. So check back or watch for announcements.

I was delighted to find that the video server, http://video.fosdem.org/ has videos going back to 2005!

I don’t know that a FOSDEM video would be a real crowd pleaser at your house but you won’t know unless you ask. 😉

« Newer PostsOlder Posts »

Powered by WordPress