Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 6, 2012

Computer Algorithms: Merge Sort

Filed under: Algorithms,Merge Sort — Patrick Durusau @ 8:10 pm

Computer Algorithms: Merge Sort

From the post:

Basically sorting algorithms can be divided into two main groups. Such based on comparisons and such that are not. I already posted about some of the algorithms of the first group. Insertion sort, bubble sort and Shell sort are based on the comparison model. The problem with these three algorithms is that their complexity is O(n2) so they are very slow.

So is it possible to sort a list of items by comparing their items faster than O(n2)? The answer is yes and here’s how we can do it.

Nice illustrations!

I suspect that algorithms and graph algorithms in particular are going to become fairly common here. Suggestions of work or posts that I should cover most welcome!

Stanford – Delayed Classes – Enroll Now!

If you have been waiting for notices about the delayed Stanford courses for Spring 2012, your wait is over!

Even if you signed up for more information, you must register at the course webpage to take the course.

Details as I have them on 6 March 2012 (check course pages for official information):

Cryptography Starts March 12th.

Design and Analysis of Algorithms Part 1 Starts March 12th.

Game Theory Starts March 19th.

Natural Language Processing Starts March 12th.

Probabilistic Graphical Models Starts March 19th.

You may be asking yourself, “Are all these courses useful for topic maps?”

I would answer by pointing out that librarians and indexers have rely on a broad knowledge of the world to make information more accessible to users.

By way of contrast, “big data” and Google, have made it less accessible.

Something to think about while you are registering for one or more of these courses!

February 24, 2012

Graph Mining: Laws, Generators, and Algorithms

Filed under: Algorithms,Graph Database Benchmark,Graphs — Patrick Durusau @ 5:06 pm

Graph Mining: Laws, Generators, and Algorithms by Deepayan Chakrabarti and Christos Faloutsos.

Abstract:

How does theWeb look? How could we tell an abnormal social network from a normal one? These and similar questions are important in many fields where the data can intuitively be cast as a graph; examples range from computer networks to sociology to biology and many more. Indeed, any M : N relation in database terminology can be represented as a graph. A lot of these questions boil down to the following: “How can we generate synthetic but realistic graphs?” To answer this, we must first understand what patterns are common in real-world graphs and can thus be considered a mark of normality/realism. This survey give an overview of the incredible variety of work that has been done on these problems. One of our main contributions is the integration of points of view from physics, mathematics, sociology, and computer science. Further, we briefly describe recent advances on some related and interesting graph problems.

If any readers of this blog have doubts about the need for mappings between terminology in fields of research (topic maps), consider the authors remarking:

…we need to detect patterns in graphs and then generate synthetic graphs matching such patterns automatically.

This is a hard problem. What patterns should we look for? What do such patterns mean? How can we generate them? A lot of research ink has been spent on this problem, not only by computer scientists but also physicists, mathematicians, sociologists, and others. However, there is little interaction among these fields with the result that they often use different terminology and do not benefit from each other’s advances. In this survey, we attempt to give an overview of the main ideas. Our focus is on combining sources from all the different fields to gain a coherent picture of the current stateof-the-art. The interested reader is also referred to some excellent and entertaining books on the topic, namely, Barabási [2002],Watts [2003], and Dorogovtsev and Mendes [2003]. (emphasis added)

Extremely detailed survey with copious references. Dates from 2006.

Do you know of a later cross-field survey that updates this article?

This would make a good article for the Reading club on Graph databases and distributed systems.

February 20, 2012

A beautiful algorithm

Filed under: Algorithms — Patrick Durusau @ 8:36 pm

A beautiful algorithm

Breakthroughs are possible, even for something as well studied as Conway’s Game of Life.

Read the post to be inspired to keep looking for better solutions.

Solving the Queens Problem with Haskell

Filed under: Algorithms,Haskell — Patrick Durusau @ 8:35 pm

Solving the Queens Problem with Haskell

From the post:

The eight queens puzzle is a classic problem in algorithm design because it showcases the power of backtracking and recursion. The premise is simple: find a way to place 8 queens on an 8×8 board such that none of them are attacking each other, which means that no two queens can share the same row, column, or diagonal.

Sounds simple enough, but how to implement it? The most basic solution would be to brute force it: try every possible configuration, stopping when a valid one is found. Not very efficient, but hey, at least it’s easy to understand.

A good illustration of recursion and backtracking but also suggests that brute force isn’t the best option for authoring or manipulating topic maps.

A deep understanding of your data, relevant subjects, methods for identifying those subjects, etc., will be more useful than brute force.

February 14, 2012

On Approximating String Selection Problems with Outliers

Filed under: Algorithms,Bioinformatics,String Matching — Patrick Durusau @ 5:07 pm

On Approximating String Selection Problems with Outliers by Christina Boucher, Gad M. Landau, Avivit Levy, David Pritchard and Oren Weimann.

Abstract:

Many problems in bioinformatics are about finding strings that approximately represent a collection of given strings. We look at more general problems where some input strings can be classified as outliers. The Close to Most Strings problem is, given a set S of same-length strings, and a parameter d, find a string x that maximizes the number of “non-outliers” within Hamming distance d of x. We prove this problem has no PTAS unless ZPP=NP, correcting a decade-old mistake. The Most Strings with Few Bad Columns problem is to find a maximum-size subset of input strings so that the number of non-identical positions is at most k; we show it has no PTAS unless P=NP. We also observe Closest to k Strings has no EPTAS unless W[1]=FPT. In sum, outliers help model problems associated with using biological data, but we show the problem of finding an approximate solution is computationally difficult.

Just in case you need a break from graph algorithms, intractable and otherwise. 😉

Sublinear Time Algorithm for PageRank Computations and Related Applications

Filed under: Algorithms,PageRank — Patrick Durusau @ 5:06 pm

Sublinear Time Algorithm for PageRank Computations and Related Applications by Christian Borgs, Michael Brautbar, Jennifer Chayes, Shang-Hua Teng

In a network, identifying all vertices whose PageRank is more than a given threshold value $\Delta$ is a basic problem that has arisen in Web and social network analyses. In this paper, we develop a nearly optimal, sublinear time, randomized algorithm for a close variant of this problem. When given a network \graph, a threshold value $\Delta$, and a positive constant $c>1$, with probability $1-o(1)$, our algorithm will return a subset $S\subseteq V$ with the property that $S$ contains all vertices of PageRank at least $\Delta$ and no vertex with PageRank less than $\Delta/c$. The running time of our algorithm is always $\tilde{O}(\frac{n}{\Delta})$. In addition, our algorithm can be efficiently implemented in various network access models including the Jump and Crawl query model recently studied by \cite{brautbar_kearns10}, making it suitable for dealing with large social and information networks.

As part of our analysis, we show that any algorithm for solving this problem must have expected time complexity of ${\Omega}(\frac{n}{\Delta})$. Thus, our algorithm is optimal up to a logarithmic factor. Our algorithm (for identifying vertices with significant PageRank) applies a multi-scale sampling scheme that uses a fast personalized PageRank estimator as its main subroutine. We develop a new local randomized algorithm for approximating personalized PageRank, which is more robust than the earlier ones developed by Jeh and Widom \cite{JehW03} and by Andersen, Chung, and Lang \cite{AndersenCL06}. Our multi-scale sampling scheme can also be adapted to handle a large class of matrix sampling problems that may have potential applications to online advertising on large social networks (See the appendix).

Pay close attention to the author’s definition of “significant” vertices:

A basic problem in network analysis is to identify the set of its vertices that are “significant.” For example, the significant nodes in the web graph defined by a query could provide the authoritative contents in web search; they could be the critical proteins in a protein interaction network; and they could be the set of people (in a social network) most effective to seed the influence for online advertising. As the networks become larger, we need more efficient algorithms to identify these “significant” nodes.

As far as online advertising, I await the discovery by vendors that “pull” models of advertising pre-qualify potential purchasers. “Push” models spam everyone within reach, with correspondingly low success rates.

For your convenience, the cites that don’t work well as source in the abstract:

brautbar_kearns10 – Local Algorithms for Finding Interesting Individuals in Large Networks by Mickey Brautbar , Michael Kearns.

Jeh and Widom – Scaling personalized web search (ACM), Scaling personalized web search (Stanford, free).

Andersen, Chung, and Lang – Local graph partitioning using PageRank vectors.

January 28, 2012

Citogenesis in science and the importance of real problems

Filed under: Algorithms,CS Lectures — Patrick Durusau @ 10:54 pm

Citogenesis in science and the importance of real problems

Daniel Lemire writes:

Many papers in Computer Science tell the following story:

  • There is a pre-existing problem P.
  • There are few relatively simple but effective solution to problem P. Among them is solution X.
  • We came up with a new solution X+ which is a clever variation on X. It looks good on paper.
  • We ran some experiments and tweaked our results until X+ looked good. We found a clever way to avoid comparing X+ and X directly and fairly, as it might then become obvious that the gains are small, or even negative! We would gladly report negative results, but then our paper could not be published.

It is a very convenient story for reviewers: the story is simple and easy to assess superficially. The problem is that sometimes, especially if the authors are famous and the idea is compelling, the results will spread. People will adopt X+ and cite it in their work. And the more they cite it, the more enticing it is to use X+ as every citation becomes further validation for X+. And why bother with algorithm X given that it is older and X+ is the state-of-the-art?

Occasionally, someone might try both X and X+, and they may report results showing that the gains due to X+ are small, or negative. But they have no incentive to make a big deal of it because they are trying to propose yet another better algorithm (X++).

But don’t we see the same thing in blogs? Where writers say “some,” “many,” “often,” etc., but none are claims that can be evaluated by others?

Make no mistake, given the rate of mis-citation that I find in published proceedings, I really want to agree with Daniel but I think the matter is more complex than simply saying that “engineers” work with “real” tests.

One of my pet peeves is that lack of history that I find in most CS papers. They may go back ten years but what about thirty or even forty years ago?

But as far as engineers, why is there so little code re-use if they are so interested in being efficient? Is re-writing code really more efficient or just NWH (Not Written Here)?

January 18, 2012

Compact Binary Relation Representations with Rich Functionality

Filed under: Algorithms,Binary Relations,Graphs,Trees,Wavelet Trees — Patrick Durusau @ 7:57 pm

Compact Binary Relation Representations with Rich Functionality by Jérémy Barbay, Francisco Claude, Gonzalo Navarro.

Abstract:

Binary relations are an important abstraction arising in many data representation problems. The data structures proposed so far to represent them support just a few basic operations required to fit one particular application. We identify many of those operations arising in applications and generalize them into a wide set of desirable queries for a binary relation representation. We also identify reductions among those operations. We then introduce several novel binary relation representations, some simple and some quite sophisticated, that not only are space-efficient but also efficiently support a large subset of the desired queries.

Read the introduction (runs about two of the thirty-two pages) and tell me you aren’t interested. Go ahead.

Just the start of what looks like a very interesting line of research.

January 11, 2012

Algorithms exercise: Find mistakes in Wikipedia articles

Filed under: Algorithms,Critical Reading — Patrick Durusau @ 8:06 pm

Algorithms exercise: Find mistakes in Wikipedia articles by René Pickhardt.

From the post:

Today I started an experiment I created an excercise for coursework in algorithms and data structures that is very unusuale and many people have been criticle if this was a good idea. The idea behind the exercise is that studens should read wikipedia articles to topics related to lectures and find mistakes or suggest things that could be improoved. Thereby I hope that people will do something that many people in science don’t do often enough: Read something critically and carefully and question the things that you have learnt. (more discussions after the exercise)

This is great! No only can students practice thinking critically but there is a forum to test their answers: other users of Wikipedia.

Read the article, do the exercises and see how your critical reading skills fare.

January 5, 2012

Graph Algorithms

Filed under: Algorithms,Cypher,Graphs,Gremlin,Neo4j — Patrick Durusau @ 4:14 pm

Graph Algorithms

I ran across this Wikipedia book while working on one of the data structures posts for today.

I think you may find it useful but some cautions:

First, being a collection of Wikipedia articles, it doesn’t have a consistent editorial voice. That is more than being fussy, the depth and usefulness of explanations will vary from article to article.

Second, you will find topics that are “stubs,” and hence not very useful.

Third, I think with the advent of Neo4j, Grelim, Cypher and other graph databases/software, future entries should have in addition to text, exercises that users can perform with common software to reinforce their understanding of entries.

December 15, 2011

AlgLab: An Open Laboratory for Experiments On Algorithms

Filed under: Algorithms — Patrick Durusau @ 7:48 pm

AlgLab: An Open Laboratory for Experiments On Algorithms

From the webpage:

Welcome to the open laboratory for experiments on algorithms. Computer scientists experiment on algorithms for many reasons, including:

  • To understand their fundamental properties.
  • To compare several algorithms to choose the best one for application.
  • To tune an algorithm to perform well in a given context.

This web site contains laboratories for conducting experiments on algorithms. It is a companion site for A Guide to Experimental Algorithmics, by Catherine McGeoch (Cambridge University Press).

If you didn’t get the brochure, go to: www.cambridge.org/us/compsci11. Publication date is 2012.

Given the large number of similarity measures, I suspect that experiments with similarity algorithms is going to be a “hot” area of research as well. No one measure is going to be the “best” one for all uses.

What is unknown (at this point) is what factors may be indicators or break-points for choosing one similarity algorithm or another?

December 4, 2011

Clustering Large Attributed Graphs: An Efficient Incremental Approach

Filed under: Algorithms,Clustering,Graphs — Patrick Durusau @ 8:19 pm

Clustering Large Attributed Graphs: An Efficient Incremental Approach by Yang Zhou, Hong Cheng, and Jeffrey Xu Yu. (PDF file)

Abstract:

In recent years, many networks have become available for analysis, including social networks, sensor networks, biological networks, etc. Graph clustering has shown its effectiveness in analyzing and visualizing large networks. The goal of graph clustering is to partition vertices in a large graph into clusters based on various criteria such as vertex connectivity or neighborhood similarity. Many existing graph clustering methods mainly focus on the topological structures, but largely ignore the vertex properties which are often heterogeneous. Recently, a new graph clustering algorithm, SA-Cluster, has been proposed which combines structural and attribute similarities through a unified distance measure. SACluster performs matrix multiplication to calculate the random walk distances between graph vertices. As the edge weights are iteratively adjusted to balance the importance between structural and attribute similarities, matrix multiplication is repeated in each iteration of the clustering process to recalculate the random walk distances which are affected by the edge weight update.

In order to improve the efficiency and scalability of SA-Cluster, in this paper, we propose an efficient algorithm Inc-Cluster to incrementally update the random walk distances given the edge weight increments. Complexity analysis is provided to estimate how much runtime cost Inc-Cluster can save. Experimental results demonstrate that Inc-Cluster achieves significant speedup over SA-Cluster on large graphs, while achieving exactly the same clustering quality in terms of intra-cluster structural cohesiveness and attribute value homogeneity.

Seeing this reminded me that I need to review the other papers presented at the 2010 IEEE International Conference on Data Mining. The problem is that papers that seem the most relevant at one time, six months later don’t seem as relevant as they once did. Same papers, same person looking at them. Passage of time and other papers I suspect.

Graph algorithms continue to improve, if you are working with large graphs, suggest you give some time to this paper.

November 30, 2011

balanced binary search trees exercise for algorithms and data structures class

Filed under: Algorithms,Data Structures,Search Trees — Patrick Durusau @ 8:11 pm

balanced binary search trees exercise for algorithms and data structures class by René Pichardt.

From the post:

I created some exercises regarding binary search trees. This time there is no coding involved. My experience from teaching former classes is that many people have a hard time understanding why trees are usefull and what the dangers of these trees is. Therefor I have created some straight forward exercises that nevertheless involve some work and will hopefully help the students to better understand and internalize the concepts of binary search tress which are in my oppinion one of the most fundamental and important concepts in a class about algorithms and data structures.

I visited René’s blog because of the Google n gram post but could not leave without mentioning these exercises.

Great teaching technique!

What parts of topic maps should be illustrated with similar exercises?

PS: Still working on it but I am thinking that the real power of topic maps lies in its lack of precision or rather that a topic map can be as precise or as loose as need be. No pre-set need to have a decidable outcome. Or perhaps rather, it can have a decidable outcome that is the decidable outcome because I say it is so. 😉

November 1, 2011

Dictionary of Algorithms and Data Structures

Filed under: Algorithms — Patrick Durusau @ 3:33 pm

Dictionary of Algorithms and Data Structures

From the webpage:

This web site is hosted in part by the Software and Systems Division, Information Technology Laboratory.

This is a dictionary of algorithms, algorithmic techniques, data structures, archetypal problems, and related definitions. Algorithms include common functions, such as Ackermann’s function. Problems include traveling salesman and Byzantine generals. Some entries have links to implementations and more information. Index pages list entries by area and by type. The two-level index has a total download 1/20 as big as this page.

Don’t use this site to cheat. Teachers, contact us if we can help.

To define or correct terms, please contact Paul E. Black. We do not include algorithms particular to business data processing, communications, operating systems or distributed algorithms, programming languages, AI, graphics, or numerical analysis: it is
tough enough covering “general” algorithms and data structures.

I thought I had listed this site but apparently never did. Although only general algorithms, it is a good resource to have on hand.

October 31, 2011

Inquiry: Algebraic Geometry and Topology

Filed under: Algebraic Geometry,Algorithms,Geometry,Topological Data Analysis,Topology — Patrick Durusau @ 7:33 pm

Inquiry: Algebraic Geometry and Topology

Speaking of money and such matters, a call for assistance from Quantivity:

Algebraic geometry and topology traditionally focused on fairly pure math considerations. With the rise of high-dimensional machine learning, these fields are increasing being pulled into interesting computational applications such as manifold learning. Algebraic statistics and information geometry offer potential to help bridge these fields with modern statistics, especially time-series and random matrices.

Early evidence suggests potential for significant intellectual cross-fertilization with finance, both mathematical and computational. Geometrically, richer modeling and analysis of latent geometric structure than available from classic linear algebraic decomposition (e.g. PCA, one of the main workhorses of modern $mathbb{P}$ finance); for example, cumulant component analysis. Topologically, more effective qualitative analysis of data sampled from manifolds or singular algebraic varieties; for example, persistent homology (see CompTop).

As evidence by Twitter followers, numerous Quantivity readers are experts in these fields. Thus, perhaps the best way to explore is to seek insight from readers.

Readers: please use comments to suggest applied literature from these fields; ideally, although not required, that of potential relevance to finance modeling. All types of literature are requested, from intro texts to survey articles to preprint working papers on specific applications.

These suggestions will be synthesized into one or more subsequent posts, along with appropriate additions to People of Quant Research.

If you or a member of your family knows of any relevant resources, please go to: Inquiry: Algebraic Geometry and Topology and volunteer those resources as comments. You might even make the People of Quant Research list!

I wonder if there would be any interest in tracking bundled instruments using topic maps? That could be an interesting question.

October 29, 2011

We Really Don’t Know How To Compute!

Filed under: Algorithms,CS Lectures,Parallel Programming — Patrick Durusau @ 7:20 pm

We Really Don’t Know How To Compute! by Gerald Jay Sussman.

This is a must watch video! Sussman tries to make the case that we need to think differently about computing. For example, being able to backtrack the provenance of data in a series of operations. Or being able to maintain inconsistent world views while maintaining locally consistent world views in a single system. Or being able to say where world views diverge, without ever claiming either one to be correct/incorrect.

Argues in general for systems that will be robust enough for massively parallel programming. Where inconsistencies and the like are going to abound when applied to say all known medical literature. Isn’t going to be helpful if our systems fall over on their sides when they encounter inconsistency.

A lot of what Sussman says I think is largely applicable to parallel processing of topic maps. Certainly will be looking up some of his class videos from MIT.

From the webpage:

Summary

Gerald Jay Sussman compares our computational skills with the genome, concluding that we are way behind in creating complex systems such as living organisms, and proposing a few areas of improvement.

Bio

Gerald Jay Sussman is the Panasonic Professor of EE at MIT. Sussman is a coauthor (with Hal Abelson and Julie Sussman) of the MIT computer science textbook “Structure and Interpretation of Computer Programs”. Sussman has had a number of important contributions to Artificial Intelligence, and with his former student, Guy L. Steele Jr., invented the Scheme programming language in 1975.

About the conference

Strange Loop is a multi-disciplinary conference that aims to bring together the developers and thinkers building tomorrow’s technology in fields such as emerging languages, alternative databases, concurrency, distributed systems, mobile development, and the web.

October 26, 2011

Can an algorithm be wrong? Twitter Trends,…

Filed under: Algorithms — Patrick Durusau @ 6:58 pm

Can an algorithm be wrong? Twitter Trends, the specter of censorship, and our faith in the algorithms around us

From the post:

The interesting question is not whether Twitter is censoring its Trends list. The interesting question is, what do we think the Trends list is, what it represents and how it works, that we can presume to hold it accountable when we think it is “wrong?” What are these algorithms, and what do we want them to be?

It’s not the first time it has been asked. Gilad Lotan at SocialFlow (and erstwhile Microsoft UX designer), spurred by questions raised by participants and supporters of the Occupy Wall Street protests, asks the question: is Twitter censoring its Trends list to exclude #occupywallstreet and #occupyboston? While the protest movement gains traction and media coverage, and participants, observers and critics turn to Twitter to discuss it, why are these widely-known hashtags not Trending? Why are they not Trending in the very cities where protests have occurred, including New York?

Careful analysis of the data indicates that Twitter is not censoring its Trends list. But I have heard people that I consider to be quite responsible argue to the contrary.

I raise this here as a caution against criticism of topic maps that don’t reflect what you think are undisputed “facts.” That you think some case to be true or that it must be true (in the case of political discussions) isn’t a sufficient basis for others to feel the same way. Nor does that mean their topic maps are “censoring” your view. Others have no obligation to advance your perspective.

October 25, 2011

Mapreduce & Hadoop Algorithms in Academic Papers (4th update – May 2011)

Filed under: Algorithms,Hadoop,MapReduce — Patrick Durusau @ 7:34 pm

Mapreduce & Hadoop Algorithms in Academic Papers (4th update – May 2011) by Amund Tveit.

From the post:

It’s been a year since I updated the mapreduce algorithms posting last time, and it has been truly an excellent year for mapreduce and hadoop – the number of commercial vendors supporting it has multiplied, e.g. with 5 announcements at EMC World only last week (Greenplum, Mellanox, Datastax, NetApp, and Snaplogic) and today’s Datameer funding announcement , which benefits the mapreduce and hadoop ecosystem as a whole (even for small fish like us here in Atbrox). The work-horse in mapreduce is the algorithm, this update has added 35 new papers compared to the prior posting, new ones are marked with *. I’ve also added 2 new categories since the last update – astronomy and social networking.

A truly awesome resource!

This promises to be hours of entertainment!

October 9, 2011

Approximating Edit Distance in Near-Linear Time

Filed under: Algorithms,Edit Distance,Levenshtein Distance — Patrick Durusau @ 6:44 pm

Approximating Edit Distance in Near-Linear Time

Abstract:

We show how to compute the edit distance between two strings of length n up to a factor of $2^{\~O(sqrt(log n))} in n^(1+o(1))$ time. This is the first sub-polynomial approximation algorithm for this problem that runs in near-linear time, improving on the state-of-the-art $n^(1/3+o(1))$ approximation. Previously, approximation of $2^{\~O(sqrt(log n))}$ was known only for embedding edit distance into $l_1$, and it is not known if that embedding can be computed in less than quadratic time.

Deeply important research for bioinformatics, text searching. The edit distance is “approximated.”

If you are not familiar with this area, Levenshtein Distance, in Three Flavors by Michael Gilleland is a nice starting point with source code in three languages.

October 8, 2011

Tree Traversal in O(1) Space

Filed under: Algorithms,Graphs,Software,Trees — Patrick Durusau @ 8:14 pm

Tree Traversal in O(1) Space by Sanjoy.

From the post:

I’ve been reading some texts recently, and came across a very interesting way to traverse a graph, called pointer reversal. The idea is this — instead of maintaining an explicit stack (of the places you’ve visited), you try to store the relevant information in the nodes themselves. One approach that works on directed graphs with two (outgoing) arcs per node is called the Deutsch-Schorr-Waite algorithm. This was later extended by Thorelli to work for directed graphs with an unknown number of (outgoing) arcs per node.

Implemented here for a tree, care to go for a more general graph?

October 7, 2011

Getting Creative with MapReduce

Filed under: Algorithms,Cascalog,MapReduce — Patrick Durusau @ 6:16 pm

Getting Creative with MapReduce

From the post:

One problem with many existing MapReduce abstraction layers is the utter difficulty of testing queries and workflows. End-to-end tests are maddening to craft in vanilla Hadoop and frustrating at best in Pig and Hive. The difficulty of testing MapReduce workflows makes it scary to change code, and destroys your desire to be creative. A proper testing suite is an absolute prerequisite to doing creative work in big data.

In this blog post, I aim to show how most of the difficulty of writing and testing MapReduce queries stems from the fact that Hadoop confounds application logic with decisions about data storage. These problems are the result of poorly implemented abstractions over the primitives of MapReduce, not problems with the core MapReduce algorithms.

The author advocates the use of Cacaslog and its testing suite. Comments?

October 3, 2011

Algorithms of the Intelligent Web Review

Algorithms of the Intelligent Web Review by Pearlene McKinley

From the post:

I have always had an interest in AI, machine learning, and data mining but I found the introductory books too mathematical and focused mostly on solving academic problems rather than real-world industrial problems. So, I was curious to see what this book was about.

I have read the book front-to-back (twice!) before I write this report. I started reading the electronic version a couple of months ago and read the paper print again over the weekend. This is the best practical book in machine learning that you can buy today — period. All the examples are written in Java and all algorithms are explained in plain English. The writing style is superb! The book was written by one author (Marmanis) while the other one (Babenko) contributed in the source code, so there are no gaps in the narrative; it is engaging, pleasant, and fluent. The author leads the reader from the very introductory concepts to some fairly advanced topics. Some of the topics are covered in the book and some are left as an exercise at the end of each chapter (there is a “To Do” section, which was a wonderful idea!). I did not like some of the figures (they were probably made by the authors not an artist) but this was only a minor aesthetic inconvenience.

The book covers four cornerstones of machine learning and intelligence, i.e. intelligent search, recommendations, clustering, and classification. It also covers a subject that today you can find only in the academic literature, i.e. combination techniques. Combination techniques are very powerful and although the author presents the techniques in the context of classifiers, it is clear that the same can be done for recommendations — as the Bell Korr team did for the Netflix prize.

Wonder if this will be useful in the Stanford AI course that starts next week with more than 130,000 students? Introduction to Artificial Intelligence – Stanford Class

I am going to order a copy, if for no other reason than to evaluate the reviewer’s claim of explanations “in plain English.” I have seen some fairly clever explanations of AI algorithms and would like to see how these stack up.

October 2, 2011

“Algorithm” is not a !@%#$@ 4-letter Word

Filed under: Algorithms,Graphs — Patrick Durusau @ 6:37 pm

“Algorithm” is not a !@%#$@ 4-letter Word by Jamis Buck (via @CompSciFact)

Very nice algorithm and graph presentation.

Instructive on algorithms and graphs and despite the fact more than a little skill went into the presentation, simple enough technically that it ought to encourage all of us to do more such explanations for topic maps. (Well, talking to myself in particular to be honest. I keep wanting to do the “perfect” presentation rather than the presently “possible” ones.)

September 27, 2011

k-means Approach to the Karhunen-Loéve Transform (aka PCA – Principal Component Analysis)

Filed under: Algorithms,Principal Component Analysis (PCA) — Patrick Durusau @ 6:52 pm

k-means Approach to the Karhunen-Loeve Transform by Krzysztof Misztal, Przemyslaw Spurek, and Jacek Tabor.

Abstract:

We present a simultaneous generalization of the well-known Karhunen-Loeve (PCA) and k-means algorithms. The basic idea lies in approximating the data with k affine subspaces of a given dimension n. In the case n=0 we obtain the classical k-means, while for k=1 we obtain PCA algorithm.

We show that for some data exploration problems this method gives better result then either of the classical approaches.

I know, it is a very forbidding title but once you look at the paper you will be glad you did.

First, the authors begin with a graphic illustration of the goal of their technique (no, you have to look at the paper to see it), which even the most “lay” reader can appreciate.

Second, the need for topic maps strikes again as in the third paragraph we learn: “…Karhunen-Loéve transform (called also PCA – Principle Component Analysis)….”

Third, some of the uses of this technique:

  • data mining – we can detect important coordinates and subsets with similar properties;
  • clustering – our modification of k-means can detect different, high dimensional relation in data;
  • image compression and image segmentation;
  • pattern recognition – thanks to detection of relation in data we can use it to assign data to defined before classes.

A sample implementation is available at: http://www.ii.uj.edu.pl/~misztalk.

September 25, 2011

Furnace — A Property Graph Algorithms Package

Filed under: Algorithms,Blueprints,Frames,Furnace,Graphs,Gremlin,Neo4j,Pipes,Rexster,TinkerPop — Patrick Durusau @ 7:48 pm

Furnace — A Property Graph Algorithms Package

Marko Rodriguez posted the following note to the Grelim-users mailing list today:

Hello,

For many months, the TinkerPop community has been trying to realize the best way to go about providing a graph analysis package to the TinkerPop stack ( http://bit.ly/qCMlcP ). With the increased flexibility and power of Pipes and the partitioning of Gremlin into multiple JVM languages, we feel that the stack is organized correctly now to support Furnace — A Property Graph Algorithms Package.

http://furnace.tinkerpop.com
( https://github.com/tinkerpop/furnace/wiki if the domain hasn’t propagated to your DNS yet )

The project is currently just stubbed, but overtime you can expect the ability to evaluate standard (and non-standard) graph analysis algorithms over Blueprints-enabled graphs in a way that respects explicit and implicit associations in the graph. In short, it will implement the ideas articulated in:

http://markorodriguez.com/2011/02/08/property-graph-algorithms/
http://arxiv.org/abs/0806.2274

This will be possible due to Pipes and the ability to represent abstract relationships using Pipes, Gremlin_groovy (and the upcoming Gremlin_scala). Moreover, while more thought is needed, there will be a way to talk at the Frames-levels (http://frames.tinkerpop.com) and thus, calculate graph algorithms according to one’s domain model. Ultimately, in time, as Furnace develops, we will see a Rexster-Kibble that supports the evaluation of algorithms via Rexster.

While the project is still developing, please feel free to contribute ideas and/or participate in the development process. To conclude, we hope people are excited about the promises that Furnace will bring by raising the processing abstraction level above the imperative representations of Pipes/Gremlin.

Thank you,
Marko.

http://markorodriguez.com

You have been waiting for the opportunity to contribute to the Tinkerpop stack, particularly on graph analysis, so here is your chance! Seriously, you need to forward this to every graph person, graph project and graduate student taking graph theory.

We can use simple graphs and hope (pray?) the world is a simple place. Or use more complex graphs to model the world. Do you feel lucky? Do you?

September 18, 2011

Approaching optimality for solving SDD systems

Filed under: Algorithms,Mathematics,Matrix — Patrick Durusau @ 7:29 pm

In October 2010, this paper was presented by the authors.

Approaching optimality for solving SDD systems by Ioannis Koutis, Gary L. Miller, and Richard Peng.

Public reports on that paper can be found at: A Breakthrough in Algorithm Design in the September 2011 issue of CACM and PC Pro in: Algorithm sees massive jump in complex number crunching.

The claim is that the new approach will be a billion times faster than traditional techniques.

In February of 2011, the authors have posted a new and improved version of their algorithm in:

A nearly-mlogn time solver for SDD linear systems.

Koutis has written a MATLAB implementation at: CMG: Combinatorial Multigrid

For further background, see: Combinatorial Preconditioning, sparsification, local clustering, low-stretch trees, etc. by Spielman, one of the principal researchers in this area.

The most obvious application in topic maps would be recommender systems that bring possible merges to a topic map author’s attention or even perform merging on specified conditions. (If the application doesn’t seem obvious, read the post I refer to in: Text Feature Extraction (tf-idf) – Part 1 , again. Will also give you some ideas about scaleable merging tests as well.)

Years ago Lars Marius told me that topic maps needed to scale on laptops to be successful. It looks like algorithms are catching up to meet his requirement.

September 1, 2011

Greenplum Community

Filed under: Algorithms,Analytics,Machine Learning,SQL — Patrick Durusau @ 6:00 pm

A post by Alex Popescu, Data Scientist Summit Videos, lead me to discover the Greenplum Community.

Hosted by Greenplum:

Greenplum is the pioneer of Enterprise Data Cloud™ solutions for large-scale data warehousing and analytics, providing customers with flexible access to all their data for business intelligence and advanced analytics. Greenplum offers industry-leading performance at a low cost for companies managing terabytes to petabytes of data. Data-driven businesses around the world, including NASDAQ, NYSE Euronext, Silver Spring Networks and Zions Bancorporation, have adopted Greenplum Database-based products to support their mission-critical business functions.

registration (free) brings access to the videos from the Data Scientist Summit.

The “community” is focused on Greenplum software (there is a “community” edition). Do be aware that Greenplum Database CE is a 1.7 GB download. Just so you know.

August 15, 2011

Skiena’s Algorithms Lectures

Filed under: Algorithms,CS Lectures — Patrick Durusau @ 7:31 pm

Skiena’s Algorithms Lectures by Steven Skiena, Stony Brook University.

From the website:

Below are audio, video and lecture sides for 1997 and 2007. Since the lectures are 10 years apart some of the topics covered by the course have changed. The 1997 lectures have a better quality video and audio than the 2007, although the 2007 covers the newer material and has better lecture notes.

If you found this useful also check out the video lectures of my Discrete Mathematics, Computational Biology, and Computational Finance courses.

I have the first edition of “The Algorithm Design Manual,” which is now out in a second edition. Guess it is time for an upgrade. 😉

There are going to be startups that re-implement assumptions based on prior hardware limitations and those who have a competitive advantage. Which one do you want to be?

Saw this in a tweet from @CompSciFact.

August 14, 2011

Springy

Filed under: Algorithms,Graphs,Visualization — Patrick Durusau @ 7:12 pm

Springy

From the webpage:

Springy is a force directed graph layout algorithm.

So what does this ‘force directed’ stuff mean anyway? Excellent question!

It basically means that it uses some real world physics to try and figure out how to show a network graph in a nice way.

Try to imagine it as a bunch of springs connected to each other.

Written in JavaScript. No surprises but you may find it useful for webpages.

« Newer PostsOlder Posts »

Powered by WordPress