Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 28, 2012

Algorithms for Massive Data Sets

Filed under: Algorithms,BigData,CS Lectures — Patrick Durusau @ 8:43 am

Algorithms for Massive Data Sets by Inge Li Gørtz and Philip Bille.

From the course description:

A student who has met the objectives of the course will be able to:

  • Describe an algorithm in a comprehensible manner, i.e., accurately, concise, and unambiguous.
  • Prove correctness of algorithms.
  • Analyze, evaluate, and compare the performance of algorithms in models of computation relevant to massive data sets.
  • Analyze, evaluate, and compare the quality and reliability of solutions.
  • Apply and extend relevant algorithmic techniques for massive data sets.
  • Design algorithms for problems related to massive data sets.
  • Lookup and apply relevant research literature for problems related to massive data sets.
  • Systematically identify and analyze problems and make informed choices for solving the problems based on the analysis.
  • Argue clearly for the choices made when solving a problem.

Papers, slides and exercises provided for these topics:

Week 1: Introduction and Hashing: Chained, Universal, and Perfect.

Week 2: Predecessor Data Structures: x-fast tries and y-fast tries.

Week 3: Decremental Connectivity in Trees: Cluster decomposition, Word-Level Parallelism.

Week 4: Nearest Common Ancestors: Distributed data structures, Heavy-path decomposition, alphabetic codes.

Week 5: Amortized analysis and Union-Find.

Week 6: Range Reporting: Range Trees, Fractional Cascading, and kD Trees.

Week 7: Persistent data structures.

Week 8: String matching.

Week 9: String Indexing: Dictionaries, Tries, Suffix trees, and Suffix Sorting.

Week 10: Introduction to approximation algorithms. TSP, k-center, vertex cover.

Week 11: Approximation algorithms: Set Cover, stable matching.

Week 12: External Memory: I/O Algorithms, Cache-Oblivious Algorithms, and Dynamic Programming

Just reading the papers will improve your big data skills.

October 27, 2012

Designing good MapReduce algorithms

Filed under: Algorithms,BigData,Hadoop,MapReduce — Patrick Durusau @ 6:28 pm

Designing good MapReduce algorithms by Jeffrey D. Ullman.

From the introduction:

If you are familiar with “big data,” you are probably familiar with the MapReduce approach to implementing parallelism on computing clusters [1]. A cluster consists of many compute nodes, which are processors with their associated memory and disks. The compute nodes are connected by Ethernet or switches so they can pass data from node to node.

Like any other programming model, MapReduce needs an algorithm-design theory. The theory is not just the theory of parallel algorithms—MapReduce requires we coordinate parallel processes in a very specific way. A MapReduce job consists of two functions written by the programmer, plus some magic that happens in the middle:

  1. The Map function turns each input element into zero or more key-value pairs. A “key” in this sense is not unique, and it is in fact important that many pairs with a given key are generated as the Map function is applied to all the input elements.
  2. The system sorts the key-value pairs by key, and for each key creates a pair consisting of the key itself and a list of all the values associated with that key.
  3. The Reduce function is applied, for each key, to its associated list of values. The result of that application is a pair consisting of the key and whatever is produced by the Reduce function applied to the list of values. The output of the entire MapReduce job is what results from the application of the Reduce function to each key and its list.

When we execute a MapReduce job on a system like Hadoop [2], some number of Map tasks and some number of Reduce tasks are created. Each Map task is responsible for applying the Map function to some subset of the input elements, and each Reduce task is responsible for applying the Reduce function to some number of keys and their associated lists of values. The arrangement of tasks and the key-value pairs that communicate between them is suggested in Figure 1. Since the Map tasks can be executed in parallel and the Reduce tasks can be executed in parallel, we can obtain an almost unlimited degree of parallelism—provided there are many compute nodes for executing the tasks, there are many keys, and no one key has an unusually long list of values

A very important feature of the Map-Reduce form of parallelism is that tasks have the blocking property [3]; that is, no Map or Reduce task delivers any output until it has finished all its work. As a result, if a hardware or software failure occurs in the middle of a MapReduce job, the system has only to restart the Map or Reduce tasks that were located at the failed compute node. The blocking property of tasks is essential to avoid restart of a job whenever there is a failure of any kind. Since Map-Reduce is often used for jobs that require hours on thousands of compute nodes, the probability of at least one failure is high, and without the blocking property large jobs would never finish.

There is much more to the technology of MapReduce. You may wish to consult, a free online text that covers MapReduce and a number of its applications [4].

Warning: This article may change your interest in the design of MapReduce algorithms.

Ullman’s stories of algorithm tradeoffs provide motivation to evaluate (or reevaluate) your own design tradeoffs.

October 16, 2012

Report on XLDB Tutorial on Data Structures and Algorithms

Filed under: Algorithms,Data Structures,Fractal Trees,TokuDB,Tokutek — Patrick Durusau @ 3:55 am

Report on XLDB Tutorial on Data Structures and Algorithms by Michael Bender.

From the post:

The tutorial was organized as follows:

  • Module 0: Tutorial overview and introductions. We describe an observed (but not necessary) tradeoff in ingestion, querying, and freshness in traditional database.
  • Module 1: I/O model and cache-oblivious analysis.
  • Module 2: Write-optimized data structures. We give the optimal trade-off between inserts and point queries. We show how to build data structures that lie on this tradeoff curve.
  • Module 2 continued: Write-optimized data structures perform writes much faster than point queries; this asymmetry affects the design of an ACID compliant database.
  • Module 3: Case study – TokuFS. How to design and build a write-optimized file systems.
  • Module 4: Page-replacement algorithms. We give relevant theorems on the performance of page-replacement strategies such as LRU.
  • Module 5: Index design, including covering indexes.
  • Module 6: Log-structured merge trees and fractional cascading.
  • Module 7: Bloom filters.

These algorithms and data structures are used both in NoSQL implementations such as MongoDB, HBase and in SQL-oriented implementations such as MySQL and TokuDB.

The slides are available here.

A tutorial offered by Michael and Bradley C. Kuszmaul at the 6th XLDB conference.

If you are committed to defending your current implementation choices against all comers, don’t bother with the slides.

If you want a peek at one future path in data structures, get the slides. You won’t be disappointed.

October 10, 2012

Distributed Algorithms in NoSQL Databases

Filed under: Algorithms,Distributed Systems,NoSQL — Patrick Durusau @ 4:20 pm

Distributed Algorithms in NoSQL Databases by Ilya Katsov.

From the post:

Scalability is one of the main drivers of the NoSQL movement. As such, it encompasses distributed system coordination, failover, resource management and many other capabilities. It sounds like a big umbrella, and it is. Although it can hardly be said that NoSQL movement brought fundamentally new techniques into distributed data processing, it triggered an avalanche of practical studies and real-life trials of different combinations of protocols and algorithms. These developments gradually highlight a system of relevant database building blocks with proven practical efficiency. In this article I’m trying to provide more or less systematic description of techniques related to distributed operations in NoSQL databases.

In the rest of this article we study a number of distributed activities like replication of failure detection that could happen in a database. These activities, highlighted in bold below, are grouped into three major sections:

  • Data Consistency. Historically, NoSQL paid a lot of attention to tradeoffs between consistency, fault-tolerance and performance to serve geographically distributed systems, low-latency or highly available applications. Fundamentally, these tradeoffs spin around data consistency, so this section is devoted data replication and data repair.
  • Data Placement. A database should accommodate itself to different data distributions, cluster topologies and hardware configurations. In this section we discuss how to distribute or rebalance data in such a way that failures are handled rapidly, persistence guarantees are maintained, queries are efficient, and system resource like RAM or disk space are used evenly throughout the cluster.
  • System Coordination. Coordination techniques like leader election are used in many databases to implements fault-tolerance and strong data consistency. However, even decentralized databases typically track their global state, detect failures and topology changes. This section describes several important techniques that are used to keep the system in a coherent state.

Slow going but well worth the effort.

Not the issues discussed in the puff-piece webinars extolling NoSQL solutions to “big data.”

But you already knew that if you read this far! Enjoy!

I first saw this at Christophe Lalanne’s A bag of tweets / September 2012

Twitter Recommendations by @alpa

Filed under: Algorithms,Recommendation,Tweets — Patrick Durusau @ 4:18 pm

Twitter Recommendations by @alpa by Marti Hearst.

From the post:

Alpa Jain has great experience teaching from her time as a graduate student at Columbia University, and it shows in the clarity of her descriptions of SVD and other recommendation algorithms in today’s lecture:

Would you incorporate recommendation algorithms into a topic map authoring solution?

October 9, 2012

Advanced Data Structures [Jeff Erickson, UIUC]

Filed under: Algorithms,Data Structures — Patrick Durusau @ 10:08 am

Advanced Data Structures

From the description:

This course will survey important developments in data structures that have not (yet) worked their way into the standard computer science curriculum. The precise topics will depend on the interests and background of the course participants; see the current schedule for details. Potential topics include include self-adjusting binary search trees, dynamic trees and graphs, persistent data structures, geometric data structures, kinetic data structures, I/O-efficient and cache-oblivious data structures, hash tables and Bloom filters, data structures that beat information-theoretic lower bounds, and applications of these data structures in computational geometry, combinatorial optimization, systems and networking, databases, and other areas of computer science.

The course page has links to similar courses.

For hardy souls exploring data structures per se or for specialized topic maps, an annotated bibliography of readings.

If you haven’t seen it, visit Jeff’s homepage.

To see what Jeff has been up to lately: DBLP: Jeff Erickson.

September 25, 2012

How to teach Algorithms ?

Filed under: Algorithms — Patrick Durusau @ 10:20 am

How to teach Algorithms ? by Shiva Kintali.

From the post:

The goal is to have a very simple to understand “executable pseudo-code” along with an animation framework that “understands” this language. So I started designing a new language and called it Kintali language, for lack of a better word 🙂 . I borrowed syntax from several pseudo-codes. It took me almost two years to implement all the necessary features keeping in mind a broad range of algorithms. I developed an interpreter to translate this language into an intermediate representation with callbacks to an animation library. This summer, I finally implemented the animation library and the front-end in Objective-C. The result is the Algorithms App for iPad, released on Sep 20, 2012. This is my attempt to teach as many algorithms as possible by intuitive visualization and friendly exercises.

I like the idea of an “executable pseudo-code,” but I have this nagging feeling this has been done before.

Yes?

If I had a topic map of algorithm classes, textbooks and teaching aids I would know the answer right away.

But I don’t.

Are you aware of similar projects?

September 24, 2012

The tyranny of algorithms

Filed under: Algorithms,Computation — Patrick Durusau @ 4:18 pm

The tyranny of algorithms by Kaiser Fung.

This WSJ book review on “The Tyranny of Algorithms” (link) is well worth reading for those interested in how computer algorithms are used in business and government. I agree with most of what this author has to say.

You need to read Kaiser’s comment on the review before proceeding….

Back?

I am not altogether sure that algorithms are the problem the book and/or review make they out to be.

No doubt the concerns and problems described are real, but they don’t exist because of algorithms.

Rather they exist because we are basically lazy and accept the result of algorithms, just like we accept the judgements of others, advertising, etc.

Were we to question algorithms, judgements of others, advertising, we might be living in a very different world.

But we don’t, so we’re not.

So the question is, how to live with algorithms knowing we are too lazy to question them. Yes?

Are these shadows/echoes of Thinking, Fast and Slow?

September 22, 2012

Damn Cool Algorithms: Cardinality Estimation

Filed under: Algorithms,Cardinality Estimation — Patrick Durusau @ 3:12 pm

Damn Cool Algorithms: Cardinality Estimation by Nick Johnson.

From the post:

Suppose you have a very large dataset – far too large to hold in memory – with duplicate entries. You want to know how many duplicate entries, but your data isn’t sorted, and it’s big enough that sorting and counting is impractical. How do you estimate how many unique entries the dataset contains? It’s easy to see how this could be useful in many applications, such as query planning in a database: the best query plan can depend greatly on not just how many values there are in total, but also on how many unique values there are.

I’d encourage you to give this a bit of thought before reading onwards, because the algorithms we’ll discuss today are quite innovative – and while simple, they’re far from obvious.

Duplicate entries?

They are singing our song!

😉

I found this looking around for newer entries after stumbling on the older one.

Enjoy!

September 20, 2012

25th ACM Symposium on Parallelism in Algorithms and Architectures

Filed under: Algorithms,Conferences,Parallel Programming — Patrick Durusau @ 7:59 pm

25th ACM Symposium on Parallelism in Algorithms and Architectures

Submission Deadlines:
Abstracts: February 11 (11:59 pm EST)
Full papers: February 13 (11:59 pm EST)
These are firm deadlines. No extensions will be granted.
Notification: April 15
Camera-ready copy due: May 14

From the call for papers:

This year, SPAA is co-located with PODC. SPAA defines the term “parallel” broadly, encompassing any computational system that can perform multiple operations or tasks simultaneously. Topics include, but are not limited to:

  • Parallel and Distributed Algorithms
  • Parallel and Distributed Data Structures
  • Green Computing and Power-Efficient Architectures
  • Management of Massive Data Sets
  • Parallel Complexity Theory
  • Parallel and Distributed Architectures
  • Multi-Core Architectures
  • Instruction Level Parallelism and VLSI
  • Compilers and Tools for Concurrent Programming
  • Supercomputer Architecture and Computing
  • Transactional Memory Hardware and Software
  • The Internet and the World Wide Web
  • Game Theory and Collaborative Learning
  • Routing and Information Dissemination
  • Resource Management and Awareness
  • Peer-to-Peer Systems
  • Mobile Ad-Hoc and Sensor Networks
  • Robustness, Self-Stabilization and Security
  • Synergy of Parallelism in Algorithms, Programming and Architecture

Montreal, Canada, July 23 – 25, 2013.

Think about it. Balisage won’t be that far away, could put some vacation time together with the conferences at either end.

September 12, 2012

Fast integer compression: decoding billions of integers per second

Filed under: Algorithms,Integers,Search Engines — Patrick Durusau @ 3:14 pm

Fast integer compression: decoding billions of integers per second by Daniel Lemire.

At > 2 billion integers per second, you may find there is plenty of action left in your desktop processor!

From the post:

Databases and search engines often store arrays of integers. In search engines, we have inverted indexes that map a query term to a list of document identifiers. This list of document identifiers can be seen as a sorted array of integers. In databases, indexes often work similarly: they map a column value to row identifiers. You also get arrays of integers in databases through dictionary coding: you map all column values to an integer in a one-to-one manner.

Our modern processors are good at processing integers. However, you also want to keep much of the data close to the CPU for better speed. Hence, computer scientists have worked on fast integer compression techniques for the last 4 decades. One of the earliest clever techniques is Elias coding. Over the years, many new techniques have been developed: Golomb and Rice coding, Frame-of-Reference and PFOR-Delta, the Simple family, and so on.

The general story is that while people initially used bit-level codes (e.g., gamma codes), simpler byte-level codes like Google’s group varint are more practical. Byte-level codes like what Google uses do not compress as well, and there is less opportunity for fancy information theoretical mathematics. However, they can be much faster.

Yet we noticed that there was no trace in the literature of a sensible integer compression scheme running on desktop processor able to decompress data at a rate of billions of integers per second. The best schemes, such as Stepanov et al.’s varint-G8IU report top speeds of 1.5 billion integers per second.

As your may expect, we eventually found out that it was entirely feasible to decoding billions of integers per second. We designed a new scheme that typically compress better than Stepanov et al.’s varint-G8IU or Zukowski et al.’ PFOR-Delta, sometimes quite a bit better, while being twice as fast over real data residing in RAM (we call it SIMD-BP128). That is, we cleanly exceed a speed of 2 billions integers per second on a regular desktop processor.

We posted our paper online together with our software. Note that our scheme is not patented whereas many other schemes are.

So, how did we do it? Some insights:

August 31, 2012

SMART – string matching research tool

Filed under: Algorithms,String Matching — Patrick Durusau @ 5:10 pm

SMART – string matching research tool by Simone Faro and Thierry Lecroq.

From the webpage:

1 tool

smart (string matching algorithms research tool) is a tool which provides a standard framework for researchers in string matching. It helps users to test, design, evaluate and understand existing solutions for the exact string matching problem. Moreover it provides the implementation of (almost) all string matching algorithms and a wide corpus of text buffers.

40 years

In the last 40 years of research in computer science string matching was one of the most extensively studied problem, mainly due to its direct applications to such diverse areas as text, image and signal processing, speech analysis and recognition, data compression, information retrieval, computational biology and chemistry. Moreover String matching algorithms are also basic components used in implementations of practical softwares existing under most operating systems.

85 algos

Since 1970 more than 80 string matching algorithms have been proposed, and more than 50% of them in the last ten years. The smart tool provides a comprehensive collection of all string matching algorithms, inplemented in C programming language, and helps researcher to perform experimental results and compare them from a practical point of view. Smart provides a practical and standard platform for testing string matching algorithms and sharing results with the community.

12 texts

The smart tool provides also a corpus of 12 texts on which the string matching algorithms can be tested. Texts in the corpus are of different types, including natural language texts, genome sequences, protein sequences, and random texts with a uniform distribution of characters.

Do you know of any similar research tools for named entity recognition?

Bio-hackers will be interested in the “Complete genome of the E. Coli bacterium.”

A Fast Suffix Automata Based Algorithm for Exact Online String Matching

Filed under: Algorithms,String Matching — Patrick Durusau @ 4:41 pm

A Fast Suffix Automata Based Algorithm for Exact Online String Matching by Simone Faro and Thierry Lecroq (Implementation and Application of Automata Lecture Notes in Computer Science, 2012, Volume 7381/2012, 149-158, DOI: 10.1007/978-3-642-31606-7_13)

Abstract:

Searching for all occurrences of a pattern in a text is a fundamental problem in computer science with applications in many other fields, like natural language processing, information retrieval and computational biology. Automata play a very important role in the design of efficient solutions for the exact string matching problem. In this paper we propose a new very simple solution which turns out to be very efficient in practical cases. It is based on a suitable factorization of the pattern and on a straightforward and light encoding of the suffix automaton. It turns out that on average the new technique leads to longer shift than that proposed by other known solutions which make use of suffix automata.

I suppose it is too much to expect writers to notice matching “…pattern[s] in a text is a fundamental problem in [topic maps]….”

Pattern matching in topic maps can extend beyond strings so perhaps the oversight is understandable. 😉

August 28, 2012

‘The Algorithm That Runs the World’ [Optimization, Identity and Polytopes]

Filed under: Algorithms,Dimensions,Identification,Identity,Polytopes — Patrick Durusau @ 12:28 pm

“The Algorithm That Runs the World” by Erwin Gianchandani.

From the post:

New Scientist published a great story last week describing the history and evolution of the simplex algorithm — complete with a table capturing “2000 years of algorithms”:

The simplex algorithm directs wares to their destinations the world over [image courtesy PlainPicture/Gozooma via New Scientist].Its services are called upon thousands of times a second to ensure the world’s business runs smoothly — but are its mathematics as dependable as we thought?

YOU MIGHT not have heard of the algorithm that runs the world. Few people have, though it can determine much that goes on in our day-to-day lives: the food we have to eat, our schedule at work, when the train will come to take us there. Somewhere, in some server basement right now, it is probably working on some aspect of your life tomorrow, next week, in a year’s time.

Perhaps ignorance of the algorithm’s workings is bliss. The door to Plato’s Academy in ancient Athens is said to have borne the legend “let no one ignorant of geometry enter”. That was easy enough to say back then, when geometry was firmly grounded in the three dimensions of space our brains were built to cope with. But the algorithm operates in altogether higher planes. Four, five, thousands or even many millions of dimensions: these are the unimaginable spaces the algorithm’s series of mathematical instructions was devised to probe.

Perhaps, though, we should try a little harder to get our heads round it. Because powerful though it undoubtedly is, the algorithm is running into a spot of bother. Its mathematical underpinnings, though not yet structurally unsound, are beginning to crumble at the edges. With so much resting on it, the algorithm may not be quite as dependable as it once seemed [more following the link].

A fund manager might similarly want to arrange a portfolio optimally to balance risk and expected return over a range of stocks; a railway timetabler to decide how best to roster staff and trains; or a factory or hospital manager to work out how to juggle finite machine resources or ward space. Each such problem can be depicted as a geometrical shape whose number of dimensions is the number of variables in the problem, and whose boundaries are delineated by whatever constraints there are (see diagram). In each case, we need to box our way through this polytope towards its optimal point.

This is the job of the algorithm.

Its full name is the simplex algorithm, and it emerged in the late 1940s from the work of the US mathematician George Dantzig, who had spent the second world war investigating ways to increase the logistical efficiency of the U.S. air force. Dantzig was a pioneer in the field of what he called linear programming, which uses the mathematics of multidimensional polytopes to solve optimisation problems. One of the first insights he arrived at was that the optimum value of the “target function” — the thing we want to maximise or minimise, be that profit, travelling time or whatever — is guaranteed to lie at one of the corners of the polytope. This instantly makes things much more tractable: there are infinitely many points within any polytope, but only ever a finite number of corners.

If we have just a few dimensions and constraints to play with, this fact is all we need. We can feel our way along the edges of the polytope, testing the value of the target function at every corner until we find its sweet spot. But things rapidly escalate. Even just a 10-dimensional problem with 50 constraints — perhaps trying to assign a schedule of work to 10 people with different expertise and time constraints — may already land us with several billion corners to try out.

Apologies but I saw this article too late to post within the “free” days allowed by New Scientist.

But, I think from Erwin’s post and long quote from the original article, you can see how the simplex algorithm may be very useful where identity is defined in multidimensional space.

The literature in this area is vast and it may not offer an appropriate test for all questions of subject identity.

For example, the possessor of a credit card is presumed to be the owner of the card. Other assumptions are possible, but fraud costs are recouped from fees paid by customers. Creating a lack of interest in more stringent identity tests.

On the other hand, if your situation requires multidimensional identity measures, this may be a useful approach.


PS: Be aware that naming confusion, the sort that can be managed (not solved) by topic maps abounds even in mathematics:

The elements of a polytope are its vertices, edges, faces, cells and so on. The terminology for these is not entirely consistent across different authors. To give just a few examples: Some authors use face to refer to an (n−1)-dimensional element while others use face to denote a 2-face specifically, and others use j-face or k-face to indicate an element of j or k dimensions. Some sources use edge to refer to a ridge, while H. S. M. Coxeter uses cell to denote an (n−1)-dimensional element. (Polytope)

August 20, 2012

Fast Set Intersection in Memory [Foul! They Peeked!]

Filed under: Algorithms,Memory,Set Intersection,Sets — Patrick Durusau @ 4:06 pm

Fast Set Intersection in Memory by Bolin Ding and Arnd Christian König.

Abstract:

Set intersection is a fundamental operation in information retrieval and database systems. This paper introduces linear space data structures to represent sets such that their intersection can be computed in a worst-case efficient way. In general, given k (preprocessed) sets, with totally n elements, we will show how to compute their intersection in expected time O(n / sqrt(w) + kr), where r is the intersection size and w is the number of bits in a machine-word. In addition,we introduce a very simple version of this algorithm that has weaker asymptotic guarantees but performs even better in practice; both algorithms outperform the state of the art techniques for both synthetic and real data sets and workloads.

Important not only for the algorithm but how they arrived at it.

They peeked at the data.

Imagine that.

Not trying to solve the set intersection problem in the abstract but looking at data you are likely to encounter.

I am all for the pure theory side of things but there is something to be said for less airy (dare I say windy?) solutions. 😉

I first saw this at Theoretical Computer Science: Most efficient algorithm to compute set difference?

August 12, 2012

Systematic benchmark of substructure search in molecular graphs – From Ullmann to VF2

Filed under: Algorithms,Bioinformatics,Graphs,Molecular Graphs — Patrick Durusau @ 1:11 pm

Systematic benchmark of substructure search in molecular graphs – From Ullmann to VF2 by Hans-Christian Ehrlich and Matthias Rarey. (Journal of Cheminformatics 2012, 4:13 doi:10.1186/1758-2946-4-13)

Abstract:

Background

Searching for substructures in molecules belongs to the most elementary tasks in cheminformatics and is nowadays part of virtually every cheminformatics software. The underlying algorithms, used over several decades, are designed for the application to general graphs. Applied on molecular graphs, little effort has been spend on characterizing their performance. Therefore, it is not clear how current substructure search algorithms behave on such special graphs. One of the main reasons why such an evaluation was not performed in the past was the absence of appropriate data sets.

Results

In this paper, we present a systematic evaluation of Ullmann’s and the VF2 subgraph isomorphism algorithms on molecular data. The benchmark set consists of a collection of 1236 SMARTS substructure expressions and selected molecules from the ZINC database. The benchmark evaluates substructures search times for complete database scans as well as individual substructure-molecule-pairs. In detail, we focus on the influence of substructure formulation and size, the impact of molecule size, and the ability of both algorithms to be used on multiple cores.

Conclusions

The results show a clear superiority of the VF2 algorithm in all test scenarios. In general, both algorithms solve most instances in less than one millisecond, which we consider to be acceptable. Still, in direct comparison, the VF2 is most often several folds faster than Ullmann’s algorithm. Additionally, Ullmann’s algorithm shows a surprising number of run time outliers.

Questions:

How do your graphs compare to molecular graphs? Similarities? Differences?

For searching molecular graphs, what algorithm does your software use for substructure searches?

August 11, 2012

Themes in streaming algorithms (workshop at TU Dortmund)

Filed under: Algorithms,Data Streams,Stream Analytics — Patrick Durusau @ 8:31 pm

Themes in streaming algorithms (workshop at TU Dortmund) by Anna C. Gilbert.

From the post:

I recently attended the streaming algorithms workshop at Technische Universitat Dortumund. It was a follow-on to the very successful series of streaming algorithms workshops held in Kanpur over the last six years. Suresh and Andrew have both given excellent summaries of the individual talks at the workshop (see day 1, day 2, and day 3) so, as both a streaming algorithms insider and outsider, I thought it would be good to give a high-level overview of what themes there are in streaming algorithms research these days, to identify new areas of research and to highlight advances in existing areas.

Anna gives the briefest of summaries but I think they will entice you to look further.

Curious, how would you distinguish a “stream of data” from “read once data?”

That is in the second case you only get one pass at reading the data. Errors are just that, errors, but you can’t look back to be sure.

Is data “integrity” an artifact of small data sets and under-powered computers?

August 5, 2012

Algorithms for Modern Massive Data Sets [slides]

Filed under: Algorithms,BigData,GraphLab,Sparse Data — Patrick Durusau @ 3:40 pm

Algorithms for Modern Massive Data Sets [slides]

Igor Carron writes:

In case you have to take your mind off tomorrow’s suspense-filled and technologically challenging landing of Curiosity on Mars (see 7 minutes of Terror, a blockbuster taking place on Mars this SummerMichael Mahoney, Alex Shkolnik, Gunnar Carlsson, Petros Drineas, the organizers of Workshop on Algorithms for Modern Massive Data Sets (MMDS 2012), just made available the slides of the meeting. Other relevant meeting slides include that of the Coding, Complexity, and Sparsity Workshop and that of the GraphLab workshop. Don’t grind your teeth too much tomorrow and Go Curiosity !

Igor has helpfully listed four conference days of links for the algorithms workshop. When you finish there, take a look at the other two slide listings.

July 25, 2012

Searching the WWW for found things, a bad solution

Filed under: Algorithms,Graphs,Networks — Patrick Durusau @ 1:53 pm

Searching for something you have found once before, is a very bad solution.

It is like catching a fish and then throwing it back into the ocean and attempting to find that same fish, again.

Real example that happened to me this week. I had mentioned this research in a post but did not include a link to it. I didn’t remember the names of the researchers or their location.

From the news bulletin:

Whether sudoku, a map of Germany or solid bodies – in all of these cases, it’s all about counting possibilities. In the sudoku, it is the permitted solutions; in the solid body, it is the possible arrangements of atoms. In the map, the question is how many ways the map can be colored so that adjacent countries are always shown in a different color. Scientists depict these counting problems as a network of lines and nodes. Consequently, they need to answer just one question: How many different ways are there to color in the nodes with a certain number of colors? The only condition: nodes joined by a line may not have the same color. Depending on the application, the color of a node is given a completely new significance. In the case of the map, “color” actually means color; with sudoku the “colors” represent different figures.

“The existing algorithm copies the whole network for each stage of the calculation and only changes one aspect of it each time,” explains Frank van Bussel of the Max Planck Institute for Dynamics and Self-Organization (MPIDS). Increasing the number of nodes dramatically increases the calculation time. For a square lattice the size of a chess board, this is estimated to be many billions of years. The new algorithm developed by the Göttingen-based scientists is significantly faster. “Our calculation for the chess board lattice only takes seven seconds,” explains Denny Fliegner from MPIDS.

Without naming the search engine, would you believe that:

network +color +node

results in 24 “hits,” none of which are the research in question.

Remembering some of the terms in the actual scholarly article I searched using:

"chromatic polynomials" +network

Which resulted in three (3) scholarly articles and one “hit,” none of which were the research in question.

As you may suspect, variations on these searches resulted in similar “non-helpful” results.

I had not imagined the research in question but searching was unable to recover the reference.

Well, using a search engine was unable to recover the reference.

Knowing that I had bookmarked the site, I had to scan a large bookmark file for the likely entry.

Found it and so I don’t have to repeat this non-productive behavior, what follows are the citations and some text from each to help with the finding, next time.

The general news article:

A new kind of counting

How many different sudokus are there? How many different ways are there to color in the countries on a map? And how do atoms behave in a solid? Researchers at the Max Planck Institute for Dynamics and Self-Organization in Göttingen and at Cornell University (Ithaca, USA) have now developed a new method that quickly provides an answer to these questions. In principle, there has always been a way to solve them. However, computers were unable to find the solution as the calculations took too long. With the new method, the scientists look at separate sections of the problem and work through them one at a time. Up to now, each stage of the calculation has involved the whole map or the whole sudoku. The answers to many problems in physics, mathematics and computer science can be provided in this way for the first time. (New Journal of Physics, February 4, 2009)

The New Journal of Physics article:

Counting complex disordered states by efficient pattern matching: chromatic polynomials and Potts partition functions by Marc Timme, Frank van Bussel, Denny Fliegner, and Sebastian Stolzenberg.

Abstract:

Counting problems, determining the number of possible states of a large system under certain constraints, play an important role in many areas of science. They naturally arise for complex disordered systems in physics and chemistry, in mathematical graph theory, and in computer science. Counting problems, however, are among the hardest problems to access computationally. Here, we suggest a novel method to access a benchmark counting problem, finding chromatic polynomials of graphs. We develop a vertex-oriented symbolic pattern matching algorithm that exploits the equivalence between the chromatic polynomial and the zero-temperature partition function of the Potts antiferromagnet on the same graph. Implementing this bottom-up algorithm using appropriate computer algebra, the new method outperforms standard top-down methods by several orders of magnitude, already for moderately sized graphs. As a first application, we compute chromatic polynomials of samples of the simple cubic lattice, for the first time computationally accessing three-dimensional lattices of physical relevance. The method offers straightforward generalizations to several other counting problems.

GENERAL SCIENTIFIC SUMMARY

Introduction and background. The number of accessible states of a complex physical system fundamentally impacts its static and dynamic properties. For instance, antiferromagnets often exhibit an exponential number of energetically equivalent ground states and thus positive entropy at zero temperature – an exception to the third law of thermodynamics. However, counting the number of ground states, such as for the Potts model antiferromagnet, is computationally very hard (so-called sharp-P hard), i.e. the computation time generally increases exponentially with the size of the system. Standard computational counting methods that use theorems of graph theory are therefore mostly restricted to very simple or very small lattices.

Main results. Here we present a novel general-purpose method for counting. It relies on a symbolic algorithm that is based on the original physical representation of a Potts partition function and is implemented in the computer algebra language FORM that was successfully used before in precision high-energy physics.

Wider implications. The bottom-up nature of the algorithm, together with the purely symbolic implementation make the new method many orders of magnitude faster than standard methods. It now enables exact solutions of various systems that have been thus far computationally inaccessible, including lattices in three dimensions. Through the relation of the Potts partition functions to universal functions in graph theory, this new method may also help to access related counting problems in communication theory, graph theory and computer science.

The language used was FORM.

This search reminded me that maps are composed of found and identified things.

Which explains the difference between searching the WWW versus a map of found things.

July 21, 2012

The Amazing Mean Shift Algorithm

Filed under: Algorithms,Merging — Patrick Durusau @ 7:47 pm

The Amazing Mean Shift Algorithm by Larry Wasserman.

From the post:

The mean shift algorithm is a mode-based clustering method due to Fukunaga and Hostetler (1975) that is commonly used in computer vision but seems less well known in statistics.

The steps are: (1) estimate the density, (2) find the modes of the density, (3) associate each data point to one mode.

If you are puzzling over why I cited this post, it might help if you read “(3)” as:

(3) merge data points associated with one mode.

The notion that topics can only be merged on the basis of URLs, actually discrete values of any sort, is one way to think about merging. Your data may or may not admit to robust processing on that basis.

Those are all very good ways to merge topics, if and only if that works for your data.

If not, then you need to find ways that work with your data.

July 13, 2012

Search Algorithms for Conceptual Graph Databases

Filed under: Algorithms,Graph Databases,Search Algorithms — Patrick Durusau @ 5:10 pm

Search Algorithms for Conceptual Graph Databases by Abdurashid Mamadolimov.

Abstract:

We consider a database composed of a set of conceptual graphs. Using conceptual graphs and graph homomorphism it is possible to build a basic query-answering mechanism based on semantic search. Graph homomorphism defines a partial order over conceptual graphs. Since graph homomorphism checking is an NP-Complete problem, the main requirement for database organizing and managing algorithms is to reduce the number of homomorphism checks. Searching is a basic operation for database manipulating problems. We consider the problem of searching for an element in a partially ordered set. The goal is to minimize the number of queries required to find a target element in the worst case. First we analyse conceptual graph database operations. Then we propose a new algorithm for a subclass of lattices. Finally, we suggest a parallel search algorithm for a general poset.

While I have no objection to efficient solutions for particular cases, as a general rule those solutions are valid for some known set of cases.

Here we appear to have an efficient solution for some unknown number of cases. I mention it to keep in mind while watching the search literature on graph databases develop.

June 30, 2012

Algorithms

Filed under: Algorithms — Patrick Durusau @ 6:50 pm

Algorithms by S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani.

From the webpage:

This is a penultimate draft of our soon to appear textbook.

For more information, visit http://www.mhhe.com/dasgupta.



Table of contents

Preface

Chapter 0: Prologue

Chapter 1: Algorithms with numbers

Chapter 2: Divide-and-conquer algorithms

Chapter 3: Decompositions of graphs

Chapter 4: Paths in graphs

Chapter 5: Greedy algorithms

Chapter 6: Dynamic programming

Chapter 7: Linear programming

Chapter 8: NP-complete problems

Chapter 9: Coping with NP-completeness

Chapter 10: Quantum algorithms

Entire book (draft)

The published version was reviewed by Dean Kelley (Dean Kelley. 2009. Joint review of algorithms by Richard Johnsonbaugh and Marcus Schaefer (Pearson/Prentice-Hall, 004) and algorithms by Sanjoy Dasgupta, Christos Papadimitriou and Umesh Vazirani (McGraw-Hill, 008). SIGACT News 40, 2 (June 2009), 23-25. DOI=10.1145/1556154.1556159 http://doi.acm.org/10.1145/1556154.1556159)

Who noted:

….Eschewing a formal and traditional presentation, the authors focus on distilling the core of a problem and/or the fundamental idea that makes an algorithm work. Defi nitions, theorems and proofs are present, of course, but less visibly so and are less formally presented than in the other text reviewed here.

The result is a book which fi nds a rigorous, but nicely direct path through standard subjects such as divide-and-conquer, graph algorithms, greedy algorithms, dynamic programming, and NP-completeness. You won’t necessarily find every topic that you might want in all of these subjects, but the book doesn’t claim to be encyclopedic and the authors’ presentation doesn’t su ffer as a result of their choice of specifi c topics.

Nice collections of chapter problems provide opportunities to formalize parts of the presentation and explore additional topics. The text contains plenty of “asides” (digressions, illuminations, addenda, perspectives, etc.) presented as boxes on some of the pages. These little side trips are fun to read, enhance the presentation and can often lead to opportunitites to include additional material in the course. It has been my experience that even a disinterested, academically self-destructive student can fi nd something of sufficient interest in these excursions to grab their attention.

A good text on algorithms and one that merits a hard copy for the shelf!

An example of what to strive for when writing a textbook.

June 25, 2012

…a phylogeny-aware graph algorithm

Filed under: Algorithms,Alignment,Genome,Graphs,Sequence Detection — Patrick Durusau @ 2:43 pm

Accurate extension of multiple sequence alignments using a phylogeny-aware graph algorithm by Loytynoja, A., Vilella, A. J., Goldman, N.

From the post:

Motivation: Accurate alignment of large numbers of sequences is demanding and the computational burden is further increased by downstream analyses depending on these alignments. With the abundance of sequence data, an integrative approach of adding new sequences to existing alignments without their full re-computation and maintaining the relative matching of existing sequences is an attractive option. Another current challenge is the extension of reference alignments with fragmented sequences, as those coming from next-generation metagenomics, that contain relatively little information. Widely used methods for alignment extension are based on profile representation of reference sequences. These do not incorporate and use phylogenetic information and are affected by the composition of the reference alignment and the phylogenetic positions of query sequences.

Results: We have developed a method for phylogeny-aware alignment of partial-order sequence graphs and apply it here to the extension of alignments with new data. Our new method, called PAGAN, infers ancestral sequences for the reference alignment and adds new sequences in their phylogenetic context, either to predefined positions or by finding the best placement for sequences of unknown origin. Unlike profile-based alternatives, PAGAN considers the phylogenetic relatedness of the sequences and is not affected by inclusion of more diverged sequences in the reference set. Our analyses show that PAGAN outperforms alternative methods for alignment extension and provides superior accuracy for both DNA and protein data, the improvement being especially large for fragmented sequences. Moreover, PAGAN-generated alignments of noisy next-generation sequencing (NGS) sequences are accurate enough for the use of RNA-seq data in evolutionary analyses.

Availability: PAGAN is written in C++, licensed under the GPL and its source code is available at http://code.google.com/p/pagan-msa.

Contact: ari.loytynoja@helsinki.fi

Does your graph software support “…phylogeny-aware alignment of partial-order sequence graphs…?”

June 14, 2012

Why My Soap Film is Better than Your Hadoop Cluster

Filed under: Algorithms,Hadoop,Humor — Patrick Durusau @ 6:56 pm

Why My Soap Film is Better than Your Hadoop Cluster

From the post:

The ever amazing slime mold is not the only way to solve complex compute problems without performing calculations. There is another: soap film. Unfortunately for soap film it isn’t nearly as photogenic as slime mold, all we get are boring looking pictures, but the underlying idea is still fascinating and ten times less spooky.

As a quick introduction we’ll lean on Long Ouyang, who has really straightforward explanation of how soap film works in Approaching P=NP: Can Soap Bubbles Solve The Steiner Tree Problem In Polynomial.

And no, this isn’t what I am writing about on Hadoop for next Monday. 😉

I point this out partially for humor.

But considering unconventional computational methods may give you ideas about more conventional things to try.

May 3, 2012

Lock-Free Algorithms: How Intel X86_64 Processors and Their Memory Model Works

Filed under: Algorithms,Lock-Free Algorithms — Patrick Durusau @ 6:23 pm

Lock-Free Algorithms: How Intel X86_64 Processors and Their Memory Model Works

Alex Popescu has links to both presentations and other resources by Martin Thompson on how to write lock-free algorithms.

Just in case you are interested in performance for your semantic/topic map applications.

April 22, 2012

GraphLab: Workshop on Big Learning

Filed under: Algorithms,BigData,Machine Learning — Patrick Durusau @ 7:06 pm

GraphLab: Workshop on Big Learning

Monday, July 9, 2012

From the webpage:

The GraphLab workshop on large scale machine learning is a meeting place for both academia and industry to discuss upcoming challenges of large scale machine learning and solution methods. GraphLab is Carnegie Mellon’s large scale machine learning framework. The workshop will include demos and tutorials showcasing the next generation of the GraphLab framework, as well as lectures and demos from the top technology companies about their applied large scale machine learning solutions.

and

There is a related workshop on Algorithms for Modern Massive Datasets at Stanford, immediately after the GraphLab workshop.

If you are going to be in the Bay area, definitely a good way to start the week!

April 15, 2012

Computer Algorithms: Morris-Pratt String Searching

Filed under: Algorithms,Searching,String Matching — Patrick Durusau @ 7:15 pm

Computer Algorithms: Morris-Pratt String Searching

From the post:

We saw that neither brute force string searching nor Rabin-Karp string searching are effective. However in order to improve some algorithm, first we need to understand its principles in detail. We know already that brute force string matching is slow and we tried to improve it somehow by using a hash function in the Rabin-Karp algorithm. The problem is that Rabin-Karp has the same complexity as brute force string matching, which is O(mn).

Obviously we need a different approach, but to come with a different approach let’s see what’s wrong with brute force string searching. Indeed by taking a closer look at its principles we can answer the question.

In brute force matching we checked each character of the text with the first character of the pattern. In case of a match we shifted the comparison between the second character of the pattern and the next character of the text. The problem is that in case of a mismatch we must go several positions back in the text. Well in fact this technique can’t be optimized.

Good posts on string matching!

March 20, 2012

Worst-case Optimal Join Algorithms

Filed under: Algorithms,Database,Joins — Patrick Durusau @ 3:52 pm

Worst-case Optimal Join Algorithms by Hung Q. Ngo, Ely Porat, Christopher Ré, and Atri Rudra.

Abstract:

Efficient join processing is one of the most fundamental and well-studied tasks in database research. In this work, we examine algorithms for natural join queries over many relations and describe a novel algorithm to process these queries optimally in terms of worst-case data complexity. Our result builds on recent work by Atserias, Grohe, and Marx, who gave bounds on the size of a full conjunctive query in terms of the sizes of the individual relations in the body of the query. These bounds, however, are not constructive: they rely on Shearer’s entropy inequality which is information-theoretic. Thus, the previous results leave open the question of whether there exist algorithms whose running time achieve these optimal bounds. An answer to this question may be interesting to database practice, as it is known that any algorithm based on the traditional select-project-join style plans typically employed in an RDBMS are asymptotically slower than the optimal for some queries. We construct an algorithm whose running time is worst-case optimal for all natural join queries. Our result may be of independent interest, as our algorithm also yields a constructive proof of the general fractional cover bound by Atserias, Grohe, and Marx without using Shearer’s inequality. This bound implies two famous inequalities in geometry: the Loomis-Whitney inequality and the Bollob\’as-Thomason inequality. Hence, our results algorithmically prove these inequalities as well. Finally, we discuss how our algorithm can be used to compute a relaxed notion of joins.

With reference to the optimal join problem the authors say:

Implicitly, this problem has been studied for over three decades: a modern RDBMS use decades of highly tuned algorithms to efficiently produce query results. Nevertheless, as we described above, such systems are asymptotically suboptimal – even in the above simple example of (1). Our main result is an algorithm that achieves asymptotically optimal worst-case running times for all conjunctive join queries.

The author’s strategy involves evaluation of the keys in a join and the dividing of those keys into separate sets. The information used by the authors has always been present, just not used in join processing. (pp. 2-3 of the article)

There are a myriad of details to be mastered in the article but I suspect this line of thinking may be profitable in many situations where “join” operations are relevant.

March 19, 2012

Clever Algorithms: Statistical Machine Learning Recipes

Filed under: Algorithms,Programming — Patrick Durusau @ 6:52 pm

Clever Algorithms: Statistical Machine Learning Recipes by Jason Brownlee PhD. First Edition, Lulu Enterprises, [Expected mid 2012]. ISBN: xxx.

From the website cleveralgorithms.com

Implementing an Machine Learning algorithms is difficult. Algorithm descriptions may be incomplete, inconsistent, and distributed across a number of papers, chapters and even websites. This can result in varied interpretations of algorithms, undue attrition of algorithms, and ultimately bad science.

This book is an effort to address these issues by providing a handbook of algorithmic recipes drawn from the field of Machine Learning, described in a complete, consistent, and centralized manner. These standardized descriptions were carefully designed to be accessible, usable, and understandable.

An encyclopedic algorithm reference, this book is intended for research scientists, engineers, students, and interested amateurs. Each algorithm description provides a working code example in R.

Also see: Clever Algorithms: Nature-Inspired Programming Recipes.

Led to this by Experiments in Genetic Programming.

Experiments in genetic programming

Filed under: Algorithms,Authoring Topic Maps,Genetic Algorithms,Record Linkage — Patrick Durusau @ 6:52 pm

Experiments in genetic programming

Lars Marius Garshol writes:

I made an engine called Duke that can automatically match records to see if they represent the same thing. For more background, see a previous post about it. The biggest problem people seem to have with using it is coming up with a sensible configuration. I stumbled across a paper that described using so-called genetic programming to configure a record linkage engine, and decided to basically steal the idea.

You need to read about the experiments in the post but I can almost hear Lars saying the conclusion:

The result is pretty clear: the genetic configurations are much the best. The computer can configure Duke better than I can. That’s almost shocking, but there you are. I guess I need to turn the script into an official feature.

😉

Excellent post and approach by the way!

Lars also posted a link to Reddit about his experiments. Several links appear in comments that I have turned into short posts to draw more attention to them.

Another tool for your topic mapping toolbox.

Question: I wonder what it would look like to have the intermediate results used for mapping, only to be replaced as “better” mappings become available? Has a terminating condition but new content can trigger additional cycles but only as relevant to its content.

Or would queries count as new content? If they expressed synonymy or other relations?

« Newer PostsOlder Posts »

Powered by WordPress