Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 24, 2014

Twitter open sourced a recommendation algorithm for massive datasets

Filed under: Algorithms,Recommendation,Similarity — Patrick Durusau @ 7:19 pm

Twitter open sourced a recommendation algorithm for massive datasets by Derrick Harris.

From the post:

Late last month, Twitter open sourced an algorithm that’s designed to ease the computational burden on systems trying to recommend content — contacts, articles, products, whatever — across seemingly endless sets of possibilities. Called DIMSUM, short for Dimension Independent Matrix Square using MapReduce (rolls off the tongue, no?), the algorithm trims the list of potential combinations to a reasonable number, so other recommendation algorithms can run in a reasonable amount of time.

Reza Zadeh, the former Twitter data scientist and current Stanford consulting professor who helped create the algorithm, describes it in terms of the famous handshake problem. Two people in a room? One handshake; no problem. Ten people in a room? Forty-five handshakes; still doable. However, he explained, “The number of handshakes goes up quadratically … That makes the problem very difficult when x is a million.”

Twitter claims 271 million active users.

DIMSUM works primarily in two different areas: (1) matching promoted ads with the right users, and (2) suggesting similar people to follow after users follow someone. Running through all the possible combinations would take days even on a large cluster of machines, Zadeh said, but sampling the user base using DIMSUM takes significantly less time and significantly fewer machines.

The “similarity” of two or more people or bits of content is a variation on the merging rules of the TMDM.

In recommendation language, two or more topics are “similar” if:

  • at least one equal string in their [subject identifiers] properties,
  • at least one equal string in their [item identifiers] properties,
  • at least one equal string in their [subject locators] properties,
  • an equal string in the [subject identifiers] property of the one topic item and the [item identifiers] property of the other, or
  • the same information item in their [reified] properties.

TMDM 5.3.5 Properties

The TMDM says “equal” and not “similar” but the point being that you can arbitrarily decide on how “similar” two or more topics must be in order to trigger merging.

That realization opens up the entire realm of “similarity” and “recommendation” algorithms and techniques for application to topic maps.

Which brings us back to the algorithm just open sourced by Twitter.

With DIMSUM, you don’t have to do a brute force topic by topic comparison for merging purposes. Some topics will not meet a merging “threshold” and not be considered by merging routines.

Of course, with the TMDM, merging being either true or false, you may be stuck with brute force. Suggestions?

But if you have other similarity measures, you may be able to profit from DIMSUM.

BTW, I would not follow #dimsum on Twitter because it is apparently a type of dumpling. 😉

Update: All-pairs similarity via DIMSUM DIMSUM has been implemented in Spark!

September 22, 2014

Algorithms and Data – Example

Filed under: Algorithms,Data,Statistics — Patrick Durusau @ 10:41 am

People's Climate

AJ+ was all over the #OurClimate march in New York City.

Let’s be generous and say the march attracted 400,000 people.

At approximately 10:16 AM Eastern time this morning, the world population clock reported a population of 7,262,447,500.

0.000550 % of the world’s population expressed an opinion on climate change in New York yesterday.

I mention that calculation, disclosing both data and the algorithm, to point out the distortion between the number of people driving policy versus the number of people impacted.

Other minority opinions promoted by AJ+ include that of the United States (population: 318,776,000) on what role Iran (population: 77,176,930) should play in the Middle East (population: 395,133,109) and the world (population: 7,262,447,500), on issues such as the Islamic State. BBC News: Islamic State crisis: Kerry says Iran can help defeat IS.

Isn’t that “the tail wagging the dog?”

Is there any wonder why international decision making departs from the common interests of the world’s population?

Hopefully AJ+ will stop beating the drum quite so loudly for minority opinions and seek out more representative ones, even if not conveniently located in New York City.

September 10, 2014

Taxonomies and Toolkits of Regular Language Algorithms

Filed under: Algorithms,Automata,Finite State Automata,String Matching — Patrick Durusau @ 3:32 pm

Taxonomies and Toolkits of Regular Language Algorithms by Bruce William Watson.

From 1.1 Problem Statement:

A number of fundamental computing science problems have been extensively studied since the 1950s and the 1960s. As these problems were studied, numerous solutions (in the form of algorithms) were developed over the years. Although new algorithms still appear from time to time, each of these fields can be considered mature. In the solutions to many of the well-studied computing science problems, we can identify three deficiencies:

  1. Algorithms solving the same problem are difficult to compare to one another. This is usually due to the use of different programming languages, styles of presentation, or simply the addition of unnecessary details.
  2. Collections of implementations of algorithms solving a problem are difficult, if not impossible, to find. Some of the algorithms are presented in a relatively obsolete manner, either using old notations or programming languages for which no compilers exist, making it difficult to either implement the algorithm or find an existing implementation.
  3. Little is known about the comparative practical running time performance of the algorithms. The lack of existing implementations in one and the same framework, especially of the older algorithms, makes it difficult to determine the running time characteristics of the algorithms. A software engineer selecting one of the algorithms will usually do so on the basis of the algorithm’s theoretical running time, or simply by guessing.

In this dissertation, a solution to each of the three deficiencies is presented for each of the following three fundamental computing science problems:

  1. Keyword pattern matching in strings. Given a finite non-empty set of keywords (the patterns) and an input string, find the set of all occurrences of a keyword as a substring of the input string.
  2. Finite automata (FA) construction. Given a regular expression, construct a finite automaton which accepts the language denoted by the regular expression.
  3. Deterministic finite automata (DFA) minimization. Given a DFA, construct the unique minimal DFA accepting the same language.

We do not necessarily consider all the known algorithms solving the problems. For example, we restrict ourselves to batch-style algorithms1, as opposed to incremental algorithms2.

Requires updating given its age, 1995, but a work merits mention.

I first saw this in a tweet by silentbicycle.srec.

August 26, 2014

Improving sparse word similarity models…

Filed under: Algorithms,Linguistics,Modeling,Similarity — Patrick Durusau @ 6:02 pm

Improving sparse word similarity models with asymmetric measures by Jean Mark Gawron.

Abstract:

We show that asymmetric models based on Tversky (1977) improve correlations with human similarity judgments and nearest neighbor discovery for both frequent and middle-rank words. In accord with Tversky’s discovery that asymmetric similarity judgments arise when comparing sparse and rich representations, improvement on our two tasks can be traced to heavily weighting the feature bias toward the rarer word when comparing high- and mid- frequency words.

From the introduction:

A key assumption of most models of similarity is that a similarity relation is symmetric. This assumption is foundational for some conceptions, such as the idea of a similarity space, in which similarity is the inverse of distance; and it is deeply embedded into many of the algorithms that build on a similarity relation among objects, such as clustering algorithms. The symmetry assumption is not, however, universal, and it is not essential to all applications of similarity, especially when it comes to modeling human similarity judgments.

What assumptions underlie your “similarity” measures?

Not that we can get away from “assumptions” but are your assumptions based on evidence or are they unexamined assumptions?

Do you know of any general techniques for discovering assumptions in algorithms?

August 25, 2014

Desperately Seeking Algorithms!

Filed under: Algorithms,Clojure,Functional Programming — Patrick Durusau @ 2:33 pm

I don’t know for sure that Christophe Grand is “desperately” seeking algorithms but he has tweeted a request for “favorite algorithms” to be cast into posts similar to:

Tarjan’s strongly connected components algorithm

I dislike algorithms that are full of indices and mutations. Not because they are bad but because I always have the feeling that the core idea is buried. As such, Tarjan’s SCC algorithm irked me.

So I took the traditional algorithm, implemented it in Clojure with explicit environment passing, then I replaced indices by explicit stacks (thanks to persistence!) and after some tweaks, I realized that I’ve gone full circle and could switch to stacks lengths instead of stacks themselves and get rid of the loop. However the whole process made the code cleaner to my eye. You can look at the whole history.

Here is the resulting code:

See the Tarjan post for the Clojure version. Something similar is planned for “favorite” algorithms.

What algorithm are you going to submit?

Pass this along.

July 29, 2014

Alphabetical Order

Filed under: Algorithms,Sorting,Unicode — Patrick Durusau @ 3:00 pm

Alphabetical order explained in a mere 27,817 words by David Weinberger.

From the post:

This is one of the most amazing examples I’ve seen of the complexity of even simple organizational schemes. “Unicode Collation Algorithm (Unicode Technical Standard #10)” spells out in precise detail how to sort strings in what we might colloquially call “alphabetical order.” But it’s way, way, way more complex than that.

Unicode is an international standard for how strings of characters get represented within computing systems. For example, in the familiar ASCII encoding, the letter “A” is represented in computers by the number 65. But ASCII is too limited to encode the world’s alphabets. Unicode does the job.

As the paper says, “Collation is the general term for the process and function of determining the sorting order of strings of characters” so that, for example, users can look them up on a list. Alphabetical order is a simple form of collation.

The best part is the summary of Unicode Technical Standard #10:

This document dives resolutely into the brambles and does not give up. It incidentally exposes just how complicated even the simplest of sorting tasks is when looked at in their full context, where that context is history, language, culture, and the ambiguity in which they thrive.

We all learned the meaning of “alphabetical order” in elementary school. But which “alphabetical order” depends upon language, culture, context, etc.

Other terms and phrases have the same problem. But the vast majority of them have no Unicode Technical Report with all the possible meanings.

For those terms there are topic maps.

I first saw this in a tweet by Computer Science.

July 27, 2014

Elementary Algorithms

Filed under: Algorithms,Data Structures — Patrick Durusau @ 3:57 pm

Elementary Algorithms by Xinyu LIU.

From the github page:

AlgoXY is a free book about elementary algorithms and data structures. This book doesn’t only focus on an imperative (or procedural) approach, but also includes purely functional algorithms and data structures. It doesn’t require readers to master any programming languages, because all the algorithms are described using mathematical functions and pseudocode.

For reference and implementation purposes, source code in C, C++, Haskell, Python, Scheme/Lisp is available in addition to the book.

The contents of the book are provided under GNU FDL and the source code is under GNU GPLv3.

The PDF version can be downloaded from github: https://github.com/liuxinyu95/AlgoXY/blob/algoxy/preview/elementary-algorithms.pdf?raw=true

This book is also available online at: https://sites.google.com/site/algoxy/home

I was concerned when the HTML version for trie was only 2 pages long. You need to view the pdf version, which for trie is some forty (40) pages, to get an idea of the coverage of any particular algorithm.

I first saw this in a tweet by OxAX

July 20, 2014

Combinatorial Algorithms

Filed under: Algorithms,Combinatorics — Patrick Durusau @ 4:54 pm

Combinatorial Algorithms For Computers and Calculators by Albert Jijenhuis and Hebert S. Wilf. (PDF)

I suspect the word “calculators” betrays the age of this item. This edition was published in 1978. Still, Amazon is currently asking > $50.00 U.S. for it so at least knowing about it can save you some money.

Not that the algorithms covered have changed but the authors say that combinatorics changed enough between 1975 and 1978 to warrant a second edition.

I suspect that is true several times over between 1978 and 2014.

Still, there is value in a different presentation than you would see today, even without the latest content.

I first saw this in a tweet by Atabey Kaygun.

July 19, 2014

HOGWILD!

Filed under: Algorithms,Machine Learning — Patrick Durusau @ 2:21 pm

Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent by Feng Niu, Benjamin Recht, Christopher Ré and Stephen J. Wright.

Abstract:

Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve state-of-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performance-destroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking. We present an update scheme called Hogwild! which allows processors access to shared memory with the possibility of over-writing each other’s work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then Hogwild! achieves a nearly optimal rate of convergence. We demonstrate experimentally that Hogwild! outperforms alternative schemes that use locking by an order of magnitude. (emphasis in original)

From further in the paper:

Our second graph cut problem sought a mulit-way cut to determine entity recognition in a large database of web data. We created a data set of clean entity lists from the DBLife website and of entity mentions from the DBLife Web Crawl [11]. The data set consists of 18,167 entities and 180,110 mentions and similarities given by string similarity. In this problem each stochastic gradient step must compute a Euclidean projection onto a simplex of dimension 18,167.

A 9X speedup on 10 cores. (Against Vowpal Wabbit.)

A must read paper.

I first saw this in Nat Torkington’s Four short links: 15 July 2014. Nat says:

the algorithm that Microsoft credit with the success of their Adam deep learning system.

July 5, 2014

Understanding and Expressing Scalable Concurrency

Filed under: Algorithms,Concurrent Programming,Parallelism,Scalability — Patrick Durusau @ 4:35 pm

Understanding and Expressing Scalable Concurrency by Aaron Turon.

Abstract


The Holy Grail of parallel programming is to provide good speedup while hiding or avoiding the pitfalls of concurrency. But some level in the tower of abstraction must face facts: parallel processors execute code concurrently, and the interplay between concurrent code, synchronization, and the memory subsystem is a major determiner of performance. Effšective parallel programming must ultimately be supported by scalable concurrent algorithms—algorithms that tolerate (or even embrace) concurrency for the sake of scaling with available parallelism. This dissertation makes several contributions to the understanding and expression of such algorithms:

  • It shows how to understand scalable algorithms in terms of local protocols governing each part of their hidden state. These protocols are visual artifacts that can be used to informally explain an algorithm at the whiteboard. But they also play a formal role in a new logic for verifying concurrent algorithms, enabling correctness proofs that are local in space, time, and thread execution. Correctness is stated in terms of refinement: clients of an algorithm can reason as if they were using the much simpler specification code it refines.
  • It shows how to express synchronization in a declarative but scalable way, based on a new library providing join patterns. By declarative, we mean that the programmer needs only to write down the constraints of a synchronization problem, and the library will automatically derive a correct solution.By scalable, we mean that the derived solutions deliver robust performance with increasing processor count and problem complexity. The library’s performance on common synchronization problems is competitive with specialized algorithms from the literature.
  • It shows how to express scalable algorithms through reagents, a new monadic abstraction. With reagents, concurrent algorithms no longer need to be constructed from “wholecloth,” i.e., by using system-level primitives directly. Instead, they are built using a mixture of shared-state and message-passing combinators. Concurrency experts benefit, because they can write libraries at a higher level, with more reuse, without sacrificing scalability. Their clients benefit, because composition empowers them to extend and tailor a library without knowing the details of its underlying algorithms.

Not for the faint of heart! 😉

But if you are interested in algorithms for when processing is always parallel by default, best dig in.

I like the author’s imagery of “Go Fish” when he says:

A scalable hashtable is useful not just for concurrent systems; it can also be a boon for explicit parallel programming. A simple but vivid example is the problem of duplicate removal: given a vector of items, return the items in any order, but without any duplicates. Since the input is unstructured, any way of dividing it amongst parallel threads appears to require global coordination to discover duplicate items. The key to avoiding a multiprocessor game of “Go Fish” is to focus on producing the output rather than dividing the input. If threads share a scalable hashtable that allows parallel insertion of distinct elements, they can construct the correct output with (on average) very little coordination, by simply each inserting a segment of the input into the table, one element at a time.

Now that I think about it, topic map processing does a lot of duplicate removal.

Topic maps in a parallel processing environment anyone?

I first saw this in a tweet by Alex Clemmer.

June 26, 2014

Visualizing Algorithms

Filed under: Algorithms,Visualization — Patrick Durusau @ 4:03 pm

Visualizing Algorithms by Mike Bostock.

From the post:

Algorithms are a fascinating use case for visualization. To visualize an algorithm, we don’t merely fit data to a chart; there is no primary dataset. Instead there are logical rules that describe behavior. This may be why algorithm visualizations are so unusual, as designers experiment with novel forms to better communicate. This is reason enough to study them.

But algorithms are also a reminder that visualization is more than a tool for finding patterns in data. Visualization leverages the human visual system to augment human intellect: we can use it to better understand these important abstract processes, and perhaps other things, too.

Better start with fresh pot of coffee when you read Mike’s post. Mike covers visualization of sampling algorithms using Van Gogh’s The Starry Night, sorting and maze generation (2-D). It is well written and illustrated but it is a lot of material to cover in one read.

The post finishes up with numerous references to other algorithm visualization efforts.

Put on your “must read” list for this weekend!

June 8, 2014

Laboratory for Web Algorithmics

Filed under: Algorithms,Search Algorithms,Search Engines,Webcrawler — Patrick Durusau @ 2:53 pm

Laboratory for Web Algorithmics

From the homepage:

The Laboratory for Web Algorithmics (LAW) was established in 2002 at the Dipartimento di Scienze dell’Informazione (now merged in the Computer Science Department) of the Università degli studi di Milano.

The LAW is part of the NADINE FET EU project.

Research at LAW concerns all algorithmic aspects of the study of the web and of social networks. More in detail…

The details include:

  • High-performance web crawling: Including an open source web crawler
  • Compression of web graphs and social networks: compression of web crawling results
  • Analysis of web graphs and social networks: research and algorithms for exploration of web graphs

Deeply impressive project and one with several papers and resources that I will be covering in more detail in future posts.

I first saw this in a tweet by Network Fact.

May 30, 2014

Quantum Computing Playground

Filed under: Algorithms,Computation,Quantum — Patrick Durusau @ 9:03 am

Google’s “Quantum Computing Playground” lets you fiddle with quantum algorithms by Dario Borghino.

From the post:

Google has just launched a new web-based integrated development environment (IDE) that allows users to write, run and debug software that makes use of quantum algorithms. The tool could allow computer scientists to stay ahead of the game and get acquainted with the many quirks of quantum algorithms even before the first practical quantum computer is built.

Homepage: Quantum Computing Playground.

From the homepage:

We strongly recommened to run Quantum Playground in Google Chrome.

I accessed the homepage using, gasp, another browser. 😉 Just happenstance.

The page doesn’t say anything about use of another browser leaving your computer out of phase but probably best not to take chances.

Enjoy!

April 19, 2014

Algorithmic Newspapers?

Filed under: Algorithms,News — Patrick Durusau @ 1:20 pm

A print newspaper generated by robots: Is this the future of media or just a sideshow? by Mathew Ingram.

From the post:

What if you could pick up a printed newspaper, but instead of a handful of stories hand-picked by a secret cabal of senior editors in a dingy newsroom somewhere, it had pieces that were selected based on what was being shared — either by your social network or by users of Facebook, Twitter etc. as a whole? Would you read it? More importantly, would you pay for it?

You can’t buy one of those yet, but The Guardian (see disclosure below) is bringing an experimental print version it has been working on to the United States for the first time: a printed paper that is generated entirely — or almost entirely — by algorithms based on social-sharing activity and other user behavior by the paper’s readers. Is this a glimpse into the future of newspapers?

According to Digiday, the Guardian‘s offering — known as #Open001 — is being rolled out later this week. But you won’t be able to pick one up at the corner store: only 5,000 copies will be printed each month, and they are going to the offices of media and ad agencies. In other words, it’s as much a marketing effort at this point for the Guardian (which isn’t printed in the U.S.) as it is a publishing experiment.

Mathew recounts the Guardian effort, similar services and questions whether robots can preserve serendipity?, alleged to be introduced by editors. It’s a good read.

The editors at the Guardian may introduce stories simply because they are “important,” but is that the case for other media outlets?

I know that is often alleged but peer review was alleged to lead to funding good projects and insuring that good papers were published. The alleged virtues of peer review, when tested, have been found to be false.

How would you test for “serendipity” in a news outlet? That is not simply running stories because they are popular in the local market but because they are “important?”

Or to put it another way: Is the news from a local media outlet already being personalized/customized?

April 15, 2014

The Bw-Tree: A B-tree for New Hardware Platforms

Filed under: Algorithms,B-trees,Programming — Patrick Durusau @ 4:22 pm

The Bw-Tree: A B-tree for New Hardware Platforms by Justin J. Levandoski, David B. Lomet, and, Sudipta Sengupta.

Abstract:

The emergence of new hardware and platforms has led to reconsideration of how data management systems are designed. However, certain basic functions such as key indexed access to records remain essential. While we exploit the common architectural layering of prior systems, we make radically new design decisions about each layer. Our new form of B-tree, called the Bw-tree achieves its very high performance via a latch-free approach that effectively exploits the processor caches of modern multi-core chips. Our storage manager uses a unique form of log structuring that blurs the distinction between a page and a record store and works well with flash storage. This paper describes the architecture and algorithms for the Bw-tree, focusing on the main memory aspects. The paper includes results of our experiments that demonstrate that this fresh approach produces outstanding performance.

With easy availability of multi-core chips, what new algorithms are you going to discover while touring SICP or TAOCP?

Is that going to be an additional incentive to tour one or both of them?

April 9, 2014

Revealing the Uncommonly Common…

Filed under: Algorithms,ElasticSearch,Search Engines,Searching — Patrick Durusau @ 3:34 pm

Revealing the Uncommonly Common with Elasticsearch by Mark Harwood.

From the summary:

Mark Harwood shows how anomaly detection algorithms can spot card fraud, incorrectly tagged movies and the UK’s most unexpected hotspot for weapon possession.

Makes me curious about the market for a “Mr./Ms. Normal” service?

A service that enables you to enter your real viewing/buying/entertainment preferences and for a fee, the service generates a paper trail for you than hides your real habits in digital dust.

If you order porn from NetFlix then the “Mr./Ms. Normal” service will order enough PBS and NatGeo material to even out your renting record.

Depending on how extreme your buying habits happen to be, you may need a “Mr./Ms. Abnormal” service that shields you from any paper trail at all.

As data surveillance grows, having a pre-defined Mr./Ms. Normal/Abnormal account may become a popular high school/college graduation or even a wedding present.

The usefulness of data surveillance depends on the cooperation of its victims. Have you ever considered not cooperating? But appearing to?

April 5, 2014

Algorithmic Number Theory, Vol. 1: Efficient Algorithms

Filed under: Algebra,Algorithms,Mathematics — Patrick Durusau @ 7:20 pm

Algorithmic Number Theory, Vol. 1: Efficient Algorithms by Eric Bach and Jeffrey Shallit.

From the preface:

This is the first volume of a projected two-volume set on algorithmic number theory, the design and analysis of algorithms for problems from the theory of numbers. This volume focuses primarily on those problems from number theory that admit relatively efficient solutions. The second volume will largely focus on problems for which efficient algorithms are not known, and applications thereof.

We hope that the material in this book will be useful for readers at many levels, from the beginning graduate student to experts in the area. The early chapters assume that the reader is familiar with the topics in an undergraduate algebra course: groups, rings, and fields. Later chapters assume some familiarity with Galois theory.

As stated above, this book discusses the current state of the art in algorithmic number theory. This book is not an elementary number theory textbook, and so we frequently do not give detailed proofs of results whose central focus is not computational. Choosing otherwise would have made this book twice as long.

The webpage offers the BibTeX files for the bibliography, which includes more than 1800 papers and books.

BTW, Amazon notes that Volume 2 was never published.

Now that high performance computing resources are easily available, perhaps you can start working on your own volume 2. Yes?

I first saw this in a tweet by Alvaro Videla.

March 12, 2014

Raft Consensus Algorithm

Filed under: Algorithms,Consensus,Consistency,Paxos — Patrick Durusau @ 1:34 pm

Raft: Understandable Distributed Consensus

A compelling visualization of the Raft consensus algorithm!

I first saw the visualization link in a tweet by Aaron Bull Schaefer.

The visualization closes with pointers to more information on Raft.

One pointer is to the Raft Consensus Algorithm website.

From the homepage:

Raft is a consensus algorithm that is designed to be easy to understand. It’s equivalent to Paxos in fault-tolerance and performance. The difference is that it’s decomposed into relatively independent subproblems, and it cleanly addresses all major pieces needed for practical systems. We hope Raft will make consensus available to a wider audience, and that this wider audience will be able to develop a variety of higher quality consensus-based systems than are available today.

There are links to videos + slides, the raft-dev Google Group, and numerous implementations of the Raft algorithm.

The other pointer from the visualization is to the Raft paper: In Search of an Understandable Consensus Algorithm (PDF) by Diego Ongaro and John Ousterhout.

From the paper (section 4):

We had several goals in designing Raft: it must provide a complete and appropriate foundation for system building, so that it significantly reduces the amount of design work required of developers; it must be safe under all conditions and available under typical operating conditions; and it must be efficient for common operations. But our most important goal—and most difficult challenge—was understandability. It must be possible for a large audience to understand the algorithm comfortably. In addition, it must be possible to develop intuitions about the algorithm, so that system builders can make the extensions that are inevitable in real-world implementations.

Who would have thought that choosing more obvious/understandable approaches would have practical benefits?

There were numerous points in the design of Raft where we had to choose among alternative approaches. In these situations we evaluated the alternatives based on understandability: how hard is it to explain each alternative (for example, how complex is its state space, and does it have subtle implications?), and how easy will it be for a reader to completely understand the approach and its implications? Given a choice between an alternative that was concise but subtle and one that was longer (either in lines of code or explanation) but more obvious, we chose the more obvious approach. Fortunately, in most cases the more obvious approach was also more concise. (emphasis added)

Understandability, now there’s a useful requirement.

February 17, 2014

Understanding Classic SoundEx Algorithms

Filed under: Algorithms,SoundEx,Subject Identity — Patrick Durusau @ 8:53 pm

Understanding Classic SoundEx Algorithms

From the webpage:

Terms that are often misspelled can be a problem for database designers. Names, for example, are variable length, can have strange spellings, and they are not unique. American names have a diversity of ethnic origins, which give us names pronounced the same way but spelled differently and vice versa.

Words too, can be misspelled or have multiple spellings, especially across different cultures or national sources.

To help solve this problem, we need phonetic algorithms which can find similar sounding terms and names. Just such a family of algorithms exist and have come to be called SoundEx algorithms.

A Soundex search algorithm takes a written word, such as a person’s name, as input, and produces a character string that identifies a set of words that are (roughly) phonetically alike. It is very handy for searching large databases when the user has incomplete data.

The method used by Soundex is based on the six phonetic classifications of human speech sounds (bilabial, labiodental, dental, alveolar, velar, and glottal), which are themselves based on where you put your lips and tongue to make the sounds.

The algorithm itself is fairly straight forward to code and requires no backtracking or multiple passes over the input word. In fact, it is so straight forward, I will start (after a history section) by presenting it as an outline. Further on, I will give C, JavaScript, Perl, and VB code that implements the two standard algorithms used in the American Census as well as an enhanced version, which is described in this article.

A timely reminder that knowing what is likely to be confused can be more powerful than the details of any particular confusion.

Even domain level semantics may be too difficult to capture. What if we were to capture only the known cases of confusion?

That would be a much smaller set than the domain in general and easier to maintain. (As well as to distinguish in a solution.)

February 14, 2014

“Envy-Free” Algorithm

Filed under: Algorithms — Patrick Durusau @ 4:07 pm

Valentine’s Day is a good time to discuss the flaws in this “envy free” algorithm.

Researchers Develop “Envy-Free” Algorithm for Settling Disputes from Divorce to Inheritance

The headline should have read:

Researchers Develop Theoretical Algorithm for Settling Disputes from Divorce to Inheritance

From the article:

This algorithm is “envy free” because each party prefers each of its items to a corresponding item of the other party. A potential conflict arises, of course, when the two parties desire the same item at the same time. For example, assume players A and B rank four items, going from left to right, as follows:

A: 1 2 3 4
B: 2 3 4 1

Now, if we give A item 1 and B item 2 (their most preferred), the next unallocated item on both their lists is item 3. Who should get it? The algorithm gives it to A and gives item 4 to B, which is an envy-free allocation because each player prefers its items to the other player’s:

A prefers item 1 to 2 and item 3 to 4
B prefers item 2 to 3 and item 4 to 1

Not only does each party prefer its pair of items to the other’s, but there is no alternative allocation that both parties would prefer, which makes it efficient. Although such an efficient, envy-free allocation is not always possible, the algorithm finds one that comes as close to this ideal as can be achieved.

The first problem with the algorithm is that while it may account for envy, it does not account for the desire of party A to deprive party B of anything B desires.

Failing to account for spite, the algorithm is useful only in non-spiteful divorces and inheritance disputes.

Care to wager on the incidence of non-spiteful divorces and inheritance disputes?

Distinct from any “envy” or “spite” with regard to material items, there are other issues that the algorithm fails to account for in a divorce or inheritance dispute.

I can’t share the details but I do recall the testimony in one case being the other spouse had remarried, “that ugly woman.” That was the real driver behind all of the other issues.

The “envy-free” algorithm isn’t quite ready for a scorned spouse case.

All algorithms sound good in an ideal world. Next time you read about an algorithm, think of data that isn’t ideal for it. Become an algorithm skeptic. Your clients will be better off for it.

February 13, 2014

Algebra for Analytics:…

Filed under: Algebra,Algorithms,Analytics — Patrick Durusau @ 3:44 pm

Algebra for Analytics: Two pieces for scaling computations, ranking and learning by P. Oscar Boykin.

Slide deck from Oscar’s presentation at Strataconf 2014.

I don’t normally say a slide deck on algebra is inspirational but I have to for this one!

Looking forward to watching the video of the presentation that went along with it.

Think of all the things you can do with associativity and hashes before you review the slide deck.

It will make it all the more amazing.

I first saw this in a tweet by Twitter Open Source.

January 14, 2014

Algorithmic Music Discovery at Spotify

Filed under: Algorithms,Machine Learning,Matrix,Music,Music Retrieval,Python — Patrick Durusau @ 3:19 pm

Algorithmic Music Discovery at Spotify by Chris Johnson.

From the description:

In this presentation I introduce various Machine Learning methods that we utilize for music recommendations and discovery at Spotify. Specifically, I focus on Implicit Matrix Factorization for Collaborative Filtering, how to implement a small scale version using python, numpy, and scipy, as well as how to scale up to 20 Million users and 24 Million songs using Hadoop and Spark.

Among a number of interesting points, Chris points out differences between movie and music data.

One difference is that songs are consumed over and over again. Another is that users rate movies but “vote” by their streaming behavior on songs.*

While leads to Chris’ main point, implicit matrix factorization. Code. The source code page points to: Collaborative Filtering for Implicit Feedback Datasets by Yifan Hu, Yehuda Koren, and Chris Volinsky.

Scaling that process is represented in blocks for Hadoop and Spark.

* I suspect that “behavior” is more reliable than “ratings” from the same user. Reasoning ratings are more likely to be subject to social influences. I don’t have any research at my fingertips on that issue. Do you?

January 4, 2014

Know Thy Complexities!

Filed under: Algorithms,Complexity — Patrick Durusau @ 9:26 pm

Know Thy Complexities!

From the post:

Hi there! This webpage covers the space and time Big-O complexities of common algorithms used in Computer Science. When preparing for technical interviews in the past, I found myself spending hours crawling the internet putting together the best, average, and worst case complexities for search and sorting algorithms so that I wouldn’t be stumped when asked about them. Over the last few years, I’ve interviewed at several Silicon Valley startups, and also some bigger companies, like Yahoo, eBay, LinkedIn, and Google, and each time that I prepared for an interview, I thought to myself “Why oh why hasn’t someone created a nice Big-O cheat sheet?”. So, to save all of you fine folks a ton of time, I went ahead and created one. Enjoy!
….

The algorithms are linked to appropriate entries in Wikipedia.

But other data exists on these algorithms and new results are common.

If this is a “cheatsheet” view, what other views of that data would you create?

I first saw this in a tweet by The O.C.R.

December 31, 2013

Provable Algorithms for Machine Learning Problems

Provable Algorithms for Machine Learning Problems by Rong Ge.

Abstract:

Modern machine learning algorithms can extract useful information from text, images and videos. All these applications involve solving NP-hard problems in average case using heuristics. What properties of the input allow it to be solved effciently? Theoretically analyzing the heuristics is very challenging. Few results were known.

This thesis takes a di fferent approach: we identify natural properties of the input, then design new algorithms that provably works assuming the input has these properties. We are able to give new, provable and sometimes practical algorithms for learning tasks related to text corpus, images and social networks.

The first part of the thesis presents new algorithms for learning thematic structure in documents. We show under a reasonable assumption, it is possible to provably learn many topic models, including the famous Latent Dirichlet Allocation. Our algorithm is the first provable algorithms for topic modeling. An implementation runs 50 times faster than latest MCMC implementation and produces comparable results.

The second part of the thesis provides ideas for provably learning deep, sparse representations. We start with sparse linear representations, and give the fi rst algorithm for dictionary learning problem with provable guarantees. Then we apply similar ideas to deep learning: under reasonable assumptions our algorithms can learn a deep network built by denoising autoencoders.

The fi nal part of the thesis develops a framework for learning latent variable models. We demonstrate how various latent variable models can be reduced to orthogonal tensor decomposition, and then be solved using tensor power method. We give a tight sample complexity analysis for tensor power method, which reduces the number of sample required for learning many latent variable models.

In theory, the assumptions in this thesis help us understand why intractable problems in machine learning can often be solved; in practice, the results suggest inherently new approaches for machine learning. We hope the assumptions and algorithms inspire new research problems and learning algorithms.

Admittedly an odd notion, starting with the data rather than an answer and working back towards data but it does happen. 😉

Given the performance improvements for LDA (50X), I anticipate this approach being applied to algorithms for “big data.”

I first saw this in a tweet by Chris Deihl.

December 22, 2013

The sound of sorting…

Filed under: Algorithms,Programming,Sorting,Visualization — Patrick Durusau @ 8:33 pm

The sound of sorting – 15 sorting algorithms visualized and sonified by Alex Popescu.

Alex has embedded a YouTube video that visualizes and sonifies 15 sorting algorithms.

As a special treat, Alex links to more details on the video.

If you are interested in more visualizations of algorithms, see Algoviz.org.

December 14, 2013

Everything is Editorial:..

Filed under: Algorithms,Law,Legal Informatics,Search Algorithms,Searching,Semantics — Patrick Durusau @ 7:57 pm

Everything is Editorial: Why Algorithms are Hand-Made, Human, and Not Just For Search Anymore by Aaron Kirschenfeld.

From the post:

Down here in Durham, NC, we have artisanal everything: bread, cheese, pizza, peanut butter, and of course coffee, coffee, and more coffee. It’s great—fantastic food and coffee, that is, and there is no doubt some psychological kick from knowing that it’s been made carefully by skilled craftspeople for my enjoyment. The old ways are better, at least until they’re co-opted by major multinational corporations.

Aside from making you either hungry or jealous, or perhaps both, why am I talking about fancy foodstuffs on a blog about legal information? It’s because I’d like to argue that algorithms are not computerized, unknowable, mysterious things—they are produced by people, often painstakingly, with a great deal of care. Food metaphors abound, helpfully I think. Algorithms are the “special sauce” of many online research services. They are sets of instructions to be followed and completed, leading to a final product, just like a recipe. Above all, they are the stuff of life for the research systems of the near future.

Human Mediation Never Went Away

When we talk about algorithms in the research community, we are generally talking about search or information retrieval (IR) algorithms. A recent and fascinating VoxPopuLII post by Qiang Lu and Jack Conrad, “Next Generation Legal Search – It’s Already Here,” discusses how these algorithms have become more complicated by considering factors beyond document-based, topical relevance. But I’d like to step back for a moment and head into the past for a bit to talk about the beginnings of search, and the framework that we have viewed it within for the past half-century.

Many early information-retrieval systems worked like this: a researcher would come to you, the information professional, with an information need, that vague and negotiable idea which you would try to reduce to a single question or set of questions. With your understanding of Boolean search techniques and your knowledge of how the document corpus you were searching was indexed, you would then craft a search for the computer to run. Several hours later, when the search was finished, you would be presented with a list of results, sometimes ranked in order of relevance and limited in size because of a lack of computing power. Presumably you would then share these results with the researcher, or perhaps just turn over the relevant documents and send him on his way. In the academic literature, this was called “delegated search,” and it formed the background for the most influential information retrieval studies and research projects for many years—the Cranfield Experiments. See also “On the History of Evaluation in IR” by Stephen Robertson (2008).

In this system, literally everything—the document corpus, the index, the query, and the results—were mediated. There was a medium, a middle-man. The dream was to some day dis-intermediate, which does not mean to exhume the body of the dead news industry. (I feel entitled to this terrible joke as a former journalist… please forgive me.) When the World Wide Web and its ever-expanding document corpus came on the scene, many thought that search engines—huge algorithms, basically—would remove any barrier between the searcher and the information she sought. This is “end-user” search, and as algorithms improved, so too would the system, without requiring the searcher to possess any special skills. The searcher would plug a query, any query, into the search box, and the algorithm would present a ranked list of results, high on both recall and precision. Now, the lack of human attention, evidenced by the fact that few people ever look below result 3 on the list, became the limiting factor, instead of the lack of computing power.

delegated search

The only problem with this is that search engines did not remove the middle-man—they became the middle-man. Why? Because everything, whether we like it or not, is editorial, especially in reference or information retrieval. Everything, every decision, every step in the algorithm, everything everywhere, involves choice. Search engines, then, are never neutral. They embody the priorities of the people who created them and, as search logs are analyzed and incorporated, of the people who use them. It is in these senses that algorithms are inherently human.

A delightful piece on search algorithms that touches at the heart of successful search and/or data integration.

Its first three words capture the issue: Everything is Editorial….

Despite the pretensions of scholars, sages and rogues, everything is editorial, there are no universal semantic primitives.

For convenience in data processing we may choose to treat some tokens as semantic primitives, but that is always a choice that we make.

Once you make that leap, it comes as no surprise that owl:sameAs wasn’t used the same way by everyone who used it.

See: When owl:sameAs isn’t the Same: An Analysis of Identity Links on the Semantic Web by Harry Halpin, Ivan Herman, and Patrick J. Hayes, for one take on the confusion around owl:sameAs.

If you are interested in moving beyond opaque keyword searching, consider Aaron’s post carefully.

November 30, 2013

CGAL: Computational Geometry Algorithms Library

Filed under: Algorithms,Computational Geometry,Mathematics,Programming — Patrick Durusau @ 7:47 pm

CGAL: Computational Geometry Algorithms Library

From the webpage:

The goal of the CGAL Open Source Project is to provide easy access to efficient and reliable geometric algorithms in the form of a C++ library. CGAL is used in various areas needing geometric computation, such as: computer graphics, scientific visualization, computer aided design and modeling, geographic information systems, molecular biology, medical imaging, robotics and motion planning, mesh generation, numerical methods… More on the projects using CGAL web page.

The Computational Geometry Algorithms Library (CGAL), offers data structures and algorithms like triangulations (2D constrained triangulations, and Delaunay triangulations and periodic triangulations in 2D and 3D), Voronoi diagrams (for 2D and 3D points, 2D additively weighted Voronoi diagrams, and segment Voronoi diagrams), polygons (Boolean operations, offsets, straight skeleton), polyhedra (Boolean operations), arrangements of curves and their applications (2D and 3D envelopes, Minkowski sums), mesh generation (2D Delaunay mesh generation and 3D surface and volume mesh generation, skin surfaces), geometry processing (surface mesh simplification, subdivision and parameterization, as well as estimation of local differential properties, and approximation of ridges and umbilics), alpha shapes, convex hull algorithms (in 2D, 3D and dD), search structures (kd trees for nearest neighbor search, and range and segment trees), interpolation (natural neighbor interpolation and placement of streamlines), shape analysis, fitting, and distances (smallest enclosing sphere of points or spheres, smallest enclosing ellipsoid of points, principal component analysis), and kinetic data structures.

All these data structures and algorithms operate on geometric objects like points and segments, and perform geometric tests on them. These objects and predicates are regrouped in CGAL Kernels.

Finally, the Support Library offers geometric object generators and spatial sorting functions, as well as a matrix search framework and a solver for linear and quadratic programs. It further offers interfaces to third party software such as the GUI libraries Qt, Geomview, and the Boost Graph Library.

I found this earlier today while searching for support for half-edges in graphs (CGAL supports half-edges).

November 26, 2013

A Book About Bandit Algorithms

Filed under: Algorithms,Computer Science,Programming — Patrick Durusau @ 9:50 am

A Book About Bandit Algorithms by Noel Welsh.

From the introduction:

…The basic bandit problem is this:

We are alone in a room full of slot machines (fruit machines, or one-armed bandits). At every tick of the clock we must choose a machine to play. When we pull the arm on a machine we get a reward – perhaps a river of coins tumbles from the machine, or perhaps nothing at all happens – but we don’t know what reward we will get till we try the arm. By pulling arms we can learn about the machines, and we must use this knowledge to choose the best arms to pull.

Though I’ve tried to make the problem concrete (and also show how the name multi-armed bandit arises) my description leaves many things undefined:

  • I haven’t said how the machines behave. Do they always reward at the same rate, or does the rate of reward change arbitrarily, or even capriciously?
  • Do we know anything about the machines before we start? Are some of them, for example, from the same manufacturer, so we might assume they behave similarly?
  • What exactly is the goal? Are we playing this game forever, and if so do we want to make the most money possible? Alternatively, we might want to find the best machine as quickly as possible, so we can go do something more interesting with our time instead.

All of these different variations are bandit problems that we will discuss. The essence of a bandit problem is meeting two criteria. Firstly, we only learn about actions when we take them. Thus bandits are partial information problems. We also require that our actions do not change the machines’ behaviour, though their behaviour may change for other reasons. This important restriction means we don’t have to consider the effect our past actions might have on the present, and makes bandit algorithms tractable to solve. If we drop this restriction we have a problem known as reinforcement learning.

Is merging a “partial information problem (PIP)?”

Or does that answer depend upon the complexity of the merging criteria?

If the status of merging as a PIP rests on the complexity of merging criteria, what is the boundary where merging becomes a PIP?

A book in progress so be sure to sign up for new content and assist Noel with comments.

I first saw this mentioned in a tweet by Lars Marius Garshol.

October 21, 2013

Work Efficient Parallel Algorithms for Large Graph Exploration

Filed under: Algorithms,Graphs,Parallel Programming — Patrick Durusau @ 6:27 pm

Work Efficient Parallel Algorithms for Large Graph Exploration by Dip Sankar Banerjee, Shashank Sharma, Kishore Kothapalli.

Abstract:

Graph algorithms play a prominent role in several fields of sciences and engineering. Notable among them are graph traversal, finding the connected components of a graph, and computing shortest paths. There are several efficient implementations of the above problems on a variety of modern multiprocessor architectures. It can be noticed in recent times that the size of the graphs that correspond to real world data sets has been increasing. Parallelism offers only a limited succor to this situation as current parallel architectures have severe short-comings when deployed for most graph algorithms. At the same time, these graphs are also getting very sparse in nature. This calls for particular work efficient solutions aimed at processing large, sparse graphs on modern parallel architectures. In this paper, we introduce graph pruning as a technique that aims to reduce the size of the graph. Certain elements of the graph can be pruned depending on the nature of the computation. Once a solution is obtained for the pruned graph, the solution is extended to the entire graph. We apply the above technique on three fundamental graph algorithms: breadth first search (BFS), Connected Components (CC), and All Pairs Shortest Paths (APSP). To validate our technique, we implement our algorithms on a heterogeneous platform consisting of a multicore CPU and a GPU. On this platform, we achieve an average of 35% improvement compared to state-of-the-art solutions. Such an improvement has the potential to speed up other applications that rely on these algorithms.

This paper makes me curious if pruning could be a viable solution for processing very large topic maps?

I may have a map that includes all of Wikipedia but for some purposes, I have no interest in any of the Wikipedia data.

Or I may have different accounting entries that display depending upon the user accessing my accounting map. Totals, audit trails should be generated from one consistent set of entries.

Any experience or thoughts in that regard?

October 15, 2013

Nearest Neighbor Search in Google Correlate

Filed under: Algorithms,Google Correlate,Nearest Neighbor — Patrick Durusau @ 6:41 pm

Nearest Neighbor Search in Google Correlate by Dan Vanderkam, Rob Schonberger, Henry Rowley, Sanjiv Kumar.

Abstract:

This paper presents the algorithms which power Google Correlate, a tool which finds web search terms whose popularity over time best matches a user-provided time series. Correlate was developed to generalize the query-based modeling techniques pioneered by Google Flu Trends and make them available to end users.

Correlate searches across millions of candidate query time series to find the best matches, returning results in less than 200 milliseconds. Its feature set and requirements present unique challenges for Approximate Nearest Neighbor (ANN) search techniques. In this paper, we present Asymmetric Hashing (AH), the technique used by Correlate, and show how it can be adapted to the specific needs of the product.

We then develop experiments to test the throughput and recall of Asymmetric Hashing as compared to a brute-force search. For “full” search vectors, we achieve a 10x speedup over brute force search while maintaining 97% recall. For search vectors which contain holdout periods, we achieve a 4x speedup over brute force search, also with 97% recall. (I followed the paragraphing in the PDF for the abstract.)

Ten-x speedups on Google Flu Trends size data sets are non-trivial.

If you are playing in that size sand box, this is a must read.

« Newer PostsOlder Posts »

Powered by WordPress