Archive for the ‘Algorithms’ Category

Colorized Math Equations [Algorithms?]

Friday, December 15th, 2017

Colorized Math Equations by Kalid Azad.

From the post:

Years ago, while working on an explanation of the Fourier Transform, I found this diagram:


Argh! Why aren’t more math concepts introduced this way?

Most ideas aren’t inherently confusing, but their technical description can be (e.g., reading sheet music vs. hearing the song.)

My learning strategy is to find what actually helps when learning a concept, and do more of it. Not the stodgy description in the textbook — what made it click for you?

The checklist of what I need is ADEPT: Analogy, Diagram, Example, Plain-English Definition, and Technical Definition.

Here’s a few reasons I like the colorized equations so much:

  • The plain-English description forces an analogy for the equation. Concepts like “energy”, “path”, “spin” aren’t directly stated in the equation.
  • The colors, text, and equations are themselves a diagram. Our eyes bounce back and forth, reading the equation like a map (not a string of symbols).
  • The technical description — our ultimate goal — is not hidden. We’re not choosing between intuition or technical, it’s intuition for the technical.

Of course, we still need examples to check our understanding, but 4/5 ain’t bad!

Azad includes a LaTeX template that he uses to create colorized math equations.

Consider the potential use of color + explanation for algorithms. Being mindful that use of color presents accessibility issues that will require cleverness on your part.

Another tool for your explanation quiver!


Monday, December 11th, 2017

Mathwashing: How Algorithms Can Hide Gender and Racial Biases by Kimberley Mok.

From the post:

Scholars have long pointed out that the way languages are structured and used can say a lot about the worldview of their speakers: what they believe, what they hold sacred, and what their biases are. We know humans have their biases, but in contrast, many of us might have the impression that machines are somehow inherently objective. But does that assumption apply to a new generation of intelligent, algorithmically driven machines that are learning our languages and training from human-generated datasets? By virtue of being designed by humans, and by learning natural human languages, might these artificially intelligent machines also pick up on some of those same human biases too?

It seems that machines can and do indeed assimilate human prejudices, whether they are based on race, gender, age or aesthetics. Experts are now finding more evidence that supports this phenomenon of algorithmic bias. As sets of instructions that help machines to learn, reason, recognize patterns and perform tasks on their own, algorithms increasingly pervade our lives. And in a world where algorithms already underlie many of those big decisions that can change lives forever, researchers are finding that many of these algorithms aren’t as objective as we assume them to be.

If you have ever suffered from the delusion that algorithms, any algorithm is “objective,” this post is a must read. Or re-read to remind yourself that “objectivity” is a claim used to put your position beyond question for self-interest. Nothing more.

For my part, I’m not sure what’s unclear about data collected, algorithms chosen, interpretation of results, all being the results of bias?

There may be acceptable biases, or degrees of bias, but the goal of any measurement is a result, which automatically biases a measurer in favor of phenomena that can be measured by a convenient technique. Phenomena that cannot be easily measured, no matter how important, won’t be included.

By the same token, “bias-correction” is the introduction of an acceptable bias and/or limiting bias to what is seen as, to the person judging the presence of bias, to an acceptable level of bias.

Bias is omnipresent and while evaluating algorithms is important, always bear in mind you are choosing acceptable bias over unacceptable bias.

Or to mis-quote the Princess Bride: “Bias is everywhere. Anyone who says differently is selling something.”

DHS Algorithms – Putting Discrimination Beyond Discussion

Friday, November 17th, 2017

Coalition of 100+ tech groups and leaders warn the DHS that “extreme vetting” software will be a worse-than-useless, discriminatory nightmare by Cory Doctorow.

From the post:

In a pair of open letters to Letter to The Honorable Elaine C. Duke, Acting Secretary of Homeland, a coalition of more than 100 tech liberties groups and leading technology experts urged the DHS to abandon its plan to develop a black-box algorithmic system for predicting whether foreigners coming to the USA to visit or live are likely to be positive contributors or risks to the nation.

The letters warn that algorithmic assessment tools will be prone to religious and racial bias, in which programmers get to decide, without evidence, debate or transparency, what kind of person should be an American — which jobs, attitudes, skills and family types are “American” and which ones are “undesirable.”

Further, the system for predicting terrorist proclivities will draw from an infinitesimal data-set of known terrorists, whose common characteristics will be impossible to divide between correlative and coincidental.

If the Department of Homeland Security (DHS) needed confirmation it’s on the right track, then Doctorow and “the 100 tech liberties groups and leading technology experts” have provided that confirmation.

The letters warn that algorithmic assessment tools will be prone to religious and racial bias, in which programmers get to decide, without evidence, debate or transparency, what kind of person should be an American — which jobs, attitudes, skills and family types are “American” and which ones are “undesirable.”

To discriminate “…without evidence, debate or transparency…” is an obvious, if unstated, goal of the DHS “black-box algorithmic system.”

The claim by Doctorow and others the system will be ineffectual:

…the system for predicting terrorist proclivities will draw from an infinitesimal data-set of known terrorists, whose common characteristics will be impossible to divide between correlative and coincidental

imposes a requirement of effectiveness that has never been applied to the DHS.

Examples aren’t hard to find but consider that since late 2001, the Transportation Safety Administration (TSA) has not caught a single terrorist. Let me repeat that: Since late 2001, the Transportation Safety Administration (TSA) has not caught a single terrorist. But visit any airport and the non-terrorist catching TSA is in full force.

Since the Naturalization Act of 1790 forward, granting naturalization to “…free white person[s]…,” US immigration policy has been, is and likely will always be, a seething cauldron of discrimination.

That the DNS wants to formalize whim, caprice and discrimination into algorithms “…without evidence, debate or transparency…” comes as no surprise.

That Doctorow and others think pointing out discrimination to those with a history, habit and intent to discriminate is meaningful is surprising.

I’m doubtful that educating present members of Congress about the ineffective and discriminatory impact of the DHS plan will be useful as well. Congress is the source of the current discriminatory laws governing travel and immigration so I don’t sense a favorable reception there either.

Perhaps new members of Congress or glitches in DHS algorithms/operations that lead to unforeseen consequences?

The power of algorithms and how to investigate them (w/ resources)

Tuesday, May 23rd, 2017

The power of algorithms and how to investigate them by Katrien Vanherck.

From the post:

Most Americans these days get their main news from Google or Facebook, two tools that rely heavily on algorithms. A study in 2015 showed that the way a search engine like Google selects and prioritises search results on political candidates can have an influence on voters’ preferences.

Similarly, it has been shown that by tweaking the algorithms behind the Facebook newsfeed, the turnout of voters in American elections can be influenced. If Marc Zuckerberg were ever to run for president, he would theoretically have an enormously powerful tool at his disposal. (Note: as recent article in The Guardian investigated the misuse of big data and social media in the context of the Brexit referendum).

Algorithms are everywhere in our everyday life and are exerting a lot of power in our society. They prioritise, classify, connect and filter information, automatically making decisions on our behalf all the time. But as long as the algorithms remain a ‘black box’, we don’t know exactly how these decisions are made.

Are these algorithms always fair? Examples of possible racial bias in algorithms include the risk analysis score that is calculated for prisoners that are up for parole or release (white people appear to get more favourable scores more often) and the service quality of Uber in Washington DC (waiting times are shorter in predominantly white neighbourhoods). Maybe such unfair results are not only due to the algorithms, but the lack of transparency remains a concern.

So what is going on in these algorithms, and how can we make them more accountable?
… (emphasis in original)

A great inspirational keynote but short on details for investigation of algorithms.

Such as failing to mention the algorithms of both Google and Facebook are secret.

Reverse engineering those from results would be a neat trick.

Google would be the easier of the two, since you could script searches domain by domain with a list of search terms to build up a data set of its results. That would not result in the algorithm per se but you could detect some of its contours.

Google has been accused of liberal bias, Who would Google vote for? An analysis of political bias in internet search engine results, bias in favor of Hillary Clinton, Google defends its search engine against charges it favors Clinton, and, bias in favor of the right wing, How Google’s search algorithm spreads false information with a rightwing bias.

To the extent you identify Hillary Clinton with the rightwing, those results may be expressions of the same bias.

In any event, you can discern from those studies some likely techniques to use in testing Google search/auto-completion results.

Facebook is be harder because you don’t have access to or control over the content it is manipulating for delivery. Although by manipulating social media identities, you could test and compare the content that Facebook delivers.

3 Reasons to Read: Algorithms to Live By

Monday, April 24th, 2017

How Algorithms can untangle Human Questions. Interview with Brian Christian by Roberto V. Zican.

The entire interview is worth your study but the first question and answer establish why you should read Algorithms to Live By:

Q1. You have worked with cognitive scientist Tom Griffiths (professor of psy­chol­ogy and cognitive science at UC Berkeley) to show how algorithms used by computers can also untangle very human questions. What are the main lessons learned from such a joint work?

Brian Christian: I think ultimately there are three sets of insights that come out of the exploration of human decision-making from the perspective of computer science.

The first, quite simply, is that identifying the parallels between the problems we face in everyday life and some of the canonical problems of computer science can give us explicit strategies for real-life situations. So-called “explore/exploit” algorithms tell us when to go to our favorite restaurant and when to try something new; caching algorithms suggest — counterintuitively — that the messy pile of papers on your desk may in fact be the optimal structure for that information.

Second is that even in cases where there is no straightforward algorithm or easy answer, computer science offers us both a vocabulary for making sense of the problem, and strategies — using randomness, relaxing constraints — for making headway even when we can’t guarantee we’ll get the right answer every time.

Lastly and most broadly, computer science offers us a radically different picture of rationality than the one we’re used to seeing in, say, behavioral economics, where humans are portrayed as error-prone and irrational. Computer science shows us that being rational means taking the costs of computation — the costs of decision-making itself — into account. This leads to a much more human, and much more achievable picture of rationality: one that includes making mistakes and taking chances.
… (emphasis in original)

After the 2016 U.S. presidential election, I thought the verdict that humans are error-prone and irrational was unassailable.

Looking forward to the use of a human constructed lens (computer science) to view “human questions.” There are answers to “human questions” baked into computer science so watching the authors unpack those will be an interesting read. (Waiting for my copy to arrive.)

Just so you know, the Picador edition is a reprint. It was originally published by William Collins, 21/04/2016 in hardcover, see: Algorithms to Live By, a short review by Roberto Zicari, October 24, 2016.

Financing the Revolution – Hacking Slot Machines

Tuesday, February 7th, 2017

Russians Engineer A Brilliant Slot Machine Cheat-And Casinos Have No Fix by Brendan I. Koerner.

From the post:

IN EARLY JUNE 2014, accountants at the Lumiere Place Casino in St. Louis noticed that several of their slot machines had—just for a couple of days—gone haywire. The government-approved software that powers such machines gives the house a fixed mathematical edge, so that casinos can be certain of how much they’ll earn over the long haul—say, 7.129 cents for every dollar played. But on June 2 and 3, a number of Lumiere’s machines had spit out far more money than they’d consumed, despite not awarding any major jackpots, an aberration known in industry parlance as a negative hold. Since code isn’t prone to sudden fits of madness, the only plausible explanation was that someone was cheating.

Casino security pulled up the surveillance tapes and eventually spotted the culprit, a black-haired man in his thirties who wore a Polo zip-up and carried a square brown purse. Unlike most slots cheats, he didn’t appear to tinker with any of the machines he targeted, all of which were older models manufactured by Aristocrat Leisure of Australia. Instead he’d simply play, pushing the buttons on a game like Star Drifter or Pelican Pete while furtively holding his iPhone close to the screen.

He’d walk away after a few minutes, then return a bit later to give the game a second chance. That’s when he’d get lucky. The man would parlay a $20 to $60 investment into as much as $1,300 before cashing out and moving on to another machine, where he’d start the cycle anew. Over the course of two days, his winnings tallied just over $21,000. The only odd thing about his behavior during his streaks was the way he’d hover his finger above the Spin button for long stretches before finally jabbing it in haste; typical slots players don’t pause between spins like that.

On June 9, Lumiere Place shared its findings with the Missouri Gaming Commission, which in turn issued a statewide alert. Several casinos soon discovered that they had been cheated the same way, though often by different men than the one who’d bilked Lumiere Place. In each instance, the perpetrator held a cell phone close to an Aristocrat Mark VI model slot machine shortly before a run of good fortune.

… (emphasis in original)

A very cool story with recording a slot machine’s “pseudo-random” behavior, predicting its future “pseudo-random” behavior, acting on those predictions, all while in full view of security cameras and human observers.

Now, there’s a hacker’s challenge for you!

Legislation is pending to bring casino gambling to Georgia (USA). I assume they will have all new slot machines so newer algorithms are going to be in demand.

Assuming you don’t have access to the source code, how much video of a particular machine would you need to predict the “pseudo-random” behavior?

Thinking that people do have “favorite” machines and there would be nothing odd about the same person playing the same machine for several hours. Or even returning over a period of days. Think of that person as the information forager.

The goal of the information forager being to record as much activity of the slot machine as possible, that is not to play quickly.

One or more players can return to the casino and play the machines for which predictions are available. In that respect, the hackers in the article were smart in trying to keep winnings low, so as to avoid attracting attention.

Casinos’ crunch data like everyone else so don’t expect for unusual winnings to go unnoticed. Still, in college towns, like Atlanta, it should not be too difficult to recruit players with some equitable split of the winnings.

PS: You can buy a slot machine once you know the model being used in the target casino. Hell, install it in a frat house so it pays for itself and avoids a direct connection to you.

Ethics for Powerful Algorithms

Sunday, August 28th, 2016

Ethics for Powerful Algorithms by Abe Gong.

Abe’s four questions:

  1. Are the statistics solid?
  2. Who wins? Who loses?
  3. Are those changes to power structures healthy?
  4. How can we mitigate harms?

Remind me of my favorite scene from Labyrinth:


Sarah: That’s not fair!
Jareth: You say that so often, I wonder what your basis for comparison is?

Isn’t the question of “fairness” one for your client?

GPU + Russian Algorithm Bests Supercomputer

Thursday, June 30th, 2016

No need for supercomputers

From the post:

Senior researchers Vladimir Pomerantcev and Olga Rubtsova, working under the guidance of Professor Vladimir Kukulin (SINP MSU), were able to use on an ordinary desktop PC with GPU to solve complicated integral equations of quantum mechanics — previously solved only with the powerful, expensive supercomputers. According to Vladimir Kukulin, the personal computer does the job much faster: in 15 minutes it is doing the work requiring normally 2-3 days of the supercomputer time.

The main problem in solving the scattering equations of multiple quantum particles was the calculation of the integral kernel — a huge two-dimensional table, consisting of tens or hundreds of thousands of rows and columns, with each element of such a huge matrix being the result of extremely complex calculations. But this table appeared to look like a monitor screen with tens of billions of pixels, and with a good GPU it was quite possible to calculate all of these. Using the software developed in Nvidia and having written their own programs, the researchers split their calculations on the many thousands of streams and were able to solve the problem brilliantly.

“We reached the speed we couldn’t even dream of,” Vladimir Kukulin said. “The program computes 260 million of complex double integrals on a desktop computer within three seconds only. No comparison with supercomputers! My colleague from the University of Bochum in Germany (recently deceased, mournfully), whose lab did the same, carried out the calculations by one of the largest supercomputers in Germany with the famous blue gene architecture that is actually very expensive. And what his group is seeking for two or three days, we do in 15 minutes without spending a dime.”

The most amazing thing is that the desired quality of graphics processors and a huge amount of software to them exist for ten years already, but no one used them for such calculations, preferring supercomputers. Anyway, our physicists surprised their Western counterparts pretty much.

One of the principal beneficiaries of the US restricting the export of the latest generation of computer technology to the former USSR, was of course Russia.

Deprived of the latest hardware, Russian mathematicians and computer scientists were forced to be more efficient with equipment that was one or two generations off the latest mark for computing.

Parity between the USSR and the USA in nuclear weapons is testimony to their success and the failure of US export restriction policies.

For the technical details: V.N. Pomerantsev, V.I. Kukulin, O.A. Rubtsova, S.K. Sakhiev. Fast GPU-based calculations in few-body quantum scattering. Computer Physics Communications, 2016; 204: 121 DOI: 10.1016/j.cpc.2016.03.018.

Will a GPU help you startle your colleagues in the near future?

Apache Spark as a Compiler:… [This is wicked cool!]

Tuesday, May 24th, 2016

Apache Spark as a Compiler: Joining a Billion Rows per Second on a Laptop by Sameer Agarwal, Davies Liu and Reynold Xin.

From the post:

When our team at Databricks planned our contributions to the upcoming Apache Spark 2.0 release, we set out with an ambitious goal by asking ourselves: Apache Spark is already pretty fast, but can we make it 10x faster?

This question led us to fundamentally rethink the way we built Spark’s physical execution layer. When you look into a modern data engine (e.g. Spark or other MPP databases), a majority of the CPU cycles are spent in useless work, such as making virtual function calls or reading or writing intermediate data to CPU cache or memory. Optimizing performance by reducing the amount of CPU cycles wasted in this useless work has been a long-time focus of modern compilers.

Apache Spark 2.0 will ship with the second generation Tungsten engine. Built upon ideas from modern compilers and MPP databases and applied to data processing queries, Tungsten emits (SPARK-12795) optimized bytecode at runtime that collapses the entire query into a single function, eliminating virtual function calls and leveraging CPU registers for intermediate data. As a result of this streamlined strategy, called “whole-stage code generation,” we significantly improve CPU efficiency and gain performance.

(emphasis in original)

How much better you ask?

cost per row (in nanoseconds, single thread)

primitive Spark 1.6 Spark 2.0
filter 15 ns 1.1 ns
sum w/o group 14 ns 0.9 ns
sum w/ group 79 ns 10.7 ns
hash join 115 ns 4.0 ns
sort (8-bit entropy) 620 ns 5.3 ns
sort (64-bit entropy) 620 ns 40 ns
sort-merge join 750 ns 700 ns
Parquet decoding (single int column) 120 ns 13 ns

Don’t just stare at the numbers:

Try the whole-stage code generation notebook in Databricks Community Edition

What’s the matter?

Haven’t you ever seen a 1 billion record join in 0.8 seconds? (Down from 61.7 seconds.)

If all that weren’t impressive enough, the post walks you through the dominate (currently) query evaluation strategy as a setup to Spark 2.0 and then into why “whole-stage code generation is so powerful.”

A must read!

FOIA – For Algorithms

Tuesday, May 24th, 2016

We need to know the algorithms the government uses to make important decisions about us by Nicholas Diakopoulos.

From the post:

In criminal justice systems, credit markets, employment arenas, higher education admissions processes and even social media networks, data-driven algorithms now drive decision-making in ways that touch our economic, social and civic lives. These software systems rank, classify, associate or filter information, using human-crafted or data-induced rules that allow for consistent treatment across large populations.

But while there may be efficiency gains from these techniques, they can also harbor biases against disadvantaged groups or reinforce structural discrimination. In terms of criminal justice, for example, is it fair to make judgments on an individual’s parole based on statistical tendencies measured across a wide group of people? Could discrimination arise from applying a statistical model developed for one state’s population to another, demographically different population?

The public needs to understand the bias and power of algorithms used in the public sphere, including by government agencies. An effort I am involved with, called algorithmic accountability, seeks to make the influences of those sorts of systems clearer and more widely understood.

Existing transparency techniques, when applied to algorithms, could enable people to monitor, audit and criticize how those systems are functioning – or not, as the case may be. Unfortunately, government agencies seem unprepared for inquiries about algorithms and their uses in decisions that significantly affect both individuals and the public at large.

Nicholas makes a great case for Freedom of Information Act (FOIA) legislation being improved to explicitly include algorithms used by government or on its behalf.

I include “on its behalf” because as Nicholas documents, some states have learned the trick of having algorithms held by vendors, thus making them “proprietary.”

If you can’t see the algorithms behind data results, there is no meaningful transparency.

Demand meaningful transparency!

More Bad News For EC Brain Project Wood Pigeons

Sunday, February 14th, 2016

I heard the story of how the magpie tried to instruct other birds, particularly the wood pigeon, on how to build nests in a different form but the lesson was much the same.

The EC Brain project reminds me of the wood pigeon hearing “…take two sticks…” and running off to build its nest.

With no understanding of the human brain, the EC set out to build one, on a ten year deadline.

Byron Spice’s report in: Project Aims to Reverse-engineer Brain Algorithms, Make Computers Learn Like Humans casts further doubt upon that project:

Carnegie Mellon University is embarking on a five-year, $12 million research effort to reverse-engineer the brain, seeking to unlock the secrets of neural circuitry and the brain’s learning methods. Researchers will use these insights to make computers think more like humans.

The research project, led by Tai Sing Lee, professor in the Computer Science Department and the Center for the Neural Basis of Cognition (CNBC), is funded by the Intelligence Advanced Research Projects Activity (IARPA) through its Machine Intelligence from Cortical Networks (MICrONS) research program. MICrONS is advancing President Barack Obama’s BRAIN Initiative to revolutionize the understanding of the human brain.

“MICrONS is similar in design and scope to the Human Genome Project, which first sequenced and mapped all human genes,” Lee said. “Its impact will likely be long-lasting and promises to be a game changer in neuroscience and artificial intelligence.”

Artificial neural nets process information in one direction, from input nodes to output nodes. But the brain likely works in quite a different way. Neurons in the brain are highly interconnected, suggesting possible feedback loops at each processing step. What these connections are doing computationally is a mystery; solving that mystery could enable the design of more capable neural nets.

My goodness! Unknown loops in algorithms?

The Carnegie Mellon project is exploring potential algorithms, not trying to engineer the unknown.

If the EC had titled its project the Graduate Assistant and Hospitality Industry Support Project, one could object to the use of funds for travel junkets but it would otherwise be intellectually honest.

Document Summarization via Markov Chains

Saturday, October 17th, 2015

Document Summarization via Markov Chains by Atabey Kaygun.

From the post:

Description of the problem

Today’s question is this: we have a long text and we want a machine generated summary of the text. Below, I will describe a statistical (hence language agnostic) method to do just that.

Sentences, overlaps and Markov chains.

In my previous post I described a method to measure the overlap between two sentences in terms of common words. Today, we will use the same measure, or a variation, to develop a discrete Markov chain whose nodes are labeled by individual sentences appearing in our text. This is essentially page rank applied to sentences.

Atabey says the algorithm (code supplied) works well on:

news articles, opinion pieces and blog posts.

Not so hot on Supreme Court decisions.

In commenting on a story from the New York Times, Obama Won’t Seek Access to Encrypted User Data, I suspect, Atabey says that we have no reference for “what frustrated him” in the text summary.

If you consider the relevant paragraph from the New York Times story:

Mr. Comey had expressed alarm a year ago after Apple introduced an operating system that encrypted virtually everything contained in an iPhone. What frustrated him was that Apple had designed the system to ensure that the company never held on to the keys, putting them entirely in the hands of users through the codes or fingerprints they use to get into their phones. As a result, if Apple is handed a court order for data — until recently, it received hundreds every year — it could not open the coded information.

The reference is clear. Several other people are mentioned in the New York Times article but none rank high enough to appear in the summary.

Not a sure bet but with testing, try attribution to people who rank high enough to appear in the summary.

Non-News: Algorithms Are Biased

Wednesday, August 19th, 2015

Programming and prejudice

From the post:

Software may appear to operate without bias because it strictly uses computer code to reach conclusions. That’s why many companies use algorithms to help weed out job applicants when hiring for a new position.

But a team of computer scientists from the University of Utah, University of Arizona and Haverford College in Pennsylvania have discovered a way to find out if an algorithm used for hiring decisions, loan approvals and comparably weighty tasks could be biased like a human being.

The researchers, led by Suresh Venkatasubramanian, an associate professor in the University of Utah’s School of Computing, have discovered a technique to determine if such software programs discriminate unintentionally and violate the legal standards for fair access to employment, housing and other opportunities. The team also has determined a method to fix these potentially troubled algorithms.

Venkatasubramanian presented his findings Aug. 12 at the 21st Association for Computing Machinery’s Conference on Knowledge Discovery and Data Mining in Sydney, Australia.

“There’s a growing industry around doing resume filtering and resume scanning to look for job applicants, so there is definitely interest in this,” says Venkatasubramanian. “If there are structural aspects of the testing process that would discriminate against one community just because of the nature of that community, that is unfair.”

It’s a puff piece and therefore misses that all algorithms are biased, but some algorithms are biased in ways not permitted under current law.

The paper, which this piece avoids citing for some reason, Certifying and removing disparate impact by Michael Feldman, Sorelle Friedler, John Moeller, Carlos Scheidegger, Suresh Venkatasubramanian

The abstract for the paper does a much better job of setting the context for this research:

What does it mean for an algorithm to be biased? In U.S. law, unintentional bias is encoded via disparate impact, which occurs when a selection process has widely different outcomes for different groups, even as it appears to be neutral. This legal determination hinges on a definition of a protected class (ethnicity, gender, religious practice) and an explicit description of the process.

When the process is implemented using computers, determining disparate impact (and hence bias) is harder. It might not be possible to disclose the process. In addition, even if the process is open, it might be hard to elucidate in a legal setting how the algorithm makes its decisions. Instead of requiring access to the algorithm, we propose making inferences based on the data the algorithm uses.

We make four contributions to this problem. First, we link the legal notion of disparate impact to a measure of classification accuracy that while known, has received relatively little attention. Second, we propose a test for disparate impact based on analyzing the information leakage of the protected class from the other data attributes. Third, we describe methods by which data might be made unbiased. Finally, we present empirical evidence supporting the effectiveness of our test for disparate impact and our approach for both masking bias and preserving relevant information in the data. Interestingly, our approach resembles some actual selection practices that have recently received legal scrutiny.

If you are a bank, you want a loan algorithm to be biased against people with a poor history of paying their debts. The distinction being that is a legitimate basis for discrimination among loan applicants.

The lesson here is that all algorithms are biased, the question is whether the bias is in your favor or not.

Suggestion: Only bet when using your own dice (algorithm).


Sunday, July 19th, 2015

Aho-Corasick – Java implementation of the Aho-Corasick algorithm for efficient string matching.

From the webpage:

Nowadays most free-text searching is based on Lucene-like approaches, where the search text is parsed into its various components. For every keyword a lookup is done to see where it occurs. When looking for a couple of keywords this approach is great. But what about it if you are not looking for just a couple of keywords, but a 100,000 of them? Like, for example, checking against a dictionary?

This is where the Aho-Corasick algorithm shines. Instead of chopping up the search text, it uses all the keywords to build up a construct called a Trie. There are three crucial components to Aho-Corasick:

  • goto
  • fail
  • output

Every character encountered is presented to a state object within the goto structure. If there is a matching state, that will be elevated to the new current state.

However, if there is no matching state, the algorithm will signal a fail and fall back to states with less depth (ie, a match less long) and proceed from there, until it found a matching state, or it has reached the root state.

Whenever a state is reached that matches an entire keyword, it is emitted to an output set which can be read after the entire scan has completed.

The beauty of the algorithm is that it is O(n). No matter how many keywords you have, or how big the search text is, the performance will decline in a linear way.

Some examples you could use the Aho-Corasick algorithm for:

  • looking for certain words in texts in order to URL link or emphasize them
  • adding semantics to plain text
  • checking against a dictionary to see if syntactic errors were made

This library is the Java implementation of the afore-mentioned Aho-Corasick algorithm for efficient string matching. The algorithm is explained in great detail in the white paper written by Aho and Corasick:

The link to the Aho-Corasick paper timed out on me. Try: Efficient String Matching: An Aid to Bibliographic Search (CiteSeer).

The line that caught my eye was “adding semantics to plain text.

Apologies for the “lite” posting over the past several days. I have just completed a topic maps paper for the Balisage conference in collaboration with Sam Hunting. My suggestion is that you register before all the seats are gone.

Top 10 data mining algorithms in plain English

Thursday, May 21st, 2015

Top 10 data mining algorithms in plain English by Raymond Li.

From the post:

Today, I’m going to explain in plain English the top 10 most influential data mining algorithms as voted on by 3 separate panels in this survey paper.

Once you know what they are, how they work, what they do and where you can find them, my hope is you’ll have this blog post as a springboard to learn even more about data mining.

What are we waiting for? Let’s get started!

Raymond covers:

  1. C4.5
  2. k-means
  3. Support vector machines
  4. Apriori
  5. EM
  6. PageRank
  7. AdaBoost
  8. kNN
  9. Naive Bayes
  10. CART
  11. Interesting Resources
  12. Now it’s your turn…

Would be nice if we all had a similar ability to explain algorithms!


The Future of Algorithm Development

Friday, March 13th, 2015

The Future of Algorithm Development

From the post:

We’re building a community around state-of-the-art algorithm development. Users can create, share, and build on other algorithms and then instantly make them available as a web service.

A year ago, we shared with you our belief that Algorithm Development is Broken, where we laid out problems in the world of algorithm development and our vision of how to make things better.

Since then, we’ve been working hard to create our vision of the world’s first open marketplace for algorithms – and you, the developers of the world, have enthusiastically joined us to help active the algorithm economy.

Today, we are proud to open our doors to the world, and deliver on our promise of a global, always-available, open marketplace for algorithms.

We believe that the future of algorithm development is:

  • Live: easy access to any algorithm, callable at any time
  • No dependencies: no downloads, no installs, no stress
  • Interoperable: call any algorithm regardless of programming language
  • Composable: algorithms are building blocks, build something bigger!
  • Invisible but powerful infrastructure: don’t worry about containers, networks, or scaling; we make your algorithm scale to hundreds of machines
  • No operations: we scale, meter, optimize and bill for you

For the first time, algorithm developers can publish their algorithms as a live API and applications can be made smarter by taking advantage of these intelligent algorithms.

With our public launch we are enabling payments and go from being a repository of algorithms, to a true marketplace:

  • For algorithm developers, Algorithmia provides a unique opportunity to increase the impact of their work and collect the rewards associated with it.  
  • For application developers, Algorithmia provides the largest repository of live algorithms ever assembled, supported by a vibrant community of developers.

So far, users have contributed over 800 algorithms, all under one API that grows every day. The most exciting thing from our point of view is watching the Algorithmia Marketplace grow and expand in its functionality.

In just the last few months, Algorithmia has gained the ability to:

Thanks to the thousands of developers who joined our cause, contributed algorithms, and gave us countless hours of feedback so far. We are very excited to be opening our doors, and we look forward to helping you create something amazing!

Come check it out

doppenhe & platypii

Algorithmia has emerged from private beta!.

The future of Snow Crash, at least in terms of contracting, gets closer every day!


Watch Hilary Mason discredit the cult of the algorithm

Friday, March 6th, 2015

Watch Hilary Mason discredit the cult of the algorithm by Stacey Higginbotham.

From the post:

Want to see Hilary Mason, the CEO and founder at Fast Forward Labs, get fired up? Tell her about your new connected product and its machine learning algorithm that will help it anticipate your needs over time and behave accordingly. “That’s just a bunch of marketing bullshit,” said Mason when I asked her about these claims.

Mason actually builds algorithms and is well-versed in what they can and cannot do. She’s quick to dismantle the cult that has been built up around algorithms and machine learning as companies try to make sense of all the data they have coming in, and as they try to market products built on learning algorithms in the wake of Nest’s $3.2 billion sale to Google (I call those efforts faithware). She’ll do more of this during our opening session with Data collective co-managing partner Matt Ocko at Structure Data on March 18 in New York. You won’t want to miss it.

I won’t be there to see Hilary call “bullshit” on algorithms but you can be:

Structure Data, March 18-19, 2015, New York, NY.


The Ultimate Google Algorithm Cheat Sheet

Wednesday, March 4th, 2015

The Ultimate Google Algorithm Cheat Sheet by Neil Patel.

From the post:

Have you ever wondered why Google keeps pumping out algorithm updates?

Do you understand what Google algorithm changes are all about?

No SEO or content marketer can accurately predict what any future update will look like. Even some Google employees don’t understand everything that’s happening with the web’s most dominant search engine.

All this confusion presents a problem for savvy site owners: Since there are about 200 Google ranking factors, which ones should you focus your attention on after each major Google algorithm update?

Google has issued four major algorithm updates, named (in chronological order) Panda, Penguin, Hummingbird and Pigeon. In between these major updates, Google engineers also made some algorithm tweaks that weren’t that heavily publicized, but still may have had an impact on your website’s rankings in search results.

A very detailed post describing the history of changes to the Google algorithm, the impact of those changes and suggestions for how you can improve your search rankings.

The post alone is worth your attention but so are the resources that Neil relies upon, such as:

200 Google ranking factors, which really is a list of 200 Google ranking factors.

I first saw this in a tweet by bloggerink.


Saturday, February 28th, 2015


Algorithmia was born in 2013 with the goal of advancing the art of algorithm development, discovery and use. As developers ourselves we believe that given the right tools the possibilities for innovation and discovery are limitless.

Today we build what we believe to be the next era of programming: a collaborative, always live and community driven approach to making the machines that we interact with better.

The community drives the Algorithmia API. One API that exposes the collective knowledge of algorithm developers across the globe.

Currently in private beta but sounds very promising!

I first saw Algorithmia mentioned in Algorithmia API Exposes Collective Knowledge of Developers by Martin W. Brennan.

Comparing supervised learning algorithms

Friday, February 27th, 2015

Comparing supervised learning algorithms by Kevin Markham.

From the post:

In the data science course that I instruct, we cover most of the data science pipeline but focus especially on machine learning. Besides teaching model evaluation procedures and metrics, we obviously teach the algorithms themselves, primarily for supervised learning.

Near the end of this 11-week course, we spend a few hours reviewing the material that has been covered throughout the course, with the hope that students will start to construct mental connections between all of the different things they have learned. One of the skills that I want students to be able to take away from this course is the ability to intelligently choose between supervised learning algorithms when working a machine learning problem. Although there is some value in the “brute force” approach (try everything and see what works best), there is a lot more value in being able to understand the trade-offs you’re making when choosing one algorithm over another.

I decided to create a game for the students, in which I gave them a blank table listing the supervised learning algorithms we covered and asked them to compare the algorithms across a dozen different dimensions. I couldn’t find a table like this on the Internet, so I decided to construct one myself! Here’s what I came up with:

Eight (8) algorithms compared across a dozen (12) dimensions.

What algorithms would you add? Comments to add or take away?

Looks like the start of a very useful community resource.

A Gentle Introduction to Algorithm Complexity Analysis

Monday, February 23rd, 2015

A Gentle Introduction to Algorithm Complexity Analysis by Dionysis “dionyziz” Zindros.

From the post:

A lot of programmers that make some of the coolest and most useful software today, such as many of the stuff we see on the Internet or use daily, don’t have a theoretical computer science background. They’re still pretty awesome and creative programmers and we thank them for what they build.

However, theoretical computer science has its uses and applications and can turn out to be quite practical. In this article, targeted at programmers who know their art but who don’t have any theoretical computer science background, I will present one of the most pragmatic tools of computer science: Big O notation and algorithm complexity analysis. As someone who has worked both in a computer science academic setting and in building production-level software in the industry, this is the tool I have found to be one of the truly useful ones in practice, so I hope after reading this article you can apply it in your own code to make it better. After reading this post, you should be able to understand all the common terms computer scientists use such as “big O”, “asymptotic behavior” and “worst-case analysis”.

Do you nod when encountering “big O,” “asymptotic behavior” and “worst-case analysis” in CS articles?

Or do you understand what is meant when you encounter “big O,” “asymptotic behavior” and “worst-case analysis” in CS articles?

You are the only one who can say for sure. If it has been a while or you aren’t sure, this should act as a great refresher.

As an incentive, you can intimidate co-workers with descriptions of your code. 😉

I first saw this in a tweet by Markus Sagebiel.

The Parable of Google Flu… [big data hubris]

Sunday, February 8th, 2015

The Parable of Google Flu: Traps in Big Data Analysis by David Lazer, Ryan Kennedy, Gary King, Alessandro Vespignani.

From the article:

In February 2013, Google Flu Trends (GFT) made headlines but not for a reason that Google executives or the creators of the flu tracking system would have hoped. Nature reported that GFT was predicting more than double the proportion of doctor visits for influenza-like illness (ILI) than the Centers for Disease Control and Prevention (CDC), which bases its estimates on surveillance reports from laboratories across the United States (1,2). This happened despite the fact that GFT was built to predict CDC reports. Given that GFT is often held up as an exemplary use of big data (3, 4), what lessons can we drawfrom this error?

The problems we identify are not limited to GFT. Research on whether search or social media can predict x has become commonplace (5-7) and is often put in sharp contrast with traditional methods and hypotheses. Although these studies have shown the value of these data, we are far from a place where they can supplant more traditional methods or theories (8). We explore two issues that contributed to GFT’s mistakes— big data hubris and algorithm dynamics— and offer lessons for moving forward in the big data age.

Highly recommended reading for big data advocates.

Not that I doubt the usefulness of big data, but I do doubt its usefulness in the absence of an analyst who understands the data.

Did you catch the aside about documentation?

There are multiple challenges to replicating GFT’s original algorithm. GFT has never documented the 45 search terms used, and the examples that have been released appear misleading (14) (SM). Google does provide a service, Google Correlate, which allows the user to identify search data that correlate with a given time series; however, it is limited to national level data, whereas GFT was developed using correlations at the regional level (13). The service also fails to return any of the sample search terms reported in GFT-related publications (13,14).

Document your analysis and understanding of data. Or you can appear in a sequel to Google Flu. Not really where I want my name to appear. You?

I first saw this in a tweet by Edward Tufte.

Business Analytics Error: Learn from Uber’s Mistake During the Sydney Terror Attack

Tuesday, January 27th, 2015

Business Analytics Error: Learn from Uber’s Mistake During the Sydney Terror Attack by RK Paleru.

From the post:

Recently, as a sad day of terror ended in Sydney, a bad case of Uber’s analytical approach to pricing came to light – an “algorithm based price surge.” Uber’s algorithm driven price surge started overcharging people fleeing the Central Business District (CBD) of Sydney following the terror attack.

I’m not sure the algorithm got it wrong. If you asked me to drive into a potential war zone to ferry strangers out, I suspect a higher fee than normal is to be expected.

The real dilemma for Uber is that not all ground transportation has surge price algorithms. When buses, subways, customary taxis, etc. all have surge price algorithms, the price hikes won’t appear to be abnormal.

One of the consequences of an algorithm/data-driven world is that factors known or unknown to you may be driving the price or service. To say it another way, your “expectations” of system behavior may be at odds with how the system will behave.

The inventory algorithm at my local drugstore thought a recent prescription was too unusual to warrant stocking. My drugstore had to order it from a regional warehouse. Just-in-time inventory I think they call it. That was five (5) days ago. That isn’t “just-in-time” for the customer (me) but that isn’t the goal of most cost/pricing algorithms. Particularly when the customer has little choice about the service.

I first saw this in a tweet by Kirk Borne.


Sunday, January 4th, 2015


Speaking of combat machine learning environments:

AdversariaLib is an open-source python library for the security evaluation of machine learning (ML)-based classifiers under adversarial attacks. It comes with a set of powerful features:

  • Easy-to-use. Running sophisticated experiments is as easy as launching a single script. Experimental settings can be defined through a single setup file.
  • Wide range of supported ML algorithms. All supervised learning algorithms supported by scikit-learn are available, as well as Neural Networks (NNs), by means of our scikit-learn wrapper for FANN. In the current implementation, the library allows for the security evaluation of SVMs with linear, rbf, and polynomial kernels, and NNs with one hidden layer, against evasion attacks.
  • Fast Learning and Evaluation. Thanks to scikit-learn and FANN, all supported ML algorithms are optimized and written in C/C++ language.
  • Built-in attack algorithms. Evasion attacks based on gradient-descent optimization.
  • Extensible. Other attack algorithms can be easily added to the library.
  • Multi-processing. Do you want to further save time? The built-in attack algorithms can run concurrently on multiple processors.

Last, but not least, AdversariaLib is free software, released under the GNU General Public License version 3!

The “full documentation” link on the homepage returns a “no page.” I puzzled over it until I realized that the failing link reads:

and the successful link reads:

I have pinged the site owners.

The sourceforge link for the code: still works.

The full documentation page notes:

However, learning algorithms typically assume data stationarity: that is, both the data used to train the classifier and the operational data it classifies are sampled from the same (though possibly unknown) distribution. Meanwhile, in adversarial settings such as the above mentioned ones, intelligent and adaptive adversaries may purposely manipulate data (violating stationarity) to exploit existing vulnerabilities of learning algorithms, and to impair the entire system.

Not quite the case of reactive data that changes representations depending upon the source of a query but certainly a move in that direction.

Do you have a data stability assumption?

Show and Tell (C-suite version)

Thursday, January 1st, 2015

How Google “Translates” Pictures into Words Using Vector Space Mathematics

From the post:

Translating one language into another has always been a difficult task. But in recent years, Google has transformed this process by developing machine translation algorithms that change the nature of cross cultural communications through Google Translate.

Now that company is using the same machine learning technique to translate pictures into words. The result is a system that automatically generates picture captions that accurately describe the content of images. That’s something that will be useful for search engines, for automated publishing and for helping the visually impaired navigate the web and, indeed, the wider world.

One of the best c-suite level explanations I have seen of Show and Tell: A Neural Image Caption Generator.

May be useful to you in obtaining support/funding for similar efforts in your domain.

Take particular note of the decision to not worry overmuch about the meaning of words. I would never make that simplifying assumption. Just runs counter to the grain for the meaning of the words to not matter. However, I am very glad that Oriol Vinyals and colleagues made that assumption!

That assumption enables the processing of images at a large scale.

I started to write that I would not use such an assumption for more precise translation tasks, say the translation of cuneiform tablets. But as a rough finding aid for untranslated cuneiform or hieroglyphic texts, this could be the very thing. Doesn’t have to be 100% precise or accurate, just enough that the vast archives of ancient materials becomes easier to use.

Is there an analogy for topic maps here? That topic maps need not be final production quality materials when released but can be refined over time by authors, editors and others?

Like Wikipedia but not quite so eclectic and more complete. Imagine a Solr reference manual that inlines or at least links to the most recent presentations and discussions on a particular topic. And incorporates information from such sources into the text.

Is Google offering us “good enough” results with data, expectations that others will refine the data further? Perhaps a value-add economic model where the producer of the “good enough” content has an interest in the further refinement of that data by others?

Inadvertent Algorithmic Cruelty

Thursday, December 25th, 2014

Inadvertent Algorithmic Cruelty by Eric A. Meyers.

From the post:

I didn’t go looking for grief this afternoon, but it found me anyway, and I have designers and programmers to thank for it. In this case, the designers and programmers are somewhere at Facebook.

I know they’re probably pretty proud of the work that went into the “Year in Review” app they designed and developed. Knowing what kind of year I’d had, though, I avoided making one of my own. I kept seeing them pop up in my feed, created by others, almost all of them with the default caption, “It’s been a great year! Thanks for being a part of it.” Which was, by itself, jarring enough, the idea that any year I was part of could be described as great.

Suffice it to say that Eric suffered a tragic loss this year and the algorithms behind “See Your Year” didn’t take that into account.

While I think Eric is right in saying users should have the ability to opt out of “See Your Year,” I am less confident about his broader suggestion:

If I could fix one thing about our industry, just one thing, it would be that: to increase awareness of and consideration for the failure modes, the edge cases, the worst-case scenarios. And so I will try.

That might be helpful but uncovering edge cases or worst-case scenarios takes time and resources, to say nothing of accommodating them. Once an edge case comes up, then it can be accommodated as I am sure Facebook will do next year with “See Your Year.” But it has to happen first and become noticed.

Keep Eric’s point that algorithms being “thoughtless” in mind when using machine learning techniques. Algorithms aren’t confirming your classification, they are confirming the conditions they have been taught to recognize are present. Not the same thing. Recalling that deep learning algorithms can be fooled into recognizing noise as objects..

On the Computational Complexity of MapReduce

Monday, October 27th, 2014

On the Computational Complexity of MapReduce by Jeremy Kun.

From the post:

I recently wrapped up a fun paper with my coauthors Ben Fish, Adam Lelkes, Lev Reyzin, and Gyorgy Turan in which we analyzed the computational complexity of a model of the popular MapReduce framework. Check out the preprint on the arXiv.

As usual I’ll give a less formal discussion of the research here, and because the paper is a bit more technically involved than my previous work I’ll be omitting some of the more pedantic details. Our project started after Ben Moseley gave an excellent talk at UI Chicago. He presented a theoretical model of MapReduce introduced by Howard Karloff et al. in 2010, and discussed his own results on solving graph problems in this model, such as graph connectivity. You can read Karloff’s original paper here, but we’ll outline his model below.

Basically, the vast majority of the work on MapReduce has been algorithmic. What I mean by that is researchers have been finding more and cleverer algorithms to solve problems in MapReduce. They have covered a huge amount of work, implementing machine learning algorithms, algorithms for graph problems, and many others. In Moseley’s talk, he posed a question that caught our eye:

Is there a constant-round MapReduce algorithm which determines whether a graph is connected?

After we describe the model below it’ll be clear what we mean by “solve” and what we mean by “constant-round,” but the conjecture is that this is impossible, particularly for the case of sparse graphs. We know we can solve it in a logarithmic number of rounds, but anything better is open.

In any case, we started thinking about this problem and didn’t make much progress. To the best of my knowledge it’s still wide open. But along the way we got into a whole nest of more general questions about the power of MapReduce. Specifically, Karloff proved a theorem relating MapReduce to a very particular class of circuits. What I mean is he proved a theorem that says “anything that can be solved in MapReduce with so many rounds and so much space can be solved by circuits that are yae big and yae complicated, and vice versa.

But this question is so specific! We wanted to know: is MapReduce as powerful as polynomial time, our classical notion of efficiency (does it equal P)? Can it capture all computations requiring logarithmic space (does it contain L)? MapReduce seems to be somewhere in between, but it’s exact relationship to these classes is unknown. And as we’ll see in a moment the theoretical model uses a novel communication model, and processors that never get to see the entire input. So this led us to a host of natural complexity questions:

  1. What computations are possible in a model of parallel computation where no processor has enough space to store even one thousandth of the input?
  2. What computations are possible in a model of parallel computation where processor’s can’t request or send specific information from/to other processors?
  3. How the hell do you prove that something can’t be done under constraints of this kind?
  4. How do you measure the increase of power provided by giving MapReduce additional rounds or additional time?

These questions are in the domain of complexity theory, and so it makes sense to try to apply the standard tools of complexity theory to answer them. Our paper does this, laying some brick for future efforts to study MapReduce from a complexity perspective.

Given the prevalence of MapReduce, progress on understanding what is or is not possible is an important topic.

The first two complexity questions strike me as the ones most relevant to topic map processing with map reduce. Depending upon the nature of your merging algorithm.


Sudoku, Linear Optimization, and the Ten Cent Diet

Tuesday, September 30th, 2014

Sudoku, Linear Optimization, and the Ten Cent Diet by Joh Orwant.

From the post:

In 1945, future Nobel laureate George Stigler wrote an essay in the Journal of Farm Economics titled The Cost of Subsistence about a seemingly simple problem: how could a soldier be fed for as little money as possible?

The “Stigler Diet” became a classic problem in the then-new field of linear optimization, which is used today in many areas of science and engineering. Any time you have a set of linear constraints such as “at least 50 square meters of solar panels” or “the amount of paint should equal the amount of primer” along with a linear goal (e.g., “minimize cost” or “maximize customers served”), that’s a linear optimization problem.

At Google, our engineers work on plenty of optimization problems. One example is our YouTube video stabilization system, which uses linear optimization to eliminate the shakiness of handheld cameras. A more lighthearted example is in the Google Docs Sudoku add-on, which instantaneously generates and solves Sudoku puzzles inside a Google Sheet, using the SCIP mixed integer programming solver to compute the solution.

(image omitted)

Today we’re proud to announce two new ways for everyone to solve linear optimization problems. First, you can now solve linear optimization problems in Google Sheets with the Linear Optimization add-on written by Google Software Engineer Mihai Amarandei-Stavila. The add-on uses Google Apps Script to send optimization problems to Google servers. The solutions are displayed inside the spreadsheet. For developers who want to create their own applications on top of Google Apps, we also provide an API to let you call our linear solver directly.

(image omitted)

Second, we’re open-sourcing the linear solver underlying the add-on: Glop (the Google Linear Optimization Package), created by Bruno de Backer with other members of the Google Optimization team. It’s available as part of the or-tools suite and we provide a few examples to get you started. On that page, you’ll find the Glop solution to the Stigler diet problem. (A Google Sheets file that uses Glop and the Linear Optimization add-on to solve the Stigler diet problem is available here. You’ll need to install the add-on first.)

For a fuller introduction to linear programming: Practical Optimization: A Gentle Introduction by John W. Chinneck. Online, draft chapters.

I would say more about the utility of linear optimization in the subject identity space but it might violate an NDA I signed many years ago. Sorry.

Twitter open sourced a recommendation algorithm for massive datasets

Wednesday, September 24th, 2014

Twitter open sourced a recommendation algorithm for massive datasets by Derrick Harris.

From the post:

Late last month, Twitter open sourced an algorithm that’s designed to ease the computational burden on systems trying to recommend content — contacts, articles, products, whatever — across seemingly endless sets of possibilities. Called DIMSUM, short for Dimension Independent Matrix Square using MapReduce (rolls off the tongue, no?), the algorithm trims the list of potential combinations to a reasonable number, so other recommendation algorithms can run in a reasonable amount of time.

Reza Zadeh, the former Twitter data scientist and current Stanford consulting professor who helped create the algorithm, describes it in terms of the famous handshake problem. Two people in a room? One handshake; no problem. Ten people in a room? Forty-five handshakes; still doable. However, he explained, “The number of handshakes goes up quadratically … That makes the problem very difficult when x is a million.”

Twitter claims 271 million active users.

DIMSUM works primarily in two different areas: (1) matching promoted ads with the right users, and (2) suggesting similar people to follow after users follow someone. Running through all the possible combinations would take days even on a large cluster of machines, Zadeh said, but sampling the user base using DIMSUM takes significantly less time and significantly fewer machines.

The “similarity” of two or more people or bits of content is a variation on the merging rules of the TMDM.

In recommendation language, two or more topics are “similar” if:

  • at least one equal string in their [subject identifiers] properties,
  • at least one equal string in their [item identifiers] properties,
  • at least one equal string in their [subject locators] properties,
  • an equal string in the [subject identifiers] property of the one topic item and the [item identifiers] property of the other, or
  • the same information item in their [reified] properties.

TMDM 5.3.5 Properties

The TMDM says “equal” and not “similar” but the point being that you can arbitrarily decide on how “similar” two or more topics must be in order to trigger merging.

That realization opens up the entire realm of “similarity” and “recommendation” algorithms and techniques for application to topic maps.

Which brings us back to the algorithm just open sourced by Twitter.

With DIMSUM, you don’t have to do a brute force topic by topic comparison for merging purposes. Some topics will not meet a merging “threshold” and not be considered by merging routines.

Of course, with the TMDM, merging being either true or false, you may be stuck with brute force. Suggestions?

But if you have other similarity measures, you may be able to profit from DIMSUM.

BTW, I would not follow #dimsum on Twitter because it is apparently a type of dumpling. 😉

Update: All-pairs similarity via DIMSUM DIMSUM has been implemented in Spark!

Algorithms and Data – Example

Monday, September 22nd, 2014

People's Climate

AJ+ was all over the #OurClimate march in New York City.

Let’s be generous and say the march attracted 400,000 people.

At approximately 10:16 AM Eastern time this morning, the world population clock reported a population of 7,262,447,500.

0.000550 % of the world’s population expressed an opinion on climate change in New York yesterday.

I mention that calculation, disclosing both data and the algorithm, to point out the distortion between the number of people driving policy versus the number of people impacted.

Other minority opinions promoted by AJ+ include that of the United States (population: 318,776,000) on what role Iran (population: 77,176,930) should play in the Middle East (population: 395,133,109) and the world (population: 7,262,447,500), on issues such as the Islamic State. BBC News: Islamic State crisis: Kerry says Iran can help defeat IS.

Isn’t that “the tail wagging the dog?”

Is there any wonder why international decision making departs from the common interests of the world’s population?

Hopefully AJ+ will stop beating the drum quite so loudly for minority opinions and seek out more representative ones, even if not conveniently located in New York City.