Archive for the ‘Algorithms’ Category
Thursday, May 2nd, 2013
HyperLogLog — Cornerstone of a Big Data Infrastructure
From the introduction:
In the Zipfian world of AK, the HyperLogLog distinct value (DV) sketch reigns supreme. This DV sketch is the workhorse behind the majority of our DV counters (and we’re not alone) and enables us to have a real time, in memory data store with incredibly high throughput. HLL was conceived of by Flajolet et. al. in the phenomenal paper HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. This sketch extends upon the earlier Loglog Counting of Large Cardinalities (Durand et. al.) which in turn is based on the seminal AMS work FM-85, Flajolet and Martin’s original work on probabilistic counting. (Many thanks to Jérémie Lumbroso for the correction of the history here. I am very much looking forward to his upcoming introduction to probabilistic counting in Flajolet’s complete works.) UPDATE – Rob has recently published a blog about PCSA, a direct precursor to LogLog counting which is filled with interesting thoughts. There have been a few posts on HLL recently so I thought I would dive into the intuition behind the sketch and into some of the details.
After seeing the HyperLogLog references in Approximate Methods for Scalable Data Mining I started looking for a fuller explanation/illustration of HyperLogLog.
Stumbled on this posting.
Includes a great HyperLogLog (HLL) simulation written in JavaScript.
Enjoy!
Posted in Algorithms, BigData, Data Mining, Scalability | No Comments »
Wednesday, May 1st, 2013
Algorithms, 4th Edition by Robert Sedgewick and Kevin Wayne.
Here you will find excerpts from the text, Java code and solutions to some exercises on:
- Fundamentals
- Sorting
- Searching
- Graphs
- Strings
- Context
My latest from Sedgewick is almost a decade old so I may need to upgrade to the latest version.
At almost 1,000 pages (976 to be exact), it may push me towards a Kindle reader.
Posted in Algorithms, Programming | No Comments »
Sunday, April 28th, 2013
Algorithms Every Data Scientist Should Know: Reservoir Sampling by Josh Wills.
Data scientists, that peculiar mix of software engineer and statistician, are notoriously difficult to interview. One approach that I’ve used over the years is to pose a problem that requires some mixture of algorithm design and probability theory in order to come up with an answer. Here’s an example of this type of question that has been popular in Silicon Valley for a number of years:
Say you have a stream of items of large and unknown length that we can only iterate over once. Create an algorithm that randomly chooses an item from this stream such that each item is equally likely to be selected.
The first thing to do when you find yourself confronted with such a question is to stay calm. The data scientist who is interviewing you isn’t trying to trick you by asking you to do something that is impossible. In fact, this data scientist is desperate to hire you. She is buried under a pile of analysis requests, her ETL pipeline is broken, and her machine learning model is failing to converge. Her only hope is to hire smart people such as yourself to come in and help. She wants you to succeed.

Remember: Stay Calm.
The second thing to do is to think deeply about the question. Assume that you are talking to a good person who has read Daniel Tunkelang’s excellent advice about interviewing data scientists. This means that this interview question probably originated in a real problem that this data scientist has encountered in her work. Therefore, a simple answer like, “I would put all of the items in a list and then select one at random once the stream ended,” would be a bad thing for you to say, because it would mean that you didn’t think deeply about what would happen if there were more items in the stream than would fit in memory (or even on disk!) on a single computer.
The third thing to do is to create a simple example problem that allows you to work through what should happen for several concrete instances of the problem. The vast majority of humans do a much better job of solving problems when they work with concrete examples instead of abstractions, so making the problem concrete can go a long way toward helping you find a solution.
In addition to great interview advice, Josh also provides a useful overview of reservoir sampling.
Whether reservoir sampling will be useful to you depends on your test for subject identity.
I tend to think of subject identity as being very precise but that isn’t necessarily the case.
Or should I say that precision of subject identity is a matter of requirements?
For some purposes, it may be sufficient to know the gender of attendees, as a subject, within some margin of statistical error. With enough effort we could know that more precisely but the cost may be prohibitive.
Thinking of any test for subject identity being located on a continuum of subject identification. Where the notion of “precision” itself is up for definition.
Russia’s warnings on one of the Boston Marathon bombers, a warning that used his name as he did, not as captured by the US intelligence community, was a case of mistaken level of precision.
Mostly likely the result of an analyst schooled in an English-only curriculum.
Posted in Algorithms, Data Science, Reservoir Sampling | No Comments »
Thursday, April 25th, 2013
A different take on data skepticism by Beau Cronin.
From the post:
Recently, the Mathbabe (aka Cathy O’Neil) vented some frustration about the pitfalls in applying even simple machine learning (ML) methods like k-nearest neighbors. As data science is democratized, she worries that naive practitioners will shoot themselves in the foot because these tools can offer very misleading results. Maybe data science is best left to the pros? Mike Loukides picked up this thread, calling for healthy skepticism in our approach to data and implicitly cautioning against a “cargo cult” approach in which data collection and analysis methods are blindly copied from previous efforts without sufficient attempts to understand their potential biases and shortcomings.
…
…Well, I would argue that all ML methods are not created equal with regard to their safety. In fact, it is exactly some of the simplest (and most widely used) methods that are the most dangerous.
Why? Because these methods have lots of hidden assumptions. Well, maybe the assumptions aren’t so much hidden as nodded-at-but-rarely-questioned. A good analogy might be jumping to the sentencing phase of a criminal trial without first assessing guilt: asking “What is the punishment that best fits this crime?” before asking “Did the defendant actually commit a crime? And if so, which one?” As another example of a simple-yet-dangerous method, k-means clustering assumes a value for k, the number of clusters, even though there may not be a “good” way to divide the data into this many buckets. Maybe seven buckets provides a much more natural explanation than four. Or maybe the data, as observed, is truly undifferentiated and any effort to split it up will result in arbitrary and misleading distinctions. Shouldn’t our methods ask these more fundamental questions as well?
Beau make several good points on questioning data methods.
I would extend those “…more fundamental questions…” to data as well.
Data, at least as far as I know, doesn’t drop from the sky. It is collected, generated, sometimes both, by design.
That design had some reason for collecting that data, in some particular way and in a given format.
Like methods, data stands mute with regard to those designs, what choices were made, by who and for what reason?
Giving voice what can be known about methods and data falls to human users.
Posted in Algorithms, Data, Data Models, Data Quality | No Comments »
Sunday, April 7th, 2013
Deploying Graph Algorithms on GPUs: an Adaptive Solution by Da Li and Michela Becchi. (27th IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2013)
From the post:
Thanks to their massive computational power and their SIMT computational model, Graphics Processing Units (GPUs) have been successfully used to accelerate a wide variety of regular applications (linear algebra, stencil computations, image processing and bioinformatics algorithms, among others). However, many established and emerging problems are based on irregular data structures, such as graphs. Examples can be drawn from different application domains: networking, social networking, machine learning, electrical circuit modeling, discrete event simulation, compilers, and computational sciences. It has been shown that irregular applications based on large graphs do exhibit runtime parallelism; moreover, the amount of available parallelism tends to increase with the size of the datasets. In this work, we explore an implementation space for deploying a variety of graph algorithms on GPUs. We show that the dynamic nature of the parallelism that can be extracted from graph algorithms makes it impossible to find an optimal solution. We propose a runtime system able to dynamically transition between different implementations with minimal overhead, and investigate heuristic decisions applicable across algorithms and datasets. Our evaluation is performed on two graph algorithms: breadth-first search and single-source shortest paths. We believe that our proposed mechanisms can be extended and applied to other graph algorithms that exhibit similar computational patterns.
A development that may surprise some graph software vendors, there are “no optimal solution[s] across graph problems and datasets” for graph algorithms on GPU.
This paper points towards an adaptive technique that may prove to be “resilient to the irregularity and heterogeneity of real world graphs.”
I first saw this in a tweet by Stefano Bertolo.
Posted in Algorithms, GPU, Graphs, Networks | No Comments »
Wednesday, March 20th, 2013
Large-Scale Learning with Less RAM via Randomization by Daniel Golovin, D. Sculley, H. Brendan McMahan, Michael Young.
Abstract:
We reduce the memory footprint of popular large-scale online learning methods by projecting our weight vector onto a coarse discrete set using randomized rounding. Compared to standard 32-bit float encodings, this reduces RAM usage by more than 50% during training and by up to 95% when making predictions from a fixed model, with almost no loss in accuracy. We also show that randomized counting can be used to implement per-coordinate learning rates, improving model quality with little additional RAM. We prove these memory-saving methods achieve regret guarantees similar to their exact variants. Empirical evaluation confirms excellent performance, dominating standard approaches across memory versus accuracy tradeoffs.
I mention this in part because topic map authoring can be assisted by the results of machine learning.
It is also a data point for the proposition that unlike their human masters, machines are too precise.
Perhaps it is the case that the vagueness of human reasoning has significant advantages over the disk grinding precision of our machines.
The question then becomes: How do we capture vagueness in a system where every point is either 0 or 1?
Not probabilistic because that can be expressed but vagueness, which I experience as something different.
Suggestions?
PS: Perhaps that is what makes artificial intelligence artificial. It is too precise.
I first saw this in a tweet by Stefano Bertolo.
Posted in Algorithms, Artificial Intelligence, Machine Learning | No Comments »
Tuesday, March 19th, 2013
AI Algorithms, Data Structures, and Idioms in Prolog, Lisp and Java by George F. Luger and William A. Stubblefield.
From the introduction:
Writing a book about designing and implementing representations and search algorithms in Prolog, Lisp, and Java presents the authors with a number of exciting opportunities.
The first opportunity is the chance to compare three languages that give very different expression to the many ideas that have shaped the evolution of programming languages as a whole. These core ideas, which also support modern AI technology, include functional programming, list processing, predicate logic, declarative representation, dynamic binding, meta-linguistic abstraction, strong-typing, meta-circular definition, and object-oriented design and programming. Lisp and Prolog are, of course, widely recognized for their contributions to the evolution, theory, and practice of programming language design. Java, the youngest of this trio, is both an example of how the ideas pioneered in these earlier languages have shaped modern applicative programming, as well as a powerful tool for delivering AI applications on personal computers, local networks, and the world wide web.
Where could you go wrong with comparing Prolog, Lisp and Java?
Either for the intellectual exercise or because you want a better understanding of AI, a resource to enjoy!
Posted in Algorithms, Artificial Intelligence, Data Structures, Java, Lisp, Prolog | No Comments »
Sunday, March 10th, 2013
SPMF: A Sequential Pattern Mining Framework
From the webpage:
SPMF is an open-source data mining mining platform written in Java.
It is distributed under the GPL v3 license.
It offers implementations of 52 data mining algorithms for:
- sequential pattern mining,
- association rule mining,
- frequent itemset mining,
- sequential rule mining,
- clustering
It can be used as a standalone program with a user interface or from the command line. Moreover, the source code of each algorithm can be integrated in other Java software.
The documentation consists entirely of examples of using SPMF for data mining tasks.
The algorithms page details the fifty-two (52) algorithms of SPMF by references to the literature.
I first saw this at: SPMF: Sequential Pattern Mining Framework.
Posted in Algorithms, Data Mining | No Comments »
Friday, March 8th, 2013
Untangling algorithmic illusions from reality in big data by Alex Howard.
From the post:
Microsoft principal researcher Kate Crawford (@katecrawford) gave a strong talk at last week’s Strata Conference in Santa Clara, Calif. about the limits of big data. She pointed out potential biases in data collection, questioned who may be excluded from it, and hammered home the constant need for context in conclusions. Video of her talk is embedded below:
See Alex’s post for the video and the interview that follows.
Both are simply golden.
How important are biases in data collection?
Consider the classic example:
Are you in favor of convicted felons owning firerams?
90%+ of all surveyed say they favor gun control.
Are you in favor of gun control?
Much lower percentage saying they favor gun control.
The numbers are from memory and surveys probably forty years ago but the lesson is to watch the question being asked.
A survey that doesn’t expose its questions, how people were contacted, at what time of day, just to name a few factors, isn’t worthy of comment.
Posted in Algorithms, BigData | 1 Comment »
Tuesday, March 5th, 2013
A Simple, Combinatorial Algorithm for Solving SDD Systems in Nearly-Linear Time by Jonathan A. Kelner, Lorenzo Orecchia, Aaron Sidford, Zeyuan Allen Zhu.
Abstract:
In this paper, we present a simple combinatorial algorithm that solves symmetric diagonally dominant (SDD) linear systems in nearly-linear time. It uses very little of the machinery that previously appeared to be necessary for a such an algorithm. It does not require recursive preconditioning, spectral sparsification, or even the Chebyshev Method or Conjugate Gradient. After constructing a “nice” spanning tree of a graph associated with the linear system, the entire algorithm consists of the repeated application of a simple (non-recursive) update rule, which it implements using a lightweight data structure. The algorithm is numerically stable and can be implemented without the increased bit-precision required by previous solvers. As such, the algorithm has the fastest known running time under the standard unit-cost RAM model. We hope that the simplicity of the algorithm and the insights yielded by its analysis will be useful in both theory and practice.
In one popular account, the importance of the discovery was described this way:
The real value of the MIT paper, Spielman says, is in its innovative theoretical approach. “My work and the work of the folks at Carnegie Mellon, we’re solving a problem in numeric linear algebra using techniques from the field of numerical linear algebra,” he says. “Jon’s paper is completely ignoring all of those techniques and really solving this problem using ideas from data structures and algorithm design. It’s substituting one whole set of ideas for another set of ideas, and I think that’s going to be a bit of a game-changer for the field. Because people will see there’s this set of ideas out there that might have application no one had ever imagined.”
Thirty-two pages of tough sledding but if the commentaries are correct, this paper may have a major impact on graph processing.
Posted in Algorithms, Data Structures, Mathematics | No Comments »
Tuesday, February 5th, 2013
Chaotic Nihilists and Semantic Idealists by Alistair Croll.
From the post:
There are competing views of how we should tackle an abundance of data, which I’ve referred to as big data’s “odd couple”.
One camp—made up of semantic idealists who fetishize taxonomies—is to tag and organize it all. Once we’ve marked everything and how it relates to everything else, they hope, the world will be reasonable and understandable.
The poster child for the Semantic Idealists is Wolfram Alpha, a “reasoning engine” that understands, for example, a question like “how many blue whales does the earth weigh?”—even if that question has never been asked before. But it’s completely useless until someone’s told it the weight of a whale, or the earth, or, for that matter, what weight is.
They’re wrong.
Alistair continues with the other camp:
Wolfram Alpha’s counterpart for the Algorithmic Nihilists is IBM’s Watson, a search engine that guesses at answers based on probabilities (and famously won on Jeopardy.) Watson was never guaranteed to be right, but it was really, really likely to have a good answer. It also wasn’t easily controlled: when it crawled the Urban Dictionary website, it started swearing in its responses[1], and IBM’s programmers had to excise some of its more colorful vocabulary by hand.
She’s wrong too.
And projects the future as:
The future of data is a blend of both semantics and algorithms. That’s one reason Google recently introduced a second search engine, called the Knowledge Graph, that understands queries.[3] Knowledge Graph was based on technology from Metaweb, a company it acquired in 2010, and it augments “probabilistic” algorithmic search with a structured, tagged set of relationships.
Why are we missing asking users what they meant as a third option?
Depends on who you want to be in charge:
Algorithms — Empower Computer Scientists.
Ontologies/taxonomies — Empower Ontologists.
Asking Users — Empowers Users.
Topic maps are a solution that can ask users.
Any questions?
Posted in Algorithms, Ontology, Semantics, Taxonomy, Topic Maps | No Comments »
Tuesday, February 5th, 2013
Posted in Algorithms, MapReduce, Texts | No Comments »
Tuesday, January 29th, 2013
Aggregate Skyline Join Queries: Skylines with Aggregate Operations over Multiple Relations by Arnab Bhattacharya and B. Palvali Teja.
(Submitted on 28 Jun 2012)
Abstract:
The multi-criteria decision making, which is possible with the advent of skyline queries, has been applied in many areas. Though most of the existing research is concerned with only a single relation, several real world applications require finding the skyline set of records over multiple relations. Consequently, the join operation over skylines where the preferences are local to each relation, has been proposed. In many of those cases, however, the join often involves performing aggregate operations among some of the attributes from the different relations. In this paper, we introduce such queries as “aggregate skyline join queries”. Since the naive algorithm is impractical, we propose three algorithms to efficiently process such queries. The algorithms utilize certain properties of skyline sets, and processes the skylines as much as possible locally before computing the join. Experiments with real and synthetic datasets exhibit the practicality and scalability of the algorithms with respect to the cardinality and dimensionality of the relations.
The authors illustrate a “skyline” query with a search for a hotel that has a good price and it close to the beach. A “skyline” set of hotels excludes hotels that are not as good on those points as hotels in the set. They then observe:
In real applications, however, there often exists a scenario when a single relation is not sufficient for the application, and the skyline needs to be computed over multiple relations [16]. For example, consider a flight database. A person traveling from city A to city B may use stopovers, but may still be interested in flights that are cheaper, have a less overall journey time, better ratings and more amenities. In this case, a single relation specifying all direct flights from A to B may not suffice or may not even exist. The join of multiple relations consisting of flights starting from A and those ending at B needs to be processed before computing the preferences.
The above problem becomes even more complex if the person is interested in the travel plan that optimizes both on the total cost as well as the total journey time for the two flights (other than the ratings and amenities of each
airline). In essence, the skyline now needs to be computed on attributes that have been aggregated from multiple relations in addition to attributes whose preferences are local within each relation. The common aggregate operations are sum, average, minimum, maximum, etc.
No doubt the travel industry thinks it has conquered semantic diversity in travel arrangements. If they have, it has since I stopped traveling several years ago.
Even simple tasks such as coordination of air and train schedules was unnecessarily difficult.
I suspect that is still the case and so mention “skyline” queries as a topic to be aware of and if necessary, to include in a topic map application that brings sanity to travel arrangements.
True, you can get a travel service that handles all the details, but only for a price and only if you are that trusting.
Posted in Aggregation, Algorithms, Query Engine, Query Language | No Comments »
Monday, January 28th, 2013
Pathfinding Algorithms for Changing Graphs
The algorithms summarized at the end of the discussion:
The increased interest in graph based information processing promises this will be an active area of research.
I first saw this in a tweet by GraphHopper.
Posted in Algorithms, Graphs | No Comments »
Sunday, January 20th, 2013
New big data firm to pioneer topological data analysis by John Burn-Murdoch.
From the post:
A US big data firm is set to establish algebraic topology as the gold standard of data science with the launch of the world’s leading topological data analysis (TDA) platform.
Ayasdi, whose co-founders include renowned mathematics professor Gunnar Carlsson, launched today in Palo Alto, California, having secured $10.25m from investors including Khosla Ventures in the first round of funding.
The funds will be used to build on its Insight Discovery platform, the culmination of 12 years of research and development into mathematics, computer science and data visualisation at Stanford.
Ayasdi’s work prior to launching as a company has already yielded breakthroughs in the pharmaceuticals industry. In one case it revealed new insights in eight hours – compared to the previous norm of over 100 hours – cutting the turnaround from analysis to clinical trials in the process.
The project? CompTop, which I covered here.
Does topological data analysis sound more interesting now than before?
Posted in Algorithms, Topological Data Analysis, Topology | No Comments »
Friday, January 18th, 2013
Graph Algorithms by David Eppstein.
Graph algorithms course with a Wikipedia book, Graph Algorithms, made up of articles from Wikipedia.
The syllabus does include some materials not found at Wikipedia, so be sure to check there as well.
Posted in Algorithms, Graphs, Networks | No Comments »
Friday, January 18th, 2013
Strong components of the Wikipedia graph
From the post:
I recently covered strong connectivity analysis in my graph algorithms class, so I’ve been playing today with applying it to the link structure of (small subsets of) Wikipedia.
For instance, here’s one of the strong components among the articles linked from Hans Freudenthal (a mathematician of widely varied interests): Algebraic topology, Freudenthal suspension theorem, George W. Whitehead, Heinz Hopf, Homotopy group, Homotopy groups of spheres, Humboldt University of Berlin, Luitzen Egbertus Jan Brouwer, Stable homotopy theory, Suspension (topology), University of Amsterdam, Utrecht University. Mostly this makes sense, but I’m not quite sure how the three universities got in there. Maybe from their famous faculty members?
One of responses to this post suggest grabbing the entire Wikipedia dataset for purposes of trying out algorithms.
A good suggestion for algorithms, perhaps even algorithms meant to reduce visual clutter, but at what point does a graph become too “busy” for visual analysis?
Recalling the research that claims people can only remember seven or so things at one time.
Posted in Algorithms, Graphs, Networks, Wikipedia | No Comments »
Monday, January 14th, 2013
How to implement an algorithm from a scientific paper by Emmanuel Goossaert.
From the post:
This article is a short guide to implementing an algorithm from a scientific paper. I have implemented many complex algorithms from books and scientific publications, and this article sums up what I have learned while searching, reading, coding and debugging. This is obviously limited to publications in domains related to the field of Computer Science. Nevertheless, you should be able to apply the guidelines and good practices presented below to any kind of paper or implementation.
Seven (7) rules, some with sub-parts, to follow when trying to implement an algorithm from a paper.
With the growth of research on parallel programming, likely the most important skill you will pick up this year.
I first saw this in Four short links: 11 January 2013 by Nat Torkington.
Posted in Algorithms, Programming | No Comments »
Tuesday, January 8th, 2013
Designing algorithms for Map Reduce by Ricky Ho.
From the post:
Since the emerging of Hadoop implementation, I have been trying to morph existing algorithms from various areas into the map/reduce model. The result is pretty encouraging and I’ve found Map/Reduce is applicable in a wide spectrum of application scenarios.
So I want to write down my findings but then found the scope is too broad and also I haven’t spent enough time to explore different problem domains. Finally, I realize that there is no way for me to completely cover what Map/Reduce can do in all areas, so I just dump out what I know at this moment over the long weekend when I have an extra day.
Notice that Map/Reduce is good for “data parallelism”, which is different from “task parallelism”. Here is a description about their difference and a general parallel processing design methodology.
I’ll cover the abstract Map/Reduce processing model below. For a detail description of the implementation of Hadoop framework, please refer to my earlier blog here.
A bit dated (2010) but still worth your time.
I missed its initial appearance so appreciated Ricky pointing back to it in MapReduce: Detecting Cycles in Network Graph.
You may also want to consult: Designing good MapReduce algorithms by Jeffrey Ullman.
Posted in Algorithms, BigData, Hadoop, MapReduce | No Comments »
Friday, December 28th, 2012
LIBOL 0.1.0
From the webpage:
LIBOL is an open-source library for large-scale online classification, which consists of a large family of efficient and scalable state-of-the-art online learning algorithms for large-scale online classification tasks. We have offered easy-to-use command-line tools and examples for users and developers. We also have made documents available for both beginners and advanced users. LIBOL is not only a machine learning tool, but also a comprehensive experimental platform for conducting online learning research.
In general, the existing online learning algorithms for linear classication tasks can be grouped into two major categories: (i) first order learning (Rosenblatt, 1958; Crammer et al., 2006), and (ii) second order learning (Dredze et al., 2008; Wang et al., 2012; Yang et al., 2009).
Example online learning algorithms in the first order learning category implemented in this library include:
• Perceptron: the classical online learning algorithm (Rosenblatt, 1958);
• ALMA: A New ApproximateMaximal Margin Classification Algorithm (Gentile, 2001);
• ROMMA: the relaxed online maxiumu margin algorithms (Li and Long, 2002);
• OGD: the Online Gradient Descent (OGD) algorithms (Zinkevich, 2003);
• PA: Passive Aggressive (PA) algorithms (Crammer et al., 2006), one of state-of-the-art first order online learning algorithms;
Example algorithms in the second order online learning category implemented in this library include the following:
• SOP: the Second Order Perceptron (SOP) algorithm (Cesa-Bianchi et al., 2005);
• CW: the Confidence-Weighted (CW) learning algorithm (Dredze et al., 2008);
• IELLIP: online learning algorithms by improved ellipsoid method (Yang et al., 2009);
• AROW: the Adaptive Regularization of Weight Vectors (Crammer et al., 2009);
• NAROW: New variant of Adaptive Regularization (Orabona and Crammer, 2010);
• NHERD: the Normal Herding method via Gaussian Herding (Crammer and Lee, 2010)
• SCW: the recently proposed Soft ConfidenceWeighted algorithms (Wang et al., 2012).
LIBOL is still being improved by improvements from practical users and new research results.
More information can be found in our project website: http://libol.stevenhoi.org/
Consider this an early New Year’s present!
Posted in Algorithms, Classification, Machine Learning | No Comments »
Wednesday, December 5th, 2012
Algorithms: Design and Analysis, Part 2 by Tim Roughgarden. (Coursera)
From the course description:
In this course you will learn several fundamental principles of advanced algorithm design: greedy algorithms and applications; dynamic programming and applications; NP-completeness and what it means for the algorithm designer; the design and analysis of heuristics; and more.
The course started December 3, 2012 so if you are going to join, best do so soon.
Posted in Algorithms, CS Lectures | No Comments »
Wednesday, December 5th, 2012
Fast Parallel Sorting Algorithms on GPUs by Bilal Jan, Bartolomeo Montrucchio, Carlo Ragusa, Fiaz Gul Khan, Omar Khan.
Abstract:
This paper presents a comparative analysis of the three widely used parallel sorting algorithms: OddEven sort, Rank sort and Bitonic sort in terms of sorting rate, sorting time and speed-up on CPU and different GPU architectures. Alongside we have implemented novel parallel algorithm: min-max butterfly network, for finding minimum and maximum in large data sets. All algorithms have been implemented exploiting data parallelism model, for achieving high performance, as available on multi-core GPUs using the OpenCL specification. Our results depicts minimum speed-up19x of bitonic sort against oddeven sorting technique for small queue sizes on CPU and maximum of 2300x speed-up for very large queue sizes on Nvidia Quadro 6000 GPU architecture. Our implementation of full-butterfly network sorting results in relatively better performance than all of the three sorting techniques: bitonic, odd-even and rank sort. For min-max butterfly network, our findings report high speed-up of Nvidia quadro 6000 GPU for high data set size reaching 224 with much lower sorting time.
Is there a GPU in your topic map processing future?
I first saw this in a tweet by Stefano Bertolo.
Posted in Algorithms, GPU, Parallel Programming, Sorting | No Comments »
Saturday, November 10th, 2012
Algorithmic Economics, August 6-10, 2012, Carnegie Mellon University.
You will find slides and videos for:
Another view of social dynamics. Which is everywhere when you think about it. Not just consumers but sellers, manufacturers, R&D.
There isn’t any human activity separate and apart from social dynamics or influenced by them.
That includes the design, authoring and marketing of topic maps.
I first saw this in a tweet from Stefano Bertolo, mentioning the general link and also the lecture on game theory.
Posted in Algorithms, Game Theory, Networks, Social Networks | 1 Comment »
Sunday, October 28th, 2012
1 MB Sorting Explained by Jeff Preshing.
From the post:
In my previous post, I shared some source code to sort one million 8-digit numbers in 1MB of RAM as an answer to this Stack Overflow question. The program works, but I didn’t explain how, leaving it as a kind of puzzle for the reader.
(image omitted)
I had promised to explain it in a followup post, and in the meantime, there’s been a flurry of discussion in the comments and on Reddit. In particular, commenter Ben Wilhelm (aka ZorbaTHut) already managed to explain most of it (Nice work!), and by now, I think quite a few people already get it. Nonetheless, I’ll write up another explanation as promised.
You may want to also review the answers and comments at Stack Overflow as well.
Sorting being one of those fundamental operations you will encounter time and again.
Even in topic maps.
Posted in Algorithms, Programming, Sorting | No Comments »
Sunday, October 28th, 2012
Algorithms for Massive Data Sets by Inge Li Gørtz and Philip Bille.
From the course description:
A student who has met the objectives of the course will be able to:
- Describe an algorithm in a comprehensible manner, i.e., accurately, concise, and unambiguous.
- Prove correctness of algorithms.
- Analyze, evaluate, and compare the performance of algorithms in models of computation relevant to massive data sets.
- Analyze, evaluate, and compare the quality and reliability of solutions.
- Apply and extend relevant algorithmic techniques for massive data sets.
- Design algorithms for problems related to massive data sets.
- Lookup and apply relevant research literature for problems related to massive data sets.
- Systematically identify and analyze problems and make informed choices for solving the problems based on the analysis.
- Argue clearly for the choices made when solving a problem.
Papers, slides and exercises provided for these topics:
Week 1: Introduction and Hashing: Chained, Universal, and Perfect.
Week 2: Predecessor Data Structures: x-fast tries and y-fast tries.
Week 3: Decremental Connectivity in Trees: Cluster decomposition, Word-Level Parallelism.
Week 4: Nearest Common Ancestors: Distributed data structures, Heavy-path decomposition, alphabetic codes.
Week 5: Amortized analysis and Union-Find.
Week 6: Range Reporting: Range Trees, Fractional Cascading, and kD Trees.
Week 7: Persistent data structures.
Week 8: String matching.
Week 9: String Indexing: Dictionaries, Tries, Suffix trees, and Suffix Sorting.
Week 10: Introduction to approximation algorithms. TSP, k-center, vertex cover.
Week 11: Approximation algorithms: Set Cover, stable matching.
Week 12: External Memory: I/O Algorithms, Cache-Oblivious Algorithms, and Dynamic Programming
Just reading the papers will improve your big data skills.
Posted in Algorithms, BigData, CS Lectures | No Comments »
Saturday, October 27th, 2012
Designing good MapReduce algorithms by Jeffrey D. Ullman.
From the introduction:
If you are familiar with “big data,” you are probably familiar with the MapReduce approach to implementing parallelism on computing clusters [1]. A cluster consists of many compute nodes, which are processors with their associated memory and disks. The compute nodes are connected by Ethernet or switches so they can pass data from node to node.
Like any other programming model, MapReduce needs an algorithm-design theory. The theory is not just the theory of parallel algorithms—MapReduce requires we coordinate parallel processes in a very specific way. A MapReduce job consists of two functions written by the programmer, plus some magic that happens in the middle:
- The Map function turns each input element into zero or more key-value pairs. A “key” in this sense is not unique, and it is in fact important that many pairs with a given key are generated as the Map function is applied to all the input elements.
- The system sorts the key-value pairs by key, and for each key creates a pair consisting of the key itself and a list of all the values associated with that key.
- The Reduce function is applied, for each key, to its associated list of values. The result of that application is a pair consisting of the key and whatever is produced by the Reduce function applied to the list of values. The output of the entire MapReduce job is what results from the application of the Reduce function to each key and its list.
When we execute a MapReduce job on a system like Hadoop [2], some number of Map tasks and some number of Reduce tasks are created. Each Map task is responsible for applying the Map function to some subset of the input elements, and each Reduce task is responsible for applying the Reduce function to some number of keys and their associated lists of values. The arrangement of tasks and the key-value pairs that communicate between them is suggested in Figure 1. Since the Map tasks can be executed in parallel and the Reduce tasks can be executed in parallel, we can obtain an almost unlimited degree of parallelism—provided there are many compute nodes for executing the tasks, there are many keys, and no one key has an unusually long list of values
A very important feature of the Map-Reduce form of parallelism is that tasks have the blocking property [3]; that is, no Map or Reduce task delivers any output until it has finished all its work. As a result, if a hardware or software failure occurs in the middle of a MapReduce job, the system has only to restart the Map or Reduce tasks that were located at the failed compute node. The blocking property of tasks is essential to avoid restart of a job whenever there is a failure of any kind. Since Map-Reduce is often used for jobs that require hours on thousands of compute nodes, the probability of at least one failure is high, and without the blocking property large jobs would never finish.
There is much more to the technology of MapReduce. You may wish to consult, a free online text that covers MapReduce and a number of its applications [4].
Warning: This article may change your interest in the design of MapReduce algorithms.
Ullman’s stories of algorithm tradeoffs provide motivation to evaluate (or reevaluate) your own design tradeoffs.
Posted in Algorithms, BigData, Hadoop, MapReduce | No Comments »
Tuesday, October 16th, 2012
Report on XLDB Tutorial on Data Structures and Algorithms by Michael Bender.
From the post:
The tutorial was organized as follows:
- Module 0: Tutorial overview and introductions. We describe an observed (but not necessary) tradeoff in ingestion, querying, and freshness in traditional database.
- Module 1: I/O model and cache-oblivious analysis.
- Module 2: Write-optimized data structures. We give the optimal trade-off between inserts and point queries. We show how to build data structures that lie on this tradeoff curve.
- Module 2 continued: Write-optimized data structures perform writes much faster than point queries; this asymmetry affects the design of an ACID compliant database.
- Module 3: Case study – TokuFS. How to design and build a write-optimized file systems.
- Module 4: Page-replacement algorithms. We give relevant theorems on the performance of page-replacement strategies such as LRU.
- Module 5: Index design, including covering indexes.
- Module 6: Log-structured merge trees and fractional cascading.
- Module 7: Bloom filters.
These algorithms and data structures are used both in NoSQL implementations such as MongoDB, HBase and in SQL-oriented implementations such as MySQL and TokuDB.
The slides are available here.
A tutorial offered by Michael and Bradley C. Kuszmaul at the 6th XLDB conference.
If you are committed to defending your current implementation choices against all comers, don’t bother with the slides.
If you want a peek at one future path in data structures, get the slides. You won’t be disappointed.
Posted in Algorithms, Data Structures, Fractal Trees, TokuDB, Tokutek | No Comments »
Wednesday, October 10th, 2012
Distributed Algorithms in NoSQL Databases by Ilya Katsov.
From the post:
Scalability is one of the main drivers of the NoSQL movement. As such, it encompasses distributed system coordination, failover, resource management and many other capabilities. It sounds like a big umbrella, and it is. Although it can hardly be said that NoSQL movement brought fundamentally new techniques into distributed data processing, it triggered an avalanche of practical studies and real-life trials of different combinations of protocols and algorithms. These developments gradually highlight a system of relevant database building blocks with proven practical efficiency. In this article I’m trying to provide more or less systematic description of techniques related to distributed operations in NoSQL databases.
In the rest of this article we study a number of distributed activities like replication of failure detection that could happen in a database. These activities, highlighted in bold below, are grouped into three major sections:
- Data Consistency. Historically, NoSQL paid a lot of attention to tradeoffs between consistency, fault-tolerance and performance to serve geographically distributed systems, low-latency or highly available applications. Fundamentally, these tradeoffs spin around data consistency, so this section is devoted data replication and data repair.
- Data Placement. A database should accommodate itself to different data distributions, cluster topologies and hardware configurations. In this section we discuss how to distribute or rebalance data in such a way that failures are handled rapidly, persistence guarantees are maintained, queries are efficient, and system resource like RAM or disk space are used evenly throughout the cluster.
- System Coordination. Coordination techniques like leader election are used in many databases to implements fault-tolerance and strong data consistency. However, even decentralized databases typically track their global state, detect failures and topology changes. This section describes several important techniques that are used to keep the system in a coherent state.
Slow going but well worth the effort.
Not the issues discussed in the puff-piece webinars extolling NoSQL solutions to “big data.”
But you already knew that if you read this far! Enjoy!
I first saw this at Christophe Lalanne’s A bag of tweets / September 2012
Posted in Algorithms, Distributed Systems, NoSQL | No Comments »
Tuesday, October 9th, 2012
Advanced Data Structures
From the description:
This course will survey important developments in data structures that have not (yet) worked their way into the standard computer science curriculum. The precise topics will depend on the interests and background of the course participants; see the current schedule for details. Potential topics include include self-adjusting binary search trees, dynamic trees and graphs, persistent data structures, geometric data structures, kinetic data structures, I/O-efficient and cache-oblivious data structures, hash tables and Bloom filters, data structures that beat information-theoretic lower bounds, and applications of these data structures in computational geometry, combinatorial optimization, systems and networking, databases, and other areas of computer science.
The course page has links to similar courses.
For hardy souls exploring data structures per se or for specialized topic maps, an annotated bibliography of readings.
If you haven’t seen it, visit Jeff’s homepage.
To see what Jeff has been up to lately: DBLP: Jeff Erickson.
Posted in Algorithms, Data Structures | No Comments »