Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 14, 2013

Astrophysics Source Code Library…

Filed under: Algorithms,Astroinformatics,Programming,Software — Patrick Durusau @ 4:25 pm

Astrophysics Source Code Library: Where do we go from here? by Ms. Alice Allen.

From the introduction:

This week I am featuring a guest post by Ms. Alice Allen, the Editor of the the Astrophysics Source Code Library, an on-line index of codes used in astronomical research and that have been referenced in peer-reviewed journal articles. The post is essentially a talk given by Ms. Allen at the recent ADASS XXIII meeting. The impact of the ASCL is growing – a poster by Associate Editor Kim DuPrie at ADASS XXIII showed that there are now 700+ codes indexed, and quarterly page views have quadrupled from Q1/2011 to 24,ooo. Researchers are explicitly citing the code in papers that use the software, the ADS is linking software to papers about the code, and the ASCL is sponsoring workshops and discussion forums to identify obstacles to code sharing and propose solutions. And now, over to you, Alice: (emphasis in original)

Alice describes “success” as:

Success for us is this: you read a paper, want to see the code, click a link or two, and can look at the code for its underlying assumptions, methods, and computations. Alternately, if you want to investigate an unfamiliar domain, you can peruse the ASCL to see what codes have been written in that area.

Imagine having that level of “success” for data sets or data extraction source code.

October 8, 2013

From Algorithms to Z-Scores:…

Filed under: Algorithms,Computer Science,Mathematics,Probability,R,Statistics — Patrick Durusau @ 2:47 pm

From Algorithms to Z-Scores: Probabilistic and Statistical Modeling in Computer Science by Noram Matloff.

From the Overview:

The materials here form a textbook for a course in mathematical probability and statistics for computer science students. (It would work fine for general students too.)


“Why is this text different from all other texts?”

  • Computer science examples are used throughout, in areas such as: computer networks; data and text mining; computer security; remote sensing; computer performance evaluation; software engineering; data management; etc.
  • The R statistical/data manipulation language is used throughout. Since this is a computer science audience, a greater sophistication in programming can be assumed. It is recommended that my R tutorials be used as a supplement:

  • Throughout the units, mathematical theory and applications are interwoven, with a strong emphasis on modeling: What do probabilistic models really mean, in real-life terms? How does one choose a model? How do we assess the practical usefulness of models?

    For instance, the chapter on continuous random variables begins by explaining that such distributions do not actually exist in the real world, due to the discreteness of our measuring instruments. The continuous model is therefore just that–a model, and indeed a very useful model.

    There is actually an entire chapter on modeling, discussing the tradeoff between accuracy and simplicity of models.

  • There is considerable discussion of the intuition involving probabilistic concepts, and the concepts themselves are defined through intuition. However, all models and so on are described precisely in terms of random variables and distributions.

Another open-source textbook from Norm Matloff!

Algorithms to Z-Scores (the book).

Source files for the book available at: http://heather.cs.ucdavis.edu/~matloff/132/PLN .

Norm suggests his R tutorial, R for Programmers http://heather.cs.ucdavis.edu/~matloff/R/RProg.pdf as supplemental reading material.

To illustrate the importance of statistics, Norm gives the following examples in chapter 1:

  • The statistical models used on Wall Street made the quants” (quantitative analysts) rich— but also contributed to the worldwide fi nancial crash of 2008.
  • In a court trial, large sums of money or the freedom of an accused may hinge on whether the judge and jury understand some statistical evidence presented by one side or the other.
  • Wittingly or unconsciously, you are using probability every time you gamble in a casino— and every time you buy insurance.
  • Statistics is used to determine whether a new medical treatment is safe/e ffective for you.
  • Statistics is used to flag possible terrorists —but sometimes unfairly singling out innocent people while other times missing ones who really are dangerous.

Mastering the material in this book will put you a long way to becoming a network “statistical skeptic.”

So you can debunk mis-leading or simply wrong claims by government, industry and special interest groups. Wait! Those are also known as advertisers. Never mind.

September 29, 2013

Foundations of Data Science

Foundations of Data Science by John Hopcroft and Ravindran Kannan.

From the introduction:

Computer science as an academic discipline began in the 60’s. Emphasis was on programming languages, compilers, operating systems, and the mathematical theory that supported these areas. Courses in theoretical computer science covered nite automata, regular expressions, context free languages, and computability. In the 70’s, algorithms was added as an important component of theory. The emphasis was on making computers useful. Today, a fundamental change is taking place and the focus is more on applications. There are many reasons for this change. The merging of computing and communications has played an important role. The enhanced ability to observe, collect and store data in the natural sciences, in commerce, and in other elds calls for a change in our understanding of data and how to handle it in the modern setting. The emergence of the web and social networks, which are by far the largest such structures, presents both opportunities and challenges for theory.

While traditional areas of computer science are still important and highly skilled individuals are needed in these areas, the majority of researchers will be involved with using computers to understand and make usable massive data arising in applications, not just
how to make computers useful on specifi c well-defi ned problems. With this in mind we have written this book to cover the theory likely to be useful in the next 40 years, just as automata theory, algorithms and related topics gave students an advantage in the last 40 years. One of the major changes is the switch from discrete mathematics to more of an emphasis on probability, statistics, and numerical methods.

In draft form but impressive!

Current chapters:

  1. Introduction
  2. High-Dimensional Space
  3. Random Graphs
  4. Singular Value Decomposition (SVD)
  5. Random Walks and Markov Chains
  6. Learning and the VC-dimension
  7. Algorithms for Massive Data Problems
  8. Clustering
  9. Topic Models, Hidden Markov Process, Graphical Models, and Belief Propagation
  10. Other Topics [Rankings, Hare System for Voting, Compressed Sensing and Sparse Vectors]
  11. Appendix

I am certain the authors would appreciate comments and suggestions concerning the text.

I first saw this in a tweet by CompSciFact.

July 2, 2013

AK Data Science Summit – Streaming and Sketching

Filed under: Algorithms,BigData,Stream Analytics — Patrick Durusau @ 12:35 pm

AK Data Science Summit – Streaming and Sketching – June 20, 2013

From the post:

Aggregate Knowledge, along with Foundation Capital, proudly presented the AK Data Science Summit on Streaming and Sketching Algorithms in Big Data and Analaytics at 111 Minna Gallery in San Francisco on June 20th, 2013. It was a one day conference dedicated to bridging the gap between implementers and researchers. You can find video and slides of the talks given and panels held on that day. Thank you again to our speakers and panelists for their time and efforts!

What do you do when big data is too big?

Use streaming and sketching algorithms!

Not every subject identity problem allows time for human editorial intervention.

Consider target acquisition in a noisy environment, where some potential targets are hostile and some are not.

Capturing what caused a fire/no-fire decision (identification as hostile) enables refinement or transfer to other systems.

June 22, 2013

Machine Learning Cheat Sheet [Suggestions for a better one]

Filed under: Algorithms,Machine Learning — Patrick Durusau @ 3:39 pm

Machine Learning Cheat Sheet (pdf)

If you need to memorize machine learning formulas for an exam, this might be the very thing.

On the other hand, if you are sitting at your console, you are likely to have online or hard copy references with this formula and more detailed information.

A generally helpful machine learning cheatsheet would some common cases where each algorithm has been successful. Perhaps even some edge cases you are unlikely to think about.

The algorithms are rarely in question. Proper application, well, that’s an entirely different story.

I first saw this in a tweet by Siah.

May 26, 2013

SIAM Archives

Filed under: Algorithms,Archives,Computer Science — Patrick Durusau @ 2:05 pm

I saw an announcement for SDM 2014 : SIAM International Conference on Data Mining, Philadelphia, Pennsylvania, USA, April 24 – 26, 2014, today but the call for papers hasn’t appeared, yet.

While visiting the conference site I followed the proceedings link to discover:

Workshop on Algorithm Engineering and Experiments (ALENEX) 2006 – 2013

Workshop on Analytic Algorithmics and Combinatorics (ANALCO) 2006 – 2013

ACM-SIAM Symposium on Discrete Algorithms (SODA) 2009 – 2013

Data Mining – 2001 – 2013

Mathematics for Industry 2009

Just in case you are short on reading material for the summer. 😉

May 2, 2013

HyperLogLog — Cornerstone of a Big Data Infrastructure

Filed under: Algorithms,BigData,Data Mining,HyperLogLog,Scalability — Patrick Durusau @ 10:44 am

HyperLogLog — Cornerstone of a Big Data Infrastructure

From the introduction:

In the Zipfian world of AK, the HyperLogLog distinct value (DV) sketch reigns supreme. This DV sketch is the workhorse behind the majority of our DV counters (and we’re not alone) and enables us to have a real time, in memory data store with incredibly high throughput. HLL was conceived of by Flajolet et. al. in the phenomenal paper HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. This sketch extends upon the earlier Loglog Counting of Large Cardinalities (Durand et. al.) which in turn is based on the seminal AMS work FM-85, Flajolet and Martin’s original work on probabilistic counting. (Many thanks to Jérémie Lumbroso for the correction of the history here. I am very much looking forward to his upcoming introduction to probabilistic counting in Flajolet’s complete works.) UPDATE – Rob has recently published a blog about PCSA, a direct precursor to LogLog counting which is filled with interesting thoughts. There have been a few posts on HLL recently so I thought I would dive into the intuition behind the sketch and into some of the details.

After seeing the HyperLogLog references in Approximate Methods for Scalable Data Mining I started looking for a fuller explanation/illustration of HyperLogLog.

Stumbled on this posting.

Includes a great HyperLogLog (HLL) simulation written in JavaScript.

Enjoy!

May 1, 2013

Algorithms, 4th Edition

Filed under: Algorithms,Programming — Patrick Durusau @ 8:46 am

Algorithms, 4th Edition by Robert Sedgewick and Kevin Wayne.

Here you will find excerpts from the text, Java code and solutions to some exercises on:

  1. Fundamentals
  2. Sorting
  3. Searching
  4. Graphs
  5. Strings
  6. Context

My latest from Sedgewick is almost a decade old so I may need to upgrade to the latest version.

At almost 1,000 pages (976 to be exact), it may push me towards a Kindle reader. 😉

April 28, 2013

Algorithms Every Data Scientist Should Know: Reservoir Sampling

Filed under: Algorithms,Data Science,Reservoir Sampling — Patrick Durusau @ 12:12 pm

Algorithms Every Data Scientist Should Know: Reservoir Sampling by Josh Wills.

Data scientists, that peculiar mix of software engineer and statistician, are notoriously difficult to interview. One approach that I’ve used over the years is to pose a problem that requires some mixture of algorithm design and probability theory in order to come up with an answer. Here’s an example of this type of question that has been popular in Silicon Valley for a number of years:

Say you have a stream of items of large and unknown length that we can only iterate over once. Create an algorithm that randomly chooses an item from this stream such that each item is equally likely to be selected.

The first thing to do when you find yourself confronted with such a question is to stay calm. The data scientist who is interviewing you isn’t trying to trick you by asking you to do something that is impossible. In fact, this data scientist is desperate to hire you. She is buried under a pile of analysis requests, her ETL pipeline is broken, and her machine learning model is failing to converge. Her only hope is to hire smart people such as yourself to come in and help. She wants you to succeed.

Beaker image
Remember: Stay Calm.

The second thing to do is to think deeply about the question. Assume that you are talking to a good person who has read Daniel Tunkelang’s excellent advice about interviewing data scientists. This means that this interview question probably originated in a real problem that this data scientist has encountered in her work. Therefore, a simple answer like, “I would put all of the items in a list and then select one at random once the stream ended,” would be a bad thing for you to say, because it would mean that you didn’t think deeply about what would happen if there were more items in the stream than would fit in memory (or even on disk!) on a single computer.

The third thing to do is to create a simple example problem that allows you to work through what should happen for several concrete instances of the problem. The vast majority of humans do a much better job of solving problems when they work with concrete examples instead of abstractions, so making the problem concrete can go a long way toward helping you find a solution.

In addition to great interview advice, Josh also provides a useful overview of reservoir sampling.

Whether reservoir sampling will be useful to you depends on your test for subject identity.

I tend to think of subject identity as being very precise but that isn’t necessarily the case.

Or should I say that precision of subject identity is a matter of requirements?

For some purposes, it may be sufficient to know the gender of attendees, as a subject, within some margin of statistical error. With enough effort we could know that more precisely but the cost may be prohibitive.

Thinking of any test for subject identity being located on a continuum of subject identification. Where the notion of “precision” itself is up for definition.

Russia’s warnings on one of the Boston Marathon bombers, a warning that used his name as he did, not as captured by the US intelligence community, was a case of mistaken level of precision.

Mostly likely the result of an analyst schooled in an English-only curriculum.

April 25, 2013

A different take on data skepticism

Filed under: Algorithms,Data,Data Models,Data Quality — Patrick Durusau @ 1:26 pm

A different take on data skepticism by Beau Cronin.

From the post:

Recently, the Mathbabe (aka Cathy O’Neil) vented some frustration about the pitfalls in applying even simple machine learning (ML) methods like k-nearest neighbors. As data science is democratized, she worries that naive practitioners will shoot themselves in the foot because these tools can offer very misleading results. Maybe data science is best left to the pros? Mike Loukides picked up this thread, calling for healthy skepticism in our approach to data and implicitly cautioning against a “cargo cult” approach in which data collection and analysis methods are blindly copied from previous efforts without sufficient attempts to understand their potential biases and shortcomings.

…Well, I would argue that all ML methods are not created equal with regard to their safety. In fact, it is exactly some of the simplest (and most widely used) methods that are the most dangerous.

Why? Because these methods have lots of hidden assumptions. Well, maybe the assumptions aren’t so much hidden as nodded-at-but-rarely-questioned. A good analogy might be jumping to the sentencing phase of a criminal trial without first assessing guilt: asking “What is the punishment that best fits this crime?” before asking “Did the defendant actually commit a crime? And if so, which one?” As another example of a simple-yet-dangerous method, k-means clustering assumes a value for k, the number of clusters, even though there may not be a “good” way to divide the data into this many buckets. Maybe seven buckets provides a much more natural explanation than four. Or maybe the data, as observed, is truly undifferentiated and any effort to split it up will result in arbitrary and misleading distinctions. Shouldn’t our methods ask these more fundamental questions as well?

Beau make several good points on questioning data methods.

I would extend those “…more fundamental questions…” to data as well.

Data, at least as far as I know, doesn’t drop from the sky. It is collected, generated, sometimes both, by design.

That design had some reason for collecting that data, in some particular way and in a given format.

Like methods, data stands mute with regard to those designs, what choices were made, by who and for what reason?

Giving voice what can be known about methods and data falls to human users.

April 7, 2013

Deploying Graph Algorithms on GPUs: an Adaptive Solution

Filed under: Algorithms,GPU,Graphs,Networks — Patrick Durusau @ 5:46 am

Deploying Graph Algorithms on GPUs: an Adaptive Solution by Da Li and Michela Becchi. (27th IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2013)

From the post:

Thanks to their massive computational power and their SIMT computational model, Graphics Processing Units (GPUs) have been successfully used to accelerate a wide variety of regular applications (linear algebra, stencil computations, image processing and bioinformatics algorithms, among others). However, many established and emerging problems are based on irregular data structures, such as graphs. Examples can be drawn from different application domains: networking, social networking, machine learning, electrical circuit modeling, discrete event simulation, compilers, and computational sciences. It has been shown that irregular applications based on large graphs do exhibit runtime parallelism; moreover, the amount of available parallelism tends to increase with the size of the datasets. In this work, we explore an implementation space for deploying a variety of graph algorithms on GPUs. We show that the dynamic nature of the parallelism that can be extracted from graph algorithms makes it impossible to find an optimal solution. We propose a runtime system able to dynamically transition between different implementations with minimal overhead, and investigate heuristic decisions applicable across algorithms and datasets. Our evaluation is performed on two graph algorithms: breadth-first search and single-source shortest paths. We believe that our proposed mechanisms can be extended and applied to other graph algorithms that exhibit similar computational patterns.

A development that may surprise some graph software vendors, there are “no optimal solution[s] across graph problems and datasets” for graph algorithms on GPU.

This paper points towards an adaptive technique that may prove to be “resilient to the irregularity and heterogeneity of real world graphs.”

I first saw this in a tweet by Stefano Bertolo.

March 20, 2013

Large-Scale Learning with Less… [Less Precision Viable?]

Filed under: Algorithms,Artificial Intelligence,Machine Learning — Patrick Durusau @ 4:32 pm

Large-Scale Learning with Less RAM via Randomization by Daniel Golovin, D. Sculley, H. Brendan McMahan, Michael Young.

Abstract:

We reduce the memory footprint of popular large-scale online learning methods by projecting our weight vector onto a coarse discrete set using randomized rounding. Compared to standard 32-bit float encodings, this reduces RAM usage by more than 50% during training and by up to 95% when making predictions from a fixed model, with almost no loss in accuracy. We also show that randomized counting can be used to implement per-coordinate learning rates, improving model quality with little additional RAM. We prove these memory-saving methods achieve regret guarantees similar to their exact variants. Empirical evaluation confirms excellent performance, dominating standard approaches across memory versus accuracy tradeoffs.

I mention this in part because topic map authoring can be assisted by the results of machine learning.

It is also a data point for the proposition that unlike their human masters, machines are too precise.

Perhaps it is the case that the vagueness of human reasoning has significant advantages over the disk grinding precision of our machines.

The question then becomes: How do we capture vagueness in a system where every point is either 0 or 1?

Not probabilistic because that can be expressed but vagueness, which I experience as something different.

Suggestions?

PS: Perhaps that is what makes artificial intelligence artificial. It is too precise. 😉

I first saw this in a tweet by Stefano Bertolo.

March 19, 2013

AI Algorithms, Data Structures, and Idioms…

Filed under: Algorithms,Artificial Intelligence,Data Structures,Java,Lisp,Prolog — Patrick Durusau @ 10:51 am

AI Algorithms, Data Structures, and Idioms in Prolog, Lisp and Java by George F. Luger and William A. Stubblefield.

From the introduction:

Writing a book about designing and implementing representations and search algorithms in Prolog, Lisp, and Java presents the authors with a number of exciting opportunities.

The first opportunity is the chance to compare three languages that give very different expression to the many ideas that have shaped the evolution of programming languages as a whole. These core ideas, which also support modern AI technology, include functional programming, list processing, predicate logic, declarative representation, dynamic binding, meta-linguistic abstraction, strong-typing, meta-circular definition, and object-oriented design and programming. Lisp and Prolog are, of course, widely recognized for their contributions to the evolution, theory, and practice of programming language design. Java, the youngest of this trio, is both an example of how the ideas pioneered in these earlier languages have shaped modern applicative programming, as well as a powerful tool for delivering AI applications on personal computers, local networks, and the world wide web.

Where could you go wrong with comparing Prolog, Lisp and Java?

Either for the intellectual exercise or because you want a better understanding of AI, a resource to enjoy!

March 10, 2013

SPMF

Filed under: Algorithms,Data Mining — Patrick Durusau @ 3:14 pm

SPMF: A Sequential Pattern Mining Framework

From the webpage:

SPMF is an open-source data mining mining platform written in Java.

It is distributed under the GPL v3 license.

It offers implementations of 52 data mining algorithms for:

  • sequential pattern mining,
  • association rule mining,
  • frequent itemset mining,
  • sequential rule mining,
  • clustering

It can be used as a standalone program with a user interface or from the command line. Moreover, the source code of each algorithm can be integrated in other Java software.

The documentation consists entirely of examples of using SPMF for data mining tasks.

The algorithms page details the fifty-two (52) algorithms of SPMF by references to the literature.

I first saw this at: SPMF: Sequential Pattern Mining Framework.

March 8, 2013

Untangling algorithmic illusions from reality in big data

Filed under: Algorithms,BigData — Patrick Durusau @ 3:46 pm

Untangling algorithmic illusions from reality in big data by Alex Howard.

From the post:

Microsoft principal researcher Kate Crawford (@katecrawford) gave a strong talk at last week’s Strata Conference in Santa Clara, Calif. about the limits of big data. She pointed out potential biases in data collection, questioned who may be excluded from it, and hammered home the constant need for context in conclusions. Video of her talk is embedded below:

See Alex’s post for the video and the interview that follows.

Both are simply golden.

How important are biases in data collection?

Consider the classic example:

Are you in favor of convicted felons owning firerams?

90%+ of all surveyed say they favor gun control.

Are you in favor of gun control?

Much lower percentage saying they favor gun control.

The numbers are from memory and surveys probably forty years ago but the lesson is to watch the question being asked.

A survey that doesn’t expose its questions, how people were contacted, at what time of day, just to name a few factors, isn’t worthy of comment.

March 5, 2013

A Simple, Combinatorial Algorithm for Solving…

Filed under: Algorithms,Data Structures,Mathematics — Patrick Durusau @ 2:55 pm

A Simple, Combinatorial Algorithm for Solving SDD Systems in Nearly-Linear Time by Jonathan A. Kelner, Lorenzo Orecchia, Aaron Sidford, Zeyuan Allen Zhu.

Abstract:

In this paper, we present a simple combinatorial algorithm that solves symmetric diagonally dominant (SDD) linear systems in nearly-linear time. It uses very little of the machinery that previously appeared to be necessary for a such an algorithm. It does not require recursive preconditioning, spectral sparsification, or even the Chebyshev Method or Conjugate Gradient. After constructing a “nice” spanning tree of a graph associated with the linear system, the entire algorithm consists of the repeated application of a simple (non-recursive) update rule, which it implements using a lightweight data structure. The algorithm is numerically stable and can be implemented without the increased bit-precision required by previous solvers. As such, the algorithm has the fastest known running time under the standard unit-cost RAM model. We hope that the simplicity of the algorithm and the insights yielded by its analysis will be useful in both theory and practice.

In one popular account, the importance of the discovery was described this way:

The real value of the MIT paper, Spielman says, is in its innovative theoretical approach. “My work and the work of the folks at Carnegie Mellon, we’re solving a problem in numeric linear algebra using techniques from the field of numerical linear algebra,” he says. “Jon’s paper is completely ignoring all of those techniques and really solving this problem using ideas from data structures and algorithm design. It’s substituting one whole set of ideas for another set of ideas, and I think that’s going to be a bit of a game-changer for the field. Because people will see there’s this set of ideas out there that might have application no one had ever imagined.”

Thirty-two pages of tough sledding but if the commentaries are correct, this paper may have a major impact on graph processing.

February 5, 2013

Chaotic Nihilists and Semantic Idealists [And What of Users?]

Filed under: Algorithms,Ontology,Semantics,Taxonomy,Topic Maps — Patrick Durusau @ 5:54 pm

Chaotic Nihilists and Semantic Idealists by Alistair Croll.

From the post:

There are competing views of how we should tackle an abundance of data, which I’ve referred to as big data’s “odd couple”.

One camp—made up of semantic idealists who fetishize taxonomies—is to tag and organize it all. Once we’ve marked everything and how it relates to everything else, they hope, the world will be reasonable and understandable.

The poster child for the Semantic Idealists is Wolfram Alpha, a “reasoning engine” that understands, for example, a question like “how many blue whales does the earth weigh?”—even if that question has never been asked before. But it’s completely useless until someone’s told it the weight of a whale, or the earth, or, for that matter, what weight is.

They’re wrong.

Alistair continues with the other camp:

Wolfram Alpha’s counterpart for the Algorithmic Nihilists is IBM’s Watson, a search engine that guesses at answers based on probabilities (and famously won on Jeopardy.) Watson was never guaranteed to be right, but it was really, really likely to have a good answer. It also wasn’t easily controlled: when it crawled the Urban Dictionary website, it started swearing in its responses[1], and IBM’s programmers had to excise some of its more colorful vocabulary by hand.

She’s wrong too.

And projects the future as:

The future of data is a blend of both semantics and algorithms. That’s one reason Google recently introduced a second search engine, called the Knowledge Graph, that understands queries.[3] Knowledge Graph was based on technology from Metaweb, a company it acquired in 2010, and it augments “probabilistic” algorithmic search with a structured, tagged set of relationships.

Why are we missing asking users what they meant as a third option?

Depends on who you want to be in charge:

Algorithms — Empower Computer Scientists.

Ontologies/taxonomies — Empower Ontologists.

Asking Users — Empowers Users.

Topic maps are a solution that can ask users.

Any questions?

MapReduce Algorithms

Filed under: Algorithms,MapReduce,Texts — Patrick Durusau @ 4:55 pm

MapReduce Algorithms by Bill Bejeck.

Bill is writing a series of posts on implementing the algorithms given in pseudo-code in: Data-Intensive Text Processing with MapReduce.

  1. Working Through Data-Intensive Text Processing with MapReduce
  2. Working Through Data-Intensive Text Processing with MapReduce – Local Aggregation Part II
  3. Calculating A Co-Occurrence Matrix with Hadoop
  4. MapReduce Algorithms – Order Inversion
  5. Secondary Sorting

Another resource to try with your Hadoop Sandbox install!

I first saw this at Alex Popescu’s 3 MapReduce and Hadoop Links: Secondary Sorting, Hadoop-Based Letterpress, and Hadoop Vaidya.

January 29, 2013

Aggregate Skyline Join Queries: Skylines with Aggregate Operations over Multiple Relations

Filed under: Aggregation,Algorithms,Query Engine,Query Language — Patrick Durusau @ 6:50 pm

Aggregate Skyline Join Queries: Skylines with Aggregate Operations over Multiple Relations by Arnab Bhattacharya and B. Palvali Teja.
(Submitted on 28 Jun 2012)

Abstract:

The multi-criteria decision making, which is possible with the advent of skyline queries, has been applied in many areas. Though most of the existing research is concerned with only a single relation, several real world applications require finding the skyline set of records over multiple relations. Consequently, the join operation over skylines where the preferences are local to each relation, has been proposed. In many of those cases, however, the join often involves performing aggregate operations among some of the attributes from the different relations. In this paper, we introduce such queries as “aggregate skyline join queries”. Since the naive algorithm is impractical, we propose three algorithms to efficiently process such queries. The algorithms utilize certain properties of skyline sets, and processes the skylines as much as possible locally before computing the join. Experiments with real and synthetic datasets exhibit the practicality and scalability of the algorithms with respect to the cardinality and dimensionality of the relations.

The authors illustrate a “skyline” query with a search for a hotel that has a good price and it close to the beach. A “skyline” set of hotels excludes hotels that are not as good on those points as hotels in the set. They then observe:

In real applications, however, there often exists a scenario when a single relation is not sufficient for the application, and the skyline needs to be computed over multiple relations [16]. For example, consider a flight database. A person traveling from city A to city B may use stopovers, but may still be interested in flights that are cheaper, have a less overall journey time, better ratings and more amenities. In this case, a single relation specifying all direct flights from A to B may not suffice or may not even exist. The join of multiple relations consisting of flights starting from A and those ending at B needs to be processed before computing the preferences.

The above problem becomes even more complex if the person is interested in the travel plan that optimizes both on the total cost as well as the total journey time for the two flights (other than the ratings and amenities of each
airline). In essence, the skyline now needs to be computed on attributes that have been aggregated from multiple relations in addition to attributes whose preferences are local within each relation. The common aggregate operations are sum, average, minimum, maximum, etc.

No doubt the travel industry thinks it has conquered semantic diversity in travel arrangements. If they have, it has since I stopped traveling several years ago.

Even simple tasks such as coordination of air and train schedules was unnecessarily difficult.

I suspect that is still the case and so mention “skyline” queries as a topic to be aware of and if necessary, to include in a topic map application that brings sanity to travel arrangements.

True, you can get a travel service that handles all the details, but only for a price and only if you are that trusting.

January 28, 2013

Pathfinding Algorithms for Changing Graphs

Filed under: Algorithms,Graphs — Patrick Durusau @ 1:18 pm

Pathfinding Algorithms for Changing Graphs

The algorithms summarized at the end of the discussion:

The increased interest in graph based information processing promises this will be an active area of research.

I first saw this in a tweet by GraphHopper.

January 20, 2013

…topological data analysis

Filed under: Algorithms,Topological Data Analysis,Topology — Patrick Durusau @ 8:05 pm

New big data firm to pioneer topological data analysis by John Burn-Murdoch.

From the post:

A US big data firm is set to establish algebraic topology as the gold standard of data science with the launch of the world’s leading topological data analysis (TDA) platform.

Ayasdi, whose co-founders include renowned mathematics professor Gunnar Carlsson, launched today in Palo Alto, California, having secured $10.25m from investors including Khosla Ventures in the first round of funding.

The funds will be used to build on its Insight Discovery platform, the culmination of 12 years of research and development into mathematics, computer science and data visualisation at Stanford.

Ayasdi’s work prior to launching as a company has already yielded breakthroughs in the pharmaceuticals industry. In one case it revealed new insights in eight hours – compared to the previous norm of over 100 hours – cutting the turnaround from analysis to clinical trials in the process.

The project? CompTop, which I covered here.

Does topological data analysis sound more interesting now than before?

January 18, 2013

Graph Algorithms

Filed under: Algorithms,Graphs,Networks — Patrick Durusau @ 7:16 pm

Graph Algorithms by David Eppstein.

Graph algorithms course with a Wikipedia book, Graph Algorithms, made up of articles from Wikipedia.

The syllabus does include some materials not found at Wikipedia, so be sure to check there as well.

Strong components of the Wikipedia graph

Filed under: Algorithms,Graphs,Networks,Wikipedia — Patrick Durusau @ 7:16 pm

Strong components of the Wikipedia graph

From the post:

I recently covered strong connectivity analysis in my graph algorithms class, so I’ve been playing today with applying it to the link structure of (small subsets of) Wikipedia.

For instance, here’s one of the strong components among the articles linked from Hans Freudenthal (a mathematician of widely varied interests): Algebraic topology, Freudenthal suspension theorem, George W. Whitehead, Heinz Hopf, Homotopy group, Homotopy groups of spheres, Humboldt University of Berlin, Luitzen Egbertus Jan Brouwer, Stable homotopy theory, Suspension (topology), University of Amsterdam, Utrecht University. Mostly this makes sense, but I’m not quite sure how the three universities got in there. Maybe from their famous faculty members?

One of responses to this post suggest grabbing the entire Wikipedia dataset for purposes of trying out algorithms.

A good suggestion for algorithms, perhaps even algorithms meant to reduce visual clutter, but at what point does a graph become too “busy” for visual analysis?

Recalling the research that claims people can only remember seven or so things at one time.

January 14, 2013

How to implement an algorithm from a scientific paper

Filed under: Algorithms,Programming — Patrick Durusau @ 8:36 pm

How to implement an algorithm from a scientific paper by Emmanuel Goossaert.

From the post:

This article is a short guide to implementing an algorithm from a scientific paper. I have implemented many complex algorithms from books and scientific publications, and this article sums up what I have learned while searching, reading, coding and debugging. This is obviously limited to publications in domains related to the field of Computer Science. Nevertheless, you should be able to apply the guidelines and good practices presented below to any kind of paper or implementation.

Seven (7) rules, some with sub-parts, to follow when trying to implement an algorithm from a paper.

With the growth of research on parallel programming, likely the most important skill you will pick up this year.

I first saw this in Four short links: 11 January 2013 by Nat Torkington.

January 8, 2013

Designing algorithms for Map Reduce

Filed under: Algorithms,BigData,Hadoop,MapReduce — Patrick Durusau @ 11:48 am

Designing algorithms for Map Reduce by Ricky Ho.

From the post:

Since the emerging of Hadoop implementation, I have been trying to morph existing algorithms from various areas into the map/reduce model. The result is pretty encouraging and I’ve found Map/Reduce is applicable in a wide spectrum of application scenarios.

So I want to write down my findings but then found the scope is too broad and also I haven’t spent enough time to explore different problem domains. Finally, I realize that there is no way for me to completely cover what Map/Reduce can do in all areas, so I just dump out what I know at this moment over the long weekend when I have an extra day.

Notice that Map/Reduce is good for “data parallelism”, which is different from “task parallelism”. Here is a description about their difference and a general parallel processing design methodology.

I’ll cover the abstract Map/Reduce processing model below. For a detail description of the implementation of Hadoop framework, please refer to my earlier blog here.

A bit dated (2010) but still worth your time.

I missed its initial appearance so appreciated Ricky pointing back to it in MapReduce: Detecting Cycles in Network Graph.

You may also want to consult: Designing good MapReduce algorithms by Jeffrey Ullman.

December 28, 2012

LIBOL 0.1.0

Filed under: Algorithms,Classification,Machine Learning — Patrick Durusau @ 7:44 pm

LIBOL 0.1.0

From the webpage:

LIBOL is an open-source library for large-scale online classification, which consists of a large family of efficient and scalable state-of-the-art online learning algorithms for large-scale online classification tasks. We have offered easy-to-use command-line tools and examples for users and developers. We also have made documents available for both beginners and advanced users. LIBOL is not only a machine learning tool, but also a comprehensive experimental platform for conducting online learning research.

In general, the existing online learning algorithms for linear classication tasks can be grouped into two major categories: (i) first order learning (Rosenblatt, 1958; Crammer et al., 2006), and (ii) second order learning (Dredze et al., 2008; Wang et al., 2012; Yang et al., 2009).

Example online learning algorithms in the first order learning category implemented in this library include:

• Perceptron: the classical online learning algorithm (Rosenblatt, 1958);

• ALMA: A New ApproximateMaximal Margin Classification Algorithm (Gentile, 2001);

• ROMMA: the relaxed online maxiumu margin algorithms (Li and Long, 2002);

• OGD: the Online Gradient Descent (OGD) algorithms (Zinkevich, 2003);

• PA: Passive Aggressive (PA) algorithms (Crammer et al., 2006), one of state-of-the-art first order online learning algorithms;

Example algorithms in the second order online learning category implemented in this library include the following:

• SOP: the Second Order Perceptron (SOP) algorithm (Cesa-Bianchi et al., 2005);

• CW: the Confidence-Weighted (CW) learning algorithm (Dredze et al., 2008);

• IELLIP: online learning algorithms by improved ellipsoid method (Yang et al., 2009);

• AROW: the Adaptive Regularization of Weight Vectors (Crammer et al., 2009);

• NAROW: New variant of Adaptive Regularization (Orabona and Crammer, 2010);

• NHERD: the Normal Herding method via Gaussian Herding (Crammer and Lee, 2010)

• SCW: the recently proposed Soft ConfidenceWeighted algorithms (Wang et al., 2012).

LIBOL is still being improved by improvements from practical users and new research results.

More information can be found in our project website: http://libol.stevenhoi.org/

Consider this an early New Year’s present!

December 5, 2012

Algorithms: Design and Analysis, Part 2

Filed under: Algorithms,CS Lectures — Patrick Durusau @ 12:31 pm

Algorithms: Design and Analysis, Part 2 by Tim Roughgarden. (Coursera)

From the course description:

In this course you will learn several fundamental principles of advanced algorithm design: greedy algorithms and applications; dynamic programming and applications; NP-completeness and what it means for the algorithm designer; the design and analysis of heuristics; and more.

The course started December 3, 2012 so if you are going to join, best do so soon.

Fast Parallel Sorting Algorithms on GPUs

Filed under: Algorithms,GPU,Parallel Programming,Sorting — Patrick Durusau @ 6:00 am

Fast Parallel Sorting Algorithms on GPUs by Bilal Jan, Bartolomeo Montrucchio, Carlo Ragusa, Fiaz Gul Khan, Omar Khan.

Abstract:

This paper presents a comparative analysis of the three widely used parallel sorting algorithms: OddEven sort, Rank sort and Bitonic sort in terms of sorting rate, sorting time and speed-up on CPU and different GPU architectures. Alongside we have implemented novel parallel algorithm: min-max butterfly network, for finding minimum and maximum in large data sets. All algorithms have been implemented exploiting data parallelism model, for achieving high performance, as available on multi-core GPUs using the OpenCL specification. Our results depicts minimum speed-up19x of bitonic sort against oddeven sorting technique for small queue sizes on CPU and maximum of 2300x speed-up for very large queue sizes on Nvidia Quadro 6000 GPU architecture. Our implementation of full-butterfly network sorting results in relatively better performance than all of the three sorting techniques: bitonic, odd-even and rank sort. For min-max butterfly network, our findings report high speed-up of Nvidia quadro 6000 GPU for high data set size reaching 224 with much lower sorting time.

Is there a GPU in your topic map processing future?

I first saw this in a tweet by Stefano Bertolo.

November 10, 2012

Algorithmic Economics

Filed under: Algorithms,Game Theory,Networks,Social Networks — Patrick Durusau @ 1:28 pm

Algorithmic Economics, August 6-10, 2012, Carnegie Mellon University.

You will find slides and videos for:

Another view of social dynamics. Which is everywhere when you think about it. Not just consumers but sellers, manufacturers, R&D.

There isn’t any human activity separate and apart from social dynamics or influenced by them.

That includes the design, authoring and marketing of topic maps.

I first saw this in a tweet from Stefano Bertolo, mentioning the general link and also the lecture on game theory.

October 28, 2012

1MB Sorting Explained

Filed under: Algorithms,Programming,Sorting — Patrick Durusau @ 2:36 pm

1 MB Sorting Explained by Jeff Preshing.

From the post:

In my previous post, I shared some source code to sort one million 8-digit numbers in 1MB of RAM as an answer to this Stack Overflow question. The program works, but I didn’t explain how, leaving it as a kind of puzzle for the reader.

(image omitted)

I had promised to explain it in a followup post, and in the meantime, there’s been a flurry of discussion in the comments and on Reddit. In particular, commenter Ben Wilhelm (aka ZorbaTHut) already managed to explain most of it (Nice work!), and by now, I think quite a few people already get it. Nonetheless, I’ll write up another explanation as promised.

You may want to also review the answers and comments at Stack Overflow as well.

Sorting being one of those fundamental operations you will encounter time and again.

Even in topic maps.

« Newer PostsOlder Posts »

Powered by WordPress