Archive for the ‘Genetic Algorithms’ Category

AI vs. Taxpayer (so far, taxpayer wins)

Saturday, October 10th, 2015

Computer Scientists Wield Artificial Intelligence to Battle Tax Evasion by Lynnley Browning.

From the post:

When federal authorities want to ferret out abusive tax shelters, they send an army of forensic accountants, auditors and lawyers to burrow into suspicious tax returns.

Analyzing mountains of filings and tracing money flows through far-flung subsidiaries is notoriously difficult; even if the Internal Revenue Service manages to unravel a major scheme, it typically does so only years after its emergence, by which point a fresh dodge has often already replaced it.

But what if that needle-in-a-haystack quest could be done routinely, and quickly, by a computer? Could the federal tax laws — 74,608 pages of legal gray areas and welters of credits, deductions and exemptions — be accurately rendered in an algorithm?

“We see the tax code as a calculator,” said Jacob Rosen, a researcher at the Massachusetts Institute of Technology who focuses on the abstract representation of financial transactions and artificial intelligence techniques. “There are lots of extraordinarily smart people who take individual parts of the tax code and recombine them in complex transactions to construct something not intended by the law.”

A recent paper by Mr. Rosen and four other computer scientists — two others from M.I.T. and two at the Mitre Corporation, a nonprofit technology research and development organization — demonstrated how an algorithm could detect a certain type of known tax shelter used by partnerships.

I had to chuckle when I read:

“There are lots of extraordinarily smart people who take individual parts of the tax code and recombine them in complex transactions to construct something not intended by the law.”

It would be more accurate to say: “…something not intended by the tax policy wonks at the IRS.”

Or at Justice Sutherland said in Gregory v. Helvering (1934):

The legal right of a taxpayer to decrease the amount of what otherwise would be his taxes, or altogether to avoid them, by means which the law permits, cannot be doubted.

Gregory v. Helvering isn’t much comfort because Sutherland also found against the taxpayer in that case on a “not intended by the law” basis.

Still, if you read the paper you will realize taxpayers are still well ahead vis-a-vis any AI:

Drawbacks are that currently SCOTE has a very simplified view of transactions, audit points and law.

Should we revisit the Turing test?

Perhaps a series of tax code tests, 1040A, 1040 long form, corporate reorganization, each one more complex than the one before.

Pitch the latest AIs against tax professionals?

Creating a genetic algorithm for beginners

Wednesday, September 16th, 2015

Creating a genetic algorithm for beginners by Lee Jacobson.

From the post:

A genetic algorithm (GA) is great for finding solutions to complex search problems. They’re often used in fields such as engineering to create incredibly high quality products thanks to their ability to search a through a huge combination of parameters to find the best match. For example, they can search through different combinations of materials and designs to find the perfect combination of both which could result in a stronger, lighter and overall, better final product. They can also be used to design computer algorithms, to schedule tasks, and to solve other optimization problems. Genetic algorithms are based on the process of evolution by natural selection which has been observed in nature. They essentially replicate the way in which life uses evolution to find solutions to real world problems. Surprisingly although genetic algorithms can be used to find solutions to incredibly complicated problems, they are themselves pretty simple to use and understand.

How they work

As we now know they’re based on the process of natural selection, this means they take the fundamental properties of natural selection and apply them to whatever problem it is we’re trying to solve.

The basic process for a genetic algorithm is:

  1. Initialization – Create an initial population. This population is usually randomly generated and can be any desired size, from only a few individuals to thousands.
  2. Evaluation – Each member of the population is then evaluated and we calculate a ‘fitness’ for that individual. The fitness value is calculated by how well it fits with our desired requirements. These requirements could be simple, ‘faster algorithms are better’, or more complex, ‘stronger materials are better but they shouldn’t be too heavy’.
  3. Selection – We want to be constantly improving our populations overall fitness. Selection helps us to do this by discarding the bad designs and only keeping the best individuals in the population.  There are a few different selection methods but the basic idea is the same, make it more likely that fitter individuals will be selected for our next generation.
  4. Crossover – During crossover we create new individuals by combining aspects of our selected individuals. We can think of this as mimicking how sex works in nature. The hope is that by combining certain traits from two or more individuals we will create an even ‘fitter’ offspring which will inherit the best traits from each of it’s parents.
  5. Mutation – We need to add a little bit randomness into our populations’ genetics otherwise every combination of solutions we can create would be in our initial population. Mutation typically works by making very small changes at random to an individuals genome.
  6. And repeat! – Now we have our next generation we can start again from step two until we reach a termination condition.


There are a few reasons why you would want to terminate your genetic algorithm from continuing it’s search for a solution. The most likely reason is that your algorithm has found a solution which is good enough and meets a predefined minimum criteria. Offer reasons for terminating could be constraints such as time or money.

A bit old, 2012, but it is a good introduction to genetic algorithms and if you read the comments (lots of those), you will find ports into multiple languages.

Important point here is to remember when presented with genetic algorithm results, be sure to ask for the fitness criteria, selection method, termination condition and the number of generations run.

Personally I would ask for the starting population and code as well.

There are any number of ways to produce an “objective” result from simply running a genetic algorithm so adopt that Heinlein adage: “Always cut cards.”

Applies in data science as it does in moon colonies.

The Genetic Programming Bibliography (Hits 10K!)

Sunday, March 29th, 2015

The Genetic Programming Bibliography by William Langdon.

A truly awesome bibliography collection and tool!

The introduction (10 pages PDF) is a model of clarity and will enhance your use/enjoyment of this bibliography.

You will also find there:Ai/index.html

I first saw this in a tweet by Jason H. Moore, PhD.

I suppose Google may downgrade my search listing because I have included a list of URLs that may be useful to you.

I prefer to post useful data for my readers than I care about gaming Google. If more of us felt that way, search results might be less the products of SEO gaming.

Active learning, almost black magic

Tuesday, October 22nd, 2013

Active learning, almost black magic by Lars Marius Garshol.

From the post:

I’ve written Duke, an engine for figuring out which records represent the same thing. It works fine, but people find it difficult to configure correctly, which is not so strange. Getting the configurations right requires estimating probabilities and choosing between comparators like Levenshtein, Jaro-Winkler, and Dice coefficient. Can we get the computer to do something people cannot? It sounds like black magic, but it’s actually pretty simple.

I implemented a genetic algorithm that can set up a good configuration automatically. The genetic algorithm works by making lots of configurations, then removing the worst and making more of the best. The configurations that are kept are tweaked randomly, and the process is repeated over and over again. It’s dead simple, but it works fine. The problem is: how is the algorithm to know which configurations are the best? The obvious solution is to have test data that tells you which records should be linked, and which ones should not be linked.

But that leaves us with a bootstrapping problem. If you can create a set of test data big enough for this to work, and find all the correct links in that set, then you’re fine. But how do you find all the links? You can use Duke, but if you can set up Duke well enough to do that you don’t need the genetic algorithm. Can you do it in other ways? Maybe, but that’s hard work, quite possibly harder than just battling through the difficulties and creating a configuration.

So, what to do? For a year or so I was stuck here. I had something that worked, but it wasn’t really useful to anyone.

Then I came across a paper where Axel Ngonga described how to solve this problem with active learning. Basically, the idea is to pick some record pairs that perhaps should be linked, and ask the user whether they should be linked or not. There’s an enormous number of pairs we could ask the user about, but most of these pairs provide very little information. The trick is to select those pairs which teach the algorithm the most.

This great stuff.

Particularly since I have a training problem that lacks a training set. 😉

Looking forward to trying this on “real-world problems” as Lars says.

Astrophysical data mining with GPU…

Tuesday, April 9th, 2013

Astrophysical data mining with GPU. A case study: genetic classification of globular clusters by Stefano Cavuoti, Mauro Garofalo, Massimo Brescia, Maurizio Paolillo, Antonio Pescape’, Giuseppe Longo, Giorgio Ventre.


We present a multi-purpose genetic algorithm, designed and implemented with GPGPU / CUDA parallel computing technology. The model was derived from our CPU serial implementation, named GAME (Genetic Algorithm Model Experiment). It was successfully tested and validated on the detection of candidate Globular Clusters in deep, wide-field, single band HST images. The GPU version of GAME will be made available to the community by integrating it into the web application DAMEWARE (DAta Mining Web Application REsource), a public data mining service specialized on massive astrophysical data. Since genetic algorithms are inherently parallel, the GPGPU computing paradigm leads to a speedup of a factor of 200x in the training phase with respect to the CPU based version.

BTW, DAMEWARE (DAta Mining Web Application REsource,

In case you are curious about the application of genetic algorithms in a low signal/noise situation with really “big” data, this is a good starting point.

Makes me curious about the “noise” in other communications.

The “signal” is fairly easy to identify in astronomy, but what about in text or speech?

I suppose “background noise, music, automobiles” would count as “noise” on a tape recording of a conversation, but is there “noise” in a written text?

Or noise in a conversation that is clearly audible?

If we have 100% signal, how do we explain failing to understand a message in speech or writing?

If it is not “noise,” then what is the problem?

Machine Learning: Genetic Algorithms in Javascript Part 2

Sunday, September 16th, 2012

Machine Learning: Genetic Algorithms in Javascript Part 2 by Burak Kanber.

From the post:

Today we’re going to revisit the genetic algorithm. If you haven’t read Genetic Algorithms Part 1 yet, I strongly recommend reading that now. This article will skip over the fundamental concepts covered in part 1 — so if you’re new to genetic algorithms you’ll definitely want to start there.

Just looking for the example?

The Problem

You’re a scientist that has recently been framed for murder by an evil company. Before you flee the lab you have an opportunity to steal 1,000 pounds (or kilograms!) of pure elements from the chemical warehouse; your plan is to later sell them and survive off of the earnings.

Given the weight and value of each element, which combination should you take to maximize the total value without exceeding the weight limit?

This is called the knapsack problem. The one above is a one-dimensional problem, meaning the only constraint is weight. We could complicate matters by also considering volume, but we need to start somewhere. Note that in our version of the problem only one piece of each element is available, and each piece has a fixed weight. There are some knapsack problems where you can take unlimited platinum or up to 3 pieces of gold or something like that, but here we only have one of each available to us.

Why is this problem tough to solve? We’ll be using 118 elements. The brute-force approach would require that we test 2118 or 3.3 * 1035 different combinations of elements.

What if you have subject identity criteria of varying reliability? What is the best combination for the highest reliability?

To sharpen the problem: Your commanding officer has requested declaration of sufficient identity for a drone strike target.

Machine Learning: Genetic Algorithms Part 1 (Javascript)

Sunday, September 16th, 2012

Machine Learning: Genetic Algorithms Part 1 (Javascript) by Burak Kanber.

From the post:

I like starting my machine learning classes with genetic algorithms (which we’ll abbreviate “GA” sometimes). Genetic algorithms are probably the least practical of the ML algorithms I cover, but I love starting with them because they’re fascinating and they do a good job of introducing the “cost function” or “error function”, and the idea of local and global optima — concepts both important and common to most other ML algorithms.

Genetic algorithms are inspired by nature and evolution, which is seriously cool to me. It’s no surprise, either, that artificial neural networks (“NN”) are also modeled from biology: evolution is the best general-purpose learning algorithm we’ve experienced, and the brain is the best general-purpose problem solver we know. These are two very important pieces of our biological existence, and also two rapidly growing fields of artificial intelligence and machine learning study. While I’m tempted to talk more about the distinction I make between the GA’s “learning algorithm” and the NN’s “problem solver” terminology, we’ll drop the topic of NNs altogether and concentrate on GAs… for now.

One phrase I used above is profoundly important: “general-purpose”. For almost any specific computational problem, you can probably find an algorithm that solves it more efficiently than a GA. But that’s not the point of this exercise, and it’s also not the point of GAs. You use the GA not when you have a complex problem, but when you have a complex problem of problems. Or you may use it when you have a complicated set of disparate parameters.

Off to a great start!

Genetic algorithms: a simple R example

Saturday, August 4th, 2012

Genetic algorithms: a simple R example by Bart Smeets.

From the post:

Genetic algorithm is a search heuristic. GAs can generate a vast number of possible model solutions and use these to evolve towards an approximation of the best solution of the model. Hereby it mimics evolution in nature.

GA generates a population, the individuals in this population (often called chromosomes) have a given state. Once the population is generated, the state of these individuals is evaluated and graded on their value. The best individuals are then taken and crossed-over – in order to hopefully generate ‘better’ offspring – to form the new population. In some cases the best individuals in the population are preserved in order to guarantee ‘good individuals’ in the new generation (this is called elitism).

The GA site by Marek Obitko has a great tutorial for people with no previous knowledge on the subject.

As the size of data stores increase, the cost of personal judgement on each subject identity test will as well. Genetic algorithms may be one way of creating subject identity tests in such situations.

In any event, it won’t harm anyone to be aware of the basic contours of the technique.

I first saw this at R-Bloggers.

Experiments in genetic programming

Monday, March 19th, 2012

Experiments in genetic programming

Lars Marius Garshol writes:

I made an engine called Duke that can automatically match records to see if they represent the same thing. For more background, see a previous post about it. The biggest problem people seem to have with using it is coming up with a sensible configuration. I stumbled across a paper that described using so-called genetic programming to configure a record linkage engine, and decided to basically steal the idea.

You need to read about the experiments in the post but I can almost hear Lars saying the conclusion:

The result is pretty clear: the genetic configurations are much the best. The computer can configure Duke better than I can. That’s almost shocking, but there you are. I guess I need to turn the script into an official feature.


Excellent post and approach by the way!

Lars also posted a link to Reddit about his experiments. Several links appear in comments that I have turned into short posts to draw more attention to them.

Another tool for your topic mapping toolbox.

Question: I wonder what it would look like to have the intermediate results used for mapping, only to be replaced as “better” mappings become available? Has a terminating condition but new content can trigger additional cycles but only as relevant to its content.

Or would queries count as new content? If they expressed synonymy or other relations?

Data Mining: Professor Pier Luca Lanzi, Politecnico di Milano

Monday, August 8th, 2011

This post started with my finding the data mining slides at Slideshare (about 4 years old) and after organizing those, deciding to check Professor Pier Luca Lanzi’s homepage for more recent material. I think you will find it useful material.

Pier Luca Lanzi – homepage

The professor is obviously interested in video games, a rapidly growing area of development and research.

Combining video games with data mining, that would be a real coup.

Data Mining Course page

Data Mining

Includes prior exams, video (2009 course), transparencies from all lectures.

Lecture slides on Data Mining and Machine Learning at Slideshare.

Not being a lemming, I don’t find most viewed a helpful sorting criteria.

I organized the data mining slides in course order (as nearly as I could determine, there are two #6 presentations and no #7 or #17 presentations):

00 Course Introduction

01 Data Mining

02 Machine Learning

03 The representation of data

04 Association rule mining

05 Association rules: advanced topics

06 Clustering: Introduction

06 Clustering: Partitioning Methods

08 Clustering: Hierarchical

09 Density-based, Grid-based, and Model-based Clustering

10 Introduction to Classification

11 Decision Trees

12 Classification Rules

13 Nearest Neighbor and Bayesian Classifiers

14 Evaluation

15 Data Exploration and Preparation

16 Classifiers Ensembles

18 Mining Data Streams

19 Text and Web Mining

Genetic Algorithms

Genetic Algorithms Course Notes

Genetic Algorithm Examples – Post

Sunday, March 6th, 2011

Genetic Algorithm Examples

From the post:

There’s been a lot of buzz recently on reddit and HN about genetic algorithms. Some impressive new demos have surfaced and I’d like to take this opportunity to review some of the cool things people have done with genetic algorithms, a fascinating subfield of evolutionary computing / machine learning (which is itself a part of the broader study of artificial intelligence (ah how academics love to classify things (and nest parentheses (especially computer scientists)))).

Interesting collection of examples of uses of genetic algorithms.

Posted here to provoke thinking about the use of genetic algorithms in topic maps.

See also the author’s tutorial: Genetic Algorithm For Hello World.

Have you used genetic algorithms with a topic map?

Appreciate a note if you have.