Archive for February, 2013

R Bootcamp Materials!

Monday, February 25th, 2013

R Bootcamp Materials! by Jared Knowles.

From the post:

To train new employees at the Wisconsin Department of Public Instruction, I have developed a 2-3 day series of training modules on how to get work done in R. These modules cover everything from setting up and installing R and RStudio to creating reproducible analyses using the knitr package. There are also some experimental modules for introductions to basic computer programming, and a refresher course on statistics. I hope to improve both of these over time. 

I am happy to announce that all of these materials are available online, for free.

​The bootcamp covers the following topics:

  1. Introduction to R​ : History of R, R as a programming language, and features of R.
  2. Getting Data In :​ How to import data into R, manipulate, and manage multiple data objects. 
  3. Sorting and Reshaping Data :  ​Long to wide, wide to long, and everything in between!
  4. Cleaning Education Data​ : Includes material from the Strategic Data Project about how to implement common business rules in processing administrative data. 
  5. Regression and Basic Analytics in R​ : Using school mean test scores to do OLS regression and regression diagnostics — a real world example. 
  6. Visualizing Data : ​Harness the power of R’s data visualization packages to make compelling and informative visualizations.
  7. Exporting Your Work : ​Learn the knitr​ package, and how to export graphics, and create PDF reports.
  8. Advanced Topics :​ A potpourri of advanced features in R (by request)
  9. A Statistics Refresher : ​With interactive examples using shiny​ 
  10. ​Programming Principles : ​Tips and pointers about writing code. (Needs work)

The best part is, all of the materials are available online and free of charge! (Check out the R Bootcamp page). They are constantly evolving. We have done two R Bootcamps so far, and hope to do more. Each time the materials get a little better. ​

The R Bootcamp page enables you to download all the materials or view the modules separately.

If you already know R, pass it on.

TCP Traceroute

Monday, February 25th, 2013

TCP Traceroute by Peteris Krumins.

From the post:

Did you know you could traceroute over the TCP protocol?

The regular traceroute usually uses either ICMP or UDP protocols. Unfortunately firewalls and routers often block the ICMP protocol completely or disallow the ICMP echo requests (ping requests), and/or block various UDP ports.

However you’d rarely have firewalls and routers drop TCP protocol on port 80 because it’s the web’s port.

Check this out. Let’s try to traceroute using ICMP protocol:

Certainly a way to document subjects along a TCP route.

Can you think of other reasons to use traceroute?

UW Courses in Computer Science and Engineering

Monday, February 25th, 2013

University of Washington Courses in Computer Science and Engineering

When I noticed the 2008 date on CSE 321: Discrete Structures 2008, I checked for a later offering of the course. The most recent one being 2010.

That’s still not terribly recent for a fundamental course so I ended up at the general courses page you see above.

By my count, two hundred and twenty-eight (228) courses, many of the ones I checked with video lectures and other materials.

I never did discover the “official” successor for CSE 321, but given the wealth of course materials, that is a small matter.

Discrete Structures (University of Washington 2008)

Monday, February 25th, 2013

I found the following material following a link in Christophe Lalanne’s A bag of tweets / February 2013 on “Common Mistakes in Discrete Mathematics.”

Clearly organized around a text but it wasn’t clear which text was being used.

Backing up the URI, I found the homepage for: CSE 321: Discrete Structures 2008, which listed the textbook as Rosen, Discrete Mathematics and Its Applications, McGraw-Hill, 6th Edition. (BTW, there is a 7th edition, Discrete Mathematics and Its Applications).

I also found links for:


Lecture Slides

Recorded Lectures

Post-Section Notes (note and a problem correction)

and, the origin of my inquiry:

Common Mistakes in Discrete Mathematics

In this section of the Guide we list many common mistakes that people studying discrete mathematics sometimes make. The list is organized chapter by chapter, based on when they first occur, but sometimes mistakes made early in the course perpetuate in later chapters. Also, some of these mistakes are remnants of misconceptions from high school mathematics (such as the impulse to assume that every operation distributes over every other operation).

In most cases we describe the mistake, give a concrete example, and then offer advice about how to avoid it. Note that additional advice about common mistakes in given, implicitly or explicitly, in the solutions to the odd-numbered exercises, which constitute the bulk of this Guide.

If 2008 sounds a bit old, you’re right. There is an update that requires a separate post. See: UW Courses in Computer Science and Engineering.

Computational Journalism

Monday, February 25th, 2013

Computational Journalism by Jonathan Stray.

From the webpage:

Maybe it’s not obvious that computer science and journalism go together, but they do!

Computational journalism combines classic journalistic values of storytelling and public accountability with techniques from computer science, statistics, the social sciences, and the digital humanities.

This course, given at the University of Hong Kong during January-February 2013, is an advanced look at how techniques from visualization, natural language processing, social network analysis, statistics, and cryptography apply to four different areas of journalism: finding stories through data mining, communicating what you’ve learned, filtering an overwhelming volume of information, and tracking the spread of information and effects.

The course assumes knowledge of computer science, including standard algorithms and linear algebra. The assignments are in Python and require programming experience. But this introductory video, which explains the topics covered, is for everyone.

For more, see the syllabus, or jump directly to a lecture:

  1. Basics. Feature vectors, clustering, projections.
  2. Text analysis. Tokenization, TF-IDF, topic modeling.
  3. Algorithmic filters. Information overload. Newsblaster and Google News.
  4. Hybrid filters. Social networks as filters. Collaborative Filtering.
  5. Social network analysis. Using it in journalism. Centrality algorithms.
  6. Knowledge representation. Structured data. Linked open data. General Q&A.
  7. Drawing conclusions. Randomness. Competing hypotheses. Causation.
  8. Security, surveillance, and privacy. Cryptography. Threat modeling.

CS knowledge and programming experience still required.

Interfaces will lessen that need over time but that knowledge/experience will help you question when interfaces have given odd results.

I would settle for journalists who question reports, like the Mandiant advertisement on cybersecurity last week. (Crowdsourcing Cybersecurity: A Proposal (Part 1))

Even the talking heads on the PBS Sunday morning news treated it as serious content. It was poorly written/researched ad copy, nothing more.

Of course, you would have to read the first couple of pages to discover that, not just skim the press release.

I first saw this at Christophe Lalanne’s A bag of tweets / February 2013.

ApacheCon NA 2013

Monday, February 25th, 2013

ApacheCon NA 2013

The schedule for ApacheCon NA 2013 (February 26th – 28th, 2013)

Sessions with authors.

A good starting point for reading topics in the Apache community.

BTW, is there a channel on YouTube for ApacheCon presentations that I am overlooking?


Linked Data for Holdings and Cataloging

Monday, February 25th, 2013

From the ALA Midwinter Meeting:

Linked Data for Holdings and Cataloging: The First Step Is Always the Hardest! by Eric Miller (Zepheira) and Richard Wallis (OCLC). (Video + Slides)

Linked Data for Holdings and Cataloging: Interactive Session. (Audio)

Since linked data wasn’t designed for human users, the advantage for library catalogs isn’t clear.

Most users can’t use LCSH so perhaps the lack of utility will go unnoticed. (Subject Headings and the Semantic Web)

I first saw this at: Linked Data for Holdings and Cataloging – recordings now available!

Calculate Return On Analytics Investment! [TM ROI/ROA?]

Monday, February 25th, 2013

Excellent Analytics Tip #22: Calculate Return On Analytics Investment! by Avinash Kaushik.

From the post:

Analysts: Put up or shut up time!

This blog is centered around creating incredible digital experiences powered by qualitative and quantitative data insights. Every post is about unleashing the power of digital analytics (the potent combination of data, systems, software and people). But we’ve never stopped to consider this question:

What is the return on investment (ROI) of digital analytics? What is the incremental revenue impact on the company’s bottom-line for the investment in data, systems and people?

Isn’t it amazing? We’ve not pointed the sexy arrow of accountability on ourselves!

Let’s fix that in this post. Let’s calculate the ROI of digital analytics. Let’s show, with real numbers (!) and a mathematical formula (oh, my!), that we are worth it!

We shall do that in in two parts.

In part one, my good friend Jesse Nichols will present his wonderful formula for computing ROA (return on analytics).

In part two, we are going to build on the formula and create a model (ok, spreadsheet :)) that you can use to compute ROA for your own company. We’ll have a lot of detail in the model. It contains a sample computation you can use to build your own. It also contains multiple tabs full of specific computations of revenue incrementality delivered for various analytical efforts (Paid Search, Email Marketing, Attribution Analysis, and more). It also has one tab so full of awesomeness, you are going to have to download it to bathe in its glory.

Bottom-line: The model will give you the context you need to shine the bright sunshine of Madam Accountability on your own analytics practice.

Ready? (It is okay if you are scared. :)).

Would this work for measuring topic map ROI/ROA?

What other measurement techniques would you suggest?

Drill Sideways faceting with Lucene

Monday, February 25th, 2013

Drill Sideways faceting with Lucene by Mike McCandless.

From the post:

Lucene’s facet module, as I described previously, provides a powerful implementation of faceted search for Lucene. There’s been a lot of progress recently, including awesome performance gains as measured by the nightly performance tests we run for Lucene:

[3.8X speedup!]


For example, try searching for an LED television at Amazon, and look at the Brand field, seen in the image to the right: this is a multi-select UI, allowing you to select more than one value. When you select a value (check the box or click on the value), your search is filtered as expected, but this time the field does not disappear: it stays where it was, allowing you to then drill sideways on additional values. Much better!

LinkedIn’s faceted search, seen on the left, takes this even further: not only are all fields drill sideways and multi-select, but there is also a text box at the bottom for you to choose a value not shown in the top facet values.

To recap, a single-select field only allows picking one value at a time for filtering, while a multi-select field allows picking more than one. Separately, drilling down means adding a new filter to your search, reducing the number of matching docs. Drilling up means removing an existing filter from your search, expanding the matching documents. Drilling sideways means changing an existing filter in some way, for example picking a different value to filter on (in the single-select case), or adding another or’d value to filter on (in the multi-select case). (images omitted)

More details: DrillSideways class being developed under LUCENE-4748.

Just following the progress on Lucene is enough to make you dizzy!

Charging for Your Product is…

Sunday, February 24th, 2013

Charging for Your Product is About 2000 Times More Effective than Relying on Ad Revenue by Bob Warfield.

From the post:

I was reading Gabriel Weinberg’s piece on the depressing math behind consumer-facing apps. He’s talking about conversion rates for folks to actually use such apps and I got to thinking about the additional conversion rate of an ad-based revenue model since he refers to the Facebooks and Twitters of the world. Just for grins, I put together a comparison between the numbers Gabriel uses and the numbers from my bootstrapped company, CNCCookbook. The difference is stark:

Ad-Based Revenue Model CNCCookbook Selling a B2B and B2C Product
Conversion from impression to user 5% Conversion to Trial from Visitor 0.50%
Add clickthrough rate 0.10% Trial Purchase Rate 13%
Clickthrough Revenue $ 1.00 Avg Order Size $ 152.03
Value of an impression $ 0.00005 $ 0.10 = 1,976.35 times better

Apologies, the table doesn’t display very well but its point is clear.

I mention this to ask:

What topic map product are you charging for?

In-Q-Tel (IQT)

Sunday, February 24th, 2013

In-Q-Tel (IQT)

From the about page:


Launched in 1999 as an independent, not-for-profit organization, IQT was created to bridge the gap between the technology needs of the U.S. Intelligence Community (IC) and new advances in commercial technology. With limited insight into fast-moving private sector innovation, the IC needed a way to find emerging companies, and, more importantly, to work with them. As a private company with deep ties to the commercial world, we attract and build relationships with technology startups outside the reach of the Intelligence Community. In fact, more than 70 percent of the companies that IQT partners with have never before done business with the government.

As a strategic investor, our model is unique. We make investments in startup companies that have developed commercially-focused technologies that will provide strong, near-term advantages (within 36 months) to the IC mission. We design our strategic investments to accelerate product development and delivery for this ready-soon innovation, and specifically to help companies add capabilities needed by our customers in the Intelligence Community. Additionally, IQT effectively leverages its direct investments by attracting a significant amount of private sector funds, often from top-tier venture capital firms, to co-invest in our portfolio companies. On average, for every dollar that IQT invests in a company, the venture capital community has invested over nine dollars, helping to deliver crucial new capabilities at a lower cost to the government.

Topic maps could offer advantages to an intelligence community, either vis-à-vis other intelligence communities and/or vis-à-vis competitors in the same intelligence community.

A funding source to consider for topic maps in intelligence work.

I first saw this at Beyond Search.


Sunday, February 24th, 2013

usenet-legend by Zach Beane

From the description:

This is Usenet Legend, an application for producing a searchable archive of an author’s comp.lang.lisp history from Ron Garrett’s large archive dump.

Zach mentions this in his post The Rob Warnock Lisp Usenet Archive but I thought it needed a separate post.

Making content more navigable is always a step in the right direction.

The Rob Warnock Lisp Usenet Archive [Selling Topic Maps]

Sunday, February 24th, 2013

The Rob Warnock Lisp Usenet Archive by Zach Beane.

From the post:

I've been reading and enjoying comp.lang.lisp for over 10 years. I find it important to ignore the noise and seek out material from authors that clearly have something interesting and informative to say.

Rob Warnock has posted neat stuff for many years, both in comp.lang.lisp and comp.lang.scheme. After creating the Erik Naggum archive, Rob was next on my list of authors to archive. It took me a few years, but here it is: the Rob Warnock Lisp Usenet archive. It has 3,265 articles from comp.lang.lisp and comp.lang.scheme from 1995 to 2009, indexed and searchable. I hope it helps you find as many useful articles as I have over the years.

You can imagine my heartbreak when the Eric Naggum archive turned out to be for comp.lang.lisp. 😉

I think Zach’s point, it is “important to ignore the noise and seek out material from authors that clearly have something interesting and informative to say,” is a clue to the difficulty selling topic maps.

Who thinks that is important?

If I am being paid by the hour to sort through search engine results, what is my motivation to do it faster/better?

If I am managing hourly workers, who are doing the sorting of search engine results, won’t doing it faster reduce the payroll I manage?

If my department has the manager with hourly workers and the facilities to house them, what is my motivation for faster/better?

If my company/government agency has the department with the manager with hourly workers and facilities under contract, what is my motivation for faster/better?

If that helps identify who has no motivation for topic maps, who should be interested in topic maps?

I first saw this at Christophe Lalanne’s A bag of tweets / February 2013.

Purely Functional Data Structures in Clojure: Leftist Heaps

Sunday, February 24th, 2013

Purely Functional Data Structures in Clojure: Leftist Heaps by Leonardo Borges.

From the post:

Last year I started reading a book called Purely Functional Data Structures. It’s a fascinating book and if you’ve ever wondered how Clojure’s persistent data structures work, it’s mandatory reading.

However, all code samples in the book are written in ML – with Haskell versions in the end of the book. This means I got stuck in Chapter 3, where the ML snippets start.

I had no clue about Haskell’s – much less ML’s! – syntax and I was finding it very difficult to follow along. What I did notice is that their syntaxes are not so different from each other.

So I put the book down and read Learn You a Haskell For Great Good! with the hopes that learning more about haskell’s syntax – in particular, learning how to read its type signatures – would help me get going with Puretly Functional Data Structures.

Luckily, I was right – and I recommend you do the same if you’re not familiar with either of those languages. Learn You a Haskell For Great Good! is a great book and I got a lot out of it. My series on Monads is a product of reading it.

Enough background though.

The purpose of this post is two-fold: One is to share the github repository I created and that will contain the Clojure versions of the data structures in the book as well as most solutions to the exercises – or at least as many as my time-poor life allows me to implement.

The other is to walk you through some of the code and get a discussion going. Hopefully we will all learn something – as I certainly have when implementing these. Today, we’ll start with Leftist Heaps.

This sounds like a wonderful resource in the making!

I first saw this at Christophe Lalanne’s A bag of tweets / February 2013.

Natural Language Meta Processing with Lisp

Sunday, February 24th, 2013

Natural Language Meta Processing with Lisp by Vsevolod Dyomkin.

From the post:

Recently I’ve started work on gathering and assembling a comprehensive suite of NLP tools for Lisp — CL-NLP. Something along the lines of OpenNLP or NLTK. There’s actually quite a lot of NLP tools in Lisp accumulated over the years, but they are scattered over various libraries, internet sites and books. I’d like to have them in one place with a clean and concise API which would provide easy startup point for anyone willing to do some NLP experiments or real work in Lisp. There’s already a couple of NLP libraries, most notably, langutils, but I don’t find them very well structured and also their development isn’t very active. So, I see real value in creating CL-NLP.

Besides, I’m currently reading the NLTK book. I thought that implementing the examples from the book in Lisp could be likewise a great introduction to NLP and to Lisp as it is an introduction to Python. So I’m going to work through them using CL-NLP toolset. I plan to cover 1 or 2 chapters per month. The goal is to implement pretty much everything meaningful, including the graphs — for them I’m going to use gnuplot driven by cgn of which I’ve learned answering questions on StackOverflow. 🙂 I’ll try to realize the examples just from the description — not looking at NLTK code — although, I reckon it will be necessary sometimes if the results won’t match. Also in the process I’m going to discuss different stuff re NLP, Lisp, Python, and NLTK — that’s why there’s “meta” in the title. 🙂

Just in case you haven’t found a self-improvement project for 2013! 😉

Seriously, this could be a real learning experience.

I first saw this at Christophe Lalanne’s A bag of tweets / February 2013.

Text processing (part 2): Inverted Index

Sunday, February 24th, 2013

Text processing (part 2): Inverted Index by Ricky Ho.

From the post:

This is the second part of my text processing series. In this blog, we’ll look into how text documents can be stored in a form that can be easily retrieved by a query. I’ll used the popular open source Apache Lucene index for illustration.

Not only do you get to learn about inverted indexes but some Lucene in the bargain.

That’s not a bad deal!

Lucene 4 Finite State Automata In 10 Minutes (Intro & Tutorial)

Sunday, February 24th, 2013

Lucene 4 Finite State Automata In 10 Minutes (Intro & Tutorial) by Doug Turnbull.

From the post:

This article is intended to help you bootstrap your ability to work with Finite State Automata (note automata == plural of automaton). Automata are a unique data structure, requiring a bit of theory to process and understand. Hopefully what’s below can give you a foundation for playing with these fun and useful Lucene data structures!

Motivation, Why Automata?

When working in search, a big part of the job is making sense of loosely-structured text. For example, suppose we have a list of about 1000 valid first names and 100,000 last names. Before ingesting data into a search application, we need to extract first and last names from free-form text.

Unfortunately the data sometimes has full names in the format “LastName, FirstName” like “Turnbull, Doug”. In other places, however, full names are listed “FirstName LastName” like “Doug Turnbull”. Add a few extra representations, and to make sense out of what strings represent valid names becomes a chore.

This becomes especially troublesome when we’re depending on these as natural identifiers for looking up or joining across multiple data sets. Each data set might textually represent the natural identifier in subtly different ways. We want to capture the representations across multiple data sets to ensure our join works properly.

So… Whats a text jockey to do when faced with such annoying inconsistencies?

You might initially think “regular expression”. Sadly, a normal regular expression can’t help in this case. Just trying to write a regular expression that allows a controlled vocabulary of 100k valid last names but nothing else is non-trivial. Not to mention the task of actually using such a regular expression.

But there is one tool that looks promising for solving this problem. Lucene 4.0′s new Automaton API. Lets explore what this API has to offer by first reminding ourselves about a bit of CS theory.

Are you motivated?

I am!

See John Berryman’s comment about matching patterns of words.

Then think about finding topics, associations and occurrences in free form data.

Or creating a collection of automata as a tool set for building topic maps.

S3G2: A Scalable Structure-Correlated Social Graph Generator

Sunday, February 24th, 2013

S3G2: A Scalable Structure-Correlated Social Graph Generator by Minh-Duc Pham, Peter Boncz, Orri Erling. (The same text you will find at: Selected Topics in Performance Evaluation and Benchmarking Lecture Notes in Computer Science Volume 7755, 2013, pp 156-172. DOI: 10.1007/978-3-642-36727-4_11)


Benchmarking graph-oriented database workloads and graph-oriented database systems is increasingly becoming relevant in analytical Big Data tasks, such as social network analysis. In graph data, structure is not mainly found inside the nodes, but especially in the way nodes happen to be connected, i.e. structural correlations. Because such structural correlations determine join fan-outs experienced by graph analysis algorithms and graph query executors, they are an essential, yet typically neglected, ingredient of synthetic graph generators. To address this, we present S3G2: a Scalable Structure-correlated Social Graph Generator. This graph generator creates a synthetic social graph, containing non-uniform value distributions and structural correlations, which is intended as test data for scalable graph analysis algorithms and graph database systems. We generalize the problem by decomposing correlated graph generation in multiple passes that each focus on one so-called correlation dimension; each of which can be mapped to a MapReduce task. We show that S3G2 can generate social graphs that (i) share well-known graph connectivity characteristics typically found in real social graphs (ii) contain certain plausible structural correlations that influence the performance of graph analysis algorithms and queries, and (iii) can be quickly generated at huge sizes on common cluster hardware.

You may also want to see the slides.

What a nice way to start the week!


I first saw this at Datanami.

Apache HBase 0.94.5 is out!

Sunday, February 24th, 2013

Apache HBase 0.94.5 is out! by Enis Soztutar.

From the post:

Last week, the HBase community released 0.94.5, which is the most stable release of HBase so far. The release includes 76 jira issues resolved, with 61 bug fixes, 8 improvements, and 2 new features.

Have you upgraded your HBase installation?

Indexing StackOverflow In Solr

Saturday, February 23rd, 2013

Indexing StackOverflow In Solr by John Berryman.

From the post:

One thing I really like about Solr is that its super easy to get started. You just download solr, fire it up, and then after following the 10 minute tutorial you’ll have a basic understand of indexing, updating, searching, faceting, filtering, and generally using Solr. But, you’ll soon get bored of playing with the 50 or so demo documents. So, quit insulting Solr with this puny, measly, wimpy dataset; Index something of significance and watch what Solr can do.

One of the most approachable large datasets is the StackExchange data set which most notably includes all of StackOverflow, but also contains many of the other StackExchange sites (Cooking, English Grammar, Bicycles, Games, etc.) So if StackOverflow is not your cup of tea, there’s bound to be a data set in there that jives more with your interests.

Once you’ve pulled down the data set, then you’re just moments away from having your own SolrExchange index. Simply unzip the dataset that you’re interested in (7-zip format zip files), pull down this git repo which walks you through indexing the data, and finally, just follow the instructions in the

Interesting data set for Solr.

More importantly, a measure of how easy it needs to be to get started with software.

Software like topic maps.


Apache Camel meets Redis

Saturday, February 23rd, 2013

Apache Camel meets Redis by Bilgin Ibryam.

From the post:

The Lamborghini of Key-Value stores

Camel is the best of bread Integration framework and in this post I’m going to show you how to make it even more powerful by leveraging another great project – Redis. Camel 2.11 is on its way to be released soon with lots of new features, bug fixes and components. Couple of these new components are authored by me, redis-component being my favourite one. Redis – a ligth key/value store is an amazing piece of Italian software designed for speed (same as Lamborghini – a two-seater Italian car designed for speed). Written in C and having an in-memory closer to the metal nature, Redis performs extremely well (Lamborgini’s motto is “Closer to the Road”). Redis is often referred to as a data structure server since keys can contain strings, hashes, lists and sorted sets. A fast and light data structure server is like a super sportscars for software engineers – it just flies. If you want to find out more about Redis’ and Lamborghini’s unique performance characteristics google around and you will see for yourself.

Idempotent Repository

The term idempotent is used in mathematics to describe a function that produces the same result if it is applied to itself. In Messaging this concepts translates into the a message that has the same effect whether it is received once or multiple times. In Camel this pattern is implemented using the IdempotentConsumer class which uses an Expression to calculate a unique message ID string for a given message exchange; this ID can then be looked up in the IdempotentRepository to see if it has been seen before; if it has the message is consumed; if its not then the message is processed and the ID is added to the repository. RedisIdempotentRepository is using a set structure to store and check for existing Ids.

If you have or are considering a message passing topic map application, this may be of interest.

U.S. Statutes at Large 1951-2009

Saturday, February 23rd, 2013

GPO is Closing Gap on Public Access to Law at JCP’s Direction, But Much Work Remains by Daniel Schuman.

From the post:

The GPO’s recent electronic publication of all legislation enacted by Congress from 1951-2009 is noteworthy for several reasons. It makes available nearly 40 years of lawmaking that wasn’t previously available online from any official source, narrowing part of a much larger information gap. It meets one of three long-standing directives from Congress’s Joint Committee on Printing regarding public access to important legislative information. And it has published the information in a way that provides a platform for third-party providers to cleverly make use of the information. While more work is still needed to make important legislative information available to the public, this online release is a useful step in the right direction.

Narrowing the Gap

In mid-January 2013, GPO published approximately 32,000 individual documents, along with descriptive metadata, including all bills enacted into law, joint concurrent resolutions that passed both chambers of Congress, and presidential proclamations from 1951-2009. The documents have traditionally been published in print in volumes known as the “Statutes at Large,” which commonly contain all the materials issued during a calendar year.

The Statutes at Large are literally an official source for federal laws and concurrent resolutions passed by Congress. The Statutes at Large are compilations of “slip laws,” bills enacted by both chambers of Congress and signed by the President. By contrast, while many people look to the US Code to find the law, many sections of the Code in actuality are not the “official” law. A special office within the House of Representatives reorganizes the contents of the slip laws thematically into the 50 titles that make up the US Code, but unless that reorganized document (the US Code) is itself passed by Congress and signed into law by the President, it remains an incredibly helpful but ultimately unofficial source for US law. (Only half of the titles of the US Code have been enacted by Congress, and thus have become law themselves.) Moreover, if you want to see the intact text of the legislation as originally passed by Congress — before it’s broken up and scattered throughout the US Code — the place to look is the Statutes at Large.

Policy wonks and trivia experts will have a field day but the value of the Statutes at Large isn’t apparent to me.

I assume there are cases where errors can be found between the U.S.C. (United States Code) and the Statutes at Large. The significance of those errors is unknown.

Like my comments on the SEC Midas program, knowing a law was passed isn’t the same as knowing who benefits from it.

Or who paid for its passage.

Knowing which laws were passed is useful.

Knowing who benefited or who paid, priceless.


Saturday, February 23rd, 2013

MLBase by Danny Bickson.

From the post:

Here is an interesting post I got from Ben Lorica, O’Reilly about MLbase:

It is a proof of concept machine learning library on top of Spark, with a custom declarative language called MQL.

Slated for release in August, 2013.

Suggest you digest Lorica’s post and the links therein.

Failure By Design

Saturday, February 23rd, 2013

Did you know the Security and Exchange Commission (SEC) is now collecting 400 gigabytes of market data daily?

Midas [Market Information Data Analytics System], which is costing the SEC $2.5 million a year, captures data such as time, price, trade type and order number on every order posted on national stock exchanges, every cancellation and modification, and every trade execution, including some off-exchange trades. Combined it adds up to billions of daily records.

So, what’s my complaint?

Midas won’t be able to fill in all of the current holes in SEC’s vision. For example, the SEC won’t be able to see the identities of entities involved in trades and Midas doesn’t look at, for example, futures trades and trades executed outside the system in what are known as “dark pools.” (emphasis added)


The one piece of information that could reveal patterns of insider trading, churning, and a whole host of other securities crimes, is simply not collected.

I wonder who would benefit from the SEC not being able to track insider trading, churning, etc.?

People engaged in insider trading, churning, etc. would be my guess.


Maybe someone should ask SEC chairman Elisse Walter or Gregg Berman (who oversees MIDAS) if tracking entities would help with SEC enforcement?

If they agree, then ask why not now?

For that matter, why not open up the data + entities so others can help the SEC with analysis of the data?

Obvious questions J. Nicholas Hoover should have asked for SEC Makes Big Data Push To Analyze Markets.

Philosophy behind YARN Resource Management

Saturday, February 23rd, 2013

Philosophy behind YARN Resource Management by Bikas Saha.

From the post:

YARN is part of the next generation Hadoop cluster compute environment. It creates a generic and flexible resource management framework to administer the compute resources in a Hadoop cluster. The YARN application framework allows multiple applications to negotiate resources for themselves and perform their application specific computations on a shared cluster. Thus, resource allocation lies at the heart of YARN.

YARN ultimately opens up Hadoop to additional compute frameworks, like Tez, so that an application can optimize compute for their specific requirements.

The YARN Resource Manager service is the central controlling authority for resource management and makes allocation decisions. It exposes a Scheduler API that is specifically designed to negotiate resources and not schedule tasks. Applications can request resources at different layers of the cluster topology such as nodes, racks etc. The scheduler determines how much and where to allocate based on resource availability and the configured sharing policy.

If YARN does become the cluster operating system, knowing the “why” of its behavior will be as important as knowing the “how.”

Large-Scale Data Analysis Beyond Map/Reduce

Saturday, February 23rd, 2013

Large-Scale Data Analysis Beyond Map/Reduce by Fabian Hüske.

From the description:

Stratosphere is a joint project by TU Berlin, HU Berlin, and HPI Potsdam and researches “Information Management on the Cloud”. In the course of the project, a massively parallel data processing system is built. The current version of the system consists of the parallel PACT programming model, a database inspired optimizer, and the parallel dataflow processing engine, Nephele. Stratosphere has been released as open source. This talk will focus on the PACT programming model, which is a generalization of Map/Reduce, and show how PACT eases the specification of complex data analysis tasks. At the end of the talk, an overview of Stratosphere’s upcoming release will be given.

In Stratosphere, parallel programming model is separated from the execution engine (unlike Hadoop).

Interesting demonstration of differences between Hadoop versus PACT programming models.

Home: Stratosphere: Above the Clouds

I first saw this at DZone.


Saturday, February 23rd, 2013


While looking for more information on Arango-DB, I stumbled across this collection of graph data sets:

Brief descriptions: ArangoDB-Data

Storing and Traversing Graphs in ArangoDB

Saturday, February 23rd, 2013

Storing and Traversing Graphs in ArangoDB by Frank Celler.


In this session we will use bibliographic data as an example of a large data-set with graph structure. In order to investigate this structure the data is imported into the multi-model database ArangoDB. This database allows to investigate and access the underlying graph: A query language gives you access to basic path structure. Graph traversals written in JavaScript allow you to explore that graph in-depth. Finally, a library of graph algorithms is available to look for hot-spots and the like.


ArangoDB supports its own query language as well as Gremlin.

Interesting for its use of JavaScript to explore the graph.

BTW, ArangoDB home.

Social Network Analysis [Coursera – March 4, 2013]

Saturday, February 23rd, 2013

Social Network Analysis by Lada Adamic (University of Michigan)


Everything is connected: people, information, events and places, all the more so with the advent of online social media. A practical way of making sense of the tangle of connections is to analyze them as networks. In this course you will learn about the structure and evolution of networks, drawing on knowledge from disciplines as diverse as sociology, mathematics, computer science, economics, and physics. Online interactive demonstrations and hands-on analysis of real-world data sets will focus on a range of tasks: from identifying important nodes in the network, to detecting communities, to tracing information diffusion and opinion formation.

The item on the syllabus that caught my eye:

Ahn et al., and Teng et al.: Learning about cooking from ingredient and flavor networks

On which see:

Flavor network and the principles of food pairing, Yong-Yeol Ahn, Sebastian E. Ahnert, James P. Bagrow & Albert-László Barabási.


Recipe recommendation using ingredient networks, Chun-Yuen Teng, Yu-Ru Lin, Lada A. Adamic,

Heavier on the technical side than Julia Child reruns but enjoyable none the less.

NICCS National Initiative For Cybersecurity Careers And Studies

Friday, February 22nd, 2013

NICCS National Initiative For Cybersecurity Careers And Studies

From the Education and Training Catalog Search page:

The NICCS interactive Training Catalog is currently under development. The Training Catalog will provide a robust listing of all the cybersecurity or cybersecurity-related education and training courses offered in the U.S. NICCS will not teach the courses, but rather will provide a central resource to help people find the information on courses they want or need.

The Training Catalog will allow users to search for education and training courses based on a variety of criteria including: location, modality, and Framework Specialty Area. To get an idea of what the Training Catalog will be like click on “Try our Demo” tab.

Here is where we need your help!

To make the Training Catalog a useful resource, we need your help! We are asking education and training providers to submit a list of their courses, mapped to the Framework, which can be included in our database.

Another U.S. government response to cybersecurity issues.

An advertising campaign for private training vendors, if they contribute the content.

As Dogbert once observed: “There are times when no snide comment seems adequate.”

PS: A number of U.S. agencies make major contributions on cybersecurity. None of them are mentioned on this page.