Archive for October, 2011

Inquiry: Algebraic Geometry and Topology

Monday, October 31st, 2011

Inquiry: Algebraic Geometry and Topology

Speaking of money and such matters, a call for assistance from Quantivity:

Algebraic geometry and topology traditionally focused on fairly pure math considerations. With the rise of high-dimensional machine learning, these fields are increasing being pulled into interesting computational applications such as manifold learning. Algebraic statistics and information geometry offer potential to help bridge these fields with modern statistics, especially time-series and random matrices.

Early evidence suggests potential for significant intellectual cross-fertilization with finance, both mathematical and computational. Geometrically, richer modeling and analysis of latent geometric structure than available from classic linear algebraic decomposition (e.g. PCA, one of the main workhorses of modern $mathbb{P}$ finance); for example, cumulant component analysis. Topologically, more effective qualitative analysis of data sampled from manifolds or singular algebraic varieties; for example, persistent homology (see CompTop).

As evidence by Twitter followers, numerous Quantivity readers are experts in these fields. Thus, perhaps the best way to explore is to seek insight from readers.

Readers: please use comments to suggest applied literature from these fields; ideally, although not required, that of potential relevance to finance modeling. All types of literature are requested, from intro texts to survey articles to preprint working papers on specific applications.

These suggestions will be synthesized into one or more subsequent posts, along with appropriate additions to People of Quant Research.

If you or a member of your family knows of any relevant resources, please go to: Inquiry: Algebraic Geometry and Topology and volunteer those resources as comments. You might even make the People of Quant Research list!

I wonder if there would be any interest in tracking bundled instruments using topic maps? That could be an interesting question.

Venture Capital (VC) Blog Directory – 2011 Edition

Monday, October 31st, 2011

Venture Capital (VC) Blog Directory – 2011 Edition

From Larry Cheng’s blog “Thinking about Thinking:”

This is the 4th edition of the Venture Capital Blog Directory (1st edition, 2nd edition, 3rd edition). This directory includes 149 venture capital, microVC/seed, and growth equity blogs. The imperfect statistic used to rank these blogs is their average monthly uniques in Q410 from Compete (more methodology info below). Blogs that have seen increased traffic over Q409 by 1,000+ uniques/month are highlighted in bold. There is an additional list below of VC blogs below that had insufficient Compete data. To subscribe to the top 15 VC blogs through Google Reader, click here: Top 15 VC Blogs. As always, if there is any information missing or incorrect, please leave it in the comment field. Many thanks to my colleagues at Volition Capital for their assistance with this directory – we hope it’s a useful service for everyone.

Topic maps are a lot of fun to write and write about but end of the day, we all have to make money from their use as well.

Thought this would be of interest to start-ups using topic maps (or technology start-ups generally).

Best of luck!

PS: Speaking of making money, if you are lucky and need a standards editor/topic map theorist, you know where to find me. 😉

Term Extraction and Image Labeling with Python and topia.termextract

Monday, October 31st, 2011

Term Extraction and Image Labeling with Python and topia.termextract

From the post:

Here’s the question I want to answer: given an image and some related text, can I automatically find a subset of phrases in the text that describe the image?

An amusing description of the use of topia.termextract for term extraction/image labeling.

To be honest, even after being told the image is of Susan Coffey, an alleged celebrity, I still don’t know who it is. I haven’t seen her at Balisage so I assume she isn’t a markup person. Suggestions?

Lessons learned from bountying bugs

Monday, October 31st, 2011

Lessons learned from bountying bugs

From the post:

A bit over a week ago, I wrote here about the $1265 of Tarsnap bugs fixed as a result of the Tarsnap bug bounty which I’ve been running since April. I was impressed by the amount of traffic that post received — over 82000 hits so far — as well as the number of people who said they were considering following my example. For them and anyone else interested in “crowdsourcing” bug hunting, here’s some of the lessons I learned over the past few months.

I suppose next to writing documentation, debugging is the least attractive job of all.

Curious if anyone has used a bountying bugs approach with their topic maps?

Seems like it would be a useful technique, particularly with large topic maps spread over organizations for users to get some sort of reward for reporting errors.

I suppose it isn’t a really big step to giving at least some users the ability to suggest new topics or merges.

I think it is important to remember that our users are people first and users of topic maps and/or software second. We all like to get rewards, even small ones.

It occurs to me that if a topic map is only one resource of many, that a user could consult and the plan is to move users towards relying on the topic map, offering incentives for use of the topic map would be an extension of the reward idea.

Graph Words: A Free Visual Thesaurus of the English Language

Monday, October 31st, 2011

Graph Words: A Free Visual Thesaurus of the English Language

From the post:

One of the very first examples of visualization that succeeds in merging beauty with function is Visual Thesaurus, a subscription-based online thesaurus and dictionary that shows the relationships between words through a beautiful interactive map.

The idea behind Graph Words [] is quite similar, though the service can be used completely free of charge.

Based on the knowledge captured in WordNet, a large lexical database of the English language, Graph Words is an interactive English dictionary and thesaurus that helps one find the meanings of words by revealing their connections among associated words. Any resulting network graph can also be stored as images.

I particularly liked “…helps one find the meanings of words by revealing their connections among associated words.”

I would argue that words only have meaning in the context of associated words. The unfortunate invention of the modern dictionary falsely portrays words as being independent of their context.

The standard Arabic dictionary, Lisan al-‘Arab (roughly, “The Arab Tongue”), was reported by my Arabic professor to be very difficult to use because the entries consisted of poetry and prose selections that illustrated the use of words in context. You have to be conversant to use the dictionary but that would be one of the reasons for using it, to become conversant. 😉 Both Lisan al-‘Arab (about 20,000 pages) and Lane’s Arabic-English Lexicon (about 8,000+ pages) are online now.

Using StackOverflow’s API to Find the Top Web Frameworks

Monday, October 31st, 2011

Using StackOverflow’s API to Find the Top Web Frameworks by Bryce Boe.

From the post:

Adam and I are currently in the process of working on our research about the Execution After Redirect, or EAR, Vulnerability which I previously discussed in my blog post about the 2010 iCTF. While Adam is working on a static analyzer to detect EARs in ruby on rails projects, I am testing how simple it is for a developer to introduce an EAR vulnerability in several popular web frameworks. In order to do that, I first needed to come up with a mostly unbiased list of popular web frameworks.

My first thought was to perform a search on the top web frameworks hoping that the information I seek may already be available. This search provided a few interesting results, such as the site, Best Web-Frameworks as well as the page Framework Usage Statistics by the group BuiltWith. The Best Web-Frameworks page lists and compares various web frameworks by language, however it offers no means to compare the adoption of each. The Framework Usage Statistics page caught my eye as its usage statistics are generated by crawling and fingerprinting various websites in order to determine what frameworks are in use. Their fingerprinting technique, however, is too generic in some cases thus resulting in the labeling of languages like php and perl as frameworks. While these results were a step in the right direction, what I was really hoping to find was a list of top web frameworks that follow the model, view, controller, or MVC, architecture.

After a bit more consideration I realized it wouldn’t be very simple to get a list of frameworks by usage, thus I had to consider alternative metrics. I thought how I could measure the popularity of the framework by either the number of developers using or at least interested in the framework. It was this train of thought that lead me to both Google Trends and StackOverflow. Google Trends allows one to perform a direct comparison of various search queries over time, such as ruby on rails compared to python. The problem, as evidenced by the former link, is that some of the search queries don’t directly apply to the web framework; in this case not all the people searching for django are looking for the web framework. Because of this problem, I decided a more direct approach was needed.

StackOverflow is a website geared towards developers where they can go to ask questions about various programing languages, development environments, algorithms, and, yes, even web frameworks. When someone asks a question, they can add tags to the question to help guide it to the right community. Thus if I had a question about redirects in ruby on rails, I might add the tag ruby-on-rails. Furthermore if I was interested in questions other people had about ruby on rails I might follow the ruby-on-rails tag.

Bryce’s use of the StackOverflow’s API is likely to interest anyone creating topic maps on CS topics. Not to mention that his use of graphs for visualization is interesting as well.

Statement of Disbursements

Monday, October 31st, 2011

Statement of Disbursements – United States House of Representatives.

From the Introduction:

The Statement of Disbursements (SOD) is a quarterly public report of all receipts and expenditures for U.S. House of Representatives Members, Committees, Leadership, Officers and Offices. The House has been required by law to publish the SOD since 1964.

The Chief Administrative Officer of the House publishes the SOD within 60 days of the end of each calendar year quarter (January–March, April–June, July–September and October–December).

Since 2009 the SOD has been published online to increase governmental transparency and accountability.

As a result of a new House financial system, all SODs from the fourth quarter of 2010 onward will display new transaction codes while maintaining the same data transparency as before. These codes (AP for Accounts Payable; AR for Accounts Receivable and GL for General Ledger) will replace all previously used SOD transaction codes.

Later in the Introduction it is noted:

Because of the procedural complexity inherent in balancing hundreds of Congressional budgets, the SOD document is not easy to read.

Well, that’s certainly true.

What would you need to make this a meaningful document?

Who are these salaried employees and how are they related to campaign contributors?

We pay for phones and cellphones, so where are the calling records to and from those phones?

Where are the calendars of the House members so expenses for meetings with particular lobbyists or others can be matched to the expense record?

As it is, we have a document that shows bakery expenses, cab fares, mixed in with larger and smaller expenses.

Still, I suppose it is better than nothing at all. But only just.

What other public data would you match up with these expense documents to uncover patterns of behavior?

Accessing Chicago, Cook and Illinois Open Data via PHP

Monday, October 31st, 2011

Accessing Chicago, Cook and Illinois Open Data via PHP by Paul Weinstein.

From the post:

In Accessing the CTA’s API with PHP I outlined a class file I created [footnote omitted] for querying the Chicago Transit Authority’s three web-based application programming interfaces (APIs). However, that isn’t the only open data development I’ve been working on recently, I’ve also been working on a class file for accessing the City of Chicago’s Open Data Portal.

The City of Chicago’s Open Data Portal is driven by a web application developed by Socrata. Socrata’s platform provides a number of web-based API methods for retrieving published datasets [footnote omitted].

class.windy.php provides the definition for a PHP object that wraps around Socrata’s API providing PHP written apps access in turn to the city’s open data.

The general trend seems to be more access to more government data and I suspect the 2012 elections will only speed up that trend.

The question is: How valuable is the government data that is posted? Or to put it more bluntly: Is the data that is released simply noise to cover up more interesting data?

How would you use a topic map to try to answer that question? Can you think of a way to use topic maps to uncover “missing” associations?


Monday, October 31st, 2011

streamgraph.js by Jason Sundram.

From the post:

Streamgraphs are cool. They’re great at displaying trends in data over time, similar to a stacked graph, but much prettier. The first one I saw was Lee Byron’s listening history graphic, a beautiful poster showing trends in the music he had listened to over the course of two years. The New York Times used an interactive streamgraph (created by Matthew Bloch and Shawn Carter) to great effect to show box office receipts over 22 years in The Ebb and Flow of Movies.

When Lee Byron & Martin Wattenberg open-sourced their streamgraph implementation in Java and Processing, I was pretty excited. I’ve been playing around with processing.js a lot lately, so I decided to port their code.

I’ve posted the code on github. The algorithms are much the same as Byron & Wattenberg’s, but I’ve added code to make the graphs more interactive and easily configurable.

Another visualization tool never hurt anyone.

And Jason says, these graphs are “prettier” than some alternatives. I will have to let someone else be the judge of the aesthetics but different visualizations work for different data sets and different people.

Statistics with R

Monday, October 31st, 2011

Statistics with R

From the webpage:

Here are the notes I took while discovering and using the statistical environment R. However, I do not claim any competence in the domains I tackle: I hope you will find those notes useful, but keep you eyes open — errors and bad advice are still lurking in those pages…

Another statistics with R guide. If you use more than one learning R, would appreciate your comparison/comments.

A new interface for IUCAT: Blacklight

Sunday, October 30th, 2011

A new interface for IUCAT: Blacklight by Hyeran Kang.

From the post:

As you may have heard, work has begun on a new interface for IUCAT. The IU Libraries OLE Discovery Layer Implementation Task Force (DLITF) will be overseeing the implementation of a new discovery layer, powered by Blacklight, to overlay our current SirsiDynix system. Development work is going on during this fall semester and a public Beta will be launched in spring 2012. This is a good time to share some background information around the new discovery interface, Blacklight.

What is Blacklight?

Blacklight is a free and open source OPAC (Online Public Access Catalog) solution developed at the University of Virginia (UVA) Library; check the project site for detailed information. While some OSS (Open Source Software) systems, such as Evergreen and Koha, were developed to replace a library’s entire ILS (Integrated Library System), Blacklight has been designed to work with a library’s current ILS to assist in reengineering the library’s searching tools. It uses Apache Solr for indexing and searching records and Ruby on Rails for its front end.

If Solr or Ruby on Rails is on your “to be learned” list, you might want bump one or both up a notch or two and consider contributing to the Blacklight project.

How to beat the CAP theorem

Sunday, October 30th, 2011

How to beat the CAP theorem by Nathan Marz.

After the Storm video, I ran across this post by Nathan and just had to add it as well!

From the post:

The CAP theorem states a database cannot guarantee consistency, availability, and partition-tolerance at the same time. But you can’t sacrifice partition-tolerance (see here and here), so you must make a tradeoff between availability and consistency. Managing this tradeoff is a central focus of the NoSQL movement.

Consistency means that after you do a successful write, future reads will always take that write into account. Availability means that you can always read and write to the system. During a partition, you can only have one of these properties.

Systems that choose consistency over availability have to deal with some awkward issues. What do you do when the database isn’t available? You can try buffering writes for later, but you risk losing those writes if you lose the machine with the buffer. Also, buffering writes can be a form of inconsistency because a client thinks a write has succeeded but the write isn’t in the database yet. Alternatively, you can return errors back to the client when the database is unavailable. But if you’ve ever used a product that told you to “try again later”, you know how aggravating this can be.

The other option is choosing availability over consistency. The best consistency guarantee these systems can provide is “eventual consistency”. If you use an eventually consistent database, then sometimes you’ll read a different result than you just wrote. Sometimes multiple readers reading the same key at the same time will get different results. Updates may not propagate to all replicas of a value, so you end up with some replicas getting some updates and other replicas getting different updates. It is up to you to repair the value once you detect that the values have diverged. This requires tracing back the history using vector clocks and merging the updates together (called “read repair”).

I believe that maintaining eventual consistency in the application layer is too heavy of a burden for developers. Read-repair code is extremely susceptible to developer error; if and when you make a mistake, faulty read-repairs will introduce irreversible corruption into the database.

So sacrificing availability is problematic and eventual consistency is too complex to reasonably build applications. Yet these are the only two options, so it seems like I’m saying that you’re damned if you do and damned if you don’t. The CAP theorem is a fact of nature, so what alternative can there possibly be?

Nathan finds a way and it is as clever as his coding for Storm.

Take your time and read slowly. See what you think. Comments welcome!

Storm: Distributed and Fault-tolerant Real-time Computation

Sunday, October 30th, 2011

Storm: Distributed and Fault-tolerant Real-time Computation by Nathan Marz.


Nathan Marz explain Storm, a distributed fault-tolerant and real-time computational system currently used by Twitter to keep statistics on user clicks for every URL and domain.

A presentation that makes you want a cluster for Christmas!

Curious if processing by subject identifiers would be one way harness Storm for processing a topic map? Not a general solution but then I am not sure general solutions are all that interesting.

A topic map for a library, for instance, could be configured to merge topics that match on subject identifiers for items held in the catalog and to reject other input. Not doubt the other input may be of interest to others but there is no topic map requirement that any topic map application accept all input.

Some quick Storm links:


Storm Deploy – One click to deploy on AWS.

Storm Starter – sample code


Storm Wiki

Revised Report on the Propagator Model

Sunday, October 30th, 2011

Revised Report on the Propagator Model by Alexey Radul and Gerald Jay Sussman.

After I watched Sussman’s We Really Don’t Know How To Compute!, I just had to track down more information on the propagator model!


In the past year we have made serious progress on elaborating the propagator programming model [2,3]. Things have gotten serious enough to build a system that can be used for real experiments.

The most important problem facing a programmer is the revision of an existing program to extend it for some new situation. Unfortunately, the traditional models of programming provide little support for this activity. The programmer often finds that commitments made in the existing code impede the extension, but the costs of reversing those commitments are excessive.

Such commitments tend to take the form of choices of strategy. In the design of any significant system there are many implementation plans proposed for every component at every level of detail. However, in the system that is finally delivered this diversity of plans is lost and usually only one unified plan is adopted and implemented. As in an ecological system, the loss of diversity in the traditional engineering process has serious consequences.

The Propagator Programming Model is an attempt to mitigate this problem. It is a model that supports the expression and integration of multiple viewpoints on a design. It incorporates explicit structure to support the integration of redundant pieces and subsystems that solve problems in several different ways. It will help us integrate the diversity that was inherent in the design process into the delivered operational product.

The Propagator Programming Model is built on the idea that the basic computational elements are autonomous machines interconnected by shared cells through which they communicate. Each machine continuously examines the cells it is interested in, and adds information to some based on computations it can make from information from the others. Cells accumulate information from the propagators that produce that information. The key idea here is additivity. New ways to make contributions can be added just by adding new propagators; if an approach to a problem doesn’t turn out to work well, it can be identified by its premises and ignored, dynamically and without disruption.

This work was supported in part by the MIT Mind Machine Project.

I really like the idea of additivity. I think it has long legs in a world characterized by diversity. That is to say a world outside most programming paradigms, which presume a syntax, semantics and inviolate rules, the very opposite of diversity.

Spring Data Neo4j

Sunday, October 30th, 2011

Spring Data Neo4j

From the webpage:

Spring Data Neo4j enables POJO based development for graph databases like Neo4j. It extends annotated entity classes with transparent mapping functionality. Spring Data Neo4j is part of the bigger Spring Data project which aims to provide convenient support for NOSQL databases.

Examples, a webinar, all the community friendly stuff you have come to expect from Neo4j!

Computer Science Teachers Association

Sunday, October 30th, 2011

Computer Science Teachers Association

From the website:

The Computer Science Teachers Association is a membership organization that supports and promotes the teaching of computer science and other computing disciplines. CSTA provides opportunities for K-12 teachers and students to better understand the computing disciplines and to more successfully prepare themselves to teach and learn.

I suspect that the issues that face teachers in more formal classroom settings are largely the same ones that face us when we try to teach topic maps to users. As a matter of fact, other than being age/gender/culture adaptations, I would venture to say that the basic teaching techniques remain largely the same over a lifetime.

I can remember very enthusiastic teachers who had great examples that got kids (myself included) interested in literature, math, science, etc., and I can remember those who were putting in the hours until the end of the school day. I saw the same techniques, with some dressing up because we become more “serious” the older we get (allegedly) in college and a couple of rounds of graduate school.

Not that we have to make silly techniques to teach topic maps but having a few of those isn’t going to hurt anyone or detract from the “gravitas” of the paradigm.

BTW, the price is right for the Computer Science Teachers Association, it’s free! Who knows? You might learn something and perhaps get better at learning with others (what else would teaching be?).

Easy as Pie? – Teaching Code Literacy

Sunday, October 30th, 2011

Easy as Pie? – Teaching Code Literacy by Sarah Allen.

A very entertaining presentation on teaching programming to children.

One of its key points was the need for immediate gratification. (Suspect that is probably the case for adults as well but don’t tell anyone.)

The presentation made me think that one of the barriers to teaching topic maps (under whatever guise or name) is its delayed gratification.

That is it is all fine and good to talk about the issues that interest us as topic map specialists but users are really more interested in results that are of interest to them.

I don’t have a specific game or scenario in mind but wanted to point out this presentation as a starting point for discussion of subject-centric gaming.

Your suggestions and comments are always welcome but especially here. I don’t know what interests/motivates other adults, much less children.

PS: Sarah mentions that the “computer science” classes in SF are teaching Word and PowerPoint. Says “…having a class in Word is like having a class in pencil.” Thought you would appreciate that. 😉

From the description:

Sarah Allen talks on how to introduce children to the basics of programming, presenting a new related language called “Pie” along with lessons learned from creating a DSL in Ruby.

Experimenting with Real-Time Search using Sphinx

Sunday, October 30th, 2011

Experimenting with Real-Time Search using Sphinx by Jeremy Zawodny.

From the post:

In the last few weeks I’ve been experimenting with using real-time indexes in Sphinx to allow full-text searching of data recently added to craigslist. While this posting is not intended to be a comprehensive guide to what I’ve implemented (that may come later in the form of a few posts and/or a conference talk) or a tutorial for doing it yourself, I have had a few questions about it and learned a few things along the way.

Will help you evaluate Sphinx for your indexing needs.

Case study with Data blogs, from 300 to 1000

Sunday, October 30th, 2011

Case study with Data blogs, from 300 to 1000

From the post:

We looked at how we could help increase the size of a list of Top300 Data blogs fast.

Initial idea from Marshall Kirkpatrick.

Project turn around 1 day. Here is the initial list of 300 bloggers:

Great study in how to take an initial data set and expand it quickly.

Data: Making a List of the Top 300 Blogs about Data, Who Did We Miss?

Sunday, October 30th, 2011

Data: Making a List of the Top 300 Blogs about Data, Who Did We Miss? by Marshall Kirkpatrick.

From the post:

Dear friends and neighbors, as part of my ongoing practice of using robots and algorithms to make grandiose claims about topics I know too little about, I have enlisted a small army of said implements of journalistic danger to assemble the above collection of blogs about data. I used a variety of methods to build the first half of the list, then scraped all the suggestions from this Quora discussion to flesh out the second half. Want to see if your blog is on this list? Control-F and search for its name or URL and your browser will find it if it’s there.

Why data? Because we live in a time when the amount of data being produced is exploding and it presents incredible opportunities for software developers and data analysts. Opportunities to build new products and services, but also to discover patterns. Those patterns will represent further opportunities for innovation, or they’ll illuminate injustices, or they’ll simply delight us with a greater sense of self-awareness than we had before. (I was honored to have some of my thoughts on data as a platform cited in this recent Slate write-up on the topic, if you’re interested in a broader discussion.) Data is good, and these are the leading people I’ve found online who are blogging about it.

A bit dated now but instructive for the process of mining and then ranking the blogs. There are any number of subject areas that await similar treatment.

  1. What subject area would interest you enough to collect the top 100 or 300 blogs?
  2. Would collecting and ranking be enough to be useful? For what purposes? Where would that fail?
  3. How would you envision topic maps making a difference for such a collection of blogs?

Printer Dots, Pervasive Tracking and the Transparent Society

Saturday, October 29th, 2011

Printer Dots, Pervasive Tracking and the Transparent Society

From the post:

So far in the fingerprinting series, we’ve seen how a variety of objects and physical devices [1, 2, 3, 4], often even supposedly identical ones, can be uniquely fingerprinted. This article is non-technical; it is an opinion on some philosophical questions about tracking and surveillance.

Here’s a fascinating example of tracking that’s all around you but that you’re probably unaware of:

Color laser printers and photocopiers print small yellow dots on every page for tracking purposes.

My source for this is the EFF’s Seth Schoen, who has made his presentation on the subject available.

If you are tracking the provenance of data in your topic map, does that mean that you are also tracking the users who submitted it?

And is that tracking “transparent” to the users who are being tracked or only “transparent” to the aggregators of that submitted content?

Or is that tracking only “transparent” to the sysops who are working one level above the aggregators?

And at any level, what techniques would you suggest for tracking data, whether transparent or not?

For that matter, what techniques would you suggest for detecting tracking?

Ironic that “transparent” government may require that some (all?) of its citizens and their transactions be “transparent” as well. How else to track the web of influence of lobbyists of various sorts if their transactions are not “transparent?” Which then means their employers will need to be “transparent.” Along with the objects of their largesse and attention. And the web of actors just keeps spreading out.

Visualizing Facebook Friends: Eye-Candy in R

Saturday, October 29th, 2011

Visualizing Facebook Friends: Eye-Candy in R by Paul Butler.

From the post:

Earlier this week I published a data visualization on the Facebook Engineering blog which, to my surprise, has received a lot of media covereage.

I’ve received a lot comments about the image, many asking for more details on how I created it. When I tell people I used R, the reaction I get is roughly what I would expect if I told them I made it with a Microsoft Paint and a bottle of Jägermeister. Some people even questioned whether it was actually done in R. The truth is, aside from the addition of the logo and date text, the image was produced entirely with about 150 lines of R code with no external dependencies. In the process I learned a few things about creating nice-looking graphs in R.

Great visualization example and encouragement to consider R for your next visualization project!

Data Structures for Range-Sum Queries (slides)

Saturday, October 29th, 2011

Data Structures for Range-Sum Queries (slides) by Paul Butler.

From the post:

This week I attended the Canadian Undergraduate Mathematics Conference. I enjoyed talks from a number of branches of mathematics, and gave a talk of my own on range-sum queries. Essentially, range-aggregate queries are a class of database queries which involve taking an aggregate (in SQL terms, SUM, AVG, COUNT, MIN, etc.) over a set of data where the elements are filtered by simple inequality operators (in SQL terms, WHERE colname {<, <=, =, >=, >} value AND …). Range-sum queries are the subset of those queries where SUM is the aggregation function.

Due to the nature of the conference, I did my best to make things as accessible to someone with a general mathematics background rather than assuming familiarity with databases or order notation.

I’ve put the slides (pdf link, embedded below also) online. They may be hard to follow as slides, but I hope they pique your interest enough to check out the papers referenced at the end if that’s the sort of thing that interests you. I may turn them into a blog post at some point. The presentation begins with tabular data and shows some of the insights that led to the Dynamic Data Cube, which is a clever data structure for answering range-sum queries.

I will run down the links and see what other materials I can find on the “Dynamic Data Cube” (this post is from 2010). Data structures for range-sum queries look quite interesting.

Failing at Google Interviews

Saturday, October 29th, 2011

Failing at Google Interviews by Alex Bowe.

Not strictly a topic map topic except that topic map practitioners do apply for jobs, including at places like Google. I thought the advice in this article important enough to pass along for anyone in the topic map audience.

The one thing Alex did not mention is the proposition:

Not being hired by Google is their loss.

Proof is left up to the reader.

FM-Indexes and Backwards Search

Saturday, October 29th, 2011

FM-Indexes and Backwards Search

From the post:

Last time (way back in June! I have got to start blogging consistently again) I discussed a gorgeous data structure called the Wavelet Tree. When a Wavelet Tree is stored using RRR sequences, it can answer rank and select operations in O(log A) time, where A is the size of the alphabet. If the size of the alphabet is 2, we could just use RRR by itself, which answers rank and select in O(1) time for binary strings. RRR also compresses the binary strings, and hence compresses a Wavelet Tree which is stored using RRR.

So far so good, but I suspect rank and select queries seem to be of limited use right now (although once you are familiar with the available structures, applications show up often). One of the neatest uses of rank that I’ve seen is in substring search, which is certainly a wide reaching problem (for a very recent application to genome assembly, see Jared Simpson’s paper from 2010 called Efficient construction of an assembly string graph using the FM-index)

Also, my apologies but I am using 1-basing instead of 0-basing, because that is how I did my diagrams a year ago 🙂 (and bear with my lack of nicely typeset math, I am migrating to GitHub pages where I will be allowed to use Mathjax soon)

If you are interested in stretching your indexing muscles a bit, this post will get them limbered up!

Mapreduce in Search

Saturday, October 29th, 2011

Mapreduce in Search by Amund Tveit.

Interesting coverage of MapReduce in search applications. Covers basic MapReduce, what I would call “search” MapReduce and concludes with “advanced” MapReduce, as in expert systems, training, etc.

Worth a close look. Or even a tutorial focused on a specific data set with problem sets. Something to think about.

IFIP Working Conference on Uncertainty Quantification in Scientific Computing

Saturday, October 29th, 2011

IFIP Working Conference on Uncertainty Quantification in Scientific Computing

From the webpage:

I just came across the following presentations at the IFIP Working Conference on Uncertainty Quantification in Scientific Computing held at the Millennium Harvest House in Boulder, on August 1-4, 2011. Here are the talks and some abstracts:

I really like the title of this blog: The Robust Mathematical Modeling Blog …When modeling Reality is not an option.

I think you will find the presentations good starting points for reviewing what we know or suspect about uncertainty.

Does anyone know of references to modeling uncertainties in the humanities?

Seems to me that our notions of subject identity should be understood along a continuum of uncertainty.

Testling – Automated Cross-Browser Javascript Testing

Saturday, October 29th, 2011

Testling – Automated Cross-Browser Javascript Testing

I ran across this service while looking at data blogs. Looked like it would be a useful resource for topic map developers working on web interfaces.

I am sure others of this type exist, suggestions/pointers?

Check out the other items on the site. The sed and awk “cheat” sheets are two items that may be of interest.

We Really Don’t Know How To Compute!

Saturday, October 29th, 2011

We Really Don’t Know How To Compute! by Gerald Jay Sussman.

This is a must watch video! Sussman tries to make the case that we need to think differently about computing. For example, being able to backtrack the provenance of data in a series of operations. Or being able to maintain inconsistent world views while maintaining locally consistent world views in a single system. Or being able to say where world views diverge, without ever claiming either one to be correct/incorrect.

Argues in general for systems that will be robust enough for massively parallel programming. Where inconsistencies and the like are going to abound when applied to say all known medical literature. Isn’t going to be helpful if our systems fall over on their sides when they encounter inconsistency.

A lot of what Sussman says I think is largely applicable to parallel processing of topic maps. Certainly will be looking up some of his class videos from MIT.

From the webpage:


Gerald Jay Sussman compares our computational skills with the genome, concluding that we are way behind in creating complex systems such as living organisms, and proposing a few areas of improvement.


Gerald Jay Sussman is the Panasonic Professor of EE at MIT. Sussman is a coauthor (with Hal Abelson and Julie Sussman) of the MIT computer science textbook “Structure and Interpretation of Computer Programs”. Sussman has had a number of important contributions to Artificial Intelligence, and with his former student, Guy L. Steele Jr., invented the Scheme programming language in 1975.

About the conference

Strange Loop is a multi-disciplinary conference that aims to bring together the developers and thinkers building tomorrow’s technology in fields such as emerging languages, alternative databases, concurrency, distributed systems, mobile development, and the web.


Saturday, October 29th, 2011

Jubatus: Distributed Online Machine Learning Framework

From the webpage:

The Jubatus library is a online machine learning framework which runs in distributed environment. Jubatus library includes these functions:

  • multi-class/binary classification,
  • pre-proccessing data(for natural language), and
  • process management.

Talk about something that will make you perk up on a rainy afternoon!