Archive for February, 2012

What is Google Missing?

Wednesday, February 29th, 2012

After I wrote Graph Databases: Information Silo Busters it occurred to me that it names what Google (as an interface) is missing:

  • Declaring relationships
  • Persisting declared relationships

Think about it.

Google crawls a large percentage of the WWW, peering down into first one information silo and then another. And by using Google, I can look down into the same information silos. Better than nothing but could be game changing better.

What if instead of looking down into one data silo after another, I can gather together all the information I find about a subject?

Much like you create a collection of bookmarks but better. I get to say what “string” that the various URLs have information about. Nothing fancy, no complicated, whistle out your left ear syntax, just a string.

See how people like that. If successful, bump that up to two collections with a limited set of relationships (think of operators) between the two strings.

Oh, that’s the other thing, relationships need to be persisted. Think of all the traction that Facebook and others have gotten from persisting relationships.

Unless you know a good reason to throw away searches and whatever declarations I want to make about them?

I can hear Googlers saying they already do the foregoing with all the annoying tracking information attached to all search results. True that Google is tracking search results and choice of sites for particular search requests, but Googlers are guessing as to the relevance of any result for a particular user.

So, rather than guessing, and remembering that making Google more (and not less) useful to users is the key to ad revenue, why not give users the ability to declare and persist relationships in Google search results? (Any other search wannabe is free to try the same strategy.)

Solving Problems on Recursively Constructed Graphs

Wednesday, February 29th, 2012

Solving Problems on Recursively Constructed Graphs by Richard B. Borie , R. Gary Parker , Craig A. Tovey.

Abstract:

Fast algorithms can be created for many graph problems when instances are confined to classes of graphs that are recursively constructed. This article first describes some basic conceptual notions regarding the design of such fast algorithms, and then the coverage proceeds through several recursive graph classes. Specific classes include trees, series-parallel graphs, k-terminal graphs, treewidth-k graphs, k-trees, partial k-trees, k-jackknife graphs, pathwidth-k graphs, bandwidth-k graphs, cutwidth-k graphs, branchwidth-k graphs, Halin graphs, cographs, cliquewidth-k graphs, k-NLC graphs, k-HB graphs, and rankwidth-k graphs. The definition of each class is provided. Typical algorithms are applied to solve problems on instances of most classes. Relationships between the classes are also discussed.

Part survey and part tutorial, this article (at 51 pages) will take some time to read in detail.

I wanted to mention it because one of the topics being discussed in the graph reading club will be the partitioning of graph databases.

I suspect, but obviously don’t know for certain, that the graph databases that are constructed in enterprise settings are not going to be random graphs. That is to say some (all?) have repetitive (if not recursive) structures that can be exploited to solve particular graph operations on those databases.

Suggestions of other resources on recursively constructed graphs?

Designing Search (part 2): As-you-type suggestions

Wednesday, February 29th, 2012

Designing Search (part 2): As-you-type suggestions by Tony Russell-Rose.

From the post:

Have you ever tried the “I’m Feeling Lucky” button on Google? The idea is, of course, that Google will take you directly to the result you want, rather than return a list of results. It’s a simple idea, and when it works, it seems like magic.

(graphic omitted)

But most of the time we are not so lucky. Instead, we submit a query and review the results; only to find that they’re not quite what we were looking for. Occasionally, we review a further page or two of results, but in most cases it’s quicker just to enter a new query and try again. In fact, this pattern of behaviour is so common that techniques have been developed specifically to help us along this part of our information journey. In particular, three versions of as-you-type suggestions—auto-complete, auto-suggest, and instant results—subtly guide us in creating and reformulating queries.

Tony guides the reader through auto-complete, auto-suggest, and instant results in his usual delightful manner. He illustrates the principles under discussion with well known examples from the WWW.

A collection of his posts should certainly be supplemental (if not primary) reading for any course on information interfaces.

Work-Stealing & Recursive Partitioning with Fork/Join

Wednesday, February 29th, 2012

Work-Stealing & Recursive Partitioning with Fork/Join by Ilya Grigorik.

From the post:

Implementing an efficient parallel algorithm is, unfortunately, still a non-trivial task in most languages: we need to determine how to partition the problem, determine the optimal level of parallelism, and finally build an implementation with minimal synchronization. This last bit is especially critical since as Amdahl’s law tells us: “the speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program”.

The Fork/Join framework (JSR 166) in JDK7 implements a clever work-stealing technique for parallel execution that is worth learning about – even if you are not a JDK user. Optimized for parallelizing divide-and-conquer (and map-reduce) algorithms it abstracts all the CPU scheduling and work balancing behind a simple to use API.

I appreciated the observation later in the post that map-reduce is a special case of the pattern described in this post. A better understanding of the special cases can lead to a deeper understanding of the general one.

Announcing Google-hosted workshop videos from NIPS 2011

Wednesday, February 29th, 2012

Announcing Google-hosted workshop videos from NIPS 2011 by John Blitzer and Douglas Eck.

From the post:

At the 25th Neural Information Processing Systems (NIPS) conference in Granada, Spain last December, we engaged in dialogue with a diverse population of neuroscientists, cognitive scientists, statistical learning theorists, and machine learning researchers. More than twenty Googlers participated in an intensive single-track program of talks, nightly poster sessions and a workshop weekend in the Spanish Sierra Nevada mountains. Check out the NIPS 2011 blog post for full information on Google at NIPS.

In conjunction with our technical involvement and gold sponsorship of NIPS, we recorded the five workshops that Googlers helped to organize on various topics from big learning to music. We’re now pleased to provide access to these rich workshop experiences to the wider technical community.

Watch videos of Googler-led workshops on the YouTube Tech Talks Channel:

Not to mention several other videos you will find at the original post.

Suspect everyone will find something they will enjoy!

Comments on any of these that you find particularly useful?

The meaning of most

Wednesday, February 29th, 2012

The meaning of most by Junkcharts.

This is an important post at Junkcharts for its lesson in bad infographics, its humor and if you think you “know” the meaning of “most,” hang on!

We don’t have to wait for Nineteen Eighty-Four to arrive, we need to wait for it to leave.

Will the Circle Be Unbroken? Interactive Annotation!

Wednesday, February 29th, 2012

I have to agree with Bob Carpenter, the title is a bit much:

Closing the Loop: Fast, Interactive Semi-Supervised Annotation with Queries on Features and Instances

From the post:

Whew, that was a long title. Luckily, the paper’s worth it:

Settles, Burr. 2011. Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances. EMNLP.

It’s a paper that shows you how to use active learning to build reasonably high-performance classifier with only minutes of user effort. Very cool and right up our alley here at LingPipe.

Both the paper and Bob’s review merit close reading.

Graph Databases: Information Silo Busters

Wednesday, February 29th, 2012

In a post about InfiniteGraph 2.1 I found the following:

Other big data solutions all lack one thing, Clark contends. There is no easy way to represent the connection information, the relationships across the different silos of data or different data stores, he says. “That is where Objectivity can provide the enhanced storage for actually helping extract and persist those relationships so you can then ask queries about how things are connected.”

(Brian Clark, vice president, Data Management, Objectivity)

It was the last line of the post but I would have sharpened it and made it the lead slug.

Think about what Clark is saying: Not only can we persist relationship information within a datastore but also generate and persist relationship information between datastores. With no restriction on the nature of the datastores.

Try doing that with a relational database and SQL.

What I find particularly attractive is that persisting relationships across datastores means that we can jump the hurdle of making everyone use a common data model. It can be as common (in the graph) as it needs to be and no more.

Of course I think about this as being particularly suited for topic maps as we can document why we have mapped components of diverse data models to particular points in the graph but what did you expect?

But used robustly, graph databases are going to allow you to perform integration across whatever datastores are available to you, using whatever data models they use, and mapped to whatever data model you like. As others may map your graph database to models they prefer as well.

I think the need for documenting those mappings is one that needs attention sooner rather than later.

BTW, feel free to use the phrase “Graph Databases: Information Silo Busters.” (with or without attribution – I want information silos to fall more than I want personal recognition.)

Everything Goes Better With Bacon

Wednesday, February 29th, 2012

Everything Goes Better With Bacon by by Nick Quinn, Senior Software Developer, InfiniteGraph.

From the post:

Whenever someone considers a large movie database like Internet Movie Database, or IMDB, inevitably the classic six degrees of Kevin Bacon problem comes up. It is a famous problem posed like this, “…any individual involved in the Hollywood, California film industry can be linked through his or her film roles to actor Kevin Bacon within six steps” [http://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon]. This problem even helped Kevin Bacon begin his own social charity organization called SixDegrees.org linking people with charities that they might be interested in.

Below is an example of how InfiniteGraph can be used to store and navigate through large sets of connected data like the IMDB. In the example, I will show how to both find the links between various actors and Kevin Bacon, but also how to output the navigation results in various formats including JSON and GraphML. Note: Custom navigator plugins and custom formatter plugins (including the default JSON/GraphML formatters) can be created and used in any InfiniteGraph (2.1) graph database instance. See the InfiniteGraph developer wiki for more details and examples of how to write and use custom plugins (http://wiki.infinitegraph.com).

Here is a visualization of the actors connected to Kevin Bacon within just two degrees of separation (up to 1500 connections).

Even if you are not interested in movies or Kevin Bacon (there are a few of us around), this post rocks!

Good demonstration of the power of a graph database (in this case, InfiniteGraph) for navigation of relationships in data.

Code for visualization as well!

AlchemyDB – The world’s first integrated GraphDB + RDBMS + KV Store + Document Store

Wednesday, February 29th, 2012

AlchemyDB – The world’s first integrated GraphDB + RDBMS + KV Store + Document Store by Russell Sullivan.

From the post:

I recently added a fairly feature rich Graph Database to AlchemyDB (called it LuaGraphDB) and it took roughly 10 days to prototype. I implemented the graph traversal logic in Lua (embedded in AlchemyDB) and used AlchemyDB’s RDBMS to index the data. The API for the GraphDB is modeled after the very advanced GraphDB Neo4j. Another recently added functionality in AlchemyDB, a column type that stores a Lua Table (called it LuaTable), led me to mix Lua-function-call-syntax into every part of SQL I could fit it into (effectively tacking on Document-Store functionality to AlchemyDB). Being able to call lua functions from any place in SQL and being able to call lua functions (that can call into the data-store) directly from the client, made building a GraphDB on top of AlchemyDB possible as a library, i.e. it didn’t require any new core functionality. This level of extensibility is unique and I am gonna refer to AlchemyDB as a “Data Platform”. This is the best term I can come up with, I am great at writing cache invalidation algorithms, but I suck at naming things :)

Another graph contender! It’s a fairly long post so get a cup of coffee before you start!

An observation that interests me:

It is worth noting (at some point, why not here:) that as long as you keep requests relatively simple, meaning they dont look at 1000 table-rows or traverse 1000 graph-nodes, your performance will range between 10K-100K TPS on a single core w/ single millisecond latencies, these are the types of numbers people should demand for OLTP.

Are we moving in the direction of databases that present “good enough” performance for particular use cases?

A related question: How would you optimize a graph database for particular recursive graphs?

Map your Twitter Friends

Tuesday, February 28th, 2012

Map your Twitter Friends by Nathan Yau.

From the post:

You’d think that this would’ve been done by now, but this simple mashup does exactly what the title says. Just connect your Twitter account and the people you follow popup, with some simple clustering so that people don’t get all smushed together in one location.

Too bad the FBI’s social media mining contract will be secret. Wonder how much freely available capabilities will be?

Security requirements will drive up the cost. Like secure installations where the computers have R/W DVDs installed.

Not that I carry a brief for any government, other than paying ones, but I do dislike incompetence, on any side.

Really old maps online

Tuesday, February 28th, 2012

Really old maps online by Nathan Yau.

From the post:

Maps have been around for a long time, but you might not know it looking online. It can be hard to find them. Old Maps Online, a project by The Great Britain Historical GIS Project and Klokan Technologies GmbH, Switzerland, is a catalog of just that.

I do wonder when the organization of information in its various forms will be recognized as maps? Or for that matter, visualized as maps?

Juriscraper: A New Tool for Scraping Court Websites

Tuesday, February 28th, 2012

Juriscraper: A New Tool for Scraping Court Websites

Legalinformatics reports a new tool for scraping court websites.

I understand the need for web scraping tools but fail to understand why public data sources make it necessary? It is getting to where it is a fairly trivia exercise so it is only impeding access, not denying it.

Not that denying access is acceptable but at least it would be an understandable motivation. To try knowing you are going to fail makes you look dumb. Perhaps that is its reward.

Metaphorical search engine finds creative new meanings

Tuesday, February 28th, 2012

Metaphorical search engine finds creative new meanings

From the post:

TYPING “love” into Google, I find the Wikipedia entry, a “relationship calculator” and Lovefilm, a DVD rental service. Doing the same in YossarianLives, a new search engine due to launch this year, I might receive quite different results: “river”, “sleep” and “prison”. Its creators claim YossarianLives is a metaphorical search engine, designed to spark creativity by returning disparate but conceptually related terms. So the results perhaps make sense if you accept that love can ebb and flow, provide rejuvenating comfort or just make you feel trapped.

“Today’s internet search tells us what the world already knows,” explains the CEO of YossarianLives, J. Paul Neeley. “We don’t want you to know what everyone else knows, we want you to generate new knowledge.” He says that metaphors help us see existing concepts in a new way and create innovative ideas. For example, using a Formula 1 pit crew as a metaphor for doctors in an emergency room has helped improve medical procedures. YossarianLives aims to create new metaphors for designers, artists, writers or even scientists.

The name is derived from the anti-hero of the novel Catch-22, as the company wants to solve the catch-22 of existing search engines, which they say help us to access current knowledge but also harm us by reinforcing that knowledge above all else.

Sounds too good to be true but good things do happen.

What do you think?

Instruction Delivery

Tuesday, February 28th, 2012

It may just the materials I have encountered but here is how I would rate (highest to lowest) instruction delivery using the following methods:

  1. Interactive lecture/presentation
  2. Non-interactive lecture/presentation (think recorded CS lectures)
  3. Short non-interactive lectures plus online quizzes
  4. Webinars

I am not sure where pod/screencasts would fit into that ranking, probably between #2 and #3.

I suspect my feelings about webinars are colored by the appearance of corporate apparatchiks and fairly shallow technical content of those I have encountered.

Not all, some are quite good but that is like observing that PBS has good programming in apology for the 500 channels of trash on the local cable TV.

So it isn’t too narrow a question, what stands out for you as the most successful learning experiences you have had? What components or techniques seemed to make it so?

Can’t promise I will have the skill or talent to follow some or all of your suggestions but I am truly interested in what might make a successful learning experience. It will be for a fairly unique audience but every audience is unique in some way.

Any and all suggestions are deeply appreciated!

PS: And yes, to narrow the question or present the opportunity for more criticism, I will be venturing into the video realm in the near future.

git-oh-$#!†

Tuesday, February 28th, 2012

git-oh-$#!† by Kristina Chodorow.

From the post:

I’ve learned a lot about git, usually in a hurry after I mess up and have to fix it. Here are some basic techniques I’ve learned that may help a git beginner.

I have always found humor to be a good tool for teaching material you want remembered. See what you think.

OECD Homepage

Tuesday, February 28th, 2012

OECD Homepage

More about how I got to this site in a moment but it is a wealth of statistical information.

From the about page:

The mission of the Organisation for Economic Co-operation and Development (OECD) is to promote policies that will improve the economic and social well-being of people around the world.

The OECD provides a forum in which governments can work together to share experiences and seek solutions to common problems. We work with governments to understand what drives economic, social and environmental change. We measure productivity and global flows of trade and investment. We analyse and compare data to predict future trends. We set international standards on a wide range of things, from agriculture and tax to the safety of chemicals.

We look, too, at issues that directly affect the lives of ordinary people, like how much they pay in taxes and social security, and how much leisure time they can take. We compare how different countries’ school systems are readying their young people for modern life, and how different countries’ pension systems will look after their citizens in old age.

Drawing on facts and real-life experience, we recommend policies designed to make the lives of ordinary people better. We work with business, through the Business and Industry Advisory Committee to the OECD, and with labour, through the Trade Union Advisory Committee. We have active contacts as well with other civil society organisations. The common thread of our work is a shared commitment to market economies backed by democratic institutions and focused on the wellbeing of all citizens. Along the way, we also set out to make life harder for the terrorists, tax dodgers, crooked businessmen and others whose actions undermine a fair and open society.

I got to the site by following a link to OECD.StatExtracts which is a beta page reported by Christophe Lalanne’s A bag of tweets / Feb 2012.

I am sure comments (helpful ones in particular) would be appreciated on the beta pages.

My personal suspicion is that eventually very little data will be transferred in bulk but most large data sets will admit to both pre-programmed as well as ah hoc processing requests. That is already quite common in astronomy (both optical and radio).

StatLib

Tuesday, February 28th, 2012

StatLib

From the webpage:

Welcome to StatLib, a system for distributing statistical software, datasets, and information by electronic mail, FTP and WWW. StatLib started out as an e-mail service and some of the organization still reflects that heritage. We hope that this document will give you sufficient guidance to navigate through the archives. For your convenience there are several sites around the world which serve as full or partial mirrors to StatLib.

An amazing source of software and data. Including sets of webpages for clustering analysis, etc.

Was mentioned in the first R-Podcast episode.

RHadoop – rmr – 1.2 released!

Tuesday, February 28th, 2012

RHadoop – rmr – 1.2 released!

From the Changelog:

  • Binary formats
  • Simpler, more powerful I/O format API
  • Native binary format with support for all R data types
  • Worked around an R bug that made large reduces very slow.
  • Backend specific parameters to modify things like number of reducers at the hadoop level
  • Automatic library loading in mappers and reducers
  • Better data frame conversions
  • Adopted a uniform.naming.convention
  • New package options API

If you are using R with Hadoop, this is a project you need to watch.

The R-Podcast

Tuesday, February 28th, 2012

The R-Podcast

From the about page:

Whether you have experience with commercial statistical software such as SAS or SPSS and want to learn R, or getting into statistical computing for the first time, the R-Podcast will provide you with valuable information and advice that will help you to tap into the power of R. Our intent is to start with the basic concepts that can be a struggle for those new to R and statistical computing. We will give practical advice on how to take advantage of R’s capabilities to accomplish innovative and robust data analyses. Along the way we will highlight the additional tools and packages that greatly enhance the experience of using R, and highlight resources that can help people become experts with R. While this podcast is not meant to be a series of lectures on statistics, we will use freely and publicly available data sets to illustrate both basic statistical analyses as well as state-of-the-art algorithms to show how powerful and robust R can be for analyzing today’s explosion of data. In addition to the audio podcast, we will also produce screencasts for hands-on demonstrations for those topics that are best explained via video.

Your host:

The host of the R-Podcast is Eric Nantz, a statistician working in the life sciences industry who has been using R since 2004. Eric quickly developed a passion for using and learning more about R due in large part to the brilliant and exciting R community and its free and open source heritage, much like his favorite operating system Linux. Currently he is running Linux Mint and Ubuntu on many of his computers at home, each running R of course!

Hosts podcasts and resources about R.

Only two podcasts so far but it sounds like an interesting project.

R Tutorials from Universities Around the World

Monday, February 27th, 2012

R Tutorials from Universities Around the World by Pairach Piboonrungroj.

A nice listing of R tutorials.

Fourteen so far, if you know of others, please contact the author.

Cassandra Radical NoSQL Scalability

Monday, February 27th, 2012

Cassandra Radical NoSQL Scalability by Tim Berglund.

From the description:

Cassandra is a scalable, highly available, column-oriented data store in use use at Netflix, Twitter, Urban Airship, Constant Contact, Reddit, Cisco, OpenX, Digg, CloudKick, Ooyala and more companies that have large, active data sets. The largest known Cassandra cluster has over 300 TB of data in over 400 machines.

This open source project managed by the Apache foundation offers a compelling combination of a rich data model, a robust deployment track record, and a sound architecture. This video presents the Cassandra’s data model, works through its API in Java and Groovy, talks about how to deploy it and looks at use cases in which it is an appropriate data storage solution.

It explores the Amazon Dynamo project and Google’s BigTable and explains how its architecture helps us achieve the gold standard of scalability: horizontal scalability on commodity hardware. You will be ready to begin experimenting with Cassandra immediately and planning its adoption in your next project.

Take some time to look at CQL – Cassandra Query Language.

BTW, Berglund is a good presenter.

Best Online Tools to Make Simple HTML5 Coding

Monday, February 27th, 2012

Best Online Tools to Make Simple HTML5 Coding

From the post:

HTML which stands for Hypertext mark-up language is supposed to be the main language for web pages. Hence the purpose of a web browser is to read the HTML documents and then compose them into either audible or visible websites.

HTML or Hypertext mark-up language is the main languages used for web pages. It is written in the form of tags enclosed in angle brackets. It is written within the web page content. Mostly these HTML elements are used in pairs but some that are known as empty elements are used unpaired also.

HTML5 is a language that used to create web pages, its fifth revision of HTML, a core technology of internet and basic language of designing. This advanced technology has some new features and tags that presents website designs with special effects and awesome layouts. HTML5 adds many new syntactical features.

These include the <video>, <audio>, <header> and <canvas> elements, as well as the integration of SVG(Scalable Vector Graphics) content. Today on internet, thousand of websites are available on internet having attractive designs so I have choose some great piece work of designers. I hope you will love to see this beautiful collection.

Today we cover some best online tools that helps developers to make easy and simple HTML5 coding. In this list we featured HTML5 cheat sheets, HTML5 website generator, HTML5 demos and examples, etc. Visit this list and share your thought in our comment section below.

Even if you aren’t in the web design end of things, it doesn’t hurt to have a general idea of what is easy to be done. And some idea of how to do it if you are the only one available if something breaks.

Some interesting tools. Are there others you prefer?

Quick Search Reference Manual for Solr

Monday, February 27th, 2012

Quick Search Reference Manual for Solr by Mitch Pronschinske.

From the post:

A tweet from Eric Pugh informs us about a recent handy resource he contributed to that you Solr/Lucene users out there should look into.

It’s a Solr reference manual [pdf] based on some distilled references from Apache Solr 3 Enterprise Search Server from Packt Publishing, a book that Eric co-authored with David Smiley. The motivation behind this manual is simply to help with some of the logical inconsistencies in the parameter names (e.g. query type parameter is “qt” but query parser is “deftype”). The reference is meant to help you remember these oddities that are hard to codify in your brain.

Opportunities to improve upon documentation for software are nearly endless. If you like this effort, consider contributing your own.

How to create a crossword puzzle in LaTeX?

Monday, February 27th, 2012

How to create a crossword puzzle in LaTeX?

Brief discussion and pointers to packages to create crossword puzzles in LaTeX.

Could be a useful classroom device.

I suppose the harder question would be how to write a LaTeX macro that solves crossword puzzles. 😉

Found thanks to: Christophe Lalanne’s A bag of tweets / Feb 2012.

Barcodes: The Persistent Topology of Data

Monday, February 27th, 2012

Barcodes: The Persistent Topology of Data by Robert Ghrist.

Abstract:

This article surveys recent work of Carlsson and collaborators on applications of computational algebraic topology to problems of feature detection and shape recognition in high-dimensional data. The primary mathematical tool considered is a homology theory for point-cloud data sets — persistent homology — and a novel representation of this algebraic characterization — barcodes. We sketch an application of these techniques to the classification of natural images.

From the article:

1. The shape of data

When a topologist is asked, “How do you visualize a four-dimensional object?” the appropriate response is a Socratic rejoinder: “How do you visualize a three-dimensional object?” We do not see in three spatial dimensions directly, but rather via sequences of planar projections integrated in a manner that is sensed if not comprehended. We spend a significant portion of our first year of life learning how to infer three-dimensional spatial data from paired planar projections. Years of practice have tuned a remarkable ability to extract global structure from representations in a strictly lower dimension.

The inference of global structure occurs on much finer scales as well, with regards to converting discrete data into continuous images. Dot-matrix printers, scrolling LED tickers, televisions, and computer displays all communicate images via arrays of discrete points which are integrated into coherent, global objects. This also is a skill we have practiced from childhood. No adult does a dot-to-dot puzzle with anything approaching anticipation.

1.1. Topological data analysis.

Problems of data analysis share many features with these two fundamental integration tasks: (1) how does one infer high dimensional structure from low dimensional representations; and (2) how does one assemble discrete points into global structure.

Now are you interested?

Reminds me of another paper on homology and higher dimensions that I need to finish writing up. Probably not today but later this week.

Found thanks to: Christophe Lalanne’s A bag of tweets / Feb 2012.

Multivariate Statistical Analysis: Old School

Monday, February 27th, 2012

Multivariate Statistical Analysis: Old School by John Marden.

From the preface:

The goal of this text is to give the reader a thorough grounding in old-school multivariate statistical analysis. The emphasis is on multivariate normal modeling and inference, both theory and implementation. Linear models form a central theme of the book. Several chapters are devoted to developing the basic models, including multivariate regression and analysis of variance, and especially the “both-sides models” (i.e., generalized multivariate analysis of variance models), which allow modeling relationships among individuals as well as variables. Growth curve and repeated measure models are special cases.

The linear models are concerned with means. Inference on covariance matrices covers testing equality of several covariance matrices, testing independence and conditional independence of (blocks of) variables, factor analysis, and some symmetry models. Principal components, though mainly a graphical/exploratory technique, also lends itself to some modeling.

Classification and clustering are related areas. Both attempt to categorize individuals. Classification tries to classify individuals based upon a previous sample of observed individuals and their categories. In clustering, there is no observed categorization, nor often even knowledge of how many categories there are. These must be estimated from the data.

Other useful multivariate techniques include biplots, multidimensional scaling, and canonical correlations.

The bulk of the results here are mathematically justified, but I have tried to arrange the material so that the reader can learn the basic concepts and techniques while plunging as much or as little as desired into the details of the proofs.

Practically all the calculations and graphics in the examples are implemented using the statistical computing environment R [R Development Core Team, 2010]. Throughout the notes we have scattered some of the actual R code we used. Many of the data sets and original R functions can be found in the file http://www.istics.net/r/multivariateOldSchool.r. For others we refer to available R packages.

This is “old school.” A preface that contains useful information and outlines what the reader may find? Definitely “old school.”

Found thanks to: Christophe Lalanne’s A bag of tweets / Feb 2012.

Biplots in Practice

Monday, February 27th, 2012

Biplots in Practice by Michael Greenacre.

I was rather disappointed in the pricing information for the monographs on biplots cited in Christophe Lalanne’s Biplots. Particularly since most users would be new to biplots and reluctant to invest that kind of money in a monograph.

With a little searching I cam across this volume by Michael Greenacre, which is described as follows:

Biplots in Practice is a comprehensive introduction to one of the most useful and versatile methods of multivariate data visualization: the biplot. The biplot extends the idea of a simple scatterplot of two variables to the case of many variables, with the objective of visualizing the maximum possible amount of information in the data. Research data are typically presented in the form of a rectangular table and the biplot takes its name from the fact that it visualizes the rows and the columns of this table in a common space. This book explains the specific interpretation of the biplot in many different areas of multivariate analysis, notably regression, generalized linear modelling, principal component analysis, log-ratio analysis, various forms of correspondence analysis and discriminant analysis. It includes applications in many different fields of the social and natural sciences, and provides three detailed case studies documenting how the biplot reveals structure in large complex data sets in genomics (where thousands of variables are commonly encountered), in social survey research (where many categorical variables are studied simultaneously) and ecological research (where relationships between two sets of variables are investigated).

It is available online as well as a print publication.

The R code and other supplemental materials are available at this site.

In terms of promoting biplots, I think this is a step in the right direction.

Biplots

Monday, February 27th, 2012

Biplots

A very entertaining and informative account by Christophe Lalanne of his pursuit of biplots from a question about display using Lisp to statistics with R.

Entertaining for professional researchers, who experience the joy of one source unravelling into another on a daily basis.

Instructive for those not yet professional researchers, by demonstrating the riches that await just beyond where most people stop searching.

And not to mention a wealth of pointers to resources on biplots.

Everyone has a Graph Store

Monday, February 27th, 2012

Everyone has a Graph Store by Danny Ayers.

Try this thought experiment.

For practical purposes we often assume that everyone has a computer, a reasonable Internet connection and a modern Web browser. We know it’s an inaccurate assumption, but it provides conceptual targets for technology in terms of people and environment.

Ok, now add to that list a Graph Store: a flexible database to which information can easily be added, and which can be easily queried. The data can also be easily shared over the Cloud. The data is available for any applications that might want to use it. The database is schemaless, agnostic about what you put in it: the data could be about contacts, descriptions of people & their relationships (i.e. a Social Graph), it could be about places or events, products, technical information, whatever. It can contain private information, it can contain information that you’re happy to share. You control your own store and can let other people access as much or as little of its contents as you like (which they can do easily over the cloud). You can access other people’s store in the same way, according to their preferences. It’s both a Personal Knowledgebase and a Federated Public Knowledgebase.

So, make the assumption: everyone has a Graph Store. Now what do you want to do with yours? What can your friends and colleagues do with theirs? How can you use other peoples information to improve your quality of life, and vice versa? What new tools can be developed to help them take advantage of their stores? How can you get rich quick on this? What other questions are there..?

When I do this thought experiment, all I come up with is Facebook. So I am not very encouraged.

Perhaps Danny is expecting a natural clumping of useful comments and insights. Certainly is possible but then clumpings around Jim Jones and Jimmy Swaggart are also possible.

Or that a process of collective commenting and consideration will lead to useful results. American Idol isn’t strong evidence that mass participation produces good results. Or American election results.

Your thought experiment results may vary so feel free to report them.

Graphs are a great idea. Asking everyone to write down their thoughts in a graph store, not so great.