Archive for May, 2012

From Tweets to Results: How to obtain, mine, and analyze Twitter data

Thursday, May 31st, 2012

From Tweets to Results: How to obtain, mine, and analyze Twitter data by Derek Ruths (McGill University).


Since Twitter’s creation in 2006, it has become one of the most popular microblogging platforms in the world. By virtue of its popularity, the relative structural simplicity of Twitter posts, and a tendency towards relaxed privacy settings, Twitter has also become a popular data source for research on a range of topics in sociology, psychology, political science, and anthropology. Nonetheless, despite its widespread use in the research community, there are many pitfalls when working with Twitter data.

In this day-long workshop, we will lead participants through the entire Twitter-based research pipeline: from obtaining Twitter data all the way through performing some of the sophisticated analyses that have been featured in recent high-profile publications. In the morning, we will cover the nuts and bolts of obtaining and working with a Twitter dataset including: using the Twitter API, the firehose, and rate limits; strategies for storing and filtering Twitter data; and how to publish your dataset for other researchers to use. In the afternoon, we will delve into techniques for analyzing Twitter content including constructing retweet, mention, and follower networks; measuring the sentiment of tweets; and inferring the gender of users from their profiles and unstructured text.

We assume that participants will have little to no prior experience with mining Twitter or other social network datasets. As the workshop will be interactive, participants are encouraged to bring a laptop. Code examples and exercises will be given in Python, thus participants should have some familiarity with the language. However, all concepts and techniques covered will be language-independent, so any individual with some background in scripting or programming will benefit from the workshop.

Any plans to use Twitter feeds for your topic maps?

I first saw a reference to this workshop at: Do you haz teh (twitter) codez? by Ricard Nielson.

How do you measure the impact of tagging on retrieval?

Thursday, May 31st, 2012

How do you measure the impact of tagging on retrieval? by Tony Russell-Rose.

From the post:

A client of mine wants to measure the difference between manual tagging and auto-classification on unstructured documents, focusing in particular on its impact on retrieval (i.e. relevance ranking). At the moment they are considering two contrasting approaches:

See Tony’s post for details.

What do you think?

Reverse engineering targeted emails from 2012 Campaign

Thursday, May 31st, 2012

Reverse engineering targeted emails from 2012 Campaign

Nathan Yau writes:

After noticing the Obama campaign was sending variations of an email to voters, ProPublica identified six distinct types with certain demographics and showed the differences. It was called the Message Machine. Now ProPublica is taking it a step further, hoping to dissect every email from all 2012 campaigns.

Fewer emails than in e-discovery or email archives.

Same or different tools/techniques?

Sarcastic Computers?

Thursday, May 31st, 2012

You may have seen the headline: Could Sarcastic Computers Be in Our Future? New Math Model Can Help Computers Understand Inference.

And the lead for the article sounds promising:

In a new paper, the researchers describe a mathematical model they created that helps predict pragmatic reasoning and may eventually lead to the manufacture of machines that can better understand inference, context and social rules.

Language is so much more than a string of words. To understand what someone means, you need context.

Consider the phrase, “Man on first.” It doesn’t make much sense unless you’re at a baseball game. Or imagine a sign outside a children’s boutique that reads, “Baby sale — One week only!” You easily infer from the situation that the store isn’t selling babies but advertising bargains on gear for them.

Present these widely quoted scenarios to a computer, however, and there would likely be a communication breakdown. Computers aren’t very good at pragmatics — how language is used in social situations.

But a pair of Stanford psychologists has taken the first steps toward changing that.

Context being one of those things you can use semantic mapping techniques to capture, I was interested.

Jack Park pointed me to a public PDF of the article: Predicting pragmatic reasoning in language games

Be sure to read the entire file.

A blue square, a blue circle, a green square.

Not exactly a general model for context and inference.

WikiLeaks as Wakeup Call?

Thursday, May 31st, 2012

Must be a slow news week. Federal Computer Week is recycling Wikileaks as a “wake up” call.

In case you have forgotten (or is that why the story is coming back up?), Robert Gates (Sec. of Defense) found that Wikileaks did not disclose sensitive intelligence sources or methods.

Hardly “…a security breach of epic proportions…” as claimed by the State Department.

If you want to claim Wikileaks was a “wakeup call,” make it a wake up call about “data dumpster” techniques for sharing intelligence data.

“Here are all our reports. Good luck finding something, anything.”

Security breach written all over it. Useless other than as material for a security breach. Easy to copy in bulk, etc.

What about this says “potential security breach” to you?

Best methods for sharing intelligence vary depending on the data, security requirements and a host of other factors. Take Wikileaks as motivation (if lacking before) to strive for useful intelligence sharing.

Not sharing for the sake of saying you are sharing.

Mathematical Reasoning Group

Thursday, May 31st, 2012

Mathematical Reasoning Group

From the homepage:

The Mathematical Reasoning Group is a distributed research group based in the Centre for Intelligent Systems and their Applications, a research institute within the School of Informatics at the University of Edinburgh. We are a community of informaticists with interests in theorem proving, program synthesis and artificial intelligence. There is a more detailed overview of the MRG and a list of people. You can also find out how to join the MRG.

I was chasing down proceedings from prior “Large Heterogeneous Data” workshops (damn, that’s a fourth name), when I ran across this jewel as the location of some of the archives.

Has lots of other interesting papers, software, activities.

Sing out if you see something you think needs to appear on this blog.

Large Heterogeneous Data 2012

Thursday, May 31st, 2012

Workshop on Discovering Meaning On the Go in Large Heterogeneous Data 2012 (LHD-12)

Important Dates

  • Deadline for paper subsmission: July 31, 2012
  • Author notification: August 21, 2012
  • Deadline for camera-ready: September 10, 2012
  • Workshop date: November 11th or 12th, 2012

Take the time to read the workshop description.

A great summary of the need for semantic mappings, not more semantic fascism.

From the call for papers:

An interdisciplinary approach is necessary to discover and match meaning dynamically in a world of increasingly large data sources. This workshop aims to bring together practitioners from academia, industry and government for interaction and discussion. This will be a half-day workshop which primarily aims to initiate discussion and debate. It will involve

  • A panel discussion focussing on these issues from an industrial and governmental point of view. Membership to be confirmed, but we expect a representative from Scottish Government and from Google, as well as others.
  • Short presentations grouped into themed panels, to stimulate debate not just about individual contributions but also about the themes in general.

Workshop Description

The problem of semantic alignment – that of two systems failing to understand one another when their representations are not identical – occurs in a huge variety of areas: Linked Data, database integration, e-science, multi-agent systems, information retrieval over structured data; anywhere, in fact, where semantics or a shared structure are necessary but centralised control over the schema of the data sources is undesirable or impractical. Yet this is increasingly a critical problem in the world of large scale data, particularly as more and more of this kind of data is available over the Web.

In order to interact successfully in an open and heterogeneous environment, being able to dynamically and adaptively integrate large and heterogeneous data from the Web “on the go” is necessary. This may not be a precise process but a matter of finding a good enough integration to allow interaction to proceed successfully, even if a complete solution is impossible.

Considerable success has already been achieved in the field of ontology matching and merging, but the application of these techniques – often developed for static environments – to the dynamic integration of large-scale data has not been well studied.

Presenting the results of such dynamic integration to both end-users and database administrators – while providing quality assurance and provenance – is not yet a feature of many deployed systems. To make matters more difficult, on the Web there are massive amounts of information available online that could be integrated, but this information is often chaotically organised, stored in a wide variety of data-formats, and difficult to interpret.

This area has been of interest in academia for some time, and is becoming increasingly important in industry and – thanks to open data efforts and other initiatives – to government as well. The aim of this workshop is to bring together practitioners from academia, industry and government who are involved in all aspects of this field: from those developing, curating and using Linked Data, to those focusing on matching and merging techniques.

Topics of interest include, but are not limited to:

  • Integration of large and heterogeneous data
  • Machine-learning over structured data
  • Ontology evolution and dynamics
  • Ontology matching and alignment
  • Presentation of dynamically integrated data
  • Incentives and human computation over structured data and ontologies
  • Ranking and search over structured and semi-structured data
  • Quality assurance and data-cleansing
  • Vocabulary management in Linked Data
  • Schema and ontology versioning and provenance
  • Background knowledge in matching
  • Extensions to knowledge representation languages to better support change
  • Inconsistency and missing values in databases and ontologies
  • Dynamic knowledge construction and exploitation
  • Matching for dynamic applications (e.g., p2p, agents, streaming)
  • Case studies, software tools, use cases, applications
  • Open problems
  • Foundational issues

Applications and evaluations on data-sources that are from the Web and Linked Data are particularly encouraged.

Several years from now, how will you find this conference (and its proceedings)?

  • Large Heterogeneous Data 2012
  • Workshop on Discovering Meaning On the Go in Large Heterogeneous Data 2012
  • LHD-12

Just curious.

Joint International Workshop on Entity-oriented and Semantic Search

Thursday, May 31st, 2012

1st Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012

Important Dates:

  • Submissions Due: July 2, 2012
  • Notification of Acceptance: July 23, 2012
  • Camera Ready: August 1, 2012
  • Workshop date: August 16th, 2012

Located at the 35th ACM SIGIR Conference, Portland, Oregon, USA, August 12–16, 2012.

From the homepage of the workshop:

About the Workshop:

The workshop encompasses various tasks and approaches that go beyond the traditional bag-of-words paradigm and incorporate an explicit representation of the semantics behind information needs and relevant content. This kind of semantic search, based on concepts, entities and relations between them, has attracted attention both from industry and from the research community. The workshop aims to bring people from different communities (IR, SW, DB, NLP, HCI, etc.) and backgrounds (both academics and industry practitioners) together, to identify and discuss emerging trends, tasks and challenges. This joint workshop is a sequel of the Entity-oriented Search and Semantic Search Workshop series held at different conferences in previous years.


The workshop aims to gather all works that discuss entities along three dimensions: tasks, data and interaction. Tasks include entity search (search for entities or documents representing entities), relation search (search entities related to an entity), as well as more complex tasks (involving multiple entities, spatio-temporal relations inclusive, involving multiple queries). In the data dimension, we consider (web/enterprise) documents (possibly annotated with entities/relations), Linked Open Data (LOD), as well as user generated content. The interaction dimension gives room for research into user interaction with entities, also considering how to display results, as well as whether to aggregate over multiple entities to construct entity profiles. The workshop especially encourages submissions on the interface of IR and other disciplines, such as the Semantic Web, Databases, Computational Linguistics, Data Mining, Machine Learning, or Human Computer Interaction. Examples of topic of interest include (but are not limited to):

  • Data acquisition and processing (crawling, storage, and indexing)
  • Dealing with noisy, vague and incomplete data
  • Integration of data from multiple sources
  • Identification, resolution, and representation of entities (in documents and in queries)
  • Retrieval and ranking
  • Semantic query modeling (detecting, modeling, and understanding search intents)
  • Novel entity-oriented information access tasks
  • Interaction paradigms (natural language, keyword-based, and hybrid interfaces) and result representation
  • Test collections and evaluation methodology
  • Case studies and applications

We particularly encourage formal evaluation of approaches using previously established evaluation benchmarks: Semantic Search Challenge 2010, Semantic Search Challenge 2011, TREC Entity Search Track.

All workshops are special to someone. This one sounds more special than most. Collocated with the ACM SIGIR 2012 meeting. Perhaps that’s the difference.

Knowledge Extraction and Consolidation from Social Media

Thursday, May 31st, 2012

Knowledge Extraction and Consolidation from Social Media KECSM2012 – November 11 – 12, Boston, USA.

Important dates

  • Jul 31, 2012: submission deadline full & short papers
  • Aug 21, 2012: notifications for research papers
  • Sep 10, 2012: camera-ready papers due
  • Oct 05, 2012: submission deadline poster & demo abstracts
  • Oct 10, 2012: notifications posters & demos

From the website:

The workshop aims to become a highly interactive research forum for exploring innovative approaches for extracting and correlating knowledge from degraded social media by exploiting the Web of Data. While the workshop’s general focus is on the creation of well-formed and well-interlinked structured data from highly unstructured Web content, its interdisciplinary scope will bring together researchers and practitioners from areas such as the semantic and social Web, text mining and NLP, multimedia analysis, data extraction and integration, and ontology and data mapping. The workshop will also look into innovative applications that exploit extracted knowledge in order to produce solutions to domain-specific needs.

We will welcome high-quality papers about current trends in the areas listed in the following, non-exhaustive list of topics. We will seek application-oriented, as well as more theoretical papers and position papers.

Knowledge detection and extraction (content perspective)

  • Knowledge extraction from text (NLP, text mining)
  • Dealing with scalability and performance issues with regard to large amounts of heterogeneous content
  • Multilinguality issues
  • Knowledge extraction from multimedia (image and video analysis)
  • Sentiment detection and opinion mining from text and audiovisual content
  • Detection and consideration of temporal and dynamics aspects
  • Dealing with degraded Web content

Knowledge enrichment, aggregation and correlation (data perspective)

  • Modelling of events and entities such as locations, organisations, topics, opinions
  • Representation of temporal and dynamics-related aspects
  • Data clustering and consolidation
  • Data enrichment based on linked data/semantic web
  • Using reference datasets to structure, cluster and correlate extracted knowledge
  • Evaluation of automatically extracted data

Exploitation of automatically extracted knowledge/data (application perspective)

  • Innovative applications which make use of automatically extracted data (e.g. for recommendation or personalisation of Web content)
  • Semantic search in annotated Web content
  • Entity-driven navigation of user-generated content
  • Novel navigation and visualisation of extracted knowledge/graphs and associated Web resources

I like the sound of “consolidation.” An unspoken or tacit goal of any knowledge gathering. Not much use in scattered pieces on the shop floor.

Collocated with the 11th International Semantic Web Conference (ISWC2012)

Informationsvisualisierung [Information Visualization]

Wednesday, May 30th, 2012

Informationsvisualisierung [Information Visualization] by Dr.Silvia Miksch.

Course outline and excellent source of reading materials on information visualization.

I originally saw links to a couple of chapters on information visualization in Christophe Lalanne’s Bag of Tweets for April 2012.

Hoping for an entire book, I walked the URL path back to the pages with links to the chapters. The page I cite above.

Excellent collection of readings.

How to Stay Current in Bioinformatics/Genomics [Role for Topic Maps as Filters?]

Wednesday, May 30th, 2012

How to Stay Current in Bioinformatics/Genomics by Stephen Turner.

From the post:

A few folks have asked me how I get my news and stay on top of what’s going on in my field, so I thought I’d share my strategy. With so many sources of information begging for your attention, the difficulty is not necessarily finding what’s interesting, but filtering out what isn’t. What you don’t read is just as important as what you do, so when it comes to things like RSS, Twitter, and especially e-mail, it’s essential to filter out sources where the content consistently fails to be relevant or capture your interest. I run a bioinformatics core, so I’m more broadly interested in applied methodology and study design rather than any particular phenotype, model system, disease, or method. With that in mind, here’s how I stay current with things that are relevant to me. Please leave comments with what you’re reading and what you find useful that I omitted here.

Here is a concrete example of the information feeds used to stay current on bioinformatics/genomics.

A topic map mantra has been: “All the information about a subject in one place.”

Should that change to: “Current information about subject(s) ….,” rather than aggregation, topic maps as a filtering strategy?

I think of filters as “subtractive” but that is only one view of filtering.

Can have “additive” filters as well.

Take a look at the information feeds Stephen is using.

Would you use topic maps as “additive” or “subtractive” filters?

Printable, Math and Physics Flash Cards

Wednesday, May 30th, 2012

Printable, Math and Physics Flash Cards by Jason Underdown.

From the introduction:

Click on the links below to download PDF files containing double-sided flash cards suitable for printing on common business card printer paper. If you don’t have or don’t want to buy special business card paper, I have also included versions which include a grid. You can use scissors or a paper cutter to create your cards.

The definitions and theorems of mathematics constitute the body of the discipline. To become conversant in mathematics, you simply must become so familiar with certain concepts and facts that you can recall them without thought. Making these flash cards has been a great help in getting me closer to that point. I hope they help you too. If you find any errors please contact me at the email address below.

Some of the decks are works in progress and thus incomplete, but if you know how to use LaTeX, the source files are also provided, so you can add your own flash cards. If you do create new flash cards, please share them back with me. You can contact me at the address below. Special thanks to Andrew Budge who created the “flashcards” LaTeX class which handles the formatting details.

Quite delightful!

What areas do you think are missing for IR, statistics, search?

As a markup hand, XML, XSLT, XPath 2.0 spring to mind.

I suspect you would learn as much about an area authoring cards as you will from using them.

If you make a set, please post and send a note.

First seen in Christophe Lalanne’s Bag of Tweets for May 2012.

D3 Tutorials

Wednesday, May 30th, 2012

D3 Tutorials by Scott Murray.

I was following a link to the last tutorial in this series on axes when I discovered this resource.

Starts from a description of the tutorials:

  • Brief
  • Focused, each addressing a single topic
  • Modular, so you can reference only the topics relevant to your goals
  • Complete, with sample code illustrating each topic
  • Dynamic, updated and expanded as needed
  • Free, licensed so you can use the code however you wish

and ends sixteen (16) lessons later with a tutorial on axes, or as Scott puts it:

Let’s add horizontal and vertical axes, so we can do away with the horrible red numbers cluttering up our chart.

More tutorials are on the way but I am sure that Scott would appreciate your questions and encouragement.

I saw the reference to the axes tutorial in Christophe Lalanne’s Bag of Tweets for May 2012.

Human-Computer Interaction Lab – Tech Papers

Wednesday, May 30th, 2012

Human-Computer Interaction Lab – Tech Papers

Twenty-five (25) years worth of research papers, presentations, software and other materials from the Human-Computer Interaction Lab at the University of Maryland.

I discovered the tech report site in the musings of Kim Rees, Thoughts on the HCIL symposium.

From the overview of the HCIL:

The Human-Computer Interaction lab has a long, rich history of transforming the experience people have with new technologies. From understanding user needs, to developing and evaluating those technologies, the lab’s faculty, staff, and students have been leading the way in HCI research and teaching.

We believe it is critical to understand how the needs and dreams of people can be reflected in our future technologies. To this end, the HCIL develops advanced user interfaces and design methodology. Our primary activities include collaborative research, publication and the sponsorship of open houses, workshops and symposiums.

I mentioned the tech reports, don’t neglect the video reports, presentations and projects while you are browsing this site.

If I were making a limited set of sites to search for human-computer interface issues, this would be one of them. Yes?

Wednesday, May 30th, 2012

From the about page:

Less Junk is a search engine that aims to help sift through the junk on the internet. Let’s face it, there is way too much on the internet, and sometimes, you can’t find good information with a regular search engine. Less Junk searches only the top 5000 sites in the world based on user votes, so you know that you’re only searching the good stuff on the internet. A typical search can return literally millions of results, so you can see why this is helpful in a lot of cases. Our goal isn’t to replace the big name search engines, but rather to supplement them, and take off where they left off. Less Junk brings the social, human element to an industry that is defined by “crawlers” and “robots”.

A crude measure for “junk,” < 5,000th site based on votes. As useful as more elaborate and expensive measures for quality? Looking at the numbers for total votes all time:

  • Apple 13 votes
  • Facebook 17 votes
  • Microsoft 13 votes
  • Yahoo 14 votes
  • Youtube 23 votes

It looks like a social site that hasn’t quite gone social. If you know what I mean.

You have to wonder about Youtube getting 23 votes as “less junk” in any contest.

Still, limiting the range of search content, perhaps not by voting, may be a good idea.

Thoughts on criteria other than social popularity for limiting the range of material to be searched?


Wednesday, May 30th, 2012


Alpha Release Coming June 5, 2012

From the homepage:

Titan is a distributed graph database optimized for storing and processing large-scale graphs within a multi-machine cluster. The primary features of Titan are itemized below.

If the names Marko A. Rodriguez or Matthias Broecheler mean anything to you, June 5th can’t come soon enough!

Graph/Network World View

Wednesday, May 30th, 2012

Or should I say: graph/network view of the world?

To name just a few of the available graph databases:









A more extensive list: Graph Database at Wikipedia.

Asking because graph/network software popular, it may not be the best for you. Or your data.

With Facebook at 900 million users, social networks are on the tip of the tongue of just about everyone.

What if you are interested in a subset of a social network? Or a subset of characteristics in a social network? Or a “social network” with strict limits on what can appear?

Sounds like the schema for a relational database doesn’t it? Has specified properties, relationships (foreign keys), tables, etc.

Is it surprising a schema can be viewed as a network? Albeit one with predefined limits and contours?

But a schema, the relational kind, can be implemented and optimized for particular operations. Those may be operations that are of interest to you.

Or not.

You may have a need for a greater range of operations or different operations than are supported by a relational schema.

One of the NoSQL database offerings, viewed as networks, “from a certain point of view,” may be more appropriate.

Or you may need one of the graph databases I listed earlier or that you find elsewhere.

A “graph” is an abstraction onto which you can map relationships, characteristics and capabilities.

A graph/network database comes with built-in relationships, characteristics and capabilities, chosen by its implementers.

Just like relational, NoSQL, New SQL and other databases.

So, saying “graph or network” doesn’t mean your requirements are going to be met.

Comparing your requirements to assumed relationships, characteristics and capabilities is up to you.

Introductions to Graph Databases and Theory

Wednesday, May 30th, 2012

Introductions to Graph Databases and Theory

Links to a couple of introductory videos on graphs and to Graph Theory and Complex Networks: An Introduction by Maarten van Steen.

Life sciences/bioinformatics orientation.

A blog to revisit for graphs + life sciences + bioinformatics.

Graph Theory and Complex Networks: An Introduction

Wednesday, May 30th, 2012

Graph Theory and Complex Networks: An Introduction by Maarten van Steen.

From the webpage:

GTCN aims to explain the basics of graph theory that are needed at an introductory level for students in computer or information sciences. To motivate students and to show that even these basic notions can be extremely useful, the book also aims to provide an introduction to the modern field of network science.

I take the starting-point that mathematics for most students is unnecessarily intimidating. Explicit attention is paid in the first chapters to mathematical notations and proof techniques, emphasizing that the notations form the biggest obstacle, not the mathematical concepts themselves. Taking this approach has allowed me to gradually prepare students for using tools that are necessary to put graph theory to work: complex networks.

In the second part of the book the student learns about random networks, small worlds, the structure of the Internet and the Web, and social networks. Again, everything is discussed at an elementary level, but such that in the end students indeed have the feeling that they:

  1. Have learned how to read and understand the basic mathematics related to graph theory
  2. Understand how basic graph theory can be applied to optimization problems such as routing in communication networks
  3. Know a bit more about this sometimes mystical field of small worlds and random networks.

The full text of Graph Theory and Complex Networks (GTCN) is available as a “personalized” download (“personalized for” at the top of each page and “your email address” at the bottom of each page) or from Amazon for $25.00.

Additional course materials are also available at this site.

You will be amused to read about the difficulty of graph/network notation:

It is also not that difficult, as most notations come directly from set theory.

That’s reassuring. 😉

GTCN offers suggestions for translating mathematical notation into English. A useful skill, here and elsewhere.

I ran across this resource at: Introductions to Graph Databases and Theory.

Analysing spatial point patterns in R

Tuesday, May 29th, 2012

Analysing spatial point patterns in R by Adrian Baddeley. (ebook)

If that doesn’t sound immediately appealing, consider the following from section 1.1 Types of Data:

1.1.1 Points

A point pattern dataset gives the locations of objects/events occurring in a study region.

[graphic omitted]

The points could represent trees, animal nests, earthquake epicentres, petty crimes, domiciles of new cases of influenza, galaxies, etc.

The points might be situated in a region of the two-dimensional (2D) plane, or on the Earth’s surface, or a 3D volume, etc. They could be points in space-time (e.g. earthquake epicentre location and time). The software presented here is only applicable to 2D point patterns (but we’re working on it).

1.1.2 Marks

The points may have extra information called marks attached to them. The mark represents an “attribute” of the point…

Now, think of something that doesn’t occupy a spatial point pattern (outside of digital memory).

😉 It is that universal.

First seen in Christophe Lalanne’s Bag of Tweets for May 2012.

Tuesday, May 29th, 2012

By my count, thirty-nine (39) visualization tools.

You are sure to find something you like and useful.


First seen in Christophe Lalanne’s Bag of Tweets for May 2012.

Statistics for Genomics (Spring 2012)

Tuesday, May 29th, 2012

Statistics for Genomics (Spring 2012) by Rafael Irizarry.

Rafael is in the process of posting lectures from his statistics for genomics course online.


RafaLab’s Facebook page

Twitter feed

Good way to learn R, statistics and a good bit about genomics.

Destination: Montreal!

Tuesday, May 29th, 2012

If you remember the Saturday afternoon sci-fi movies, Destination: …., then you will appreciate the title for this post. 😉

Tommie Usdin and company just posted: Balisage 2012 Call for Late-breaking News, written in torn bodice style:

The peer-reviewed part of the Balisage 2012 program has been scheduled (and will be announced in a few days). A few slots on the Balisage program have been reserved for presentation of “Late-breaking” material.

Proposals for late-breaking slots must be received by June 15, 2012. Selection of late-breaking proposals will be made by the Balisage conference committee, instead of being made in the course of the regular peer-review process.

If you have a presentation that should be part of Balisage, please send a proposal message as plain-text email to

In order to be considered for inclusion in the final program, your proposal message must supply the following information:

  • The name(s) and affiliations of all author(s)/speaker(s)
  • The email address of the presenter
  • The title of the presentation
  • An abstract of 100-150 words, suitable for immediate distribution
  • Disclosure of when and where, if some part of this material has already been presented or published
  • An indication as to whether the presenter is comfortable giving a conference presentation and answering questions in English about the material to be presented
  • Your assurance that all authors are willing and able to sign the Balisage Non-exclusive Publication Agreement ( with respect to the proposed presentation

In order to be in serious contention for inclusion in the final program, your proposal should probably be either a) really late-breaking (it happened in the last month or two) or b) a paper, an extended paper proposal, or a very long abstract with references. Late-breaking slots are few and the competition is fiercer than for peer-reviewed papers. The more we know about your proposal, the better we can appreciate the quality of your submission.

Please feel encouraged to provide any other information that could aid the conference committee as it considers your proposal, such as a detailed outline, samples, code, and/or graphics. We expect to receive far more proposals than we can accept, so it’s important that you send enough information to make your proposal convincing and exciting. (This material may be attached to the email message, if appropriate.)

The conference committee reserves the right to make editorial changes in your abstract and/or title for the conference program and publicity. (emphasis added to last sentence)

Read that last sentence again!

The conference committee reserves the right to make editorial changes in your abstract and/or title for the conference program and publicity.

The conference committee might change your abstract and/or title to say something …. controversial? ….attention getting? ….CNN / Slashdot worthy?

Bring it on!

Submit late breaking proposals!



Tuesday, May 29th, 2012

I stumbled upon CUBRID via its Important Facts to Know about CUBRID page, where the first entry reads:

Naming Conventions:

The name of this DBMS is CUBRID, written in capital letters, and not Cubrid. We would appreciate much if you followed this naming conventions. It should be fairly simple to remember, itsn’t it!?

Got my attention!

Not for a lack of projects with “attitude” on the Net but a project with “attitude” that expressed it cleverly. Not just offensively.

Features of CUBRID:

Here are the key features that make CUBRID the most optimized open source database management system:

First time I have seen CUBRID.

Does promise a release supporting sharding in June 2012.

The documentation posits extensions to the relational data model:

Extending the Relational Data Model


For the relational data model, it is not allowed that a single column has multiple values. In CUBRID, however, you can create a column with several values. For this purpose, collection data types are provided in CUBRID. The collection data type is mainly divided into SET, MULTISET and LIST; the types are distinguished by duplicated availability and order.

  • SET : A collection type that does not allow the duplication of elements. Elements are stored without duplication after being sorted regardless of their order of entry.
  • MULTISET : A collection type that allows the duplication of elements. The order of entry is not considered.
  • LIST : A collection type that allows the duplication of elements. Unlike with SET and MULTISET, the order of entry is maintained.


Inheritance is a concept to reuse columns and methods of a parent table in those of child tables. CUBRID supports reusability through inheritance. By using inheritance provided by CUBRID, you can create a parent table with some common columns and then create child tables inherited from the parent table with some unique columns added. In this way, you can create a database model which can minimize the number of columns.


In a relational database, the reference relationship between tables is defined as a foreign key. If the foreign key consists of multiple columns or the size of the key is significantly large, the performance of join operations between tables will be degraded. However, CUBRID allows the direct use of the physical address (OID) where the records of the referred table are located, so you can define the reference relationship between tables without using join operations.

That is, in an object-oriented database, you can create a composition relation where one record has a reference value to another by using the column displayed in the referred table as a domain (type), instead of referring to the primary key column from the referred table.

Suggestions/comments on what to try first?

GitHub Social Graphs with Groovy and GraphViz

Tuesday, May 29th, 2012

GitHub Social Graphs with Groovy and GraphViz

From the post:

Using the GitHub API, Groovy and GraphViz to determine, interpret and render a graph of the relationships between GitHub users based on the watchers of their repositories. The end result can look something like this.

[Image omitted. I stared to embed the image but on the narrow scale of my blog, it just didn’t look good. See the post for the full size version.]

A must see for all Groovy fans!

For an alternative, see:

Mining GitHub – Followers in Tinkerpop

Pointers to social graphs for GitHub using other tools appreciated!


Tuesday, May 29th, 2012


A tool for exploring texts on non-word basis.

Or in the words of the project:

ProseVis is a visualization tool developed as part of a use case supported by the Andrew W. Mellon Foundation through a grant titled “SEASR Services,” in which we seek to identify other features than the “word” to analyze texts. These features comprise sound including parts-of-speech, accent, phoneme, stress, tone, break index.

ProseVis allows a reader to map the features extracted from OpenMary ( Text-to-speech System and predictive classification data to the “original” text. We developed this project with the ultimate goal of facilitating a reader’s ability to analyze and disseminate the results in human readable form. Research has shown that mapping the data to the text in its original form allows for the kind of human reading that literary scholars engage: words in the context of phrases, sentences, lines, stanzas, and paragraphs (Clement 2008). Recreating the context of the page not only allows for the simultaneous consideration of multiple representations of knowledge or readings (since every reader’s perspective on the context will be different) but it also allows for a more transparent view of the underlying data. If a human can see the data (the syllables, the sounds, the parts-of-speech) within the context in which they are used to reading, with the data mapped back onto the full text, then the reader is empowered within this familiar context to read what might otherwise be an unfamiliar representation tabular representation of the text. For these reasons, we developed ProseVis as a reader interface to allow scholars to work with the data in a language or context in which we are used to saying things about the world.

Textual analysis tools are “smoking gun” detectors.

CEO is unlikely to make inappropriate comments in a spreadsheet or data feed. Emails on the other hand… 😉

Big or little data, the goal is to have the “right” data.

Network diagrams simplified

Tuesday, May 29th, 2012

Network diagrams simplified

Kim Rees at Flowing Data writes of a new network visualization technique by Cody Dunn:

In essence, he aggregates leaf nodes into a fan glyph that describes the underlying data in its size, arc, and color. Span nodes are similarly captured into crescent glyphs. The result is an easy to read, high level look at the network. You can easily compare different sections of the network, understand areas that may have been occluded by the lines in a traditional diagram, and see relationships far more quickly.

The explanation is useful but I think you will find the visualizations impressive!

Check Kim’s post for images and links to more materials.

The Anatomy of Search Technology: Crawling using Combinators [blekko – part 2]

Monday, May 28th, 2012

The Anatomy of Search Technology: Crawling using Combinators by Greg Lindahl.

From the post:

This is the second guest post (part 1) of a series by Greg Lindahl, CTO of blekko, the spam free search engine. Previously, Greg was Founder and Distinguished Engineer at PathScale, at which he was the architect of the InfiniPath low-latency InfiniBand HCA, used to build tightly-coupled supercomputing clusters.

What’s so hard about crawling the web?

Web crawlers have been around as long as the Web has — and before the web, there were crawlers for gopher and ftp. You would think that 25 years of experience would render crawling a solved problem, but the vast growth of the web and new inventions in the technology of webspam and other unsavory content results in a constant supply of new challenges. The general difficulty of tightly-coupled parallel programming also rears its head, as the web has scaled from millions to 100s of billions of pages

In part 2, you learn why you were supposed to pay attention to combinators in part 1.

Want to take a few minutes to refresh on part 1?

Crawler problems still exist but you may have some new approaches to try.

The Anatomy of Search Technology: blekko’s NoSQL database [part 1]

Monday, May 28th, 2012

The Anatomy of Search Technology: blekko’s NoSQL database by Greg Lindahl.

From the post:

This is a guest post by Greg Lindahl, CTO of blekko, the spam free search engine that had over 3.5 million unique visitors in March. Greg Lindahl was Founder and Distinguished Engineer at PathScale, at which he was the architect of the InfiniPath low-latency InfiniBand HCA, used to build tightly-coupled supercomputing clusters.

Imagine that you’re crazy enough to think about building a search engine. It’s a huge task: the minimum index size needed to answer most queries is a few billion webpages. Crawling and indexing a few billion webpages requires a cluster with several petabytes of usable disk — that’s several thousand 1 terabyte disks — and produces an index that’s about 100 terabytes in size.

Greg starts with the storage aspects of the blekko search engine before taking on crawling in part 2 of this series.

Pay special attention to the combinators. You will be glad you did.

5 Hidden Skills for Big Data Scientists

Monday, May 28th, 2012

5 Hidden Skills for Big Data Scientists by Matthew Hurst.

Matthew outlines five (5) skills for data scientists:

  1. Be Clear: Is Your Problem Really A Big Data Problem?
  2. Communicating About Your Data
  3. Invest in Interactive Analytics, not Reporting
  4. Understand the Role and Quality of Human Evaluations of Data
  5. Spend Time on the Plumbing

Turn off your email/cellphone and spend a few minutes jotting down your ideas on these points.

Then compare your ideas/comments to Matthew’s.

Not a question of better/worse but of forming a habit of thinking about data.