March « 2013 « Another Word For It

March 10, 2013

Tom Sawyer and Crowdsourcing

Filed under: Crowd Sourcing,Marketing — Patrick Durusau @ 3:15 pm

Crowdsource from your Community the Tom Sawyer Way – Community Nuggets Vol.1 (video by Dave Olson)

Crowdsource From Your Community – the Tom Sawyer Way (article by Connor Meakin)

Deeply impressive video/article.

More of the nuts and bolts of the social side of crowd sourcing.

The side that makes it so successful (or not) depending on how well you do the social side.

Makes me wonder how to adapt the lessons of crowd sourcing both for development of topic maps but also for topic maps standardization?

Suggestions/comments?

Comments Off

Bayesian Reasoning and Machine Learning (update)

Filed under: Bayesian Models,Machine Learning — Patrick Durusau @ 3:15 pm

Bayesian Reasoning and Machine Learning by David Barber.

I first posted about this work at: Bayesian Reasoning and Machine Learning in 2011.

The current draft (that corresponds to the Cambridge University Press hard copy is dated January 9, 2013.

If you use the online version and have the funds, please order a hard copy to encourage the publisher to continue to make published texts available online.

Comments Off

Titan 0.3.0

Filed under: Graphs,Titan — Patrick Durusau @ 3:15 pm

Titan 0.3.0 (roadmap) by Matthias Broecheler.

From the post:

just wanted to share with you an update on the Titan roadmap. We re-prioritized a bunch of features and decided that it was about time to remove some technical debt in the Titan core module. This turned out into a major rewrite Titan’s internals which opened the door to adding some great new features. With that many changes, Titan 0.3.0 will be backwards incompatible, so we decided to do a 0.2.1 release first, which includes a bunch of bugfixes, the multi-module refactoring and other changes that we have added to master over the last two months. Titan 0.2.1-SNAPSHOT has been deployed to sonatype and will be released in two weeks.

Titan 0.3.0-SNAPSHOT currently lives in the “indexing” branch which indicates one of the major new features that will be coming in Titan 0.3.0: full-text indexing, numeric range indexing, and geospatial indexing for both vertices and edges. These advanced indexing capabilities are provided by ElasticSearch (http://www.elasticsearch.org/) and Lucene (http://lucene.apache.org/) which are now integrated into Titan and available as Titan modules. Similarly to storage backends, Titan abstract external indexes which allows it to interface with arbitrary indexing solutions. We chose Lucene for this initial release because its the most popular and most mature indexing system in the open source domain. Like BerkeleyDB, it is designed for single machine use. ElasticSearch is a fairly young but quickly maturing open source project build on top of Lucene that scales to multiple servers and is robust against failure. Hence, it is an ideal partner for Cassandra or Hbase.

….

Since a lot of people have asked for this feature, I thought you might want to take a look at Titan 0.3.0-SNAPSHOT and play around with it to give us some feedback on this new feature. Note, that Titan 0.3.0 is not yet stable as we are still tinkering with the interface and sorting out some hyper threading issues.
Other things that are new in 0.3.0:

use “unique” in type definitions to mark labels and keys as functional (i.e. unique(Direction.OUT)). That allows us to remove that mathematical “functional”.

complete rewrite of the caching engine which is now much better about caching vertex centric query results

better byte representation and lazy de-serialization for better performance

better query optimization and query rewriting for both vertex centric queries and global graph queries

Edge now longer extends Vertex. Access to unidirectional edges through get/setProperty

Properties on vertices can have properties on them (mind boggling…) which is very useful for version, timestamping, etc

The “properties on vertices can have properties on them,” reminds me of scope in topic maps.

Comments Off

SPMF

Filed under: Algorithms,Data Mining — Patrick Durusau @ 3:14 pm

SPMF: A Sequential Pattern Mining Framework

From the webpage:

SPMF is an open-source data mining mining platform written in Java.

It is distributed under the GPL v3 license.

It offers implementations of 52 data mining algorithms for:

sequential pattern mining,

association rule mining,

frequent itemset mining,

sequential rule mining,

clustering

It can be used as a standalone program with a user interface or from the command line. Moreover, the source code of each algorithm can be integrated in other Java software.

The documentation consists entirely of examples of using SPMF for data mining tasks.

The algorithms page details the fifty-two (52) algorithms of SPMF by references to the literature.

I first saw this at: SPMF: Sequential Pattern Mining Framework.

Comments Off

Using and abusing evidence

Filed under: Government,Medical Informatics,Transparency — Patrick Durusau @ 3:14 pm

New thematic series: Using and abusing evidence by Adrian Aldcroft.

From the post:

Scientific evidence plays an important role in guiding medical laws and policies, but how evidence is represented, and often misrepresented, warrants careful consideration. A new cross-journal thematic series headed by Genome Medicine, Using and abusing evidence in science and health policy, explores the application of evidence in healthcare law and policy in an attempt to uncover how evidence from research is translated into the public sphere. Other journals involved in the series include BMC Medical Ethics, BMC Public Health, BMC Medical Genomics, BMC Psychiatry, and BMC Medicine.

Articles already published include an argument for reframing the obesity epidemic through the use of the term caloric overconsumption, an examination of bioethics in popular science literature, and a look at the gap between reality and public perception when discussing the potential of stem cell therapies. Other published articles look at the quality of informed consent in pediatric research and evidence for genetic discrimination in the life insurance industry. More articles will be added to the series as they are published.

Articles published in this series were invited from delegates at the meeting “Using and Abusing Evidence in Science and Health Policy” held in Banff, Alberta, on May 30th-June 1st, 2012. We hope the publication of the article collection will contribute to the understanding of the ethical and political implications associated with the application of evidence in research and politics.

A useful series but I wonder how effective the identification of “abuse” of evidence will be without identifying its abusers?

And making the case for “abuse” of evidence in a compelling manner?

For example, changing “obesity” to “caloric overconsumption” (Addressing the policy cacophony does not require more evidence: an argument for reframing obesity as caloric overconsumption), carries the day if and only if one presumes a regulatory environment with the goal of improving public health.

The near toxic levels of high fructose corn syrup in the average American diet demonstrate the goals of food regulation in the United States have little to do with public health and welfare.

Identification of who make such policies, who benefits and who is harmed:

obesity

could go a long way towards creating a different regulatory environment.

Comments Off

March 9, 2013

The god Architecture

Filed under: Database,DHash,god Architecture,Redis — Patrick Durusau @ 3:51 pm

The god Architecture

From the overview:

god is a scalable, performant, persistent, in-memory data structure server. It allows massively distributed applications to update and fetch common data in a structured and sorted format.

Its main inspirations are Redis and Chord/DHash. Like Redis it focuses on performance, ease of use and a small, simple yet powerful feature set, while from the Chord/DHash projects it inherits scalability, redundancy, and transparent failover behaviour.

This is a general architectural overview aimed at somewhat technically inclined readers interested in how and why god does what it does.

To try it out right now, install Go, git, Mercurial and gcc, go get github.com/zond/god/god_server, run god_server, browse to http://localhost:9192/.

For API documentation, go to http://go.pkgdoc.org/github.com/zond/god.

For the source, go to https://github.com/zond/god

I know, “in memory” means its not “web scale” but to be honest, I have a lot of data needs that aren’t “web scale.”

There, I’ve said it. Some (most?) important data is not “web scale.”

And when it is, I only have to check my spam filter for options to deal with “web scale” data.

The set operations in particular look quite interesting.

Enjoy!

I first saw this in Nat Torkington’s Four short links: 1 March 2013.

Comments Off

Elasticsearch OpenNLP Plugin

Filed under: ElasticSearch,Natural Language Processing — Patrick Durusau @ 3:50 pm

Elasticsearch OpenNLP Plugin

From the webpage:

This plugin uses the opennlp project to extract named entities from an indexed field. This means, when a certain field of a document is indexed, you can extract entities like persons, dates and locations from it automatically and store them in additional fields.

Extracting entities into roles perhaps?

Comments Off

Learning from Big Data: 40 Million Entities in Context

Filed under: BigData,Disambiguation,Entities,Entity Resolution — Patrick Durusau @ 3:50 pm

Learning from Big Data: 40 Million Entities in Context by Dave Orr, Amar Subramanya, and Fernando Pereira, Google Research,

A fuller explanation of the Wikilinks Corpus from Google:

When someone mentions Mercury, are they talking about the planet, the god, the car, the element, Freddie, or one of some 89 other possibilities? This problem is called disambiguation (a word that is itself ambiguous), and while it’s necessary for communication, and humans are amazingly good at it (when was the last time you confused a fruit with a giant tech company?), computers need help.

To provide that help, we are releasing the Wikilinks Corpus: 40 million total disambiguated mentions within over 10 million web pages — over 100 times bigger than the next largest corpus (about 100,000 documents, see the table below for mention and entity counts). The mentions are found by looking for links to Wikipedia pages where the anchor text of the link closely matches the title of the target Wikipedia page. If we think of each page on Wikipedia as an entity (an idea we’ve discussed before), then the anchor text can be thought of as a mention of the corresponding entity.

Suggestions for using the data? The authors have those as well:

What might you do with this data? Well, we’ve already written one ACL paper on cross-document co-reference (and received lots of requests for the underlying data, which partly motivates this release). And really, we look forward to seeing what you are going to do with it! But here are a few ideas:

Look into coreference — when different mentions mention the same entity — or entity resolution — matching a mention to the underlying entity

Work on the bigger problem of cross-document coreference, which is how to find out if different web pages are talking about the same person or other entity

Learn things about entities by aggregating information across all the documents they’re mentioned in

Type tagging tries to assign types (they could be broad, like person, location, or specific, like amusement park ride) to entities. To the extent that the Wikipedia pages contain the type information you’re interested in, it would be easy to construct a training set that annotates the Wikilinks entities with types from Wikipedia.

Work on any of the above, or more, on subsets of the data. With existing datasets, it wasn’t possible to work on just musicians or chefs or train stations, because the sample sizes would be too small. But with 10 million Web pages, you can find a decent sampling of almost anything.

Those all sound like topic map tasks to me, especially if you capture your coreference results for merging with other coreference results.

Comments Off

The history of Hadoop:…

Filed under: Hadoop,MapReduce — Patrick Durusau @ 3:50 pm

The history of Hadoop: From 4 nodes to the future of data by Derrick Harris.

From the post:

Depending on how one defines its birth, Hadoop is now 10 years old. In that decade, Hadoop has gone from being the hopeful answer to Yahoo’s search-engine woes to a general-purpose computing platform that’s poised to be the foundation for the next generation of data-based applications.

Alone, Hadoop is a software market that IDC predicts will be worth $813 million in 2016 (although that number is likely very low), but it’s also driving a big data market the research firm predicts will hit more than $23 billion by 2016. Since Cloudera launched in 2008, Hadoop has spawned dozens of startups and spurred hundreds of millions in venture capital investment since 2008.

In this four-part series, we’ll explain everything anyone concerned with information technology needs to know about Hadoop. Part I is the history of Hadoop from the people who willed it into existence and took it mainstream. Part II is more graphic; a map of the now-large and complex ecosystem of companies selling Hadoop products. Part III is a look into the future of Hadoop that should serve as an opening salvo for much of the discussion at our Structure: Data conference March 20-21 in New York. Finally, Part IV will highlight some the best Hadoop applications and seminal moments in Hadoop history, as reported by GigaOM over the years.

Whether you hope for insight into what makes a software paradigm successful or just to enrich your knowledge of Hadoop’s history, either way this is a great start on a history of Hadoop!

Enjoy!

Comments Off

Research Data Symposium – Columbia

Filed under: Archives,Curation,Data Management,Data Science,Librarian/Expert Searchers,Library — Patrick Durusau @ 3:50 pm

Research Data Symposium – Columbia.

Posters from the Research Data Symposium, held at Columbia University, February 27, 2013.

Subject to the limitations of the poster genre but useful as a quick overview of current projects and directions.

Making Research Data More Accessible by Eleni Castro; Gustavo Durand.
The Electronic Lab Notebook: Piloting a Research Data Management Tool at Cornell University by Wendy A. Kozlowski.
Bootstrapping Data Deposition In the Earth Sciences: A Joint Elsevier/IEDA Initiative To Ramp Up Research Data Import by Anita de Waard, Kerstin A. Lehnert, Suzzane M. Carbotte.
Analyzing Data Citations to Assess the Scientific and Societal Value of Scientific Data by Robert S. Chen; Robert R. Downs; Joachim A. Schumacher.
Data Services for Long Tail Science at the Integrated Earth Data Applications (IEDA) Data Facility by Leslie Hsu; Kerstin A. Lehnert; Suzanne M. Carbotte; Vicki Lynn Ferrini; Suzanne H. O’Hara; J. Douglas Walker; Robert Arko.
Data Management Progress at the University of Connecticut Libraries by David Lowe; Carolyn Mills.
Integrating Data Management Planning Into a Holistic Service Model by Margaret Smith; Samantha Guss; Jill Conte.
Data Information Literacy: Multiple Paths to a Single Goal by Sarah J. Wright; Jake Carlson; Brian Westra; Jon Jeffryes.
Comparing Approaches for the Sustainability of Scientific Data Repositories by Robert S. Chen; Robert R. Downs.
Shared Vision for Data Life-Cycle: Targeting Graduate Students by Himanshu Mistry; Samantha Guss; Andy Rutkowski.
When citing data, what thing are you actually citing? by Simone Sacchi.
Let’s Research Together, so join.me by Elena Dana Neacsu.
Assessing through Interviews the Data Management Behaviors and Needs of an Earth and Environmental Sciences Academic Department by Ting Wang; Brian Simboli.
From Data Citation to Scholarly Impact: Marking a Path and Clearing a Way for Access and Analysis by Robert J. Hilliker; Amy L. Nurnberger.
Rich Linking in a Digital Library of Full-Text Scientific Research Reports by Robert B. Allen.
Digital Curation and Preservation: An Integrated Approach by Elizabeth Brown; Edward Corrado; John Meador; Andrea Melione.
Calculating All That Jazz: Linking Technical Specifications to the Management of Digitization Projects by Krista White.

Comments Off

Introduction to Apache HBase Snapshots

Filed under: HBase — Patrick Durusau @ 11:57 am

Introduction to Apache HBase Snapshots by Matteo Bertozzi.

From the post:

The current (4.2) release of CDH — Cloudera’s 100% open-source distribution of Apache Hadoop and related projects (including Apache HBase) — introduced a new HBase feature, recently landed in trunk, that allows an admin to take a snapshot of a specified table.

Prior to CDH 4.2, the only way to back-up or clone a table was to use Copy/Export Table, or after disabling the table, copy all the hfiles in HDFS. Copy/Export Table is a set of tools that uses MapReduce to scan and copy the table but with a direct impact on Region Server performance. Disabling the table stops all reads and writes, which will almost always be unacceptable.

In contrast, HBase snapshots allow an admin to clone a table without data copies and with minimal impact on Region Servers. Exporting the snapshot to another cluster does not directly affect any of the Region Servers; export is just a distcp with an extra bit of logic.

Here are a few of the use cases for HBase snapshots:

Recovery from user/application errors

Restore/Recover from a known safe state.

View previous snapshots and selectively merge the difference into production.

Save a snapshot right before a major application upgrade or change.

Auditing and/or reporting on views of data at specific time

Capture monthly data for compliance purposes.

Run end-of-day/month/quarter reports.

Application testing

Test schema or application changes on data similar to that in production from a snapshot and then throw it away. For example: take a snapshot, create a new table from the snapshot content (schema plus data), and manipulate the new table by changing the schema, adding and removing rows, and so on. (The original table, the snapshot, and the new table remain mutually independent.)

Offloading of work

Take a snapshot, export it to another cluster, and run your MapReduce jobs. Since the export snapshot operates at HDFS level, you don’t slow down your main HBase cluster as much as CopyTable does.

Under “application testing” I would include access to your HBase data by non-experts. Gives them something to tinker with and preserves the integrity of your production data.

Comments Off

NetflixGraph

Filed under: Graphs,NetflixGraph,Networks — Patrick Durusau @ 11:57 am

NetflixGraph: Compact in-memory representation of directed graph data by Drew Koszewnik.

From the post:

Your memory footprint just shrank

NetflixGraph is a compact in-memory data structure used to represent directed graph data. You can use NetflixGraph to vastly reduce the size of your application’s memory footprint, potentially by an order of magnitude or more. If your application is I/O bound, you may be able to remove that bottleneck by holding your entire dataset in RAM. You’ll likely be very surprised by how little memory is actually required to represent your data.

NetflixGraph provides an API to translate your data into a graph format, compress that data in memory, then serialize the compressed in-memory representation of the data so that it may be easily transported across your infrastructure.

Definitely a high priority for the coming weekend!

Comments Off

Solr + “flash sale site”

Filed under: Lucene,Solr — Patrick Durusau @ 11:56 am

How Solr powers search on America’s largest flash sale site by Ade Trenaman.

The post caught my attention with “flash sale,” which I had to look up. 😉

Even after discovering it means “deal of the day,” the slides were interesting.

Especially the commentary on synonym lists!

What someone else considers to be a synonym may not be one for your audience.

Comments Off

…Wikilinks Corpus With 40M Mentions And 3M Entities

Filed under: Corpus Linguistics,Disambiguation,Entities,Entity Resolution — Patrick Durusau @ 11:56 am

Google Research Releases Wikilinks Corpus With 40M Mentions And 3M Entities by Frederic Lardinois.

From the post:

Google Research just launched its Wikilinks corpus, a massive new data set for developers and researchers that could make it easier to add smart disambiguation and cross-referencing to their applications. The data could, for example, make it easier to find out if two web sites are talking about the same person or concept, Google says. In total, the corpus features 40 million disambiguated mentions found within 10 million web pages. This, Google notes, makes it “over 100 times bigger than the next largest corpus,” which features fewer than 100,000 mentions.

For Google, of course, disambiguation is something that is a core feature of the Knowledge Graph project, which allows you to tell Google whether you are looking for links related to the planet, car or chemical element when you search for ‘mercury,’ for example. It takes a large corpus like this one and the ability to understand what each web page is really about to make this happen.

Details follow on how to create this data set.

Very cool!

The only caution is that your entities, those specific to your enterprise, are unlikely to appear, even in 40M mentions.

But the Wikilinks Corpus + your entities, now that is something with immediate ROI for your enterprise.

Comments Off

Graph Partitioning and Expanders (April 2013)

Filed under: Graph Partitioning,Graphs,Networks — Patrick Durusau @ 11:56 am

Graph Partitioning and Expanders by Professor Luca Trevisan.

From the description:

In this research-oriented graduate course, we will study algorithms for graph partitioning and clustering, constructions of expander graphs, and analysis of random walks. These are three topics that build on the same mathematical background and that have several important connections: for example it is possible to find graph clusters via random walks, and it is possible to use the linear programming approach to graph partitioning as a way to study random walks.

We will study spectral graph theory, which explains how certain combinatorial properties of graphs are related to the eigenvalues and eigenvectors of the adjacency matrix, and we will use it describe and analyze spectral algorithms for graph partitioning and clustering. Spectral graph theory will recur as an important tool in the rest of the course. We we will also discuss other approaches to graph partitioning via linear programming and semidefinite programming. Then we will study constructions of expander graphs, which are graphs with very strong pseudorandomness properties, which are useful in many applications, including in cryptography, in complexity theory, in algorithms and data structures, and in coding theory. Finally, we will study the mixing time of random walks, a problem that comes up in several applications, including the analysis of the convergence time of certain randomized algorithms, such as the Metropolis algorithm.

Workload

about 8 hours per week

Prerequisites

linear algebra, discrete probability, and algorithms

The Instructor

Luca Trevisan is a professor of computer science at Stanford University. Before joining Stanford in 2010, Luca taught at Columbia University and at the University of California, Berkeley.

Luca’s research is in theoretical computer science, and he has worked on average-case complexity theory, pseudorandomness and derandomization, hardness of approximation, probabilistically checkable proofs, and approximation algorithms. In the past three years he has been working on spectral graph theory and its applications to graph algorithmns.

Luca received the STOC’97 Danny Lewin award, the 2000 Oberwolfach Prize, and the 2000 Sloan Fellowship. He was an invited speaker at the 2006 International Congress of Mathematicians in Madrid.

Not for the faint of heart!

But on the other hand, if you want to be on the cutting edge of graph development….

Comments Off

March 8, 2013

Six Degrees of Francis Bacon…

Filed under: EU,Graphics,Visualization — Patrick Durusau @ 5:26 pm

Six Degrees of Francis Bacon, a 17th century social network by Nathan Yau.

From the post:

Nathan points us to a project to determine the relationships of Francis Bacon:

Six Degrees of Francis Bacon.

Imagine that instead of collecting “door pass” data in the Man Bites Dog story about influence of special interests in the EU Parliment, the study collected financial, social, education, and other relationships with members of the EU Parliament and the favors it bestows.

Same outcome? Or different?

Comments Off

Databases & Dragons

Filed under: MongoDB,Software — Patrick Durusau @ 5:17 pm

Databases & Dragons by Kristina Chodorow.

From the post:

Here are some exercises to battle-test your MongoDB instance before going into production. You’ll need a Database Master (aka DM) to make bad things happen to your MongoDB install and one or more players to try to figure out what’s going wrong and fix it.

Should be of interest if you are developing MongoDB to go into production.

The idea should also be of interest if you are developing other software to go into production.

Most software (not all) works fine with expected values, other components responding correctly, etc.

But those are the very conditions your software may not encounter in production.

Where’s your “databases &amps dragons” test for your software?

Comments Off

Man Bites Dog Story (EU Interest Groups and Legislation)

Filed under: Government,Government Data — Patrick Durusau @ 5:04 pm

Interest groups and the making of legislation

From the post:

How are the activities of interest groups related to the making of legislation? Does mobilization of interest groups lead to more legislation in the future? Alternatively, does the adoption of new policies motivate interest groups to get active? Together with Dave Lowery, Brendan Carroll and Joost Berkhout, we tackle these questions in the case of the European Union. What we find is that there is no discernible signal in the data indicating that the mobilization of interest groups and the volume of legislative production over time are significantly related. Of course, absence of evidence is the same as the evidence of absence, so a link might still exist, as suggested by theory, common wisdom and existing studies of the US (e.g. here). But using quite a comprehensive set of model specifications we can’t find any link in our time-series sample. The abstract of the paper is below and as always you can find at my website the data, the analysis scripts, and the pre-print full text. One a side-note – I am very pleased that we managed to publish what is essentially a negative finding. As everyone seems to agree, discovering which phenomena are not related might be as important as discovering which phenomena are. Still, there are few journals that would apply this principle in their editorial policy. So cudos for the journal of Interest Groups and Advocacy.

Abstract
Different perspectives on the role of organized interests in democratic politics imply different temporal sequences in the relationship between legislative activity and the influence activities of organized interests. Unfortunately, lack of data has greatly limited any kind of detailed examination of this temporal relationship. We address this problem by taking advantage of the chronologically very precise data on lobbying activity provided by the door pass system of the European Parliament and data on EU legislative activity collected from EURLEX. After reviewing the several different theoretical perspectives on the timing of lobbying and legislative activity, we present a time-series analysis of the co-evolution of legislative output and interest groups for the period 2005-2011. Our findings show that, contrary to what pluralist and neo-corporatist theories propose, interest groups neither lead nor lag bursts in legislative activity in the EU.

You can read an earlier version of the paper at: Timing is Everything? Organized Interests and the Timing of Legislative Activity. (I say earlier version because the title is the same but the abstract is slightly different.)

Just a post or so ago, Untangling algorithmic illusions from reality in big data, the point was made that biases in data collection can make a significant difference in results.

The “negative” finding in this paper is an example of that hazard.

From the paper:

The European Parliament maintains a door pass system for lobbyists. Everyone entering the Parliament’s premises as a lobbyist is expected to register on this list ….

Now there’s a serious barrier to any special interest group that wants to influence the EU Parliament!

Certainly no special interest group would be so devious and under-handed as to meet with members of the EU Parliament away from the Parliament’s premises.

Say, in exotic vacation spots/spas? Or at meetings of financial institutions? Or just in the normal course of their day to day affairs?

The U.S. registers lobbyists, but like the EU “hall pass” system, it is the public side of influence.

People with actual influence don’t have to rely on anything as crude as lobbyists to insure their goals are met.

The data you collect may exclude the most important data.

Unless it is your goal for it to be excluded, then carry on.

Comments (1)

Crossfilter

Filed under: Data Mining,Dataset,Filters,Javascript,Top-k Query Processing — Patrick Durusau @ 4:34 pm

Crossfilter: Fast Multidimensional Filtering for Coordinated Views

From the webpage:

Crossfilter is a JavaScript library for exploring large multivariate datasets in the browser. Crossfilter supports extremely fast (<30ms) interaction with coordinated views, even with datasets containing a million or more records; we built it to power analytics for Square Register, allowing merchants to slice and dice their payment history fluidly.

Since most interactions only involve a single dimension, and then only small adjustments are made to the filter values, incremental filtering and reducing is significantly faster than starting from scratch. Crossfilter uses sorted indexes (and a few bit-twiddling hacks) to make this possible, dramatically increasing the performance of live histograms and top-K lists. For more details on how Crossfilter works, see the API reference.

See the webpage for an impressive demonstration with a 5.3 MB dataset.

Is there a trend towards “big data” manipulation on clusters and “less big data” in browsers?

Will be interesting to see how the benchmarks for “big” and “less big” move over time.

I first saw this in Nat Torkington’s Four Short links: 4 March 2013.

Comments Off

Untangling algorithmic illusions from reality in big data

Filed under: Algorithms,BigData — Patrick Durusau @ 3:46 pm

Untangling algorithmic illusions from reality in big data by Alex Howard.

From the post:

Microsoft principal researcher Kate Crawford (@katecrawford) gave a strong talk at last week’s Strata Conference in Santa Clara, Calif. about the limits of big data. She pointed out potential biases in data collection, questioned who may be excluded from it, and hammered home the constant need for context in conclusions. Video of her talk is embedded below:

See Alex’s post for the video and the interview that follows.

Both are simply golden.

How important are biases in data collection?

Consider the classic example:

Are you in favor of convicted felons owning firerams?

90%+ of all surveyed say they favor gun control.

Are you in favor of gun control?

Much lower percentage saying they favor gun control.

The numbers are from memory and surveys probably forty years ago but the lesson is to watch the question being asked.

A survey that doesn’t expose its questions, how people were contacted, at what time of day, just to name a few factors, isn’t worthy of comment.

Comments (1)

Why FoundationDB Might Be All Its Cracked Up To Be

Filed under: FoundationDB,NoSQL,SQL — Patrick Durusau @ 3:30 pm

Why FoundationDB Might Be All Its Cracked Up To Be by Doug Turnbull.

From the post:

When I first heard about FoundationDB, I couldn’t imagine how it could be anything but vaporware. Seemed like Unicorns crapping happy rainbows to solve all your problems. As I’m learning more about it though, I realize it could actually be something ground breaking.

NoSQL: Lets Review…

So, I need to step back and explain one reason NoSQL databases have been revolutionary. In the days of yore, we used to normalize all our data across multiple tables on a single database living on a single machine. Unfortunately, Moore’s law eventually crapped out and maybe more importantly hard drive space stopped increasing massively. Our data and demands on it only kept growing. We needed to start trying to distribute our database across multiple machines.

Turns out, its hard to maintain transactionality in a distributed, heavily normalized SQL database. As such, a lot of NoSQL systems have emerged with simpler features, many promoting a model based around some kind of single row/document/value that can be looked up/inserted with a key. Transactionality for these systems is limited a single key value entry (“row” in Cassandra/HBase or “document” in (Mongo/Couch) — we’ll just call them rows here). Rows are easily stored in a single node, although we can replicate this row to multiple nodes. Despite being replicated, it turns out transactionally working with single rows in distributed NoSQL is easier than guaranteeing transactionality of an SQL query visiting potentially many SQL tables in a distributed system.

There are deep design ramifications/limitations to the transactional nature of rows. First you always try to cram a lot of data related to the row’s key into a single row, ending up with massive rows of hierarchical or flat data that all relates to the row key. This lets you cover as much data as possible under the row-based transactionality guarantee. Second, as you only have a single key to use from the system, you must chose very wisely what your key will be. You may need to think hard how your data will be looked up through its whole life, it can be hard to go back. Additionally, if you need to lookup on a secondary value, you better hope that your database is friendly enough to have a secondary key feature or otherwise you’ll need to maintain secondary row for storing the relationship. Then you have the problem of working across two rows, which doesn’t fit in the transactionality guarantee. Third, you might lose the ability to perform a join across multiple rows. In most NoSQL data stores, joining is discouraged and denormalization into large rows is the encouraged best practice.

FoundationDB Is Different

FoundationDB is a distributed, sorted key-value store with support for arbitrary transactions across multiple key-values — multiple “rows” — in the database.

As Doug points out, there is must left to be known.

Still, exciting to have something new to investigate.

Comments Off

Model Matters: Graphs, Neo4j and the Future

Filed under: Graphs,Modeling,Neo4j — Patrick Durusau @ 2:58 pm

Model Matters: Graphs, Neo4j and the Future by Tareq Abedrabbo.

From the post:

As part of our work, we often help our customers choose the right datastore for a project. There are usually a number of considerations involved in that process, such as performance, scalability, the expected size of the data set, and the suitability of the data model to the problem at hand.

This blog post is about my experience with graph database technologies, specifically Neo4j. I would like to share some thoughts on when Neo4j is a good fit but also what challenges Neo4j faces now and in the near future.

I would like to focus on the data model in this blog post, which for me is the crux of the matter. Why? Simply because if you don’t choose the appropriate data model, there are things you won’t be able to do efficiently and other things you won’t be able to do at all. Ultimately, all the considerations I mentioned earlier influence each other and it boils down to finding the most acceptable trade-off rather than picking a database technology for one specific feature one might fancy.

So when is a graph model suitable? In a nutshell when the domain consists of semi-structured, highly connected data. That being said, it is important to understand that semi-structured doesn’t imply an absence of structure; there needs to be some order in your data to make any domain model purposeful. What it actually means is that the database doesn’t enforce a schema explicitly at any given point in time. This makes it possible for entities of different types to cohabit – usually in different dimensions – in the same graph without the need to make them all fit into a single rigid structure. It also means that the domain can evolve and be enriched over time when new requirements are discovered, mostly with no fear of breaking the existing structure.

Effectively, you can start taking a more fluid view of your domain as a number of superimposed layers or dimensions, each one representing a slice of the domain, and each layer can potentially be connected to nodes in other layers.

More importantly, the graph becomes the single place where the full domain representation can be consolidated in a meaningful and coherent way. This is something I have experienced on several projects, because modeling for the graph gives developers the opportunity to think about the domain in a natural and holistic way. The alternative is often a data-centric approach, that usually results from integrating different data flows together into a rigidly structured form which is convenient for databases but not for the domain itself.

Interesting review of the current and some projected capabilities of Neo4j.

I am particularly sympathetic to starting with the data users have as opposed to starting with a model written in software and shoe horning the user’s data to fit the model.

Can be done, has been done (for decades), and works quite well in some cases.

But not all cases.

Comments Off

neo4j: Make properties relationships [Associations As First Class Citizens?]

Filed under: Graphs,Neo4j,Networks — Patrick Durusau @ 2:43 pm

neo4j: Make properties relationships by Mark Needham.

From the post:

I spent some of the weekend working my way through Jim, Ian & Emil‘s book ‘Graph Databases‘ and one of the things that they emphasise is that graphs allow us to make relationships first class citizens in our model.

Looking back on a couple of the graphs that I modelled last year I realise that I didn’t quite get this and although the graphs I modelled had some relationships a lot of the time I was defining things as properties on nodes.

While it’s fine to do this I think we lose some of the power of a graph and it’s not necessarily obvious what we’ve lost until we model a property as a relationship and see what possibilities open up.

For example in my football graph I wanted to record the date of matches and initially stored this as a property on the match before realising that modelling it as a relationship which might open up some interesting queries.

Reading Mark’s post illustrates the power of using associations to model “properties” in topic maps.

In Neo4j, relationships are first class citizens.

Unfortunately, we can’t say the same for associations in topic maps.

You may recall that associations in a topic map are restricted in the information they can carry.

If you want to add a name to an association, for example, you have to reify the association with a topic. Which means you have the association and a topic for the association, representing the same subject.

Not to mention a lot of machinery overhead for something fairly simple.

I am aware that the TMDM and XTM were fashioned to follow the original version of ISO 13250. The origin of reification in topic maps.

However, simply because all buggies had whips at one point is no reason to design cars with whip holders.

The time has come to revisit reification and in my view, revise both the TMDM and XTM to remove it.

And to make associations and occurrences first class citizens in both the TMDM and XTM.

Comments/suggestions?

Comments Off

hadoop illuminated (book)

Filed under: Hadoop,MapReduce — Patrick Durusau @ 2:06 pm

hadoop illuminated by Mark Kerzner and Sujee Maniyam.

Largely a subjective judgment but I think the explanations of Hadoop are getting better.

Oh, the deep/hard stuff is still there, but the on ramp for getting to that point has become easier.

This book is a case in point.

I first saw this in a tweet by Computer Science.

Comments Off

Adding Value through graph analysis…

Filed under: Faunus,Graphs,Titan — Patrick Durusau @ 6:17 am

Adding Value through graph analysis using Titan and Faunus by Matthias Broecheler.

Alludes to Titan 0.3.0 release but the latest I saw at the Titan site was 0.2.0. Perhaps 0.3.0 will be along presently.

I don’t recall seeing Titan listed in the Literature Survey of Graph Databases so I have sent the author a note about including Titan in any updates to the survey.

BTW, I would not take the ages on slide 35 seriously. 😉

Comments Off

March 7, 2013

Open Source for Cybersecurity?

Filed under: Cybersecurity,Open Source,Security — Patrick Durusau @ 5:15 pm

A couple of weeks ago I posted: Crowdsourcing Cybersecurity: A Proposal (Part 1) and Crowdsourcing Cybersecurity: A Proposal (Part 2), concluding that publicity (not secrecy) about security flaws would enhance cybersecurity.

Then this week I read:

A classic open source koan is that “with many eyes, all bugs become shallow.” In IT security, is it that with many eyes, all worms become shallow?

Burton: What the Department of Defense said was if someone has malicious intent and the code isn’t available, they’ll have some way of getting the code. But if it is available and everyone has access to it, then any vulnerabilities that are there are much more likely to be corrected than before they’re exploited.

(From Alex Howard’s interview of CFPB ( Consumer Financial Protection Bureau ) CIO Chris Willey (@ChrisWilleyDC) and acting deputy CIO Matthew Burton (@MatthewBurton), reported in: Open source is interoperable with smarter government at the CFPB.

If the “white hats” aren’t going to recognize the benefits of crowdsourcing cybersecurity, perhaps it is time for the “black hats” to take up the mantle of crowdsourcing.

Perhaps that will force the “white hats” to adapt better security measures than “security by secrecy.”

Public mappings of security flaws anyone?

Update: DARPA to Turn Off Funding for Hackers Pursuing Cybersecurity Research

The Pentagon is scuttling a program that awards grants to reformed hackers and security professionals for short-term research with game-changing potential, according to cybersecurity firm Kaspersky Lab.

That’s the ticket. If we don’t know it, it must not be known.

Comments Off

Million Song Dataset in Minutes!

Filed under: Hadoop,MapReduce,Mortar,Pig,Python — Patrick Durusau @ 3:50 pm

Million Song Dataset in Minutes! (Video)

Actually 5:35 as per the video.

The summary of the video reads:

Created Web Project [zero install]

Loaded data from S3

Developed in Pig and Python [watch for the drop down menus of pig fragments]

ILLUSTRATE’d our work [perhaps the most impressive feature, tests code against sample of data]

Ran on Hadoop [drop downs to create a cluster]

Downloaded results [50 “densest songs”, see the video]

It’s not all “hands free” or without intellectual effort on your part.

But, a major step towards a generally accessible interface for Hadoop/MapReduce data processing.

Comments Off

MortarData2013

Filed under: Hadoop,MapReduce,Mortar,Pig — Patrick Durusau @ 3:36 pm

MortarData2013

Mortar has its own YouTube channel!

Unlike the History Channel, the MotorData2013 channel is educational and entertaining.

I leave it to you to guess whether those two adjectives apply to the History Channel. (Hint: Thirty (30) minutes of any Vikings episode should help you answer.)

Not a lot of data at the moment but what is there, well, I am going to cover one of those in a separate post.

Comments Off

Data Warehousing and Big Data Papers by Peter Bailis

Filed under: BigData,Data Warehouse — Patrick Durusau @ 3:16 pm

Quick and Dirty (Incomplete) List of Interesting, Mostly Recent Data Warehousing and Big Data Papers by Peter Bailis

Alex Popescu reports some twenty-seven (27) papers and links gathered by Peter Bailis on Data Warehousing and Big Data!

Enjoy!

Comments Off

PersistIT [B+ Tree]

Filed under: B+Tree,Data Structures,Java — Patrick Durusau @ 3:04 pm

PersistIT: A fast, transactional, Java B+Tree library

From the webpage:

Akiban PersistIT is a key/value data storage library written in Java™. Key features include:

Support for highly concurrent transaction processing with multi-version concurrency control

Optimized serialization and deserialization mechanism for Java primitives and objects

Multi-segment keys to enable a natural logical key hierarchy

Support for long records

Implementation of a persistent SortedMap

Extensive management capability including command-line and GUI tools

For more information

See the javadoc

See the user guide

Report issues here

Questions? Ask on the @akiban mailing list

Want real-time interaction? Hop on #akiban on irc.freenode.net

I mention this primarily because of the multi-segment keys, which I suspect could be useful for type hierarchies.

Possibly other uses as well but that is the first one that came to mind.

Comments Off

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 10, 2013

March 9, 2013

March 8, 2013

March 7, 2013