Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 16, 2013

Cassandra project chair: We’re taking on Oracle (Cassandra 2.0)

Filed under: Cassandra,NoSQL — Patrick Durusau @ 6:35 pm

Cassandra project chair: We’re taking on Oracle by Paul Krill.

From the post:

Apache Cassandra is an open source, NoSQL database accommodating large-scale workloads and attracting a lot of attention, having been deployed in such organizations as Netflix, eBay, and Twitter. It was developed at Facebook, which open-sourced it in 2008, and its database can be deployed across multiple data centers and in cloud environments.

Jonathan Ellis is the chair of the project at Apache, and he serves as chief technical officer at DataStax, which has built a business around Cassandra. InfoWorld Editor-at-Large Paul Krill spoke with Ellis at the company’s recent Cassandra Summit 2013 conference in San Francisco, where Ellis discussed efforts to make the database easier to use and how it has become a viable competitor to Oracle’s relational database technology.

InfoWorld: What is the biggest value-add for Cassandra?

Ellis: It’s driving the Web applications. We’re the ones who power Netflix, Spotify. Cassandra is actually powering the applications directly. It lets you scale to millions of operations per second and software-as-a-service, machine-generated data, Web applications. Those are all really hot spots for Cassandra.

Cassandra 2.0 is targeted for the end of July, 2013. Lightweight transactions and triggers are on the menu.

Open Data Certificates

Filed under: Open Data — Patrick Durusau @ 6:17 pm

Open Data Certificates

From the website:

1. Publish your data

Good news! You’ve already done this bit (or you’re about to). Now let’s make your data easier for people to find, use and share.

2. Check it with our questionnaire

Our helpful questions act like a checklist. They explain your options about how to publish good open data and give you clear and recognised targets to aim for.

3. Share it with a certificate

Your answers determine which of our four certificates you generate. Each one means success in a unique way and demonstrates you are a leader in open data.

From the questionnaire:

This self-assessment questionnaire generates an open data certificate and badge you can publish to tell people all about this open data. We also use your answers to learn how organisations publish open data.

When you answer these questions it demonstrates your efforts to comply with relevant UK legislation. You should also check which other laws and policies apply to your sector, especially if you’re outside the UK (which these questions don’t cover).

The self-assessment aspect of the certificate seems problematic to me.

Too many bad experiences with SDOs that rely on self-assessment in place of independent review.

Having said that, the checklist will help those interested in producing quality data products.

Perhaps there is a commercial opportunity in assessing open data sets?

Music Information Research Based on Machine Learning

Filed under: Machine Learning,Music,Music Retrieval — Patrick Durusau @ 3:38 pm

Music Information Research Based on Machine Learning by Masataka Goto and Kazuyoshi Yoshii.

From the webpage:

Music information research is gaining a lot of attention after 2000 when the general public started listening to music on computers in daily life. It is widely known as an important research field, and new researchers are continually joining the field worldwide. Academically, one of the reasons many researchers are involved in this field is that the essential unresolved issue is the understanding of complex musical audio signals that convey content by forming a temporal structure while multiple sounds are interrelated. Additionally, there are still appealing unresolved issues that have not been touched yet, and the field is a treasure trove of research topics that could be tackled with state-of-the-art machine learning techniques.

This tutorial is intended for an audience interested in the application of machine learning techniques to such music domains. Audience members who are not familiar with music information research are welcome, and researchers working on music technologies are likely to find something new to study.

First, the tutorial serves as a showcase of music information research. The audience can enjoy and study many state-of-the-art demonstrations of music information research based on signal processing and machine learning. This tutorial highlights timely topics such as active music listening interfaces, singing information processing systems, web-related music technologies, crowdsourcing, and consumer-generated media (CGM).

Second, this tutorial explains the music technologies behind the demonstrations. The audience can learn how to analyze and understand musical audio signals, process singing voices, and model polyphonic sound mixtures. As a new approach to advanced music modeling, this tutorial introduces unsupervised music understanding based on nonparametric Bayesian models.

Third, this tutorial provides a practical guide to getting started in music information research. The audience can try available research tools such as music feature extraction, machine learning, and music editors. Music databases and corpora are then introduced. As a hint towards research topics, this tutorial also discusses open problems and grand challenges that the audience members are encouraged to tackle.

In the future, music technologies, together with image, video, and speech technologies, are expected to contribute toward all-around media content technologies based on machine learning.

Download tutorial slides.

Always nice to start with week with something different.

I first saw this in a tweet by Masataka Goto.

Storing and visualizing LinkedIn with Neo4j and sigma.js

Filed under: Graphs,Neo4j,Sigma.js,Visualization — Patrick Durusau @ 3:19 pm

Storing and visualizing LinkedIn with Neo4j and sigma.js by Bob Briody.

From the post:

In this post I am going to present a way to:

  • load a linkedin networkvia the linkedIn developer AP into neo4j using python
  • serve the network from neo4j using node.js, express.js, and cypher
  • display the network in the browser using sigma.js

Bob remarks that his method for deduping relationships would not scale to very large networks.

Pointers to how LinkedIn deals with that problem?

I first saw this in a tweet by Peter Neubauer.

June 15, 2013

Making Life Hard For The NSA

Filed under: NSA,Security — Patrick Durusau @ 4:51 pm

Google has found a unique way to get back at the NSA for its snooping activities. Make snooping exponentially harder!

Or at least that is how I read: Introducing Project Loon: Balloon-powered Internet access:

The Internet is one of the most transformative technologies of our lifetimes. But for 2 out of every 3 people on earth, a fast, affordable Internet connection is still out of reach. And this is far from being a solved problem.

There are many terrestrial challenges to Internet connectivity—jungles, archipelagos, mountains. There are also major cost challenges. Right now, for example, in most of the countries in the southern hemisphere, the cost of an Internet connection is more than a month’s income.

Solving these problems isn’t simply a question of time: it requires looking at the problem of access from new angles. So today we’re unveiling our latest moonshot from Google[x]: balloon-powered Internet access.

We believe that it might actually be possible to build a ring of balloons, flying around the globe on the stratospheric winds, that provides Internet access to the earth below. It’s very early days, but we’ve built a system that uses balloons, carried by the wind at altitudes twice as high as commercial planes, to beam Internet access to the ground at speeds similar to today’s 3G networks or faster. As a result, we hope balloons could become an option for connecting rural, remote, and underserved areas, and for helping with communications after natural disasters. The idea may sound a bit crazy—and that’s part of the reason we’re calling it Project Loon—but there’s solid science behind it.

If successful (30 balloons went up this week), some of the consequences might be:

  • Cheaper than satellites, enabling global private networks.
  • Lower level than satellites for private video surveillance.
  • Bandwidth will increase, along with demand, increasing electronic chaff.
  • There will be no main router stations or pipes for easy interception of signals.
  • Anti-balloon weapons and people who hijack anti-balloon weapons.
  • How the NSA, CIA, China react to near total loss of control over communications?

By the end of the decade, high-speed wireless communication maybe too cheap to meter.

Let’s see the NSA collect all that data!

Starting with the Needle

Filed under: NSA,Security — Patrick Durusau @ 4:26 pm

Sebastian Rotella writes in Defenders of NSA Surveillance Omit Most of Mumbai Plotter’s Story:

James Clapper, the director of national intelligence, said a data collection program by the National Security Agency helped stop an attack on a Danish newspaper for which Headley did surveillance. And Sen. Dianne Feinstein, D-Calif., the Senate intelligence chairwoman, also called Headley's capture a success.

But a closer examination of the case, drawn from extensive reporting by ProPublica, shows that the government surveillance only caught up with Headley after the U.S. had been tipped by British intelligence. And even that victory came after seven years in which U.S. intelligence failed to stop Headley as he roamed the globe on missions for Islamic terror networks and Pakistan's spy agency.

Supporters of the sweeping U.S. surveillance effort say it's needed to build a haystack of information in which to find a needle that will stop a terrorist. In Headley's case, however, it appears the U.S. was handed the needle first — and then deployed surveillance that led to the arrest and prosecution of Headley and other plotters.

As ProPublica has previously documented, Headley's case shows an alarming litany of breakdowns in the U.S. counterterror system that allowed him to play a central role in the massacre of 166 people in Mumbai, among them six Americans.

(emphasis added)

Quoting the lawyer for a co-defendant of Headley, Sebastian writes:

Swift called the case a dramatic example of the limits of the U.S. counterterror system because both high-tech and human resources failed to prevent the Mumbai attacks.

“You have to know what you are looking for and what you are looking at,” Swift said. “Headley’s the classic example. They missed Mumbai completely.”

I would say having to know:

  1. What you are looking for, and
  2. What you are looking at.

are fairly severe limitations on a heterogeneous data system.

It’s the only way such a system can work but then I’m not an advocate for it.

On the other hand, starting from terrorists who are pointed out or captured, will help illustrate how little different such a system makes.

Not to mention showing all the cases where agencies failed to share information or use information they already had.

So maybe there is an upside to the NSA data project after all.

It can used to hold the NSA, CIA, etc. responsible for their many failures.

Perhaps the NSA is documenting a path to its own demise.

Indexing web sites in Solr with Python

Filed under: Indexing,Python,Solr — Patrick Durusau @ 3:44 pm

Indexing web sites in Solr with Python by Martijn Koster.

From the post:

In this post I will show a simple yet effective way of indexing web sites into a Solr index, using Scrapy and Python.

We see a lot of advanced Solr-based applications, with sophisticated custom data pipelines that combine data from multiple sources, or that have large scale requirements. Equally we often see people who want to start implementing search in a minimally-invase way, using existing websites as integration points rather than implementing a deep integration with particular CMSes or databases which may be maintained by other groups in an organisation. While crawling websites sounds fairly basic, you soon find that there are gotchas, with the mechanics of crawling, but more importantly, with the structure of websites. If you simply parse the HTML and index the text, you will index a lot of text that is not actually relevant to the page: navigation sections, headers and footers, ads, links to related pages. Trying to clean that up afterwards is often not effective; you’re much better off preventing that cruft going into the index in the first place. That involves parsing the content of the web page, and extracting information intelligently. And there’s a great tool for doing this: Scrapy. In this post I will give a simple example of its use. See Scrapy’s tutorial for an introduction and further information.

Good practice with Solr, not to mention your search activities are yours to keep private if you like. 😉

Streaming IN Hadoop: Yahoo! release Storm-YARN

Filed under: Hadoop YARN,MapReduce,Storm,Yahoo! — Patrick Durusau @ 2:31 pm

Streaming IN Hadoop: Yahoo! release Storm-YARN by Jim Walker.

From the post:

Over the past year, customers have told us they want to store all their data in one place and interact with it in multiple ways… they want to use Hadoop, but in order to do so, it needs to extend beyond batch. It also needs to be interactive and real-time (among others).

This is the entire principle behind YARN, which together with others in the community, Arun Murthy and the team at Hortonworks have been working on for more than 5 years! The YARN based architecture of Hadoop 2.0 is hugely significant and we have been working closely with many partners to incorporate it into their applications.

Storm-YARN Released as Open Source

Yahoo! has been testing Hadoop 2 and its YARN-based architecture for quite some time. All the while they have worked on the convergence of the streaming framework Storm with Hadoop. This work has resulted in a YARN based version of Storm that will radically improve performance and resource management for streaming.

The release blog post from Yahoo.

Processing of data, even big data, is approaching “interactive and real-time,” although I suspect definitions of those terms vary. What is “interactive” for an automated trader might be too fast for human trader.

What I haven’t seen is concurrent development on the handling of the semantics of big data.

After the initial hysteria over the scope of NSA snooping, except for cases where the NSA was given the identity of a suspect (and not always then), was its data gathering of any use.

In topic map terms, the semantic impedance between the data systems was too great for useful manipulation of the data sets as one.

Streaming in Hadoop is welcome news, but until we can robustly manages the semantics of data in streams, much gold is going to pass uncollected from streams.

Hortonworks Sandbox (1.3): Stinger, Visualizations and Virtualization

Filed under: BigData,Hadoop,Hortonworks,MapReduce — Patrick Durusau @ 2:13 pm

Hortonworks Sandbox: Stinger, Visualizations and Virtualization by Cheryle Custer.

From the post:

A couple of weeks ago, we releases several new Hadoop tutorials showcasing real-life uses cases and you can read about them here.Today, we’re delighted to bring to you the newest release of the Hortonworks Sandbox 1.3. The Hortonworks Sandbox allows you to go from Zero to Big Data in 15 Minutes through step-by-step hands-on Hadoop tutorials. The Sandbox is a fully functional single node personal Hadoop environment, where you can add your own data sets, validate your Hadoop use cases and build a small proof-of-concept.

Update of your favorite way to explore Hadoop!

Get the sandbox here.

Upcoming Data Viz Contests, Summer 2013

Filed under: Graphics,Visualization — Patrick Durusau @ 12:54 pm

Upcoming Data Viz Contests, Summer 2013 by Ben Jones.

Ben has a list of five (5) data visualization contests for the summer, with prizes ranging from TBA/registration to $9,000.00.

Would be good PR and some summer cash!

Visit Ben’s post for the details.

June 14, 2013

NoSQL Matters 2013 (Videos/Slides)

Filed under: NoSQL — Patrick Durusau @ 5:12 pm

NoSQL Matters 2013 (Video/Slides)

A great set of videos but in no particular order in the original listing. I ordered them by the author’s last name for quick scanning.

Unless otherwise noted, the titles link to videos at Vimeo. Abstracts follow each title when available.

Enjoy!

Pavlo Baron – 100% Big Data, 0% Hadoop, 0% Java

If your data is big enough, Hadoop it!” That’s simply not true – there is much more behind this term than just a tool. In this talk I will show one possible, practically working approach and the corresponding selection of tools that help collect, mine, move around, store and provision large, unstructured data amounts. Completely without Hadoop. And even completely without Java.

Pere Urbón Bayes – From Tables to Graph. Recommendation Systems, a Graph Database Use Case (No video, slides)

Recommendation engines have changed a lot during the last years and the last big change is NoSQL, especially Graph Data- bases. With this presentation we intend to show how to build a Graph Processing technology, based on our experience in doing that for environments like Digital Libraries and Movies and Digital Media. First, we will introduce the state of the art on context aware Recommendation Engines, with special interest on how peo- ple are using Graph Processing, NoSQL, systems to scale this kind of solutions. After an introduction to the ecosystem, the next step is to have something to work with. So we will show the audience how to build a Recommendation Engine with a few steps.

The demonstration part will be made using the next technology stack: Sinatra as a simple web framework. Ruby as a programming language. OrientDB, Neo4j, Redis, etc. as a NoSQL technology stack. The result of our demonstration will be a simple engine, accessible through a REST API, to play and extend, so that atten- dants can learn by doing.

In the end our audience will have a full in- troduction to the field of Recommendati- on Engines, with special interest on Graph Processing, NoSQL, systems. Based on our experience making this technology for large scale architectures, we think the best way to learn this is by doing it and having an example to play with.

Nic Caine – Leveraging InfiniteGraph for Big Data
Slides

Practical insight! Graph databases could help institutions designed for research & development in healthcare and life sciences managing Big Data sets. Researcher obtains entry in a new level for healthcare and pharmaceutical data analytics.

This talk explains the challenges for developer to detect relationships and similar cross link interaction within data analysis. Graph database technology can give the answer that nobody asked before!

William Candillon – JSONiq – The SQL of NoSQL
Slides

SQL has been good to the relational world, but what about query languages in the NoSQL space?

We introduce JSONiq: the SQL of NoSQL.

Like SQL, JSONiq enables developers to leverage the same productive high-level language across a variety of products.

But this not your grandma’s SQL; it supports novel concepts purposely designed for flexible data.

Moreover, JSONiq is a highly optimizable language to query and update NoSQL stores.

We show how JSONiq can be used on top products such as MongoDB, CouchBase, and DynamoDB.

Aparna Chaudhary – Look Ma! No more Blobs
Slides

GridFS is a storage mechanism for persisting large objects in MongoDB. The talk will cover a use case of content management using MongoDB. During the talk I would explain why we chose MongoDB over traditional relational database to store XML files. The talk would be accompanied by a live demo using Spring Data & GridFS.

Sean Cribbs – Data Structures in Riak
Slides

Since the beginning, Riak has supported high write-availability using Dynamo-style multi-valued keys – also known as conflicts or siblings. The tradeoff for this type of availability is that the application must include logic to resolve conflicting updates. While it is convenient to say that the application can reason best about conflicts, ad hoc resolution is error-prone and can result in surprising anomalies, like the reappearing item problem in Dynamo’s shopping cart.

What is needed is a more formal and general approach to the problem of conflict resolution for complex data structures. Luckily, there are some formal strategies in recent literature, including Conflict-Free Replicated Data Types (CRDTs) and BloomL lattices. We’ll review these strategies and cover some recent work we’ve done toward adding automatically-convergent data structures to Riak.

David Czarnecki – Real-World Redis
Slides

Redis is a data structure server, but yet all too often, it is used to do simple data caching. This is not because its internal data structures are not powerful, but I believe, because they require libraries which wrap the functionality into something meaningful for modeling a particular problem or domain. In this talk, we will cover 3 important use cases for Redis data structures that are drawn from real-world experience and production applications handling millions of users and GB of data:

Leaderboards – also known as scoreboards or high score tables – used in video games or in gaming competition sites

Relationships (e.g. friendships) – used in “social” sites

Activity feeds – also known as timelines in “social” sites

The talk will cover these use cases in detail and the development of libraries around each separate use case. Particular attention for each service will be devoted to service failover, scaling and performance issues.

Lucas Dohmen – Rapid API Development with Foxx
Slides

Foxx is a feature of the upcoming version of the free and open source NoSQL database ArangoDB. It allows you to build APIs directly on top of the database and therefore skip the middleman (Rails, Django or whatever your favorite web framework is). This can for example be used to build a backend for Single Page Web Applications. It is designed with simplicity and the specific use case of modern client-side MVC frameworks in mind featuring tools like an asset delivery system.

Stefan Edlich – NoSQL in 5+ years
Slides

Currently it’s getting harder and harder to keep track of all the movements in the SQL, NoSQL and NewSQL world. Furthermore even polyglot persistence can have many meanings and eculiarities. This talk shows possible directions where NoSQL might move in this decade. We will discuss db-model integration- and storage aspects together with some of the hottest systems in the market that lead the way toady. We conclude with some survival strategies for us as users / companies in this messy world.

Benjamin Engber- How to Compare NoSQL Databases: Determining True Performance and Recoverability Metrics For Real-World Use Cases
Slides

One of the major challenges in choosing an appropriate NoSQL solution is finding reliable information as to how a particular database performs for a given use case. Stories of high profile systems failures abound, and tossed around with widely varying benchmark numbers that seem to have no bearing on how tuned systems behave out of the lab. Many of the profiling tools and studies out there use deeply flawed tools or methodologies. Getting meaningful data out of published benchmark studies is difficult, and running internal benchmarks even more so.

In this presentation we discuss how to perform tuned benchmarking across a number of NoSQL solutions (Couchbase, Aerospike, MongoDB, Cassandra, HBase, others) and to do so in a way that does not artificially distort the data in favor of a particular database or storage paradigm. This includes hardware and software configurations, as well as ways of measuring to ensure repeatable results.

We also discuss how to extend benchmarking tests to simulate different kinds of failure scenarios to help evaluate the maintainablility and recoverability of different systems. This requires carefully constructed tests and significant knowledge of the underlying databases — the talk will help evaluators overcome the common pitfalls and time sinks involved in trying to measure this.

Lastly we discuss the YCSB benchmarking tool, its significant limitations, and the significant extensions and supplementary tools Thumbtack has created to provide distributed load generation and failure simulation.

Ralf S. Engelschall – Polyglot Persistence: Boon and Bane for Software Architects
Slides

RDBMS since two decades are the established single standard for every type of data storage of business information systems. In the last few years, the NoSQL movement brought us a myriad of interesting alternative data storage approaches. Now Polyglot Persistence tells us to leverage from the combination of multiple data storage approaches in a “best tool for the job” approach, including the combination of RDBMS and NoSQL.

This presentation addresses the following questions: How does Polyglot Persistence look like in practive when implementing a full-size business information system with it? What interesting use-cases towards the persistence layer exist here? What technical challenges are we faced with here? How does the persis-tence layer architecture look like when using multiple storage backends? What are the technical alternative solutions to implement Polgylot Persistence?

Martin Fowler – NoSQL Distilled to an hour

NoSQL databases offer a significant change to how enterprise applications are built, challenging to two-decade hegemony of relational databases. The question people face is whether NoSQL databases are an appropriate choice, either for new projects or to introduce to existing projects. I’ll give rapid introduction to NoSQL databases: where they came from, the nature of the data models they use, and the different way you have to think about consistency. From this I’ll outline what kinds of circumstances you should consider using them, why they will not make relational databases obsolete, and the important consequence of polyglot persistence.

Uwe Friedrichsen – How to survive in a BASE world
Slides

NoSQL, Big Data and Scale-out in general are leaving the hype plateau and start to become enterprise reality. This usally means no more ACID tranactions, but BASE transactions instead. When confronted with BASE, many developers just shrug and think “Okay, no more SQL but that’s basically it, isn’t it?”. They are terribly wrong!

BASE transactions do not guarantee data consistency at all times anymore, which is a property we became so used to in the ACID years that we barely think about it anymore. But if we continue to design and implement our applications as if there still were ACID transactions, system crashes and corrupt data will become your daily company.

This session gives a quick introduction into the challenges of BASE transactions and explains how to design and implement a BASE-aware application using real code examples. Additionally we extract some concrete patterns in order to preserve the ideas in a concise way. Let’s get ready to survive in a BASE world!

Lars George – HBase Schema Design
Slides

HBase is the Hadoop Database, a random access store that can easily scale to Petabytes of data. It employs common logical concepts, such as rows and tables. But the lack of transaction, the simply CRUD API, combined with a nearly schema-less data layout, it requires a deeper understanding of its inner workings, and how these affect performance. The talk will discuss the architecture behind HBase and lead into practical advice on how to structure data inside HBase to gain the best possible performance.

Lars George – Introduction to Hadoop
Slides

Apache Hadoop is the most popular solution for today’s big data problems. By connecting multiple servers, Hadoop provides a redundant and distributed platform to store and process large amounts of data. This presentation will introduce the architecture of Hadoop and the various interfaces to import and export data into it. Finally, a range of tools will be presented to access the data which include a NoSQL layer called HBase, a Scripting Language layer called Pig, but also a goo old SQL approach through Hive.

Kris Geusebroek – Creating Neo4j Graph Databases with Hadoop
Slides

When exploring very large raw datasets containing massive interconnected networks, it is sometimes helpful to extract your data, or a subset thereof, into a graph database like Neo4j. This allows you to easily explore and visualize networked data to discover meaningful patterns. When your graph has 100M+ nodes and 1000M+ edges, using the regular Neo4j import tools will make the import very time-intensive (as in many hours to days).

In this talk, I’ll show you how we used Hadoop to scale the creation of very large Neo4j databases by distributing the load across a cluster and how we solved problems like creating sequential row ids and position-dependent records using a distributed framework like Hadoop.

Tugdual Grall -Introduction to Map Reduce with Couchbase 2.0
Slides

MapReduce allows systems to delegate the query processing on different machines in parallel. Couchbase 2.0 allows developer to use Map Reduce to query JSON based data on a large volume of data (and server). Couchbase 2.0 features incremental map reduce, which provides powerful aggregates and summaries, even with large datasets for distributed real-time analytic use cases. In this session you will learn:
– What is MapReduce?
– How incremental map-reduce works in Couchbase Server 2.0?
– How to utilize incremental map-reduce for real-time analytics?
– Common use cases this feature addresses?
In this session you will see in demonstration how you can create new Map Reduce function and use it into your application.

Alex Hall – Processing a Trillion Cells per Mouse Click

Slides

Column-oriented database systems have been a real game changer for the industry in recent years. Highly tuned and performant systems have evolved that provide users with the possibility of answering ad hoc queries over large datasets in an interactive manner. In this paper we present the column-oriented datastore developed as one of the central components of PowerDrill (internal Google data-analysis project).

It combines the advantages of columnar data layout with other known techniques (such as using composite range partitions) and extensive algorithmic engineering on key data structures. The main goal of the latter being to reduce the main memory footprint and to increase the efficiency in processing typical user queries. In this combination we achieve large speed-ups. These enable a highly interactive Web UI where it is common that a single mouse click leads to processing a trillion values in the underlying dataset.

Randall Hauch – Elastic, consistent and hierarchical data storage with ModeShape 3
Slides

ModeShape 3 is an elastic, strongly-consistent hierarchical database that supports queries, full-text search, versioning, events, locking and use of schema-rich or schema- less constraints. It’s perfect for storing files and hierarchically structured data that will be accessed by navigation or queries. You can choose where (if at all) you want ModeShape to enforce your schema, but your structure and schema can always evolve as your needs change. Sequencers make it easy to extract structure from stored files, and federation can bring into your database information from external systems. It’s fast, sits on top of an Infinispan data grid, and open source. Learn about the benefits of ModeShape 3, and how to deploy and use it to store your own data.

Michael Hausenblas – Apache Drill In-Depth Dissection
Slides

The Apache Drill project ostensibly has goals that make it look a lot like Dremel in that the mainline use case involves SQL or SQL-like queries applied to a large distributed data store, possibly organized in a columnar format.

In fact, however, Drill is a highly flexible architecture that allows it to serve many needs. Moreover, Drill has standardized internal API’s which allow easy extension for experimentation with parallel query evaluation. This is achieved by defining a standard logical query data flow language with a standardized and very flexible JSON syntax. Operators can be added to this framework very easily with the only assumption being that operators have inputs that are record sequences and a single output consisting of a record sequence. A SQL to logical query translator and the operators necessary to evaluate these queries are part of the standard Drill, but alternative syntax is easily added and alternative semantics are easily substituted.

This talk will describe the overall architecture of Drill, report on the progress in building an open source development community and show how Drill can be used to do machine learning, how Drill can be embedded in a language like Scala or Groovy, and how new syntax components can be added to support a language like Pig. This will be done by a description of how new parsers and operators are added. In addition, I will provide a description of how Drill uses Optiq to do cost-based query optimization.

Michael Hunger – Cypher Query Language and Neo4j
Slides

The Neo4j graph database is all about relationships. It allows to model domains of connected data easily. Querying using a imperative API is cumbersome. So we decided to develop a query language more suited to query graph data and focused on readability.

Taking inspiration from SQL, SparQL and others and using Scala to implement it turned out to be a good decision. Cypher has become one of the things people love about Neo4j. So in the talk we’ll introduce the language and its applicability for graph querying. We will focus on the expressive declaration of patterns, conditions and projections as well as the updating capabilities.

Michael Hunger – Intro to Neo4j or Domain Modeling with graphs
Slides No abstract)

Dr. Stefan Kaes & Sebastian Röbke – Triple R – Riak, Redis and RabbitMQ at XING
Slides

This talk will focus on how the team at XING, a popular social network for business professionals, rebuilt the activity stream for xing.com with Riak.

We will present interesting challenges we had to solve, like:
* Finding the proper data structures for storing very large activity streams
* Dealing with eventual consistency
* Migrating legacy data
* Setting up Riak in production

Piotr Kolaczkowski- Scalable Full-Text Search with DataStax Enterprise
Slides

Cassandra is the scalable NoSQL database powering modern applications at companies like Netflix, eBay, Disney, and hundreds of others. Solr is the popular enterprise search platform from the Apache Lucene project, offering powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, and many more. DataStax Enterprise platform integrates them both in such a way that data stored in Cassandra can be accessed and searched using Solr, and data stored in Solr can be accessed through Cassandra. This talk will describe the high level architecture of the Datastax Enterprise Search as well as specific algorithms responsible for massive scalability and good load balancing of Solr-enabled clusters.

Steffen Krause – DynamoDB – on-demand NoSQL scaling as a service
Slides

Scaling a distributed NoSQL database and making it resilient to failure can be hard. With Amazon DynamoDB, you just specify the desired throughput, consistency level and upload your data. DynamoDB does all the heavy lifting for you. Come to this session to get an overview of an automated, self-managed key-value store that can seamlessly scale to hundreds of thousands of operations per second.

Hannu Krosing – PostSQL – using PostgreSQL as a better NoSQL
Slides

This talk will describe using PostgreSQL superior ACID data engine “in a NoSQL way”, that is using PostgreSQL’s support for JSON and other dynamic/freeform types and embedded languages (pl/python, pl/jsv8) for data processing near data.

Also, scaling the Skype way using pl/proxy sharding language, pgbouncer connection pooler and skytools data moving and transforming multitool is demonstrated. Performance comparisons to popular NoSQL databases are also shown.

Fabian Lange – There and Back Again: A Cloud’s Tale
Slides

Want to see how we built our cloud based document management solution CenterDevice? This talk will cover the joy and pain of working with MongoDB, Elastic Search, Gluster FS, Rabbit MQ Java and more. I will show how we build, deploy, run and monitor our very own cloud. You will learn what we learned, what worked and where hype met reality. Warning: no unicorns or pixie dust.

Dennis Meyer & Uwe Seiler – Map/Reduce in Action: Large Scale Reporting Based on Hadoop and Vertica
Slides

German based ADTECH GmbH, an AOL Inc. Company, is a leading international supplier of digital market-ing solutions delivering approx. 6 billion advertisements on a daily basis for customers in over 25 countries. Every ad delivery needs to be logged for reasons of billing but additionally ADTECH’s customers want to know as much detail as possible about those deliveries. Until recently the reporting part of ADTECH’s infrastructure was based on a custom C++ reporting solution used in combination with multiple databases. With ever increasing traffic the performance was reaching its limits, especially for the customer-critical end-of-month reporting period. Furthermore changes to custom reports were complex and time consuming due to the highly intermeshed architecture. The delays in producing these customizations was a source of customer dissatisfaction. To solve these issues ADTECH, in partnership with codecentric AG, made the move to a more flexible and performant architecture based on the Apache Hadoop ecosystem. With this new approach all details about the ad deliveries are logged using a combination of Avro and Flume, stored into HDFS and then aggregated using Map/Reduce with Pig before being stored in the NoSQL datastore Vertica. This talk aims to give an overview of the architecture and explain the architectural decisions made with a strong focus on the lessons learned.

Konstantin Osipov – Persistent message queues with Tarantool/Box
Slides

Tarantool/Box is an open-source, persistent, transactional in-memory database with rich support of Lua as a stored procedures and extensions language.

In the diverse world of NoSQL, Tarantool/Box owns a niche of a database efficiently running important parts of Web application logic, thus smartly enabling Web application to scale. In this talk I’ll present several new use cases in which Tarantool/Box plays a special role, and our new features implemented to support them. In particular, I’ll explore the problem of persistent transactional message queues and the role they play in highly available software. I will demonstrate how Tarantool/Box can be used as a reliable message queue server, but customized with parts of application-specific logic.

I’ll show-case Tarantool/Box features, designed to support message queues:

– inter- stored procedure communication channels, to effectively exchange messages between task producers and consumers
– triggers, fired upon system events, such as connect or disconnect, and their usage in an efficient queue implementation
– new database index types: bitmap, partial and functional indexes, necessary to implement very large queues with minimal memory footprint.

Panel discussion (No abstract)

Mahesh Paolini-Subramanya – NoSQL the Telco way
Slides

Being a decent-sized Telecommunications provider, we process a lot of calls (hundreds/second), and need to keep track of all the events on each call. Technically speaking, this is “A Lot” of data – data that our clients (and our own people!) want real-time access to in a myriad of ways. We’ve ended up going through quite a few NoSQL stores in our quest to satisfy everyone – and the way we do things now has very little to do with where we started out. Join me as I describe our experience and what we’ve learned, focusing on the Big 4, viz.

– The “solution-oriented” nature of NoSQL repeatedly changed our understanding of our problem-space – sometimes drastically.
– The system behavior , particularly the failure modes, were significantly different at scale
– The software model kept getting overhauled – regardless of how much we planned ahead
– We came to value agility – the ability to change direction – above all (yes, even at a Telco!)

Eric Redmond – Distributed Patterns You Should Know
Slides

Do you use Merkle trees? How about hash rings? Vector clocks? How many message patterns do you know? In this increasingly decentralized world, a firm grasp of the pantheon of data structures and patterns used to manage decentralized systems is no longer a luxury, it’s a necessity. You may never implement a pipeline, but chances are, you already use them.

Larysa Visengeriyeva – Introduction to Predictive Analytics for NoSQL Nerds (Slides, no video)

Rolling out and running a NoSQL Database is only half the battle. It’s obvious to see that NoSQL Databases are used more and more in companies and start-ups where there is a huge need to dig the ‘big-data’ treasures. This requires a profound knowledge of mathematics, statistics, AI, data mining and machine-learning where experts are rare. This talk will give an overview of the most important concepts mentioned before. Furthermore tools, techniques and and experiences for a successful data analysis will be introduced. Finally this talk closes with a practical implementation for analyzing text – following the ‘IBM Watson’ idea.

Matthew Revell – Building scalability into your next project with Riak
Slides

Change is the one thing you can guarantee will happen in your next project. Fixing your schema before you’ve even launched means taking a massive gamble over what your users will want and how you’re going to deliver it.

The new generation of schema-free databases give you the flexibility to learn what your users need and what your application should do. As a bonus, databases such as Riak give you huge scalability meaning that you needn’t fear success.

Matthew introduces the new world of schema-free/NoSQL databases and focuses on how Riak can make your next project truly web-scale.

Martin Schoenert – CAP and Architectual Consequences
Slides

The blunt formulation of the CAP theorem states that any database system can achieve only 2 of the 3 properties: consistency, availability and partition-tolerance. We look at it more closely and see that this
formulation is misleading, because there is not a single big design decision but several smaller ones for the design of a database system. We then concentrate on the architectural consequences for massively
distributed database systems and argue that such systems must place restrictions on consistency and functionality.

Jan Steemann – Query Languages for Document Stores
Slides

SQL is the standard and established way to query relational databases.

As the name “NoSQL” suggests, NoSQL databases have gone some other way, coming up with several approaches of querying, e.g. access by key, map/reduce, and even own full-featured query languages.

We surely don’t want the super-fast key/value store require us to use a full-blown query language and slow us down – but for several other cases querying using a language can still be convenient. This is especially the case in document stores that have a wide range of use cases and allow us to look at different aspects of the same data.

As there isn’t yet an established standard for querying document databases, the talk will showcase some of the existing implementations such as UNQL, AQL, and jsoniq. Additionally, related topics such as graph query languages will be covered.

Kai Wähner – Big Data beyond Hadoop – How to integrate ALL your data
Slides

Big data represents a significant paradigm shift in enterprise technology. Big data radically changes the nature of the data management profession as it introduces new concerns about the volume, velocity and variety of corporate data.

Apache Hadoop is the open source defacto standard for implementing big data solutions on the Java platform. Hadoop consists of its kernel, MapReduce, and the Hadoop Distributed Filesystem (HDFS). A challenging task is to send all data to Hadoop for processing and storage (and then get it back to your application later), because in practice data comes from many different applications (SAP, Salesforce, Siebel, etc.) and databases (File, SQL, NoSQL), uses different technologies and concepts for communication (e.g. HTTP, FTP, RMI, JMS), and consists of different data formats using CSV, XML, binary data, or other alternatives.

This session shows the powerful combination of Apache Hadoop and Apache Camel to solve this challenging task. Learn how to use every thinkable data with Hadoop – without plenty of complex or redundant boilerplate code. Besides supporting the integration of all different technologies and data formats, Apache Camel also offers an easy, standardized DSL to transform, split or filter incoming data using the Enterprise Integration Patterns (EIP). Therefore, Apache Hadoop and Apache Camel are a perfect match for processing big data on the Java platform.

Simon Willnauer – With a hammer in your hand…

ElasticSearch combines the power of Apache Lucene (NoSQL since 2001) and the movement of distributed, scalable high-performance NoSQL solutions into easy to use schema free search engine that can serve full-text search request, key-value lookups, schema free analytics requests, facets or even suggestions in real-time. This talk will give an introduction to the key features of ElasticSearch with live examples.

The talk won’t be an exhaustive feature presentation but rather an overview of what and how ElasticSearch can do for you.

Randall Wilson – A Billion Person Family Tree with MongoDB
Slides

FamilySearch maintains a collaborative family tree with a billion individuals in it. The tree is edited in real time by thousands of concurrent users. Recent experiments to move the tree from a relational database to a MongoDB yielded a huge gain in performance. This presentation reviews the lessons learned throughout this experience, including how to deal with things such as operations that previously depended on transactional integrity. It also shares some insights into the experience gained by testing against Riak and other solutions.

I first saw this in a tweet by Eugene Dvorkin.

BrightstarDB 1.3 now available

Filed under: BrightstarDB,LINQ,NoSQL,SPARQL — Patrick Durusau @ 3:22 pm

BrightstarDB 1.3 now available

From the post:

We are pleased to announce the release of BrightstarDB 1.3. This is the first “official” release of BrightstarDB under the open-source MIT license. All of the documentation and notices on the website should now have been updated to remove any mention of commercial licensing. To be clear: BrightstarDB is not dual licensed, the MIT license applies to all uses of BrightstarDB, commercial or non-commercial. If you spot something we missed in the docs that might indicate otherwise please let us know.

The main focus of this release has been to tidy up the licensing and use of third-party closed-source applications in the build process, but we also took the opportunity to extend the core RDF APIs to provide better support for named graphs within BrightstarDB stores. This release also incorporates the most recent version of dotNetRDF providing us with updated Turtle parsing and improved SPARQL query performance over the previous release.

Just to tempt you into looking further, the features are:

  • Schema-Free Triple Store
  • High Performance
  • LINQ & OData Support
  • Historical Data Access
  • Transactional (ACID)
  • NoSQL Entity Framework
  • SPARQL Support
  • Automatic Indexing

From Kal Ahmed and Graham Moore if you don’t recognize the software.

NAACL ATL 2013

2013 Conference of the North American Chapter of the Association for Computational Linguistics

The NAACL conference wraps up tomorrow in Atlanta but in case you are running low on summer reading materials:

Proceedings for the 2013 NAACL and *SEM conferences. Not quite 180MB but close.

Scanning the accepted papers will give you an inkling of what awaits.

Enjoy!

NSA shows how big ‘big data’ can be

Filed under: BigData,NSA,Security — Patrick Durusau @ 1:55 pm

NSA shows how big ‘big data’ can be by Frank Konkel.

If big data was cheap and easy and always resulted in an abundance of relevant insights, every agency and organization would do it.

The fact that so few federal agencies are engaging this new technology – zero out of 17 in a recent Meritalk survey – only highlights the challenges inherent with what recent intelligence leaks show the National Security Agency is trying to do.

NSA reportedly collects the daily phone records of hundreds of millions of customers from the largest providers in the nation, as well as a wealth of online information about individuals from Internet companies like Facebook, Microsoft, Google and others.

To put the NSA’s big data problems into perspective, Facebook’s 1 billion worldwide users alone generate 500 terabytes of information per day – about as much data as a digital library containing all books ever written in any language. Worldwide, humans generate 6.1 trillion text messages annually, and Americans alone make billions of phone calls each year.

Even if the NSA takes in only a small percentage of the metadata generated daily by those major companies and carriers in its efforts to produce foreign signals intelligence and thwart terrorists, the information contained therein would be a vast sea of data.

Frank’s line: If big data was cheap and easy and always resulted in an abundance of relevant insights, every agency and organization would do it. bears repeating.

Especially in light of misleading news stories like: Intercepted communications called critical in terror investigations by Tim Lister and Paul Cruickshank.

Sure, starting from someone already known or under surveillance, intercepting communications can be valuable.

But that wasn’t what the NSA was doing, at least from the court order to monitor all Verizon customers. Unless they are all in cahoots with terrorists? Seems unlikely.

So why is the NSA gathering data it can’t effectively analyze?

Suggestions?

Kamala Cloud 2.0!

Filed under: Cloud Computing,Kamala,Topic Map Software,Topic Maps — Patrick Durusau @ 9:42 am

Description:

Kamala is a knowledge platform for organizations and people to link their data and share their knowledge. Key features of Kamala: smart suggestions, semantic search and efficient filtering. These help you perfecting your knowledge model and give you powerful, reusable search results.

Model your domain knowledge in the cloud.

Website: http://kamala-cloud.com
Kamala: http://kamala.mssm.nl
Morpheus: http://en.mssm.nl

Excellent!

I understand more Kamala videos are coming next week.

An example of how to advertise topic maps. Err, a good example of how to advertise topic maps! 😉

You will see Gabriel Hopmans put in a cameo appearance in the video.

Congratulations to the Kamala Team!

Details, discussion, criticisms, etc., to follow.

June 13, 2013

Loopy Lattices Redux

Filed under: Faunus,Graphs,Networks,Titan — Patrick Durusau @ 4:45 pm

Loopy Lattices Redux by Marko A. Rodriguez.

Comparison of Titan and Faunus counting the number of paths in a 20 x 20 lattice.

Interesting from a graph-theoretic perspective but since the count can be determined analytically, I am not sure of the utility of being about to count the paths?

In some ways this reminds me of Counting complex disordered states by efficient pattern matching: chromatic polynomials and Potts partition functions by Marc Timme, Frank van Bussel, Denny Fliegner and Sebastian Stolzenberg, New Journal of Physics 11 (2009) 023001.

The question Timme and colleagues were investigating was the coloring of nodes in a graph which depended upon the coloring of other nodes. For a chess board sized graph, the calculation is estimated to take billions of years. The technique developed here takes less than seven (7) seconds for a chess board sized graph.

Traditionally, assigning a color to a vertex required knowledge of the entire graph. Here, instead of assigning a color, the color that should be assigned is represented by a formula stating the unknowns. Once all the nodes have such a formula:

The computation of the chromatic polynomial has been reduced to a process of alternating expansion of expressions and symbolically replacing terms in an appropriate order. In the language of computer science, these operations are represented as the expanding, matching and sorting of patterns, making the algorithm suitable for computer algebra programs optimized for pattern matching.

What isn’t clear is whether a similar technique could be applied to merging conditions where the merging state of a proxy depends upon, potentially, all other proxies.

Visual Data Web

Filed under: Graphics,Semantic Web,Visualization — Patrick Durusau @ 12:37 pm

Visual Data Web

From the website:

This website provides an overview of our attempts to a more visual Data Web.

The term Data Web refers to the evolution of a mainly document-centric Web toward a more data-oriented Web. In its narrow sense, the term describes pragmatic approaches of the Semantic Web, such as RDF and Linked Data. In a broader sense, it also includes less formal data structures, such as microformats, microdata, tagging, and folksonomies.

The term Visual Data Web reflects our goal of making the Data Web visually more experienceable, also for average Web users with little to no knowledge about the underlying technologies. This website presents developments, related publications, and current activities to generate new ideas, methods, and tools that help making the Data Web easier accessible, more visible, and thus more attractive.

The recent NSA scandal underlined the smallness of “web scale.” The NSA data was orders of magnitude greater than “web scale.”

Still, experimenting with visualization, even on “web scale” data, may lead to important lessons on visualization.

I first saw this in a tweet by Stian Danenbarger.

Hafslund Sesam — an archive on semantics

Filed under: RDF,Topic Maps — Patrick Durusau @ 12:27 pm

Hafslund Sesam — an archive on semantics by Lars Marius Garshol and Axel Borge.

Abstract:

Sesam is an archive system developed for Hafslund, a Norwegian energy company. It achieves the often-sought but rarely-achieved goal of automatically enriching metadata by using semantic technologies to extract and integrate business data from business applications. The extracted data is also indexed with a search engine together with the archived documents, allowing true enterprise search.

A curious paper that requires careful reading.

Since the paper makes technology choices, it’s only appropriate to start with the requirements:

The system must handle 1000 users, although not necessarily simultaneously.

Initial calculations of data size assumed 1.4 million customers and 1 million electric meters with 30-50 properties each. Including various other data gave a rough estimate on the order of 100 million statements.

The archive must be able to receive up to 2 documents per second over an interval of many hours, in order to handle about 100,000 documents a day during peak periods. The documents would mostly be paper forms recording electric meter readings.

To inherit metadata tags automatically requires running queries to achieve transitive closure. Assuming on average 10 queries for each document, the system must be able to handle 20 queries per second on 100 million statements.

In the next section, the authors concede that the fourth requirement, “RDF data integration” was unrealistic, so the fourth requirement was dropped:

The canonical approach to RDF data integration is currently query federation of SPARQL queries against a set of heterogeneous data sources, often using R2RML. Given the size of the data set, the generic nature of the transitive closure queries, and the number of data sources to be supported, we considered achieving 20 queries per second with query federation unrealistic.

Which leaves only:

The system must handle 1000 users, although not necessarily simultaneously.

Initial calculations of data size assumed 1.4 million customers and 1 million electric meters with 30-50 properties each. Including various other data gave a rough estimate on the order of 100 million statements.

The archive must be able to receive up to 2 documents per second over an interval of many hours, in order to handle about 100,000 documents a day during peak periods. The documents would mostly be paper forms recording electric meter readings.

as the requirements to be met.

I mention that because of the following technology choice statement:

To write generic code we must use a schemaless data representation, which must also be standards-based. The only candidates were Topic Maps [ISO13250-2] and RDF. The available Topic Maps implementations would not be able to handle the query throughput at the data sizes required. Testing of the Virtuoso triple store indicated that it could handle the workload just fine. RDF thus appeared to be the only suitable technology.

But there is no query throughput requirement. At least not for the storage mechanism. For deduplication in the ERP system (section 3.5), the authors choose to follow neither topic maps nor RDF but a much older technology, record linkage.

The other query mechnism is a Recommind search engine, which is reported to not be able to index and search at the same time. (section 4.1)

If I am reading the paper correctly, data from different sources are stored as received from various sources and owl:sameAs statements are used to map data to the archives schema.

I puzzle at that point because RDF is simply a format and OWL a means to state a mapping statement.

Given the semantic vagaries of owl:sameAs (Semantic Drift and Linked Data/Semantic Web), I have to wonder about the longer term maintenance of owl:sameAs mappings?

There is no expression of a reason for “sameAs” A reason that might prompt a future maintainer of the system to follow or not some particular “sameAs.”

Still, the project was successful and that counts for more than using any single technology to the exclusion of all others.

The comments on performance of topic maps options does make me mindful of the lack of benchmark data sets for topic maps.

Clojure Cookbook

Filed under: Clojure,Programming — Patrick Durusau @ 9:39 am

Clojure Cookbook

From the webpage:

Clojure Cookbook is coming

And we need your help.

We want this O’Reilly cookbook to be a comprehensive resource containing the collective wisdom of Clojurists from every domain. That’s why we want to write it together, as a community.

Share some code. Explain it. Be a part of Clojure history.

Clojure Cookbook

Cool!

GitHub clojure-cookbook.

Post-Prism Data Science Venn Diagram

Filed under: Humor,NSA,Security — Patrick Durusau @ 9:23 am

post-prism data science

Joel Grus posted an updated version of Drew Conway’s Data Science Venn Diagram.

June 12, 2013

HyperGraphDB (Google+)

Filed under: Graphs,Hypergraphs — Patrick Durusau @ 3:45 pm

HyperGraphDB (Google+)

Jack Park forwarded a note pointing out this Google+ group for HyperGraphDB.

Not a lot of traffic, must be waiting on you!

See also: HyperGraphDB 1.2 Final for links to the software.

Stealth-mode 28msec wants to build a Tower of Babel for databases

Filed under: ETL,JSONiq,Zorba — Patrick Durusau @ 3:31 pm

Stealth-mode 28msec wants to build a Tower of Babel for databases by Derrick Harris.

From the post:

28msec is not your average database startup but, then again, neither is its mission. The company — still in stealth mode (until our Structure Launchpad event on June 20) after about seven years of existence — has created a data-processing platform that it says can take and analyze data from any source, and then deliver the results in real time.

The company took so long to officially launch, CEO Eric Kish told me, because it took such a long time to build. The 28msec history goes like this: The early investors are database industry veterans (one was employee No. 6 at Oracle) who, at some point in 2006, envisioned an explosion in data formats and databases. Their solution was to create a platform able to extract data from any of these sources, transform it into a standard format, and then let users analyze it using a single query language that looks a lot like the SQL they already know. 28msec is based on the open source JSONiq and Zorba query languages and will be available as a cloud service.

Alex Popescu points to his: Main difference between Hadapt and Microsoft Polybase, HAWQ, SQL-H to underline the point that we all know ETL works, the question is what is required to optimize it.

I first saw this at Alex Popescu’s 28msec – query data from any source in real time.

PS: Should I send a note along to the NSA or just assume they are listening? 😉

State of the OpenStreetMap [Watching the Watchers?]

Filed under: Crowd Sourcing,OpenStreetMap — Patrick Durusau @ 3:13 pm

State of the OpenStreetMap by Nathan Yau.

OpenStreetMap

Nathan reminds us to review the OpenStreetMap Data Report, which includes a dynamic map showing changes as they are made.

OpenStreetMap has exceeded 1,000,000 users and 1,000 mappers contribute every day.

I wonder if the OpenStreetMap would be interested in extending its Features to include a “seen-at” tag?

So people can upload cellphone photos with geotagging of watchers.

With names, if known, if not, perhaps other users can supply names.

How does name analysis work?

Filed under: Names,Natural Language Processing — Patrick Durusau @ 2:51 pm

How does name analysis work? by Pete Warden.

From the post:

Over the last few months, I’ve been doing a lot more work with name analysis, and I’ve made some of the tools I use available as open-source software. Name analysis takes a list of names, and outputs guesses for the gender, age, and ethnicity of each person. This makes it incredibly useful for answering questions about the demographics of people in public data sets. Fundamentally though, the outputs are still guesses, and end-users need to understand how reliable the results are, so I want to talk about the strengths and weaknesses of this approach.

The short answer is that it can never work any better than a human looking at somebody else’s name and guessing their age, gender, and race. If you saw Mildred Hermann on a list of names, I bet you’d picture an older white woman, whereas Juan Hernandez brings to mind an Hispanic man, with no obvious age. It should be obvious that this is not always reliable for individuals (I bet there are some young Mildreds out there) but as the sample size grows, the errors tend to cancel each other out.

The algorithms themselves work by looking at data that’s been released by the US Census and the Social Security agency. These data sets list the popularity of 90,000 first names by gender and year of birth, and 150,000 family names by ethnicity. I then use these frequencies as the basis for all of the estimates. Crucially, all the guesses depend on how strong a correlation there is between a particular name and a person’s characteristics, which varies for each property. I’ll give some estimates of how strong these relationships are below, and I link to some papers with more rigorous quantitative evaluations below.

Not 100% as Pete points out but an interesting starting point. Plus links to more formal analysis.

For Example

Filed under: D3,Graphics,Visualization — Patrick Durusau @ 1:55 pm

For Example by Mike Bostock.

Montage

I am a big fan of examples. Not a surprise, right? If you follow me on Twitter, or my projects over the last few years (or asked D3 questions on Stack Overflow), you’ve likely seen some of my example visualizations, maps and explanations.

I use examples so often that I created bl.ocks.org to make it easier for me to share them. It lets you quickly post code and share examples with a short URL. Your code is displayed below; it’s view source by default. And it’s backed by GitHub Gist, so examples have a git repository for version control, and are forkable, cloneable and commentable.

I initially conceived this talk as an excuse to show all my examples. But with more than 600, I’d have only 4.5 seconds per slide. A bit overwhelming. So instead I’ve picked a few favorites that I hope you’ll enjoy. You should find this talk entertaining, even if it fails to be insightful.

This talk does have a point, though. Examples are lightweight and informal; they can often be made in a few minutes; they lack the ceremony of polished graphics or official tools. Yet examples are a powerful medium of communication that is capable of expressing big ideas with immediate impact. And Eyeo is a unique opportunity for me to talk directly to all of you that are doing amazing things with code, data and visualization. So, if I can accomplish one thing here, it should be to get you to share more examples. In short, to share my love of examples with you.

Mike’s post is full of excellent D3 graphics. You owe it to yourself to review all of them in full.

I first saw this at Nat Torkington’s Four short links: 11 June 2013.

NYU Large Scale Machine Learning Class Notes

Filed under: Machine Learning — Patrick Durusau @ 1:41 pm

NYU Large Scale Machine Learning Class Notes by John Langford.

John has posted the class notes from the large scale machine learning class he co-taught with Yann LeCun.

Catch the videos here.

Building A Visual Planetary Time Machine

Filed under: Geography,Mapping,Maps,Visualization — Patrick Durusau @ 1:30 pm

Building A Visual Planetary Time Machine by by Randy Sargent, Google/Carnegie Mellon University; Matt Hancher and Eric Nguyen, Google; and Illah Nourbakhsh, Carnegie Mellon University.

From the post:

When a societal or scientific issue is highly contested, visual evidence can cut to the core of the debate in a way that words alone cannot — communicating complicated ideas that can be understood by experts and non-experts alike. After all, it took the invention of the optical telescope to overturn the idea that the heavens revolved around the earth.

Last month, Google announced a zoomable and explorable time-lapse view of our planet. This time-lapse Earth enables you explore the last 29 years of our planet’s history — from the global scale to the local scale, all across the planet. We hope this new visual dataset will ground debates, encourage discovery, and shift perspectives about some of today’s pressing global issues.

This project is a collaboration between Google’s Earth Engine team, Carnegie Mellon University’s CREATE Lab, and TIME Magazine — using nearly a petabyte of historical record from USGS’s and NASA’s Landsat satellites. And in this post, we’d like to give a little insight into the process required to build this time-lapse view of our planet.

Great imaging and a benchmark to compare future progress in this area.

Within three to five (3-5) years, this should be doable as senior CS project. Graduate students and advanced hackers will be using higher resolution “spy” satellite images.

From five to eight (5-8) years, software packages appear for the average consumer, processing on the local “grid.”

From eight to ten (8-10) years, mostly due to the long product cycle, appears in MS Office XXI. 😉

If not sooner!

Easy mapping with Map Stack

Filed under: Interface Research/Design,Mapping,Maps,Usability — Patrick Durusau @ 11:25 am

Easy mapping with Map Stack by Nathan Yau.

Map Stack image

Nathan writes:

It seems like the technical side of map-making, the part that requires code or complicated software installations, fades a little more every day. People get to focus more on actual map-making than on server setup. Map Stack by Stamen is the most recent tool to help you do this.

(…)

It’s completely web-based, and you edit your maps via a click interface. Pick what you want (or use Stamen’s own stylish themes) and save an image. For the time being, the service is open only from 11am to 5pm PST, so just come back later if it happens to be closed.

Over 3,000 maps have been made over the last four days! Examples.

Now to see semantic mapping interfaces improve.

« Newer PostsOlder Posts »

Powered by WordPress