Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 16, 2012

Mining of Massive Datasets [Revised – Mining Large Graphs Added]

Filed under: BigData,Data Analysis,Data Mining — Patrick Durusau @ 7:04 pm

Mining of Massive Datasets by Anand Rajaraman and Jeff Ullman.

Version 1.0 errata frozen as of June 4, 2012.

Version 1.1 adds Jure Leskovec as a co-author and adds a chapter on mining large graphs.

Both versions can be downloaded as chapters or as entire text.

Pig as Hadoop Connector, Part One: Pig, MongoDB and Node.js

Filed under: Hadoop,MongoDB,node-js,Pig — Patrick Durusau @ 6:53 pm

Pig as Hadoop Connector, Part One: Pig, MongoDB and Node.js by Russell Jurney.

From the post:

Series Introduction

Apache Pig is a dataflow oriented, scripting interface to Hadoop. Pig enables you to manipulate data as tuples in simple pipelines without thinking about the complexities of MapReduce.

But Pig is more than that. Pig has emerged as the ‘duct tape’ of Big Data, enabling you to send data between distributed systems in a few lines of code. In this series, we’re going to show you how to use Hadoop and Pig to connect different distributed systems, to enable you to process data from wherever and to wherever you like.

Working code for this post as well as setup instructions for the tools we use are available at https://github.com/rjurney/enron-node-mongo and you can download the Enron emails we use in the example in Avro format at http://s3.amazonaws.com/rjurney.public/enron.avro. You can run our example Pig scripts in local mode (without Hadoop) with the -x local flag: pig -x local. This enables new Hadoop users to try out Pig without a Hadoop cluster.

Introduction

In this post we’ll be using Hadoop, Pig, mongo-hadoop, MongoDB and Node.js to turn Avro records into a web service. We do so to illustrate Pig’s ability to act as glue between distributed systems, and to show how easy it is to publish data from Hadoop to the web.

I was tempted to add ‘duct tape’ as a category. But there could only be one entry. 😉

Take an early weekend and have some fun with this tomorrow. August will be over sooner than you think.

Getting a Big Neo4j Test Box for Cheap!

Filed under: Hosting,Neo4j — Patrick Durusau @ 4:11 pm

Getting a Big Neo4j Test Box for Cheap! by Max De Marzi.

From the post:

When embarking on a new Neo4j project, one of the things you have to figure out is where to run it. Most of the time the answer is just your laptop. Other times, using Heroku works great. However, if you are at the stage of your testing where you have billions of nodes and relationships, you need something a little bigger.

If you are not ready to commit to purchasing a 100k server for testing, then I suggest you borrow one for a short time. You can try to spin up an Amazon EC2 instance, the high memory large ones go up to 60 gigs of RAM. But what if you need more? Lots more?

The answer for me has been WebNX.com It’s a data center out of Los Angeles that has great deals on big boxes.

I won’t ruin the surprise for you!

Read Max’s post and then go to WebNX.com.

The prices are only going to get better.

Using the Cloudant Data Layer for Windows Azure

Filed under: Cloud Computing,Windows Azure — Patrick Durusau @ 4:00 pm

Using the Cloudant Data Layer for Windows Azure by Doug Mahugh.

From the post:

If you need a highly scalable data layer for your cloud service or application running on Windows Azure, the Cloudant Data Layer for Windows Azure may be a great fit. This service, which was announced in preview mode in June and is now in beta, delivers Cloudant’s “database as a service” offering on Windows Azure.

From Cloudant’s data layer you’ll get rich support for data replication and synchronization scenarios such as online/offline data access for mobile device support, a RESTful Apache CouchDB-compatible API, and powerful features including full-text search, geo-location, federated analytics, schema-less document collections, and many others. And perhaps the greatest benefit of all is what you don’t get with Cloudant’s approach: you’ll have no responsibility for provisioning, deploying, or managing your data layer. The experts at Cloudant take care of those details, while you stay focused on building applications and cloud services that use the data layer.

….

For an example of how to use the Cloudant Data Layer, see the tutorial “Using the Cloudant Data Layer for Windows Azure,” which takes you through the steps needed to set up an account, create a database, configure access permissions, and develop a simple PHP-based photo album application that uses the database to store text and images:

Not that you need a Cloudant Data Layer for a photo album but it will help get your feet wet with cloud computing.

A Provably Correct Scalable Concurrent Skip List

Filed under: Data Structures,Lock-Free Algorithms,Scalability,Skip List — Patrick Durusau @ 3:19 pm

From High Scalability, report of Paper: A Provably Correct Scalable Concurrent Skip List.

From the post:

In MemSQL Architecture we learned one of the core strategies MemSQL uses to achieve their need for speed is lock-free skip lists. Skip lists are used to efficiently handle range queries. Making the skip-lists lock-free helps eliminate contention and make writes fast.

If this all sounds a little pie-in-the-sky then here’s a very good paper on the subject that might help make it clearer: A Provably Correct Scalable Concurrent Skip List.

The cited paper is by Maurice Herlihy, Yossi Lev, Victor Luchangco, and Nir Shavit. The authors shared Sun Microsystems as an employer so you know the paper is dated.

For more background on lock-free data structures, including Keir Fraser’s “Practical lock freedom” dissertation, see: Practical lock-free data structures.

HBase Replication: Operational Overview

Filed under: Hadoop,HBase — Patrick Durusau @ 2:16 pm

HBase Replication: Operational Overview by Himanshu Vashishtha

From the post:

This is the second blogpost about HBase replication. The previous blogpost, HBase Replication Overview, discussed use cases, architecture and different modes supported in HBase replication. This blogpost is from an operational perspective and will touch upon HBase replication configuration, and key concepts for using it — such as bootstrapping, schema change, and fault tolerance.

The sort of post that makes you long for one or more mini-clusters. 😉

Building LinkedIn’s Real-time Activity Data Pipeline

Filed under: Aggregation,Analytics,Data Streams,Kafka,Systems Administration — Patrick Durusau @ 1:21 pm

Building LinkedIn’s Real-time Activity Data Pipeline by Ken Goodhope, Joel Koshy, Jay Kreps, Neha Narkhede, Richard Park, Jun Rao, and Victor Yang Ye. (pdf)

Abstract:

One trend in the implementation of modern web systems is the use of activity data in the form of log or event messages that capture user and server activity. This data is at the heart of many internet systems in the domains of advertising, relevance, search, recommendation systems, and security, as well as continuing to fulfill its traditional role in analytics and reporting. Many of these uses place real-time demands on data feeds. Activity data is extremely high volume and real-time pipelines present new design challenges. This paper discusses the design and engineering problems we encountered in moving LinkedIn’s data pipeline from a batch-oriented file aggregation mechanism to a real-time publish-subscribe system called Kafka. This pipeline currently runs in production at LinkedIn and handles more than 10 billion message writes each day with a sustained peak of over 172,000 messages per second. Kafka supports dozens of subscribing systems and delivers more than 55 billion messages to these consumer processing each day. We discuss the origins of this systems, missteps on the path to real-time, and the design and engineering problems we encountered along the way.

More details on Kafka (see Choking Cassandra Bolt).

What if you think about message feeds as being pipelines that are large enough to see and configure?

Chip level pipelines are more efficient but harder to configure.

Perhaps passing messages is efficient and flexible enough for a class of use cases.

Typology Oberseminar talk and Speed up of retrieval by a factor of 1000

Filed under: Graphity,Graphs — Patrick Durusau @ 12:05 pm

Typology Oberseminar talk and Speed up of retrieval by a factor of 1000
by René Pickhardt.

From the post:

Almost 2 months ago I talked in our oberseminar about Typology. Most readers of my blog will already know the project which was initially implemented by my students Till and Paul. I am just about to share some slides with you. They explain on one hand how the systems works and on the other hand give some overview of the related work.
As you can see from the slides we are planning to submit our results to SIGIR conference. So one year after my first blogpost on graphity which devoloped in a full paper for socialcom2012 (graphity blog post and blog post for source code) there is the yet informal typology blog post with the slides about the Typology Oberseminar talk and 3 months left for our SIGIR submission. I expect this time the submission will not be such a hassle as graphity since I shuold have learnt some lessons and also have a good student who is helping me with the implementation of all the tests.

Time remains for you to make suggestions!

August 15, 2012

The Statistical Sleuth (second edition) in R

Filed under: R,Statistics — Patrick Durusau @ 7:59 pm

The Statistical Sleuth (second edition) in R by Nick Horton.

For those of you who teach, or are interested in seeing an illustrated series of analyses, there is a new compendium of files to help describe how to fit models for the extended case studies in the Second Edition of the Statistical Sleuth: A Course in Methods of Data Analysis (2002), the excellent text by Fred Ramsey and Dan Schafer. If you are using this book, or would like to see straightforward ways to undertake analyses in R for intro and intermediate statistics courses, these may be of interest.

This originally appeared at SAS and R.

Announcing Percona Server 5.6 Alpha

Filed under: MySQL,Percona Server — Patrick Durusau @ 2:45 pm

Announcing Percona Server 5.6 Alpha by Stewart Smith

From the post:

We are very happy to announce our first alpha of Percona Server 5.6. Based on MySQL 5.6.5 and all the improvements contained within, this is the first step towards a full Percona Server 5.6 release.

Binaries are available to download from our downloads site here: http://www.percona.com/downloads/Percona-Server-5.6/Percona-Server-5.6.5-alpha60.0/

We will post binaries to our EXPERIMENTAL repositories later, we’re undergoing final testing to ensure that it won’t cause problems for those running Percona Server < 5.6 from EXPERIMENTAL. Percona Server 5.6.5-alpha60.0 does not contain all the features of Percona Server 5.5. We are going to “release early, release often” as we add features from Percona Server 5.5. As such, our documentation will not be complete for a little while yet and these release notes are currently the best source of information – please bear with us.

Go ahead, take a walk on the wild side! 😉

Where to start with text mining

Filed under: Digital Research,Text Mining — Patrick Durusau @ 2:40 pm

Where to start with text mining by Ted Underwood.

From the post:

This post is less a coherent argument than an outline of discussion topics I’m proposing for a workshop at NASSR2012 (a conference of Romanticists). But I’m putting this on the blog since some of the links might be useful for a broader audience. Also, we won’t really cover all this material, so the blog post may give workshop participants a chance to explore things I only gestured at in person.

In the morning I’ll give a few examples of concrete literary results produced by text mining. I’ll start the afternoon workshop by opening two questions for discussion: first, what are the obstacles confronting a literary scholar who might want to experiment with quantitative methods? Second, how do those methods actually work, and what are their limits?

I’ll also invite participants to play around with a collection of 818 works between 1780 and 1859, using an R program I’ve provided for the occasion. Links for these materials are at the end of this post.

Something to pass along to any humanities scholars that you know, who aren’t already into text mining.

I first saw this at: primer for digital humanities.

Designing Open Projects

Filed under: Project Management — Patrick Durusau @ 2:12 pm

Designing Open Projects: Lessons From Internet Pioneers (PDF) by David Witzel.

From the foreword:

A key insight underpinning Witzel’s tips is that this is not a precise methodology to be followed. Instead, an open project approach should be viewed as a mindset. Leaders have to discern whether the challenges they are facing can best be solved using a closed or open approach, defined as follows:

  • A closed project has a defined staff, budget, and outcome; and uses hierarchy and logic models to direct activities. It is particularly appropriate for problems with known solutions and stable environments, such as the development of a major highway project.
  • An open project is useful to address challenges where the end may not be clear, the environment is rapidly changing, and/or the coordinating entity doesn’t have the authority or resources to directly create needed change. In these open projects, new stakeholders can join at will, roles are often informal, resources are shared, and actions and decisions are distributed throughout the system.

Witzel’s report provides guideposts on how to use an open project approach on appropriate large-scale efforts. We hope this report serves as an inspiration and practical guide to federal managers as they address the increasingly complex challenges facing our country that reach across federal agency—and often state, local, nonprofit, and private sector—boundaries.

I can think of examples of semantic integration projects that would work better with either model.

What factors would you consider before putting your next semantic integration project into one category or the other?

I first saw this at: Four short links: 15 August 2012 by Nat Torkington

Mining the astronomical literature

Filed under: Astroinformatics,Data Mining — Patrick Durusau @ 1:58 pm

Mining the astronomical literature (A clever data project shows the promise of open and freely accessible academic literature) by Alasdair Allan.

From the post:

There is a huge debate right now about making academic literature freely accessible and moving toward open access. But what would be possible if people stopped talking about it and just dug in and got on with it?

NASA’s Astrophysics Data System (ADS), hosted by the Smithsonian Astrophysical Observatory (SAO), has quietly been working away since the mid-’90s. Without much, if any, fanfare amongst the other disciplines, it has moved astronomers into a world where access to the literature is just a given. It’s something they don’t have to think about all that much.

The ADS service provides access to abstracts for virtually all of the astronomical literature. But it also provides access to the full text of more than half a million papers, going right back to the start of peer-reviewed journals in the 1800s. The service has links to online data archives, along with reference and citation information for each of the papers, and it’s all searchable and downloadable.

(graphic omitted)

The existence of the ADS, along with the arXiv pre-print server, has meant that most astronomers haven’t seen the inside of a brick-built library since the late 1990s.

It also makes astronomy almost uniquely well placed for interesting data mining experiments, experiments that hint at what the rest of academia could do if they followed astronomy’s lead. The fact that the discipline’s literature has been scanned, archived, indexed and catalogued, and placed behind a RESTful API makes it a treasure trove, both for hypothesis generation and sociological research.

That’s the trick isn’t it? “…if they followed astronomy’s lead.”

The technology used by the astronomical community has been equally available to other scientific, technical, medical and humanities disciplines.

Instead of ADS, for example, the humanities have JSTOR. JSTOR is supported by funds that originate with the public but the public has no access.

An example of how a data project reflects the character of the community that gave rise to it.

Astronomers value sharing of information and data, therefore their projects reflect those values.

Other projects reflect other values.

Not a question of technology but one of fundamental values.

BiologicalNetworks

Filed under: Bioinformatics,Biomedical,Genome — Patrick Durusau @ 10:25 am

BiologicalNetworks

From the webpage:

BiologicalNetworks research environment enables integrative analysis of:

  • Interaction networks, metabolic and signaling pathways together with transcriptomic, metabolomic and proteomic experiments data
  • Transcriptional regulation modules (modular networks)
  • Genomic sequences including gene regulatory regions (e.g. binding sites, promoter regions) and respective transcription factors, as well as NGS data
  • Comparative genomics, homologous/orthologous genes and phylogenies
  • 3D protein structures and ligand binding, small molecules and drugs
  • Multiple ontologies including GeneOntology, Cell and Tissue types, Diseases, Anatomy and taxonomies

BiologicalNetworks backend database (IntegromeDB) integrates >1000 curated data sources (from the NAR list) for thousands of eukaryotic, prokaryotic and viral organisms and millions of public biomedical, biochemical, drug, disease and health-related web resources.

Correction: As of 3 July 2012, “IntegromeDB’s index reaches 1 Billion (biomedical resources links) milestone.”

IntegromeDB collects all the biomedical, biochemical, drug and disease related data available in the public domain and brings you the most relevant data for your search. It provides you with an integrative view on the genomic, proteomic, transcriptomic, genetic and functional information featuring gene/protein names, synonyms and alternative IDs, gene function, orthologies, gene expression, pathways and molecular (protein-protein, TF-gene, genetic, etc.) interactions, mutations and SNPs, disease relationships, drugs and compounds, and many other. Explore and enjoy!

Sounds a lot like a topic map doesn’t it?

One interesting feature is Inconsistency in the integrated data.

The data sets are available for download as RDF files.

How would you:

  • Improve the consistency of integrated data?
  • Enable crowd participation in curation of data?
  • Enable the integration of data files into other data systems?

August 14, 2012

Lucene Core 4.0-BETA and Solr 4.0-BETA Available

Filed under: Lucene,Solr — Patrick Durusau @ 6:10 pm

I stopped by the Lucene site to check for upgrades to find:

The Lucene PMC is pleased to announce the availability of Apache Lucene 4.0-BETA and Apache Solr 4.0-BETA

Lucene can be downloaded from http://lucene.apache.org/core/mirrors-core-latest-redir.html
and Solr can be downloaded from http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Highlights of the Lucene release include:

  • IndexWriter.tryDeleteDocument can sometimes delete by document ID, for higher performance in some applications.
  • New experimental postings formats: BloomFilteringPostingsFormat uses a bloom filter to sometimes avoid disk seeks when looking up terms, DirectPostingsFormat holds all postings as simple byte[] and int[]for very fast performance at the cost of very high RAM consumption.
  • CJK analysis improvements: JapaneseIterationMarkCharFilter normalizes Japanese iteration marks, added unigram+bigram support to CJKBigramFilter.
  • Improvements to Scorer navigation API (Scorer.getChildren) to support all queries, useful for determining which portions of the query matched.
  • Analysis improvements: factories for creating Tokenizer, TokenFilter, and CharFilter have been moved from Solr to Lucene’s analysis module, less memory overhead for StandardTokenizer and Snowball filters.

  • Improved highlighting for multi-valued fields.
  • Various other API changes, optimizations and bug fixes.

Highlights of the Solr release include:

  • Added a Collection management API for Solr Cloud.
  • Solr Admin UI now clearly displays failures related to initializing SolrCores
  • Updatable documents can create a document if it doesn’t already exist,nor you can force that the document must already exist.
  • Full delete-by-query support for Solr Cloud.
  • Default to NRTCachingDirectory for improved near-realtime performance.
  • Improved Solrj client performance with Solr Cloud: updates are only sent to leaders by default.
  • Various other API changes, optimizations and bug fixes.

A Direct Mapping of Relational Data to RDF

Filed under: RDB,RDF — Patrick Durusau @ 3:36 pm

A Direct Mapping of Relational Data to RDF from the RDB2RDF Working Group.

From the news:

The need to share data with collaborators motivates custodians and users of relational databases (RDB) to expose relational data on the Web of Data. This document defines a direct mapping from relational data to RDF. This definition provides extension points for refinements within and outside of this document. Comments are welcome through 15 September. (emphasis added)

Comments to: public-rdb2rdf-comments@w3.org.

Subscribe (prior to commenting).

R2RML: RDB to RDF Mapping Language

Filed under: Analytics,BigData,R2RML,RDB,RDF — Patrick Durusau @ 3:29 pm

R2RML: RDB to RDF Mapping Language from the RDB2RDF Working Group.

From the news:

This document describes R2RML, a language for expressing customized mappings from relational databases to RDF datasets. Such mappings provide the ability to view existing relational data in the RDF data model, expressed in a structure and target vocabulary of the mapping author’s choice. R2RML mappings are themselves RDF graphs and written down in Turtle syntax. R2RML enables different types of mapping implementations. Processors could, for example, offer a virtual SPARQL endpoint over the mapped relational data, or generate RDF dumps, or offer a Linked Data interface. Comments are welcome through 15 September. (emphasis added)

Subscribe (prior to commenting).

Comments to: public-rdb2rdf-comments@w3.org.

Prior Art Finder

Filed under: Patents — Patrick Durusau @ 2:31 pm

Improving Google Patents with European Patent Office patents and the Prior Art Finder by Jon Orwant, Engineering Manager (Google Research).

From the post:

At Google, we’re constantly trying to make important collections of information more useful to the world. Since 2006, we’ve let people discover, search, and read United States patents online. Starting this week, you can do the same for the millions of ideas that have been submitted to the European Patent Office, such as this one.

Typically, patents are granted only if an invention is new and not obvious. To explain why an invention is new, inventors will usually cite prior art such as earlier patent applications or journal articles. Determining the novelty of a patent can be difficult, requiring a laborious search through many sources, and so we’ve built a Prior Art Finder to make this process easier. With a single click, it searches multiple sources for related content that existed at the time the patent was filed.

Maybe the USPTO will add:

Have you used Google’s Prior Art Finder? as part of the patent examination form.

Vocabulary issue remains but at least this is a start in the right direction.

Tracking Down an Epidemic’s Source

Filed under: Networks,Record Linkage — Patrick Durusau @ 12:49 pm

Tracking Down an Epidemic’s Source (Physics 5, 89 (2012) | DOI: 10.1103/Physics.5.89)

From the post:

Epidemiologists often have to uncover the source of a disease outbreak with only limited information about who is infected. Mathematical models usually assume a complete dataset, but a team reporting in Physical Review Letters demonstrates how to find the source with very little data. Their technique is based on the principles used by telecommunication towers to pinpoint cell phone users, and they demonstrate its effectiveness with real data from a South African cholera outbreak. The system could also work with other kinds of networks to help governments locate contamination sources in water systems or find the leaders in a network of terrorist contacts.

A rumor can spread across a user network on Twitter, just as a disease spreads throughout a network of personal contacts. But there’s a big difference when it comes to tracking down the source: online social networks have volumes of time-stamped data, whereas epidemiologists usually have information from only a fraction of the infected individuals.

To address this problem, Pedro Pinto and his colleagues at the Swiss Federal Institute of Technology in Lausanne (EPFL) developed a model based on the standard network picture for epidemics. Individuals are imagined as points, or “nodes,” in a plane, connected by a network of lines. Each node has several lines connecting it to other nodes, and each node can be either infected or uninfected. In the team’s scenario, all nodes begin the process uninfected, and a single source node spreads the infection from neighbor to neighbor, with a random time delay for each transmission. Eventually, every node becomes infected and records both its time of infection and the identity of the infecting neighbor.

To trace back to the source using data from a fraction of the nodes, Pinto and his colleagues adapted methods used in wireless communications networks. When three or more base stations receive a signal from one cell phone, the system can measure the difference in the signal’s arrival time at each base station to triangulate a user’s position. Similarly, Pinto’s team combined the arrival times of the infection at a subset of “observer” nodes to find the source. But in the infection network, a given arrival time could correspond to multiple transmission paths, and the time from one transmission to the next varies randomly. To improve their chances of success, the team used the fact that the source had to be one of a finite set of nodes, unlike a cell phone user, who could have any of an infinite set of coordinates within the coverage area.

Summarizes: Locating the Source of Diffusion in Large-Scale Networks Pedro C. Pinto, Patrick Thiran, and Martin Vetterli Phys. Rev. Lett. 109, 068702 (2012).

One wonders if participation in multiple networks, some social, some electronic, some organizational, would be amenable to record linkage type techniques?

Leaks from government could be tracked using only one type of network but that is likely to be incomplete and misleading.

Mono integrates Entity Framework

Filed under: .Net,ADO.Net Entity Framework,C#,ORM — Patrick Durusau @ 10:42 am

Mono integrates Entity Framework

From the post:

The fourth preview release of version 2.11 of Mono, the open source implementation of Microsoft’s C# and .NET platform, is now available. Version 2.11.3 integrates Microsoft’s ADO.NET Entity Framework which was released as open source, under the Apache 2.0 licence, at the end of July. The Entity Framework is the company’s object-relational mapper (ORM) for the .NET Framework. This latest alpha version of Mono 2.11 has also been updated in order to match async support in .NET 4.5.

Just in case you are not familiar with the MS ADO.Net Entity Framework:

The ADO.NET Entity Framework enables developers to create data access applications by programming against a conceptual application model instead of programming directly against a relational storage schema. The goal is to decrease the amount of code and maintenance required for data-oriented applications. Entity Framework applications provide the following benefits:

  • Applications can work in terms of a more application-centric conceptual model, including types with inheritance, complex members, and relationships.
  • Applications are freed from hard-coded dependencies on a particular data engine or storage schema.
  • Mappings between the conceptual model and the storage-specific schema can change without changing the application code.
  • Developers can work with a consistent application object model that can be mapped to various storage schemas, possibly implemented in different database management systems.
  • Multiple conceptual models can be mapped to a single storage schema.
  • Language-integrated query (LINQ) support provides compile-time syntax validation for queries against a conceptual model.

Does the source code at Entity Framework at CodePlex need extension to:

  • Discover when multiple conceptual models are mapped against a single storage schema?
  • Discover when parts of conceptual models vary in name only? (to avoid duplication of models)
  • Compare/contrast types with inheritance, complex members, and relationships?

If those sound like topic map type questions, they are.

There are always going to be subjects that need mappings to work with newer systems or different understandings of old ones.

Let’s stop pretending we going to reach the promised land and keep our compasses close at hand.

Neo4J Internals (update)

Filed under: Graphs,Neo4j — Patrick Durusau @ 10:15 am

Max De Marzi points to the slides for a presentation by Tobias Lindaaker on Neo4j internals (January, 2012).

When you see the linked lists you will understand why Neo4j doesn’t offer “merging” in the TMDM sense of the word. The overhead would be impressive.

See slide 9 (the way slideshare counts) for example.

I don’t know if these slides go with the January presentation (above), a presentation in May of 2012 (date at the bottom of the slides), or some other presentation. Dating presentation slides would help keep that straight.

August 13, 2012

Are You An IT Hostage?

As I promised last week in From Overload to Impact: An Industry Scorecard on Big Data Business Challenges [Oracle Report], the key finding that is missing from Oracle’s summary:

Executives’ Biggest Data Management Gripes:*

#1 Don’t have the right systems in place to gather the information we need (38%)

#2 Can’t give our business managers access to the information they need; need to rely on IT (36%)

Ask your business managers: Do they feel like IT hostages?

You are likely to be surprised at the answers you get.

IT’s vocabulary acts as an information clog.

A clog that impedes the flow of information in your organization.

Information that can improve the speed and quality of business decision making.

The critical point is: Information clogs are bad for business.

Do you want to borrow my plunger?

Lessons from organizing the kitchen cabinet

Filed under: Graphics,Visualization — Patrick Durusau @ 6:04 pm

Lessons from organizing the kitchen cabinet by Kaiser Fung.

From the post:

The first thing we know about kitchen cabinets is that they are not large enough. If you live in a small city apartment, you’re always looking for ways to maximize your space. If your McMansion has a huge kitchen, you’ll run out of space all the same, after splurging on the breadmaker, and the ice-cream maker, and the panini grill, and containers for garlic, onions, different shapes of pastas, and the peelers for apples, garlic, carrots, the egg-separator, the foam-maker, and so on.

Another thing we know is that no matter how many and how large the cabinets are, there is not enough premium space, by which we mean front-facing space within arm’s reach. What has this to do with graphs and charts? We’ll find out soon enough.

I won’t spoil the surprise for you.

You will enjoy the foregrounding of choices that seem “obvious” to us but no doubt were unseen by others.

Neo4j 1.8.M07 – Sharing is Caring

Filed under: Cypher,Graphs,Neo4j — Patrick Durusau @ 4:51 pm

Neo4j 1.8.M07 – Sharing is Caring

Neo4j 1.8.M07 announcement with a couple of the highlights.

Available immediately, Neo4j 1.8 Milestone 7 sets the stage for responsible data sharing. We’re open source. Naturally we’re mindful about supporting…

Open (Meta) Data

Way back when Neo4j 1.2.M01 was released we introduced the Usage Data Collector (UDC), an optional component which would help us understand how running instance of Neo4j were being used, by reporting back anonymous context information: operating system, runtime, region of the Earth, that kind of thing.

Of course, the source code for the UDC is open and available for inspection. Now, we’re taking some steps to make the meta-data itself available, to make that data useful for everyone in the community, and to do so while being uber sensitive to the slightest hint of privacy concerns.

We’re kinda excited about this, actually. Stay tuned to learn more about what we’re doing, how you can be involved, and how it will be awesome for the community.

Create Unique Data

Earlier in the 1.8 branch, we introduced the RELATE clause, a powerful blend of MATCH and CREATE. With it, you could insist that a pattern of data should exist in the graph, and RELATE would perform the least creations required to uniquely satisfy the pattern.

In discussion, we kept saying things like “uniquely creates” to describe it, finally realizing that we should name the thing with the much more obvious CREATE UNIQUE.

Don’t be the last one on your block to have the latest Neo4j release!

Summarize Opinions with a Graph – Part 1

Filed under: Graphs,Neo4j,Opinions — Patrick Durusau @ 4:36 pm

Summarize Opinions with a Graph – Part 1 by Max De Marzi.

From the post:

How does the saying go? Opinions are like bellybuttons, everybody’s got one? So let’s say you have an opinion that NOSQL is not for you. Maybe you read my blog and think this Graph Database stuff is great for recommendation engines and path finding and maybe some other stuff, but you got really hard problems and it can’t help you.

I am going to try to show you that a graph database can help you solve your really hard problems if you can frame your problem in terms of a graph. Did I say “you”? I meant anybody, specially Ph.D. students. One trick is to search for “graph based approach to” and your problem.

I’ll give you an example. The other day I ran in to “Opinosis: A Graph Based Approach to Abstractive Summarization of Highly Redundant Opinions” by Kavita Ganesan, ChengXiang Zhai and Jiawei Han at the University of Illinois at Urbana-Champaign.

I think you are going to like this. Max’s work is always interesting but this post is particularly so.

Has implications beyond opinion gathering.

CDH3 update 5 is now available

Filed under: Avro,Cloudera,Flume,Hadoop,HDFS,Hive — Patrick Durusau @ 4:17 pm

CDH3 update 5 is now available by Arvind Prabhakar

From the post:

We are happy to announce the general availability of CDH3 update 5. This update is a maintenance release of CDH3 platform and provides a considerable amount of bug-fixes and stability enhancements. Alongside these fixes, we have also included a few new features, most notable of which are the following:

  • Flume 1.2.0 – Provides a durable file channel and many more features over the previous release.
  • Hive AvroSerDe – Replaces the Haivvreo SerDe and provides robust support for Avro data format.
  • WebHDFS – A full read/write REST API to HDFS.

Maintenance release. Installation is good practice before major releases.

Caleydo Project

Filed under: Bioinformatics,Graphics,Graphs,Networks,Visualization — Patrick Durusau @ 4:04 pm

Caleydo Project

From the webpage:

Caleydo is an open source visual analysis framework targeted at biomolecular data. The biggest strength of Caleydo is the visualization of interdependencies between multiple datasets. Caleydo can load tabular data and groupings/clusterings. You can explore relationships between multiple groupings, between different datasets and see how your data maps onto pathways.

Caleydo has been successfully used to analyze mRNA, miRNA, methylation, copy number variation, mutation status and clinical data as well as other datset types.

The screenshot from mybiosoftware.com really caught my attention:

Caleydo Screenshot

Targets biomolecular data but may have broader applications.

Machine Learning Throwdown, [Part 1, 2, 3, 4, 5, 6 (complete)]

Filed under: Machine Learning — Patrick Durusau @ 3:52 pm

Machine Learning Throwdown, Part 1 – Introduction by Nick Wilson.

From the post:

Hi, I’m Nick the intern. The fine folks at BigML brought me on board for the summer to drink their coffee, eat their snacks, and compare their service to similar offerings from other companies. I have a fair amount of software engineering experience but limited machine learning skills beyond some introductory classes. Prior to beginning this internship, I had no experience with the services I am going to talk about. Since BigML aims to make machine learning easy for non-experts like myself, I believe I am in a great position to provide feedback on these types of services. But please, take what I say with a grain of salt. I’ll try to stay impartial but it’s not easy when BigML keeps dumping piles of money and BigML credits on my doorstep to ensure a favorable outcome.

From my time at BigML, it has become clear that everyone here is a big believer in the power of machine learning to extract value from data and build intelligent systems. Unfortunately, machine learning has traditionally had a high barrier to entry. The BigML team is working hard to change this; they want anyone to be able to gain valuable insights and predictive power from their data.

It turns out BigML is not the only player in this game. How does it stack up against the competition? This is the first in a series of blog posts where I compare BigML to a few other services offering machine learning capabilities. These services vary in multiple ways including the level of expertise required, the types of models that can be created, and the ease with which they can be integrated into your business.

You need to make decisions on services using your own data and requirements but Nick’s posts make as good a place to start as any.

Will be even more useful if the posts result in counter-posts on other blogs, not so much disputing trivia but in outlining their best approach as opposed to other best approaches.

Could be quite educational.

Series continues with:

Machine Learning Throwdown, Part 2 – Data Preparation

Machine Learning Throwdown, Part 3 – Models

Machine Learning Throwdown, Part 4 – Predictions

Machine Learning Throwdown, Part 5 – Miscellaneous

Machine Learning Throwdown, Part 6 – Summary

Series in now complete.

New version of data-visualising D3 JavaScript library

Filed under: D3,Graphics,Visualization — Patrick Durusau @ 3:37 pm

New version of data-visualising D3 JavaScript library

From the post:

Version 2.10 of the open source D3 (Data Driven Development) JavaScript library features a good dozen additions or enhancements and can be used to visualise datasets both statically or interactively using HTML, CSS, and SVG. Like jQuery, the cross-browser library features generic DOM element selection, dynamic property annotation, event processing, transitions and transformations. D3’s web site includes more than 200 impressive examples, some from large media firms and institutions.

New features for present or future topic map visualizations!

Lucid Imagination become LucidWorks [Man Bites Dog Story]

Filed under: Lucene,LucidWorks,Solr — Patrick Durusau @ 3:27 pm

Lucid Imagination becomes LucidWorks

Soft news except for the note about the soon to appear SearchHub.org (September, 2012).

And the company listening to users refer to it as LucidWorks and deciding to change the name of the company from Lucid Imagination to LucidWorks.

Sort of a man bites dog sort of story don’t your think?

Hurray for LucidWorks!

Makes me curious about the SearchHub.org site. Likely to listen to users there as well.

« Newer PostsOlder Posts »

Powered by WordPress