Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 16, 2012

The RDF Data Cube Vocabulary

Filed under: RDF Data Cube Vocabulary,Statistics — Patrick Durusau @ 7:11 pm

The RDF Data Cube Vocabulary

A new draft from the W3C, adapting existing data cube vocabularies into an RDF representation.

The proposal re-uses several other vocabularies that I will be covering separately.

There are several open issues so read carefully.


What do you make of: The RDF Data Cube Vocabulary? I haven’t run diffs on it, yet.

Mule Summits

Filed under: Data Integration,Mule — Patrick Durusau @ 12:42 pm

Mule Summits

From the webpage:

Mule Summit brings together the core Mule development team and Mule users for the premier event of the year for anyone involved in integration. It offers a great opportunity to learn about key product developments, influence the product roadmap, and interact with other Mule users who share their best practices.

Locations announced in Europe and the U.S.

Cheaper than a conference + the core development team. What more could you want? (OK, Dallas isn’t Amsterdam but you chose to live there.) 😉

Mule Webinars

Filed under: Data Integration,Mule — Patrick Durusau @ 12:41 pm

Mule Webinars

I was looking for a web listing of Mule Summits and ran across this archive of prior Mule webinars.

Tried to pick my/your favorites and finally just put in the entire list. Enjoy!

Lucene Revolution 2012

Filed under: Lucene — Patrick Durusau @ 8:03 am

Lucene Revolution 2012

Best advertising for the conference:

Presentations/videos from Lucene Revolution 2011.

Agenda for Lucene Revolution 2012.

Boston, May 7 – 10, The Royal Sonesta


The ad department thought otherwise:

Top 5 Reasons You Need To Attend Lucene Revolution!

  • Learn from the Best
    Meet, socialize, collaborate, ask questions and network with fellow Lucene / Solr enthusiasts. A large contingency of the project committers will be in Boston to discuss your questions in real-time.
  • Innovate with Search
    From field-collapsing to flexible indexing to integration with NoSQL technologies, you get the freshest thinking on solving the deepest, most interesting problems in open source search and big data.
  • Get connected in the community
    The power of open source is demolishing traditional barriers and forging new opportunity for killer code and new killer search apps — and this is the place to meet the people doing it.
  • Fun…
    We’ve scheduled in adequate time for fun at the conference! Networking breaks, Stump-the-Chump, and a big conference party at the Boston Museum of Science!
  • A Bargain
    Save money with packaged deals on accelerated two-day, hands-on training workshops, coupled with conference sessions on real-world implementations from Solr/Lucene experts throughout the world.

Not traveling so depending on your blogs and tweets to capture the conference!

April 15, 2012

Announcing Fech 1.0

Filed under: Data Mining,Government Data,News — Patrick Durusau @ 7:15 pm

Announcing Fech 1.0 by Derek Willis.

From the post:

Fech now retrieves a whole lot more campaign finance data.

We’re excited to announce the 1.0 release of Fech, our Ruby library for parsing Federal Election Commission electronic campaign filings. Fech 1.0 now covers all of the current form types that candidates and committees submit. Originally developed to parse presidential committee filings, Fech now can be used for almost any kind of report (Senate candidates file on paper, so Fech can’t help there). The updated documentation, made with Github Pages, has a full listing of the supported formats.

Now it’s possible to use Fech to parse the pre-election filings of candidates receiving contributions of $1,000 or more — one way to see the late money in politics — or to dig through political party and political action committee reports to see how committees spend their funds. At The Times, Fech now plays a much greater role in powering our Campaign Finance API and in interactives that make use of F.E.C. data.

The additions to Fech include the ability to compare two filings and examine the differences between them. Since the F.E.C. requires that amendments replace the entire original filing, the comparison feature is especially useful for seeing what has changed between an original filing and an amendment to it. Another feature allows users to pass in a specific quote character (or parse a filing’s data without one at all) in order to avoid errors parsing comma-separated values that occasionally appear in filings.

Kudos to the New York Times for development of software and Fech in particular, to give the average person access to “public” information. Without meaningful access, it can hardly qualify as “public” can it?

Something the U.S. Senate should keep in mind as it remains mired in 19th century pomp and privilege. Or diplomats. The other remaining class of privilege. Transparency is coming.


Update: Fech 1.1 Released.

The D Programming Language

Filed under: D Language — Patrick Durusau @ 7:15 pm

The D Programming Language by Walter Bright.

Description:

The D Programming Language combines modeling power, modern convenience, and native efficiency into one powerful language. D embodies many new ideas in programming languages along with traditional proven techniques.

See also: Dlang.org

Computer Algorithms: Morris-Pratt String Searching

Filed under: Algorithms,Searching,String Matching — Patrick Durusau @ 7:15 pm

Computer Algorithms: Morris-Pratt String Searching

From the post:

We saw that neither brute force string searching nor Rabin-Karp string searching are effective. However in order to improve some algorithm, first we need to understand its principles in detail. We know already that brute force string matching is slow and we tried to improve it somehow by using a hash function in the Rabin-Karp algorithm. The problem is that Rabin-Karp has the same complexity as brute force string matching, which is O(mn).

Obviously we need a different approach, but to come with a different approach let’s see what’s wrong with brute force string searching. Indeed by taking a closer look at its principles we can answer the question.

In brute force matching we checked each character of the text with the first character of the pattern. In case of a match we shifted the comparison between the second character of the pattern and the next character of the text. The problem is that in case of a mismatch we must go several positions back in the text. Well in fact this technique can’t be optimized.

Good posts on string matching!

Facebook Search- The fall of the machines

Filed under: Facebook,Search Engines,Searching — Patrick Durusau @ 7:15 pm

Facebook Search- The fall of the machines by Ajay Ohri.

Ajay gives five numbered reasons and then one more for preferring Facebook searching.

I hardly ever visit Facebook (I do have an account) and certainly don’t search using it.

But we could trade stories, rumors, etc. all day.

How would we test Facebook versus other search engines?

Or for that matter, how would we test search engines in general?

When we say search A got a “better” result using search engine Z, by what measure do we mean “better?”

Importing UK Weather Data from Azure Marketplace into PowerPivot

Filed under: Azure Marketplace,PowerPivot — Patrick Durusau @ 7:15 pm

Importing UK Weather Data from Azure Marketplace into PowerPivot by Chris Webb.

From the post:

I don’t always agree with everything Rob Collie says, much as I respect him, but his recent post on the Windows Azure Marketplace (part of which used to be known as the Azure Datamarket) had me nodding my head. The WAM has been around for a while now and up until recently I didn’t find anything much there that I could use in my day job; I had the distinct feeling it was going to be yet another Microsoft white elephant. The appearance of the DateStream date dimension table (see here for more details) was for me a turning point, and a month ago I saw something really interesting: detailed weather data for the UK from the Met Office (the UK’s national weather service) is now available there too. OK, it’s not going to be very useful for anyone outside the UK, but the UK is my home market and for some of my customers the ability to do things like use weather forecasts to predict footfall in shops will be very useful. It’s exactly the kind of data that analysts want to find in a data market, and if the WAM guys can add other equally useful data sets they should soon reach the point where WAM is a regular destination for all PowerPivot users.

Importing this weather data into PowerPivot isn’t completely straightforward though – the data itself is quite complex. The Datamarket guys are working on some documentation for it but in the meantime I thought I’d blog about my experiences; I need to thank Max Uritsky and Ziv Kaspersky for helping me out on this.

I don’t live in the UK nor do I use PowerPivot but I suspect readers of this blog may fall into either category or both. In any event, learning more about data sources, import and even software is always a useful thing.

All of those are likely to be sources you will need or encounter when authoring a topic map.

Interesting that while Amazon is striving to bring “big data” processing skills to everyone, the importing of data remains a roadblock for some users. Standard exports for particular data sets may become a commodity.

TIBCO ActiveSpaces – Community Edition Soon (2.0.1)

Filed under: ActiveSpaces,NoSQL — Patrick Durusau @ 7:13 pm

TIBCO ActiveSpaces

From the webpage:

There is increasing pressure on IT to reduce reliance on costly transactional systems and to process increasing streams of data and events in real time.

TIBCO ActiveSpacesÂŽ Enterprise Edition provides an infrastructure for building highly scalable, fault-tolerant distributed applications. It combines the features and performance of databases, caching systems, and messaging software to support very large, highly volatile data sets and event-driven applications. It enables organizations to off-load transaction-heavy systems and allows developers to concentrate on business logic rather than the complexities of distributing, scaling, and making applications autonomously fault-tolerant.

TIBCO ActiveSpaces Enterprise Edition is a distributed peer-to-peer in-memory data grid, a form of virtual shared memory that leverages a distributed hash table with configurable replication. This approach means the capacity of the space scales automatically as nodes join and leave. Replication assures fault-tolerance from node failure as the space autonomously re-replicates and re-distributes lost data.

I saw this at KDnuggets and had to investigate.

While poking about the site I found: Coming soon: ActiveSpaces Community edition! by Jean-Noel Moyne, which says:

I am proud to be able to announce that along with the upcomming ActiveSpaces Enterprise Edition version 2.0.1 we will also be releasing a new ‘Community Edition’ of ActiveSpaces 2.0.1.

The community edition will be available for download free of charge, giving every one a chance to evaluate ActiveSpaces for themselves.

The community edition is the full featured version of ActiveSpaces and is only limited in the fact that you can not use it in production, as it is only supported through the community of users and not by TIBCO Software and also by the fact that you can only have a maximum of four members to each metaspace that your process connects to.

Stay tuned for more details about this comming soon!

Looking forward to learning more about ActiveSpaces!

Constructing Case-Control Studies With Hadoop

Filed under: Bioinformatics,Biomedical,Giraph,Hadoop,Medical Informatics — Patrick Durusau @ 7:13 pm

Constructing Case-Control Studies With Hadoop by Josh Wills.

From the post:

San Francisco seems to be having an unusually high number of flu cases/searches this April, and the Cloudera Data Science Team has been hit pretty hard. Our normal activities (working on Crunch, speaking at conferences, finagling a job with the San Francisco Giants) have taken a back seat to bed rest, throat lozenges, and consuming massive quantities of orange juice. But this bit of downtime also gave us an opportunity to focus on solving a large-scale data science problem that helps some of the people who help humanity the most: epidemiologists.

Case-Control Studies

A case-control study is a type of observational study in which a researcher attempts to identify the factors that contribute to a medical condition by comparing a set of subjects who have that condition (the ‘cases’) to a set of subjects who do not have the condition, but otherwise resemble the case subjects (the ‘controls’). They are useful for exploratory analysis because they are relatively cheap to perform, and have led to many important discoveries- most famously, the link between smoking and lung cancer.

Epidemiologists and other researchers now have access to data sets that contain tens of millions of anonymized patient records. Tens of thousands of these patient records may include a particular disease that a researcher would like to analyze. In order to find enough unique control subjects for each case subject, a researcher may need to execute tens of thousands of queries against a database of patient records, and I have spoken to researchers who spend days performing this laborious task. Although they would like to parallelize these queries across multiple machines, there is a constraint that makes this problem a bit more interesting: each control subject may only be matched with at most one case subject. If we parallelize the queries across the case subjects, we need to check to be sure that we didn’t assign a control subject to multiple cases. If we parallelize the queries across the control subjects, we need to be sure that each case subject ends up with a sufficient number of control subjects. In either case, we still need to query the data an arbitrary number of times to ensure that the matching of cases and controls we come up with is feasible, let alone optimal.

Analyzing a case-control study is a problem for a statistician. Constructing a case-control study is a problem for a data scientist.

Great walk through on constructing a case-control study, including the use of the Apache Giraph library.

“Verdict First, Then The Trial”

Filed under: Data Analysis,Exploratory Data Analysis — Patrick Durusau @ 7:13 pm

No, not the Trevon Martin case but rather the lack of “exploratory data analysis” in business environments.

From Business Intelligence Ain’t Over Until Exploratory Data Analysis Sings, where Wayne Kernochan reviews the rise of statistical analysis in businesses and then says:

And yet there is a glaring gap in this picture – or at least a gap that should be glaring. This gap might be summed up as Alice in Wonderland’s “verdict first, then the trial.” Both the business and the researcher start with their own narrow picture of what the customer or research subject should look like, and the analytics and statistics that accompany such hypotheses are designed to narrow in on a solution rather than expand due to unexpected data. Thus, the business/researcher is likely to miss key customer insights, psychological and otherwise.

Pile on top of this the “not invented here” syndrome characteristic of most enterprises, and the “confirmation bias” that recent research has shown to be prevalent among individuals and organizations, and you have a real analytical problem on your hands. (emphasis added)

I don’t know if I would call it “a real analytical problem” so much as I would call it “business as usual.”

There may be a real coming shortage of people who can turn the crank to make the usual analysis come out the other end.

Can you imagine the shortage of people who possess the analytical skills and initiative to do more than the usual analysis?

The ability to recognize when two or more departments have different vocabularies for the same things is one indicator of possible analytical talent.

What are some others? (Thinking you can also use these to find topic map authors for your business/organization.)

MongoDB Hadoop Connector Announced

Filed under: Hadoop,MongoDB — Patrick Durusau @ 7:13 pm

MongoDB Hadoop Connector Announced

From the post:

10gen is pleased to announce the availability of our first GA release of the MongoDB Hadoop Connector, version 1.0. This release was a long-term goal, and represents the culmination of over a year of work to bring our users a solid integration layer between their MongoDB deployments and Hadoop clusters for data processing. Available immediately, this connector supports many of the major Hadoop versions and distributions from 0.20.x and onwards.

The core feature of the Connector is to provide the ability to read MongoDB data into Hadoop MapReduce jobs, as well as writing the results of MapReduce jobs out to MongoDB. Users may choose to use MongoDB reads and writes together or separately, as best fits each use case. Our goal is to continue to build support for the components in the Hadoop ecosystem which our users find useful, based on feedback and requests.

For this initial release, we have also provided support for:

  • writing to MongoDB from Pig (thanks to Russell Jurney for all of his patches and improvements to this feature)
  • writing to MongoDB from the Flume distributed logging system
  • using Python to MapReduce to and from MongoDB via Hadoop Streaming.

Hadoop Streaming was one of the toughest features for the 10gen team to build. To that end, look for a more technical post on the MongoDB blog in the next week or two detailing the issues we encountered and how to utilize this feature effectively.

Question: Is anyone working on a matrix of Hadoop connectors and their capabilities? A summary resource on Hadoop connectors might be of value.

Information Retrieval: Berkeley School of Information

Filed under: Information Retrieval — Patrick Durusau @ 7:12 pm

Information Retrieval: Berkeley School of Information

The PDFs are password protected (on the outline) but the course slides are available.

Good slides by the way. Particularly the illustrations.

The course used one of the mini-TREC data sets.

If you are not familiar with TREC, you should be.

April 14, 2012

Procedural Reflection in Programming Languages Volume 1

Filed under: Lisp,Reflection,Scala — Patrick Durusau @ 6:28 pm

Procedural Reflection in Programming Languages Volume 1

Brian Cantwell Smith’s dissertation that is the base document for reflection in programming languages.

Abstract:

We show how a computational system can be constructed to “reason”, effectively and consequentially, about its own inferential processes. The analysis proceeds in two parts. First, we consider the general question of computational semantics, rejecting traditional approaches, and arguing that the declarative and procedural aspects of computational symbols (what they stand for, and what behaviour they engender) should be analysed independently, in order that they may be coherently related. Second, we investigate self-referential behaviour in computational processes, and show how to embed an effective procedural model of a computational calculus within that calculus (a model not unlike a meta-circular interpreter, but connected to the fundamental operations of the machine in such a way as to provide, at any point in a computation, fully articulated descriptions of the state of that computation, for inspection and possible modification). In terms of the theories that result from these investigations, we present a general architecture for procedurally reflective processes, able to shift smoothly between dealing with a given subject domain, and dealing with their own reasoning processes over that domain.

An instance of the general solution is worked out in the context of an applicative language. Specifically, we present three successive dialects of LISP: 1-LISP, a distillation of current practice, for comparison purposes; 2-LISP, a dialect constructed in terms of our rationalised semantics, in which the concept of elevation is rejected in favour of independent notions of simplification and reference, and in which the respective categories of notation, structure, semantics, and behaviour are strictly aligned; and 3-LISP, an extension of 2-LISP endowed with reflective powers. (Warning: Hand copied from an image PDF. Tying errors may have occurred.)

I think reflection as it is described here is very close to Newcomb’s notion of composite subject identities, which are themselves composed of composite subject identities.

Has me wondering what a general purpose identification language with reflection would look like?

VoltDB Version 2.5

Filed under: NoSQL,VoltDB — Patrick Durusau @ 6:28 pm

VoltDB Version 2.5

VoltDB 2.5 has arrived with:

Database Replication. As I’d previously described here, Database Replication is the headline feature of 2.5 (until recently, we referred to the feature as WAN replication). It allows VoltDB databases to be automatically replicated within and across data centers. Available in the VoltDB Enterprise Edition, Database Replication ensures that every database transaction applied to a VoltDB database is asynchronously applied to a defined replica database. Following a catastrophic crash, you can immediately promote the database replica to be the master and redirect all traffic to that cluster. Once the original master has been recovered, you can quickly and easily reverse the process.

In addition to serving disaster recovery needs, you can also use Database Replication to maintain a hot standby database (i.e., to eliminate service windows when you’re doing systems maintenance) and for workload optimization where, for example, write traffic is directed to the master VoltDB database, and read traffic is directed to the replica.

Performance improvements. Version 2.5 includes performance improvements to the VoltDB SQL planner, which benefit all VoltDB products. In addition, we eliminated some unnecessary cluster messaging for single-node deployments, which reduce average transaction latencies to around 1ms for our VoltOne product.

Functional enhancements. In 2.5 we expanded VoltDB’s SQL support and extended support for distributed joins. We also added new administrative options for managing database snapshots and controlling the behavior of command logging activities.

Updated Node.js support. As Andy Wilson describes here, VoltDB 2.5 includes an updated client library for the Node.js programming framework. This driver, which was originally created by community member Jacob Wright, includes performance optimizations, bug fixes and modifications that align the driver with Node.js coding standards.

It may already exist (pointer please!) but with new versions of databases, when not entirely new databases, appearing on a regular basis, a common test suite of data would be a good thing to have. Nothing heavy, say 50 GB uncompressed of CSV files with varying structures.

Thoughts?

Martin Odersky: Reflection and Compilers

Filed under: Compilers,Reflection,Scala — Patrick Durusau @ 6:27 pm

Martin Odersky: Reflection and Compilers

From the description:

Reflection and compilers do tantalizing similar things. Yet, in mainstream, statically typed languages the two have been only loosely coupled, and generally share very little code. In this talk I explore what happens if one sets out to overcome their separation.

The first half of the talk addresses the challenge how reflection libraries can share core data structures and algorithms with the language’s compiler without having compiler internals leaking into the standard library API. It turns out that a component system based on abstract types and path-dependent types is a good tool to solve this challenge. I’ll explain how the “multiple cake pattern” can be fruitfully applied to expose the right kind of information.

The second half of the talk explores what one can do when strong, mirror-based reflection is a standard tool. In particular, the compiler itself can use reflection, leading to a particular system of low-level macros that rewrite syntax trees. One core property of these macros is that they can express staging, by rewriting a tree at one stage to code that produces the same tree at the next stage. Staging lets us implement type reification and general LINQ-like functionality. What’s more, staging can also be applied to the macro system itself, with the consequence that a simple low-level macro system can produce a high-level hygienic one, without any extra effort from the language or compiler.

Ignore the comments about the quality of the sound and video. It looks like substantial improvements have been made or I am less sensitive to those issues. Give it a try and see what you think.

Strikes me as being very close to Newcomb’s thoughts on subject identity being composed of other subject identities.

Such that you could have subject representatives that “merge” together and then themselves form the basis for merging other subject representatives.

Suggestions of literature on reflection, its issues and implementations? (Donated books welcome as well. Contact for physical delivery address.)

CloudSpokes Coding Challenge Winners – Build a DynamoDB Demo

Filed under: Amazon DynamoDB,Amazon Web Services AWS,Contest,Dynamo — Patrick Durusau @ 6:27 pm

CloudSpokes Coding Challenge Winners – Build a DynamoDB Demo

From the post:

Last November CloudSpokes was invited to participate in the DynamoDB private beta. We spent some time kicking the tires, participating in the forums and developing use cases for their Internet-scale NoSQL database service. We were really excited about the possibilities of DynamoDB and decided to crowdsource some challenge ideas from our 38,000 strong developer community. Needless to say, the release generated quite a bit of buzz.

When Amazon released DynamoDB in January, we launched our CloudSpokes challenge Build an #Awesome Demo with Amazon DynamoDB along with a blog post and a sample ”Kiva Loan Browser Demo” application to get people started. The challenge requirements were wide open and all about creating the coolest application using Amazon DynamoDB. We wanted to see what the crowd could come up with.

The feedback we received from numerous developers was extremely positive. The API was very straightforward and easy to work with. The SDKs and docs, as usual, were top-notch. Developers were able to get up to speed fast as DynamoDB’s simple storage and query methods were easy to grasp. These methods allowed developers to store and access data items with a flexible number of attributes using the simple “Put” or “Get” verbs that they are familiar with. No surprise here, but we had a number of comments regarding the speed of both read and write operations.

When our challenge ended a week later we were pleasantly surprised with the applications and chose to highlight the following top five:

I don’t think topic maps has 38,000 developers but challenges do seem to pull people out of the woodwork.

Any thoughts on what would make interesting/attractive challenges? Other than five figure prizes? 😉

Faceting & result grouping

Filed under: Faceted Search,Facets,Lucene,Solr — Patrick Durusau @ 6:27 pm

Faceting & result grouping by Martijn van Groningen

From the post:

Result grouping and faceting are in essence two different search features. Faceting counts the number of hits for specific field values matching the current query. Result grouping groups documents together with a common property and places these documents under a group. These groups are used as the hits in the search result. Usually result grouping and faceting are used together and a lot of times the results get misunderstood.

The main reason is that when using grouping people expect that a hit is represented by a group. Faceting isn’t aware of groups and thus the computed counts represent documents and not groups. This different behaviour can be very confusion. A lot of questions on the Solr user mailing list are about this exact confusion.

In the case that result grouping is used with faceting users expect grouped facet counts. What does this mean? This means that when counting the number of matches for a specific field value the grouped faceting should check whether the group a document belongs to isn’t already counted before. This is best illustrated with some example documents.

Examples follow that make the distinction between groups and facets in Lucene and Solr clear. Not to mention specific suggestions on configuration of your service.

Neo4j + last.fm in Java

Filed under: Graphs,Neo4j,Similarity — Patrick Durusau @ 6:27 pm

Neo4j + last.fm in Java by Niklas Lindblad.

Video that walks through the use of Neo4j and the music service last.fm. (Full code: https://niklaslindblad.se/kod/Database.java.html)

Watching the import, I was reminded of Iron Maiden, a group I had not thought of for years.

Directional “similarity” relationships/edges. Can also store “measures” of similarity along the edges.

Question: For subject identity purposes, can we do more than boolean yes/no?

That is for some purposes, if a subject representative is => .90 similar, can we merge it with another node?

Or for other purposes we might use a different measure of similarity?

Is it possible for boolean and measure of similarity topic maps to exist in the same topic map?

How R Searches and Finds Stuff

Filed under: R — Patrick Durusau @ 6:26 pm

How R Searches and Finds Stuff by Suraj Gupta.

From the post:

Or…
How to push oneself down the rabbit hole of environments, namespaces, exports, imports, frames, enclosures, parents, and function evaluation?

Motivation

There are a few reasons to bother reading this post:

  1. Rabbit hole avoidance
    You have avoided the above mentioned topics thus far, but now it’s time to dive in. Unfortunately you speak English, unlike the R help manuals which speak “Hairy C” (imagine a somewhat hairy native C coder from the 80s who’s really smart but grunts a lot…not the best communicator).
  2. R is acting a fool
    Your function used to work, now it spits an error. Absolutely nothing about this particular function has changed. You vaguely remember installing a new package, but what does that matter? Unfortunately my friend, it does matter.
  3. R is finding the wrong thing
    You attached the matlab package and call sum() on a numeric matrix. The result is a vector of column sums, not a length 1 numeric. This messes up everything. What were you thinking trying to make R act like Matlab? Matlab is for losers (and rich people).
  4. You want R to find something else
    You like a package’s plotting function. If you could intercept one call within the function and use your own calculation, it would be perfect. This seems like black magic to you, but something is strange about maintaining a full copy of the function just to apply your little tweak. Welcome to the dark arts.
  5. Package authoring
    You have authored a package. How does your kid plays with the other kids in the playground?

Apologies for being quite so “practical” but the rise of R as a data mining/analysis language means that people are also using it for the creation of topic maps. Do have to cover when the “rubber hits the road” every now and again. 😉

7 Big Winners in the U.S. Big Data Drive

Filed under: BigData,Data,Funding — Patrick Durusau @ 6:25 pm

7 Big Winners in the U.S. Big Data Drive by Nicole Hemsoth.

As we pointed out in Big Data is a Big Deal, the U.S. government is ponying up $200 million in new data projects.

Nicole covers seven projects that are of particular interest:

  1. DARPA’s XDATA – See XDATA for details – Closes May 30, 2012.
  2. SDAV Institute (DOE) – SDAV: Scalable Data Management, Analysis and Visualization (has a toolkit and other resources I need to cover separately)
  3. Biological and Environmental Research Program (BER) has created the Atmospheric Radiation Measurement (ARM) Climate Research Facility. Lots of data.
  4. John Wesley Powell Center for Analysis and Synthesis (USGS). Data + tools.
  5. PURVAC Purdue University – Homeland Security
  6. Biosense 2.0 – CDC project
  7. Machine Reading (DARPA) – usual goals:

    developing learning systems that process natural text and insert the resulting semantic representation into a knowledge bases rather than relying on expensive and time-consuming current processes for knowledge representation that require expert and associated knowledge engineers to hand-craft information.

I suppose one lesson to be learned is how quickly the bulk of $200 million can be sucked up by current projects.

The second lesson is to become an ongoing (large ongoing) research project so that you too can suck up new funding.

The third lesson is to use these ostensible goals of these projects as actual goals for your projects. The difference between trying to reach a goal and reaching it may make a difference.

Everything You Wanted to Know About Data Mining but Were Afraid to Ask

Filed under: Data Mining,Marketing — Patrick Durusau @ 6:24 pm

Everything You Wanted to Know About Data Mining but Were Afraid to Ask by Alexander Furnas.

Interesting piece from the Atlantic that you can use to introduce a client to the concepts of data mining. And at the same time, use as the basis for discussing topic maps.

For example, Furnas says:

For the most part, data mining tells us about very large and complex data sets, the kinds of information that would be readily apparent about small and simple things. For example, it can tell us that “one of these things is not like the other” a la Sesame Street or it can show us categories and then sort things into pre-determined categories. But what’s simple with 5 datapoints is not so simple with 5 billion datapoints.

Topic maps being more about things that are “like the other” so that we can have them all in one place. Or at least all the information about them in one place.

See, that wasn’t hard.

The editorial and technical side of it, how information is gathered for useful presentation to a user, is hard.

But the client, like someone watching cable TV, is more concerned with the result than how it arrived.

Perhaps a different marketing strategy, results first.

Thoughts?

Rules and Rituals

Filed under: Marketing — Patrick Durusau @ 6:24 pm

Rules and Rituals by Basab Pradhan.

From the post:

Matt Richtel investigates the mystery of why laptops and not iPads need to be pulled out of bags for the X-Ray machine at airport security.

From the New York Times

What’s the distinction between the devices? Similar shapes, many similar functions, the tablet is thinner but not by much. Is the iPad a lower security risk? What about the punier laptop-like gadgets, the netbooks and ultrabooks? What about my smartphone?

Richtel contacts

the TSA and security experts, but doesn’t really get a good answer. The TSA said that it had its reasons but declined to share them saying that “the agency didn’t want to betray any secrets.” Another security expert called it “security theater”, implying that making passengers go through some inconvenience makes it look like the government is taking their security seriously!

A very amusing post on rules that concludes:

The only way to keep business agile is to constantly subject its rules to the sunlight of logic. Why do we have this rule in place? Did we make this rule when the conditions were different from what they are today? Do we completely understand the costs of this rule and have we weighed them against the benefits? Does anyone even remember why we have this rule?

Like zero based budgeting, we should be talking about zero-based rules.

Most of us would agree with Basab on the TSA and the use of the “sunshine of logic” with regards to airport security.

At least at first blush.

But it is a good illustration that the “sunshine of logic” is always from a particular perspective.

As a former frequent air traveler, my view was and is that the TSA is public band-aid of little utility. At Atlanta, it is simply a job creation mechanism for sisters, cousins and other relatives. Now that groping children is part of their job, no doubt the pool of job applicants has increased.

From the perspective of people who like groping children, the “sunshine of logic” for the TSA is entirely different. The TSA exists to provide them with employment and legitimate reasons to grope children.

From the perspective of the politicians who created the TSA, the “sunshine of logic” for the TSA is that they are doing something about terrorism (a farcical claim to you or I but I have heard it claimed by politicians).

Bottom line is that if I get to declare where “zero” starts, I’m pretty comfortable with “zero-based rules.” (You may be more or less comfortable.)

April 13, 2012

The inevitable perversion of measurement

Filed under: Measurement — Patrick Durusau @ 4:48 pm

The inevitable perversion of measurement

From the post:

Supposedly one of the tactics in the fight against obesity is to change how we measure obesity (from BMI to DXA): that’s the key message in an LA Times article (link).

This is a great read if only because it covers many common problems of measurement systems. In thinking about invented metrics, such as SAT scores, employee performance ratings and teacher ratings, bear in mind they only have names because we gave them names.

Measuring things always lead to perverse behavior. Here are some examples straight out of this article:

The list of “perversions” include:

1. The metric, even if accurately measured, has no value

2. Blame the failure of a program on the metric

3. A metric becomes more complicated over time

If I am looking for “perversion” I am likely to skip this channel. 😉

On the other hand, the post does list some of the issues relative to our attempts at measurement.

Measurement is an important component for the judging of similarity and sameness.

Can you find/point out other posts addressing issues with measurement? (perverse or not)

Percona Toolkit 2.1 with New Online Schema Change Tool

Filed under: MySQL,Percona Server,Schema — Patrick Durusau @ 4:47 pm

Percona Toolkit 2.1 with New Online Schema Change Tool by Baron Schwartz.

From the post:

I’m proud to announce the GA release of version 2.1 of Percona Toolkit. Percona Toolkit is the essential suite of administrative tools for MySQL.

With this release we introduce a new version of pt-online-schema-change, a tool that enables you to ALTER large tables with no blocking or downtime. As you know, MySQL locks tables for most ALTER operations, but pt-online-schema-change performs the ALTER without any locking. Client applications can continue reading and writing the table with no interruption.

With this new version of the tool, one of the most painful things anyone experiences with MySQL is significantly alleviated. If you’ve ever delayed a project’s schedule because the release involved an ALTER, which had to be scheduled in the dead of the night on Sunday, and required overtime and time off, you know what I mean. A schema migration is an instant blocker in the critical path of your project plan. No more!

Certainly a useful feature for MySQL users.

Not to mention being another step towards data models being a matter of how you choose to view the data for some particular purpose. Not quite there, yet, but that day is coming.

In a very real sense, the “normalization” of data and the data models we have built into SQL systems were compensation for the short-comings of our computing platforms. That we have continued to do so in the face of increases in computing resources that make it unnecessary, is evidence of short-comings on our part.

Julia: a new language for technical computing

Filed under: Julia,R — Patrick Durusau @ 4:47 pm

Julia: a new language for technical computing

From the post:

Julia is a new open-source language for high-performance technical computing, created by Jeff Bezanson, Stefan Karpinski, Viral Shah and Alan Edelman and first announced in February. Their motivation for creating a new language was, they say, “greed”:

We are power Matlab users. Some of us are Lisp hackers. Some are Pythonistas, others Rubyists, still others Perl hackers. There are those of us who used Mathematica before we could grow facial hair. There are those who still can’t grow facial hair. We’ve generated more R plots than any sane person should. C is our desert island programming language.

We love all of these languages; they are wonderful and powerful. For the work we do — scientific computing, machine learning, data mining, large-scale linear algebra, distributed and parallel computing — each one is perfect for some aspects of the work and terrible for others. Each one is a trade-off.

We are greedy: we want more.

Pointers to articles and a vocabulary comparison of Julia and R. Recalling the recent complaint that a user might know the operation in R but not Julia. And my suggestion that a “lite” topic map application might be useful in that context.

Seminar: Five Years On

Filed under: Library,Linked Data,Semantic Web — Patrick Durusau @ 4:45 pm

Seminar: Five Years On

British Library
April 26, 2012 – April 27, 2012

From the webpage:

April 2012 marks the fifth anniversary of the Data Model Meeting at the British Library, London attended by participants interested in the fit between RDA: Resource Description and Access and the models used in other metadata communities, especially those working in the Semantic Web environment. This meeting, informally known as the “London Meeting”, has proved to be a critical point in the trajectory of libraries from the traditional data view to linked data and the Semantic Web.

DCMI-UK in cooperation with DCMI International as well as others will co-sponsor a one-day seminar on Friday 27 April 2012 to describe progress since 2007, mark the anniversary, and look to further collaboration in the future.

Speakers will include participants at the 2007 meeting and other significant players in library data and the Semantic Web. Papers from the seminar will be published by DCMI and available freely online.

The London Meeting stimulated significant development of Semantic Web representations of the major international bibliographic metadata models, including IFLA’s Functional Requirements family and the International Standard Bibliographic Description (ISBD), and MARC as well as RDA itself. Attention is now beginning to focus on the management and sustainability of this activity, and the development of high-level semantic and data structures to support library applications.

Would appreciate a note if you are in London for this meeting. Thanks!

How to get Solr Up and Running On OpenShift

Filed under: OpenShift,Solr — Patrick Durusau @ 4:44 pm

How to get Solr Up and Running On OpenShift by Shekhar Gulati.

Brief and to the point guide for Solr on OpenShift.

Curious, do you know of a listing of all the public clouds? Occurs to me that a listing of recent installation installations for all of them for Solr, for example, would be a good thing.

Comments?

Classifier Technology and the Illusion of Progress

Filed under: Classification,Classifier — Patrick Durusau @ 4:44 pm

Classifier Technology and the Illusion of Progress by David J. Hand.

Was pointed to in Simply Statistics for 8 April 2012:

Abstract:

A great many tools have been developed for supervised classification, ranging from early methods such as linear discriminant analysis through to modern developments such as neural networks and support vector machines. A large number of comparative studies have been conducted in attempts to establish the relative superiority of these methods. This paper argues that these comparisons often fail to take into account important aspects of real problems, so that the apparent superiority of more sophisticated methods may be something of an illusion. In particular, simple methods typically yield performance almost as good as more sophisticated methods, to the extent that the difference in performance may be swamped by other sources of uncertainty that generally are not considered in the classical supervised classification paradigm.

The original pointer didn’t mention there were four published comments and a formal rejoinder:

Comment: Classifier Technology and the Illusion of Progress by Jerome H. Friedman.

Comment: Classifier Technology and the Illusion of Progress–Credit Scoring by Ross W. Gayler.

Elaboration on Two Points Raised in “Classifier Technology and the Illusion of Progress” by Robert C. Holte.

Comment: Classifier Technology and the Illusion of Progress by Robert A. Stine.

Rejoinder: Classifier Technology and the Illusion of Progress by David J. Hand.

Enjoyable reading, one and all!

« Newer PostsOlder Posts »

Powered by WordPress