Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 25, 2012

Polarity in Type Theory

Filed under: Category Theory,Types — Patrick Durusau @ 2:33 pm

Polarity in Type Theory by Robert Harper.

From the post:

There has recently arisen some misguided claims about a supposed opposition between functional and object-oriented programming. The claims amount to a belated recognition of a fundamental structure in type theory first elucidated by Jean-Marc Andreoli, and developed in depth by Jean-Yves Girard in the context of logic, and by Paul Blain-Levy and Noam Zeilberger in the context of programming languages. In keeping with the general principle of computational trinitarianism, the concept of polarization has meaning in proof theory, category theory, and type theory, a sure sign of its fundamental importance.

Polarization is not an issue of language design, it is an issue of type structure. The main idea is that types may be classified as being positive or negative, with the positive being characterized by their structure and the negative being characterized by their behavior. In a sufficiently rich type system one may consider, and make effective use of, both positive and negative types. There is nothing remarkable or revolutionary about this, and, truly, there is nothing really new about it, other than the terminology. But through the efforts of the above-mentioned researchers, and others, we have learned quite a lot about the importance of polarization in logic, languages, and semantics. I find it particularly remarkable that Andreoli’s work on proof search turned out to also be of deep significance for programming languages. This connection was developed and extended by Zeilberger, on whose dissertation I am basing this post.

The simplest and most direct way to illustrate the ideas is to consider the product type, which corresponds to conjunction in logic. There are two possible ways that one can formulate the rules for the product type that from the point of view of inhabitation are completely equivalent, but from the point of view of computation are quite distinct. Let us first state them as rules of logic, then equip these rules with proof terms so that we may study their operational behavior. For the time being I will refer to these as Method 1 and Method 2, but after we examine them more carefully, we will find more descriptive names for them.

Best read in the morning with a fresh cup of coffee (or whenever you do your best work).

Can’t talk about equivalence without types. (Well, not interchangeably.)

Avoiding Public Confessions of Ignorance

Filed under: Government,Marketing,Topic Maps — Patrick Durusau @ 2:15 pm

I saw White House Follows No 10 To Github-first open source development, with text that reads in part:

Yesterday the White House got some justifiable praise for open sourcing its online petitioning platform, We The People, using a Github repository. In a blog post Macon Philips, Director of Digital Strategy, said:

“Now anybody, from other countries to the smallest organizations to civic hackers can take this code and put to their own use.

One of the most exciting prospects of open sourcing We the People is getting feedback, ideas and code contributions from the public. There is so much that can be done to improve this system, and we only benefit by being able to more easily collaborate with designers and engineers around the country – and the world.”

If you don’t know the details of the U.S. government and open source, see: Open Source in the U.S. Government.

History is “out there,” and not all that hard to find.

Can topic maps help government officials avoid public confessions of ignorance?

Introduction to Recommendations with Map-Reduce and mrjob [Ode to Similarity, Music]

Filed under: MapReduce,Music,Music Retrieval,Similarity — Patrick Durusau @ 10:56 am

Introduction to Recommendations with Map-Reduce and mrjob by Marcel Caraciolo

From the post:

In this post I will present how can we use map-reduce programming model for making recommendations. Recommender systems are quite popular among shopping sites and social network thee days. How do they do it ? Generally, the user interaction data available from items and products in shopping sites and social networks are enough information to build a recommendation engine using classic techniques such as Collaborative Filtering.

Usual recommendation post except for the emphasis on multiple tests of similarity.

Useful because simply reporting that two (or more) items are “similar” isn’t all that helpful. At least unless or until you know the basis for the comparison.

And have the expectation that a similar notion of “similarity” works for your audience.

For example, I read an article this morning about a “new” invention that will change the face of sheet music publishing, in three to five years. Invention Will Strike a Chord With Musicians

Despite the lack of terms like “markup,” “HyTime,” “SGML,” “XML,” “Music Encoding Initiative (MEI),” or “MusicXML,” all of those seemed quite “similar” to me. That may not be the “typical” experience but it is mine.

If you don’t want to wait three to five years for the sheet music revolution, you can check out MusicXML. It has been reported that more than 150 applications support MusicXML. Oh, that would be today, not three to five years from now.

You might want to pass the word along in the music industry before the next “revolution” in sheet music starts up.

Heritrix

Filed under: Webcrawler — Patrick Durusau @ 10:33 am

Heritrix

From the wiki page:

This is the public wiki for the Heritrix archival crawler project.

Heritrix is the Internet Archive’s open-source, extensible, web-scale, archival-quality web crawler project.

Heritrix (sometimes spelled heretrix, or misspelled or mis-said as heratrix/heritix/ heretix/heratix) is an archaic word for heiress (woman who inherits). Since our crawler seeks to collect and preserve the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt.

All topical contributions to this wiki (corrections, proposals for new features, new FAQ items, etc.) are welcome! Register using the link near the top-right corner of this page.

Tool for creating a customized search collection or as reference code for a web crawler project.

I first saw this at Pete Warden’s Five Short Links for 24 August 2012.

August 24, 2012

Fall Lineup: Protest Monitoring, Bin Laden Letters Analysis, … [Defensive Big Data (DBD)]

Filed under: Data Mining,Predictive Analytics — Patrick Durusau @ 4:33 pm

Protest Monitoring, Bin Laden Letters Analysis, and Building Custom Applications

OK, not “Fall Lineup” in the TV sense. 😉

Webinars from Recorded Future in September, 2012.

All start at 11 AM EST.

These webinars should help you learn how data mining looks for clues or how to not leave clues.

Is the term: Defensive Big Data (DBD) in common usage?

Think of using Mahout to analyze email traffic to support reforming your emails to be close to messages that are routinely ignored.

Learning Mahout : Collaborative Filtering [Recommend Your Preferences?]

Filed under: Collaboration,Filters,Machine Learning,Mahout — Patrick Durusau @ 3:52 pm

Learning Mahout : Collaborative Filtering by Sujit Pal.

From the post:

My Mahout in Action (MIA) book has been collecting dust for a while now, waiting for me to get around to learning about Mahout. Mahout is evolving quite rapidly, so the book is a bit dated now, but I decided to use it as a guide anyway as I work through the various modules in the currently GA) 0.7 distribution.

My objective is to learn about Mahout initially from a client perspective, ie, find out what ML modules (eg, clustering, logistic regression, etc) are available, and which algorithms are supported within each module, and how to use them from my own code. Although Mahout provides non-Hadoop implementations for almost all its features, I am primarily interested in the Hadoop implementations. Initially I just want to figure out how to use it (with custom code to tweak behavior). Later, I would like to understand how the algorithm is represented as a (possibly multi-stage) M/R job so I can build similar implementations.

I am going to write about my progress, mainly in order to populate my cheat sheet in the sky (ie, for future reference). Any code I write will be available in this GitHub (Scala) project.

The first module covered in the book is Collaborative Filtering. Essentially, it is a technique of predicting preferences given the preferences of others in the group. There are two main approaches – user based and item based. In case of user-based filtering, the objective is to look for users similar to the given user, then use the ratings from these similar users to predict a preference for the given user. In case of item-based recommendation, similarities between pairs of items are computed, then preferences predicted for the given user using a combination of the user’s current item preferences and the similarity matrix.

While you are working your way through this post, keep in mind: Collaborative filtering with GraphChi.

Question: What if you are an outlier?

Telephone marketing interviews with me get shortened by responses like: “X? Is that a TV show?”

How would you go about piercing the marketing veil to recommend your preferences?

Now that is a product to which even I might subscribe. (But don’t advertise on TV, I won’t see it.)

Big Data on Heroku – Hadoop from Treasure Data

Filed under: Cloud Computing,Hadoop,Heroku — Patrick Durusau @ 3:32 pm

Big Data on Heroku – Hadoop from Treasure Data by Istvan Szegedi.

From the post:

This time I write about Heroku and Treasure Data Hadoop solution – I found it really to be a ‘gem’ in the Big Data world.

Heroku is a cloud platform as a service (PaaS) owned by Salesforce.com. Originally it started with supporting Ruby as its main programming language but it has been extended to Java, Scala, Node.js, Python and Clojure, too. It also supports a long list of addons including – among others – RDBMS and NoSQL capabilities and Hadoop-based data warehouse developed by Treasure Data.

Not to leave the impression that your only cloud option is AWS.

I don’t know of any comparisons of cloud services/storage plus cost on an apples to apples basis.

Do you?

Process a Million Songs with Apache Pig

Filed under: Amazon Web Services AWS,Cloudera,Data Mining,Hadoop,Pig — Patrick Durusau @ 3:22 pm

Process a Million Songs with Apache Pig by Justin Kestelyn.

From the post:

The following is a guest post kindly offered by Adam Kawa, a 26-year old Hadoop developer from Warsaw, Poland. This post was originally published in a slightly different form at his blog, Hakuna MapData!

Recently I have found an interesting dataset, called Million Song Dataset (MSD), which contains detailed acoustic and contextual data about a million songs. For each song we can find information like title, hotness, tempo, duration, danceability, and loudness as well as artist name, popularity, localization (latitude and longitude pair), and many other things. There are no music files included here, but the links to MP3 song previews at 7digital.com can be easily constructed from the data.

The dataset consists of 339 tab-separated text files. Each file contains about 3,000 songs and each song is represented as one separate line of text. The dataset is publicly available and you can find it at Infochimps or Amazon S3. Since the total size of this data sums up to around 218GB, processing it using one machine may take a very long time.

Definitely, a much more interesting and efficient approach is to use multiple machines and process the songs in parallel by taking advantage of open-source tools from the Apache Hadoop ecosystem (e.g. Apache Pig). If you have your own machines, you can simply use CDH (Cloudera’s Distribution including Apache Hadoop), which includes the complete Apache Hadoop stack. CDH can be installed manually (quickly and easily by typing a couple of simple commands) or automatically using Cloudera Manager Free Edition (which is Cloudera’s recommended approach). Both CDH and Cloudera Manager are freely downloadable here. Alternatively, you may rent some machines from Amazon with Hadoop already installed and process the data using Amazon’s Elastic MapReduce (here is a cool description writen by Paul Lemere how to use it and pay as low as $1, and here is my presentation about Elastic MapReduce given at the second meeting of Warsaw Hadoop User Group).

An example of offering the reader their choice of implementation detail, on or off a cloud. 😉

Suspect that is going to become increasingly common.

Giant Network Links All Known Compounds and Reactions

Filed under: Cheminformatics,Networks — Patrick Durusau @ 2:01 pm

Let’s start with the “popular” version: Scientists Create Chemical ‘Brain’: Giant Network Links All Known Compounds and Reactions

From the post:

Northwestern University scientists have connected 250 years of organic chemical knowledge into one giant computer network — a chemical Google on steroids. This “immortal chemist” will never retire and take away its knowledge but instead will continue to learn, grow and share.

A decade in the making, the software optimizes syntheses of drug molecules and other important compounds, combines long (and expensive) syntheses of compounds into shorter and more economical routes and identifies suspicious chemical recipes that could lead to chemical weapons.

“I realized that if we could link all the known chemical compounds and reactions between them into one giant network, we could create not only a new repository of chemical methods but an entirely new knowledge platform where each chemical reaction ever performed and each compound ever made would give rise to a collective ‘chemical brain,'” said Bartosz A. Grzybowski, who led the work. “The brain then could be searched and analyzed with algorithms akin to those used in Google or telecom networks.”

Called Chematica, the network comprises some seven million chemicals connected by a similar number of reactions. A family of algorithms that searches and analyzes the network allows the chemist at his or her computer to easily tap into this vast compendium of chemical knowledge. And the system learns from experience, as more data and algorithms are added to its knowledge base.

Details and demonstrations of the system are published in three back-to-back papers in the Aug. 6 issue of the journal Angewandte Chemie.

Well, true enough, except for the “share” part. Chematica is in the process of being commercialized.

If you are interested in the non-“popular” version:

Rewiring Chemistry: Algorithmic Discovery and Experimental Validation of One-Pot Reactions in the Network of Organic Chemistry (pages 7922–7927) by Dr. Chris M. Gothard, Dr. Siowling Soh, Nosheen A. Gothard, Dr. Bartlomiej Kowalczyk, Dr. Yanhu Wei, Dr. Bilge Baytekin and Prof. Bartosz A. Grzybowski. Article first published online: 13 JUL 2012 | DOI: 10.1002/anie.201202155.

Abstract:

Computational algorithms are used to identify sequences of reactions that can be performed in one pot. These predictions are based on over 86 000 chemical criteria by which the putative sequences are evaluated. The “raw” algorithmic output is then validated experimentally by performing multiple two-, three-, and even four-step sequences. These sequences “rewire” synthetic pathways around popular and/or important small molecules.

Parallel Optimization of Synthetic Pathways within the Network of Organic Chemistry (pages 7928–7932) by Dr. Mikołaj Kowalik, Dr. Chris M. Gothard, Aaron M. Drews, Nosheen A. Gothard, Alex Weckiewicz, Patrick E. Fuller, Prof. Bartosz A. Grzybowski and Prof. Kyle J. M. Bishop. Article first published online: 13 JUL 2012 | DOI: 10.1002/anie.201202209.

Abstract:

Finding a needle in a haystack: The number of possible synthetic pathways leading to the desired target of a synthesis can be astronomical (1019 within five synthetic steps). Algorithms are described that navigate through the entire known chemical-synthetic knowledge to identify optimal synthetic pathways. Examples are provided to illustrate single-target optimization and parallel optimization of syntheses leading to multiple targets.

Chemical Network Algorithms for the Risk Assessment and Management of Chemical Threats (pages 7933–7937) by Patrick E. Fuller, Dr. Chris M. Gothard, Nosheen A. Gothard, Alex Weckiewicz and Prof. Bartosz A. Grzybowski. Article first published online: 13 JUL 2012 | DOI: 10.1002/anie.201202210.

Abstract:

A network of chemical threats: Current regulatory protocols are insufficient to monitor and block many short-route syntheses of chemical weapons, including those that start from household products. Network searches combined with game-theory algorithms provide an effective means of identifying and eliminating chemical threats. (Picture: an algorithm-detected pathway that yields sarin (bright red node) in three steps from unregulated substances.)

Do you see any potential semantic issues in such a network? Arising as our understanding of reactions changes?

Recalling that semantics isn’t simply a question of yesterday, today and tomorrow but also of tomorrows, 10, 50, or 100 or more years from now.

We may fancy our present understanding as definitive, but it is just a fancy.

Going Beyond the Numbers:…

Filed under: Analytics,Text Analytics,Text Mining — Patrick Durusau @ 1:39 pm

Going Beyond the Numbers: How to Incorporate Textual Data into the Analytics Program by Cindi Thompson.

From the post:

Leveraging the value of text-based data by applying text analytics can help companies gain competitive advantage and an improved bottom line, yet many companies are still letting their document repositories and external sources of unstructured information lie fallow.

That’s no surprise, since the application of analytics techniques to textual data and other unstructured content is challenging and requires a relatively unfamiliar skill set. Yet applying business and industry knowledge and starting small can yield satisfying results.

Capturing More Value from Data with Text Analytics

There’s more to data than the numerical organizational data generated by transactional and business intelligence systems. Although the statistics are difficult to pin down, it’s safe to say that the majority of business information for a typical company is stored in documents and other unstructured data sources, not in structured databases. In addition, there is a huge amount of business-relevant information in documents and text that reside outside the enterprise. To ignore the information hidden in text is to risk missing opportunities, including the chance to:

  • Capture early signals of customer discontent.
  • Quickly target product deficiencies.
  • Detect fraud.
  • Route documents to those who can effectively leverage them.
  • Comply with regulations such as XBRL coding or redaction of personally identifiable information.
  • Better understand the events, people, places and dates associated with a large set of numerical data.
  • Track competitive intelligence.

To be sure, textual data is messy and poses difficulties.

But, as Cindi points out, there are golden benefits in those hills of textual data.

Foundations of Machine Learning

Filed under: Machine Learning — Patrick Durusau @ 10:29 am

Foundations of Machine Learning by Mehryar Mohri, Afshin Rostamizadeh and Ameet Talwalkar.

From the description:

This graduate-level textbook introduces fundamental concepts and methods in machine learning. It describes several important modern algorithms, provides the theoretical underpinnings of these algorithms, and illustrates key aspects for their application. The authors aim to present novel theoretical tools and concepts while giving concise proofs even for relatively advanced topics.

Foundations of Machine Learning fills the need for a general textbook that also offers theoretical details and an emphasis on proofs. Certain topics that are often treated with insufficient attention are discussed in more detail here; for example, entire chapters are devoted to regression, multi-class classification, and ranking. The first three chapters lay the theoretical foundation for what follows, but each remaining chapter is mostly self-contained. The appendix offers a concise probability review, a short introduction to convex optimization, tools for concentration bounds, and several basic properties of matrices and norms used in the book.

The book is intended for graduate students and researchers in machine learning, statistics, and related areas; it can be used either as a textbook or as a reference text for a research seminar.

Before I lay out $70 for a copy, would appreciate comments on how this differs from say Christopher M. Bishop’s Pattern Recognition and Machine Learning (2007, 2nd printing)? Five (5) years will make some difference, but how much?

LucidChart

Filed under: Graphics,Visualization — Patrick Durusau @ 10:06 am

LucidChart

Not new but I just saw it reviewed in the Scout Report for today.

Limited to Google Chrome but what do you expect from folks that support “open” standards? (Seems like we would not have to teach the value of interoperability again and again.)

Written in HTML5 so there is hope someone will create a trans-browser service of this sort.

Even in this browser inhibited form, could be useful for another run at a graphic language for topic maps.

Proposals could be floated, trimmed, extended, etc. to see what communities of practice emerge. If any. Personally I suspect that like the domains they model, icons are going to be domain specific. Or even language specific.

Witness UML. It’s ok, if you want to speak UML. Most bankers prefer “banking,” insurance clerks “insurance,” government officials “regulations/oppression” (depends on your point of view), etc.

Me? Nice of you to ask but I’m with the guy/girl with the checkbook. Whatever they want to speak is my preference as well.

Better table search through Machine Learning and Knowledge

Filed under: Kernel Methods,Searching,Support Vector Machines,Tables — Patrick Durusau @ 9:25 am

Better table search through Machine Learning and Knowledge by Johnny Chen.

From the post:

The Web offers a trove of structured data in the form of tables. Organizing this collection of information and helping users find the most useful tables is a key mission of Table Search from Google Research. While we are still a long way away from the perfect table search, we made a few steps forward recently by revamping how we determine which tables are “good” (one that contains meaningful structured data) and which ones are “bad” (for example, a table that hold the layout of a Web page). In particular, we switched from a rule-based system to a machine learning classifier that can tease out subtleties from the table features and enables rapid quality improvement iterations. This new classifier is a support vector machine (SVM) that makes use of multiple kernel functions which are automatically combined and optimized using training examples. Several of these kernel combining techniques were in fact studied and developed within Google Research [1,2].

Important work on tables from Google Research.

Important in part because you can compare your efforts on accessible tables to theirs, to gain insight into what you are, or aren’t doing “right.”

For any particular domain, you should be able to do better than a general solution.

BTW, I disagree on the “good” versus “bad” table distinction. I suspect that tables that hold the layout of web pages, say for a CMS, are more consistent than database tables of comparable size. And that data, may or may not be important to you.

Important versus non-important data for a particular set of requirements is a defensible distinction.

“Good” versus “bad” tables is not.

Solr vs. ElasticSearch: Part 1 – Overview

Filed under: ElasticSearch,Solr,SolrCloud — Patrick Durusau @ 8:18 am

Solr vs. ElasticSearch: Part 1 – Overview by Rafał Kuć.

From the post:

A good Solr vs. ElasticSearch coverage is long overdue. We make good use of our own Search Analytics and pay attention to what people search for. Not surprisingly, lots of people are wondering when to choose Solr and when ElasticSearch.

As the Apache Lucene 4.0 release approaches and with it Solr 4.0 release as well, we thought it would be beneficial to take a deeper look and compare the two leading open source search engines built on top of Lucene – Apache Solr and ElasticSearch. Because the topic is very wide and can go deep, we are publishing our research as a series of blog posts starting with this post, which provides the general overview of the functionality provided by both search engines.

Rafal gets this series of posts off to a good start!

PS: Solr vs. ElasticSearch: Part 2 – Data Handling

August 23, 2012

VLDB 2012 Ice Breaker v0.1

Filed under: Conferences,CS Lectures — Patrick Durusau @ 7:34 pm

VLDB 2012 Ice Breaker v0.1

Apologies for the sudden silence but I started working on an “in house” version of the VLDB program listing to use with this blog.

I reformed the program to add links for all authors to DBLP Computer Science Biblography.

The links resolve but that doesn’t mean they are “correct.” As I work through the program I will be correcting any links that can be made more specific.

As usual, comments and suggestions are welcome!

August 22, 2012

A bit “lite” today

Filed under: Marketing — Patrick Durusau @ 6:48 pm

Apologies but postings are a bit “lite” today.

Probably late tomorrow but stay tuned for one of the reasons today has been a “lite” day.

It’s not a bad reason and I think you will like the outcome, or at least find it useful.

Distributed GraphLab: …

Filed under: Amazon Web Services AWS,GraphLab,Graphs — Patrick Durusau @ 6:44 pm

Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud by Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, Joseph M. Hellerstein.

Abstract:

While high-level data parallel frameworks, like MapReduce, simplify the design and implementation of large-scale data processing systems, they do not naturally or efficiently support many important data mining and machine learning algorithms and can lead to inefficient learning systems. To help fill this critical void, we introduced the GraphLab abstraction which naturally expresses asynchronous, dynamic, graph-parallel computation while ensuring data consistency and achieving a high degree of parallel performance in the shared-memory setting. In this paper, we extend the GraphLab framework to the substantially more challenging distributed setting while preserving strong data consistency guarantees.

We develop graph based extensions to pipelined locking and data versioning to reduce network congestion and mitigate the effect of network latency. We also introduce fault tolerance to the GraphLab abstraction using the classic Chandy-Lamport snapshot algorithm and demonstrate how it can be easily implemented by exploiting the GraphLab abstraction itself. Finally, we evaluate our distributed implementation of the GraphLab abstraction on a large Amazon EC2 deployment and show 1-2 orders of magnitude performance gains over Hadoop-based implementations.

A gem from the first day as a member of the GraphLab and GraphChi group on LinkedIn!

This rocks!

VLDB 2012 Advance Program

Filed under: CS Lectures,Database — Patrick Durusau @ 6:42 pm

VLDB 2012 Advance Program

I took this text from the conference homepage:

VLDB is a premier annual international forum for data management and database researchers, vendors, practitioners, application developers, and users. The conference will feature research talks, tutorials, demonstrations, and workshops. It will cover current issues in data management, database and information systems research. Data management and databases remain among the main technological cornerstones of emerging applications of the twenty-first century.

I can’t think of a better summary of the papers, tutorials, etc., that you will find here.

I could easily lose the better part of a week just skimming abstracts.

Suggestion/comments?

August 21, 2012

Predictive Models: Build once, Run Anywhere

Filed under: Machine Learning,Prediction,Predictive Analytics — Patrick Durusau @ 2:59 pm

Predictive Models: Build once, Run Anywhere

From the post:

We have released a new version of our open source Python bindings. This new version aims at showing how the BigML API can be used to build predictive models capable of generating predictions locally or remotely. You can get full access to the code at Github and read the full documentation at Read the Docs.

The complete list of updates includes (drum roll, please):

Development Mode

We recently introduced a free sandbox to help developers play with BigML on smaller datasets without being concerned about credits. In the new Python bindings you can use BigML in development mode, and all dataset and models smaller than 1MB can be created for free:

from bigml.api import BigML

api = BigML(dev_mode=True)

A “sandbox” for your machine learning experiments!

Apache to Drill for big data in Hadoop

Filed under: BigData,Drill,Hadoop — Patrick Durusau @ 1:58 pm

Apache to Drill for big data in Hadoop

From the post:

A new Apache Incubator proposal should see the Drill project offering a new open source way to interactively analyse large scale datasets on distributed systems. Drill is inspired by Google’s Dremel but is designed to be more flexible in terms of supported query languages. Dremel has been in use by Google since 2006 and is now the engine that powers Google’s BigQuery analytics.

The project is being led at Apache by developers from MapR where the early Drill development was being done. Also contributing are Drawn To Scale and Concurrent. Requirement and design documentation will be contributed to the project by MapR. Hadoop is good for batch queries, but by allowing quicker queries of huge data sets, those data sets can be better explored. The Drill technology, like the Google Dremel technology, does not replace MapReduce or Hadoop systems. It works along side them, offering a system which can analyse the output of the batch processing system and its pipelines, or be used to rapidly prototype larger scale computations.

Drill is comprised of a query language layer with parser and execution planner, a low latency execution engine for executing the plan, nested data formats for data storage and a scalable data source layer. The query language layer will focus on Drill’s own query language, DrQL, and the data source layer will initially use Hadoop as its source. The project overall will closely integrate with Hadoop, storing its data in Hadoop and supporting the Hadoop FileSystem and HBase and supporting Hadoop data formats. Apache’s Hive project is also being considered as the basis for the DrQL.

The developers hope that by developing in the open at Apache, they will be able to create and establish Drill’s own APIs and ensure a robust, flexible architecture which will support a broad range of data sources, formats and query languages. The project has been accepted into the incubator and so far has an empty subversion repository.

Q: Is anyone working on/maintaining a map between the various Hadoop related query languages?

Getting Started with R and Hadoop

Filed under: BigData,Hadoop,R — Patrick Durusau @ 1:47 pm

Getting Started with R and Hadoop by David Smith.

From the post:

Last week's meeting of the Chicago area Hadoop User Group (a joint meeting the Chicago R User Group, and sponsored by Revolution Analytics) focused on crunching Big Data with R and Hadoop. Jeffrey Breen, president of Atmosphere Research Group, frequently deals with large data sets in his airline consulting work, and R is his "go-to tool for anything data-related". His presentation, "Getting Started with R and Hadoop" focuses on the RHadoop suite of packages, and especially the rmr package to interface R and Hadoop. He lists four advantages of using rmr for big-data analytics with R and Hadoop:

  • Well-designed API: code only needs to deal with basic R objects
  • Very flexible I/O subsystem: handles common formats like CSV, and also allows complex line-by-line parsing
  • Map-Reduce jobs can easily be daisy-chained to build complex workflows
  • Concise code compared to other ways of interfacing R and Hadoop (the chart below compares the number of lines of code required to implement a map-reduce analysis using different systems) 

Slides, detailed examples, presentation, pointers to other resources.

Other than processing your data set, doesn’t look like it leaves much out. 😉

Ironic that we talk about “big data” sets when the Concept Annotation in the CRAFT corpus took two and one-half years (that 30 months for you mythic developer types) to tag ninety-seven (97) medical articles.

That’s an average of a little over three (3) articles per month.

And I am sure the project leads would concede that more could be done.

Maybe “big” data should include some notion of “complex” data?

Cliff Bleszinski’s Game Developer Flashcards

Filed under: Discourse,Games,Programming,Standards — Patrick Durusau @ 1:22 pm

Cliff Bleszinski’s Game Developer Flashcards by Cliff Bleszinski.

From the post:

As of this summer, I’ll have been making games for 20 years professionally. I’ve led the design on character mascot platform games, first-person shooters, single-player campaigns, multiplayer experiences, and much more. I’ve worked with some of the most amazing programmers, artists, animators, writers, and producers around. Throughout this time period, I’ve noticed patterns in how we, as creative professionals, tend to communicate.

I’ve learned that while developers are incredibly intelligent, they can sometimes be a bit insecure about how smart they are compared to their peers. I’ve seen developer message boards tear apart billion-dollar franchises, indie darlings, and everything in between by overanalyzing and nitpicking. We always want to prove that we thought of an idea before anyone else, or we will cite a case in which an idea has been attempted, succeeded, failed, or been played out.

In short, this article identifies communication techniques that are often used in discussions, arguments, and debates among game developers in order to “win” said conversations.

Written in a “game development” context but I think you can recognize some of these patterns in standards work, ontology development and other areas as well.

I did not transpose/translate it into standards lingo, reasoning that it would be easier to see the mote in someone else’s eye than the plank in our own. 😉

Only partially in jest.

Listening to others is hard, listening to ourselves (for patterns like these), is even harder.

I first saw this at: Nat Turkington’s Four short links: 21 August 2012.

GraphLab and GraphChi group on LinkedIn

Filed under: GraphChi,GraphLab — Patrick Durusau @ 1:09 pm

GraphLab and GraphChi group on LinkedIn by Igor Carron.

From the post:

Danny just started the GraphLab and GraphChi group on LinkedIn. If you want to be part of that disruptive discussion, we may want to join.

OK, I just hit “join.” What about you?

Streaming Data Mining Tutorial slides (and more)

Filed under: Data Mining,Stream Analytics — Patrick Durusau @ 1:02 pm

Streaming Data Mining Tutorial slides (and more) by Igor Carron.

From the post:

Jelani Nelson.and Edo Liberty just released an important tutorial they gave at KDD 12 on the state of the art and practical algorithms used in mining streaming data, entitled: Streaming Data Mining I personally marvel at the development of these deep algorithms which, because of the large data streams constraints, get to redefine what it means to do seemingly simple functions such as counting in the Big Data world. Here are some slides that got my interest, but the 111 pages are worth the read:

Pointers to more slides and videos follow.

Linked Lists in Datomic [Herein of tolog and Neo4j]

Filed under: Datomic,Prolog,tolog — Patrick Durusau @ 10:59 am

Linked Lists in Datomic by Joachim Hofer.

From the post:

As my last contact with Prolog was over ten years ago, I think it’s time for some fun with Datomic and Datalog. In order to learn to know Datomic better, I will attempt to implement linked lists as a Datomic data structure.

First, I need a database “schema”, which in Datomic means that I have to define a few attributes. I’ll define one :content/name (as a string) for naming my list items, and also the attributes for the list data structure itself, namely :linkedList/head and :linkedList/tail (both are refs):

You may or may not know that tolog, a topic map query language, was inspired in part by Datalog. Understanding Datalog could lead to new insights into tolog.

The other reason to mention this post is that Neo4j uses linked lists as part of its internal data structure.

If I am reading slide 9 (Neo4J Internals (update)) correctly, relationships are hard coded to have start/end nodes (singletons).

Not going to squeeze hyperedges out of that data structure.

What if you replaced the start/end node values with key/value pair as membership criteria for membership in the hyperedge?

Even if most nodes have only start/end nodes meeting a membership criteria, would free you up to have hyperedges when needed.

Will have to look at the implementation details on hyperedges/nodes to see. Suspect others have found better solutions.

Putting WorldCat Data Into A Triple Store

Filed under: Library,Linked Data,RDF,WorldCat — Patrick Durusau @ 10:32 am

Putting WorldCat Data Into A Triple Store by Richard Wallis.

From the post:

I can not really get away with making a statement like “Better still, download and install a triplestore [such as 4Store], load up the approximately 80 million triples and practice some SPARQL on them” and then not following it up.

I made it in my previous post Get Yourself a Linked Data Piece of WorldCat to Play With in which I was highlighting the release of a download file containing RDF descriptions of the 1.2 million most highly held resources in WorldCat.org – to make the cut, a resource had to be held by more than 250 libraries.

So here for those that are interested is a step by step description of what I did to follow my own encouragement to load up the triples and start playing.

Have you loaded the WorldCat linked data into a triple store?

Some other storage mechanism?

Amazon Glacier: Archival Storage for One Penny Per GB Per Month

Filed under: Amazon Web Services AWS,Storage — Patrick Durusau @ 10:18 am

Amazon Glacier: Archival Storage for One Penny Per GB Per Month by Jeff Barr.

From the post:

I’m going to bet that you (or your organization) spend a lot of time and a lot of money archiving mission-critical data. No matter whether you’re currently using disk, optical media or tape-based storage, it’s probably a more complicated and expensive process than you’d like which has you spending time maintaining hardware, planning capacity, negotiating with vendors and managing facilities.

True?

If so, then you are going to find our newest service, Amazon Glacier, very interesting. With Glacier, you can store any amount of data with high durability at a cost that will allow you to get rid of your tape libraries and robots and all the operational complexity and overhead that have been part and parcel of data archiving for decades.

Glacier provides – at a cost as low as $0.01 (one US penny, one one-hundredth of a dollar) per Gigabyte, per month – extremely low cost archive storage. You can store a little bit, or you can store a lot (Terabytes, Petabytes, and beyond). There’s no upfront fee and you pay only for the storage that you use. You don’t have to worry about capacity planning and you will never run out of storage space. Glacier removes the problems associated with under or over-provisioning archival storage, maintaining geographically distinct facilities and verifying hardware or data integrity, irrespective of the length of your retention periods.

With the caveat that you don’t have immediate access to your data (it is called “Glacier” for a reason), but it is still an impressive price.

Unless you are monitoring nuclear missile launch signatures or are a day trader, do you really need arbitrary and random access to all your data?

Or is that a requirement because you read some other department or agency was getting “real time” big data?

UC Irvine Medical Center: Improving Quality of Care with Apache Hadoop

Filed under: Hadoop,Health care — Patrick Durusau @ 9:21 am

UC Irvine Medical Center: Improving Quality of Care with Apache Hadoop by Charles Boicey.

From the post:

With a single observation in early 2011, the Hadoop strategy at UC Irvine Medical Center started. While using Twitter, Facebook, LinkedIn and Yahoo we came to the conclusion that healthcare data although domain specific is structurally not much different than a tweet, Facebook posting or LinkedIn profile and that the environment powering these applications should be able to do the same with healthcare data.

In healthcare, data shares many of the same qualities as that found in the large web properties. Each has a seemingly infinite volume of data to ingest and it is all types and formats across structured, unstructured, video and audio. We also noticed the near zero latency in which data was not only ingested but also rendered back to users was important. Intelligence was also apparent in that algorithms were employed to make suggestion such as people you may know.

We started to draw parallels to the challenges we were having with the typical characteristic of Big Data, volume, velocity and variety.

The start of a series Hadoop in health care.

I am more interested in the variety question than volume or velocity but for practical applications, all three are necessary considerations.

From further within the post:

We saw this project as vehicle for demonstrating the value of Applied Clinical Informatics and promoting the translational effects of rapidly moving from “code side to bedside”. (emphasis added)

Just so you know to add the string “Applied Clinical Informatics” to your literature searches in this area.

The wheel will be re-invented often enough without your help.

August 20, 2012

Fast Set Intersection in Memory [Foul! They Peeked!]

Filed under: Algorithms,Memory,Set Intersection,Sets — Patrick Durusau @ 4:06 pm

Fast Set Intersection in Memory by Bolin Ding and Arnd Christian König.

Abstract:

Set intersection is a fundamental operation in information retrieval and database systems. This paper introduces linear space data structures to represent sets such that their intersection can be computed in a worst-case efficient way. In general, given k (preprocessed) sets, with totally n elements, we will show how to compute their intersection in expected time O(n / sqrt(w) + kr), where r is the intersection size and w is the number of bits in a machine-word. In addition,we introduce a very simple version of this algorithm that has weaker asymptotic guarantees but performs even better in practice; both algorithms outperform the state of the art techniques for both synthetic and real data sets and workloads.

Important not only for the algorithm but how they arrived at it.

They peeked at the data.

Imagine that.

Not trying to solve the set intersection problem in the abstract but looking at data you are likely to encounter.

I am all for the pure theory side of things but there is something to be said for less airy (dare I say windy?) solutions. 😉

I first saw this at Theoretical Computer Science: Most efficient algorithm to compute set difference?

Topic Map Based Publishing

Filed under: Marketing,Publishing,Topic Map Software,Topic Maps — Patrick Durusau @ 10:21 am

After asking for ideas on publishing cheat sheets this morning, I have one to offer as well.

One problem with traditional cheat sheets is what any particular user wants in a cheat sheet?

Another problem is how expand the content of a cheat sheet?

And what if you want to sell the content? How does that work?

I don’t have a working version (yet) but here is my thinking on how topic maps could power a “cheat sheet” that meets all those requirements.

Solving the problem of what content to include seems critical to me. It is the make or break point in terms of attracting paying customers for a cheat sheet.

Content of no interest is as deadly as poor quality content. Either way, paying customers will vote with their feet.

The first step is to allow customers to “build” their own cheat sheet from some list of content. In topic map terminology, they specify an association between themselves and a set of topics to appear in “their” cheat sheet.

Most of the cheat sheets that I have seen (and printed out more than a few) are static artifacts. WYSIWYG artifacts. What there is and there ain’t no more.

Works for some things but what if what you need to know lies just beyond the edge of the cheat sheet? That’s that bad thing about static artifacts, they have edges.

In addition to building their own cheat sheet, the only limits to a topic map based cheat sheet are those imposed by lack of payment or interest. 😉

You may not need troff syntax examples on a daily basis but there are times when they could come in quite handy. (Don’t laugh. Liam Quin got hired on the basis of the troff typesetting of his resume.)

The second step is to have a cheat sheet that can expand or contract based on the immediate needs of the user. Sometimes more or less content, depending on their need. Think of an expandable “nutshell” reference.

A WYWIWYG (What You Want Is What You Get) approach as opposed to WWWTSYIWYG (What We Want To Sell You Is What You Get) (any publishers come to mind?).

What’s more important? Your needs or the needs of your publisher?

Finally, how to “sell” the content? The value-add?

Here’s one model: The user buys a version of the cheat sheet, which has embedded links to addition content. Links that when the user authenticates to a server, are treated as subject identifiers. Subject identifiers that cause merging to occur with topics on the server and deliver additional content. Each user subject identifier can be auto-generated on purchase and so are uniquely tied to a particular login.

The user can freely distribute the version of the cheat sheet they purchased, free advertising for you. But the additional content requires a separate purchase by the new user.

What blind alleys, pot holes and other hazards/dangers am I failing to account for in this scenario?

« Newer PostsOlder Posts »

Powered by WordPress