Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 7, 2012

Biff (Bloom Filter) Codes:…

Filed under: Bloom Filters,Error Correction,Set Reconciliation — Patrick Durusau @ 9:17 am

Biff (Bloom Filter) Codes: Fast Error Correction for Large Data Sets by M. Mitzenmacher and George Varghese.

Abstract:

Large data sets are increasingly common in cloud and virtualized environments. For example, transfers of multiple gigabytes are commonplace, as are replicated blocks of such sizes. There is a need for fast error-correction or data reconciliation in such settings even when the expected number of errors is small.

Motivated by such cloud reconciliation problems, we consider error-correction schemes designed for large data, after explaining why previous approaches appear unsuitable. We introduce Biff codes, which are based on Bloom filters and are designed for large data. For Biff codes with a message of length L and E errors, the encoding time is O(L), decoding time is O(L + E) and the space overhead is O(E). Biff codes are low-density parity-check codes; they are similar to Tornado codes, but are designed for errors instead of erasures. Further, Biff codes are designed to be very simple, removing any explicit graph structures and based entirely on hash tables. We derive Biff codes by a simple reduction from a set reconciliation algorithm for a recently developed data structure, invertible Bloom lookup tables. While the underlying theory is extremely simple, what makes this code especially attractive is the ease with which it can be implemented and the speed of decoding. We present results from a prototype implementation that decodes messages of 1 million words with thousands of errors in well under a second.

I followed this paper’s citation on set reconciliation to find the paper reported at: What’s the Difference? Efficient Set Reconciliation without Prior Context.

Suspect this line of work is far from finished and that you will find immediate and future applications of it for topic map applications.

August 6, 2012

What’s the Difference? Efficient Set Reconciliation without Prior Context

Filed under: Distributed Systems,P2P,Set Reconciliation,Sets,Topic Map Software — Patrick Durusau @ 4:56 pm

What’s the Difference? Efficient Set Reconciliation without Prior Context by David Eppstein, Michael T. Goodrich, Frank Uyeda, and George Varghese.

Abstract:

We describe a synopsis structure, the Difference Digest, that allows two nodes to compute the elements belonging to the set difference in a single round with communication overhead proportional to the size of the difference times the logarithm of the keyspace. While set reconciliation can be done efficiently using logs, logs require overhead for every update and scale poorly when multiple users are to be reconciled. By contrast, our abstraction assumes no prior context and is useful in networking and distributed systems applications such as trading blocks in a peer-to-peer network, and synchronizing link-state databases after a partition.

Our basic set-reconciliation method has a similarity with the peeling algorithm used in Tornado codes [6], which is not surprising, as there is an intimate connection between set difference and coding. Beyond set reconciliation, an essential component in our Difference Digest is a new estimator for the size of the set difference that outperforms min-wise sketches [3] for small set differences.

Our experiments show that the Difference Digest is more efficient than prior approaches such as Approximate Reconciliation Trees [5] and Characteristic Polynomial Interpolation [17]. We use Difference Digests to implement a generic KeyDiff service in Linux that runs over TCP and returns the sets of keys that differ between machines.

Distributed topic maps anyone?

From Solr to elasticsearch [Clarity as a Value?]

Filed under: ElasticSearch,JSON,Solr — Patrick Durusau @ 4:39 pm

From Solr to elasticsearch by Rob Young.

From the post:

Search is right at the center of GOV.UK. It’s the main focus of the homepage and it appears in the corner of every single page. Many of our recent and upcoming apps such as licence finder also rely heavily on search. So, making sure we have the right tool for the job is vital. Recently we decided to begin switching away from Solr to elasticsearch for our search server. Rob Young, a developer at GDS explains in some detail the basis for our decisions – the usual disclaimers about this being quite technical apply.

I am sure there are points to be made for both Solr and ElasticSearch. No doubt much religious debate will follow this decision.

What interested me was the claim that:

Just about the most important feature of any search engine is the ability to query it. Both Solr and elasticsearch expose their query APIs over HTTP but they do so in quite different ways. Solr queries are made up of two and three letter URL parameters, while elasticsearch queries are clear, self documenting JSON objects passed in the HTTP body.

It is possible, as the example in the post shows, to have “…clear, self documenting JSON objects….” in ElasticSearch but isn’t clarity in that case optional?

Or at least in the eyes of its user?

Not to downplay the important of being “…clear and self-documenting…” but to make it clear that is a design choice. A good one in my opinion but a design choice none the less.

That clarity occurs in this case in JSON is an accident of expression.

neo4j: Creating a custom index with neo4j.rb

Filed under: Neo4j,Neo4j.rb,Neography — Patrick Durusau @ 4:13 pm

neo4j: Creating a custom index with neo4j.rb by Mark Needham

From the post:

As I mentioned in my last post I’ve been playing around with the TFL Bus stop location and routes API and one thing I wanted to do was load all the bus stops into a neo4j database using the neo4j.rb gem.

I initially populated the database via neography but it was taking around 20 minutes each run and I figured it’d probably be much quicker to populate it directly rather than using the REST API.

You might want to mark/update your copy of the Neo4j documentation to account for what Mark discovers about custom indexes.

Writing a modular GPGPU program in Java

Filed under: CUDA,GPU,Java — Patrick Durusau @ 4:05 pm

Writing a modular GPGPU program in Java by Masayuki Ioki, Shumpei Hozumi, and Shigeru Chiba.

Abstract:

This paper proposes a Java to CUDA runtime program translator for scientific-computing applications. Traditionally, these applications have been written in Fortran or C without using a rich modularization mechanism. Our translator enables those applications to be written in Java and run on GPGPUs while exploiting a rich modularization mechanism in Java. This translator dynamically generates optimized CUDA code from a Java program given at bytecode level when the program is running. By exploiting dynamic type information given at translation, the translator devirtualizes dynamic method dispatches and flattens objects into simple data representation in CUDA. To do this, a Java program must be written to satisfy certain constraints.

This paper also shows that the performance overheads due to Java and WootinJ are not significantly high.

Just in case you are starting to work on topic map processing routines for GPGPUs.

Something to occupy your time during the “dog days” of August.

Channel 9’s JavaScript Fundamentals Series

Filed under: Javascript — Patrick Durusau @ 3:57 pm

Channel 9’a JavaScript Fundamental Series

I hesitated before making this post.

In part because of concern for how it would “look” to post on deeply theoretical language issues and JavaScript on the same day, in the same week, or even on the same blog.

Then I remembered that the point of this blog is to convey useful information users, authors and designers of topic maps and systems to deal with semantic diversity. How any particular post “looks” to anyone, isn’t relevant to that purpose.

If a post is too simple for you, look the other way. 😉

As much a comment to myself as anyone else!

Enjoy!

A Big Data Revolution in Astrophysics

Filed under: Astroinformatics,BigData — Patrick Durusau @ 3:34 pm

A Big Data Revolution in Astrophysics by Ian Armas Foster.

Ian writes:

Humanity has been studying the stars for as long as it has been able to gaze at them. The study of stars has to led to one revelation after another; that the planet is round, that we are not the center of the universe, and has also spawned Einstein’s general theory of relativity.

As more powerful telescopes are developed, more is learned about the wild happenings in space, including black holes, binary star systems, the movement of galaxies, and even the detection of the Cosmic Microwave Background, which may hint at the beginnings of the universe.

However, all of these discoveries were made relatively slowly, relying on the relaying of information to other stations whose observatories may not be active for several hours or even days—a process that carries a painful amount of time between image and retrieval and potential discovery recognition.

Solving these problems would be huge for astrophysics. According to Peter Nugent, Senior Staff Scientist of Berkeley’s National Laboratory, big data is on its way to doing just that. Nugent has been the expert voice on this issue following his experiences with an ambitious project known as the Palomar Transient Factory.

It’s a good post and is likely to get your interested in astronomical (both senses) data problems.

Quibble: Why no links to the Palomar Transient Factory? Happen too often at many sites for this to be oversight. We are all writing in hyperlink capable media. Yes? Why the poverty of hyperlinks?

BTW:

Palomar Transient Factory, and

Access public spectra (WISEASS)

I don’t mind if you visit other sites. I write to facilitate your use of resources on the WWW. Maybe that’s the difference.

There and Back Again

Filed under: CS Lectures,Programming,Types — Patrick Durusau @ 3:18 pm

There and Back Again by Robert Harper.

From the post:

Last fall it became clear to me that it was “now or never” time for completing Practical Foundations for Programming Languages, so I put just about everything else aside and made the big push to completion. The copy editing phase is now complete, the cover design (by Scott Draves) is finished, and its now in the final stages of publication. You can even pre-order a copy on Amazon; it’s expected to be out in November.

I can already think of ways to improve it, but at some point I had to declare victory and save some powder for future editions. My goal in writing the book is to organize as wide a body of material as I could manage in a single unifying framework based on structural operational semantics and structural type systems. At over 600 pages the manuscript is at the upper limit of what one can reasonably consider a single book, even though I strived for concision throughout.

Quite a lot of the technical development is original, and does not follow along traditional lines. For example, I completely decouple the concepts of assignment, reference, and storage class (heap or stack) from one another, which makes clear that one may have references to stack-allocated assignables, or make use of heap-allocated assignables without having references to them. As another example, my treatment of concurrency, while grounded in the process calculus tradition, coheres with my treatment of assignables, but differs sharply from conventional accounts (and suffers none of their pathologies in the formulation of equivalence).

From the preface:

Types are the central organizing principle of the theory of programming languages. Language features are manifestations of type structure. The syntax of a language is governed by the constructs that define its types, and its semantics is determined by the interactions among those constructs. The soundness of a language design—the absence of ill-defined programs—follows naturally.

The purpose of this book is to explain this remark. A variety of programming language features are analyzed in the unifying framework of type theory. A language feature is defined by its statics, the rules governing the use of the feature in a program, and its dynamics, the rules defining how programs using this feature are to be executed. The concept of safety emerges as the coherence of the statics and the dynamics of a language.

In this way we establish a foundation for the study of programming languages. But why these particular methods? The main justification is provided by the book itself. The methods we use are both precise and intuitive, providing a uniform framework for explaining programming language concepts. Importantly, these methods scale to a wide range of programming language concepts, supporting rigorous analysis of their properties. Although it would require another book in itself to justify this assertion, these methods are also practical in that they are directly applicable to implementation and uniquely effective as a basis for mechanized reasoning. No other framework offers as much.

Now that Robert has lunged across the author’s finish line, which one of us will incorporate his thinking into our own?

Twitter’s Scalding – Scala and Hadoop hand in hand

Filed under: Hadoop,Scalding — Patrick Durusau @ 10:59 am

Twitter’s Scalding – Scala and Hadoop hand in hand by Istvan Szegedi.

From the post:

If you have read the paper published by Google’s Jeffrey Dean and Sanjay Ghemawat (MapReduce: Simplied Data Processing on Large Clusters), they revealed that their work was inspired by the concept of functional languages: “Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages….Our use of a functional model with user-specified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance.”

Given the fact the Scala is a programming language that combines objective oriented and functional progarmming and runs on JVM, it is a fairly natural evolution to introduce Scala in Hadoop environment. That is what Twitter engineers did. (See more on how Scala is used at Twitter: “Twitter on Scala” and “The Why and How of Scala at Twitter“). Scala has powerful support for mapping, filtering, pattern matching (regular expressions) so it is a pretty good fit for MapReduce jobs.

Another guide to Scalding.

r3 redistribute reduce reuse

Filed under: MapReduce,Python,Redis — Patrick Durusau @ 10:30 am

r3 redistribute reduce reuse

From the project homepage:

r³ is a map-reduce engine written in python using redis as a backend

r³ is a map reduce engine written in python using a redis backend. It’s purpose is to be simple.

r³ has only three concepts to grasp: input streams, mappers and reducers.

You need to visit this project. It is simple, efficient and effective.

I found this following r³ – A quick demo of usage, which I found at: Demoing the Python-Based Map-Reduce R3 Against GitHub Data, Alex Popescu’s myNoSQL.

Apache HBase (DZone Refcard)

Filed under: HBase — Patrick Durusau @ 8:50 am

Apache HBase (DZone Refcard) by Otis Gospodnetic and Alex Baranau.

From the webpage:

The Essential HBase Cheat Sheet

HBase is the Hadoop database: a distributed, scalable Big Data store that lets you host very large tables — billions of rows multiplied by millions of columns — on clusters built with commodity hardware. Just as Google Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

If you aren’t familiar with the DZone Refcards, they are a valuable resource.

August 5, 2012

Elastisch, a Clojure client for ElasticSearch

Filed under: Clojure,ElasticSearch — Patrick Durusau @ 6:15 pm

Elastisch, a Clojure client for ElasticSearch

From about this guide:

This guide covers ElasticSearch indexing capabilities in depth, explains how Elastisch presents them in the API and how some of the key features are commonly used.

This guide covers:

  • What is indexing in the context of full text search
  • What kind of features ElasticSearch has w.r.t. indexing, how Elastisch exposes them in the API
  • Mapping types and how they define how the data is indexed by ElasticSearch
  • How to define mapping types with Elastisch
  • Lucene built-in analyzers, their characteristics, what different kind of analyzers are good for.
  • Other topics related to indexing and working with indexes

An extensive introduction to ElasticSearch.

If you are not familiar with ElasticSearch already, now might be a good time.

Thinking Functionally with Haskell [Privileging Users or System Designers]

Filed under: Functional Programming,Haskell — Patrick Durusau @ 4:36 pm

Thinking Functionally with Haskell by Paul Callaghan.

From the post:

In which we begin an exploration into the Haskell language and dive deeply into functional programming.

Ever wondered how functional programmers think? I aim to give you a glimpse into the programming style and mindset of experienced functional programmers, so you can see why we are so passionate about what we do. We’ll also discuss some wider ideas about programming, such as making our languages fit the problem and not the other way round, and how this affects language design.

Few of these ideas get the exposure they deserve in textbooks or tutorials, and in my view they are essential for coming to grips with a functional language and using it productively in real apps.

Syntax and semantics, the meat and veg of most books and university courses, are ok for basic language use, but to really master a language that embodies a paradigm that is new to you, you need to know about the deeper pragmatic ideas. Let’s see if we can do something about that.

I used Lisp for a few years before university, then switched to Haskell and have been using it for around 20 years. However, inspired by learning about Rails and Ruby when revamping a tired web technology course, I changed career to do full-time Rails work, and have spent the last four years having fun on a variety of apps, including Spree (#2 committer at one point) and recently a big bespoke lab management system.

Ruby feels like naughty fun for a Haskell programmer. Many of the ideas are very similar, like the very natural use of blocks and lambdas and having lots of scope for bending the rules. I really enjoy programming in Ruby, though some times I do get homesick and pine for a bit more oomph.

Most of this article will refer to Haskell, though many of the ideas do apply to other, similar languages as well. Haskell has a few advantages and a good balance of features. Haskell has its weaknesses too, and I hope to explore these in due course.

I rather like the “…making our languages fit the problem and not the other way round…” phrase.

Or to phrase it for topic maps: “….taking subject identity as defined by users, not defining it for them….”

Perhaps I should ask: Do you want to privilege users or system designers?

The R-Podcast Episode 9: Adventures in Data Munging Part 1

Filed under: Data Mining,R — Patrick Durusau @ 4:11 pm

The R-Podcast Episode 9: Adventures in Data Munging Part 1

From the post:

It’s great to be back with a new episode after an eventful break! This episode begins a series on my adventures in data munging, a.k.a data processing. I discuss three issues that demonstrate the flexibility and versatility R brings for recoding messy values, important inconsistent data files, and pinpointing problematic observations and variables. We also have an extended listener feedback segment with an audio installment of the “pitfalls” of R contributed by listener Frans. I hope you enjoy this episode and keep passing along your feedback to theRcast(at)gmail.com and stop by the forums as well!

What do you think about the format?

Other than for Atlanta area commuters, it seems a bit over long to me.

And for some topics, such as teaching syntax, best to be able to “see” the examples.

More Fun with Hadoop In Action Exercises (Pig and Hive)

Filed under: Hadoop,Hive,MapReduce,Pig — Patrick Durusau @ 3:50 pm

More Fun with Hadoop In Action Exercises (Pig and Hive) by Sujit Pal.

From the post:

In my last post, I described a few Java based Hadoop Map-Reduce solutions from the Hadoop in Action (HIA) book. According to the Hadoop Fundamentals I course from Big Data University, part of being a Hadoop practioner also includes knowing about the many tools that are part of the Hadoop ecosystem. The course briefly touches on the following four tools – Pig, Hive, Jaql and Flume.

Of these, I decided to focus (at least for the time being) on Pig and Hive (for the somewhat stupid reason that the HIA book covers these too). Both of these are are high level DSLs that produce sequences of Map-Reduce jobs. Pig provides a data flow language called PigLatin, and Hive provides a SQL-like language called HiveQL. Both tools provide a REPL shell, and both can be extended with UDFs (User Defined Functions). The reason they coexist in spite of so much overlap is because they are aimed at different users – Pig appears to be aimed at the programmer types and Hive at the analyst types.

The appeal of both Pig and Hive lies in the productivity gains – writing Map-Reduce jobs by hand gives you control, but it takes time to write. Once you master Pig and/or Hive, it is much faster to generate sequences of Map-Reduce jobs. In this post, I will describe three use cases (the first of which comes from the HIA book, and the other two I dreamed up).

More useful Hadoop exercise examples.

Algorithms for Modern Massive Data Sets [slides]

Filed under: Algorithms,BigData,GraphLab,Sparse Data — Patrick Durusau @ 3:40 pm

Algorithms for Modern Massive Data Sets [slides]

Igor Carron writes:

In case you have to take your mind off tomorrow’s suspense-filled and technologically challenging landing of Curiosity on Mars (see 7 minutes of Terror, a blockbuster taking place on Mars this SummerMichael Mahoney, Alex Shkolnik, Gunnar Carlsson, Petros Drineas, the organizers of Workshop on Algorithms for Modern Massive Data Sets (MMDS 2012), just made available the slides of the meeting. Other relevant meeting slides include that of the Coding, Complexity, and Sparsity Workshop and that of the GraphLab workshop. Don’t grind your teeth too much tomorrow and Go Curiosity !

Igor has helpfully listed four conference days of links for the algorithms workshop. When you finish there, take a look at the other two slide listings.

Journal of the American Medical Informatics Association (JAMIA)

Filed under: Bioinformatics,Informatics,Medical Informatics,Pathology Informatics — Patrick Durusau @ 10:53 am

Journal of the American Medical Informatics Association (JAMIA)

Aims and Scope

JAMIA is AMIA‘s premier peer-reviewed journal for biomedical and health informatics. Covering the full spectrum of activities in the field, JAMIA includes informatics articles in the areas of clinical care, clinical research, translational science, implementation science, imaging, education, consumer health, public health, and policy. JAMIA’s articles describe innovative informatics research and systems that help to advance biomedical science and to promote health. Case reports, perspectives and reviews also help readers stay connected with the most important informatics developments in implementation, policy and education.

Another informatics journal to whitelist for searching.

Content is freely available after twelve (12) months.

Cancer, NLP & Kaiser Permanente Southern California (KPSC)

Filed under: Bioinformatics,Medical Informatics,Pathology Informatics,Uncategorized — Patrick Durusau @ 10:38 am

Kaiser Permanente Southern California (KPSC) deserves high marks for the research in:

Identifying primary and recurrent cancers using a SAS-based natural language processing algorithm by Justin A Strauss, et. al.

Abstract:

Objective Significant limitations exist in the timely and complete identification of primary and recurrent cancers for clinical and epidemiologic research. A SAS-based coding, extraction, and nomenclature tool (SCENT) was developed to address this problem.

Materials and methods SCENT employs hierarchical classification rules to identify and extract information from electronic pathology reports. Reports are analyzed and coded using a dictionary of clinical concepts and associated SNOMED codes. To assess the accuracy of SCENT, validation was conducted using manual review of pathology reports from a random sample of 400 breast and 400 prostate cancer patients diagnosed at Kaiser Permanente Southern California. Trained abstractors classified the malignancy status of each report.

Results Classifications of SCENT were highly concordant with those of abstractors, achieving κ of 0.96 and 0.95 in the breast and prostate cancer groups, respectively. SCENT identified 51 of 54 new primary and 60 of 61 recurrent cancer cases across both groups, with only three false positives in 792 true benign cases. Measures of sensitivity, specificity, positive predictive value, and negative predictive value exceeded 94% in both cancer groups.

Discussion Favorable validation results suggest that SCENT can be used to identify, extract, and code information from pathology report text. Consequently, SCENT has wide applicability in research and clinical care. Further assessment will be needed to validate performance with other clinical text sources, particularly those with greater linguistic variability.

Conclusion SCENT is proof of concept for SAS-based natural language processing applications that can be easily shared between institutions and used to support clinical and epidemiologic research.

Before I forget:

Data sharing statement SCENT is freely available for non-commercial use and modification. Program source code and requisite support files may be downloaded from: http://www.kp-scalresearch.org/research/tools_scent.aspx

Topic map promotion point: Application was built to account for linguistic variability, not to stamp it out.

Tools build to fit users are more likely to succeed, don’t you think?

Journal of Pathology Informatics (JPI)

Filed under: Bioinformatics,Biomedical,Medical Informatics,Pathology Informatics — Patrick Durusau @ 10:09 am

Journal of Pathology Informatics (JPI)

About:

The Journal of Pathology Informatics (JPI) is an open access peer-reviewed journal dedicated to the advancement of pathology informatics. This is the official journal of the Association for Pathology Informatics (API). The journal aims to publish broadly about pathology informatics and freely disseminate all articles worldwide. This journal is of interest to pathologists, informaticians, academics, researchers, health IT specialists, information officers, IT staff, vendors, and anyone with an interest in informatics. We encourage submissions from anyone with an interest in the field of pathology informatics. We publish all types of papers related to pathology informatics including original research articles, technical notes, reviews, viewpoints, commentaries, editorials, book reviews, and correspondence to the editors. All submissions are subject to peer review by the well-regarded editorial board and by expert referees in appropriate specialties.

Another site to add to your whitelist of sites to search for informatics information.

> 4,000 Ways to say “You’re OK” [Breast Cancer Diagnosis]

The feasibility of using natural language processing to extract clinical information from breast pathology reports by Julliette M Buckley, et.al.

Abstract:

Objective: The opportunity to integrate clinical decision support systems into clinical practice is limited due to the lack of structured, machine readable data in the current format of the electronic health record. Natural language processing has been designed to convert free text into machine readable data. The aim of the current study was to ascertain the feasibility of using natural language processing to extract clinical information from >76,000 breast pathology reports.

Approach and Procedure: Breast pathology reports from three institutions were analyzed using natural language processing software (Clearforest, Waltham, MA) to extract information on a variety of pathologic diagnoses of interest. Data tables were created from the extracted information according to date of surgery, side of surgery, and medical record number. The variety of ways in which each diagnosis could be represented was recorded, as a means of demonstrating the complexity of machine interpretation of free text.

Results: There was widespread variation in how pathologists reported common pathologic diagnoses. We report, for example, 124 ways of saying invasive ductal carcinoma and 95 ways of saying invasive lobular carcinoma. There were >4000 ways of saying invasive ductal carcinoma was not present. Natural language processor sensitivity and specificity were 99.1% and 96.5% when compared to expert human coders.

Conclusion: We have demonstrated how a large body of free text medical information such as seen in breast pathology reports, can be converted to a machine readable format using natural language processing, and described the inherent complexities of the task.

The advantages of using current language practices include:

  • No new vocabulary needs to be developed.
  • No adoption curve for a new vocabulary.
  • No training required for users to introduce the new vocabulary
  • Works with historical data.

and I am sure there are others.

Add natural language usage to your topic map for immediately useful results for your clients.

Machine Learning — Introduction

Filed under: Machine Learning — Patrick Durusau @ 4:28 am

Machine Learning — Introduction by Jeremy Kun.

These days an absolutely staggering amount of research and development work goes into the very coarsely defined field of “machine learning.” Part of the reason why it’s so coarsely defined is because it borrows techniques from so many different fields. Many problems in machine learning can be phrased in different but equivalent ways. While they are often purely optimization problems, such techniques can be expressed in terms of statistical inference, have biological interpretations, or have a distinctly geometric and topological flavor. As a result, machine learning has come to be understood as a toolbox of techniques as opposed to a unified theory.

It is unsurprising, then, that such a multitude of mathematics supports this diversified discipline. Practitioners (that is, algorithm designers) rely on statistical inference, linear algebra, convex optimization, and dabble in graph theory, functional analysis, and topology. Of course, above all else machine learning focuses on algorithms and data.

The general pattern, which we’ll see over and over again as we derive and implement various techniques, is to develop an algorithm or mathematical model, test it on datasets, and refine the model based on specific domain knowledge. The first step usually involves a leap of faith based on some mathematical intuition. The second step commonly involves a handful of established and well understood datasets (often taken from the University of California at Irvine’s machine learning database, and there is some controversy over how ubiquitous this practice is). The third step often seems to require some voodoo magic to tweak the algorithm and the dataset to complement one another.

It is this author’s personal belief that the most important part of machine learning is the mathematical foundation, followed closely by efficiency in implementation details. The thesis is that natural data has inherent structure, and that the goal of machine learning is to represent this and utilize it. To make true progress, one must represent and analyze structure abstractly. And so this blog will focus predominantly on mathematical underpinnings of the algorithms and the mathematical structure of data.

Jeremy is starting a series of posts on machine learning that should prove to be useful.

While I would disagree about “inherent structure[s]” in data, we do treat data as though that were the case. Careful attention to those structures, inherent or not, is the watchword of useful analysis.

August 4, 2012

Fun With Hadoop In Action Exercises (Java)

Filed under: Hadoop,Java — Patrick Durusau @ 7:01 pm

Fun With Hadoop In Action Exercises (Java) by Sujit Pal.

From the post:

As some of you know, I recently took some online courses from Coursera. Having taken these courses, I have come to the realization that my knowledge has some rather large blind spots. So far, I have gotten most of my education from books and websites, and I have tended to cherry pick subjects which I need at the moment for my work, as a result of which I tend to ignore stuff (techniques, algorithms, etc) that fall outside that realm. Obviously, this is Not A Good Thing™, so I have begun to seek ways to remedy that.

I first looked at Hadoop years ago, but never got much beyond creating proof of concept Map-Reduce programs (Java and Streaming/Python) for text mining applications. Lately, many subprojects (Pig, Hive, etc) have come up in order to make it easier to deal with large amounts of data using Hadoop, about which I know nothing. So in an attempt to ramp up relatively quickly, I decided to take some courses at BigData University.

The course uses BigInsights (IBM’s packaging of Hadoop) which run only on Linux. VMWare images are available, but since I have a Macbook Pro, that wasn’t much use to me without a VMWare player (not free for Mac OSX). I then installed VirtualBox and tried to run a Fedora 10 64-bit image on it, and install BigInsights on Fedora, but it failed. I then tried to install Cloudera CDH4 (Cloudera’s packaging of Hadoop) on it (its a series of yum commands), but that did not work out either. Ultimately I decided to ditch VirtualBox altogether and do a pseudo-distributed installation of the stock Apache Hadoop (1.0.3) direct on my Mac following instructions on Michael Noll’s page.

The Hadoop Fundamentals I course which I was taking covers quite a few things, but I decided to stop and actually read all of Hadoop in Action (HIA) in order to get a more thorough coverage. I had purchased it some years before as part of Manning’s MEAP (Early Access) program, so its a bit dated (examples are mostly in the older 0.19 API), but its the only Hadoop book I possess, and the concepts are explained beautifully, and its not a huge leap to mentally translate code from the old API to the new, so it was well worth the read.

I also decided to tackle the exercises (in Java for now) and post my solutions on GitHub. Three reasons. First, it exposes me to a more comprehensive set of scenarios than I have had previously, and forces me to use techniques and algorithms that I wont otherwise. Second, hopefully some of my readers can walk circles around me where Hadoop is concerned, and they would be kind enough to provide criticism and suggestions for improvement. And third, there may be some who would benefit from having the HIA examples worked out. So anyway, here they are, my solutions to selected exercises from Chapters 4 and 5 of the HIA book for your reading pleasure.

Much good content follows!

This will be useful to a large number of people.

As well as setting a good example.

Genetic algorithms: a simple R example

Filed under: Genetic Algorithms,Merging,Subject Identity — Patrick Durusau @ 6:49 pm

Genetic algorithms: a simple R example by Bart Smeets.

From the post:

Genetic algorithm is a search heuristic. GAs can generate a vast number of possible model solutions and use these to evolve towards an approximation of the best solution of the model. Hereby it mimics evolution in nature.

GA generates a population, the individuals in this population (often called chromosomes) have a given state. Once the population is generated, the state of these individuals is evaluated and graded on their value. The best individuals are then taken and crossed-over – in order to hopefully generate ‘better’ offspring – to form the new population. In some cases the best individuals in the population are preserved in order to guarantee ‘good individuals’ in the new generation (this is called elitism).

The GA site by Marek Obitko has a great tutorial for people with no previous knowledge on the subject.

As the size of data stores increase, the cost of personal judgement on each subject identity test will as well. Genetic algorithms may be one way of creating subject identity tests in such situations.

In any event, it won’t harm anyone to be aware of the basic contours of the technique.

I first saw this at R-Bloggers.

Geometric properties of graph layouts optimized for greedy navigation

Filed under: Geometry,Graphs,Navigation — Patrick Durusau @ 3:56 pm

Geometric properties of graph layouts optimized for greedy navigation by Sang Hoon Lee and Petter Holme.

The graph layouts used for complex network studies have been mainly been developed to improve visualization. If we interpret the layouts in metric spaces such as Euclidean ones, however, the embedded spatial information can be a valuable cue for various purposes. In this work, we focus on the navigational properties of spatial graphs. We use an recently user-centric navigation protocol to explore spatial layouts of complex networks that are optimal for navigation. These layouts are generated with a simple simulated annealing optimization technique. We compared these layouts to others targeted at better visualization. We discuss the spatial statistical properties of the optimized layouts for better navigability and its implication.

Despite my misgivings about metric spaces, to say nothing of Euclidean ones, for some data, this looks particularly useful.

If you had the optimal layout for navigation of a graph, how would you recognize it? Aside from voicing your preference or choice?

Difficult question but one that the authors are pursuing.

It may be that measurement of “navigability” is possible.

Even if we have to accept that hidden factors are behind the “navigability” measurement.

Graph-rewriting Package

Filed under: Graphs,Haskell,Rewriting — Patrick Durusau @ 3:32 pm

Graph-rewriting Package

A Haskell based graph rewriting package I encountered recently.

You can find more information at Jon Rachel’s webpage. And the Wikipedia page on graph rewriting. (The Wikipedia page also has pointers to a number of graph rewriting software packages.)

For the definition of port graph grammars, see Charles Stewart, Reducibility Between Classes of Port Graph Grammar (2001).

Scalding

Filed under: MapReduce,Scala,Scalding — Patrick Durusau @ 3:07 pm

Scalding: Powerful & Concise MapReduce Programming

Description:

Scala is a functional programming language on the JVM. Hadoop uses a functional programming model to represent large-scale distributed computation. Scala is thus a very natural match for Hadoop.

In this presentation to the San Francisco Scala User Group, Dr. Oscar Boykin and Dr. Argyris Zymnis from Twitter give us some insight on Scalding DSL and provide some example jobs for common use cases.

Twitter uses Scalding for data analysis and machine learning, particularly in cases where we need more than sql-like queries on the logs, for instance fitting models and matrix processing. It scales beautifully from simple, grep-like jobs all the way up to jobs with hundreds of map-reduce pairs.

The Alice example failed (counted the different forms of Alice differently). I am reading a regex book so that may have made the problem more obvious.

Lesson: Test code/examples before presentation. 😉

See the Github repository: https://github.com/twitter/scalding.

Both Scalding and the presentation are worth your time.

Using the flickr XML/API as a source of RSS feeds

Filed under: Data,XML,XSLT — Patrick Durusau @ 2:07 pm

Using the flickr XML/API as a source of RSS feeds by Pierre Lindenbaum.

Pierre has created an XSLT stylesheet to transform XML from flickr into an RSS feed.

Something for your data harvesting recipe box.

FBI’s Sentinel Project: 5 Lessons Learned[?]

Filed under: Government,Government Data,Knowledge Management,Project Management — Patrick Durusau @ 1:57 pm

FBI’s Sentinel Project: 5 Lessons Learned [?] by John Foley.

John writes of lessons learned from the Sentinel Project, which replaces the $170 million disaster, Virtual Case File system.

Lessons you need to avoid applying to your information management projects, whether you use topic maps or no.

2. Agile development gets things done. The next big shift in strategy was Fulgham’s decision in September 2010 to wrest control of the project from prime contractor Lockheed Martin and use agile development to accelerate software deliverables. The thinking was that a hands-on, incremental approach would be faster because functionality would be developed, and adjustments made, in two-week “sprints.” The FBI missed its target date for finishing that work–September 2011–but it credits the agile methodology with ultimately getting the job done.

Missing a start date by ten (10) months does not count as a success for most projects. Moreover, note how they define “success:”

this week’s announcement that Sentinel, as of July 1, became available to all FBI employees is a major achievement.

Available to all FBI employees? I would think using it by all FBI employees would be the measure of success. Yes?

Can you think a success measure other than use by employees?

3. Commercial software plays an important role. Sentinel is based in part on commercial software, a fact that’s often overlooked because of all the custom coding and systems integration involved. Under the hood are EMC’s Documentum document management software, Oracle databases, IBM’s WebSphere middleware, Microsoft’s SharePoint, and Entrust’s PKI technology. Critics who say that Sentinel would have gone more smoothly if only it had been based on off-the-shelf software seem unaware that, in fact, it is.

Commercial software? Sounds like a software Frankenstein to me. I wonder if they simply bought software based on the political clout of the vendors and then wired it together? What it sounds like. Do you have access to the system documentation? That could prove to be an interesting read.

I can imagine legacy systems wired together with these components but if you are building a clean system, why the cut-n-paste from different vendors?

4. Agile development is cheaper, too. Sentinel came in under its $451 million budget. The caveat is that the FBI’s original cost estimate for Sentinel was $425 million, but that was before Fulgham and Johnson took over, and they stayed within the budget they were given. The Inspector General might quibble with how the FBI accounts for the total project cost, having pointed out in the past that its tally didn’t reflect the agency’s staff costs. But the FBI wasn’t forced to go to Congress with its hand out. Agile development wasn’t only faster, but also cheaper.

Right, let’s simply lie to the prospective client about the true cost of development for a project. Their staff, who already have full time duties, can just tough it out and give us the review/feedback that we need to build a working system. Right.

This is true for IT projects in general but topic map projects in particular. Clients will have to resource the project properly from the beginning, not just with your time but the time of its staff and subject matter experts.

A good topic map, read a useful topic map, is going to reflect contributions from the client’s staff. You need to make the case to decision makers that the staff contributions are just as important as their present day to day tasks.

BTW, if agile development oh so useful, people would be using it. Like C, Java, C++.

Do you see marketing pieces for C, Java, C++?

Successful approaches/languages are used, not advertised.

August 3, 2012

20% of users – 80% of security breaches?

Filed under: Search Engines,Security — Patrick Durusau @ 6:26 pm

I was reading about the Google Hacking Diggity Project today when it occurred to me to ask:

Are 20% of users responsible for 80% of security breaches?

I ask because:

The Google Hacking Diggity Project is a research and development initiative dedicated to investigating the latest techniques that leverage search engines, such as Google and Bing, to quickly identify vulnerable systems and sensitive data in corporate networks. This project page contains downloads and links to our latest Google Hacking research and free security tools. Defensive strategies are also introduced, including innovative solutions that use Google Alerts to monitor your network and systems.

OK, but that just means you are playing catch up on security breaches. You aren’t ever getting ahead. Discovering weaknesses before others do is hopefully discovering them before others do.

If you coupled a topic map with your security scans, you can track users as they move from department to department, anticipating the next security breach.

And/or providing management with the ability to avoid security breaches in the first place.

I first saw this at KDNuggets.

Olympic rings as data symbols

Filed under: Graphics,Visualization — Patrick Durusau @ 3:46 pm

Olympic rings as data symbols by Nathan Yau.

Nathan reports on the use of Olympic rings as indicators for vital statistics.

The colors are:

Oceania Africa Europe Asia Americas

Several of the comments on the video suggest better labeling/legend for colors.

Once you “get” the colors, it is a very effective graphic.

Lesson for topic map interfaces:

Being clever != Being Clear (at least not always)

« Newer PostsOlder Posts »

Powered by WordPress