Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 28, 2012

Displaying Your Data in Google Earth Using R2G2

Filed under: Google Earth,Visualization — Patrick Durusau @ 10:39 am

Displaying Your Data in Google Earth Using R2G2

From the post:

Have you ever wanted to easily visualize your ecology data in Google Earth? R2G2 is a new package for R, available via R CRAN and formally described in this Molecular Ecology Resources article, which provides a user-friendly bridge between R and the Google Earth interface. Here, we will provide a brief introduction to the package, including a short tutorial, and then encourage you to try it out with your own data!

Nils Arrigo, with some help from Loren Albert, Mike Barker, and Pascal Mickelson (one of the contributors to Recology), has created a set of R tools to generate KML files to view data with geographic components. Instead of just telling you what the tools can do, though, we will show you a couple of examples using publically available data. Note: a number of individual files are linked to throughout the tutorial below, but just in case you would rather download all the tutorial files in one go, have at it (tutorial zip file).

Among the basic tools in R2G2 is the ability to place features—like dots, shapes, or images (including plots you produced in R)— that represent discrete observations at specific geographical locations. For example, in the figure below, we show the migratory path of a particular turkey vulture in autumn of three successive years (red = 2009; blue = 2010; green = 2011).

Google Earth image with three successive years of a particular turkey vulture's migration

If researchers can track and visualize a single turkey vulture’s migration, across two continents, tracking and visualizing the paths, routes, and routines of other entities should be a matter of data collection.

Faceted classification – Drill Up/Down, Out?

Filed under: Faceted Search,Facets — Patrick Durusau @ 10:19 am

Faceted classification

I use search facets in a number of contexts everyday.

But today this summary from Wikipedia struck me differently than most days:

A faceted classification system allows the assignment of an object to multiple characteristics (attributes), enabling the classification to be ordered in multiple ways, rather than in a single, predetermined, taxonomic order. A facet comprises “clearly defined, mutually exclusive, and collectively exhaustive aspects, properties or characteristics of a class or specific subject”.[1] For example, a collection of books might be classified using an author facet, a subject facet, a date facet, etc. (From Faceted classification at Wikipedia.)

My general experience is that facets are used to narrow search results. That is set result set is progressively narrowed to fewer and fewer items.

At the same time, a choice of facets can be discarded, returning to a broader result set.

So facets can move the searcher up and down in search result size, but within the bounds of the initial result set.

Has anyone experimented with adding facets from a broader pool? Say all the items in a database and not just those items in an initial search query?

Enabling the user to “drill out” from what we think of as the initial result set?

Which would raise questions about managing facets for a changing underlying set. For a user to broaden or narrow the result set in the more traditional way.

High Availability Search with SolrCloud

Filed under: Faceted Search,Facets,Solr,SolrCloud — Patrick Durusau @ 9:41 am

High Availability Search with SolrCloud by Brent Lemons.

Brent explains that using embedded ZooKeeper is useful for testing/learning SolrCloud, but high availaility requires more.

As in separate installations of SolrCloud and ZooKeeper, both as high availability applications.

He walks through the steps to create and test such an installation.

If you have or expect to have a high availability search requirement, Brent’s post will be helpful.

Visualizing Networks: Beyond the Hairball

Filed under: Graphics,Networks,Visualization — Patrick Durusau @ 9:13 am

Visualizing Networks: Beyond the Hairball by Lynn Cherny.

Impressive slide set on visualizing networks that concludes with a good set of additional resources.

The sort of slide set that makes you regret not seeing the presentation live.

Mentions J. Bertin’s Semiology of Graphics. It is back in print if you have a serious interests using graphics for communication.

I first saw this in a tweet by Peter Neubauer.

Tips and Tricks for Cypher

Filed under: Cypher,Neo4j — Patrick Durusau @ 8:51 am

some tips and tricks for mutable Cypher by Aseem Kishore.

Aseem has a nice post on mutable Cypher and issues he has encountered.

Will save you time solving the same problems.

Algorithms for Massive Data Sets

Filed under: Algorithms,BigData,CS Lectures — Patrick Durusau @ 8:43 am

Algorithms for Massive Data Sets by Inge Li Gørtz and Philip Bille.

From the course description:

A student who has met the objectives of the course will be able to:

  • Describe an algorithm in a comprehensible manner, i.e., accurately, concise, and unambiguous.
  • Prove correctness of algorithms.
  • Analyze, evaluate, and compare the performance of algorithms in models of computation relevant to massive data sets.
  • Analyze, evaluate, and compare the quality and reliability of solutions.
  • Apply and extend relevant algorithmic techniques for massive data sets.
  • Design algorithms for problems related to massive data sets.
  • Lookup and apply relevant research literature for problems related to massive data sets.
  • Systematically identify and analyze problems and make informed choices for solving the problems based on the analysis.
  • Argue clearly for the choices made when solving a problem.

Papers, slides and exercises provided for these topics:

Week 1: Introduction and Hashing: Chained, Universal, and Perfect.

Week 2: Predecessor Data Structures: x-fast tries and y-fast tries.

Week 3: Decremental Connectivity in Trees: Cluster decomposition, Word-Level Parallelism.

Week 4: Nearest Common Ancestors: Distributed data structures, Heavy-path decomposition, alphabetic codes.

Week 5: Amortized analysis and Union-Find.

Week 6: Range Reporting: Range Trees, Fractional Cascading, and kD Trees.

Week 7: Persistent data structures.

Week 8: String matching.

Week 9: String Indexing: Dictionaries, Tries, Suffix trees, and Suffix Sorting.

Week 10: Introduction to approximation algorithms. TSP, k-center, vertex cover.

Week 11: Approximation algorithms: Set Cover, stable matching.

Week 12: External Memory: I/O Algorithms, Cache-Oblivious Algorithms, and Dynamic Programming

Just reading the papers will improve your big data skills.

October 27, 2012

zip-code-data-hacking

Filed under: Geographic Data,Geographic Information Retrieval,Government Data — Patrick Durusau @ 7:09 pm

zip-code-data-hacking by Neil Kodner.

From the readme file:

sourcing publicly available files, generate useful zip code-county data.

My goal is to be able to map zip codes to county FIPS codes, without paying. So far, I’m able to produce county fips codes for 41456 counties out of a list of 42523 zip codes.

I was able to find a zip code database from unitedstateszipcodes.org, each zip code had a county name but not a county FIPS code. I was able to find County FIPS codes on the census.gov site through some google hacking.

The data files are in the data directory – I’ll eventuall add code to make sure the latest data files are retrieved at runtime. I didn’t do this yet because I didn’t want to hammer the sites while I was quickly iterating – a local copy did just fine.

In case you are wondering why this mapping between zip codes to county FIPS codes is important:

Federal information processing standards codes (FIPS codes) are a standardized set of numeric or alphabetic codes issued by the National Institute of Standards and Technology (NIST) to ensure uniform identification of geographic entities through all federal government agencies. The entities covered include: states and statistically equivalent entities, counties and statistically equivalent entities, named populated and related location entities (such as, places and county subdivisions), and American Indian and Alaska Native areas. (From: Federal Information Processing Standard (FIPS)

To use zip code based data against federal agency data (FIPS), requires this mapping.

I suspect Neil would appreciate your assistance.

I first saw this at Pete Warden’s Five Short Links.

A Common Crawl Experiment

Filed under: BigData,Common Crawl — Patrick Durusau @ 6:54 pm

A Common Crawl Experiment by Pavel Repin.

An introduction to the Common Crawl project.

Starts you off slow, with 4 billion pages. 😉

You are limited only by your imagination.

I first saw this at: Pete Warden’s Five Short Links.

Designing good MapReduce algorithms

Filed under: Algorithms,BigData,Hadoop,MapReduce — Patrick Durusau @ 6:28 pm

Designing good MapReduce algorithms by Jeffrey D. Ullman.

From the introduction:

If you are familiar with “big data,” you are probably familiar with the MapReduce approach to implementing parallelism on computing clusters [1]. A cluster consists of many compute nodes, which are processors with their associated memory and disks. The compute nodes are connected by Ethernet or switches so they can pass data from node to node.

Like any other programming model, MapReduce needs an algorithm-design theory. The theory is not just the theory of parallel algorithms—MapReduce requires we coordinate parallel processes in a very specific way. A MapReduce job consists of two functions written by the programmer, plus some magic that happens in the middle:

  1. The Map function turns each input element into zero or more key-value pairs. A “key” in this sense is not unique, and it is in fact important that many pairs with a given key are generated as the Map function is applied to all the input elements.
  2. The system sorts the key-value pairs by key, and for each key creates a pair consisting of the key itself and a list of all the values associated with that key.
  3. The Reduce function is applied, for each key, to its associated list of values. The result of that application is a pair consisting of the key and whatever is produced by the Reduce function applied to the list of values. The output of the entire MapReduce job is what results from the application of the Reduce function to each key and its list.

When we execute a MapReduce job on a system like Hadoop [2], some number of Map tasks and some number of Reduce tasks are created. Each Map task is responsible for applying the Map function to some subset of the input elements, and each Reduce task is responsible for applying the Reduce function to some number of keys and their associated lists of values. The arrangement of tasks and the key-value pairs that communicate between them is suggested in Figure 1. Since the Map tasks can be executed in parallel and the Reduce tasks can be executed in parallel, we can obtain an almost unlimited degree of parallelism—provided there are many compute nodes for executing the tasks, there are many keys, and no one key has an unusually long list of values

A very important feature of the Map-Reduce form of parallelism is that tasks have the blocking property [3]; that is, no Map or Reduce task delivers any output until it has finished all its work. As a result, if a hardware or software failure occurs in the middle of a MapReduce job, the system has only to restart the Map or Reduce tasks that were located at the failed compute node. The blocking property of tasks is essential to avoid restart of a job whenever there is a failure of any kind. Since Map-Reduce is often used for jobs that require hours on thousands of compute nodes, the probability of at least one failure is high, and without the blocking property large jobs would never finish.

There is much more to the technology of MapReduce. You may wish to consult, a free online text that covers MapReduce and a number of its applications [4].

Warning: This article may change your interest in the design of MapReduce algorithms.

Ullman’s stories of algorithm tradeoffs provide motivation to evaluate (or reevaluate) your own design tradeoffs.

Sketching and streaming algorithms for processing massive data [Heraclitus and /dev/null]

Filed under: BigData,Stream Analytics — Patrick Durusau @ 4:43 pm

Sketching and streaming algorithms for processing massive data by Jelani Nelson.

From the introduction:

Several modern applications require handling data so massive that traditional algorithmic models do not provide accurate means to design and evaluate efficient algorithms. Such models typically assume that all data fits in memory, and that running time is accurately modeled as the number of basic instructions the algorithm performs. However in applications such as online social networks, large-scale modern scientific experiments, search engines, online content delivery, and product and consumer tracking for large retailers such as Amazon and Walmart, data too large to fit in memory must be analyzed. This consideration has led to the development of several models for processing such large amounts of data: The external memory model [1, 2] and cache-obliviousness [3, 4], where one aims to minimize the number of blocks fetched from disk; property testing [5], where it is assumed the data is so massive that we do not wish to even look at it all and thus aim to minimize the number of probes made into the data; and massively parallel algorithms operating in such systems as MapReduce and Hadoop [6, 7]. Also in some applications, data arrives in a streaming fashion and must be processed on the fly. Such cases arise, for example, with packet streams in network traffic monitoring, or query streams arriving at a Web-based service such as a search engine.

In this article we focus on this latter streaming model of computation, where a given algorithm must make one pass over a data set to then compute some function. We pursue such streaming algorithms, which use memory that is sublinear in the amount of data, since we assume the data is too large to fit in memory. Sometimes it becomes useful to consider algorithms that are allowed not just one, but a few passes over the data, in cases where the data set lives on disk and the number of passes may dominate the overall running time. We also occasionally discuss sketches. A sketch is with respect to some function f, and a sketch of a data set x is a compressed representation of x from which one can compute f(x). Of course under this definition f(x) is itself a valid sketch of x, but we often require more of our sketch than just being able to compute f(x). For example, we typically require that it should be possible for the sketch to be updated as more data arrives, and sometimes we also require sketches of two different data sets that are prepared independently can be compared to compute some function of the aggregate data, or similarity or difference measures across different data sets.

Our goal in this article is not to be comprehensive in our coverage of streaming algorithms. Rather, we discuss in some detail a few surprising results in order to convince the reader that it is possible to obtain some non-trivial algorithms within this model. Those interested in learning more about this area are encouraged to read the surveys [8, 9], or view the notes online for streaming courses taught by Chakrabarti at Dartmouth [10], Indyk at MIT [11], and McGregor at UMass Amherst [12].

Projections of data growth are outdated nearly as soon as they are uttered.

Suffice it to say that whatever data we are called upon to process today, will be larger next year. How much larger depends on the domain, the questions to be answered and a host of other factors. But it will be larger.

We need to develop methods of subject recognition, when like Heraclitus, we cannot ever step in the same stream twice.

If we miss it on the first pass, there isn’t going to be a second one. Next stop for some data streams is going to be /dev/null.

What approaches are you working on?

What’s Wrong with Probabilistic Databases? (multi-part)

Filed under: Probabilistic Database — Patrick Durusau @ 4:26 pm

Oliver Kennedy has started a series of posts on probabilistic databases:

What’s Wrong with Probabilistic Databases? (part 1): General introduction to probabilistic databases.

What’s Wrong With Probabilistic Databases? (part 2): Using probabilistic databases with noisy data.

I will be following this series here.

Other materials for introduction to probabilistic databases?


Update: What’s Wrong with Probabilistic Databases? (Part 3): Using probabilistic databases with modeled data.

SPARQL and Big Data (and NoSQL) [Identifying Winners and Losers – Cui Bono?]

Filed under: BigData,NoSQL,SPARQL — Patrick Durusau @ 3:19 pm

SPARQL and Big Data (and NoSQL) by Bob DuCharme.

From the post:

How to pursue the common ground?

I think it’s obvious that SPARQL and other RDF-related technologies have plenty to offer to the overlapping worlds of Big Data and NoSQL, but this doesn’t seem as obvious to people who focus on those areas. For example, the program for this week’s Strata conference makes no mention of RDF or SPARQL. The more I look into it, the more I see that this flexible, standardized data model and query language align very well with what many of those people are trying to do.

But, we semantic web types can’t blame them for not noticing. If you build a better mouse trap, the world won’t necessarily beat a path to your door, because they have to find out about your mouse trap and what it does better. This requires marketing, which requires talking to those people in language that they understand, so I’ve been reading up on Big Data and NoSQL in order to better appreciate what they’re trying to do and how.

A great place to start is the excellent (free!) booklet Planning for Big Data by Edd Dumbill. (Others contributed a few chapters.) For a start, he describes data that “doesn’t fit the strictures of your database architectures” as a good candidate for Big Data approaches. That’s a good start for us. Here are a few longer quotes that I found interesting, starting with these two paragraphs from the section titled “Ingesting and Cleaning” after a discussion about collecting data from multiple different sources (something else that RDF and SPARQL are good at):

Bob has a very good point: marketing “…requires talking to those people in language that they understand….”

That is, no matter how “good” we think a solution may be, it won’t interest others until we explain it in terms they “get.”

But “marketing” requires more than a lingua franca.

Once an offer is made and understood, it must interest the other person. Or it is very poor marketing.

We may think that any sane person would jump at the chance to reduce the time and expense of data cleaning. But that isn’t necessarily the case.

I once made a proposal that would substantially reduce the time and expense for maintaining membership records. Records that spanned decades and were growing every year (hard copy). I made the proposal, thinking it would be well received.

Hardly. I was called into my manager’s office and got a lecture on how the department in question had more staff, a larger budget, etc., than any other department. They had no interest whatsoever in my proposal and that I should not presume to offer further advice. (Years later my suggestion was adopted when budget issues forced the issue.)

Efficient information flow interested me but not management.

Bob and the rest of us need to ask the traditional question: Cui bono? (To whose benefit?)

Semantic technologies, just like any other, have winners and losers.

To effectively market our wares, we need to identify both.

October 26, 2012

Linked Data Platform 1.0

Filed under: Linked Data,LOD — Patrick Durusau @ 7:05 pm

Linked Data Platform 1.0

From the working draft:

A set of best practices and simple approach for a read-write Linked Data architecture, based on HTTP access to web resources that describe their state using RDF.

Just in case you are keeping up with the Linked Data effort.

I first saw this at Semanticweb.com.

Metamarkets open sources distributed database Druid

Filed under: Distributed Systems,Druid,NoSQL — Patrick Durusau @ 6:56 pm

Metamarkets open sources distributed database Druid by Elliot Bentley.

From the post:

It’s no secret that the latest challenge for the ‘big data’ movement is moving from batch processing to real-time analysis. Metamarkets, who provide “Data Science-as-a-Service” business analytics, last year revealed details of in-house distributed database Druid – and have this week released it as an open source project.

Druid was designed to solve the problem of a database which allows multi-dimensional queries on data as and when it arrives. The company originally experimented with both relational and NoSQL databases, but concluded they were not fast enough for their needs and so rolled out their own.

The company claims that Druid’s scan speed is “33M rows per second per core”, able to ingest “up to 10K incoming records per second per node”. An earlier blog post outlines how the company managed to achieve scan speeds of 26B records per second using horizontal scaling. It does this via a distributed architecture, column orientation and bitmap indices.

It was exciting to read about Druid last year.

Now to see how exciting Druid is in fact!

Source code: https://github.com/metamx/druid

Information Diffusion on Twitter by @snikolov

Filed under: Gephi,Graphs,Networks,Pig,Tweets — Patrick Durusau @ 6:33 pm

Information Diffusion on Twitter by @snikolov by Marti Hearst.

From the post:

Today Stan Nikolov, who just finished his masters at MIT in studying information diffusion networks, walked us through one particular theoretical model of information diffusion which tries to predict under what conditions an idea stops spreading based on a network’s structure (from the popular Easley and Kleinberg Network book). Stan also gathered a huge amount of Twitter data, processed it using Pig scripts, and graphed the results using Gephi. The video lecture below shows you some great visualizations of the spreading behavior of the data!

(video omitted)

The slides in his Lecture Notes let you see the Pig scripts in more detail.

Another deeply awesome lecture from Marti’s class on Twitter and big data.

Also an example of the level of analysis that a Twitter stream will need to withstand to avoid “imperial entanglements.”

Neo4j 1.9.M01 – Self-managed HA

Filed under: Cypher,Graphs,Neo4j — Patrick Durusau @ 5:00 pm

Neo4j 1.9.M01 – Self-managed HA by Peter Neubauer.

Welcome everyone to the first Milestone of the Neo4j 1.9 releases! In this release we’re presenting our new HA solution and a set of excellent improvements to our query language, Cypher.

Peter hits the highlights of the first milestone release for Neo4j 1.9.

I suggest you grab the software first and read Peter’s summary while you “play along.” 😉

BigML creates a marketplace for Predictive Models

Filed under: Data,Machine Learning,Prediction,Predictive Analytics — Patrick Durusau @ 4:42 pm

BigML creates a marketplace for Predictive Models by Ajay Ohri.

From the post:

BigML has created a marketplace for selling Datasets and Models. This is a first (?) as the closest market for Predictive Analytics till now was Rapid Miner’s marketplace for extensions (at http://rapidupdate.de:8180/UpdateServer/faces/index.xhtml)

From http://blog.bigml.com/2012/10/25/worlds-first-predictive-marketplace/

SELL YOUR DATA

You can make your Dataset public. Mind you: the Datasets we are talking about are BigML’s fancy histograms. This means that other BigML users can look at your Dataset details and create new models based on this Dataset. But they can not see individual records or columns or use it beyond the statistical summaries of the Dataset. Your Source will remain private, so there is no possibility of anyone accessing the raw data.

SELL YOUR MODEL

Now, once you have created a great model, you can share it with the rest of the world. For free or at any price you set.Predictions are paid for in BigML Prediction Credits. The minimum price is ‘Free’ and the maximum price indicated is 100 credits.

Having a public, digital marketplace for data and data analysis has been proposed by many and attempted by more than just a few.

Data is bought and sold today, but not by the digital equivalent of small shop keepers. The shop keepers who changed the face of Europe.

Data is bought and sold today by the digital equivalent of the great feudal lords. Complete with castles (read silos).

Will BigML give rise to a new mercantile class?

Or just as importantly, will you be a member of it or bound to the estate of a feudal lord?

First Steps with NLTK

Filed under: Machine Learning,NLTK,Python — Patrick Durusau @ 3:18 pm

First Steps with NLTK by Sujit Pal.

From the post:

Most of what I know about NLP is as a byproduct of search, ie, find named entities in (medical) text and annotating them with concept IDs (ie node IDs in our taxonomy graph). My interest in NLP so far has been mostly as a user, like using OpenNLP to do POS tagging and chunking. I’ve been meaning to learn a bit more, and I did take the Stanford Natural Language Processing class from Coursera. It taught me a few things, but still not enough for me to actually see where a deeper knowledge would actually help me. Recently (over the past month and a half), I have been reading the NLTK Book and the NLTK Cookbook in an effort to learn more about NLTK, the Natural Language Toolkit for Python.

This is not the first time I’ve been through the NLTK book, but it is the first time I have tried working out all the examples and (some of) the exercises (available on GitHub here), and I feel I now understand the material a lot better than before. I also realize that there are parts of NLP that I can safely ignore at my (user) level, since they are not either that baked out yet or because their scope of applicability is rather narrow. In this post, I will describe what I learned, where NLTK shines, and what one can do with it.

You will find the structured listing of links into the NLTK PyDocs very useful.

Google Web Toolkit 2.5 with leaner code

Filed under: Ajax,GWT,Javascript — Patrick Durusau @ 12:55 pm

Google Web Toolkit 2.5 with leaner code

From the post:

According to its developers, version 2.5 of the Google Web Toolkit (GWT), a Java-based open source web framework for Ajax applications, offers significant performance improvements. Apparently, the overall code base has been reduced by 20 per cent, and the download size of the sample application dropped 39 per cent.

GWT is built around a Java-to-JavaScript compiler that allows developers to almost exclusively use Java when writing an application’s client and server code. The user interface code is translated into JavaScript and deployed to the browser when required. The technology recently became a discussion topic when Google introduced its Dart alternative to JavaScript; however, Google has assured the GWT community that it will continue to develop GWT for the foreseeable future.

Ready to improve your delivery of content?

Redis 2.6.2 Released!

Filed under: NoSQL,Redis — Patrick Durusau @ 12:30 pm

Redis 2.6.2 Released!

From the introduction to Redis:

Redis is an open source, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets.

You can run atomic operations on these types, like appending to a string; incrementing the value in a hash; pushing to a list; computing set intersection, union and difference; or getting the member with highest ranking in a sorted set.

In order to achieve its outstanding performance, Redis works with an in-memory dataset. Depending on your use case, you can persist it either by dumping the dataset to disk every once in a while, or by appending each command to a log.

Redis also supports trivial-to-setup master-slave replication, with very fast non-blocking first synchronization, auto-reconnection on net split and so forth.

Other features include a simple check-and-set mechanism, pub/sub and configuration settings to make Redis behave like a cache.

You can use Redis from most programming languages out there.

Redis is written in ANSI C and works in most POSIX systems like Linux, *BSD, OS X without external dependencies. Linux and OSX are the two operating systems where Redis is developed and more tested, and we recommend using Linux for deploying. Redis may work in Solaris-derived systems like SmartOS, but the support is best effort. There is no official support for Windows builds, although you may have some options.

The “in-memory” nature of Redis will be a good excuse for more local RAM. 😉

I noticed the most recent release of Redis at Alex Popescu’s myNoSQL.

Open Source Natural Language Spell-Checker [Disambiguation at the point of origin.]

Automattic Open Sources Natural Language Spell-Checker After the Deadline by Jolie O’Dell.

I am sure the original headline made sense to its author, but I wonder how a natural language processor would react to it?

My reaction, being innocent of any prior knowledge of the actors or the software was: What deadline? Reading it as a report of a missed deadline.

It is almost a “who’s on first” type headline. The software’s name is “After the Deadline.”

That confusion resolved, I read:

Matt Mullenweg has just annouced on his blog that WordPress parent company Automattic is open sourcing After the Deadline, a natural-language spell-checking plugin for WordPress and TinyMCE that was only recently ushered into the Automattic fold.

Scarcely seven weeks after its acquisition was announced, After the Deadline’s core technology is being released under the GPL. Moreover, writes Mullenweg, “There’s also a new jQuery API that makes it easy to integrate with any text area.”

Interested parties can check out this demo or read the tech overview and grab the source code here.

I can use spelling/grammar suggestions. Particularly since I make the same mistakes over and over again.

Does that also mean I talk about the same subjects/entities over and over again? Or at least a limited range of subjects/entities?

Imagine a user configurable subject/entity “checker” that annotated recognized subjects/entities with an <a> element. Enabling the user to accept/reject the annotation.

Disambiguation at the point of origin.

The title of the original article could become:

“<a href=”http://automattic.com/”>Automattic</a> Open Sources Natural Language Spell-Checker <a href=”http://www.afterthedeadline.com/”>After the Deadline</a>”

Seems less ambiguous to me.

Certainly less ambiguous to a search engine.

You?

100 most popular Machine Learning talks at VideoLectures.Net

Filed under: CS Lectures,Machine Learning — Patrick Durusau @ 9:17 am

100 most popular Machine Learning talks at VideoLectures.Net by Davor Orlič.

A treasure trove of lectures on machine learning.

If there is a sort order to this collection, title, author, length, subject, it escapes me.

Even browsing you will find more than enough material to fill the coming weekend (and beyond).

October 25, 2012

Ditch Traditional Wireframes

Filed under: Design,Graphics,Interface Research/Design,Visualization — Patrick Durusau @ 4:18 pm

Ditch Traditional Wireframes by Sergio Nouvel.

From the post:

Wireframes have played an increasingly leading role in the modern Web development process. They provide a simple way of validating user interface and layout and are cheaper and faster to produce than a final visual comp. However, most of the methods and techniques used to create them are far from being efficient, contradicting the principles and values that made wireframing useful in first place.

While this article is not about getting rid of the wireframing process itself, now is a good time for questioning and improving some of the materials and deliverables that have become de facto standards in the UX field. To make this point clear, let´s do a quick review of the types of wireframes commonly used.

Especially appropriate since I mentioned the Health Design Challenge [$50K in Prizes – Deadline 30th Nov 2012] earlier today. You are likely to be using one or more of these techniques for your entry.

Hopefully Sergio’s comments will make your usage more productive and effective!

DINOSAURS ARE REAL: Microsoft WOWs audience with HDInsight…(Hortonworks Inside)

Filed under: Hadoop,HDInsight,Hortonworks,Microsoft — Patrick Durusau @ 4:02 pm

DINOSAURS ARE REAL: Microsoft WOWs audience with HDInsight at Strata NYC (Hortonworks Inside) by Russell Jurney.

From the post:

You don’t see many demos like the one given by Shawn Bice (Microsoft) today in the Regent Parlor of the New York Hilton, at Strata NYC. “Drive Smarter Decisions with Microsoft Big Data,” was different.

For starters – everything worked like clockwork. Live demos of new products are notorious for failing on-stage, even if they work in production. And although Microsoft was presenting about a Java-based platform at a largely open-source event… it was standing room only, with the crowd overflowing out the doors.

Shawn demonstrated working with Apache Hadoop from Excel, through Power Pivot, to Hive (with sampling-driven early results!?) and out to import third party data-sets. To get the full effect of what he did, you’re going to have to view a screencast or try it out but to give you the idea of what the first proper interface on Hadoop feels like…

My thoughts on reading Russell’s post:

  • A live product demo that did not fail? Really?
  • Is that tatoo copyrighted?
  • Oh, yes, +1!, big data has become real for millions of users.

How’s that for a big data book, tutorial, consulting, semantic market explosion?

Insisting on beautiful maps

Filed under: Cartography,Mapping,Maps — Patrick Durusau @ 3:00 pm

Insisting on beautiful maps by Nathan Yau.

Nathan calls our attention to the publication of:

the Atlas of Design, published by the North American Cartographic Information Society,….

Definitely be on the short list of books for the holiday season!

Why Microsoft is committed to Hadoop and Hortonworks

Filed under: BigData,Hadoop,Hortonworks,Microsoft — Patrick Durusau @ 2:53 pm

Why Microsoft is committed to Hadoop and Hortonworks (a buest post at Hortonworks by Microsoft’s Dave Campbell).

From the post:

Last February at Strata Conference in Santa Clara we shared Microsoft’s progress on Big Data, specifically working to broaden the adoption of Hadoop with the simplicity and manageability of Windows and enabling customers to easily derive insights from their structured and unstructured data through familiar tools like Excel.

Hortonworks is a recognized pioneer in the Hadoop Community and a leading contributor to the Apache Hadoop project, and that’s why we’re excited to announce our expanded partnership with Hortonworks to give customers access to an enterprise-ready distribution of Hadoop that is 100 percent compatible with Windows Server and Windows Azure. To provide customers with access to this Hadoop compatibility, yesterday we also released new previews of Microsoft HDInsight Server for Windows and Windows Azure HDInsight Service, our Hadoop-based solutions for Windows Server and Windows Azure.

With this expanded partnership, the Hadoop community will reap the following benefits of Hadoop on Windows:

  • Insights to all users from all data:….
  • Enterprise-ready Hadoop with HDInsight:….
  • Simplicity of Windows for Hadoop:….
  • Extend your data warehouse with Hadoop:….
  • Seamless Scale and Elasticity of the Cloud:….

This is a very exciting milestone, and we hope you’ll join us for the ride as we continue partnering with Hortonworks to democratize big data. Download HDInsight today at Microsoft.com/BigData.

See Dave’s post for the details on “benefits of Hadoop on Windows” and then like the man says:

Download HDInsight today at Microsoft.com/BigData.

Enabling Big Data Insight for Millions of Windows Developers [Your Target Audience?]

Filed under: Azure Marketplace,BigData,Hadoop,Hortonworks,Microsoft — Patrick Durusau @ 2:39 pm

Enabling Big Data Insight for Millions of Windows Developers by Shaun Connolly.

From the post:

At Hortonworks, we fundamentally believe that, in the not-so-distant future, Apache Hadoop will process over half the world’s data flowing through businesses. We realize this is a BOLD vision that will take a lot of hard work by not only Hortonworks and the open source community, but also software, hardware, and solution vendors focused on the Hadoop ecosystem, as well as end users deploying platforms powered by Hadoop.

If the vision is to be achieved, we need to accelerate the process of enabling the masses to benefit from the power and value of Apache Hadoop in ways where they are virtually oblivious to the fact that Hadoop is under the hood. Doing so will help ensure time and energy is spent on enabling insights to be derived from big data, rather than on the IT infrastructure details required to capture, process, exchange, and manage this multi-structured data.

So how can we accelerate the path to this vision? Simply put, we focus on enabling the largest communities of users interested in deriving value from big data.

You don’t have to wonder long what Shaun is reacting to:

Today Microsoft unveiled previews of Microsoft HDInsight Server and Windows Azure HDInsight Service, big data solutions that are built on Hortonworks Data Platform (HDP) for Windows Server and Windows Azure respectively. These new offerings aim to provide a simplified and consistent experience across on-premise and cloud deployment that is fully compatible with Apache Hadoop.

Enabling big data insight isn’t the same as capturing those insights for later use or re-use.

May just be me, but that sounds like a great opportunity for topic maps.

Bringing semantics to millions of Windows developers that is.

Service-Oriented Distributed Knowledge Discovery

Filed under: Distributed Systems,Knowledge Discovery — Patrick Durusau @ 10:50 am

Service-Oriented Distributed Knowledge Discovery by Domenico Talia, University of Calabria, Rende, Italy; Paolo Trunfio.

The publisher’s summary reads:

A new approach to distributed large-scale data mining, service-oriented knowledge discovery extracts useful knowledge from today’s often unmanageable volumes of data by exploiting data mining and machine learning distributed models and techniques in service-oriented infrastructures. Service-Oriented Distributed Knowledge Discovery presents techniques, algorithms, and systems based on the service-oriented paradigm. Through detailed descriptions of real software systems, it shows how the techniques, models, and architectures can be implemented.

The book covers key areas in data mining and service-oriented computing. It presents the concepts and principles of distributed knowledge discovery and service-oriented data mining. The authors illustrate how to design services for data analytics, describe real systems for implementing distributed knowledge discovery applications, and explore mobile data mining models. They also discuss the future role of service-oriented knowledge discovery in ubiquitous discovery processes and large-scale data analytics.

Highlighting the latest achievements in the field, the book gives many examples of the state of the art in service-oriented knowledge discovery. Both novices and more seasoned researchers will learn useful concepts related to distributed data mining and service-oriented data analysis. Developers will also gain insight on how to successfully use service-oriented knowledge discovery in databases (KDD) frameworks.

The idea of service-oriented data mining/analysis is very compatible with topic maps as marketable information sets.

It is not available through any of my usual channels, yet, but I would be cautious at $89.95 for 230 pages of text.

More comments to follow when I have a chance to review the text.

I first saw this at KDNuggets.

Data Preparation: Know Your Records!

Filed under: Data,Data Quality,Semantics — Patrick Durusau @ 10:25 am

Data Preparation: Know Your Records! by Dean Abbott.

From the post:

Data preparation in data mining and predictive analytics (dare I also say Data Science?) rightfully focuses on how the fields in ones data should be represented so that modeling algorithms either will work properly or at least won’t be misled by the data. These data preprocessing steps may involve filling missing values, reigning in the effects of outliers, transforming fields so they better comply with algorithm assumptions, binning, and much more. In recent weeks I’ve been reminded how important it is to know your records. I’ve heard this described in many ways, four of which are:
the unit of analysis
the level of aggregation
what a record represents
unique description of a record

A bit further on Dean reminds us:

What isn’t always obvious is when our assumptions about the data result in unexpected results. What if we expect the unit of analysis to be customerID/Session but there are duplicates in the data? Or what if we had assumed customerID/Session data but it was in actuality customerID/Day data (where ones customers typically have one session per day, but could have a dozen)? (emphasis added)

Obvious once Dean says it, but how often do you question assumptions about data?

Do you know what impact incorrect assumptions about data will have on your operations?

If you investigate your assumptions about data, where do you record your observations?

Or will you repeat the investigation with every data dump from a particular source?

Describing data “in situ” could benefit you six months from now or your successor. (The data and or its fields would be treated as subjects in a topic map.)

8th German Conference on Chemoinformatics [GCC 2012]

Filed under: Cheminformatics — Patrick Durusau @ 10:09 am

8th German Conference on Chemoinformatics [GCC 2012]

From the post:

The 8th German Conference on Chemoinformatics takes place in Goslar, Germany next month, and we are pleased to announce that once again, Journal of Cheminformatics will be the official publishing partner and poster session sponsor.

The conference runs from November 11th–13th and covers a wide range of topics around cheminformatics and chemical information including: Chemoinformatics and Drug Discovery; Molecular Modelling; Chemical Information, Patents and Databases; and Computational Material Science and Nanotechnology.

This will be the fourth year that Journal of Cheminformatics has been involved with the conference, and abstracts from the previous three meetings are freely available via the journal website.

The prior meeting abstracts are a very rich source of materials that merit your attention.

« Newer PostsOlder Posts »

Powered by WordPress