Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 28, 2012

Once Upon A Subject Clearly…

Filed under: Identity,Marketing,Subject Identity — Patrick Durusau @ 4:22 pm

As I was writing up the GWAS Central post, the question occurred to me: does their mapping of identifiers take something away from topic maps?

My answer is no and I would like to say why if you have a couple of minutes. 😉 Seriously! It isn’t going to take that long. However long it has taken me to reach this point.

Every time we talk, write or otherwise communicate about a subject, we at the same time have identified that subject. Makes sense. We want whoever we are talking, writing to or communicating with, to understand what we are talking about. Hard to do if we don’t identify what subject(s) we are talking about.

We do it all day, every day. In public, in private, in semi-public places. 😉 And we use words to do it. To identify the subjects we are talking about.

For the most part, or at least fairly often, we are understood by other people. Not always, but most of the time.

The problem comes in when we start to gather up information from different people who may (or may not) use words differently than we do. So there is a much larger chance that we don’t mean the same thing by the same words. Or we may use different words to mean the same thing.

Words, which were our reliable servants for the most part, become far less reliable.

To counter that unreliability, we can create groups of words, mappings if you like, to keep track of what words go where. But, to do that, we have to use words, again.

Start to see the problem? We always use words, to clear up our difficulties with words. And there isn’t any universal stopping place. The Cyc advocates would have us stop there and the SUMO crowd would have us stop over there and the Semantic Web folks yet somewhere else and of course the topic map mavens, yet one or more places.

For some purposes, any one or more of those mappings may be adequate. A mapping is only as good and for as long as it is useful.

History tells us that every mapping will be replaced with other mappings. We would do well us understand/document the words we are using as part of our mappings, as well as we are able.

But if words are used to map words, where do we stop? My suggestion would be to stop as we always have, wherever looks convenient. So long as the mapping suits your present purposes, what more would you ask of it?

I am quite content to have such stopping places because it means we will always have more starting places for the next round of mapping!

Ironic isn’t it? We create mappings to make sense out of words and our words lay the foundation for others to do the same.

GWAS Central

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 4:22 pm

GWAS Central

From the website:

GWAS Central (previously the Human Genome Variation database of Genotype-to-Phenotype information) is a database of summary level findings from genetic association studies, both large and small. We actively gather datasets from public domain projects, and encourage direct data submission from the community.

GWAS Central is built upon a basal layer of Markers that comprises all known SNPs and other variants from public databases such as dbSNP and the DBGV. Allele and genotype frequency data, plus genetic association significance findings, are added on top of the Marker data, and organised the same way that investigations are reported in typical journal manuscripts. Critically, no individual level genotypes or phenotypes are presented in GWAS Central – only group level aggregated (summary level) data. The largest unit in a data submission is a Study, which can be thought of as being equivalent to one journal article. This may contain one or more Experiments, one or more Sample Panels of test subjects, and one or more Phenotypes. Sample Panels may be characterised in terms of various Phenotypes, and they also may be combined and/or split into Assayed Panels. The Assayed Panels are used as the basis for reporting allele/genotype frequencies (in `Genotype Experiments`) and/or genetic association findings (in ‘Analysis Experiments’). Environmental factors are handled as part of the Sample Panel and Assayed Panel data structures.

Although I mentioned GWAS some time ago, I saw it mentioned in Christophe Lalanne’s Bag of Tweets for March 2012 and on taking a another look, thought I should mention it again.

In part because as the project reports above, this is an aggregation level site, not one that reaches into the details of studies, that may or may not be important for some researchers. That aggregation leaves a gap for aggregation or analysis of the underlying data, plus mapping it to other data!

Openfmri.org

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 4:22 pm

Openfmri.org

From the webpage:

OpenfMRI.org is a project dedicated to the free and open sharing of functional magnetic resonance imaging (fMRI) datasets, including raw data.

Now that’s a data set you don’t see everyday!

Not to mention being one that would be ripe to link into medical literature, hospital/physician records, etc.

First seen in Christophe Lalanne’s Bag of Tweets for March, 2012.

Kibana

Filed under: ElasticSearch,Kibana,logstash — Patrick Durusau @ 4:22 pm

Kibana

From the webpage:

You have logs. Billions of lines of data. You shipped, dated it, parsed it and stored it. Now what do you do with it? Now you make sense of it. Kibana helps you do that. Kibana is an alternative browser based interface for Logstash and ElasticSearch that allows you to efficiently search, graph, analyze and otherwise make sense of a mountain of logs.

Any thoughts of what data you would map to such an interface? Or map to the aggregations that it offers?

NASA-GISS Datasets and Images

Filed under: Dataset,NASA — Patrick Durusau @ 4:22 pm

NASA-GISS Datasets and Images

Data and image sets from the Goddard Institute for Space Studies.

A number of interesting data/image sets along with links to similar material.

If you are looking for data sets to integrate with other public data sets, definitely worth a look.

Visualizing a set of Hiveplots with Neo4j

Filed under: Gremlin,Hive Plots,Neo4j — Patrick Durusau @ 4:21 pm

Visualizing a set of Hiveplots with Neo4j by Max De Marzi.

Max writes:

If you want to learn more about Hive Plots, take a look at his website and this presentation (it is quite large at 20 MB). I cannot do it justice in this short blog post, and in all honestly haven’t had the time to study it properly.

Today I just want to give you a little taste of Hiveplots. I am going to visualize the github graphs of nine languages you might not have heard of: Boo, Dylan, Factor, Gosu, Mirah, Nemerle, Nu, Parrot, Self. I’m not going to show you how to create the graph this time, because this is real data we are using. You can take a look at it on the data folder in github.

The graph is basically: (Language)–(Repository)–(User). There are two relationships between Repository and User, wrote and forked.

Hive plots are an effort by Martin Krzywinski to enable viewers of a graph visualization to distinguish between two or more graphs and to recognize key features of those graphs. His website is: http://www.hiveplot.com/.

Graph Theory in LaTeX 2

Filed under: Graphs,TeX/LaTeX,Visualization — Patrick Durusau @ 4:21 pm

Graph Theory in LaTeX 2: Combinatorial graphs drawn using LaTeX

Great examples of the use of the LaTeX graph packages you will find at: Altermundus.

You need to see the examples to appreciate how they would look in a paper or professional publication.

Visualization of Hyperedges in Fixed Graph Layouts

Filed under: Graphs,Hypergraphs — Patrick Durusau @ 4:21 pm

Visualization of Hyperedges in Fixed Graph Layouts by Martin Junghans.

Abstract:

Graphs and their visualizations are widely used to communicate the structure of complex data in a formal way. Hypergraphs are dedicated to represent real-world data as they allow to relate multiple objects with each other. However, existing graph drawing techniques lack the ability to embed hyperedges into fixed two-dimensional graph layouts. We utilize a set of curves to visualize hyperedges and employ an energy-based technique to position them in the layout. By avoiding node occlusion and cluster intersections we are able to preserve the expressiveness of the given graph layout. Additionally, we investigate techniques to reduce the visual complexity of hypergraph drawings. A comprehensive evaluation using real-world data sets demonstrates the suitability of the proposed hyperedge layout techniques.

A thesis I ran across today while researching the display of hyperedges.

Graphs are being used for the storage/analysis/visualization of data. Given the history of hypergraphs in CS research, hypergraphs aren’t far behind. Now would be the time to get ahead of the curve, however briefly.

Designing User Experiences for Imperfect Data

Filed under: Data Quality,Interface Research/Design,Search Interface,Searching — Patrick Durusau @ 4:21 pm

Designing User Experiences for Imperfect Data by Matthew Hurst.

Matthew writes:

Any system that uses some sort of inference to generate user value is at the mercy of the quality of the input data and the accuracy of the inference mechanism. As neither of these can be guaranteed to by perfect, users of the system will inevitably come across incorrect results.

In web search we see this all the time with irrelevant pages being surfaced. In the context of track // microsoft, I see this in the form of either articles that are incorrectly added to the wrong cluster, or articles that are incorrectly assigned to no cluster, becoming orphans.

It is important, therefore, to take these imperfections into account when building the interface. This is not necessarily a matter of pretending that they don’t exist, or tricking the user. Rather it is a problem of eliciting an appropriate reaction to error. The average user is not conversant in error margins and the like, and thus tends to over-weight errors leading to the perception of poorer quality in the good stuff.

I am not real sure how Matthew finds imperfect data but I guess I will just have to take his word for it. 😉

Seriously, I think he is spot on in observing that expecting users to hunt-n-peck through search results is wearing a bit thin. That is going to be particularly so when better search systems make the hidden cost of hunt-n-peck visible.

Do take the time to visit his track // microsoft site.

Now imagine your own subject specific and dynamic website. Or even search engine. Could be that search engines for “everything” are the modern day dinosaurs. Big, clumsy, fairly crude.

Data in an Alien Context: Kepler Visualization Source Code

Filed under: Astroinformatics,Graphics,Visualization — Patrick Durusau @ 4:20 pm

Data in an Alien Context: Kepler Visualization Source Code

Jer Thorp released a visualization of the exoplanets discovered by the Kepler project last year and has updated that visualization to include an additional 1091 candidates. He has also released the source code for his visualization.

Imagine a marriage of Jer’s visualization with additional information as it is discovered by different projects, using different techniques and formats. Topic maps anyone?

March 27, 2012

Hive, Pig, Scalding, Scoobi, Scrunch and Spark

Filed under: Hive,Pig,Scalding,Scoobi,Scrunch,Spark — Patrick Durusau @ 7:18 pm

Hive, Pig, Scalding, Scoobi, Scrunch and Spark by Sami Badawi.

From the post:

Comparison of Hadoop Frameworks

I had to do simple processing of log files in a Hadoop cluster. Writing Hadoop MapReduce classes in Java is the assembly code of Big Data. There are several high level Hadoop frameworks that make Hadoop programming easier. Here is the list of Hadoop frameworks I tried:

  • Pig
  • Scalding
  • Scoobi
  • Hive
  • Spark
  • Scrunch
  • Cascalog

The task was to read log files join with other data do some statistics on arrays of doubles. Programming this without Hadoop is simple, but caused me some grief with Hadoop.

This blog post is not a full review, but my first impression of these Hadoop frameworks.

Everyone has a favorite use case.

How does your use case fare with different frameworks for Hadoop? (We won’t ever know if you don’t say.)

Neo4j REST API Tutorial

Filed under: Graphs,Neo4j — Patrick Durusau @ 7:18 pm

Neo4j REST API Tutorial

From the post:

Using the Neo4j REST API

After starting the Neo4j server, load the HTTP console by clicking here. The HTTP console uses the Neo4j REST API to interact with the database. Even though you can use the HTTP shell for manually interacting with the database, it is best used for prototyping the REST calls your app would be making to the database. Unless Neo4j provides bindings for your language (Java, Python, Ruby), you will most probably be using the REST API to talk to the database.

For your reference, the Neo4j documentation is located at http://docs.neo4j.org/chunked/stable/.

Let’s see how some of the common operations can be performed in Neo4j Nodes and Relationships using the REST API.

Excellent post on the Neo4j REST API.

The one improvement I would suggest is one of presentation. Code/responses that run off the viewable page aren’t very helpful.

First seen at Alex Popescu’s myNoSQL.

Neo4j 1.7.M02 – Cache Cachet, Matching Matchers, and Debian Debs

Filed under: Cypher,Graphs,Neo4j — Patrick Durusau @ 7:18 pm

Neo4j 1.7.M02 – Cache Cachet, Matching Matchers, and Debian Debs by Peter Neubauer.

Just the highlights to temp you into reading the full post:

Neo4j 1.7 Milestone 2 introduces a trio of interesting advances: a new cache scheme, targeted pattern matching in Cypher, and Debian install packages. Faster, smarter, and more accessible.

Atomic Array Cache – GC resistant, 10x faster, 10x more capacity

Under the hood, Neo4j runs on the JVM (that’s the ‘J’ in ‘Neo4j’). And as every java developer knows: the Garbage Collector is your friend, the Garbage Collector is your enemy. The Garbage Collector (GC) helpfully relieves the developer from worrying about memory management. Unhelpfully, garbage collection introduces unpredictability in an application’s responsiveness. While partial garbage collection can be a nuisance, full garbage collection can be disastrous, pausing an application for uncomfortably long durations (from a few to far too many seconds). Sadly, there are no really good solutions to consistently avoid full GC pauses or to even prevent the GC from kicking in at inconvenient times.

….

Cypher matching matchers

By describing what you want and leaving the how-to-do-it up to someone else, a declarative language like Cypher presents many opportunities for optimizations. Our brilliant Michael Hunger recently joined Cypher master Andres Taylor for a hard look at pattern matching, initiating a classification of common use cases. Then the duo targeted different pattern matcher implementations to fit each type of query. The results are promising.

….

Got Debian? apt-get neo4j

Ops people rejoice, because Neo4j is now a simple `apt-get` away for any Debian-based Linux distro (like the ever popular Ubuntu). The debian installer is now part of the regular build and deploy chain, pushing out to our own debian repository. See http://debian.neo4j.org/ for details, following these steps to install:

….

Download Neo4j 1.7.M02 today!

Cypher Query Language

Filed under: Cypher,Graphs,Neo4j — Patrick Durusau @ 7:18 pm

Cypher Query Language

Slides by Max De Marzi at Chicago Graph Database Meet-up.

Great set of slides.

Illustrates the syntax and its “ascii art” perspective before moving into more complex queries.

Saw this first at Alex Popescu’s myNoSQL.

Data Mining Bitly

Filed under: Bitly,Data Mining — Patrick Durusau @ 7:17 pm

I was reading a highly entertaining post by Nathan Yau, What News Sites People are Reading, by State that had the following quote:

Bitly’s dataset, wrangled by data scientists Hilary Mason and Anna Smith, consists of every click on every Bitly link on the Web. Bitly makes its data available publicly—just add ‘+’ to the end of any Bitly link to see how many clicks it’s gotten.

It’s a little more complicated than that but not by much.

From the Bitly help page:

Beyond basics: Capturing data and using metrics

How do I see how many times a bitly link was clicked on?

Every bitly link has an info page, which reveals the number of related clicks and other relevant data. You can get to the info page in a few different ways. For example, to view the info page for the bitly link http://bit.ly/CUjV

You can also use the the sidebar bookmarklet to instantly get information for your bitly link, or you can see basic information about all of your links on your Manage page.

What do the numbers “x” out of “x” mean next to my links?

The numbers next to your links might say “8 out of 8” or “14 out of 648,” or something else. The top number is the number of clicks that your bitly link specifically generated, for example: 30. The bottom number is the total number of bitly clicks generated for all bitly links created for that URL as a whole, for example: 100. So if you “30 out of 100” next to your link, that means the bitly link you created generated 30 clicks and 70 clicks were generated by other bitly links (from other bitly users) to that URL.

Why does the number on top always match the number of total clicks, even when I’m not the one who was responsible for the clicks?

The numbers displayed are total decodes (not total click-throughs), which JavaScript measures on the page. Decodes can be caused by bots or applications, like browser plug-ins, which expand the underlying URL without causing a click-through.? If you download a browser plug-in that automatically expands short URLs, for example, it looks a lot like a human user to an analytics program. Absent JavaScript on the page, it’s hard to distinguish between a decode and an intentional click-through. Ultimately, bitly complements rather than replaces JavaScript-based analytics utilities such as Google Analytics or Chartbeat.

If someone else shortens the same URL, do we both see the same number of clicks?

It depends on whether a user is signed in. bitly tracks the total number of clicks pointing to a single long link. Signed-in bitly users receive a unique bitly link that lets them track clicks and other data separately, while still seeing totals for all bitly links pointing to the same long link. But users who are not signed in all share the same bitly link.

Is all bitly tracking data publically available? Where can I view it?

To learn more about the life of any given bitly url, simply add a “+” sign to the end of that link and you will be directed to a page with that link’s statistics.

The permanent 301 redirects of bitly mean that multiple bitly urls can point towards a single webpage.

Sounds like having multiple identifiers doesn’t it?

What’s more, I can create a bitly redirect for a webpage and then by adding “+” to the end, see if there are other redirects for that page.

Scientific Visualization Studio (NASA)

Filed under: Data,Visualization — Patrick Durusau @ 7:17 pm

Scientific Visualization Studio (NASA)

From the website:

The mission of the Scientific Visualization Studio is to facilitate scientific inquiry and outreach within NASA programs through visualization. To that end, the SVS works closely with scientists in the creation of visualization products, systems, and processes in order to promote a greater understanding of Earth and Space Science research activities at Goddard Space Flight Center and within the NASA research community.

All the visualizations created by the SVS (currently totalling over 4,200) are accessible to you through this Web site. More recent animations are provided as MPEG-4s, MPEG-2s, and MPEG-1s. Some animations are available in high definition as well as NTSC format. Where possible, the original digital images used to make these animations have been made accessible. Lastly, high and low resolution stills, created from the visualizations, are included, with previews for selective downloading.

A visualization of data site that may have visualizations that work with your topic maps as content and/or give you creative ideas for visualizing data reported by your topic maps.

For example, consider: Five-Year Average Global Temperature Anomalies from 1880 to 2011. Easier than reporting all the underlying data. This may work for some subjects and less well for others.

Publicly available large data sets for database research

Filed under: Data,Dataset — Patrick Durusau @ 7:17 pm

Publicly available large data sets for database research by Daniel Lemire.

Daniel summaries large (> 20 GB) data sets that may be useful for database research.

If you know of any data sets that have been overlooked or that become available, please post a note on this entry at Daniel’s blog.

Result Grouping Made Easier

Filed under: Lucene — Patrick Durusau @ 7:17 pm

Result Grouping Made Easier

From the post:

Lucene has result grouping for a while now as a contrib in Lucene 3.x and as a module in the upcoming 4.0 release. In both releases the actual grouping is performed with Lucene Collectors. As a Lucene user you need to use various of these Collectors in searches. However these Collectors have many constructor arguments. So they can become quite cumbersome to use grouping in pure Lucene apps. The example below illustrates this.

(code omitted)

In the above example basic grouping with caching is used and also the group count is retrieved. As you can see there is quite a lot of coding involved. Recently a grouping convenience utility has been added to the Lucene grouping module to alleviate this problem. As the code example below illustrates, using the GroupingSearch utility is much easier than interacting with actual grouping collectors.

Normally the document count is returned as hit count. However in the situation where groups are being used as hit, rather than a document the document count will not work with pagination. For this reason the group count can be used the have correct pagination. The group count returns the number of unique groups matching the query. The group count can in the case be used as hit count since the individual hits are groups.

There are really two lessons here.

The first lesson is that if you need the GroupingSearch utility, use it.

Second is that Lucene is evolving rapidly enough that if you are a regular user, you need to be monitoring developments and releases carefully.

Tommie says: Balisage Submissions Due in 25 Days!

Filed under: Conferences — Patrick Durusau @ 7:17 pm

Tommie Usdin wrote to say:

It is time to stop thinking that you should get started on your Balisage paper and START writing. A successful Balisage submission is:

  • fresh
  • interesting
  • well thought out
  • carefully written.

This can’t be done in an hour. Not even by someone as smart and creative as you are!

If you have any questions about Balisage, if you want to bounce your paper concept off someone for a little pre-submission feedback, or if we can help you in any way, please write to info@balisage.net.

The Balisage Call for Participation is at: http://www.balisage.net/Call4Participation.html

A symposium on Quality Assurance and Quality Control in XML will precede Balisage this year. Read about it at: http://balisage.net/QA-QC/, and consider submitting a paper for the symposium, too.

Help make Balisage the conference you want it to be.

For the calendar challenged, that means 20 April 2012.

Let’s be honest with each other. Every year we cross the border into Canada thinking “This will be the year some model of my gender choice invites me to stay in Montreal.”

As a presenter (that is a person who submits a very good paper that is accepted), you will increase your odds of being noticed by a model.

How much?

Presenters who have overstayed in Canada are in violation of Canadian immigration laws, so I can’t name names. You understand. Let’s say as opposed to being an attendee (which is a lot of fun), your odds really go up.

😉

Seriously, if you are interested in the next wave of markup techniques and strategies, Balisage is the one conference to attend all year.

There are conferences where there is a lot of hand waving (wringing?) about the future, which are good for a few laughs.

Balisage is where the future of markup is being made real, one paper at a time.

PhD proposal on distributed graph data bases

Filed under: Graph Databases,Graphs — Patrick Durusau @ 7:16 pm

PhD proposal on distributed graph data bases by René Pickhardt.

From the post:

Over the last week we had our off campus meeting with a lot of communication training (very good and fruitful) as well as a special treatment for some PhD students called “massage your diss”. I was one of the lucky students who were able to discuss our research ideas with a post doc and other PhD candidates for more than 6 hours. This lead to the structure, todos and time table of my PhD proposal. This has to be finalized over the next couple days but I already want to share the structure in order to make it more real. You might also want to follow my article on a wish list of distributed graph data base technology.

If you have the time, please read René’s proposal and comment on it.

Although I am no stranger to multi-year research projects, ;-), I must admit to pausing when I read:

Here I will name the at least the related work in the following fields:

  • graph processing (Signal Collect, Pregel,…)
  • graph theory (especially data structures and algorithms)
  • (dynamic/adaptive) graph partitioning
  • distributed computing / systems (MPI, Bulk Synchronous Parallel Programming, Map Reduce, P2P, distributed hash tables, distributed file systems…)
  • redundancy vs fault tolerance
  • network programming (protocols, latency vs bandwidth)
  • data bases (ACID, multiple user access, …)
  • graph data base query languages (SPARQL, Gremlin, Cypher,…)
  • Social Network and graph analysis and modelling.

Unless René is planning on taking the most recent citations in each area, describing related work and establishing how it is related to “distributed graph data bases,” will consume the projected time period for his dissertation work.

Each of the areas listed is a complete field unto itself and has many PhD sized research problems related to “distributed graph data bases.”

Almost all PhD proposals start with breath taking scope but the ones that make a real contribution (and are completed), identify specific problems that admit to finite research programs.

I think René should revise his proposal to focus on some particular aspect of “distributed graph data bases.” I suspect even the history of one aspect of such databases will expand fairly rapidly upon detailed consideration.

The need for a larger, global perspective on “distributed graph data bases” will still be around after René finishes a less encompassing dissertation. I promise.

What is your advice?

March 26, 2012

The unreasonable necessity of subject experts

Filed under: Data Mining,Domain Expertise,Subject Experts — Patrick Durusau @ 6:40 pm

The unreasonable necessity of subject experts – Experts make the leap from correct results to understood results by Mike Loukides.

From the post:

One of the highlights of the 2012 Strata California conference was the Oxford-style debate on the proposition “In data science, domain expertise is more important than machine learning skill.” If you weren’t there, Mike Driscoll’s summary is an excellent overview (full video of the debate is available here). To make the story short, the “cons” won; the audience was won over to the side that machine learning is more important. That’s not surprising, given that we’ve all experienced the unreasonable effectiveness of data. From the audience, Claudia Perlich pointed out that she won data mining competitions on breast cancer, movie reviews, and customer behavior without any prior knowledge. And Pete Warden (@petewarden) made the point that, when faced with the problem of finding “good” pictures on Facebook, he ran a data mining contest at Kaggle.

A good impromptu debate necessarily raises as many questions as it answers. Here’s the question that I was left with. The debate focused on whether domain expertise was necessary to ask the right questions, but a recent Guardian article,”The End of Theory,” asked a different but related question: Do we need theory (read: domain expertise) to understand the results, the output of our data analysis? The debate focused on a priori questions, but maybe the real value of domain expertise is a posteriori: after-the-fact reflection on the results and whether they make sense. Asking the right question is certainly important, but so is knowing whether you’ve gotten the right answer and knowing what that answer means. Neither problem is trivial, and in the real world, they’re often closely coupled. Often, the only way to know you’ve put garbage in is that you’ve gotten garbage out.

By the same token, data analysis frequently produces results that make too much sense. It yields data that merely reflects the biases of the organization doing the work. Bad sampling techniques, overfitting, cherry picking datasets, overly aggressive data cleaning, and other errors in data handling can all lead to results that are either too expected or unexpected. “Stupid Data Miner Tricks” is a hilarious send-up of the problems of data mining: It shows how to “predict” the value of the S&P index over a 10-year period based on butter production in Bangladesh, cheese production in the U.S., and the world sheep population.

An interesting post and debate. Both worth the time to read/watch.

I am not surprised the “cons” won, saying that machine learning is more important than subject expertise, but not for the reasons Mike gives.

True enough, data is said to be “unreasonably” effective, but when judged against what?

When asked, 90% of all drivers think they are better than average drivers. If I remember averages, there is something wrong with that result. 😉

The trick is, according to Daniel Kahneman, is that drivers create an imaginary average and then say they are better than that.

I wonder what “average” data is being evaluated against?

Attain Apache Solr Coding Chops

Filed under: Solr — Patrick Durusau @ 6:37 pm

Attain Apache Solr Coding Chops by Peter Wolanin and Chris Pliakas.

From the description:

This session is for those who are excited by the great power of Apache Solr search for Drupal but want to understand how to customize their search and use Solr to power parts of the site. Join us for a technical deep dive into the world of Apache Solr search integration focused on the expanded possibilities in Drupal 7, as well as the improving overlap between the front-end components that interface with Solr and other search back ends like the PHP Zend lucene api.

Drupal oriented but still a good review of various options for using Solr.

We’re Not Very Good Statisticians

Filed under: Analytics,Statistics — Patrick Durusau @ 6:36 pm

We’re Not Very Good Statisticians by Steve Miller.

From the post:

I’ve received several emails/comments about my recent series of blogs on Duncan Watts’ interesting book “Everything is Obvious: *Once You Know the Answer — How Common Sense Fails Us.” Watts’ thesis is that the common sense that generally guides us well for life’s simple, mundane tasks often fails miserably when decisions get more complicated.

Three of the respondents suggested I take a look at “Thinking Fast and Slow,” by psychologist Daniel Kahneman, who along with the late economist Amos Tversky, was awarded the Nobel Prize in Economic Sciences for “seminal work in psychology that challenged the rational model of judgment and decision making.”

Steve’s post and the ones to follow are worth a close read.

When data, statistical or otherwise, agrees with me, I take that as a sign to evaluate it very carefully. Your mileage may vary.

10 Free, Standalone and Easy to Use UML Editors

Filed under: Graphics,UML — Patrick Durusau @ 6:36 pm

10 Free, Standalone and Easy to Use UML Editors by Çağdaş Başaraner.

From the post:

Below is a compilation of UML drawing & editing tools which are:

  • Free (and most of them are open source),
  • Standalone (not installed as plug-in or add-in),
  • Easy to download and install,
  • No-need to registration and activation keys,
  • Fast to start and use.

Note: Last 2 editors are text based web uml tools.

Curious, rather than creating a separate graphic language for topic maps, would it be useful to annotate/extend the existing UML language with topic map constructs?

Levenshtein distance in C++ and code profiling in R

Filed under: Levenshtein Distance,Lexical Analyzer,Stemming — Patrick Durusau @ 6:36 pm

Levenshtein distance in C++ and code profiling in R by Dzidorius Martinaitis.

From the post:

At work, the client requested, if existing search engine could accept singular and plural forms equally, e. g. “partner” and “partners” would lead to the same result.

The first option – stemming. In that case, search engine would use root of a word, e. g. “partn”. However, stemming has many weaknesses: two different words might have same root, a user can misspell the root of the word, except English and few others languages it is not that trivial to implement stemming.

Levenshtein distance comes as the second option. The algorithm is simple – you have two words and you calculate the difference between them. You can insert, delete or replace any character, but it will cost you. Let’s imagine, an user enters “Levenstin distances” into search engine and expects to find revalent information. However, he just made 2 errors by misspeling the author’s name and he used plural form of “distance”. If search engine accepts 3 errors – the user will get relevant information.

The challenge comes, when you have a dictionary of terms (e. g. more that 1 mil.) and you want to get similar terms based on Levenshtein distance. You can visit every entry in the dictionary (very costly) or you can push dictionary into the trie. Do you need a proof for the cost? There we go:

Appreciate the comparison of approaches based on data but wondering why “professional” stemming, like you find in Solr was not investigated? Will post a comment asking and report back.

You are likely to encounter this sort of issue in almost all topic map authoring activities.

Busting 10 Myths about Hadoop

Filed under: Hadoop — Patrick Durusau @ 6:36 pm

Busting 10 Myths about Hadoop – Hadoop is still misunderstood by many BI professionals by Philip Russom.

Philip lists ten myths (with explanations in his post):

  • Fact #1. Hadoop consists of multiple products.
  • Fact #2. Hadoop is open source but available from vendors, too.
  • Fact #3. Hadoop is an ecosystem, not a single product.
  • Fact #4. HDFS is a file system, not a database management system (DBMS).
  • Fact #5. Hive resembles SQL but is not standard SQL.
  • Fact #6. Hadoop and MapReduce are related but don’t require each other.
  • Fact #7. MapReduce provides control for analytics, not analytics per se.
  • Fact #8. Hadoop is about data diversity, not just data volume.
  • Fact #9. Hadoop complements a DW; it’s rarely a replacement.
  • Fact #10. Hadoop enables many types of analytics, not just Web analytics.

If you are unclear on any of these points, please see Philip’s post. (And/or sign up for Hadoop training.)

The Difference Between Interaction and Association

Filed under: Mathematics,Statistics — Patrick Durusau @ 6:35 pm

The Difference Between Interaction and Association by Karen Grace-Martin.

From the post:

It’s really easy to mix up the concepts of association (a.k.a. correlation) and interaction. Or to assume if two variables interact, they must be associated. But it’s not actually true.

In statistics, they have different implications for the relationships among your variables, especially when the variables you’re talking about are predictors in a regression or ANOVA model.

Association

Association between two variables means the values of one variable relate in some way to the values of the other. Association is usually measured by correlation for two continuous variables and by cross tabulation and a Chi-square test for two categorical variables.

Unfortunately, there is no nice, descriptive measure for association between one categorical and one continuous variable, but either one-way analysis of variance or logistic regression can test an association (depending upon whether you think of the categorical variable as the independent or the dependent variable).

Essentially, association means the values of one variable generally co-occur with certain values of the other.

Interaction

Interaction is different. Whether two variables are associated says nothing about whether they interact in their effect on a third variable. Likewise, if two variables interact, they may or may not be associated.

An interaction between two variables means the effect of one of those variables on a third variable is not constant—the effect differs at different values of the other.

You will most likely be using statistics or at least discussing topic maps with analysts who use statistics so be prepared to distinguish “association” in the statistics sense from association when you use it in the topic maps sense. They are pronounced the same way. 😉

Depending upon the subject matter of your topic map, you may well be describing “interaction,” but again, not in the sense that Karen illustrates in her post.

The world of semantics is a big place so be careful out there.

Accountable Government – Stopping Improper Payments

Filed under: Marketing — Patrick Durusau @ 6:35 pm

Accountable Government – Stopping Improper Payments by Kimberley Williams.

Kimberly cites a couple of the usual food stamps, unemployment fraud cases to show the need for data integration.

I find it curious that small-fry fraud, food stamps, welfare, unemployment is nearly always cited as the basis for better financial controls in government.

The question that needs to be asked is: What is the ROI for stopping small-fry fraud? Which would need a reliable estimate of fraud versus the expense of better financial controls to stop it. If the expense is greater than the fraud, why bother?

On the other hand, defense contractor fraud may justify data integration and attempts at better financial controls. For example, for the fiscal year 2009, the Defense Criminal Investigative Service recovered $2,077,282,746. That $billions with a B.

That is money recovered, not estimated fraud.

From the following year but just in case you want to personalize the narrative a bit:

A South Carolina defense contractor has agreed to pay the U.S. government more than $1 million to resolve fraud allegations related to a contract with the Defense Department.

U.S. Attorney Bill Nettles said Wednesday the Defense Department paid nearly $435,000 to Columbia-based FN Manufacturing LLC to mentor minority-owned companies. But the government says FN never provided some of the mentoring and contracted out some of theservices, an action that violated the company’s contract.

FN is a subsidiary of FN Herstal of Belgium. The company makes the popular M-16 rifle, which is carried by almost every soldier. (Source: http://www2.wspa.com/news/2010/aug/04/defense-contractor-fined-ar-661483/)

or,

Defense company BAE Systems PLC said yesterday it would pay fines totaling more than $400 million after reaching settlements with Britain’s anti-fraud agency and the U.S. Justice Department to end decades-long corruption investigations into the company.

The world’s No. 2 defense contractor said that under its agreement with Washington, it would plead guilty to one criminal charge of conspiring to make false statements to the U.S. government over regulatory filings in 2000 and 2002. The agreement was subject to court approval, it said.

In Britain, it said it would plead guilty to one charge of breach of duty to keep proper accounting records about payments it made to a former marketing adviser in Tanzania in relation to the sale of a military radar system in 1999.

The bulk of the fines would be paid to the U.S. authorities. In Britain, BAE will be paying penalties of 30 million pounds ($46.9 million), including a charity payment to Tanzania.

BAE said it “regrets the lack of rigor in the past” and “accepts full responsibility for these past shortcomings.” (Source: http://www.capecodonline.com/apps/pbcs.dll/article?AID=/20100206/BIZ/2060310)

.

Measuring User Retention with Hadoop and Hive

Filed under: Hadoop,Hive,Marketing — Patrick Durusau @ 6:35 pm

Measuring User Retention with Hadoop and Hive by Daniel Russo.

From the post:

The Hadoop ecosystem is comprised of numerous tech­nologies that can work together to provide a powerful and scalable mech­anism for analyzing and deriving insight from large quan­tities of data.

In an effort to showcase the flex­i­bility and raw power of queries that can be performed over large datasets stored in Hadoop, this post is written to demon­strate an example use case. The specific goal is to produce data related to user retention, an important metric for all product companies to analyze and understand.

Motivation: Why User Retention?

Broadly speaking, when equipped with the appro­priate tools and data, we can enable our team and our customers to better under­stand the factors that drive user engagement and to ulti­mately make deci­sions that deliver better products to market.

User retention measures speak to the core of product quality by answering a crucial question about how the product resonates with users. In the case of apps (mobile or otherwise), that question is: “how many days does it take for users to stop using (or unin­stall) the app?”.

Pinch Media (now Flurry) delivered a formative presentation early in the AppStore’s history. Among numerous insights collected from their dataset was the following slide, which detailed patterns in user retention across all apps imple­menting their tracking SDK:

I mention this example because:

  • User retention is the measure of an app’s success or failure.*
  • Hadoop and Hive skill sets are good ones pick up.

* I have a pronounced fondness for requirements and the documenting of the same. Others prefer unit/user/interface/final tests. Still others prefer formal proofs of “correctness.” All pale beside the test of “user retention.” If users keep using an application, what other measure would be meaningful?

Online SVD/PCA resources

Online SVD/PCA resources by Danny Bickson.

From the post:

Last month I was vising Toyota Technological Institure in Chicago, where I was generously hosted by Tamir Hazan and Joseph Keshet. I heard some interesting stuff about large scale SVM from Joseph Keseht which I reported here. Additionally I met with Raman Arora who is working on online SVD. I asked Raman to summarize the state-of-the-art research on online SVD and here is what I got from him:

A very rich listing of resources on single value decomposition and principal component analysis.

« Newer PostsOlder Posts »

Powered by WordPress