Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 9, 2012

Cricinfo StatsGuru Database for Statistical and Graphical Analysis

Filed under: Data,ESPN — Patrick Durusau @ 4:32 pm

Cricinfo StatsGuru Database for Statistical and Graphical Analysis by Ajay Ohri.

From the post:

However ESPN has unleashed the API (including both free and premium)for Developers at http://developer.espn.com/docs.

and especially these sports http://developer.espn.com/docs/headlines#parameters

[parameters omitted]

What puzzled me at first was the title and then Ajay jumping right in to illustrate the use of the parameters, before I could understand what sport was being described.

Ok, grin, laugh, whatever. 😉 I did not recognize Circinfo.

I am sure that many of you will find creative ways to incorporate sports information into your topic maps.

April 8, 2012

Data and the Liar’s Paradox

Filed under: Data,Data Quality,Marketing — Patrick Durusau @ 4:20 pm

Data and the Liar’s Paradox by Jim Harris.

Jim writes:

“This statement is a lie.”

That is an example of what is known in philosophy and logic as the Liar’s Paradox because if “this statement is a lie” is true, then the statement is false, which would in turn mean that it’s actually true, but this would mean that it’s false, and so on in an infinite, and paradoxical, loop of simultaneous truth and falsehood.

I have never been a fan of the data management concept known as the Single Version of the Truth, and I often quote Bob Kotch, via Tom Redman’s excellent book, Data Driven: “For all important data, there are too many uses, too many viewpoints, and too much nuance for a single version to have any hope of success. This does not imply malfeasance on anyone’s part; it is simply a fact of life. Getting everyone to work from a Single Version of the Truth may be a noble goal, but it is better to call this the One Lie Strategy than anything resembling truth.”

More business/data quality reading.

Imagine my chagrin after years of studying literary criticism in graduate seminary classes (don’t ask, its a long and boring story) to discover that business types already know “truth” is a relative thing.

What does that mean for topic maps?

I would argue with careful design we can capture several points of view, using a point of view as our vantage point.

As opposed to strategies that can only capture a single point of view, their own.

Capturing multiple viewpoints will be a hot topic when “big data” starts to hit the “big fan.”

March 27, 2012

Scientific Visualization Studio (NASA)

Filed under: Data,Visualization — Patrick Durusau @ 7:17 pm

Scientific Visualization Studio (NASA)

From the website:

The mission of the Scientific Visualization Studio is to facilitate scientific inquiry and outreach within NASA programs through visualization. To that end, the SVS works closely with scientists in the creation of visualization products, systems, and processes in order to promote a greater understanding of Earth and Space Science research activities at Goddard Space Flight Center and within the NASA research community.

All the visualizations created by the SVS (currently totalling over 4,200) are accessible to you through this Web site. More recent animations are provided as MPEG-4s, MPEG-2s, and MPEG-1s. Some animations are available in high definition as well as NTSC format. Where possible, the original digital images used to make these animations have been made accessible. Lastly, high and low resolution stills, created from the visualizations, are included, with previews for selective downloading.

A visualization of data site that may have visualizations that work with your topic maps as content and/or give you creative ideas for visualizing data reported by your topic maps.

For example, consider: Five-Year Average Global Temperature Anomalies from 1880 to 2011. Easier than reporting all the underlying data. This may work for some subjects and less well for others.

Publicly available large data sets for database research

Filed under: Data,Dataset — Patrick Durusau @ 7:17 pm

Publicly available large data sets for database research by Daniel Lemire.

Daniel summaries large (> 20 GB) data sets that may be useful for database research.

If you know of any data sets that have been overlooked or that become available, please post a note on this entry at Daniel’s blog.

March 23, 2012

Spruce Up Your Data Visualization Skills

Filed under: Data,Visualization — Patrick Durusau @ 7:23 pm

Spruce Up Your Data Visualization Skills

Juice Analytics has released five (5) new design videos on its resources page.

If its Design Principles page is completed with a webpage for every design principle down the right-hand side of the page, it will be a formidable design resource.

Design suggestion: Rather than users looking on two different pages for design resources, why not combine the white papers, resource page with the design principles page? I could not tell which one was going to result in videos except for the links in the blog post.

Good principles and videos.

Innovation History via 6,000 Pages of Annual Reports

Filed under: Data,Visualization — Patrick Durusau @ 7:23 pm

Innovation History via 6,000 Pages of Annual Reports

Nathan Yau from FlowingData reports on a visualization of all the GE annual reports from 1892 until 2011.

Selecting keywords lights up pages with those words.

Billed as tracing evolution of innovation but I am not sure I would go that far.

Interesting visualization but not every visualization, even an interesting one, is useful.

Fathom Information Design is responsible for a number of unusual visuallizations.

March 22, 2012

Vista Stares Deep Into the Cosmos:…

Filed under: Astroinformatics,Data,Dataset — Patrick Durusau @ 7:42 pm

Vista Stares Deep Into the Cosmos: Treasure Trove of New Infrared Data Made Available to Astronomers

From the post:

The European Southern Observatory’s VISTA telescope has created the widest deep view of the sky ever made using infrared light. This new picture of an unremarkable patch of sky comes from the UltraVISTA survey and reveals more than 200 000 galaxies. It forms just one part of a huge collection of fully processed images from all the VISTA surveys that is now being made available by ESO to astronomers worldwide. UltraVISTA is a treasure trove that is being used to study distant galaxies in the early Universe as well as for many other science projects.

ESO’s VISTA telescope has been trained on the same patch of sky repeatedly to slowly accumulate the very dim light of the most distant galaxies. In total more than six thousand separate exposures with a total effective exposure time of 55 hours, taken through five different coloured filters, have been combined to create this picture. This image from the UltraVISTA survey is the deepest [1] infrared view of the sky of its size ever taken.

The VISTA telescope at ESO’s Paranal Observatory in Chile is the world’s largest survey telescope and the most powerful infrared survey telescope in existence. Since it started work in 2009 most of its observing time has been devoted to public surveys, some covering large parts of the southern skies and some more focused on small areas. The UltraVISTA survey has been devoted to the COSMOS field [2], an apparently almost empty patch of sky which has already been extensively studied using other telescopes, including the NASA/ESA Hubble Space Telescope [3]. UltraVISTA is the deepest of the six VISTA surveys by far and reveals the faintest objects.

Another six (6) terabytes of images, just in case you are curious.

And the rate of acquisition of astronomical data is only increasing.

Clever insights into how to more efficiently process and analyze the resulting data are surely welcome.

March 21, 2012

A graphical overview of your MySQL database

Filed under: Data,Database,MySQL — Patrick Durusau @ 3:30 pm

A graphical overview of your MySQL database by Christophe Ladroue.

From the post:

If you use MySQL, there’s a default schema called ‘information_schema‘ which contains lots of information about your schemas and tables among other things. Recently I wanted to know whether a table I use for storing the results of a large number experiments was any way near maxing out. To cut a brief story even shorter, the answer was “not even close” and could be found in ‘information_schema.TABLES‘. Not being one to avoid any opportunity to procrastinate, I went on to write a short script to produce a global overview of the entire database.

infomation_schema.TABLES contains the following fields: TABLE_SCHEMA, TABLE_NAME, TABLE_ROWS, AVG_ROW_LENGTH and MAX_DATA_LENGTH (and a few others). We can first have a look at the relative sizes of the schemas with the MySQL query “SELECT TABLE_SCHEMA,SUM(DATA_LENGTH) SCHEMA_LENGTH FROM information_schema.TABLES WHERE TABLE_SCHEMA!='information_schema' GROUP BY TABLE_SCHEMA“.

Christophe includes R code to generate graphics that you will find useful in managing (or just learning about) MySQL databases.

While the parts of the schema Christophe is displaying graphically are obviously subjects, the graphical display pushed me in another direction.

If we can visualize the schema of a MySQL database, then shouldn’t we be able to visualize the database structures a bit closer to the metal?

And if we can visualize those database structures, shouldn’t we be able to represent them and the relationships between them as a graph?

Or perhaps better, can we “view” those structures and relationships “on demand” as a graph?

That is in fact what is happening when we display a table at the command prompt for MySQL. It is a “display” of information, it is not a report of information.

I don’t know enough about the internal structures of MySQL or PostgreSQL to start such a mapping. But ignorance is curable, at least that is what they say. 😉

I have another post today that suggests a different take on conversion methodology.

R Data Import/Export

Filed under: Data,R — Patrick Durusau @ 3:30 pm

R Data Import/Export

After posting about the Excel -> R utility, I started to wonder about R -> Excel and in researching that question, I ran across this page.

Here is the table of contents as of 2012-02-29:

  • Acknowledgements
  • Introduction
  • Spreadsheet-like data
  • Importing from other statistical systems
  • Relational databases
  • Binary files
  • Connections
  • Network interfaces
  • Reading Excel spreadsheets
  • References
  • Function and variable index
  • Concept index

Enjoy!

Reading Excel data is easy with JGR and XLConnect

Filed under: Data,Excel — Patrick Durusau @ 3:30 pm

Reading Excel data is easy with JGR and XLConnect

From the post:

Despite the fact that Excel is the most widespread application for data manipulation and (perhaps) analysis, R’s support for the xls and xlsx file formats has left a lot to be desired. Fortunately, the XLConnect package has been created to fill this void, and now JGR 1.7-8 includes integration with XLConnect package to load .xls and .xlsx documents into R.

For JGR, see: http://rforge.net/JGR/

March 20, 2012

Wheel Re-invention: Change Data Capture systems

Filed under: Change Data,Data,Databus — Patrick Durusau @ 3:53 pm

LinkedIn: Creating a Low Latency Change Data Capture System with Databus

Siddharth Anand, a senior member of LinkedIn’s Distributed Data Systems team writes:

Having observed two high-traffic web companies solve similar problems, I cannot help but notice a set of wheel-reinventions. Some of these problems are difficult and it is truly unfortunate for each company to solve its problems separately. At the same time, each company has had to solve these problems due to an absence of a reliable open-source alternative. This clearly has implications for an industry dominated by fast-moving start-ups that cannot build 50-person infrastructure development teams or dedicate months away from building features.

Siddharth goes on to address a particular re-invention of the wheel: change data capture systems.

And he has a solution to this wheel re-invention problem: Databus. (Not good for all situations but worth your time to read carefully, along with following the other resources.)

From the post:

Databus is an innovative solution in this space.

It offers the following features:

  • Pub-sub semantics
  • In-commit-order delivery guarantees
  • Commits at the source are grouped by transaction
    • ACID semantics are preserved through the entire pipeline
  • Supports partitioning of streams
    • Ordering guarantees are then per partition
  • Like other messaging systems, offers very low latency consumption for recently-published messages
  • Unlike other messaging systems, offers arbitrarily-long look-back with no impact to the source
  • High Availability and Reliability

From counting citations to measuring usage (help needed!)

Filed under: Citation Indexing,Classification,Data — Patrick Durusau @ 3:52 pm

From counting citations to measuring usage (help needed!)

Daniel Lemire writes:

We sometimes measure the caliber of a researcher by how many research papers he wrote. This is silly. While there is some correlation between quantity and quality — people like Einstein tend to publish a lot — it can be gamed easily. Moreover, several major researchers have published relatively few papers: John Nash has about two dozens papers in Scopus. Even if you don’t know much about science, I am sure you can think of a few writers who have written only a couple of books but are still world famous.

A better measure is the number of citations a researcher has received. Google Scholar profiles display the citation record of researchers prominently. It is a slightly more robust measure, but it is still silly because 90% of citations are shallow: most authors haven’t even read the paper they are citing. We tend to cite famous authors and famous venues in the hope that some of the prestige will get reflected.

But why stop there? We have the technology to measure the usage made of a cited paper. Some citations are more significant: for example it can be an extension of the cited paper. Machine learning techniques can measure the impact of your papers based on how much following papers build on your results. Why isn’t it done?

Daniel wants to distinguish important papers that cite his papers from ho-hum papers that cite him. (my characterization, not his)

That isn’t happening now so Daniel has teamed up with Peter Turney and Andre Vellino to gather data from published authors (that would be you), to use in investigating this problem.

Topic maps of scholarly and other work face the same problem. How do you distinguish the important from the less so? For that matter, what criteria do you use? If an author who cites you wins the Nobel Prize for work that doesn’t cite you, does the importance of your paper go up? Stay the same? Goes down? 😉

It is an important issue so if you are a published author, see Daniel’s post and contribute to the data gathering.

March 7, 2012

Datomic

Filed under: Data,Database,Datomic — Patrick Durusau @ 5:43 pm

Michael Popescu (myNoSQL) has a couple of posts on resources for Datomic.

Intro Videos to Datomic and Datomic Datalog

and,

Datomic: Distributed Database Designed to Enable Scalable, Flexible and Intelligent Applications, Running on Next-Generation Cloud Architectures

I commend the materials you will find there but the white paper in particular, which has the following section:

ATOMIC DATA – THE DATOM

Once you are storing facts, it becomes imperative to choose an appropriate granularity for facts. If you want to record the fact that Sally likes pizza, how best to do so? Most databases require you to update either the Sally record or document, or the set of foods liked by Sally, or the set of likers of pizza. These kind of representational issues complicate and rigidify applications using relational and document models. This can be avoided by recording facts as independent atoms of information. Datomic calls such atomic facts ‘datoms‘. A datom consists of an entity, attribute, value and transaction (time). In this way, any of those sets can be discovered via query, without embedding them into a structural storage model that must be known by applications.

In some views of granularity, the datom “atom” looks like a four-atom molecule to me. 😉 Not to mention that entities/attributes and values can have relationships that don’t involve each other.

March 2, 2012

An essay on why programmers need to learn statistics

Filed under: Data,Statistics — Patrick Durusau @ 8:05 pm

An essay on why programmers need to learn statistics from Simply Statistics.

Truly an amazing post!

But it doesn’t apply just to programmers, anyone evaluating data needs to understand statistics and perhaps more importantly, have the ability to know when the data isn’t quite right. Math is correct but the data is too clean, too good, too …., something that makes you uneasy with the data.

Consider the Duke Saga for example.

February 28, 2012

StatLib

Filed under: Data,Dataset,Statistics — Patrick Durusau @ 8:41 pm

StatLib

From the webpage:

Welcome to StatLib, a system for distributing statistical software, datasets, and information by electronic mail, FTP and WWW. StatLib started out as an e-mail service and some of the organization still reflects that heritage. We hope that this document will give you sufficient guidance to navigate through the archives. For your convenience there are several sites around the world which serve as full or partial mirrors to StatLib.

An amazing source of software and data. Including sets of webpages for clustering analysis, etc.

Was mentioned in the first R-Podcast episode.

February 22, 2012

The Data Hub

Filed under: Data,Dataset — Patrick Durusau @ 4:48 pm

The Data Hub

From the about page:

What was the average price of a house in the UK in 1935? When will India’s projected population overtake that of China? Where can you see publicly-funded art in Seattle? Data to answer many, many questions like these is out there on the Internet somewhere – but it is not always easy to find.

the Data Hub is a community-run catalogue of useful sets of data on the Internet. You can collect links here to data from around the web for yourself and others to use, or search for data that others have collected. Depending on the type of data (and its conditions of use), the Data Hub may also be able to store a copy of the data or host it in a database, and provide some basic visualisation tools.

I covered the underlying software in CKAN – the Data Hub Software.

If your goal is to simply make data sets available with a minimal amount of metadata, this may be the software for you.

If your goal is to make data sets available with enough metadata to make robust use of them, you need to think again.

There is an impressive amount of data sets at this site.

But junk yards have an impressive number of wrecked cars.

Doesn’t help you find the car with the part you need. (Think data formats, semantics, etc.)

Look But Don’t Touch

Filed under: Data,Geographic Data,Government Data,Transparency — Patrick Durusau @ 4:48 pm

I would describe the Atlanta GIS Data Catalog as a Look But Don’t Touch system. A contrast to the efforts of DC at transparency.

From the webpage:

GIS Data Catalog

Atlanta GIS creates and maintains many GIS data sets (also known as” layers” because of the way they are layered one on top another to create a map) and collects others from external sources, mostly other government agencies. Each layer represents some class of geographic feature. The features represented can be physical, such as roads, buildings and streams, or they can be conceptual, such as neighbor boundaries, property lines and the locations of crimes.

The GIS Data Catalog is an on-line compilation of information on GIS layers used by the CIty. The catalog allows you to quickly locate GIS data by searching by keyword. You can also view metadata for each data layer in the catalog. All data in the catalog represent the best and most current GIS data maintained or used by the city. The city’s GIS metadata is maintained in conformance with a standard defined by the Federal Geographic Data Committee (FGDC) .

The data layers themselves are not available for download from the catalog. Data can be requested by contacting the originating department or agency. More specific contact information is available within the metadata for many data layers. (emphasis added)

I am sure most agencies would supply the data on request, but why require the request?

To add a request processing position to the agency payroll and to have procedures for processing requests, along with meetings on request granting, plus an appeals process if the request is rejected, with record keeping for all of the foregoing plus more?

That doesn’t sound like transparent government or effective use of tax dollars to me.

District of Columbia – Data Catalog

Filed under: Data,Government Data,Open Data,Transparency — Patrick Durusau @ 4:48 pm

District of Columbia – Data Catalog

This is an example of a city moving towards transparency.

A large number of data sets to access (485 as of today), with live feeds to some data streams.

Eurostat

Filed under: Data,Dataset,Government Data,Statistics — Patrick Durusau @ 4:48 pm

Eurostat

From the “about” page:

Eurostat’s mission: to be the leading provider of high quality statistics on Europe.

Eurostat is the statistical office of the European Union situated in Luxembourg. Its task is to provide the European Union with statistics at European level that enable comparisons between countries and regions.

This is a key task. Democratic societies do not function properly without a solid basis of reliable and objective statistics. On one hand, decision-makers at EU level, in Member States, in local government and in business need statistics to make those decisions. On the other hand, the public and media need statistics for an accurate picture of contemporary society and to evaluate the performance of politicians and others. Of course, national statistics are still important for national purposes in Member States whereas EU statistics are essential for decisions and evaluation at European level.

Statistics can answer many questions. Is society heading in the direction promised by politicians? Is unemployment up or down? Are there more CO2 emissions compared to ten years ago? How many women go to work? How is your country’s economy performing compared to other EU Member States?

International statistics are a way of getting to know your neighbours in Member States and countries outside the EU. They are an important, objective and down-to-earth way of measuring how we all live.

I have seen Eurostat mentioned, usually negatively, by data aggregation services. I visited Eurostat today and found it quite useful.

For the non-data professional, there are graphs and other visualizations of popular data.

For the data professional, there are bulk downloads of data and other technical information.

I am sure there is room for improvement specific feedback is required to make that happen. (It has been my experience that positive specific feedback works best. Fine something nice to say and then suggest a change to improve the outcome.)

February 15, 2012

Unstructured data is a myth

Filed under: Data,Data Mining — Patrick Durusau @ 8:33 pm

Unstructured data is a myth by Ram Subramanyam Gopalan.

From the post:

Couldn’t resist that headline! But seriously, if you peel the proverbial onion enough, you will see that the lack of tools to discover / analyze the structure of that data is the truth behind the opaqueness that is implied by calling the data “unstructured”.

This article will give you a firm basis for arguing with casual use of “unstructured” data as a phrase.

One point that stands above the others is that all the so-called “unstructured” data is generated by some process, automated or otherwise. That you may be temporarily ignorant of that process, doesn’t mean that the data is “unstructured.” Worth reading, more than once.

February 8, 2012

Extending Data Beyond the Database – The Notion of “State”

Filed under: Data,Database — Patrick Durusau @ 5:12 pm

Extending Data Beyond the Database – The Notion of “State” by David Loshin

From the post:

In my last post, I essentially suggested that there is a difference between merging two static data sets and merging static data sets with dynamic ones. It is worth providing a more concrete example to demonstrate what I really mean by this idea: let’s say you had a single analytical database containing customer profile information (we’ll call this data set “Profiles”), but at the same time had access to a stream of web page transactions performed by individuals identified as customers (we can refer to this one as “WebStream”).

The challenge is that the WebStream data set may contain information with different degrees of believability. If an event can be verified as the result of a sequence of web transactions within a limited time frame, the resulting data should lead to an update of the Profiles data set. On the other hand, if the sequence does not take place, or takes place over an extended time frame, there is not enough “support” for the update and therefore the potential modification is dropped. For example, if a visitor places a set of items into a shopping cart and completes a purchase, the customer’s preferences are updated based on the items selected and purchased. But if the cart is abandoned and not picked up within 2 hours, the customer’s preferences may not be updated.

Because the update is conditional on a number of different variables, the system must hold into some data until it can either be determined that the preferences are updated or not. We can refer to this as maintaining some temporary state that either resolves into a modification to the Profiles data set or is thrown out after 2 hours.

Are your data sets static or dynamic? And if dynamic, how do you delay merging until some other criteria is met?

The first article David refers to is: Data Quality and State.

Interesting that as soon as we step away from static files and data, the world explodes in complexity. Add to that dynamic notions of identity and recognition and complexity seems like an inadequate term for what we face.

Be mindful those are just slices of what people automatically process all day long. Fix your requirements and build to spec. Leave the “real world” to wetware.

February 7, 2012

Finding Data on the Internet

Filed under: Data,Data Source,R — Patrick Durusau @ 4:31 pm

Finding Data on the Internet

From the post:

What I would like is a nice list of all of credible sources on the Internet for finding data to use with R projects. I know that this is a crazy idea, not well formulated (what are data after all) and loaded with absurd computational and theoretical challenges. (Why can’t I just google “data R” and get what I want?) So, what can I do? As many people are also out there doing, I can begin to make lists (in many cases lists of lists) on a platform that is stable enough to survive and grow, and perhaps encourage others to help with the effort.

Here follows a list of data sources that may easily be imported into R. If an (R) appears after source this means that the data are already in R format or there exist R commands for directly importing the data from R. (See http://www.quantmod.com/examples/intro/ for some code.) Otherwise, i have limited the list to data sources for which there is a reasonably simple process for importing csv files. What follows is a list of data sources organized into categories that are not mutually exclusive but which reflect what’s out there.

Useful listing of data sources for R, but you could use them with any SQL, NoSQ, SQL-NoSQL hybrid, or topic map as well.

Title probably should be: “Data Found on the Internet.” Finding data is a more difficult proposition.

Curious: Is there a “data crawler” that attempts to crawl websites of governments and the usual suspects for new data sets?

February 2, 2012

IMDb Alternative Interfaces

Filed under: Data,Dataset,IMDb — Patrick Durusau @ 3:39 pm

IMDb Alternative Interfaces.

From the webpage:

This page describes various alternate ways to access The Internet Movie Database locally by holding copies of the data directly on your system. See more about using our data on the Non-Commercial Licensing page.

It’s an interesting data set and I am sure its owners would not mind your sending them a screencast of some improved access you have created to their data.

That might actually be an interesting model for developing better interfaces to data served up to the public anyway. Release it for strictly personal use and see who does the best job with it. A screencast would not disclose any of your source code or processes, protecting the interest of the software author.

Just a thought.

First noticed this on PeteSearch.

January 29, 2012

Munging, Modeling and Visualizing Data with R

Filed under: Data,Modeling,R,Visualization — Patrick Durusau @ 9:17 pm

Munging, Modeling and Visualizing Data with R by Xavier Léauté.

With a title like that, how could I resist?

From the post:

Yesterday evening Romy Misra from visual.ly invited us to teach an introductory workshop to R for the San Francisco Data Mining meetup. Todd Holloway was kind enough to host the event at Trulia headquarters.

R can be a little daunting for beginners, so I wanted to give everyone a quick overview of its capabilities and enough material to get people started. Most importantly, the objective of this interactive session was to give everyone some time to try out some simple examples that would be useful in the future.

I hope everyone enjoyed learning some fun and easy ways to slice, model and visualize data, and that I piqued their interest enough to start exploring datasets on their own.

Slides and sample scripts follow.

First seen at Christophe Lalanne’s Bag of Tweets for January 2012.

January 27, 2012

Analytics with MongoDB (commercial opportunity here)

Filed under: Analytics,Data,Data Analysis,MongoDB — Patrick Durusau @ 4:35 pm

Analytics with MongoDB

Interesting enough slide deck on analytics with MongoDB.

Relies on custom programming and then closes with this punchline (along with others, slide #41):

  • If you’re a business analyst you have a problem
    • better be BFF with some engineer 🙂

I remember when word processing required a lot of “dot” commands and editing markup languages with little or no editor support. Twenty years (has it been that long?) later and business analysts are doing word processing, markup and damned near print shop presentation without working close to the metal.

Can anyone name any products that have made large sums of money making it possible for business analysts and others to perform those tasks?

If so, ask yourself if you would like to have a piece of the action that frees business analysts from script kiddie engineers?

Even if a general application is out of reach at present, imagine writing access routines for common public data sites.

Create a market for the means to import and access particular data sets.

January 7, 2012

Statistical Rules of Thumb, Part III – Always Visualize the Data

Filed under: Data,Marketing,Statistics,Visualization — Patrick Durusau @ 4:05 pm

Statistical Rules of Thumb, Part III – Always Visualize the Data

From the post:

As I perused Statistical Rules of Thumb again, as I do from time to time, I came across this gem. (note: I live in CA, so get no money from these amazon links).

Van Belle uses the term “Graph” rather than “Visualize”, but it is the same idea. The point is to visualize in addition to computing summary statistics. Summaries are useful, but can be deceiving; any time you summarize data you will lose some information unless the distributions are well behaved. The scatterplot, histogram, box and whiskers plot, etc. can reveal ways the summaries can fool you. I’ve seen these as well, especially variables with outliers or that are bi- or tri-modal.

What techniques do you use in visualizing topic maps? Such as hiding topics or associations? Or coloring schemes that appear to work better than others? Or do you integrate the information delivered by the topic map with other visualizations? Such as street maps, blueprints or floor plans?

Post seen at: Data Mining and Predictive Analytics

January 3, 2012

List of cities/states with open data – help me find more!

Filed under: Data,Government Data — Patrick Durusau @ 5:13 pm

List of cities/states with open data – help me find more!

A plea from “Simply Statistics” to increase its listing of cities with open data.

Mostly American and Canadian, with a few others, Berlin for example, suggested in comments.

I haven’t looked (yet) but since European libraries lead the charge in many ways to have greater access to their collections (my recollection, yours may differ), I would expect to find European cities and authorities also ahead on the race to publish public data.

Pointers from European readers? (Or I can look them up later this week, just not today.)

What the Sumerians can teach us about data

Filed under: Data,Data Management,Marketing — Patrick Durusau @ 5:08 pm

What the Sumerians can teach us about data

Pete Warden writes:

I spent this afternoon wandering the British Museum’s Mesopotamian collection, and I was struck by what the humanities graduates in charge of the displays missed. The way they told the story, the Sumerian’s biggest contribution to the world was written language, but I think their greatest achievement was the invention of data.

Writing grew out of pictograms that were used to tally up objects or animals. Historians and other people who write for a living treat that as a primitive transitional use, a boring stepping-stone to the final goal of transcribing speech and transmitting stories. As a data guy, I’m fascinated by the power that being able to capture and transfer descriptions of the world must have given the Sumerians. Why did they invent data, and what can we learn from them?

Although Pete uses the term “Sumerians” to cover a very wide span of peoples, languages and history, I think his comment:

Gathering data is not a neutral act, it will alter the power balance, usually in favor of the people collecting the information.

is right on the mark.

There aspect of data management that we can learn from the Ancient Near East (not just the Sumerians).

Preservation of access.

It isn’t enough to simply preserve data. You can ask NASA preservation of data. (Houston, We Erased The Apollo 11 Tapes)

Particularly with this attitude:

“We’re all saddened that they’re not there. We all wish we had 20-20 hindsight,” says Dick Nafzger, a TV specialist at NASA’s Goddard Space Flight Center in Maryland, who helped lead the search team.

“I don’t think anyone in the NASA organization did anything wrong,” Nafzger says. “I think it slipped through the cracks, and nobody’s happy about it.”

Didn’t do anything wrong?

You do know the leading cause for firing of sysadmins is failure to maintain proper backups? I would hold everyone standing near a crack responsible. Would not bring the missing tapes back but it would make future generations more careful.

Considering that was only a few decades ago, how do we read ancient texts for which we have no key in English?

The ancients preserved access to their data by way of triliteral inscriptions. Inscriptions in three different languages but all saying the same thing. If you know only one of the languages you can work towards understanding the other two.

A couple of examples:

Van Fortress, with an inscription of Xerxes the Great.

Behistun Inscription, with an inscription in Old Persian, Elamite, and Babylonian.

BTW, the final image in Pete’s post is much later than the Sumerians and is one of the first cuneiform artifacts to be found. (Taylor’s Prism) It describes King Sennacherib’s military victories and dates from about 691 B.C. It is written in Neo-Assyrian cuneiform script. That script is used in primers and introductions to Akkadian.

Can I guess how many mappings you have of your ontologies or database schemas? I suppose the first question should be if they are documented at all? Then follow up with the question of about mapping to other ontologies or schemas. Such as an industry standard schema or set of terms.

If that sounds costly, consider the cost of migration/integration without documentation/mapping. Topic maps can help with the mapping aspects of such a project.

January 1, 2012

Strata: Making Data Work – Update

Filed under: Conferences,Data — Patrick Durusau @ 5:59 pm

Strata: Making Data Work – Update

Data sessions for the Strata Conference, February 28 – March 1, 2012, Santa Clara, California.

Too many to list or effectively summarize.

Conference homepage.

Big Data: It’s Not How Big It Is, It’s How You Use It

Filed under: BigData,Data — Patrick Durusau @ 5:58 pm

Big Data: It’s Not How Big It Is, It’s How You Use It

If you are thinking about how this year will shape up, this is a post to keep in mind.

At least its points on “big data” not being as meaningful as your performance with a particular data set in a given context. That may be “big data” in someone’s view but the important point being a particular result with the data, whether it is “big” or “small.”

I am less concerned with the notion that a transition to a data/information economy can be managed. It makes for interesting water cooler talk but not much more than that.

Remember, history is written by survivors, not pre-revolution visionaries.

Be a survivor, use data, big or small, to your best advantage (or that of your client).

« Newer PostsOlder Posts »

Powered by WordPress