Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 19, 2012

FindTheData

Filed under: Data,Data Source,Dataset — Patrick Durusau @ 7:03 pm

FindTheData

From the about page:

At FindTheData, we present you with the facts stripped of any marketing influence so that you can make quick and informed decisions. We present the facts in easy-to-use tables with smart filters, so that you can decide what is best.

Too often, marketers and pay-to-play sites team up to present carefully crafted advertisements as objective “best of” lists. As a result, it has become difficult and time consuming to distinguish objective information from paid placements. Our goal is to become a trusted source in assisting you in life’s important decisions.

FindTheData is organized into 9 broad categories

Each category includes dozens of Comparisons from smartphones to dog breeds. Each Comparison consists of a variety of listings and each listing can be sorted by several key filters or compared side-by-side.

Traditional search is a great hammer but sometimes you need a wrench.

Currently search can find any piece of information across hundreds of billions of Web pages, but when you need to make a decision whether it’s choosing the right college or selecting the best financial advisor, you need information structured in an easily comparable format. FindTheData does exactly that. We help you compare apples-to-apples data, side-by-side, on a wide variety of products & services.

If you think in the same categories as the authors, sorta like using LCSH, you are in like Flint. If you don’t, well, your mileage may vary.

While some people may find it convenient to have tables and sorts pre-set for them, it would be nice to be able to download the data files.

Still, you may find it useful to browse for datasets that are new to you.

November 4, 2012

Towards Social Discovery…

Filed under: Common Crawl,Data,Social Networks — Patrick Durusau @ 4:14 pm

Towards Social Discovery – New Content Models; New Data; New Toolsets by Matthew Berk, Founder of Lucky Oyster.

From the post:

When I first came across the field of information retrieval in the 80′s and early 90′s (back when TREC began), vectors were all the rage, and the key units were terms, texts, and corpora. Through the 90′s and with the advent of hypertext and later the explosion of the Web, that metaphor shifted to pages, sites, and links, and approaches like HITS and Page Rank leveraged hyperlinking between documents and sites as key proxies for authority and relevance.

Today we’re at a crossroads, as the nature of the content we seek to leverage through search and discovery has shifted once again, with a specific gravity now defined by entities, structured metadata, and (social) connections. In particular, and based on my work with Common Crawl data specifically, content has shifted in three critical ways:

No, I won’t even summarize his three points. It’s short and quite well written.

Read his post and then consider: Where do topic maps fit into his “crossroads?”

November 2, 2012

CALIFA First Data Release

Filed under: Astroinformatics,BigData,Data — Patrick Durusau @ 6:35 am

CALIFA (Calar Alto Legacy Integral Field spectroscopy Area survey) First Data Release

From the webpage:

The Calar Alto Legacy Integral Field Area survey is one of the largest IFS surveys performed to date. At its completion it will comprise 600 galaxies, observed with the PMAS spectrograph in the PPAK mode, covering the full spatial extent of these galaxies up to two effective radii. The wavelength range between 3700 and 7500 Å is sampled with two spectroscopic configurations, a high resolution mode (V1200, R~1700, 3700-4200 Å), and a low resolution mode (V500, R~850, 3750-7500 Å). A detailed explanation of the survey is given in the CALIFA Presentation Article (Sánchez et al. 2012).

The first CALIFA Data Release (DR1) provides to the public the fully reduced and quality control tested datacubes of 100 objects in both setups (V500 and V1200). Each datacube contains ~2000 individual spectra, thus in total this DR comprises ~400,000 individual spectra. The details of the data included in this DR are described in the CALIFA DR1 Article (Husemann et al. 2012). The complete list of the DR1 objects for which we deliver data can be found in the following webpage.

The main characteristics of the galaxies included in the full CALIFA mother sample, a subset of which are delivered in this DR, will be given in the CALIFA sample characterization article (Walcher et al. in prep.). This article will provide detailed information of the photometric, morphological and environmental properties of the galaxies, and a description of the statistical properties of the full sample.

The non-technical explanation:

Galaxies are the large-scale building blocks of the cosmos. Their visible ingredients include between millions and hundreds of billions of stars as well as clouds of gas and dust. “Understanding the dynamical processes within and between galaxies that have shaped the way they are today is a key part of understanding our wider cosmic environment.”, explains Dr. Glenn van de Ven, a member of the managing board of the CALIFA survey and staff scientist at the Max Planck Institute for Astronomy (MPIA).

Traditionally, when it came to galaxies, astronomers had to choose between different observational techniques. They could, for instance, take detailed images with astronomical cameras showing the various features of a galaxy as well as their spatial relations, but they could not at the same time perform detailed analyses of the galaxy’s light, that is “obtain a galaxy spectrum”. Taking spectra required a different kind of instrument known as a spectrograph, which, as a downside, would only provide very limited information about the galaxy’s spatial structure.

An increasingly popular observational technique, integral field spectroscopy (IFS), combines the best of both worlds. The IFS instrument PMAS mounted at the Calar Alto Observatory’s 3.5 metre telescope uses 350 optical fibres to guide light from a corresponding number of different regions of a galaxy image into a spectrograph. In this way, astronomers are not restricted to analysing the galaxy as a whole – they can analyse the light coming from many different specific parts of a galaxy. The result are detailed maps of galaxy properties such as their chemical composition, and of the motions of their stars and their gas.

For the CALIFA survey, more than 900 galaxies in the local Universe, namely at distances between 70 and 400 million light years from the Milky Way, were selected from the northern sky to fully fit into the field-of-view of PMAS. They include all possible types, from roundish elliptical to majestic spiral galaxies, similar to our own Milky Way and the Andromeda galaxy. The allocated observation time will allow for around 600 of the pre-selected galaxies to be observed in depth.

From: CALIFA survey publishes intimate details of 100 galaxies

Either way, I thought you would find it an interesting “big data” set to consider over the weekend.

Or if you are an amateur astronomer with a cloudy weekend, something to expand your horizons.

October 31, 2012

Building apps with rail data

Filed under: Data,Programming,Visualization — Patrick Durusau @ 3:19 pm

Building apps with rail data by Jamie Andrews.

From the post:

Recently we ran a hack day Off the Rails to take the best rail data and see what can be built with it. I remain stunned at the range and quality of the output, particularly because of the complexity of the subject matter, and the fact that a lot of the developers hadn’t built any train-related software before.

So check out all of the impressive, useful and fun train hacks, and marvel at what can be done when data is opened and great minds work together…

The hacks really are impressive so I will just list the titles and hope that induces you to visit Jamie’s post:

Hack #1 – Trainspot.in… FourSquare for trains
Hack #2 – Journey planner maps with lines that follow the tracks
Hack #3 – Scenic railways
Hack #4 – Realtime Dutch trains
Hack #5 – ChooChooTune
Hack #6 – Realtimetrains
Hack #7 – Follow the rails
Hack #8 – [cycling in the UK]
Hack #9 – [I’ll meet you half-way?]
Hack #10 – [train delays, most delayed]

(No titles given for 8-10 so I made up titles.)

Wikidata

Filed under: Data,Wikidata — Patrick Durusau @ 11:30 am

Wikidata

From the webpage:

Wikidata is a free knowledge base that can be read and edited by humans and machines alike. It is for data what Wikimedia Commons is for media files: it centralizes access and management of structured data, such as interwiki references and statistical information. Wikidata contains data in all languages for which there are Wikimedia projects.

Not fully operational but still quite interesting.

Particularly the re-use of information aspects.

Re-use of data being one advantage commonly found in topic maps.

October 29, 2012

Characterizing a new dataset

Filed under: Data,R — Patrick Durusau @ 2:48 pm

Characterizing a new dataset by Ronald Pearson.

From the post:

In my last post, I promised a further examination of the spacing measures I described there, and I still promise to do that, but I am changing the order of topics slightly.  So, instead of spacing measures, today’s post is about the DataframeSummary procedure to be included in the ExploringData package, which I also mentioned in my last post and promised to describe later.  My next post will be a special one on Big Data and Data Science, followed by another one about the DataframeSummary procedure (additional features of the procedure and the code used to implement it), after which I will come back to the spacing measures I discussed last time.

A task that arises frequently in exploratory data analysis is the initial characterization of a new dataset.  Ideally, everything we could want to know about a dataset should come from the accompanying metadata, but this is rarely the case.  As I discuss in Chapter 2 of Exploring Data in Engineering, the Sciences, and Medicine, metadata is the available “data about data” that (usually) accompanies a data source.  In practice, however, the available metadata is almost never as complete as we would like, and it is sometimes wrong in important respects.  This is particularly the case when numeric codes are used for missing data, without accompanying notes describing the coding.  An example, illustrating the consequent problem of disguised missing data is described in my paper The Problem of Disguised Missing Data.  (It should be noted that the original source of one of the problems described there – a comment in the UCI Machine Learning Repository header file for the Pima Indians diabetes dataset that there were no missing data records – has since been corrected.)

A rich post on using R to explore data sets.

The observation that ‘metadata is the available “data about data”’ should remind us that we use subjects to talk about other subjects. There isn’t any place to stand where subjects are not all around us.

Some metadata may be unspoken or missing, as Ronald observes, but that doesn’t make it any less important.

How do you record your discoveries about data sets for future re-use?

Or merge them with discoveries by others about the same data sets?

October 26, 2012

BigML creates a marketplace for Predictive Models

Filed under: Data,Machine Learning,Prediction,Predictive Analytics — Patrick Durusau @ 4:42 pm

BigML creates a marketplace for Predictive Models by Ajay Ohri.

From the post:

BigML has created a marketplace for selling Datasets and Models. This is a first (?) as the closest market for Predictive Analytics till now was Rapid Miner’s marketplace for extensions (at http://rapidupdate.de:8180/UpdateServer/faces/index.xhtml)

From http://blog.bigml.com/2012/10/25/worlds-first-predictive-marketplace/

SELL YOUR DATA

You can make your Dataset public. Mind you: the Datasets we are talking about are BigML’s fancy histograms. This means that other BigML users can look at your Dataset details and create new models based on this Dataset. But they can not see individual records or columns or use it beyond the statistical summaries of the Dataset. Your Source will remain private, so there is no possibility of anyone accessing the raw data.

SELL YOUR MODEL

Now, once you have created a great model, you can share it with the rest of the world. For free or at any price you set.Predictions are paid for in BigML Prediction Credits. The minimum price is ‘Free’ and the maximum price indicated is 100 credits.

Having a public, digital marketplace for data and data analysis has been proposed by many and attempted by more than just a few.

Data is bought and sold today, but not by the digital equivalent of small shop keepers. The shop keepers who changed the face of Europe.

Data is bought and sold today by the digital equivalent of the great feudal lords. Complete with castles (read silos).

Will BigML give rise to a new mercantile class?

Or just as importantly, will you be a member of it or bound to the estate of a feudal lord?

October 25, 2012

Data Preparation: Know Your Records!

Filed under: Data,Data Quality,Semantics — Patrick Durusau @ 10:25 am

Data Preparation: Know Your Records! by Dean Abbott.

From the post:

Data preparation in data mining and predictive analytics (dare I also say Data Science?) rightfully focuses on how the fields in ones data should be represented so that modeling algorithms either will work properly or at least won’t be misled by the data. These data preprocessing steps may involve filling missing values, reigning in the effects of outliers, transforming fields so they better comply with algorithm assumptions, binning, and much more. In recent weeks I’ve been reminded how important it is to know your records. I’ve heard this described in many ways, four of which are:
the unit of analysis
the level of aggregation
what a record represents
unique description of a record

A bit further on Dean reminds us:

What isn’t always obvious is when our assumptions about the data result in unexpected results. What if we expect the unit of analysis to be customerID/Session but there are duplicates in the data? Or what if we had assumed customerID/Session data but it was in actuality customerID/Day data (where ones customers typically have one session per day, but could have a dozen)? (emphasis added)

Obvious once Dean says it, but how often do you question assumptions about data?

Do you know what impact incorrect assumptions about data will have on your operations?

If you investigate your assumptions about data, where do you record your observations?

Or will you repeat the investigation with every data dump from a particular source?

Describing data “in situ” could benefit you six months from now or your successor. (The data and or its fields would be treated as subjects in a topic map.)

October 23, 2012

Open Data vs. Private Data?

Filed under: Data,Open Data — Patrick Durusau @ 4:38 am

Why Government Should Care Less About Open Data and More About Data by Andrea Di Maio.

From the post:

Among the flurry of activities and deja-vu around open data that governments worldwide, in all tiers are pursuing to increase transparency and fuel a data economy, I found something really worth reading in a report that was recently published by the Danish government.

Good Basic Data for Everyone – A Driver for Growth and Efficiency” takes a different spin than many others by saying that:

Basic data is the core information authorities use in their day-to-day case processing. Basic data is e.g. data on individuals, businesses, properties, , addresses and geography. This information, called basic data, is reused throughout the public sector. Reuse of high-quality data is an essential basis for public authorities to perform their tasks properly and efficiently. Basic data can include personal data.

While most of the categories are open data, the novelty is that for the first time personal and open data is seen for what it is, i.e. data. The document suggests the development of a Data Distributor, which would be responsible for conveying data from different data to its consumers, both inside and outside government. The document also assumes that personal data may be ultimately distributed via a common public-sector data distributor.

Besides what is actually written in the document, this opens the door for a much needed shift from service orientation to data orientation in government service delivery. Stating that data must flow freely across organizational boundaries, irrespective of the type of data (and of course within appropriate policy constraints) is hugely important to lay the foundations for effective integration of services and processes across agencies, jurisdictions, tiers and constituencies.

Combining this with some premises of the US Digital Strategy, which highlights an information layer distinct from a platform layer, which is in turn distinct from a presentation layer, one starts seeing a move toward the centrality of data, which may finally emerge to the emergence of citizen data stores that would put control of service access and integration in the hand of individuals.

If there is novelty in the Danish approach, it is from being “open data.” That is all citizens can draw equally on the “basic data” for whatever purpose.

Property records, geographic, geological and other maps, plus addresses were combined long ago in the United States as “private data.”

Despite being collected at taxpayer expense, private industry sells access to collated public data.

Open data may provide businesses with collated public data at a lower cost, but as an expense to the public.

What is know as a false dilemma: We can buy back data government collected on our behalf or we can pay government to collect and collate it for the few.


The “individual being in charge of their data” is too obvious a fiction to delay us here. Isn’t true now, no signs it will become true. If you doubt that, restrict the distribution of your credit report. Post a note when you accomplish that task.

October 11, 2012

IBM Redbooks

Filed under: Books,Data,Marketing,Topic Maps — Patrick Durusau @ 2:22 pm

IBM Redbooks

You can look at this resource one of two ways:

First, as a great source of technical information about mostly IBM products and related technologies.

Second, as a starting point of IBM content for mining and navigation using a topic map.

May not be of burning personal interest to you, but to IBM clients, consultants and customers?

Here’s one pitch:

How much time do you spend searching the WWW, IBM for answers to IBM software questions? In a week? In a month?

Try (TM4IBM-Product-Name) for a week or a month. Then you do the time math.

(I would host a little time keeping applet to “assist” with the record keeping.)

Using (Spring Data) Neo4j for the Hubway Data Challenge [Boston Biking]

Filed under: Challenges,Data,Dataset,Graphs,Neo4j,Networks,Spring — Patrick Durusau @ 12:33 pm

Using (Spring Data) Neo4j for the Hubway Data Challenge by Michael Hunger.

From the post:

Using Spring Data Neo4j it was incredibly easy to model and import the Hubway Challenge dataset into a Neo4j graph database, to make it available for advanced querying and visualization.

The Challenge and Data

Tonight @graphmaven pointed me to the boston.com article about the Hubway Data Challenge.

(graphics omitted)

Hubway is a bike sharing service which is currently expanding worldwide. In the Data challenge they offer the CSV-data of their 95 Boston stations and about half a million bike rides up until the end of September. The challenge is to provide answers to some posted questions and develop great visualizations (or UI’s) for the Hubway data set. The challenge is also supported by MAPC (Metropolitan Area Planning Council).

Useful import tips for data into Neo4j and on modeling this particular dataset.

Not to mention the resulting database as well!

PS: From the challenge site:

Submission will open here on Friday, October 12, 2012.

Deadline

MIDNIGHT (11:59 p.m.) on Halloween,
Wednesday, October 31, 2012.

Winners will be announced on Wednesday, November 7, 2012.

Prizes:

  • A one-year Hubway membership
  • Hubway T-shirt
  • Bern helmet
  • A limited edition Hubway System Map—one of only 61 installed in the original Hubway stations.

For other details, see the challenge site.

October 10, 2012

Interesting large scale dataset: D4D mobile data [Deadline: October 31, 2012]

Filed under: Data,Data Mining,Dataset,Graphs,Networks — Patrick Durusau @ 4:19 pm

Interesting large scale dataset: D4D mobile data by Danny Bickson.

From the post:

I got the following from Prof. Scott Kirkpatrick.

Write a 250-words research project and get access within a week to the largest ever released mobile phone datasets: datasets based on 2.5 billion records, calls and text messages exchanged between 5 million anonymous users over 5 months.

Participation rules: http://www.d4d.orange.com/

Description of the datasets: http://arxiv.org/abs/1210.0137

The “Terms and Conditions” by Orange allows the publication of resultsbobtained from the datasets even if they do not directly relate to the challenge.

Cash prizes for winning participants and an invitation to present the results at the NetMob conference be held in May 2-3, 2013 at the Medialab at MIT (www.netmob.org).

Deadline: October 31, 2012

Looking to exercise your graph software? Compare to other graph software? Do interesting things with cell phone data?

This could be your chance!

October 8, 2012

Wolfram Data Summit 2012 Presentations [Elves and Hypergraphs = Topic Maps?]

Filed under: Combinatorics,Conferences,Data,Data Mining — Patrick Durusau @ 1:39 pm

Wolfram Data Summit 2012 Presentations

Presentations have been posted from the Wolfram Data Summit 2012:

I looked at:

“The Trouble with House Elves: Computational Folkloristics, Classification, and Hypergraphs” Timothy Tangherlini, Professor, UCLA James Abello, Research Professor, DIMACS – Rutgers University

first. 😉

Would like to see a video of the presentation. Pointers anyone?

Close as I can imagine to being a topic map without using the phrase “topic map.”

Others?

Thursday, September 8

  • Presentation “Who’s Bigger? A Quantitative Analysis of Historical Fame” Steven Skiena, Professor, Stony Brook University
  • Presentation “Academic Data: A Funder’s Perspective” Myron Gutmann, Assistant Director, Social, Behavioral & Economic Sciences, National Science Foundation (NSF)
  • Presentation “Who Owns the Law?” Ed Walters, CEO, Fastcase, Inc.
  • Presentation “An Initiative to Improve Academic and Commercial Data Sharing in Cancer Research” Charles Hugh-Jones, Vice President, Head, Medical Affairs North America, Sanofi
  • Presentation “The Trouble with House Elves: Computational Folkloristics, Classification, and Hypergraphs” Timothy Tangherlini, Professor, UCLA James Abello, Research Professor, DIMACS – Rutgers University
  • Presentation “Rethinking Digital Research” Kaitlin Thaney, Manager, External Partnerships, Digital Science
  • Presentation “Building and Learning from Social Networks” Chris McConnell, Principal Software Development Lead, Microsoft Research FUSE Labs
  • Presentation “Keeping Repositories in Synchronization: NISO/OAI ResourceSync Project” Todd Carpenter, Executive Director, NISO
  • Presentation “A New, Searchable SDMX Registry of Country-Level Health, Education, and Financial Data” Chris Dickey, Director, Research and Innovations, DevInfo Support Group
  • Presentation “Dryad’s Evolving Proof of Concept and the Metadata Hook” Jane Greenberg, Professor, School of Information and Library Science (SILS), University of North Carolina at Chapel Hill
  • Presentation “How the Associated Press Tabulates and Distributes Votes in US Elections” Brian Scanlon, Director of Election Services, The Associated Press
  • Presentation “How Open Is Open Data?” Ian White, President, Urban Mapping, Inc.
  • Presentation “No More Tablets of Stone: Enabling the User to Weight Our Data and Shape Our Research” Toby Green, Head of Publishing, Organisation for Economic Co-operation and Development (OECD)
  • Presentation “Sharing and Protecting Confidential Data: Real-World Examples” Timothy Mulcahy, Principal Research Scientist, NORC at the University of Chicago
  • Presentation “Language Models That Stimulate Creativity” Matthew Huebert, Programmer/Designer, BrainTripping
  • Presentation “The Analytic Potential of Long-Tail Data: Sharable Data and Reuse Value” Carole Palmer, Center for Informatics Research in Science & Scholarship, University of Illinois at Urbana-Champaign
  • Presentation “Evolution of the Storage Brain—Using History to Predict the Future” Larry Freeman, Senior Technologist, NetApp, Inc.

Friday, September 9

  • Presentation “Devices, Data, and Dollars” John Burbank, President, Strategic Initiatives, The Nielsen Company
  • Presentation “Pulling Structured Data Out of Unstructured” Greg Lindahl, CTO, blekko
  • Presentation “Mining Consumer Data for Insights and Trends” Rohit Chauhan, Group Executive, MasterCard Worldwide
  • Presentation
    “Data Quality and Customer Behavioral Modeling” Daniel Krasner, Chief Data Scientist, Sailthru/KFit Solutions
  • No presentation available. “Human-Powered Analysis with Crowdsourcing and Visualization” Edwin Chen, Data Scientist, Twitter
  • Presentation “Leveraging Social Media Data as Real-Time Indicators of X” Maria Singson, Vice President, Country and Industry Research & Forecasting, IHS Chris Hansen, Director, IHS Dan Bergstresser, Chief Economist, Janys Analytics
  • No presentation available. “Visualizations in Yelp” Jim Blomo, Engineering Manager, Data-Mining, Yelp
  • Presentation “The Digital Footprints of Human Activity” Stanislav Sobolevsky, MIT SENSEable City Lab
  • Presentation “Unleash Your Research: The Wolfram Data Repository” Matthew Day, Manager, Data Repository, Wolfram Alpha LLC
  • Presentation “Quantifying Online Discussion: Unexpected Conclusions from Mass Participation” Sascha Mombartz, Creative Director, Urtak
  • Presentation “Statistical Physics for Non-physicists: Obesity Spreading and Information Flow in Society” Hernán Makse, Professor, City College of New York
  • Presentation “Neuroscience Data: Past, Present, and Future” Chinh Dang, CTO, Allen Institute for Brain Science
  • Presentation “Finding Hidden Structure in Complex Networks” Yong-Yeol Ahn, Assistant Professor, Indiana University Bloomington
  • Presentation “Data Challenges in Health Monitoring and Diagnostics” Anthony Smart, Chief Science Officer, Scanadu
  • No presentation available. “Datascience Automation with Wolfram|Alpha Pro” Taliesin Beynon, Manager and Development Lead, Wolfram Alpha LLC
  • Presentation “How Data Science, the Web, and Linked Data Are Changing Medicine” Joanne Luciano, Research Associate Professor, Rensselaer Polytechnic Institute
  • Presentation “Unstructured Data and the Role of Natural Language Processing” Philip Resnik, Professor, University of Maryland
  • Presentation “A Framework for Measuring Social Quality of Content Based on User Behavior” Nanda Kishore, CTO, ShareThis, Inc.
  • Presentation “The Science of Social Data” Hilary Mason, Chief Scientist, bitly
  • Presentation “Big Data for Small Languages” Laura Welcher, Director of Operations, The Rosetta Project
  • Presentation “Moving from Information to Insight” Anthony Scriffignano, Senior Vice President, Worldwide Data & Insight, Dun and Bradstreet

PS: I saw this in Christophe Lalanne’s A bag of tweets / September 2012 and reformatted the page to make it easier to consult.

October 5, 2012

Storing Topic Map Data at $136/TB

Filed under: Data,Storage — Patrick Durusau @ 3:30 pm

Steve Streza describes his storage system in My Giant Hard Drive: Building a Storage Box with FreeNAS.

At his prices, about $136/TB for 11 TB of storage.

Large enough for realistic simulations of data mining or topic mapping. When you want to step up to production, spin up services on one of the clouds.

Not sure it will last you several years as Steve projects but it should last long enough to be worth the effort.

From the post:

For many years, I’ve had a lot of hard drives being used for data storage. Movies, TV shows, music, apps, games, backups, documents, and other data have been moved between hard drives and stored in inconsistent places. This has always been the cheap and easy approach, but it has never been really satisfying. And with little to no redundancy, I’ve suffered a non-trivial amount of data loss as drives die and files get lost. Now, I’m not alone to have this problem, and others have figured out ways of solving it. One of the most interesting has been in the form of a computer dedicated to one thing: storing data, and lots of it. These computers are called network-attached storage, or NAS, computers. A NAS is a specialized computer that has lots of hard drives, a fast connection to the local network, and…that’s about it. It doesn’t need a high-end graphics card, or a 20-inch monitor, or other things we typically associate with computers. It just sits on the network and quietly serves and stores files. There are off-the-shelf boxes you can buy to do this, such as machines made by Synology or Drobo, and you can assemble one yourself for the job.

I’ve been considering making a NAS for myself for over a year, but kept putting it off due to expense and difficulty. But a short time ago, I finally pulled the trigger on a custom assembled machine for storing data. Lots of it; almost 11 terabytes of storage, in fact. This machine is made up of 6 hard drives, and is capable of withstanding a failure on two of them without losing a single file. If any drives do fail, I can replace them and keep on working. And these 11 terabytes act as one giant hard drive, not as 6 independent ones that have to be organized separately. It’s an investment in my storage needs that should grow as I need it to, and last several years.

October 1, 2012

PDS – Planetary Data System [The Mother Lode]

Filed under: Astroinformatics,Data — Patrick Durusau @ 4:35 pm

PDS – Planetary Data System

From the webpage:

The PDS archives and distributes scientific data from NASA planetary missions, astronomical observations, and laboratory measurements. The PDS is sponsored by NASA’s Science Mission Directorate. Its purpose is to ensure the long-term usability of NASA data and to stimulate advanced research

Tools, data, guides, etc.

Quick searches include:

  • Mercury
  • Venus
  • Mars
  • Jupiter
  • Saturn
  • Uranus, Neptune, Pluto
  • Rings
  • Asteroids
  • Comets
  • Planetary Dust
  • Earth’s Moon
  • Solar Wind

The ordering here makes a little more sense to me. What about you?

A nice way to teach scientific, mathematical and computer literacy without making it seem like work. 😉

Planetary Data System – Geosciences Node

Filed under: Astroinformatics,Data,Geographic Data — Patrick Durusau @ 3:22 pm

Sounds like SciFi, yes? SciFi? No!

After seeing Google add some sea bed material to Google Maps, I started to wonder about radar based maps of other places. Like the Moon.

I remember the excitement Ranger 7 images generated. And that in grainy newspaper reproductions.

With just a little searching, I came across PDS (Planetary Data Services) Geosciences Node (Washington University in St. Louis).

From the web page:

The Geosciences Node of NASA’s Planetary Data System (PDS) archives and distributes digital data related to the study of the surfaces and interiors of terrestrial planetary bodies. We work directly with NASA missions to help them generate well-documented, permanent data archives. We provide data to NASA-sponsored researchers along with expert assistance in using the data. All our archives are online and available to the public to download free of charge.

Which includes:

  • Mars
  • Venus
  • Mercury
  • Moon
  • Earth (test data for other planetary surfaces)
  • Asteroids
  • Gravity Models

Even after checking the FAQ, I can’t explain the ordering of these entries. Order from the Sun doesn’t work. Neither does order or distance from Earth. Nor alphabetical sort order. Suggestions?

In any event, enjoy the data set!

September 22, 2012

Datasets! Datasets! Get Your Datasets Here!

Filed under: Data,Dataset — Patrick Durusau @ 3:59 pm

Datasets from René Pichardt’s group:

The project KONECT (Koblenz Network Collection) has extracted and made available four new network datasets based on information in the English Wikipedia, using data from the DBpedia project. The four network datasets are: The bipartite network of writers and their works (113,000 nodes and 122,000 edges) The bipartite network of producers and the works they […]

Assume you have a knowledge base containing entities and their properties or relations with other entities. For instance, think of a knowledge base about movies, actors and directors. For the movies you have structured knowledge about the title and the year they were made in, while for the actors and directors you might have their […]

The Institute for Web Science and Technologies (WeST) at the University of Koblenz-Landau is making available a new series of datasets: The Wikipedia hyperlink networks in the eight largest Wikipedia languages: http://konect.uni-koblenz.de/networks/wikipedia_link_en – English http://konect.uni-koblenz.de/networks/wikipedia_link_de – German http://konect.uni-koblenz.de/networks/wikipedia_link_fr – French http://konect.uni-koblenz.de/networks/wikipedia_link_ja – Japanese http://konect.uni-koblenz.de/networks/wikipedia_link_itItalian http://konect.uni-koblenz.de/networks/wikipedia_link_pt – Portugese http://konect.uni-koblenz.de/networks/wikipedia_link_ru – Russian The largest dataset, […]

I found an article about ohloh, a directory created by Black Duck Software with over 500,000 open source projects. They offer a RESTful API and the data is available under the Creative Commons Attribution 3.0 licence. An interesting aspect are Kudos. With a Kudo, a ohlo user can thank another user for his or her contribution, so […]

I started to mention these earlier in the week but decided they needed a separate post.

September 13, 2012

The US poverty map in 2011 [Who Defines Poverty?]

Filed under: Data,Semantics — Patrick Durusau @ 4:18 pm

The US poverty map in 2011 by Simon Rogers.

From the post:

New figures from the US census show that 46.2 million Americans live in poverty and another 48.6m have no health insurance. In Maryland, the median income is $68,876, in Kentucky it is $39,856, some $10,054 below than the US average. Click on each state below to see the data – or use the dropdown to see the map change

As always an interesting presentation of data (along with access to the raw data).

But what about “poverty” in the United States versus “poverty” in other places?

The World Bank’s “Poverty” page reports in part:

  • Poverty headcount ratio at $1.25 a day (PPP) (% of population)
    • East Asia & Pacific
    • Europe & Central Asia
    • Latin America & Caribbean
    • Middle East & North Africa
    • South Asia
    • Sub-Saharan Africa
  • Poverty headcount ratio at $2 a day (PPP) (% of population)
    • East Asia & Pacific
    • Europe & Central Asia
    • Latin America & Caribbean
    • Middle East & North Africa
    • South Asia
    • Sub-Saharan Africa

What area is missing from this list?

Can you say: “North America?”

The poverty rate per day for North American is an important comparison point in discussions of global trade, environment and similar issues.

Can you point me towards more comprehensive comparison data?


PS: $2 per day is $730 annual. $1.25 per day is $456.25 annual.

Europeana opens up data on 20 million cultural items

Filed under: Archives,Data,Dataset,Europeana,Library,Museums — Patrick Durusau @ 3:25 pm

Europeana opens up data on 20 million cultural items by Jonathan Gray (Open Knowledge Foundation):

From the post:

Europe‘s digital library Europeana has been described as the ‘jewel in the crown’ of the sprawling web estate of EU institutions.

It aggregates digitised books, paintings, photographs, recordings and films from over 2,200 contributing cultural heritage organisations across Europe – including major national bodies such as the British Library, the Louvre and the Rijksmuseum.

Today [Wednesday, 12 September 2012] Europeana is opening up data about all 20 million of the items it holds under the CC0 rights waiver. This means that anyone can reuse the data for any purpose – whether using it to build applications to bring cultural content to new audiences in new ways, or analysing it to improve our understanding of Europe’s cultural and intellectual history.

This is a coup d’etat for advocates of open cultural data. The data is being released after a grueling and unenviable internal negotiation process that has lasted over a year – involving countless meetings, workshops, and white papers presenting arguments and evidence for the benefits of openness.

That is good news!

A familiar issue that it overcomes:

To complicate things even further, many public institutions actively prohibit the redistribution of information in their catalogues (as they sell it to – or are locked into restrictive agreements with – third party companies). This means it is not easy to join the dots to see which items live where across multiple online and offline collections.

Oh, yeah! That was one of Google’s reasons for pulling the plug on the Open Knowledge Graph. Google had restrictive agreements so you can only connect the dots with Google products. (I think there is a name for that, let me think about it. Maybe an EU prosecutor might know it. You could always ask.)

What are you going to be mapping from this collection?

Prison Polling [If You Don’t Ask, You Won’t Know]

Filed under: Data,Design,Statistics — Patrick Durusau @ 9:44 am

Prison Polling by Carl Bialik.

From the post:

My print column examines the argument of a book out this week that major federal surveys are missing an important part of the population by not polling prisoners.

“We’re missing 1% of the population,” said Becky Pettit, a University of Washington sociologist and author of the book, “Invisible Men.” “People might say, ‘That’s not a big deal.’ “But it is for some groups, she writes — particularly young black men. And for young black men, especially those without a high-school diploma, official statistics paint a rosier picture than reality on factors such as employment and voter turnout.

“Because many surveys skip institutionalized populations, and because we incarcerate lots of people, especially young black men with low levels of education, certain statistics can look rosier than if we included” prisoners in surveys, said Jason Schnittker, a sociologist at the University of Pennsylvania. “Whether you regard the impact as ‘massive’ depends on your perspective. The problem of incarceration tends to get swept under the rug in lots of different ways, rendering the issue invisible.”

A reminder that assumptions are cooked into data long before it reaches us for analysis.

If we don’t ask questions about data collection, we may be passing on results that don’t serve the best interests of our clients.

So for population data, ask (among other things):

  • Who was included/excluded?
  • How were the included selected?
  • On what basis were people excluded?
  • Where are the survey questions?
  • By what means were the questions asked? (phone, web, in person)
  • Time of day of survey?

and I am sure there are others.

Don’t be impressed by protests that your questions are irrelevant or the source has already “accounted” for that issue.

Right.

When someone protests you don’t need to know, you know where to push. Trust me on that one.

September 8, 2012

Women’s representation in media:… [Counting and Evaluation]

Filed under: Data,Dataset,News — Patrick Durusau @ 10:46 am

Women’s representation in media: the best data on the subject to date

From the post:

In the first of a series of datablog posts looking at women in the media, we present one year of every article published by the Guardian, Telegraph and Daily Mail, with each article tagged by section, gender, and social media popularity.

(images omitted)

The Guardian datablog has joined forces with J. Nathan Matias of the MIT media lab and data scientist Lynn Cherny to collect what is to our knowledge, the most comprehensive, high resolution dataset available on news content by gender and audience interest.

The dataset covers from July 2011 to June 2012. The post describes the data collection and some rough counts by gender, etc. More analysis to follow.

The data should not be impacted by:

Opinion sections can shape a society’s opinions and therefore are an important measure of women’s voices in society.

It isn’t clear how those claims go together.

Anything being possible the statement that “…opinion sections can shape a society’s opinions…,” is trivially true.

But even if true (an unwarranted assumption), how does that lead to it being “…an important measure of women’s voices in society[?]”

Could be true and have nothing to do with measuring “…women’s voices in society.”

Could be false and have nothing to do with measuring “…women’s voices in society.”

As well as the other possibilities.

Just because we can count something, doesn’t imbue it with relevance for something else that is harder to evaluate.

Women’s voices in society are important. Let’s not demean them by grabbing the first thing we can count as their measure.

August 10, 2012

First BOSS Data: 3-D Map of 500,000 Galaxies, 100,000 Quasars

Filed under: Astroinformatics,Data,Science — Patrick Durusau @ 9:02 am

First BOSS Data: 3-D Map of 500,000 Galaxies, 100,000 Quasars

From the post:

The Third Sloan Digital Sky Survey (SDSS-III) has issued Data Release 9 (DR9), the first public release of data from the Baryon Oscillation Spectroscopic Survey (BOSS). In this release BOSS, the largest of SDSS-III’s four surveys, provides spectra for 535,995 newly observed galaxies, 102,100 quasars, and 116,474 stars, plus new information about objects in previous Sloan surveys (SDSS-I and II).

“This is just the first of three data releases from BOSS,” says David Schlegel of the U.S. Department of Energy’s Lawrence Berkeley National Laboratory (Berkeley Lab), an astrophysicist in the Lab’s Physics Division and BOSS’s principal investigator. “By the time BOSS is complete, we will have surveyed more of the sky, out to a distance twice as deep, for a volume more than five times greater than SDSS has surveyed before — a larger volume of the universe than all previous spectroscopic surveys combined.”

Spectroscopy yields a wealth of information about astronomical objects including their motion (called redshift and written “z”), their composition, and sometimes also the density of the gas and other material that lies between them and observers on Earth. The BOSS spectra are now freely available to a public that includes amateur astronomers, astronomy professionals who are not members of the SDSS-III collaboration, and high-school science teachers and their students.

The new release lists spectra for galaxies with redshifts up to z = 0.8 (roughly 7 billion light years away) and quasars with redshifts between z = 2.1 and 3.5 (from 10 to 11.5 billion light years away). When BOSS is complete it will have measured 1.5 million galaxies and at least 150,000 quasars, as well as many thousands of stars and other “ancillary” objects for scientific projects other than BOSS’s main goal.

For data access, software tools, tutorials, etc., see: http://sdss3.org/

Interesting data set but also instructive for the sharing of data and development of tools for operations on shared data. You don’t have to have a local supercomputer to process the data. Dare I say a forerunner of the “cloud?”

Be the alpha geek at your local astronomy club this weekend!

August 4, 2012

Using the flickr XML/API as a source of RSS feeds

Filed under: Data,XML,XSLT — Patrick Durusau @ 2:07 pm

Using the flickr XML/API as a source of RSS feeds by Pierre Lindenbaum.

Pierre has created an XSLT stylesheet to transform XML from flickr into an RSS feed.

Something for your data harvesting recipe box.

July 27, 2012

London 2012 Olympic athletes: the full list

Filed under: Data,Dataset — Patrick Durusau @ 4:10 am

London 2012 Olympic athletes: the full list

Simon Rogers fo the Guardian reports scrapping together the full list of Olympic athletes into a single data set.

Simon says:

We’ve just scratched the surface of this dataset – you can download it below. What can you do with it?

I would ask the question somewhat differently: Having the data set, what can you reliably add to it?

Aggregate data analysis is interesting but then so is aggregated data on the individual athletes.

PS: If you do something interesting with the data set, be sure to let the Guardian know.

July 26, 2012

How to Track Your Data: Rule-Based Data Provenance Tracing Algorithms

Filed under: Data,Provenance — Patrick Durusau @ 3:43 pm

How to Track Your Data: Rule-Based Data Provenance Tracing Algorithms by Zhang, Qing Olive; Ko, Ryan K L; Kirchberg, Markus; Suen, Chun-Hui; Jagadpramana, Peter; Lee, Bu Sung.

Abstract:

As cloud computing and virtualization technologies become mainstream, the need to be able to track data has grown in importance. Having the ability to track data from its creation to its current state or its end state will enable the full transparency and accountability in cloud computing environments. In this paper, we showcase a novel technique for tracking end-to-end data provenance, a meta-data describing the derivation history of data. This breakthrough is crucial as it enhances trust and security for complex computer systems and communication networks. By analyzing and utilizing provenance, it is possible to detect various data leakage threats and alert data administrators and owners; thereby addressing the increasing needs of trust and security for customers’ data. We also present our rule-based data provenance tracing algorithms, which trace data provenance to detect actual operations that have been performed on files, especially those under the threat of leaking customers’ data. We implemented the cloud data provenance algorithms into an existing software with a rule correlation engine, show the performance of the algorithms in detecting various data leakage threats, and discuss technically its capabilities and limitations.

Interesting work but data provenance isn’t solely a cloud computing, virtualization issue.

Consider the ongoing complaints in Washington, D.C. on who leaked what to who and why?

All posturing to one side, that is a data provenance and subject identity based issue.

The sort of thing where a topic map application could excel.

July 25, 2012

London Olympics: download the full schedule as open data

Filed under: Data — Patrick Durusau @ 7:00 pm

London Olympics: download the full schedule as open data

From the Guardian, where so much useful data gathers.

July 20, 2012

Data Jujitsu: The art of turning data into product

Filed under: Data,Marketing,Topic Maps — Patrick Durusau @ 11:00 am

Data Jujitsu: The art of turning data into product: Smart data scientists can make big problems small by DJ Patil.

From the post:

Having worked in academia, government and industry, I’ve had a unique opportunity to build products in each sector. Much of this product development has been around building data products. Just as methods for general product development have steadily improved, so have the ideas for developing data products. Thanks to large investments in the general area of data science, many major innovations (e.g., Hadoop, Voldemort, Cassandra, HBase, Pig, Hive, etc.) have made data products easier to build. Nonetheless, data products are unique in that they are often extremely difficult, and seemingly intractable for small teams with limited funds. Yet, they get solved every day.

How? Are the people who solve them superhuman data scientists who can come up with better ideas in five minutes than most people can in a lifetime? Are they magicians of applied math who can cobble together millions of lines of code for high-performance machine learning in a few hours? No. Many of them are incredibly smart, but meeting big problems head-on usually isn’t the winning approach. There’s a method to solving data problems that avoids the big, heavyweight solution, and instead, concentrates building something quickly and iterating. Smart data scientists don’t just solve big, hard problems; they also have an instinct for making big problems small.

We call this Data Jujitsu: the art of using multiple data elements in clever ways to solve iterative problems that, when combined, solve a data problem that might otherwise be intractable. It’s related to Wikipedia’s definition of the ancient martial art of jujitsu: “the art or technique of manipulating the opponent’s force against himself rather than confronting it with one’s own force.”

How do we apply this idea to data? What is a data problem’s “weight,” and how do we use that weight against itself? These are the questions that we’ll work through in the subsequent sections.

To start, for me, a good definition of a data product is a product that facilitates an end goal through the use of data. It’s tempting to think of a data product purely as a data problem. After all, there’s nothing more fun than throwing a lot of technical expertise and fancy algorithmic work at a difficult problem. That’s what we’ve been trained to do; it’s why we got into this game in the first place. But in my experience, meeting the problem head-on is a recipe for disaster. Building a great data product is extremely challenging, and the problem will always become more complex, perhaps intractable, as you try to solve it.

Before investing in a big effort, you need to answer one simple question: Does anyone want or need your product? If no one wants the product, all the analytical work you throw at it will be wasted. So, start with something simple that lets you determine whether there are any customers. To do that, you’ll have to take some clever shortcuts to get your product off the ground. Sometimes, these shortcuts will survive into the finished version because they represent some fundamentally good ideas that you might not have seen otherwise; sometimes, they’ll be replaced by more complex analytic techniques. In any case, the fundamental idea is that you shouldn’t solve the whole problem at once. Solve a simple piece that shows you whether there’s an interest. It doesn’t have to be a great solution; it just has to be good enough to let you know whether it’s worth going further (e.g., a minimum viable product).

Here’s the question to ask for an open source topic map project:

Does anyone want or need your product?

Ouch!

A few of us, not enough to make a small market, like to have topic maps as interesting computational artifacts.

For a more viable (read larger) market, we need to sell data products topic maps can deliver.

How we create or deliver that product, hypergraphs, elves chained to desks, quantum computers or even magic, doesn’t matter to any sane end user.

What matters is the utility of the data product for some particular need or task.

No, I don’t know what data product to suggest. If I did, it would have been the first thing I would have said.

Suggestions?

PS: Read DJ’s post in full. Every other day or so until you have a successful, topic map based, data product.

July 19, 2012

Following Even More of the Money

Filed under: Data,Politics — Patrick Durusau @ 3:27 pm

Following Even More of the Money By Derek Willis.

From the post:

Since we last rolled out new features in the Campaign Finance API, news organizations such as ProPublica and Mother Jones have used them to build interactive features about presidential campaigns, Super PACs and their funders. As the November election approaches, we’re announcing some additions and improvements to the API. We hope these enhancements will help others create web applications and graphics that help explain the connections between money and elections. This round of updates does not include any deprecations or backwards-incompatible changes, which is why we’re not changing the version number.

Welcome news from the NY Times on campaign finance data.

I can’t say that I follow their logic on version numbering but they are a news organization, not a software development house. 😉

July 18, 2012

Data Mining Projects (Ping Chen)

Filed under: Data,Data Mining — Patrick Durusau @ 6:59 pm

Data Mining Projects

From the webpage:

This is the website for the Data Mining CS 4319 class projects. Here you will find all of the information and data files you will need to begin working on the project you have selected for this semester. Please click on the link on the left hand side corresponding to your project to begin. Development of the projects hosted in this website is funded by NSF Award DUE 0737408.

Projects with resources and files are:

  • Netflix
  • Word Relevance Measures
  • Identify Time
  • Orbital Debris Analysis
  • Oil Exploration
  • Environmental Data Analysis
  • Association Rule Pre-Processing
  • Neural Network-Based Financial Market Forecasting
  • Identify Locations From a Webpage
  • Co-reference Resolution
  • Email Visualization

Now there is a broad selection of data mining projects!

BTW, be careful of the general Netflix file. It is 665 MB so don’t attempt it on airport WiFi.

I first saw this at KDNuggets.

PS: I can’t swear to the dates of the class but the grant ran from 2008 to 2010.

July 11, 2012

Importing public data with SAS instructions into R

Filed under: Data,Government Data,Parsing,Public Data,R — Patrick Durusau @ 2:28 pm

Importing public data with SAS instructions into R by David Smith.

From the post:

Many public agencies release data in a fixed-format ASCII (FWF) format. But with the data all packed together without separators, you need a “data dictionary” defining the column widths (and metadata about the variables) to make sense of them. Unfortunately, many agencies make such information available only as a SAS script, with the column information embedded in a PROC IMPORT statement.

David reports on the SAScii package from Anthony Damico.

You still have to parse the files but it gets you one step closer to having useful information.

« Newer PostsOlder Posts »

Powered by WordPress