Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 16, 2013

Deep Inside: A Study of 10,000 Porn Stars and Their Careers

Filed under: Data,Data Mining,Porn — Patrick Durusau @ 4:49 pm

Deep Inside: A Study of 10,000 Porn Stars and Their Careers by Jon Millward.

From the post:

For the first time, a massive data set of 10,000 porn stars has been extracted from the world’s largest database of adult films and performers. I’ve spent the last six months analyzing it to discover the truth about what the average performer looks like, what they do on film, and how their role has evolved over the last forty years.

I can now name the day when I became aware of the Internet Adult Film Database, today!

When you get through grinning, go take a look at the post. This is serious data analysis.

Complete with an idealized porn star face composite from the most popular porn stars.

Improve your trivia skills: What two states in the United States have one porn star each in the Internet Adult Film Database? (Jon has a map of the U.S. with distribution of porn stars.)

A full report with more details about the analysis is forthcoming.

I first saw this at Porn star demographics by Nathan Yau.

February 13, 2013

datacatalogs.org [San Francisco, for example]

Filed under: Data,Dataset,Government,Government Data — Patrick Durusau @ 2:25 pm

datacatalogs.org

From the homepage:

a comprehensive list of open data catalogs curated by experts from around the world.

Cited in Simon Roger’s post: Competition: visualise open government data and win $2,000.

As of today, 288 registered data catalogs.

The reservation I have about “open” government data is that when it is “open,” it’s not terribly useful.

I am sure there is useful “open” government data but let me give you an example of non-useful “open” government data.

Consider San Francisco, CA and cases of police misconduct against it citizens.

A really interesting data visualization would be to plot those incidents against the neighborhoods of San Francisco. Where the neighborhoods are colored by economic status.

The maps of San Francisco are available at DataSF, specifically, Planning Neighborhoods.

What about the police data?

I found summaries like: OCC Caseload/Disposition Summary – 1993-2009

Which listed:

  • Opened
  • Closed
  • Pending
  • Sustained

Not exactly what is needed for neighborhood by neighborhood mapping.

Note: No police misconduct since 2009 according to these data sets. (I find that rather hard to credit.)

How would you vote on this data set from San Francisco?

Open, Opaque, Semi-Transparent?

February 7, 2013

Seamless Astronomy

Filed under: Astroinformatics,Data,Data Integration,Integration — Patrick Durusau @ 10:33 am

Seamless Astronomy: Linking scientific data, publications, and communities

From the webpage:

Seamless integration of scientific data and literature

Astronomical data artifacts and publications exist in disjointed repositories. The conceptual relationship that links data and publications is rarely made explicit. In collaboration with ADS and ADSlabs, and through our work in conjunction with the Institute for Quantitative Social Science (IQSS), we are working on developing a platform that allows data and literature to be seamlessly integrated, interlinked, mutually discoverable.

Projects:

  • ADS All-SKy Survey (ADSASS)
  • Astronomy Dataverse
  • WorldWide Telescope (WWT)
  • Viz-e-Lab
  • Glue
  • Study of the impact of social media and networking sites on scientific dissemination
  • Network analysis and visualization of astronomical research communities
  • Data citation practices in Astronomy
  • Semantic description and annotation of scientific resources

A project with large amounts of data for integration.

Moreover, unlike the U.S. Intelligence Community, they are working towards data integration, not resisting it.

I first saw this in Four short links: 6 February 2013 by Nat Torkington.

February 5, 2013

Doing More with the Hortonworks Sandbox

Filed under: Data,Dataset,Hadoop,Hortonworks — Patrick Durusau @ 2:01 pm

Doing More with the Hortonworks Sandbox by Cheryle Custer.

From the post:

The Hortonworks Sandbox was recently introduced garnering incredibly positive response and feedback. We are as excited as you, and gratified that our goal providing the fastest onramp to Apache Hadoop has come to fruition. By providing a free, integrated learning environment along with a personal Hadoop environment, we are helping you gain those big data skills faster. Because of your feedback and demand for new tutorials, we are accelerating the release schedule for upcoming tutorials. We will continue to announce new tutorials via the Hortonworks blog, opt-in email and Twitter (@hortonworks).

While you wait for more tutorials, Cheryle points to some data sets to keep you busy:

For advice, see the Sandbox Forums.

BTW, while you are munging across different data sets, be sure to notice any semantic impedance if you try to merge some data sets.

If you don’t want everyone in your office doing that merging one-off, you might want to consider topic maps.

Design and document a merge between data sets once, run many times.

Even if your merging requirements change. Just change that part of the map, don’t re-create the entire map.

What if mapping companies recreated their maps for every new street?

Or would it be better to add the new street to an existing map?

If that looks obvious, try the extra-bonus question:

Which model, new map or add new street, do you use for schema migration?

Bill Gates is naive, data is not objective [Neither is Identification]

Filed under: Data,Identity — Patrick Durusau @ 10:54 am

Bill Gates is naive, data is not objective by Cathy O’Neil.

From the post:

In his recent essay in the Wall Street Journal, Bill Gates proposed to “fix the world’s biggest problems” through “good measurement and a commitment to follow the data.” Sounds great!

Unfortunately it’s not so simple.

Gates describes a positive feedback loop when good data is collected and acted on. It’s hard to argue against this: given perfect data-collection procedures with relevant data, specific models do tend to improve, according to their chosen metrics of success. In fact this is almost tautological.

As I’ll explain, however, rather than focusing on how individual models improve with more data, we need to worry more about which models and which data have been chosen in the first place, why that process is successful when it is, and – most importantly – who gets to decide what data is collected and what models are trained.

Cathy makes a compelling case for data not being objective and concludes:

Don’t be fooled by the mathematical imprimatur: behind every model and every data set is a political process that chose that data and built that model and defined success for that model.

Sounds a lot like identifying subjects.

No identification is objective. They all occur as part of social processes and are bound by those processes.

No identification is “better” than another one, although is some contexts, particular identifications may be more useful that others.

I first saw this in Four short links: 4 February 2013 by Nat Torkington.

February 3, 2013

Case study: million songs dataset

Filed under: Data,Dataset,GraphChi,Graphs,Machine Learning — Patrick Durusau @ 6:58 pm

Case study: million songs dataset by Danny Bickson.

From the post:

A couple of days ago I wrote about the million songs dataset. Our man in London, Clive Cox from Rummble Labs, suggested we should implement rankings based on item similarity.

Thanks to Clive suggestion, we have now an implementation of Fabio Aiolli’s cost function as explained in the paper: A Preliminary Study for a Recommender System for the Million Songs Dataset, which is the winning method in this contest.

Following are detailed instructions on how to utilize GraphChi CF toolkit on the million songs dataset data, for computing user ratings out of item similarities. 

Just in case you need some data for practice with your GraphChi installation. 😉

Seriously, nice way to gain familiarity with the data set.

What value you extract from it is up to you.

January 29, 2013

Bad News From UK: … brows up, breasts down

Filed under: Data,Dataset,Humor,Medical Informatics — Patrick Durusau @ 6:51 pm

UK plastic surgery statistics 2012: brows up, breasts down by Ami Sedghi.

From the post:

Despite a recession and the government launching a review into cosmetic surgery following the breast implant scandal, plastic surgery procedures in the UK were up last year.

A total of 43,172 surgical procedures were carried out in 2012 according to the British Association of Aesthetic Plastic Surgeons (BAAPS), an increase of 0.2% on the previous year. Although there wasn’t a big change for overall procedures, anti-ageing treatments such as eyelid surgery and face lifts saw double digit increases.

Breast augmentation (otherwise known as ‘boob jobs’) were still the most popular procedure overall although the numbers dropped by 1.6% from 2011 to 2012. Last year’s stats took no account of the breast implant scandal so this is the first release of figures from BAAPS to suggest what impact the scandal has had on the popular procedure.

Just for comparison purposes:

Country Procedures Population Percent of Population Treated
UK 43,172 62,641,000 0.00068%
US 9,200,000 313,914,000 0.02900%

Perhaps beauty isn’t one of the claimed advantages of socialized medicine?

January 25, 2013

Chemical datuments as scientific enablers

Filed under: Cheminformatics,Data,Identification,Topic Maps — Patrick Durusau @ 8:17 pm

Chemical datuments as scientific enablers by Henry S Rzepa. (Journal of Cheminformatics 2013, 5:6 doi:10.1186/1758-2946-5-6)

Abstract:

This article is an attempt to construct a chemical datument as a means of presenting insights into chemical phenomena in a scientific journal. An exploration of the interactions present in a small fragment of duplex Z-DNA and the nature of the catalytic centre of a carbon-dioxide/alkene epoxide alternating co-polymerisation is presented in this datument, with examples of the use of three software tools, one based on Java, the other two using Javascript and HTML5 technologies. The implications for the evolution of scientific journals are discussed.

From the background:

Chemical sciences are often considered to stand at the crossroads of paths to many disciplines, including molecular and life sciences, materials and polymer sciences, physics, mathematical and computer sciences. As a research discipline, chemistry has itself evolved over the last few decades to focus its metaphorical microscope on both far larger and more complex molecular systems than previously attempted, as well as uncovering a far more subtle understanding of the quantum mechanical underpinnings of even the smallest of molecules. Both these extremes, and everything in between, rely heavily on data. Data in turn is often presented in the form of visual or temporal models that are constructed to illustrate molecular behaviour and the scientific semantics. In the present article, I argue that the mechanisms for sharing both the underlying data, and the (semantic) models between scientists need to evolve in parallel with the increasing complexity of these models. Put simply, the main exchange mechanism, the scientific journal, is accepted [1] as seriously lagging behind in its fitness for purpose. It is in urgent need of reinvention; one experiment in such was presented as a data-rich chemical exploratorium [2]. My case here in this article will be based on my recent research experiences in two specific areas. The first involves a detailed analysis of the inner kernel of the Z-DNA duplex using modern techniques for interpreting the electronic properties of a molecule. The second recounts the experiences learnt from modelling the catalysed alternating co-polymerisation of an alkene epoxide and carbon dioxide.

Effective sharing of data, in scientific journals or no, requires either a common semantic (we know that’s uncommon) or a mapping between semantics (how may times must we repeat the same mappings, separately?).

Embedding notions of subject identity and mapping between identifications in chemical datuments could increase the reuse of data, as well as its longevity.

January 23, 2013

Data Warfare: Big Data As Another Battlefield

Filed under: Data,Marketing,Topic Maps — Patrick Durusau @ 7:40 pm

Stacks get hacked: The inevitable rise of data warfare by Alistair Croll.

A snippet from Alistair’s post:

First, technology is good. Then it gets bad. Then it gets stable.

Geeks often talk about “layer 8.” When an IT operator sighs resignedly that it’s a layer 8 problem, she means it’s a human’s fault. It’s where humanity’s rubber meets technology’s road. And big data is interesting precisely because it’s the layer 8 protocol. It’s got great power, demands great responsibility, and portends great risk unless we do it right. And just like the layers beneath it, it’s going to get good, then bad, then stable.

Other layers of the protocol stack have come under assault by spammers, hackers, and activists. There’s no reason to think layer 8 won’t as well. And just as hackers find a clever exploit to intercept and spike an SSL session, or trick an app server into running arbitrary code, so they’ll find an exploit for big data.

The term “data warfare” might seem a bit hyperbolic, so I’ll try to provide a few concrete examples. I’m hoping for plenty more in the Strata Online Conference we’re running next week, which has a stellar lineup of people who have spent time thinking about how to do naughty things with information at scale.

Alistair has interesting example cases but layer 8 warfare has been the norm for years.

Big data is just another battlefield.

Consider the lack of sharing within governmental agencies.

How else would you explain: U.S. Government’s Fiscal Years 2012 and 2011 Consolidated Financial Statements, a two hundred and seventy page report from the Government Accounting Office (GAO), detailing why it can’t audit the government due to problems at the Pentagon and elsewhere?

It isn’t like double entry accounting was invented last year and accounting software is all that buggy.

Forcing the Pentagon and others to disgorge accounting data would be a fire step.

The second step would be to map the data with its original identifiers. So it would be possible to return to that same location as last year and if the data is missing, to ask where is it now? With enough specifics to have teeth.

Let the Pentagon keep it self-licking ice cream cone accounting systems.

But attack it with mapping of data and semantics to create audit trails into that wasteland.

Data warfare is a given. The question is whether you intend to win or lose?

January 17, 2013

Complete Guardian Dataset Listing!

Filed under: Data,Dataset,News — Patrick Durusau @ 7:28 pm

All our datasets: the complete index by Chris Cross.

From the post:

Lost track of the hundreds of datasets published by the Guardian Datablog since it began in 2009? Thanks to ScraperWiki, this is the ultimate list and resource. The table below is live and updated every day – if you’re still looking for that ultimate dataset, the chance is we’ve already done it. Click below to find out

I am simply in awe of the number of datasets produced by the Guardian since 2009.

A few of the more interesting titles include:

You will find things in the hundreds of datasets you have wondered about and other things you can’t imagine wondering about. 😉

Enjoy!

Virtual Astronomical Observatory – 221st AAS Meeting

Filed under: Astroinformatics,Data — Patrick Durusau @ 7:24 pm

The Virtual Astronomical Observatory (VAO) at the 221st AAS Meeting

From the post:

The VAO is funded to provide a computational infrastructure for virtual astronomy. When complete, it will enable astronomers to discover and access data in archives worldwide, allow them to share and publish datasets, and support analysis of data through an “ecosystem” of interoperable tools.

Nine out of twelve posters are available for download, including:

Even if you live in an area of severe night pollution, the heavens may only be an IP address away.

Enjoy!

January 16, 2013

Free Datascience books

Filed under: Data,Data Mining,Data Science — Patrick Durusau @ 7:55 pm

Free Datascience books by Carl Anderson

From the post:

I’ve been impressed in recent months by the number and quality of free datascience/machine learning books available online. I don’t mean free as in some guy paid for a PDF version of an O’Reilly book and then posted it online for others to use/steal, but I mean genuine published books with a free online version sanctioned by the publisher. That is, “the publisher has graciously agreed to allow a full, free version of my book to be available on this site.”
Here are a few in my collection:

Any you would like to add to the list?

I first saw this in Four short links: 1 January 2013 by Nat Torkington.

January 14, 2013

1 Billion Videos = No Reruns

Filed under: Data,Entertainment,Social Media,Social Networks — Patrick Durusau @ 8:38 pm

Viki Video: 1 Billion Videos in 150 languages Means Never Having to Say Rerun by Greg Bates.

from the post:

Tried of American TV? Tired of TV in English? Escape to Viki, the leading global TV and movie network, which provides videos with crowd sourced translations in 150 languages. The Viki API allows your users to browse more than 1 billion videos by genre, country, and language, plus search across the entire database. The API uses OAuth2.0 authentication, REST, with responses in either JSON or XML.

The Viki Platform Google Group.

Now this looks like a promising data set!

A couple of use cases for topic maps come to mind:

  • Entry in OPAC points patron mapping from catalog to videos from this database.
  • Entry returned from database maps to book in local library collection (via WorldCat) (more likely to appeal to me).

What use cases do you see?

Connecting the Dots with Data Mashups (Webinar – 15th Jan. 2013)

Filed under: Data,Graphics,Mashups,Visualization — Patrick Durusau @ 1:53 pm

Connecting the Dots with Data Mashups (Webinar – 15th Jan. 2013)

From the webpage:

The Briefing Room with Lyndsay Wise and Tableau Software

While Big Data continues to grab headlines, most information managers know there are many more “small” data sets that are becoming more valuable for gaining insights. That’s partly because business users are getting savvier at mixing and matching all kinds of data, big and small. One key success factor is the ability create compelling visualizations that clearly show patterns in the data.

Register for this episode of The Briefing Room to hear Analyst Lindsay Wise share insights about best practices for designing data visualization mashups. She’ll be briefed by Ellie Fields of Tableau Software who will demonstrate several different business use cases in which such mashups have proven critical for generating significant business value.

Particularly interesting in the use cases part of the presentation.

Topic maps, after all, are re-usable and reliable mashups.

Finding places that like mashups+ (aka, topic maps) is a good marketing move.

PS: It took several minutes to discover a link for the webinar that did not have lots of tracking garbage attached to it. I am considering not listing events without clean URLs to registration materials. What do you think?

January 10, 2013

App-lifying USGS Earth Science Data

Filed under: Challenges,Contest,Data,Geographic Data,Science — Patrick Durusau @ 1:49 pm

App-lifying USGS Earth Science Data

Challenge Dates:

Submissions: January 9, 2013 at 9:00am EST – Ends April 1, 2013 at 11:00pm EDT.

Public Voting: April 5, 2013 at 5:00pm EDT – Ends April 25, 2013 at 11:00pm EDT.

Judging: April 5, 2013 at 5:00pm EDT – Ends April 25, 2013 at 11:00pm EDT.

Winners Announced: April 26, 2013 at 5:00pm EDT.

From the webpage:

USGS scientists are looking for your help in addressing some of today’s most perplexing scientific challenges, such as climate change and biodiversity loss. To do so requires a partnership between the best and the brightest in Government and the public to guide research and identify solutions.

The USGS is seeking help via this platform from many of the Nation’s premier application developers and data visualization specialists in developing new visualizations and applications for datasets.

USGS datasets for the contest consist of a range of earth science data types, including:

  • several million biological occurrence records (terrestrial and marine);
  • thousands of metadata records related to research studies, ecosystems, and species;
  • vegetation and land cover data for the United States, including detailed vegetation maps for the National Parks; and
  • authoritative taxonomic nomenclature for plants and animals of North America and the world.

Collectively, these datasets are key to a better understanding of many scientific challenges we face globally. Identifying new, innovative ways to represent, apply, and make these data available is a high priority.

Submissions will be judged on their relevance to today’s scientific challenges, innovative use of the datasets, and overall ease of use of the application. Prizes will be awarded to the best overall app, the best student app, and the people’s choice.

Of particular interest for the topic maps crowd:

Data used – The app must utilize a minimum of 1 DOI USGS Core Science and Analytics (CSAS) data source, though they need not include all data fields available in a particular resource. A list of CSAS databases and resources is available at: http://www.usgs.gov/core_science_systems/csas/activities.html. The use of data from other sources in conjunction with CSAS data is encouraged.

CSAS has a number of very interesting data sources. Classifications, thesauri, data integration, metadata and more.

Contest wins you a recognition and bragging rights, not to mention visibility for your approach.

Common Crawl URL Index

Filed under: Common Crawl,Data,WWW — Patrick Durusau @ 1:48 pm

Common Crawl URL Index by Lisa Green.

From the post:

We are thrilled to announce that Common Crawl now has a URL index! Scott Robertson, founder of triv.io graciously donated his time and skills to creating this valuable tool. You can read his guest blog post below and be sure to check out the triv.io site to learn more about how they help groups solve big data problems.

From Scott’s post:

If you want to create a new search engine, compile a list of congressional sentiment, monitor the spread of Facebook infection through the web, or create any other derivative work, that first starts when you think “if only I had the entire web on my hard drive.” Common Crawl is that hard drive, and using services like Amazon EC2 you can crunch through it all for a few hundred dollars. Others, like the gang at Lucky Oyster , would agree.

Which is great news! However if you wanted to extract only a small subset, say every page from Wikipedia you still would have to pay that few hundred dollars. The individual pages are randomly distributed in over 200,000 archive files, which you must download and unzip each one to find all the Wikipedia pages. Well you did, until now.

I’m happy to announce the first public release of the Common Crawl URL Index, designed to solve the problem of finding the locations of pages of interest within the archive based on their URL, domain, subdomain or even TLD (top level domain).

What research project would you want to do first?

Stop Hosting Data and Code on your Lab Website

Filed under: Archives,Data — Patrick Durusau @ 1:45 pm

Stop Hosting Data and Code on your Lab Website by Stephen Turner.

From the post:

It’s happened to all of us. You read about a new tool, database, webservice, software, or some interesting and useful data, but when you browse to http://instititution.edu/~home/professorX/lab/data, there’s no trace of what you were looking for.

THE PROBLEM

This isn’t an uncommon problem. See the following two articles:

Schultheiss, Sebastian J., et al. “Persistence and availability of web services in computational biology.” PLoS one 6.9 (2011): e24914. 

Wren, Jonathan D. “404 not found: the stability and persistence of URLs published in MEDLINE.” Bioinformatics 20.5 (2004): 668-672.

The first gives us some alarming statistics. In a survey of nearly 1000 web services published in the Nucleic Acids Web Server Issue between 2003 and 2009:

  • Only 72% were still available at the published address.
  • The authors could not test the functionality for 33% because there was no example data, and 13% no longer worked as expected.
  • The authors could only confirm positive functionality for 45%.
  • Only 274 of the 872 corresponding authors answered an email.
  • Of these 78% said a service was developed by a student or temporary researcher, and many had no plan for maintenance after the researcher had moved on to a permanent position.

The Wren et al. paper found that of 1630 URLs identified in Pubmed abstracts, only 63% were consistently available. That rate was far worse for anonymous login FTP sites (33%).

Is this a problem for published data in the topic map community?

What data should we be archiving? Discussion lists? Blogs? Public topic maps?

What do you think of Stephen’s solution?

December 26, 2012

New EU Data Portal [Transparency/Innovation?]

Filed under: Data,Data Source,EU,Transparency — Patrick Durusau @ 2:30 pm

EU Commission unwraps public beta of open data portal with 5800+ datasets, ahead of Jan 2013 launch by Robin Wauters.

The EU Data Portal.

From the post:

Good news for open data lovers in the European Union and beyond: the European Commission on Christmas Eve quietly pushed live the public beta version of its all-new open data portal.

For the record: open data is general information that can be freely used, re-used and redistributed by anyone. In this case, it concerns all the information that public bodies in the European Union produce, collect or pay for (it’s similar to the United States government’s Data.gov).

This could include geographical data, statistics, meteorological data, data from publicly funded research projects, and digitised books from libraries.

The post always quotes the portal website as saying:

This portal is about transparency, open government and innovation. The European Commission Data Portal provides access to open public data from the European Commission. It also provides access to data of other Union institutions, bodies, offices and agencies at their request.

The published data can be downloaded by everyone interested to facilitate reuse, linking and the creation of innovative services. Moreover, this Data Portal promotes and builds literacy around Europe’s data.

Eurostat is the largest data contributor so signs of “transparency” should be there, if anywhere.

The first twenty (20) data sets from Eurostat are:

  • Quarterly cross-trade road freight transport by type of transport (1 000 t, Mio Tkm)
  • Turnover by residence of client and by employment size class for div 72 and 74
  • Generation of waste by sector
  • Standardised incidence rate of accidents at work by economic activity, severity and age
  • At-risk-of-poverty rate of older people, by age and sex (Source: SILC)
  • Telecommunication services: Access to networks (1 000)
  • Production of environmentally harmful chemicals, by environmental impact class
  • Fertility indicators
  • Area under wine-grape vine varieties broken down by vine variety, age of the vines and NUTS 2 regions – Romania
  • Severe material deprivation rate by most frequent activity status (population aged 18 and over)
  • Government bond yields, 10 years’ maturity – monthly data
  • Material deprivation for the ‘Economic strain’ and ‘Durables’ dimensions, by number of item (Source: SILC)
  • Participation in non-formal taught activities within (or not) paid hours by sex and working status
  • Number of persons by working status within households and household composition (1 000)
  • Percentage of all enterprises providing CVT courses, by type of course and size class
  • EU Imports from developing countries by income group
  • Extra-EU imports of feedingstuffs: main EU partners
  • Production and international trade of foodstuffs: Fresh fish and fish products
  • General information about the enterprises
  • Agricultural holders

When I think of government “transparency,” I think of:

  • Who is making the decisions?
  • What are their relationships to the people asking for the decisions? School, party, family, social, etc.
  • What benefits are derived from the decisions?
  • Who benefits from those decisions?
  • What are the relationships between those who benefit and those who decide?
  • Remembering it isn’t the “EU” that makes a decision for good or ill for you.

    Some named individual or group of named individuals, with input from other named individuals, with who they had prior relationships, made those decisions.

    Transparency in government would name the names and relationships of those individuals.

    BTW, I would be very interested to learn what sort of “innovation” you can derive from any of the first twenty (20) data sets listed above.

    The holidays may have exhausted my imagination because I am coming up empty.

    Educated Guesses Decorated With Numbers

    Filed under: Data,Data Analysis,Open Data — Patrick Durusau @ 1:48 pm

    Researchers Say Much to Be Learned from Chicago’s Open Data by Sam Cholke.

    From the post:

    HYDE PARK — Chicago is a vain metropolis, publishing every minute detail about the movement of its buses and every little skirmish in its neighborhoods. A team of researchers at the University of Chicago is taking that flood of data and using it to understand and improve the city.

    “Right now we have more data than we’re able to make use of — that’s one of our motivations,” said Charlie Catlett, director of the new Urban Center for Computation and Data at the University of Chicago.

    Over the past two years the city has unleashed a torrent of data about bus schedules, neighborhood crimes, 311 calls and other information. Residents have put it to use, but Catlett wants his team of computational experts to get a crack at it.

    “Most of what is happening with public data now is interesting, but it’s people building apps to visualize the data,” said Catlett, a computer scientist at the university and Argonne National Laboratory.

    Catlett and a collection of doctors, urban planners and social scientists want to analyze that data so to solve urban planning puzzles in some of Chicago’s most distressed neighborhoods and eliminate the old method of trial and error.

    “Right now we look around and look for examples where something has worked or appeared to work,” said Keith Besserud, an architect at Skidmore, Owings and Merrill's Blackbox Studio and part of the new center. “We live in a city, so we think we understand it, but it’s really not seeing the forest for the trees, we really don’t understand it.”

    Besserud said urban planners have theories but lack evidence to know for sure when greater density could improve a neighborhood, how increased access to public transportation could reduce unemployment and other fundamental questions.

    “We’re going to try to break down some of the really tough problems we’ve never been able to solve,” Besserud said. “The issue in general is the field of urban design has been inadequately served by computational tools.”

    In the past, policy makers would make educated guesses. Catlett hopes the work of the center will better predict such needs using computer models, and the data is only now available to answer some fundamental questions about cities.

    …(emphasis added)

    Some city services may be improved by increased data, such as staging ambulances near high density shooting locations based upon past experience.

    That isn’t the same as “planning” to reduce the incidence of unemployment or crime by urban planning.

    If you doubt that statement, consider the vast sums of economic data available for the past century.

    Despite that array of data, there are no universally acclaimed “truths” or “policies” for economic planning.

    The temptation to say “more data,” “better data,” “better integration of data,” etc. will solve problem X is ever present.

    Avoid disappointing your topic map customers.

    Make sure a problem is one data can help solve before treating it like one.

    I first saw this in a tweet by Tim O’Reilly.

    December 25, 2012

    Quandl [> 2 million financial/economic datasets]

    Filed under: Data,Dataset,Time Series — Patrick Durusau @ 4:19 pm

    Quandl (alpha)

    From the homepage:

    Quandl is a collaboratively curated portal to over 2 million financial and economic time-series datasets from over 250 sources. Our long-term mission is to make all numerical data on the internet easy to find and easy to use.

    Interesting enough but the detail from the “about” page are even more so:

    Our Vision

    The internet offers a rich collection of high quality numerical data on thousands of subjects. But the potential of this data is not being reached at all because the data is very difficult to actually find. Furthermore, it is also difficult to extract, validate, format, merge, and share.

    We have a solution: We’re building an intelligent search engine for numerical data. We’ve developed technology that lets people quickly and easily add data to Quandl’s index. Once this happens, the data instantly becomes easy to find and easy to use because it gains 8 essential attributes:

    Findability Quandl is essentially a search engine for numerical data. Every search result on Quandl is an actual data set that you can use right now. Once data from anywhere on the internet becomes known to Quandl, it becomes findable by search and (soon) by browse.
    Structure Quandl is a universal translator for data formats. It accepts numerical data no matter what format it happens to be published in and then delivers it in any format you request it. When you find a dataset on Quandl, you’ll be able to export anywhere you want, in any format you want.
    Validity Every dataset on Quandl has a simple link back to the same data on the publisher’s web site which gives you 100% certainty on validity.
    Fusibility Any data set on Quandl is totally compatible with any and all other data on Quandl. You can merge multiple datasets on Quandl quickly and easily (coming soon).
    Permanence Once a dataset is on Quandl, it stays there forever. It is always up-to-date and available at a permanent, unchanging URL.
    Connectivity Every dataset on Quandl is accessible by a simple API. Whether or not the original publisher offered an API no longer matters because Quandl always does. Quandl is the universal API for numerical data on the internet.
    Recency Every single dataset on Quandl is guaranteed to be the most recent version of that data, retrieved afresh directly from the original publisher.
    Utility Data on Quandl is organized and presented for maximum utility: Actual data is examinable immediately; the data is graphed (properly); description, attribution, units, and export tools are clear and concise.

    I have my doubts about the “fusibility” claims. You can check the US Leading Indicators data list and note that “level” and “units” use different units of measurement. Other semantic issues lurk just beneath the surface.

    Still, the name of the engine does not begin with “B” or “G” and illustrates there is enormous potential for curated data collections.

    Come to think of it, topic maps are curated data collections.

    Are you in need of a data curator?

    I first saw this in a tweet by Gregory Piatetsky.

    December 8, 2012

    Applying “Lateral Thinking” to Data Quality

    Filed under: Data,Data Quality — Patrick Durusau @ 7:08 pm

    Applying “Lateral Thinking” to Data Quality by Ken O’Connor.

    From the post:

    I am a fan of Edward De Bono, the originator of the concept of Lateral Thinking. One of my favourite examples of De Bono’s brilliance, relates to dealing with the worldwide problem of river pollution.

    River Discharge Pipe

    De Bono suggested “each factory must be downstream of itself” – i.e. Require factories’ water inflow pipes to be just downstream of their outflow pipes.

    Suddenly, the water quality in the outflow pipe becomes a lot more important to the factory. Apparently several countries have implemented this idea as law.

    What has this got to do with data quality?

    By applying the same principle to data entry, all downstream data users will benefit, and information quality will improve.

    How could this be done?

    So how do you move the data input pipe just downstream of the data outflow pipe?

    Before you take a look at Ken’s solution, take a few minutes to brain storm about how you would do it.

    Important for semantic technologies because there aren’t enough experts to go around. Meaning non-expert users will do a large portion of the work.

    Comments/suggestions?

    December 7, 2012

    Astronomy Resources [Zillman]

    Filed under: Astroinformatics,Data — Patrick Durusau @ 6:38 pm

    Astronomy Resources by Marcus P. Zillman.

    From the post:

    Astronomy Resources (AstronomyResources.info) is a Subject Tracer™ Information Blog developed and created by the Virtual Private Library™. It is designed to bring together the latest resources and sources on an ongoing basis from the Internet for astronomical resources which are listed below….

    With some caveats, this may be of interest.

    First, the level of content is uneven. It ranges from professional surveys (suitable for topic map explorations) to more primary/secondary education type materials. Nothing against the latter but the mix is rather jarring.

    Second, I didn’t test every link but for example AstroGrid is a link to a project that was completed two years ago (2010).

    Just in case you stumble across any of the “white papers” at http://www.whitepapers.us/, also by Marcus P. Zillman, do verify resources before citing them to others.

    December 1, 2012

    MOA Massively Online Analysis

    Filed under: BigData,Data,Hadoop,Machine Learning,S4,Storm,Stream Analytics — Patrick Durusau @ 8:02 pm

    MOA Massively Online Analysis : Real Time Analytics for Data Streams

    From the homepage:

    What is MOA?

    MOA is an open source framework for data stream mining. It includes a collection of machine learning algorithms (classification, regression, and clustering) and tools for evaluation. Related to the WEKA project, MOA is also written in Java, while scaling to more demanding problems.

    What can MOA do for you?

    MOA performs BIG DATA stream mining in real time, and large scale machine learning. MOA can be easily used with Hadoop, S4 or Storm, and extended with new mining algorithms, and new stream generators or evaluation measures. The goal is to provide a benchmark suite for the stream mining community. Details.

    Short tutorials and a manual are available. Enough to get started but you will need additional resources on machine learning if it isn’t already familiar.

    A small niggle about documentation. Many projects have files named “tutorial” or in this case “Tutorial1,” or “Manual.” Those files are easier to discover/save, if the project name, version(?), is prepended to tutorial or manual. Thus “Moa-2012-08-tutorial1” or “Moa-2012-08-manual.”

    If data streams are in your present or future, definitely worth a look.

    November 27, 2012

    For Attribution… [If One Identifier/URL isn’t enough]

    Filed under: Citation Practices,Data,Data Attribution — Patrick Durusau @ 4:12 pm

    For Attribution — Developing Data Attribution and Citation Practices and Standards: Summary of an International Workshop by Paul F. Uhlir.

    From the preface:

    The growth of electronic publishing of literature has created new challenges, such as the need for mechanisms for citing online references in ways that can assure discoverability and retrieval for many years into the future. The growth in online datasets presents related, yet more complex challenges. It depends upon the ability to reliably identify, locate, access, interpret and verify the version, integrity, and provenance of digital datasets.

    Data citation standards and good practices can form the basis for increased incentives, recognition, and rewards for scientific data activities that in many cases are currently lacking in many fields of research. The rapidly-expanding universe of online digital data holds the promise of allowing peer-examination and review of conclusions or analysis based on experimental or observational data, the integration of data into new forms of scholarly publishing, and the ability for subsequent users to make new and unforeseen uses and analyses of the same data – either in isolation, or in combination with other datasets.

    The problem of citing online data is complicated by the lack of established practices for referring to portions or subsets of data. As funding sources for scientific research have begun to require data management plans as part of their selection and approval processes, it is important that the necessary standards, incentives, and conventions to support data citation, preservation, and accessibility be put into place.

    Of particular interest are the four questions that shaped this workshop:

    1. What is the status of data attribution and citation practices in the natural and social (economic and political) sciences in United States and internationally?

    2. Why is the attribution and citation of scientific data important and for what types of data? Is there substantial variation among disciplines?

    3. What are the major scientific, technical, institutional, economic, legal, and socio-cultural issues that need to be considered in developing and implementing scientific data citation standards and practices? Which ones are universal for all types of research and which ones are field or context specific?

    4. What are some of the options for the successful development and implementation of scientific data citation practices and standards, both across the natural and social sciences and in major contexts of research?

    The workshop did not presume a solution (is that a URL in your pocket?) but explores the complex nature of attribution and citation.

    Michael Sperberg-McQueen remarks:

    Longevity: Finally, there is the question of longevity. It is well known that the half-life of citations is much higher in humanities than in the natural sciences. We have been cultivating a culture of citation of referencing for about 2,000 years in the West since the Alexandrian era. Our current citation practice may be 400 years old. The http scheme, by comparison, is about 19 years old. It is a long reach to assume, as some do, that http URLs are an adequate mechanism for all citations of digital (and non-digital!) objects. It is not unreasonable for scholars to be skeptical of the use of URLs to cite data of any long-term significance, even if they are interested in citing the data resources they use. [pp. 63-64]

    What I find the most attractive about topic maps is you can have:

    • A single URL as a citation/identifier.
    • Multiple URLs as citations/identifiers (for the same data resource).
    • Multiple URLs and/or other forms of citations/identifiers as they develop(ed) over time for the same data resource.

    Why the concept of multiple citations/identifiers (quite common in biblical studies) for a single resource is so difficult I cannot explain.

    SINAInnovation: Innovation and Data

    Filed under: Bioinformatics,Cloudera,Data — Patrick Durusau @ 2:26 pm

    SINAInnovation: Innovation and Data by Jeffrey Hammerbacher.

    From the description:

    Cloudera Co-founder Jeff Hammerbacher speaks about data and innovation in the biology and medicine fields.

    Interesting presentation, particularly on creating structures for innovation.

    One of his insights I would summarize as “break early, rebuild fast.” His term for it was “lower batch size.” Try new ideas and when they fail, try a new one.

    I do wonder about his goal to : “Lower the cost of data storage and processing to zero.”

    It may get to be “too cheap to meter” but that isn’t the same thing as being zero. Somewhere in the infrastructure, someone is paying bills for storage and processing.

    I mention that because some political parties think that infrastructure can exist without ongoing maintenance and care.

    Failing infrastructures don’t lead to innovation.


    SINAInnovation description:

    SINAInnovations was a three-day conference at The Mount Sinai Medical Center that examined all aspects of innovation and therapeutic discovery within academic medical centers, from how it can be taught and fostered within academia, to how it can accelerate drug discovery and the commercialization of emerging biotechnologies.

    November 26, 2012

    Climate Data Guide:…

    Filed under: Climate Data,Data — Patrick Durusau @ 5:35 am

    Climate Data Guide: Climate data strengths, limitations and applications

    From the homepage:

    Like an insider’s guidebook to an unexplored country, the Climate Data Guide provides the key insights needed to select the data that best align with your goals, including critiques of data sets by experts from the research community. We invite you to learn from their insights and share your own.

    There are one hundred and eleven data sets as of today on this site. Some satellite based sets, other from other sources.

    Another resource that you may want to map together with other resources.

    Produced by the National Center for Atmospheric Research.

    Public FLUXNET Dataset Information

    Filed under: Climate Data,Data — Patrick Durusau @ 5:33 am

    Public FLUXNET Dataset Information

    From the webpage:

    Flux and meteorological data, collected world‐wide, are submitted to this central database (www.fluxdata.org). These data are: a) checked for quality; b) gaps are filled; c) valueadded products, like ecosystem photosynthesis and respiration, are produced; and d) daily and annual sums, or averages, are computed [Agarwal et al., 2010]. The resulting datasets are available through this site for data synthesis. This page provides information about the FLUXNET synthesis datasets, the sites that contributed data, how to use the datasets, and the synthesis efforts using the datasets.

    I encountered this while searching for more information on biological flux data and thought I should pass it along.

    If you are interested in climate data, definitely a stop you want to make!

    November 24, 2012

    The Seventh Law of Data Quality

    Filed under: Data,Data Quality — Patrick Durusau @ 12:02 pm

    The Seventh Law of Data Quality by Jim Harris.

    Jim’s series on the “laws” of data quality can be recommended without reservation. There are links to each one in his coverage of the seventh law.

    The seventh of data quality law reads:

    Determine the business impact of data quality issues BEFORE taking any corrective action in order to properly prioritize data quality improvement efforts.

    I would modify that slightly to make it applicable to data issues more broadly as:

    Determine the business impact of a data issue BEFORE addressing it at all.

    Your data may be completely isolated in silos, but without a business purpose to be served by freeing them, why bother?

    And that purpose should have a measurable ROI.

    In the absence of a business purpose and a measurable ROI, keep both hands on your wallet.

    November 21, 2012

    Archive of datasets bundled with R

    Filed under: Data,Dataset,R — Patrick Durusau @ 12:19 pm

    Archive of datasets bundled with R by Nathan Yau.

    From the post:

    R comes with a lot of datasets, some with the core distribution and others with packages, but you’d never know which ones unless you went through all the examples found at the end of help documents. Luckily, Vincent Arel-Bundock cataloged 596 of them in an easy-to-read page, and you can quickly download them as CSV files.

    Many of the datasets are dated, going back to the original distribution of R, but it’s a great resource for teaching or if you’re just looking for some data to play with.

    A great find! Thanks Nathan and to Vincent for pulling it together!

    November 20, 2012

    Wikipedia:Database download

    Filed under: Data,Wikipedia — Patrick Durusau @ 3:30 pm

    Wikipedia:Database download

    From the webpage:

    Wikipedia offers free copies of all available content to interested users. These databases can be used for mirroring, personal use, informal backups, offline use or database queries (such as for Wikipedia:Maintenance). All text content is multi-licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL). Images and other files are available under different terms, as detailed on their description pages. For our advice about complying with these licenses, see Wikipedia:Copyrights.

    I know you are already aware of this as a data source but every time I want to confirm something about it, I have a devil of a time finding it at Wikipedia.

    If I remember that I wrote about it here, perhaps it will be easier to find. 😉

    What I need to do is get one of those multi-terabyte network appliances for Christmas. Then copy large data sets that I don’t need updated as often as I need to consult their structures. (Like the next one I am about to mention.)

    « Newer PostsOlder Posts »

    Powered by WordPress