Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 31, 2014

North American Slave Narratives

Filed under: Biography,Data,Narrative — Patrick Durusau @ 3:33 pm

North American Slave Narratives

A listing of autobiographies in chronological order, starting from 1740 to 1999.

A total of two hundred and four (204) biographies and a large number of them are available online.

A class project to weave these together with court records, journals, newspapers and the like would be a good use case for topic maps.

May 30, 2014

Hello Again

Filed under: Archives,Data,Documentation — Patrick Durusau @ 3:42 pm

We Are Now In Command of the ISEE-3 Spacecraft by Keith Cowing.

From the post:

The ISEE-3 Reboot Project is pleased to announce that our team has established two-way communication with the ISEE-3 spacecraft and has begun commanding it to perform specific functions. Over the coming days and weeks our team will make an assessment of the spacecraft’s overall health and refine the techniques required to fire its engines and bring it back to an orbit near Earth.

First Contact with ISEE-3 was achieved at the Arecibo Radio Observatory in Puerto Rico. We would not have been able to achieve this effort without the gracious assistance provided by the entire staff at Arecibo. In addition to the staff at Arecibo, our team included simultaneous listening and analysis support by AMSAT-DL at the Bochum Observatory in Germany, the Space Science Center at Morehead State University in Kentucky, and the SETI Institute’s Allen Telescope Array in California.

How’s that for engineering and documentation?

So, maybe good documentation isn’t such a weird thing after all. 😉

May 29, 2014

100+ Interesting Data Sets for Statistics

Filed under: Data,Statistics — Patrick Durusau @ 6:28 pm

100+ Interesting Data Sets for Statistics by Robert Seaton.

Summary:

Summary: Looking for interesting data sets? Here’s a list of more than 100 of the best stuff, from dolphin relationships to political campaign donations to death row prisoners.

If we have data, let’s look at data. If all we have are opinions, let’s go with mine.

—Jim Barksdale

Compiled using Robert’s definition of “interesting” but I will be surprised if you don’t agree in most cases.

Curated collections of pointers to data sets come to mind as a possible information product.

Enjoy!

I first saw this in a tweet by Aatish Bhatia.

May 27, 2014

Data as Code. Code as Data:…

Filed under: Clojure,Data,Functional Programming,Semantic Web — Patrick Durusau @ 7:06 pm

Data as Code. Code as Data: Tighther Semantic Web Development Using Clojure by Frédérick Giasson.

From the post:

I have been professionally working in the field of the Semantic Web for more than 7 years now. I have been developing all kind of Ontologies. I have been integrating all kind of datasets from various sources. I have been working with all kind of tools and technologies using all kind of technologies stacks. I have been developing services and user interfaces of all kinds. I have been developing a set of 27 web services packaged as the Open Semantic Framework and re-implemented the core Drupal modules to work with RDF data has I wanted it to. I did write hundred of thousands of line of codes with one goal in mind: leveraging the ideas and concepts of the Semantic Web to make me, other developers, ontologists and data-scientists working more accurately and efficiently with any kind data.

However, even after doing all that, I was still feeling a void: a disconnection between how I was think about data and how I was manipulating it using the programming languages I was using, the libraries I was leveraging and the web services that I was developing. Everything is working, and is working really well; I did gain a lot of productivity in all these years. However, I was still feeling that void, that disconnection between the data and the programming language.

Frédérick promises to walk us through serializing RDF data into Clojure code.

Doesn’t that sound interesting?

Hmmm, will we find that data has semantics? And subjects that the data represents?

Can’t say, don’t know. But I am very interested in finding out how far Frédérick will go with “Data as Code. Code as Data.”

May 23, 2014

Convert Existing Data into Parquet

Filed under: Data,Parquet — Patrick Durusau @ 7:19 pm

Convert Existing Data into Parquet by Uri Laserson.

From the post:

Learn how to convert your data to the Parquet columnar format to get big performance gains.

Using a columnar storage format for your data offers significant performance advantages for a large subset of real-world queries. (Click here for a great introduction.)

Last year, Cloudera, in collaboration with Twitter and others, released a new Apache Hadoop-friendly, binary, columnar file format called Parquet. (Parquet was recently proposed for the ASF Incubator.) In this post, you will get an introduction to converting your existing data into Parquet format, both with and without Hadoop.

Actually, between Uri’s post and my pointing to it, Parquet has been accepted into the ASF Incubator!

All the more reason to start following this project.

Enjoy!

Early Canadiana Online

Filed under: Data,Language,Library — Patrick Durusau @ 6:50 pm

Early Canadiana Online

From the webpage:

These collections contain over 80,000 rare books, magazines and government publications from the 1600s to the 1940s.

This rare collection of documentary heritage will be of interest to scholars, genealogists, history buffs and anyone who enjoys reading about Canada’s early days.

The Early Canadiana Online collection of rare books, magazines and government publications has over 80,000 titles (3,500,000 pages) and is growing. The collection includes material published from the time of the first European settlers to the first four decades of the 20th Century.

You will find books written in 21 languages including French, English, 10 First Nations languages and several European languages, Latin and Greek.

Every online collection such as this one, increases the volume of information that is accessible and also increases the difficulty of finding related information for any given subject. But the latter is such a nice problem to have!

I first saw this in a tweet from Lincoln Mullen.

May 22, 2014

Nomad and Historic Information

Filed under: Archives,Data,Documentation — Patrick Durusau @ 10:55 am

You may remember Nomad from the Star Trek episode The Changeling. Not quite on that scale but NASA has signed an agreement to allow citizen scientists to “wake up” a thirty-five (35) year old spacecraft this next August.

NASA has given a green light to a group of citizen scientists attempting to breathe new scientific life into a more than 35-year old agency spacecraft.

The agency has signed a Non-Reimbursable Space Act Agreement (NRSAA) with Skycorp, Inc., in Los Gatos, California, allowing the company to attempt to contact, and possibly command and control, NASA’s International Sun-Earth Explorer-3 (ISEE-3) spacecraft as part of the company’s ISEE-3 Reboot Project. This is the first time NASA has worked such an agreement for use of a spacecraft the agency is no longer using or ever planned to use again.

The NRSAA details the technical, safety, legal and proprietary issues that will be addressed before any attempts are made to communicate with or control the 1970’s-era spacecraft as it nears the Earth in August.

“The intrepid ISEE-3 spacecraft was sent away from its primary mission to study the physics of the solar wind extending its mission of discovery to study two comets.” said John Grunsfeld, astronaut and associate administrator for the Science Mission Directorate at NASA headquarters in Washington. “We have a chance to engage a new generation of citizen scientists through this creative effort to recapture the ISEE-3 spacecraft as it zips by the Earth this summer.” NASA Signs Agreement with Citizen Scientists Attempting to Communicate with Old Spacecraft

Do you have any thirty-five (35) year old software you would like to start re-using? 😉

What information should you have captured for that software?

The crowdfunding is in “stretch mode,” working towards $150,000. Support at: ISEE-3 Reboot Project by Space College, Skycorp, and SpaceRef.

May 16, 2014

APIs for Scholarly Resources

Filed under: Data,Library — Patrick Durusau @ 7:58 pm

APIs for Scholarly Resources

From the webpage:

APIs, short for application programming interface, are tools used to share content and data between software applications. APIs are used in a variety of contexts, but some examples include embedding content from one website into another, dynamically posting content from one application to display in another application, or extracting data from a database in a more programmatic way than a regular user interface might allow.

Many scholarly publishers, databases, and products offer APIs to allow users with programming skills to more powerfully extract data to serve a variety of research purposes. With an API, users might create programmatic searches of a citation database, extract statistical data, or dynamically query and post blog content.

Below is a list of commonly used scholarly resources at MIT that make their APIs available for use. If you have programming skills and would like to use APIs in your research, use the table below to get an overview of some available APIs.

If you have any questions or know of an API you would like to see include in this list, please contact Mark Clemente, Library Fellow for Scholarly Publishing and Licensing in the MIT Libraries (contact information at the bottom of this page).

A nice listing of scholarly resources with public APIs and your opportunity to contribute back to this listing with APIs that you discover.

Sadly, as far as I know (subject to your corrections), the ACM Digital Library has no public API.

Not all that surprising considering considering the other shortcomings of the ACM Digital Library interface. For example, you can only save items (their citations) to a binder one item at a time. Customer service will opine they have had this request before but no, you can’t contact the committee that makes decisions about Digital Library features. Nor will they tell you who is on that committee. Sounds like the current Whitehouse doesn’t it?

I first saw this in a tweet by Scott Chamberlain.

April 28, 2014

Humanitarian Data Exchange

Filed under: Data,Data Analysis — Patrick Durusau @ 7:06 pm

Humanitarian Data Exchange

From the webpage:

A project by the United Nations Office for the Coordination of Humanitarian Affairs to make humanitarian data easy to find and use for analysis.

HDX will include a dataset repository, based on open-source software, where partners can share their data spreadsheets and make it easy for others to find and use that data.

HDX brings together a Common Humanitarian Dataset that can be compared across countries and crises, with tools for analysis and visualization.

HDX promotes community data standards (e.g. the Humanitarian Exchange Language) for sharing operational data across a network of actors.

Data from diverse sources always creates opportunities to use topic maps.

The pilot countries include Columbia, Kenya and Yemen so semantic diversity is a reasonable expectation.

BTW, they are looking for volunteers. Opportunities range from data science, development, visualization to the creation of data standards.

March 27, 2014

1939 Register

Filed under: Census Data,Data — Patrick Durusau @ 1:11 pm

1939 Register

From the webpage:

The 1939 Register is being digitised and will be published within the next two years.

It will provide valuable information about over 30 million people living in England and Wales at the start of World War Two.

What is the 1939 Register?

The British government took a record of the civilian population shortly after the outbreak of World War Two. The information was used to issue identity cards and organise rationing. It was also used to set up the National Health Service.

Explanations are one of the perils of picking very obvious/intuitive names for projects. 😉

The data should include:

Data will be provided only where the individual is recorded as deceased (or where clear evidence of death can be provided by the applicant) and will include;

  • National Registration number
  • Address
  • Surname
  • First Forename
  • Other Forename(s)/Initial(s)
  • Date of Birth
  • Sex
  • Marital Status
  • Occupation

As per the 1939 Register Service, a government office that charges money to search what one assumes are analog records. (Yikes!)

The reason I mention the 1939 Register Service is the statement:

Is any other data available?

If you wish to request additional information under the Freedom of Information Act 2000, please email enquiries@hscic.gov.uk or contact us using the postal address below, marking the letter for the Higher Information Governance Officer (Southport).

Which implies to me there is more data to be had, but the 1911Census.org.uk says not.

Well, assuming you don’t include:

“If member of armed forces or reserves,” which was column G on the original form.

Hard to say why that would be omitted.

It will be interesting to see if the original and then “updated” cards are digitized.

In some of the background reading I did on this data, some mothers omitted their sons from the registration cards (one assumes to avoid military service) but when rationing began based on the registration cards, they filed updated cards to include their sons.

I suspect the 1939 data will be mostly of historical interest but wanted to mention it because people will be interested in it.

CSV on the Web

Filed under: CSV,Data,W3C — Patrick Durusau @ 10:58 am

CSV on the Web Use Cases and Requirements, and Model for Tabular Data and Metadata Published

I swear, that really is the title.

Two recent drafts of interest:

The CSV on the Web: Use Cases and Requirements collects use cases that are at the basis of the work of the Working Group. A large percentage of the data published on the Web is tabular data, commonly published as comma separated values (CSV) files. The Working Group aim to specify technologies that provide greater interoperability for data dependent applications on the Web when working with tabular datasets comprising single or multiple files using CSV, or similar, format. This document lists a first set of use cases compiled by the Working Group that are considered representative of how tabular data is commonly used within data dependent applications. The use cases observe existing common practice undertaken when working with tabular data, often illustrating shortcomings or limitations of existing formats or technologies. This document also provides a first set of requirements derived from these use cases that have been used to guide the specification design.

The Model for Tabular Data and Metadata on the Web outlines a basic data model, or infoset, for tabular data and metadata about that tabular data. The document contains first drafts for various methods of locating metadata: one of the output the Working Group is chartered for is to produce a metadata vocabulary and standard method(s) to find such metadata. It also contains some non-normative information about a best practice syntax for tabular data, for mapping into that data model, to contribute to the standardisation of CSV syntax by IETF (as a possible update of RFC4180).

I guess they mean to use CSV as it exists? What a radical concept. 😉

What next?

Could use an updated specification for the COBOL data format in which many government data sets are published (even now).

That last statement isn’t entirely in jest. There is a lot of COBOL formatted files on government websites in particular.

March 23, 2014

New Book on Data and Power

Filed under: Data,Government,NSA,Privacy,Security — Patrick Durusau @ 6:23 pm

New Book on Data and Power by Bruce Schneier.

From the post:

I’m writing a new book, with the tentative title of Data and Power.

While it’s obvious that the proliferation of data affects power, it’s less clear how it does so. Corporations are collecting vast dossiers on our activities on- and off-line — initially to personalize marketing efforts, but increasingly to control their customer relationships. Governments are using surveillance, censorship, and propaganda — both to protect us from harm and to protect their own power. Distributed groups — socially motivated hackers, political dissidents, criminals, communities of interest — are using the Internet to both organize and effect change. And we as individuals are becoming both more powerful and less powerful. We can’t evade surveillance, but we can post videos of police atrocities online, bypassing censors and informing the world. How long we’ll still have those capabilities is unclear.

Understanding these trends involves understanding data. Data is generated by all computing processes. Most of it used to be thrown away, but declines in the prices of both storage and processing mean that more and more of it is now saved and used. Who saves the data, and how they use it, is a matter of extreme consequence, and will continue to be for the coming decades.

Data and Power examines these trends and more. The book looks at the proliferation and accessibility of data, and how it has enabled constant surveillance of our entire society. It examines how governments and corporations use that surveillance data, as well as how they control data for censorship and propaganda. The book then explores how data has empowered individuals and less-traditional power blocs, and how the interplay among all of these types of power will evolve in the future. It discusses technical controls on power, and the limitations of those controls. And finally, the book describes solutions to balance power in the future — both general principles for society as a whole, and specific near-term changes in technology, business, laws, and social norms.
….

Bruce says a table of contents should appear in “a couple of months” and he is going to be asking “for volunteers to read and comment on a draft version.”

I assume from the description that Bruce is going to try to connect a fairly large number of dots.

Such as who benefits from the Code of Federal Regulations (CFRs) not having an index? The elimination of easier access to the CFRs is a power move. Someone with a great deal of power wants to eliminate the chance of someone gaining power from following information in the CFRs.

I am not a conspiracy theorist but there are only two classes of people in any society, people with more power than you and people with less. Every sentient person wants to have more and no one will voluntarily take less. Among chickens they call it the “pecking order.”

In human society, the “pecking order” in enforced by uncoordinated and largely unconscious following of cultural norms. No conspiracy, just the way we are. But there are cases, the CFR indexes being one of them, where someone is clearly trying to disadvantage others. Who and for what reasons remains unknown.

Data enhancing the Royal Society of…

Filed under: Cheminformatics,Data,ETL — Patrick Durusau @ 4:12 pm

Data enhancing the Royal Society of Chemistry publication archive by Antony Williams.

Abstract:

The Royal Society of Chemistry has an archive of hundreds of thousands of published articles containing various types of chemistry related data – compounds, reactions, property data, spectral data etc. RSC has a vision of extracting as much of these data as possible and providing access via ChemSpider and its related projects. To this end we have applied a combination of text-mining extraction, image conversion and chemical validation and standardization approaches. The outcome of this project will result in new chemistry related data being added to our chemical and reaction databases and in the ability to more tightly couple web-based versions of the articles with these extracted data. The ability to search across the archive will be enhanced as a result. This presentation will report on our progress in this data extraction project and discuss how we will ultimately use similar approaches in our publishing pipeline to enhance article markup for new publications.

The data mining Antony details on the Royal Society of Chemistry is impressive!

But as Anthony notes at slide #30, it isn’t a long term solution:

We should NOT be mining data out of future publications (emphasis added)

I would say the same thing for metadata/subject identities in data. For some data and some subjects, we can, after the fact, reconstruct properties to identify the subjects they represent.

Data/text mining would be more accurate and easier if subjects were identified at the time of authoring. Perhaps even automatically or at least subject to a user’s approval.

More accurate than researchers removed from an author by time, distance and even profession, trying to guess what subject an author may have meant.

Better semantic authoring support now, will reduce the cost and improve the accuracy of data mining in the future.

Quickly create a 100k Neo4j graph data model…

Filed under: Data,Graphs,Neo4j — Patrick Durusau @ 2:54 pm

Quickly create a 100k Neo4j graph data model with Cypher only by Michael Hunger.

From the post:

We want to run some test queries on an existing graph model but have no sample data at hand and also no input files (CSV,GraphML) that would provide it.

Why not create quickly it on our own just using cypher. First I thought about using Cypher to generate CSV files and loading them back, but it is much easier.

The domain is simple (:User)-[:OWN]→(:Product) but good enough for collaborative filtering or demographic analysis.

Admittedly a “simple” domain but I’m curious how you would rank sample data?

We can all probably recognize “simple” domains but what criteria should we use to rank more complex sample data?

Suggestions?

March 22, 2014

Institute of Historical Research (Podcasts)

Filed under: Data,History — Patrick Durusau @ 9:51 am

Institute of Historical Research (Podcasts)

From the webpage:

Since 2009 the IHR has produced over 500 podcasts, encompassing not only its acclaimed and unique seminar series, but also one-off talks and conferences. All of these recordings are freely available here to stream or download, and can be searched, or browsed by date, event, or subject. In many cases abstracts and other material accompanying the talks can also be found.

These recordings, particularly those taken from seminars where historians are showcasing their current research, provide a great opportunity to listen to experts in all fields of history discuss their work in progress. If you have any questions relating to the podcasts found here, please contact us.

I don’t know what you like writing topic maps about but I suspect you can find some audio podcast resources here.

Disappointed that “ancient” has so few but recent history, the 16th century onward has much better coverage.

The offerings range from the expected:

Goethe’s Erotic Poetry and the Libertine Spectre

Big Flame 1970-1984. A history of a revolutionary socialist organisation

to the obscure:

Chinese and British Gift Giving in the Macartney Embassy of 1793

Learning from the Experience of a Town in Peru’s Central Andes, 1931-1948

Makes me wonder if there is linked data that cover the subjects in these podcasts?

Illustrates one problem with “universal” solutions. Fairly trivial to cover all the “facts” in Wikipedia but that is such a small portion of all available facts. Useful, but still a small set of facts.

Enjoy!

March 20, 2014

PLUS

Filed under: Data,Neo4j,Provenance — Patrick Durusau @ 7:36 pm

PLUS

From the webpage:

PLUS is a system for capturing and managing provenance information, originally created at the MITRE Corporation.

Data provenance is “information that helps determine the derivation history of a data product…[It includes] the ancestral data product(s) from which this data product evolved, and the process of transformation of these ancestral data product(s).”

Uses Neo4j for storage.

Includes an academic bibliography of related papers.

Provenance answers the question: Where has your data been, what has happened to your data and with who?

March 19, 2014

Podcast: Thinking with Data

Filed under: Data,Data Analysis,Data Science — Patrick Durusau @ 1:39 pm

Podcast: Thinking with Data: Data tools are less important than the way you frame your questions by Jon Bruner.

From the description:

Max Shron and Jake Porway spoke with me at Strata a few weeks ago about frameworks for making reasoned arguments with data. Max’s recent O’Reilly book, Thinking with Data, outlines the crucial process of developing good questions and creating a plan to answer them. Jake’s nonprofit, DataKind, connects data scientists with worthy causes where they can apply their skills.

Curious if you agree with Max that data tools are “mature?”

Certainly better than they were when I was an undergraduate in political science but measuring sentiment was a current topic even then. 😉

And the controversy of tools versus good questions isn’t a new one either.

To his credit, Max does credit decades of discussion of rhetoric and thinking as helpful in this area.

For you research buffs, any pointers to prior tools versus good questions debates? (Think sociology/political science in the 1970s to date. It’s a recurring theme.)

I first saw this in a tweet by Mike Loukides.

March 17, 2014

Peyote and the International Plant Names Index

Filed under: Agriculture,Data,Names,Open Access,Open Data,Science — Patrick Durusau @ 1:30 pm

International Plant Names Index

What a great resource to find as we near Spring!

From the webpage:

The International Plant Names Index (IPNI) is a database of the names and associated basic bibliographical details of seed plants, ferns and lycophytes. Its goal is to eliminate the need for repeated reference to primary sources for basic bibliographic information about plant names. The data are freely available and are gradually being standardized and checked. IPNI will be a dynamic resource, depending on direct contributions by all members of the botanical community.

I entered the first plant name that came to mind: Peyote.

No “hits.” ?

Wikipedia gives Peyote’s binomial name as: Lophophora williamsii (think synonym).*

Searching on Lophophora williamsii, I got three (3) “hits.”

Had I bothered to read the FAQ before searching:

10. Can I use IPNI to search by common (vernacular) name?

No. IPNI does not include vernacular names of plants as these are rarely formally published. If you are looking for information about a plant for which you only have a common name you may find the following resources useful. (Please note that these links are to external sites which are not maintained by IPNI)

I understand the need to specialize in one form of names but “formally published” means that without a useful synonyms list, the general public has an additional burden to access publicly funded research results.

Even with a synonym list there is an additional burden because you have to look up terms in the list, then read the text with that understanding and then back to the synonym list again.

What would dramatically increase public access to publicly funded research would be to have a specialized synonym list for publications that transposes the jargon in articles to selected sets of synonyms. Would not be as precise or grammatical as the original, but it would allow the reading pubic to get a sense of even very technical research.

That could be a way to hitch topic maps to the access to publicly funded data band wagon.

Thoughts?

I first saw this in a tweet by Bill Baker.

* A couple of other fun facts from Wikipedia on Peyote: 1. It’s conservation status is listed as “apparently secure,” and 2. Wikipedia has photos of Peyote “in the wild.” I suppose saying “Peyote growing in a pot” would raise too many questions.

March 12, 2014

AntWeb

Filed under: Data,R,Science — Patrick Durusau @ 7:46 pm

AntWeb by rOpenScience.

From the webpage:

AntWeb s a repository of ant specimen records maintained by the California Academy of Sciences. From the website’s description:

AntWeb is the world’s largest online database of images, specimen records, and natural history information on ants. It is community driven and open to contribution from anyone with specimen records, natural history comments, or images.

Resources

An R wrapper for the AntWeb API.

Listing functions + descriptions:

  • aw_data – Search for data by taxonomic level, full species name, a bounding box, habitat, elevation or type
  • aw_unique – Obtain a list of unique levels by various taxonomic ranks
  • aw_images – Search photos by type or time since added
  • aw_coords – Search for specimens by location and radius
  • aw_code – Search for a specimen by record number
  • aw_map – Map georeferenced data

Doesn’t hurt to have a few off-beat data sets at your command. Can’t tell when someone’s child will need help with a science fair project, etc.

PS: I did resist the temptation to list this post under “bugs.”

March 11, 2014

30,000 comics, 7,000 series – How’s Your Collection?

Filed under: Data,History,Social Sciences — Patrick Durusau @ 4:53 pm

Marvel Comics opens up its metadata for amazing Spider-Apps by Alex Dalenberg.

From the post:

It’s not as cool as inheriting superpowers from a radioactive spider, but thanks to Marvel Entertainment’s new API, you can now build Marvel Comics apps to your heart’s content.

That is, as long as you’re not making any money off of them. Nevertheless, it’s a comic geek’s dream. The Disney-owned company is opening up the data trove from its 75-year publishing history, including cover art, characters and comic book crossover events, for developers to tinker with.

That’s metadata for more than 30,000 comics and 7,000 series.

Marvel Developer.

I know, another one of those non-commercial use licenses. I mean, Marvel paid for all of this content and then has the gall to not just give it away for free. What is the world coming to?

😉

Personally I think Marvel has the right to allow as much or as little access to their data as they please. If you come up with a way to make money using this content, ask Marvel for commercial permissions. I deeply suspect they will be more than happy to accommodate any reasonable request.

The comic book zealot uses are obvious but aren’t you curious about the comic books your parents read? Or that your grandparents read?

Speaking of contemporary history, a couple of other cultural goldmines, Playboy Cover to Cover Hard Drive – Every Issue From 1953 to 2010 and Rolling Stone.

I don’t own either one so I don’t know how hard it would be to get the content in to machine readable format.

Still, both would be a welcome contrast to main stream news sources.

I first saw this in a tweet by Bob DuCharme.

March 10, 2014

Hubble Source Catalog

Filed under: Astroinformatics,Data — Patrick Durusau @ 4:51 pm

Beta Version 0.3 of the Hubble Source Catalog

From the post:

The Hubble Source Catalog (HSC) is designed to optimize science from the Hubble Space Telescope by combining the tens of thousands of visit-based source lists in the Hubble Legacy Archive (HLA) into a single master catalog.

Search with Summary Form now (one row per match)
Search with Detailed Form now (one row per source)

Beta Version 0.3 of the HSC contains members of the WFPC2, ACS/WFC, WFC3/UVIS and WFC3/IR Source Extractor source lists in HLA version DR7.2 (data release 7.2) that are considered to be valid detections because they have flag values less than 5 (see more flag information).

The crossmatching process involves adjusting the relative astrometry of overlapping images so as to minimize positional offsets between closely aligned sources in different images. After correction, the astrometric residuals of crossmatched sources are significantly reduced, to typically less than 10 mas. In addition, the catalog includes source nondetections. The crossmatching algorithms and the properties of the initial (Beta 0.1) catalog are described in Budavari & Lubow (2012) .

if you need training with this data set, see: A Hubble Source Catalog (HSC) Walkthrough

March 7, 2014

Introducing the ProPublica Data Store

Filed under: Data,News,Reporting — Patrick Durusau @ 8:07 pm

Introducing the ProPublica Data Store by Scott Klein and Ryann Grochowski Jones.

From the post:

We work with a lot of data at ProPublica. It's a big part of almost everything we do — from data-driven stories to graphics to interactive news applications. Today we're launching the ProPublica Data Store, a new way for us to share our datasets and for them to help sustain our work.

Like most newsrooms, we make extensive use of government data — some downloaded from "open data" sites and some obtained through Freedom of Information Act requests. But much of our data comes from our developers spending months scraping and assembling material from web sites and out of Acrobat documents. Some data requires months of labor to clean or requires combining datasets from different sources in a way that's never been done before.

In the Data Store you'll find a growing collection of the data we've used in our reporting. For raw, as-is datasets we receive from government sources, you'll find a free download link that simply requires you agree to a simplified version of our Terms of Use. For datasets that are available as downloads from government websites, we've simply linked to the sites to ensure you can quickly get the most up-to-date data.

For datasets that are the result of significant expenditures of our time and effort, we're charging a reasonable one-time fee: In most cases, it's $200 for journalists and $2,000 for academic researchers. Those wanting to use data commercially should reach out to us to discuss pricing. If you're unsure whether a premium dataset will suit your purposes, you can try a sample first. It's a free download of a small sample of the data and a readme file explaining how to use it.

The datasets contain a wealth of information for researchers and journalists. The premium datasets are cleaned and ready for analysis. They will save you months of work preparing the data. Each one comes with documentation, including a data dictionary, a list of caveats, and details about how we have used the data here at ProPublica.

A data store you can feel good about supporting!

I first saw this at Nathan Yau’s ProPublica opened a data store.

March 5, 2014

Q

Filed under: Data,Language — Patrick Durusau @ 8:27 pm

Q by Bernard Lambeau.

From the webpage:

Q is a data language. For now, it is limited to a data definition language (DDL). Think “JSON/XML schema”, but the correct way. Q comes with a dedicated type system for defining data and a theory, called information contracts, for interoperability with programming and data exchange languages.

I am sure this will be useful but limited since it doesn’t extend to disclosing the semantics of data or the structures that contain data.

Unfortunate but it seems like the semantics of data are treated as: “…you know what the data means…,” which is rather far from the truth.

Sometimes some people may know what the data “means,” but that is hardly a sure thing.

My favorite example being the pyramids being build in front of hundreds of thousands of people over decades and because everyone “…knew how it was done…,” no one bothered to write it down.

Now H2 can consult with “ancient astronaut theorists” (I’m not lying, that is what they called their experts) about the building of the pyramids.

Do you want your data to be interpreted by the data equivalent of an “ancient astronaut theorist?” If not, you had better give some consideration to documenting the semantics of your data.

I first saw this in a tweet by Carl Anderson.

On Data and Performance

Filed under: Art,Data,Topic Maps — Patrick Durusau @ 4:46 pm

On Data and Performance by Jer Thorp.

From the post:

Data live utilitarian lives. From the moment they are conceived, as measurements of some thing or system or person, they are conscripted to the cause of being useful. They are fed into algorithms, clustered and merged, mapped and reduced. They are graphed and charted, plotted and visualized. A rare datum might find itself turned into sound, or, more seldom, manifested as a physical object. Always, though, the measure of the life of data is in its utility. Data that are collected but not used are condemned to a quiet life in a database. They dwell in obscure tables, are quickly discarded, or worse (cue violin) – labelled as ‘exhaust’.

Perhaps this isn’t the only role for a datum? To be operated on? To be useful?

Over the last couple of years, with my collaborators Ben Rubin & Mark Hansen, we’ve been investigating the possibility of using data as a medium for performance. Here, data becomes the script, or the score, and in turn technologies that we typically think of as tools become instruments, and in some cases performers.

The most recent manifestation of these explorations is a performance called A Thousand Exhausted Things, which we recently staged at The Museum of Modern Art, with the experimental theater group Elevator Repair Service. In this performance, the script is MoMA’s collections database, an eighty year-old, 120k object strong archive. The instruments are a variety of custom-written natural language processing algorithms, which are used to turn the text of the database (largely the titles of artworks) into a performable form.

The video would have been far more effective had it included the visualization at all time with the script and actors.

The use of algorithms to create a performance from the titles of works reminds me of Stanley Fish’s How to Recognize a Poem When You See One. From my perspective, the semantics you “see” in data are the semantics you expect to see. What else would they be?

What I find very powerful about topic maps is that different semantics can reside side by side for the same data.

I first saw this in tweet by blprnt.

March 4, 2014

PLOS’ Bold Data Policy

Filed under: Data,Open Access,Open Data,Public Data — Patrick Durusau @ 11:32 am

PLOS’ Bold Data Policy by David Crotty.

From the post:

If you pay any attention at all to scholarly publishing, you’re likely aware of the current uproar over PLOS’ recent announcement requiring all article authors to make their data publicly available. This is a bold move, and a forward-looking policy from PLOS. It may, for many reasons, have come too early to be effective, but ultimately, that may not be the point.

Perhaps the biggest practical problem with PLOS’ policy is that it puts an additional time and effort burden on already time-short, over-burdened researchers. I think I say this in nearly every post I write for the Scholarly Kitchen, but will repeat it again here: Time is a researcher’s most precious commodity. Researchers will almost always follow the path of least resistance, and not do anything that takes them away from their research if it can be avoided.

When depositing NIH-funded papers in PubMed Central was voluntary, only 3.8% of eligible papers were deposited, not because people didn’t want to improve access to their results, but because it wasn’t required and took time and effort away from experiments. Even now, with PubMed Central deposit mandatory, only 20% of what’s deposited comes from authors. The majority of papers come from journals depositing on behalf of authors (something else for which no one seems to give publishers any credit, Kent, one more for your list). Without publishers automating the process on the author’s behalf, compliance would likely be vastly lower. Lightening the burden of the researcher in this manner has become a competitive advantage for the journals that offer this service.

While recognizing the goal of researchers to do more experiments, isn’t this reminiscent of the lack of documentation for networks and software?

That creators of networks and software want to get on with the work they enjoy, documentation not being part of that work.

The problem with the semantics of research data, much as it is with network and software semantics, it there is no one else to ask about its semantics. If researchers don’t document those semantics as they perform experiments, then they will have to spend the time at publication to gather that information together.

I sense an opportunity here for software to assist researchers in capturing semantics as they perform experiments, so that production of semantically annotated data at the end of an experiment can be largely a clerical task, subject to review by the actual researchers.

The minimal semantics that needs to be captured for different type of research will vary. That is all the more reason to research and document those semantics before anyone writes a complex monolith of semantics into which existing semantics must be shoe horned.

Reasoning if we don’t know the semantics of data, it is more cost effective to pipe it to /dev/null.

I first saw this in a tweet by ChemConnector.

March 1, 2014

R and the Weather

Filed under: Data,R,Weather Data — Patrick Durusau @ 6:13 pm

R and the Weather by Joseph Rickert.

From the post:

The weather is on everybody’s mind these days: too much ice and snow east of the Rockies and no rain to speak fo in California. Ram Narasimhan has made it a little easier for R users to keep track of what’s going on and also get a historical perspective. His new R package weatherData makes it easy to down load weather data from various stations around the world collecting data. Here is a time series plot of the average temperature recorded at SFO last year with the help of the weatherData’s getWeatherForYear() function. It is really nice that the function returns a data frame of hourly data with the Time variable as class POSIXct.

Everyone is still talking about winter weather but summer isn’t far off and with that comes hurricane season.

You can capture a historical perspective that goes beyond the highest and lowest temperature for a particular day.

Enjoy!

I first saw this in The week in stats (Feb. 10th edition).

Introducing OData

Filed under: Data — Patrick Durusau @ 5:55 pm

Introducing OData by David Chappell.

From the post:

Describing OData

Our world is awash in data. Vast amounts exist today, and more is created every year. Yet data has value only if it can be used, and it can be used only if it can be accessed by applications and the people who use them.

Allowing this kind of broad access to data is the goal of the Open Data Protocol, commonly called just OData. This paper provides an introduction to OData, describing what it is and how it can be applied. The goal is to illustrate why OData is important and how your organization might use it.

The Problem: Accessing Diverse Data in a Common Way

There are many possible sources of data. Applications collect and maintain information in databases, organizations store data in the cloud, and many firms make a business out of selling data. And just as there are many data sources, there are many possible clients: Web browsers, apps on mobile devices, business intelligence (BI) tools, and more. How can this varied set of clients access these diverse data sources?

One solution is for every data source to define its own approach to exposing data. While this would work, it leads to some ugly problems. First, it requires every client to contain unique code for each data source it will access, a burden for the people who write those clients. Just as important, it requires the creators of each data source to specify and implement their own approach to getting at their data, making each one reinvent this wheel. And with custom solutions on both sides, there’s no way to create an effective set of tools to make life easier for the people who build clients and data sources.

Thinking about some typical problems illustrates why this approach isn’t the best solution. Suppose a Web application wishes to expose its data to apps on mobile phones, for instance. Without some common way to do this, the Web application must implement its own idiosyncratic approach, forcing every client app developer that needs its data to support this. Or think about the need to connect various BI tools with different data sources to answer business questions. If every data source exposes data in a different way, analyzing that data with various tools is hard — an analyst can only hope that her favorite tool supports the data access mechanism she needs to get at a particular data source.

Defining a common approach makes much more sense. All that’s needed is agreement on a way to model data and a protocol for accessing that data — the implementations can differ. And given the Web-oriented world we live in, it would make sense to build this technology with existing Web standards as much as possible. This is exactly the approach taken by OData.

I’ve been looking for a more than an elevator speech but less than all the details introduction to OData. I think this one fits the bill.

I was looking because OData Version 4.0 and OData JSON Format Version 4.0 (OData TC at OASIS) recently became OASIS standards.

However you wish to treat data post-acquisition, as in a topic map, is your concern. Obtaining data, however, will be made easier through the use of OData.

February 26, 2014

Data Access for the Open Access Literature: PLOS’s Data Policy

Filed under: Data,Open Access,Open Data,Public Data — Patrick Durusau @ 5:44 pm

Data Access for the Open Access Literature: PLOS’s Data Policy by Theo Bloom.

From the post:

Data are any and all of the digital materials that are collected and analyzed in the pursuit of scientific advances. In line with Open Access to research articles themselves, PLOS strongly believes that to best foster scientific progress, the underlying data should be made freely available for researchers to use, wherever this is legal and ethical. Data availability allows replication, reanalysis, new analysis, interpretation, or inclusion into meta-analyses, and facilitates reproducibility of research, all providing a better ‘bang for the buck’ out of scientific research, much of which is funded from public or nonprofit sources. Ultimately, all of these considerations aside, our viewpoint is quite simple: ensuring access to the underlying data should be an intrinsic part of the scientific publishing process.

PLOS journals have requested data be available since their inception, but we believe that providing more specific instructions for authors regarding appropriate data deposition options, and providing more information in the published article as to how to access data, is important for readers and users of the research we publish. As a result, PLOS is now releasing a revised Data Policy that will come into effect on March 1, 2014, in which authors will be required to include a data availability statement in all research articles published by PLOS journals; the policy can be found below. This policy was developed after extensive consultation with PLOS in-house professional and external Academic Editors and Editors in Chief, who are practicing scientists from a variety of disciplines.

We now welcome input from the larger community of authors, researchers, patients, and others, and invite you to comment before March. We encourage you to contact us collectively at data@plos.org; feedback via Twitter and other sources will also be monitored. You may also contact individual PLOS journals directly.

That is a large step towards verifiable research and was taken by PLOS in December of 2013.

That has been supplemented with details that do not change the December announcement in: PLOS’ New Data Policy: Public Access to Data by Liz Silva, which reads in part:

A flurry of interest has arisen around the revised PLOS data policy that we announced in December and which will come into effect for research papers submitted next month. We are gratified to see a huge swell of support for the ideas behind the policy, but we note some concerns about how it will be implemented and how it will affect those preparing articles for publication in PLOS journals. We’d therefore like to clarify a few points that have arisen and once again encourage those with concerns to check the details of the policy or our FAQs, and to contact us with concerns if we have not covered them.

I think the bottom line is: Don’t Panic, Ask.

There are always going to be unanticipated details or concerns but as time goes by and customs develop for how to solve those issues, the questions will become fewer and fewer.

Over time and not that much time, our history of arrangements other than open access are going to puzzle present and future generations of researchers.

715 New Worlds

Filed under: Astroinformatics,Data — Patrick Durusau @ 4:13 pm

NASA’s Kepler Mission Announces a Planet Bonanza, 715 New Worlds by Michele Johnson and J.D. Harrington.

From the post:

NASA’s Kepler mission announced Wednesday the discovery of 715 new planets. These newly-verified worlds orbit 305 stars, revealing multiple-planet systems much like our own solar system.

Nearly 95 percent of these planets are smaller than Neptune, which is almost four times the size of Earth. This discovery marks a significant increase in the number of known small-sized planets more akin to Earth than previously identified exoplanets, which are planets outside our solar system.

“The Kepler team continues to amaze and excite us with their planet hunting results,” said John Grunsfeld, associate administrator for NASA’s Science Mission Directorate in Washington. “That these new planets and solar systems look somewhat like our own, portends a great future when we have the James Webb Space Telescope in space to characterize the new worlds.”

Since the discovery of the first planets outside our solar system roughly two decades ago, verification has been a laborious planet-by-planet process. Now, scientists have a statistical technique that can be applied to many planets at once when they are found in systems that harbor more than one planet around the same star.

What have you discovered lately? 😉

The papers: http://www.nasa.gov/ames/kepler/digital-press-kit-kepler-planet-bonanza.

More about Kepler: http://www.nasa.gov/kepler.

Great discoveries but what else is in the Kepler data that no one is looking for?

February 22, 2014

Latest Kepler Discoveries

Filed under: Astroinformatics,Data — Patrick Durusau @ 9:01 pm

NASA Hosts Media Teleconference to Announce Latest Kepler Discoveries

NASA Kepler Teleconference: 1 p.m. EST, Wednesday, Feb. 26, 2014.

From the post:

NASA will host a news teleconference at 1 p.m. EST, Wednesday, Feb. 26, to announce new discoveries made by its planet-hunting mission, the Kepler Space Telescope.

The briefing participants are:

Douglas Hudgins, exoplanet exploration program scientist, NASA’s Astrophysics Division in Washington

Jack Lissauer, planetary scientist, NASA’s Ames Research Center, Moffett Field, Calif.

Jason Rowe, research scientist, SETI Institute, Mountain View, Calif.

Sara Seager, professor of planetary science and physics, Massachusetts Institute of Technology, Cambridge, Mass.

Launched in March 2009, Kepler was the first NASA mission to find Earth-size planets in or near the habitable zone — the range of distance from a star in which the surface temperature of an orbiting planet might sustain liquid water. The telescope has since detected planets and planet candidates spanning a wide range of sizes and orbital distances. These findings have led to a better understanding of our place in the galaxy.

The public is invited to listen to the teleconference live via UStream, at: http://www.ustream.tv/channel/nasa-arc

Questions can be submitted on Twitter using the hashtag #AskNASA.

Audio of the teleconference also will be streamed live at: http://www.nasa.gov/newsaudio

A link to relevant graphics will be posted at the start of the teleconference on NASA’s Kepler site: http://www.nasa.gov/kepler

If you aren’t mining Kepler data, this may be the inspiration to get you started!

Someone is going to discover a planet of the right size in the “Goldilocks zone.” It won’t be you for sure if you don’t try.

That would make nice bullet on your data scientist resume: Discovered first Earth sized planet in Goldilocks zone….

« Newer PostsOlder Posts »

Powered by WordPress