Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 21, 2014

Sexual Predators in Chat Rooms

Filed under: Data,GraphLab,Graphs — Patrick Durusau @ 9:01 pm

Weird dataset: identifying sexual predators in chat rooms by Danny Bickson.

From the post:

To all of the bored data scientists who are looking for interesting demo. (Alternatively, to all the startups who want to do a fraud detection demo). I stumbled upon this weird dataset which was part of PAN 2012 conference: identifying sexual predators in chat rooms.

I wouldn’t say you have to be bored to check out this dataset.

At least it is a worthy cause.

For that matter, don’t you wonder why Atlanta, GA, for example, is a sex trafficking hub in the United States? Or rather, why hasn’t law enforcement be able to stop the trafficking?

Last time I went out of the country you had to come back in one person at a time. So we have the location, control of the area, target groups for exploitation, …, what am I missing here in terms of catching traffickers?

Sex traffickers don’t wear big orange badges saying: Sex Trafficker but is that really necessary?

Maybe law enforcement should make better use of the computing cycles wasted on chasing illusory terrorists and focus on real criminals coming in and out of the country at Hartsfield-Jackson Atlanta International Airport.

February 20, 2014

Free FORMOSAT-2 Satellite Imagery

Filed under: Data,Image Processing — Patrick Durusau @ 2:22 pm

Free FORMOSAT-2 Satellite Imagery

Proposals due by March 31, 2014.

From the post:

ISPRS WG VI/5 is delighted to announce the call for proposals for free FORMOSAT-2 satellite data. Sponsored by the National Space Organization, National Applied Research Laboratories (NARLabs-NSPO) and jointly supported by the Chinese Taipei Society of Photogrammetry and Remote Sensing and the Center for Space and Remote Sensing Research (CSRSR), National Central University (NCU) of Taiwan, this research announcement provides an opportunity for researchers to carry out advanced researches and applications in their fields of interest using archived and/or newly acquired FORMOSAT-2 satellite images.

FORMOSAT-2 has a unique daily-revisiting capability to acquire images at a nominal ground resolution of 2 meters (panchromatic) or 8 meters (multispectral). The images are suitable for different researches and applications, such as land-cover and environmental monitoring, agriculture and natural resources studies, oceanography and coastal zone researches, disaster investigation and mitigation support, and others. Basic characteristics of FORMOSAT-2 are listed in Section III of this document and detailed information about FORMOSAT-2 is available at
<http://www.nspo.org.tw>.

Interested individuals are invited to submit a proposal according to the guidelines listed below. All topics and fields of application are welcome, especially proposals aiming for addressing issues related to the Societal Beneficial Areas of GEO/GEOSS (Group on Earth Observations/Global Earth Observation System of Systems, Figure 1). Up to 10 proposals will be selected by a reviewing committee. Each selected proposal will be granted 10 archived images (subject to availability) and/or data acquisition requests (DAR) free of charge. Proposals that include members of ISPRS Student Consortium or other ISPRS affiliated personnels as principal investigator (PI) or coinvestigators (CI) will be given higher priorities, so be sure to indicate ISPRS affiliations in the cover sheet of the proposal.

Let’s see, 2 meters, that’s smaller than the average Meth lab. Yes? I have read of trees dying from long term meth labs, those should be more than 2 meters. Other environmental clues to the production of Methamphetamine?

Has your locality thought about data crunching to supplement its traditional law enforcement efforts?

A better investment than small towns buying tanks.

I first saw this in a tweet by TH Schee.

February 16, 2014

Data as Magic?

Filed under: Data,Graphics — Patrick Durusau @ 7:52 pm

An example of why data will not end debate by Kaiser Fung.

From the post:

One oft-repeated “self-evident” tenet of Big Data is that data end all debate. Except if you have ever worked for a real company (excluding those ruled by autocrats), and put data on the table, you know that the data do not end anything.

Reader Ben M. sent me to this blog post by Benedict Evans, showing a confusing chart showing how Apple has “passed” Microsoft. Evans used to be a stock analyst before moving to Andreessen Horowitz, a VC (venture capital) business. He has over 25,000 followers on Twitter.
….

Evans responded to many of these comments by complaining that readers are not getting his message. That’s an accurate statement, and it has everything to do with the looseness of his data. This reminds me of Gelman’s statistical parable. The blogger here is not so much interested in how strong his evidence is but more interested in evangelizing the morale behind the story.

A highly entertaining post as always.

Gelman’s “statistical parable” is used for stories that cite numbers that if you think about them, are quite unreasonable. Gelman’s example case was statistics on death rates that put 1/4 of the death rate at a hospital as due to record keeping errors. Probably not true.

The point being that people bolster a narrative with numbers in the interest of advancing the story, with little concern for the “accuracy” of the numbers.

Other examples include: RIAA numbers on musical piracy, software piracy, OMB budget numbers, TSA terrorist threat numbers, etc.

I put “accuracy” in quotes because recognizing a “statistical parable” depends on where you sit. If you are on the side with shaky numbers, the question of accuracy is an annoying detail. If you oppose the side with shaky numbers, it is evidence they can’t make a case without manufactured evidence.

I take Kaiser’s point to be data is not magic. Even strong (in some traditional sense) data is not magic.
.
Data is at best one tool of persuasion that you can enlist for your cause, whatever that may be. Ignore other tools of persuasion at your own peril.

February 15, 2014

On Being a Data Skeptic

Filed under: Data,Skepticism — Patrick Durusau @ 11:00 am

On Being a Data Skeptic by Cathy O’Neil. (pdf)

From Skeptic, Not Cynic:

I’d like to set something straight right out of the gate. I’m not a data cynic, nor am I urging other people to be. Data is here, it’s growing, and it’s powerful. I’m not hiding behind the word “skeptic” the way climate change “skeptics” do, when they should call themselves deniers.

Instead, I urge the reader to cultivate their inner skeptic, which I define by the following characteristic behavior. A skeptic is someone who maintains a consistently inquisitive attitude toward facts, opinions, or (especially) beliefs stated as facts. A skeptic asks questions when confronted with a claim that has been taken for granted. That’s not to say a skeptic brow-beats someone for their beliefs, but rather that they set up reasonable experiments to test those beliefs. A really excellent skeptic puts the “science” into the term “data science.”

In this paper, I’ll make the case that the community of data practitioners needs more skepticism, or at least would benefit greatly from it, for the following reason: there’s a two-fold problem in this community. On the one hand, many of the people in it are overly enamored with data or data science tools. On the other hand, other people are overly pessimistic about those same tools.

I’m charging myself with making a case for data practitioners to engage in active, intelligent, and strategic data skepticism. I’m proposing a middle-of-the-road approach: don’t be blindly optimistic, don’t be blindly pessimistic. Most of all, don’t be awed. Realize there are nuanced considerations and plenty of context and that you don’t necessarily have to be a mathematician to understand the issues.
….

It’s a scant 26 pages, cover and all but “On Being a Data Skeptic” is well worth your time.

I particularly liked Cathy’s coverage of issues such as: People Get Addicted to Metrics, which ends with separate asides to “nerds,” and “business people.” Different cultures and different ways of “hearing” the same content. Rather than trying to straddle those communities, Cathy gave them separate messages.

You will find her predator/prey model particularly interesting.

On the whole, I would say her predator/prey analysis should not be limited to modeling. See what you think.

February 12, 2014

Islamic Finance: A Quest for Publically Available Bank-level Data

Filed under: Data,Finance Services,Government,Government Data — Patrick Durusau @ 9:38 pm

Islamic Finance: A Quest for Publically Available Bank-level Data by Amin Mohseni-Cheraghlou.

From the post:

Attend a seminar or read a report on Islamic finance and chances are you will come across a figure between $1 trillion and $1.6 trillion, referring to the estimated size of the global Islamic assets. While these aggregate global figures are frequently mentioned, publically available bank-level data have been much harder to come by.

Considering the rapid growth of Islamic finance, its growing popularity in both Muslim and non-Muslim countries, and its emerging role in global financial industry, especially after the recent global financial crisis, it is imperative to have up-to-date and reliable bank-level data on Islamic financial institutions from around the globe.

To date, there is a surprising lack of publically available, consistent and up-to-date data on the size of Islamic assets on a bank-by-bank basis. In fairness, some subscription-based datasets, such Bureau Van Dijk’s Bankscope, do include annual financial data on some of the world’s leading Islamic financial institutions. Bank-level data are also compiled by The Banker’s Top Islamic Financial Institutions Report and Ernst & Young’s World Islamic Banking Competitiveness Report, but these are not publically available and require subscription premiums, making it difficult for many researchers and experts to access. As a result, data on Islamic financial institutions are associated with some level of opaqueness, creating obstacles and challenges for empirical research on Islamic finance.

The recent opening of the Global Center for Islamic Finance by World Bank Group President Jim Young Kim may lead to exciting venues and opportunities for standardization, data collection, and empirical research on Islamic finance. In the meantime, the Global Financial Development Report (GFDR) team at the World Bank has also started to take some initial steps towards this end.

I can think of two immediate benefits from publicly available data on Islamic financial institutions:

First, hopefully it will increase demands for meaningful transparency in Western financial institutions.

Second, it will blunt government hand waving and propaganda about the purposes of Islamic financial institutions. Which on a par with financial institutions everywhere want to remain solvent, serve the needs of their customers and play active roles in their communities. Nothing more sinister than that.

Perhaps the best way to vanquish suspicion is with transparency. Except for the fringe cases who treat lack of evidence as proof of secret evil doing.

Sistine Chapel full 360°

Filed under: Data,History — Patrick Durusau @ 9:00 pm

Sistine Chapel full 360°

It’s not like being there, but then visitors can’t “zoom” in as you can with this display.

If you could capture one perspective, current or historical, for the Sistine Chapel, what would it be?

If you are ever in Rome, it is worth the hours in line and exhibits you will see along the way, to finish in the Sistine Chapel.

I first saw this in a tweet by Merete Sanderhoff.

February 10, 2014

Parallel Data Generation Framework

Filed under: Benchmarks,Data — Patrick Durusau @ 11:06 am

Parallel Data Generation Framework

From the webpage:

The Parallel Data Generation Framework (PDGF) is a generic data generator for database benchmarking. Its development started at the University of Passau at the group of Prof. Dr. Harald Kosch.

PDGF was designed to take advantage of today’s multi-core processors and large clusters of computers to generate large amounts of synthetic benchmark data very fast. PDGF uses a fully computational approach and is a pure Java implementation which makes it very portable.

I mention this to ask if you are aware of methods for generating unstructured text with known characteristics such as the number of entities and their representations in the data set?

A “natural” dataset, say blog posts or emails, etc., can be probed to determine its semantic characteristics but I am interested in generation of a dataset with known semantic characteristics.

Thoughts?

I first saw this in a tweet by Stefano Bertolo.

February 7, 2014

Welsh Newspapers Online – 27 new publications

Filed under: Data,History,News — Patrick Durusau @ 4:29 pm

Welsh Newspapers Online – 27 new publications

Fromm the post:

There is great excitement today as we release 27 publications (200,000 pages) from the Library’s rich collection on Welsh Newspapers Online.

Take a trip back in time from the comfort of your home or office and discover millions of freely available articles published before 1919.

The resource now allows you to search and read over 630,000 pages from almost 100 newspaper publications from the National Library’s collection, and this will grow to over 1 million pages as more publications are added during 2014. Among the latest titles are Y Negesydd, Caernarvon and Denbigh Herald, Glamorgan Gazette, Carmarthen Journal, Welshman, and Rhondda Leader, not forgetting Y Drych, the weekly newspaper for the Welsh diaspora in America.

The resource also includes some publications that were digitised for The Welsh Experience of World War One project.

Browse the resource and discover unique information on a variety of subjects, including family history, local history and much more that was once difficult to find unless the researcher was able to browse through years of heavy volumes.

The linguistic diversity of the WWW just took a step in the right direction thanks to the National Library of Wales.

Can a realization that recorded texts are semantically diverse (diachronically and synchronically) be far behind?

I cringe every time the U.S. Supreme Court treats historical language as transparent to a “plain reading.”

Granting that I have an agenda to advance by emphasis on the historical context of the language, just as they do with a facile reading devoid of historical context.

Still, I think my approach requires less suspension of disbelief than their’s.

Lessons From “Behind The Bloodshed”

Filed under: Data,Data Mining,Visualization — Patrick Durusau @ 12:22 pm

Lessons From “Behind The Bloodshed”

From the post:

Source has published a fantastic interview with the makers of Behind The Bloodshed, a visual narrative about mass killings produced by USA Today.

The entire interview with Anthony DeBarros is definitely worth a read but here are some highlights and commentary.

A synopsis of data issues in the production of “Behind The Bloodshed.”

Great visuals, as you would expect from USA Today.

A good illustration of simplifying a series of complex events for persuasive purposes.

That’s not a negative comment.

What other purpose would communication have if not to “persuade” others to act and/or believe as we wish?

I first saw this in a tweet by Bryan Connor.

Twitter Data Grants [Following 0 Followers 524,870 + 1]

Filed under: Data,Tweets — Patrick Durusau @ 9:43 am

Introducing Twitter Data Grants by Raffi Krikorian.

Deadline: March 15, 2014

From the post:

Today we’re introducing a pilot project we’re calling Twitter Data Grants, through which we’ll give a handful of research institutions access to our public and historical data.

With more than 500 million Tweets a day, Twitter has an expansive set of data from which we can glean insights and learn about a variety of topics, from health-related information such as when and where the flu may hit to global events like ringing in the new year. To date, it has been challenging for researchers outside the company who are tackling big questions to collaborate with us to access our public, historical data. Our Data Grants program aims to change that by connecting research institutions and academics with the data they need.

….

If you’d like to participate, submit a proposal here no later than March 15th. For this initial pilot, we’ll select a small number of proposals to receive free datasets. We can do this thanks to Gnip, one of our certified data reseller partners. They are working with us to give selected institutions free and easy access to Twitter datasets. In addition to the data, we will also be offering opportunities for the selected institutions to collaborate with Twitter engineers and researchers.

We encourage those of you at research institutions using Twitter data to send in your best proposals. To get updates and stay in touch with the program: visit research.twitter.com, make sure to follow @TwitterEng, or email data-grants@twitter.com with questions.

You may want to look at Twitter Engineering to see what has been of recent interest.

Tracking social media during the Arab Spring to separate journalists from participants could be interesting.

BTW, a factoid for today: @TwitterEng had 524,870 followers and 0 following when I first saw the grant page. Now they have 524,871 followers and 0 following. 😉

There’s another question: Who has the best following/follower ratio? Any patterns there?

I first saw this in a tweet by Gregory Piatetsky.

February 1, 2014

Academic Torrents!

Filed under: Data,Open Access,Open Data — Patrick Durusau @ 4:02 pm

Academic Torrents!

From the homepage:

Currently making 1.67TB of research data available.

Sharing data is hard. Emails have size limits, and setting up servers is too much work. We’ve designed a distributed system for sharing enormous datasets – for researchers, by researchers. The result is a scalable, secure, and fault-tolerant repository for data, with blazing fast download speeds. Contact us at joecohen@cs.umb.edu.

Some data sets you have probably already seen but perhaps several you have not! Like the crater data set for Mars!

Enjoy!

I first saw this in a tweet by Tony Ojeda.

January 31, 2014

Open Science Leaps Forward! (Johnson & Johnson)

Filed under: Bioinformatics,Biomedical,Data,Medical Informatics,Open Data,Open Science — Patrick Durusau @ 11:15 am

In Stunning Win For Open Science, Johnson & Johnson Decides To Release Its Clinical Trial Data To Researchers by Matthew Herper.

From the post:

Drug companies tend to be secretive, to say the least, about studies of their medicines. For years, negative trials would not even be published. Except for the U.S. Food and Drug Administration, nobody got to look at the raw information behind those studies. The medical data behind important drugs, devices, and other products was kept shrouded.

Today, Johnson & Johnson is taking a major step toward changing that, not only for drugs like the blood thinner Xarelto or prostate cancer pill Zytiga but also for the artificial hips and knees made for its orthopedics division or even consumer products. “You want to know about Listerine trials? They’ll have it,” says Harlan Krumholz of Yale University, who is overseeing the group that will release the data to researchers.

….

Here’s how the process will work: J&J has enlisted The Yale School of Medicine’s Open Data Access Project (YODA) to review requests from physicians to obtain data from J&J products. Initially, this will only include products from the drug division, but it will expand to include devices and consumer products. If YODA approves a request, raw, anonymized data will be provided to the physician. That includes not just the results of a study, but the results collected for each patient who volunteered for it with identifying information removed. That will allow researchers to re-analyze or combine that data in ways that would not have been previously possible.

….

Scientists can make a request for data on J&J drugs by going to www.clinicaltrialstudytransparency.com.

The ability to “…re-analyze or combine that data in ways that would not have been previously possible…” is the public benefit of Johnson & Johnson’s sharing of data.

With any luck, this will be the start of a general trend among drug companies.

Mappings of the semantics of such data sets should be contributed back to the Yale School of Medicine’s Open Data Access Project (YODA), to further enhance re-use of these data sets.

January 28, 2014

Open Microscopy Environment

Filed under: Biology,Data,Image Processing,Microscopy — Patrick Durusau @ 5:23 pm

Open Microscopy Environment

From the webpage:

OME develops open-source software and data format standards for the storage and manipulation of biological microscopy data. It is a joint project between universities, research establishments, industry and the software development community.

Where you will find:

OMERO: OMERO is client-server software for visualization, management and analysis of biological microscope images.

Bio-Formats: Bio-Formats is a Java library for reading and writing biological image files. It can be used as an ImageJ plugin, Matlab toolbox, or in your own software.

OME-TIFF Format: A TIFF-based image format that includes the OME-XML standard.

OME Data Model: A common specification for storing details of microscope set-up and image acquisition.

More data formats for sharing of information. And for integration with other data.

Not only does data continue to expand but so does the semantics associated with it.

We have “big data” tools for the data per se. Have you seen any tools capable of managing the diverse semantics of “big data?”

Me neither.

I first saw this in a tweet by Paul Groth.

January 24, 2014

Biodiversity Information Serving Our Nation (BISON)

Filed under: Biodiversity,Data — Patrick Durusau @ 6:41 pm

Biodiversity Information Serving Our Nation (BISON)

From the about tab:

Researchers collect species occurrence data, records of an organism at a particular time in a particular place, as a primary or ancillary function of many biological field investigations. Presently, these data reside in numerous distributed systems and formats (including publications) and are consequently not being used to their full potential. As a step toward addressing this challenge, the Core Science Analytics and Synthesis (CSAS) program of the US Geological Survey (USGS) is developing Biodiversity Information Serving Our Nation (BISON), an integrated and permanent resource for biological occurrence data from the United States.

BISON will leverage the accumulated human and infrastructural resources of the long-term USGS investment in research and information management and delivery.

If that sounds impressive, consider the BISON statistics as of December 31, 2013:

Total Records: 126,357,352
Georeferenced: 120,394,780
Taxa: 315,663
Data Providers: 307

Searches are by scientific or common name and ITIS enabled searching is on by default. Just in case you are curious:

BISON has integrated taxonomic information provided by the Integrated Taxonomic Information System (ITIS) allowing advanced search capability in BISON. With the integration, BISON users have the ability to search more completely across species records. Searches can now include all synonyms and can be conducted hierarchically by genera and higher taxa levels using ITIS enabled queries. Binding taxonomic structure to search terms will make possible broad searches on species groups such as Salmonidae (salmon, trout, char) or Passeriformes (cardinals, tanagers, etc) as well as on all of the many synonyms and included taxa (there are 60 for Poa pratensis – Kentucky Bluegrass – alone).

Clue: With sixty (60) names, the breakfast of champions since 1875.

I wonder if Watson would have answered: “What is Kentucky Bluegrass?” on Jeopardy. The first Kentucky Derby was run on May 17, 1875.

BISON also offers developer tools and BISON Web Services.

January 21, 2014

Wellcome Images

Filed under: Data,Data Integration,Library,Museums — Patrick Durusau @ 5:47 pm

Thousands of years of visual culture made free through Wellcome Images

From the post:

We are delighted to announce that over 100,000 high resolution images including manuscripts, paintings, etchings, early photography and advertisements are now freely available through Wellcome Images.

Drawn from our vast historical holdings, the images are being released under the Creative Commons Attribution (CC-BY) licence.

This means that they can be used for commercial or personal purposes, with an acknowledgement of the original source (Wellcome Library, London). All of the images from our historical collections can be used free of charge.

The images can be downloaded in high-resolution directly from the Wellcome Images website for users to freely copy, distribute, edit, manipulate, and build upon as you wish, for personal or commercial use. The images range from ancient medical manuscripts to etchings by artists such as Vincent Van Gogh and Francisco Goya.

The earliest item is an Egyptian prescription on papyrus, and treasures include exquisite medieval illuminated manuscripts and anatomical drawings, from delicate 16th century fugitive sheets, whose hinged paper flaps reveal hidden viscera to Paolo Mascagni’s vibrantly coloured etching of an ‘exploded’ torso.

Other treasures include a beautiful Persian horoscope for the 15th-century prince Iskandar, sharply sketched satires by Rowlandson, Gillray and Cruikshank, as well as photography from Eadweard Muybridge’s studies of motion. John Thomson’s remarkable nineteenth century portraits from his travels in China can be downloaded, as well a newly added series of photographs of hysteric and epileptic patients at the famous Salpêtrière Hospital

Semantics or should I say semantic confusion is never far away. While viewing an image of Gladstone as Scrooge:

Gladstone

When “search by keyword” offered “colonies,” I assumed either the colonies of the UK at the time.

Imagine my surprise when among other images, Wellcome Images offered:

petri dish

The search by keywords had found fourteen petri dish images, three images of Batavia, seven maps of India (salt, leporsy), one half naked woman being held down, and the Gladstone image from earlier.

About what one expects from search these days but we could do better. Much better.

I first saw this in a tweet by Neil Saunders.

January 20, 2014

Data with a Soul…

Filed under: Data,Social Networks,Social Sciences — Patrick Durusau @ 5:33 pm

Data with a Soul and a Few More Lessons I Have Learned About Data by Enrico Bertini.

From the post:

I don’t know if this is true for you but I certainly used to take data for granted. Data are data, who cares where they come from. Who cares how they are generated. Who cares what they really mean. I’ll take these bits of digital information and transform them into something else (a visualization) using my black magic and show it to the world.

I no longer see it this way. Not after attending a whole three days event called the Aid Data Convening; a conference organized by the Aid Data Consortium (ARC) to talk exclusively about data. Not just data in general but a single data set: the Aid Data, a curated database of more than a million records collecting information about foreign aid.

The database keeps track of financial disbursements made from donor countries (and international organizations) to recipient countries for development purposes: health and education, disasters and financial crises, climate change, etc. It spans a time range between 1945 up to these days and includes hundreds of countries and international organizations.

Aid Data users are political scientists, economists, social scientists of many sorts, all devoted to a single purpose: understand aid. Is aid effective? Is aid allocated efficiently? Does aid go where it is more needed? Is aid influenced by politics (the answer is of course yes)? Does aid have undesired consequences? Etc.

Isn’t that incredibly fascinating? Here is what I have learned during these few days I have spent talking with these nice people.
….

This fits quite well with the resources I mention in Lap Dancing with Big Data.

Making the Aid data your own data, will require time, effort and personal effort to understand and master it.

By that point, however, you may care about the data and the people it represents. Just be forewarned.

Lap Dancing With Big Data

Filed under: BigData,Data,Data Analysis — Patrick Durusau @ 4:27 pm

Real scientists make their own data by Sean J. Taylor.

From the first list in the post:

4. If you are the creator of your data set, then you are likely to have a great understanding the data generating process. Blindly downloading someone’s CSV file means you are much more likely to make assumptions which do not hold in the data.

A good point among many good points.

Sean provides guidance on how you can collect data, not just have it dumped on you.

Or as Kaiser Fung says in the post that lead me to Sean’s:

In theory, the availability of data should improve our ability to measure performance. In reality, the measurement revolution has not taken place. It turns out that measuring performance requires careful design and deliberate collection of the right types of data — while Big Data is the processing and analysis of whatever data drops onto our laps. Ergo, we are far from fulfilling the promise.

So, do you make your own data?

Or do you lap dance with data?

I know which one I aspire to.

You?

January 18, 2014

How to Query the StackExchange Databases

Filed under: Data,Subject Identity,Topic Maps — Patrick Durusau @ 8:29 pm

How to Query the StackExchange Databases by Brent Ozar.

From the post:

During next week’s Watch Brent Tune Queries webcast, I’m using my favorite demo database: Stack Overflow. The Stack Exchange folks are kind enough to make all of their data available via BitTorrent for Creative Commons usage as long as you properly attribute the source.

There’s two ways you can get started writing queries against Stack’s databases – the easy way and the hard way.
….

I’m sure you have never found duplicate questions or answers on StackExchange.

But just in case such a thing existed, detecting and merging the duplicates from StackExchange would be a good exercise at data analysis, subject identification, etc.

😉

BTW, Brent’s webinar is 21 January 2014, or next Tuesday (as of this post).

Enjoy!

January 13, 2014

The myth of the aimless data explorer

Filed under: Bias,Data — Patrick Durusau @ 7:14 pm

The myth of the aimless data explorer by Enrico Bertini.

From the post:

There is a sentence I have heard or read multiple times in my journey into (academic) visualization: visualization is a tool people use when they don’t know what question to ask to their data.

I have always taken this sentence as a given and accepted it as it is. Good, I thought, we have a tool to help people come up with questions when they have no idea what to do with their data. Isn’t that great? It sounded right or at least cool.

But as soon as I started working on more applied projects, with real people, real problems, real data they care about, I discovered this all excitement for data exploration is just not there. People working with data are not excited about “playing” with data, they are excited about solving problems. Real problems. And real problems have questions attached, not just curiosity. There’s simply nothing like undirected data exploration in the real world.

I think Enrico misses the reason why people use/like the phrase: visualization is a tool people use when they don’t know what question to ask to their data.

Visualization privileges the “data” as the source of whatever result is displayed by the visualization.

It’s not me! That’s what the data says!

Hardly. Someone collected the data. Not at random, stuffing whatever bits came along in a bag. Someone cleaned the data with some notion of what “clean” meant. Someone choose the data that is now being called upon for a visualization. And those are clumsy steps that collapse many distinct steps into only three.

To put it another way, data never exists without choices being made. And it is the sum of those choices that influence the visualizations that are even possible from some data set.

The short term for what Enrico overlooks is bias.

I would recast his title to read: The myth of the objective data explorer.

Having said that, I don’t mean that all bias is bad.

If I were collecting data on Ancient Near Eastern (ANE) languages, I would of necessity be excluding the language traditions of the entire Western Hemisphere. It could even be that data from the native cultures of the Western Hemisphere will be lost while I am preserving data from the ANE.

So we have bias and a bad outcome, from someone’s point of view because of that bias. Was that a bad thing? I would argue not.

It isn’t every possible to collect all the potential data that can be collected. We all make values judgments about the data we choose to collect and what we choose to ignore.

Rather than pretending that we possess objectivity in any meaningful sense, we are better off to state our biases to the extent we know them. At least others will be forewarned that we are just like them.

January 12, 2014

Porn capital of the porn nation

Filed under: Data,Porn,R — Patrick Durusau @ 9:09 pm

Porn capital of the porn nation by Gianluca Baio.

From the post:

The other day I was having a quick look to the newspapers and I stumbled on this article. Apparently, Pornhub (a website whose mission should be pretty clear) have analysed the data on their customers and found out that the town of Ware (Hertfordshire) has more demand for online porn than any other UK town. According to PornHub, a Ware resident will last 10 minutes 37 seconds (637 seconds) on its adult website, compared with the world average time of 8 minutes 56 seconds (just 536 seconds).

Gianluca walks you through data available from the Guardian with R, so you can reach your own conclusions.

I need to install Tableau Public before I can download the data set. Will update this post tomorrow.

Enjoy!

Update:

I installed Tableau Public on a Windows XP VM and then downloaded the data file. Turns out with the public version of Tableau there is no open local file option but if you double-click on the file, it will load and open.

Amusing but limited data set. Top five searches, etc.

The Porn Hub Stats page has other reports from the Porn Hub stats crew.

No data downloads for stats, tags, etc., although I did post a message to them asking about that sort of data.

I have just started playing with it but Tableau appears to be a really nice data visualization tool.

Musopen

Filed under: Data,Music — Patrick Durusau @ 8:53 pm

Musopen

From the webpage:

Musopen (www.musopen.org) is a 501(c)(3) non-profit focused on improving access and exposure to music by creating free resources and educational materials. We provide recordings, sheet music, and textbooks to the public for free, without copyright restrictions. Put simply, our mission is to set music free.

The New Grove Dictionary of Music and Musicians it’s not but losing our musical heritage did not happen over night.

Nor will winning it back.

Contribute to and support Musopen.

Everpix-Intelligence [Failed Start-up Data Set]

Filed under: Data,Dataset — Patrick Durusau @ 11:09 am

Everpix-Intelligence

From the webpage:

About Everpix

Everpix was started in 2011 with the goal of solving the Photo Mess, an increasingly real pain point in people’s life photo collections, through ambitious engineering and user experience. Our startup was angel and VC funded with $2.3M raised over its lifetime.

After 2 years of research and product development, and although having a very enthousiastic user base of early adopters combined with strong PR momentum, we didn’t succeed in raising our Series A in the highly competitive VC funding market. Unable to continue operating our business, we had to announce our upcoming shutdown on November 5th, 2013.

High-Level Metrics

At the time of its shutdown announcement, the Everpix platform had 50,000 signed up users (including 7,000 subscribers) with 400 millions photos imported, while generating subscription sales of $40,000 / month during the last 3 months (i.e. enough money to cover variable costs, but not the fixed costs of the business).

Complete Dataset

Building a startup is about taking on a challenge and working countless hours on solving it. Most startups do not make it but rarely do they reveal the story behind, leaving their users often frustrated. Because we wanted the Everpix community to understand some of the dynamics in the startup world and why we had to come to such a painful ending, we worked closely with a reporter from The Verge who chronicled our last couple weeks. The resulting article generated extensive coverage and also some healthy discussions around some of our high-level metrics and financials. There was a lot more internal data we wanted to share but it wasn’t the right time or place.

With the Everpix shutdown behind us, we had the chance to put together a significant dataset covering our business from fundraising to metrics. We hope this rare and uncensored inside look at the internals of a startup will benefit the startup community.

Here are some example of common startup questions this dataset helps answering:

  • What are investment terms for consecutive convertible notes and an equity seed round? What does the end cap table look like? (see here)
  • How does a Silicon Valley startup spend its raised money during 2 years? (see here)
  • What does a VC pitch deck look like? (see here)
  • What kinds of reasons do VCs give when they pass? (see here)
  • What are the open rate and click rate of transactional and marketing emails? (see here)
  • What web traffic do various news websites generate? (see here and here)
  • What are the conversion rate from product landing page to sign up for new visitors? (see here)
  • How fast do people purchase a subscription after signing up to a freemium service? (see here and here)
  • Which countries have higher suscription rates? (see here and here)

The dataset is organized as follow:

Every IT startup but especially data oriented startups should work with this data set before launch.

I thought the comments from VCs were particularly interesting.

I would summarize those comments as:

  1. There is a problem.
  2. You have a great idea to solve the problem.
  3. Will consumers pay you to solve the problem?

What evidence do you have on #3?

Bearing in mind that should, ought to, value is obvious, etc., are wishes, not evidence.

I first saw this in a tweet by Emil Eifrem.

January 11, 2014

Winter 2013 Crawl Data Now Available

Filed under: Common Crawl,Data,Nutch — Patrick Durusau @ 7:42 pm

Winter 2013 Crawl Data Now Available by Lisa Green.

From the post:

The second crawl of 2013 is now available! In late November, we published the data from the first crawl of 2013 (see previous blog post for more detail on that dataset). The new dataset was collected at the end of 2013, contains approximately 2.3 billion webpages and is 148TB in size. The new data is located in the aws-publicdatasets at /common-crawl/crawl-data/CC-MAIN-2013-48/

In 2013, we made changes to our crawling and post-processing systems. As detailed in the previous blog post, we switched file formats to the international standard WARC and WAT files. We also began using Apache Nutch to crawl – stay tuned for an upcoming blog post on our use of Nutch. The new crawling method relies heavily on the generous data donations from blekko and we are extremely grateful for blekko’s ongoing support!

In 2014 we plan to crawl much more frequently and publish fresh datasets at least once a month.

Data to play with now and the promise of more to come! Can’t argue with that!

Learning more about Common Crawl’s use of Nutch will be fun as well.

January 10, 2014

…Customizable Test Data with Python

Filed under: Data,Python — Patrick Durusau @ 5:15 pm

A Tool to Generate Customizable Test Data with Python by Alec Noller.

From the post:

Sometimes you need a dataset to run some tests – just a bunch of data, anything – and it can be unexpectedly difficult to find something that works. There are some useful and readily-available options out there; for example, Matthew Dubins has worked with the Enron email dataset and a complete list of 9/11 victims.

However, if you have more specific needs, particularly when it comes to format and fitting within the structure of a database, and you want to customize your dataset to test one thing or another in particular, take a look at this Python package called python-testdata used to generate customizable test data. It can be set up to generate names in various forms, companies, addresses, emails, and more. The Github also includes some help to get started, as well as examples for use cases.

I hesitated when I first saw this given the overabundance of free data.

But then with “free” data, if it is large enough, you will have to rely on sampling to gauge the performance of software.

Introducing the hazards and dangers of strange data may not be acceptable in all cases.

January 9, 2014

The Rain Project:…

Filed under: Climate Data,Data,Weather Data — Patrick Durusau @ 7:30 pm

The Rain Project: An R-based Open Source Analysis of Publicly Available Rainfall Data by Gopi Goteti.

From the post:

Rainfall data used by researchers in academia and industry does not always come in the same format. Data is often in atypical formats and in extremely large number of files and there is not always guidance on how to obtain, process and visualize the data. This project attempts to resolve this issue by serving as a hub for the processing of such publicly available rainfall data using R.

The goal of this project is to reformat rainfall data from their native format to a consistent format, suitable for use in data analysis. Within this project site, each dataset is intended to have its own wiki. Eventually, an R package would be developed for each data source.

Currently R code is available to process data from three sources – Climate Prediction Center (global coverage), US Historical Climatology Network (USA coverage) and APHRODITE (Asia/Eurasia and Middle East).

The project home page is here – http://rationshop.github.io/rain_r/

Links to the original sources:

Climate Prediction Center

US Historical Climatology Network

APHRODITE

There are five (5) other sources listed at the project home page “to be included in the future.”

All of these datasets were “transparent” to someone, once upon a time.

Restoring them to transparency is a good deed.

Preventing datasets from going dark is an even better one.

January 7, 2014

Small Crawl

Filed under: Common Crawl,Data,Webcrawler,WWW — Patrick Durusau @ 7:40 pm

meanpath Jan 2014 Torrent – 1.6TB of crawl data from 115m websites

From the post:

October 2012 was the official kick off date for development of meanpath – our source code search engine. Our goal was to crawl as much of the web as we could using mostly open source software and a decent (although not Google level) financial investment. Outside of many substantial technical challenges, we also needed to acquire a sizeable list of seed domains as the starting block for our crawler. Enter Common Crawl which is an open crawl of the web that can be accessed and analysed by everyone. Of specific interest to us was the Common Crawl URL Index which we combined with raw domain zone files and domains from the Internet Census 2012 to create our master domain list.

We are firm supporters of open access to information which is why we have chosen to release a free crawl of over 115 million sites. This index contains only the front page HTML, robots.txt, favicons, and server headers of every crawlable .com, .net, .org, .biz, .info, .us, .mobi, and .xxx that were in the 2nd of January 2014 zone file. It does not execute or follow JavaScript or CSS so is not 100% equivalent to what you see when you click on view source in your browser. The crawl itself started at 2:00am UTC 4th of January 2014 and finished the same day.

Get Started:
You can access the meanpath January 2014 Front Page Index in two ways:

  1. Bittorrent – We have set up a number of seeds that you can download from using this descriptor. Please seed if you can afford the bandwidth and make sure you have 1.6TB of disk space free if you plan on downloading the whole crawl.
  2. Web front end – If you are not interested in grappling with the raw crawl files you can use our web front end to do some sample searches.

Data Set Statistics:

  1. 149,369,860 seed domains. We started our crawl with a full zone file list of all domains in the .com (112,117,307), .net (15,226,877), .org (10,396,351), .info (5,884,505), .us (1,804,653), .biz (2,630,676), .mobi (1,197,682) and .xxx (111,809) top level domains (TLD) for a total of 149,369,860 domains. We have a much larger set of domains that cover all TLDs but very few allow you to download a zone file from the registrar so we cannot guarantee 100% coverage. For statistical purposes having a defined 100% starting point is necessary.
  2. 115,642,924 successfully crawled domains. Of the 149,369,860 domains only 115,642,924 were able to be crawled which is a coverage rate of 77.42%
  3. 476 minutes of crawling. It took us a total of 476 minutes to complete the crawl which was done in 5 passes. If a domain could not be crawled in the first pass we tried 4 more passes before giving up (those excluded by robots.txt are not retried). The most common reason domains are not able to be crawled is a lack of any valid A record for domain.com or www.domain.com
  4. 1,500GB of uncompressed data. This has been compressed down to 352.40gb using gzip for ease of download.

I just scanned the Net for 2TB hard drives and the average runs between $80 and $100. There doesn’t seem to be much difference between internal and external.

The only issue I foresee is that some ISPs limit downloads. You can always tunnel to another box using SSH but that requires enough storage on the other box as well.

Be sure to check out meanpath’s search capabilities.

Perhaps the day of boutique search engines is getting closer!

January 4, 2014

…The re3data.org Registry

Filed under: Data,Data Repositories — Patrick Durusau @ 5:15 pm

Making Research Data Repositories Visible: The re3data.org Registry by Heinz Pampel, et. al.

Abstract:

Researchers require infrastructures that ensure a maximum of accessibility, stability and reliability to facilitate working with and sharing of research data. Such infrastructures are being increasingly summarized under the term Research Data Repositories (RDR). The project re3data.org–Registry of Research Data Repositories–has begun to index research data repositories in 2012 and offers researchers, funding organizations, libraries and publishers an overview of the heterogeneous research data repository landscape. In July 2013 re3data.org lists 400 research data repositories and counting. 288 of these are described in detail using the re3data.org vocabulary. Information icons help researchers to easily identify an adequate repository for the storage and reuse of their data. This article describes the heterogeneous RDR landscape and presents a typology of institutional, disciplinary, multidisciplinary and project-specific RDR. Further the article outlines the features of re3data.org, and shows how this registry helps to identify appropriate repositories for storage and search of research data.

A great summary of progress so far but pay close attention to:

In the following, the term research data is defined as digital data being a (descriptive) part or the result of a research process. This process covers all stages of research, ranging from research data generation, which may be in an experiment in the sciences, an empirical study in the social sciences or observations of cultural phenomena, to the publication of research results. Digital research data occur in different data types, levels of aggregation and data formats, informed by the research disciplines and their methods. With regards to the purpose of access for use and re-use of research data, digital research data are of no value without their metadata and proper documentation describing their context and the tools used to create, store, adapt, and analyze them [7]. (emphasis added)

If you think about that for a moment you will realize that should include all the “metadata and proper documentation …. and the tools….” The need for explanation does not go away because of the label “metadata” or “documentation.”

Not that we can ever avoid semantic opaqueness but depending on the value of the data, we can push it further away in some cases than others.

An article that will repay a close reading.

I first saw this in a tweet by Stuart Buck.

January 3, 2014

Data Without Meaning? [Dark Data]

Filed under: Data,Data Analysis,Data Mining,Data Quality,Data Silos — Patrick Durusau @ 5:47 pm

I was reading IDC: Tons of Customer Data Going to Waste by Beth Schultz when I saw:

As much as companies understand the need for data and analytics and are evolving their relationships with both, they’re really not moving quickly enough, Schaub suggested during an IDC webinar earlier this week about the firm’s top 10 predictions for CMOs in 2014. “The aspiration is know that customer, and know what the customer wants at every single touch point. This is going to be impossible in today’s siloed, channel orientation.”

Companies must use analytics to help take today’s multichannel reality and recreate “the intimacy of the corner store,” she added.

Yes, great idea. But as IDC pointed out in the prediction I found most disturbing — especially with how much we hear about customer analytics — gobs of data go unused. In 2014, IDC predicted, “80% of customer data will be wasted due to immature enterprise data ‘value chains.’ ” That has to set CMOs to shivering, and certainly IDC found it surprising, according to Schaub.

That’s not all that surprising, either the 80% and/or the cause being “immature enterprise data ‘value chains.'”

What did surprise me was:

IDC’s data group researchers say that some 80% of data collected has no meaning whatsoever, Schaub said.

I’m willing to bet the wasted 80% of consumer data and the “no meaning” 80% of consumer data, is the same 80%.

Think about it.

If your information chain isn’t associating meaning with the data you collect, the data may as well be streaming to /dev/null.

The data isn’t without meaning, you just failed to capture it. Not the same thing as having “no meaning.”

Failing to capture meaning along with data is one way to produce what I call “dark data.”

I first saw this in a tweet by Gregory Piatetsky.

« Newer PostsOlder Posts »

Powered by WordPress