Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 15, 2014

Data-Driven Discovery Initiative

Filed under: BigData,Data Science,Funding — Patrick Durusau @ 10:03 am

Data-Driven Discovery Initiative

Pre-Applications Due February 24, 2014 by 5 pm Pacific Time.

15 Awards at $1,500,000 each, at $200K-$300K/year for five years.

From the post:

Our Data-Driven Discovery Initiative seeks to advance the people and practices of data-intensive science, to take advantage of the increasing volume, velocity, and variety of scientific data to make new discoveries. Within this initiative, we’re supporting data-driven discovery investigators – individuals who exemplify multidisciplinary, data-driven science, coalescing natural sciences with methods from statistics and computer science.

These innovators are striking out in new directions and are willing to take risks with the potential of huge payoffs in some aspect of data-intensive science. Successful applicants must make a strong case for developments in the natural sciences (biology, physics, astronomy, etc.) or science enabling methodologies (statistics, machine learning, scalable algorithms, etc.), and applicants that credibly combine the two are especially encouraged. Note that the Science Program does not fund disease targeted research.

It is anticipated that the DDD initiative will make about 15 awards at ~$1,500,000 each, at $200K-$300K/year for five years.

Pre-applications are due Monday, February 24, 2014 by 5 pm Pacific Time. To begin the pre-application process, click the “Apply Here” button above. We expect to extend invitations for full applications in April 2014. Full applications will be due five weeks after the invitation is sent, currently anticipated for mid-May 2014.

Apply Here

If you are interested in leveraging topic maps in your application, give me a call!

As far as I know, topic maps remain the only technology that documents the basis for merging distinct representations of the same subject.

Mappings, such as you find in Talend and other enterprise data management technologies, is great, so long as you don’t care why a particular mapping was done.

And in many cases, it may not matter. When you are exporting one time mailing list for a media campaign. It’s going to be discarded upon use so who cares?

In other cases, where labor intensive work is required to discover the “why” of a prior mapping, documenting that “why” would be useful.

Topic maps can document as much or as little of the semantics of your data and data processing stack as you desire. Topic maps can’t make legacy data and data semantic issues go away, but they can become manageable.

February 12, 2014

Specializations On Coursera

Filed under: CS Lectures,Data Science — Patrick Durusau @ 4:31 pm

Specializations On Coursera

Coursera is offering sequences of courses that result in certificates in particular areas.

For example, John Hopkins is offering a certificate in Data Science, nine courses at $49.00 each or $490 for a specialization certificate.

I first saw this in a post by Stephen Turner, Coursera Specializations: Data Science, Systems Biology, Python Programming.

January 17, 2014

Data-Driven Discovery Initiative

Filed under: BigData,Data Science — Patrick Durusau @ 4:12 pm

Data-Driven Discovery Initiative

Pre-applications due: Monday, February 24, 2014 by 5 pm Pacific Time

From the webpage:

Our Data-Driven Discovery Initiative seeks to advance the people and practices of data-intensive science, to take advantage of the increasing volume, velocity, and variety of scientific data to make new discoveries. Within this initiative, we’re supporting data-driven discovery investigators – individuals who exemplify multidisciplinary, data-driven science, coalescing natural sciences with methods from statistics and computer science.

These innovators are striking out in new directions and are willing to take risks with the potential of huge payoffs in some aspect of data-intensive science. Successful applicants must make a strong case for developments in the natural sciences (biology, physics, astronomy, etc.) or science enabling methodologies (statistics, machine learning, scalable algorithms, etc.), and applicants that credibly combine the two are especially encouraged. Note that the Science Program does not fund disease targeted research.

It is anticipated that the DDD initiative will make about 15 awards at ~$1,500,000 each, at $200K-$300K/year for five years.

Be aware, you must be an employee of a PhD-granting institution or a private research institute in the United States to apply.

January 2, 2014

Data Science Apprenticeship

Filed under: Data Science — Patrick Durusau @ 3:06 pm

Data Science Apprenticeship by Vincent Granville.

The status of this program is as follows:

Stage 1 (Available now): DIY (do-it-yourself) for self-learners: material is available for free throughout DSC, including data sets and projects to work on. No registration required,get started here.

Stage 2 (April 2014): Participants will purchase my Wiley book (DSC members get a discount) as well as our data science cheat sheet to get jump started.

Stage 3: Projects will be evaluated for a fee, and a certification delivered.

Be aware that the book from Wiley appears to be a collection of blog posts.

Nothing against blogging, 😉 , but re-cycled blog posts don’t have a good narrative flow. And usually aren’t comprehensive examinations of an entire area.

There is no projected date for Stage 3 but I’m watching for updates.

December 30, 2013

6000 Companies Hiring Data Scientists

Filed under: Data Science — Patrick Durusau @ 4:19 pm

6000 Companies Hiring Data Scientists by Vincent Granville.

From the post:

Search engines (Google, Microsoft), social networks (Twitter, Facebook, LinkedIn), financial institutions, Amazon, Apple, eBay, the health care industry, engineering companies (Boeing, Intel, Oil industry), retail analytics, mobile analytics, marketing agencies, data science vendors (for instance, Pivotal, Teradata, Tableau, SAS, Alpine Labs), environment, utilities government and defense routinely hire data scientists, though the job title is sometimes different. Traditional companies (manufacturing) tend to call them operations research analysts.

The comprehensive list of > 6,000 companies isn’t as helpful as you might imagine.

An Excel spreadsheet with two (2) columns. The first one is the company name and the second is the number of connections on LinkedIn.

I was thinking the list might be useful both in terms of employment but also for marketing data services.

In its present form, not useful at all.

But data scientists or wannabe data scientists should not accept less than useful data as a given.

What would you want to see added to the data? How would you make that a reality?

Recalling that we want maximum accuracy with a minimum amount of manual effort.

I’m going to thinking along those lines. Suggestions welcome!

175 Analytic and Data Science Web Sites

Filed under: Data Science,Nutch — Patrick Durusau @ 4:07 pm

175 Analytic and Data Science Web Sites by Vincent Granville.

From the post:

Following is a list (in alphabetical order) of top domains related to analytics, data science or big data, based on input from Data Science central members. These top domains were cited by at least 4 members. Some of them are pure data science web sites, while others are more general (but still tech-oriented) with strong emphasis on data issues at large, or regular data science content.

I created 175-DataSites-2013.txt from Vincent’s listing formatted as a Nutch seed text.

I would delete some of the entries prior to crawling.

For example, http://harvard.edu.

Lots interesting content but If you are looking for data-centric resources, I would be more specific.

Scala as a platform…

Filed under: Data Science,Programming,Scala — Patrick Durusau @ 3:19 pm

Scala as a platform for statistical computing and data science by Darren Wilkinson

From the post:

There has been a lot of discussion on-line recently about languages for data analysis, statistical computing, and data science more generally. I don’t really want to go into the detail of why I believe that all of the common choices are fundamentally and unfixably flawed – language wars are so unseemly. Instead I want to explain why I’ve been using the Scala programming language recently and why, despite being far from perfect, I personally consider it to be a good language to form a platform for efficient and scalable statistical computing. Obviously, language choice is to some extent a personal preference, implicitly taking into account subjective trade-offs between features different individuals consider to be important. So I’ll start by listing some language/library/ecosystem features that I think are important, and then explain why.

A feature wish list

It should:

  • be a general purpose language with a sizable user community and an array of general purpose libraries, including good GUI libraries, networking and web frameworks
  • be free, open-source and platform independent
  • be fast and efficient
  • have a good, well-designed library for scientific computing, including non-uniform random number generation and linear algebra
  • have a strong type system, and be statically typed with good compile-time type checking and type safety
  • have reasonable type inference
  • have a REPL for interactive use
  • have good tool support (including build tools, doc tools, testing tools, and an intelligent IDE)
  • have excellent support for functional programming, including support for immutability and immutable data structures and “monadic” design
  • allow imperative programming for those (rare) occasions where it makes sense
  • be designed with concurrency and parallelism in mind, having excellent language and library support for building really scalable concurrent and parallel applications

The not-very-surprising punch-line is that Scala ticks all of those boxes and that I don’t know of any other languages that do. But before expanding on the above, it is worth noting a couple of (perhaps surprising) omissions. For example:

  • have excellent data viz capability built-in
  • have vast numbers of statistical routines in the standard library

Darren reviews Scala on each of these points.

Although he still uses R and Python, Darren has hopes for future development of Scala into a full featured data mining platform.

Perhaps his checklist will contribute the requirements needed to make that one of the futures of Scala.

I first saw this in Christophe Lalanne’s A bag of tweets / December 2013.

December 29, 2013

Part D Fraud

Filed under: BigData,Data Science — Patrick Durusau @ 4:11 pm

‘Let the Crime Spree Begin’: How Fraud Flourishes in Medicare’s Drug Plan by Tracy Weber and Charles Ornstein.

From the post:

With just a handful of prescriptions to his name, psychiatrist Ernest Bagner III was barely a blip in Medicare’s vast drug program in 2009.

But the next year he began churning them out at a furious rate. Not just the psych drugs expected in his specialty, but expensive pills for asthma and high cholesterol, heartburn and blood clots.

By the end of 2010, Medicare had paid $3.8 million for Bagner’s drugs — one of the highest tallies in the country. His prescriptions cost the program another $2.6 million the following year, records analyzed by ProPublica show.

Bagner, 46, says there’s just one problem with this accounting: The prescriptions aren’t his. “All of that stuff you have is false,” he said.

By his telling, someone stole his identity while he worked at a strip-mall clinic in Hollywood, Calif., then forged his signature on prescriptions for hundreds of Medicare patients he’d never seen. Whoever did it, he’s been told, likely pilfered those drugs and resold them.

“These people make more money off my name than I do,” said Bagner, who now works as a disability evaluator and says he no longer prescribes medications.

Today, credit card companies routinely scan their records for fraud, flagging or blocking suspicious charges as they happen. Yet Medicare’s massive drug program has a process so convoluted and poorly managed that fraud flourishes, giving rise to elaborate schemes that quickly siphon away millions of dollars.

Frustrated investigators for law enforcement, insurers and pharmacy chains say they don’t see evidence that Medicare officials are doing much to stop it.

“It’s kind of a black hole,” said Alanna Lavelle, director of investigations for WellPoint Inc., which provides drug coverage to about 1.4 million people in the program, known as Part D.
….

One of the problems that enables so much fraud is:

Part D is vulnerable because it requires insurance companies to pay for prescriptions issued by any licensed prescriber and filled by any willing pharmacy within 14 days. Insurers generally must cover even suspicious claims before investigating, an approach called “pay and chase.” By comparison, these same insurers have more time to review questionable medication claims for patients in their non-Medicare plans.

I wonder if the government would pay on a percentage of fraud reduction for a case like this?

Setting up the data streams from pharmacies would be the hardest part.

But once that was in place, it would a matter of getting some good average prescription data and crunching the numbers.

There would still be some minor fraud but nothing in the totals that are discussed in this article.

A topic map would be useful for some of the more sophisticated fraud schemes.

I make that sound easy and it would not be. There are financial/economic and social interests being served by the current Part D structures. And questions such as: How much fraud will you tolerate in order to get senior citizens their drugs? will need good answers.

Still, even routine data science tools and reporting should be able to lessen the financial hemorrhaging under Part D.

December 26, 2013

The Open-Source Data Science Masters – Curriculum

Filed under: Data Science — Patrick Durusau @ 4:00 pm

The Open-Source Data Science Masters – Curriculum by Clare Corthell.

An interesting mixture of online courses, books, software tools, etc.

Fully mastering all of the material mentioned would probably equal or exceed an MS in Data Science.

Probably.

I say “probably” because data sets, algorithms, processing models, and the like all have built-in assumptions that impact the results.

In a masters program worthy of the name, the assumptions of common methods of data analysis would be taught, along side how to recognize/discover assumptions in data and/or methodologies.

In lieu of a formal course of that nature, I suggest How to Lie with Statistics by Darrell Huff and How to Lie with Maps by Mark Monmonier.

Data Mining is more general than either of those two works so a “How to Lie with Data Mining” would not be amiss.

Or even a “Data Mining Lies Yearbook (year)” that annotates stories, press releases, articles, presentations with their questionable assumptions and/or choices.

Bearing in mind that incompetence is a far more common explanation of lies than malice.

November 22, 2013

Data Scientist Foundations:…

Filed under: Data Science — Patrick Durusau @ 7:28 pm

Data Scientist Foundations: The Hard and Human Skills You Need

I didn’t see anything that wasn’t obvious if you stopped to think about it.

On the other hand, having a list will make it easier to identify gaps in your skills and/or training.

Something to keep in mind for the new year.

Make resolutions off of this list be the ones you really keep.

November 7, 2013

16 Reasons Data Scientists are Difficult to Manage

Filed under: Data Science,Humor — Patrick Durusau @ 7:22 pm

16 Reasons Data Scientists are Difficult to Manage

No spoilers. Go read Amy’s post.

I think it would have worked better as:

Data Scientist Scoring Test.

With values associated with the answers.

You?

November 5, 2013

Hadoop for Data Science: A Data Science MD Recap

Filed under: Data Science,Hadoop — Patrick Durusau @ 2:02 pm

Hadoop for Data Science: A Data Science MD Recap by Matt Motyka.

From the post:

On October 9th, Data Science MD welcomed Dr. Donald Miner as its speaker to talk about doing data science work and how the hadoop framework can help. To start the presentation, Don was very clear about one thing: hadoop is bad at a lot of things. It is not meant to be a panacea for every problem a data scientist will face.

With that in mind, Don spoke about the benefits that hadoop offers data scientists. Hadoop is a great tool for data exploration. It can easily handle filtering, sampling and anti-filtering (summarization) tasks. When speaking about these concepts, Don expressed the benefits of each and included some anecdotes that helped to show real world value. He also spoke about data cleanliness in a very Baz Luhrmann Wear Sunscreen sort of way, offering that as his biggest piece of advice.

What?

Hadoop is not a panacea for every data problem????

😉

Don’t panic when you start the video. The ads, etc., take almost seven (7) minutes but Dr. Miner is on the way.


Update: Slides for Hadoop for Data Science. Enjoy!

October 21, 2013

7 command-line tools for data science

Filed under: Data Mining,Data Science,Extraction — Patrick Durusau @ 4:54 pm

7 command-line tools for data science by Jeroen Janssens.

From the post:

Data science is OSEMN (pronounced as awesome). That is, it involves Obtaining, Scrubbing, Exploring, Modeling, and iNterpreting data. As a data scientist, I spend quite a bit of time on the command-line, especially when there's data to be obtained, scrubbed, or explored. And I'm not alone in this. Recently, Greg Reda discussed how the classics (e.g., head, cut, grep, sed, and awk) can be used for data science. Prior to that, Seth Brown discussed how to perform basic exploratory data analysis in Unix.

I would like to continue this discussion by sharing seven command-line tools that I have found useful in my day-to-day work. The tools are: jq, json2csv, csvkit, scrape, xml2json, sample, and Rio. (The home-made tools scrape, sample, and Rio can be found in this data science toolbox.) Any suggestions, questions, comments, and even pull requests are more than welcome.

Jeroen covers:

  1. jq – sed for JSON
  2. json2csv – convert JSON to CSV
  3. csvkit – suite of utilities for converting to and working with CSV
  4. scrape – HTML extraction using XPath or CSS selectors
  5. xml2json – convert XML to JSON
  6. sample – when you’re in debug mode
  7. Rio – making R part of the pipeline

There are fourteen (14) more suggested by readers at the bottom of the post.

Some definite additions to the tool belt here.

I first saw this in Pete Warden’s Five Short Links, October 19, 2013.

October 18, 2013

Introduction to Data Science at Columbia University

Filed under: Computer Science,Data Science — Patrick Durusau @ 3:01 pm

Introduction to Data Science at Columbia University by Dr. Rachel Schutt.

The link points to a blog for the course. Includes entries for the same class last year.

Search for “INTRODUCTION TO DATA SCIENCE VERSION 2.0” and that should put you out at the first class for Fall 2013.

Personally I would read the entries for 2012 as well.

Hard to know when a chance remark from a student will provoke new ideas from other students or even professors.

That is one the things I like the most about teaching, being challenged by insights and views that hadn’t occurred to me.

Not that I always agree with them but it is a nice break from talking to myself. 😉

I first saw this at: Columbia Intro to Data Science 2.0.

October 16, 2013

Research Methodology [How Good Is Your Data?]

Filed under: Data Collection,Data Quality,Data Science — Patrick Durusau @ 3:42 pm

The presenters in a recent webinar took great pains to point out all the questions a user should be asking about data.

Questions like how representative a population was surveyed or how representative is the data, how were survey questions tested, selection biases, etc., it was like a flash back to empirical methodology in a political science course I took years ago.

It hadn’t occurred to me that some users of data (or “big data” if you prefer) might not have empirical methodology reflexes.

That would account for people who use Survey Monkey and think the results aren’t a reflection of themselves.

Doesn’t have to be. A professional survey person could use the same technology and possibly get valid results.

But the ability to hold a violin doesn’t mean you can play one.

Resources that you may find useful:

Political Science Scope and Methods

Description:

This course is designed to provide an introduction to a variety of empirical research methods used by political scientists. The primary aims of the course are to make you a more sophisticated consumer of diverse empirical research and to allow you to conduct advanced independent work in your junior and senior years. This is not a course in data analysis. Rather, it is a course on how to approach political science research.

Berinsky, Adam. 17.869 Political Science Scope and Methods, Fall 2010. (MIT OpenCourseWare: Massachusetts Institute of Technology), http://ocw.mit.edu/courses/political-science/17-869-political-science-scope-and-methods-fall-2010 (Accessed 16 Oct, 2013). License: Creative Commons BY-NC-SA

Qualitative Research: Design and Methods

Description:

This course is intended for graduate students planning to conduct qualitative research in a variety of different settings. Its topics include: Case studies, interviews, documentary evidence, participant observation, and survey research. The primary goal of this course is to assist students in preparing their (Masters and PhD) dissertation proposals.

Locke, Richard. 17.878 Qualitative Research: Design and Methods, Fall 2007. (MIT OpenCourseWare: Massachusetts Institute of Technology), http://ocw.mit.edu/courses/political-science/17-878-qualitative-research-design-and-methods-fall-2007 (Accessed 16 Oct, 2013). License: Creative Commons BY-NC-SA

Introduction to Statistical Method in Economics

Description:

This course is a self-contained introduction to statistics with economic applications. Elements of probability theory, sampling theory, statistical estimation, regression analysis, and hypothesis testing. It uses elementary econometrics and other applications of statistical tools to economic data. It also provides a solid foundation in probability and statistics for economists and other social scientists. We will emphasize topics needed in the further study of econometrics and provide basic preparation for 14.32. No prior preparation in probability and statistics is required, but familiarity with basic algebra and calculus is assumed.

Bennett, Herman. 14.30 Introduction to Statistical Method in Economics, Spring 2006. (MIT OpenCourseWare: Massachusetts Institute of Technology), http://ocw.mit.edu/courses/economics/14-30-introduction-to-statistical-method-in-economics-spring-2006 (Accessed 16 Oct, 2013). License: Creative Commons BY-NC-SA

Every science program, social or otherwise, will offer some type of research methods course. The ones I have listed are only the tip of a very large iceberg of courses and literature.

With a little effort you can acquire an awareness of what wasn’t said about data collection, processing or analysis.

October 14, 2013

Data Science Association

Filed under: Data Science — Patrick Durusau @ 3:24 pm

Data Science Association

From the homepage:

The Data Science Association is a non-profit professional group that offers education, professional certification, a “Data Science Code of Professional Conduct” and conferences / meetups to discuss data science (e.g. predictive / prescriptive analytics, algorithm design and execution, applied machine learning, statistical modeling, and data visualization). Our members are professionals, students, researchers, academics and others with a deep interest in data science and related technologies.

From the news/blog it looks like the Data Science Association came online in late March of 2013.

Rather sparse in terms of resources, although there is a listing of videos of indeterminate length. I say “indeterminate length” because on FireFox, Chrome and IE running on a virtual box, the video listing does not scroll. It appear to have content located below the bottom of my screen. I checked that by reducing the browser window and yes, there is content “lower” down on the list.

The code of conduct is quite long but I thought you might be interested in the following passages:

(g) A data scientist shall use reasonable diligence when designing, creating and implementing algorithms to avoid harm. The data scientist shall disclose to the client any real, perceived or hidden risks from using the algorithm. After full disclosure, the client is responsible for making the decision to use or not use the algorithm. If a data scientist reasonably believes an algorithm will cause harm, the data scientist shall take reasonable remedial measures, including disclosure to the client, and including, if necessary, disclosure to the proper authorities. The data scientist shall take reasonable measures to persuade the client to use the algorithm appropriately.

(h) A data scientist shall use reasonable diligence when designing, creating and implementing machine learning systems to avoid harm. The data scientist shall disclose to the client any real, perceived or hidden risks from using a machine learning system. After full disclosure, the client is responsible for making the decision to use or not use the machine learning system. If a data scientist reasonably believes the machine learning system will cause harm, the data scientist shall take reasonable remedial measures, including disclosure to the client, and including, if necessary, disclosure to the proper authorities. The data scientist shall take reasonable measures to persuade the client to use the machine learning system appropriately.

I would much prefer Canon 4: A Lawyer Should Preserve The Confidences And Secrets of a Client:

EC 4-1

Both the fiduciary relationship existing between lawyer and client and the proper functioning of the legal system require the preservation by the lawyer of confidences and secrets of one who has employed or sought to employ the lawyer. A client must feel free to discuss anything with his or her lawyer and a lawyer must be equally free to obtain information beyond that volunteered by the client. A lawyer should be fully informed of all the facts of the matter being handled in order for the client to obtain the full advantage of our legal system. It is for the lawyer in the exercise of independent professional judgment to separate the relevant and important from the irrelevant and unimportant. The observance of the ethical obligation of a lawyer to hold inviolate the confidences and secrets of a client not only facilitates the full development of facts essential to proper representation of the client but also encourages non-lawyers to seek early legal assistance. (NEW YORK LAWYER’S CODE OF PROFESSIONAL RESPONSIBILITY)

You could easily fit “data scientist” and “data science” as appropriate in that passage.

Playing the role of Jiminy Cricket or moral conscience of a client seems problematic to me.

In part because there are professionals, priests, rabbis, imans, ministers who are better trained to recognize and counsel on moral issues.

But in part because of the difficulty of treating all clients equally. Are you more concerned about the “harm” that may be done by a client of Middle Eastern extraction than one from New York (Timothy McVeigh)?

Or putting in extra effort to detect “harm” because the government doesn’t like someone?

Personally I think the government has too many snitches and/or potential snitches as it is. Data scientists should not be too quick to join that crowd.

September 29, 2013

Foundations of Data Science

Foundations of Data Science by John Hopcroft and Ravindran Kannan.

From the introduction:

Computer science as an academic discipline began in the 60’s. Emphasis was on programming languages, compilers, operating systems, and the mathematical theory that supported these areas. Courses in theoretical computer science covered nite automata, regular expressions, context free languages, and computability. In the 70’s, algorithms was added as an important component of theory. The emphasis was on making computers useful. Today, a fundamental change is taking place and the focus is more on applications. There are many reasons for this change. The merging of computing and communications has played an important role. The enhanced ability to observe, collect and store data in the natural sciences, in commerce, and in other elds calls for a change in our understanding of data and how to handle it in the modern setting. The emergence of the web and social networks, which are by far the largest such structures, presents both opportunities and challenges for theory.

While traditional areas of computer science are still important and highly skilled individuals are needed in these areas, the majority of researchers will be involved with using computers to understand and make usable massive data arising in applications, not just
how to make computers useful on specifi c well-defi ned problems. With this in mind we have written this book to cover the theory likely to be useful in the next 40 years, just as automata theory, algorithms and related topics gave students an advantage in the last 40 years. One of the major changes is the switch from discrete mathematics to more of an emphasis on probability, statistics, and numerical methods.

In draft form but impressive!

Current chapters:

  1. Introduction
  2. High-Dimensional Space
  3. Random Graphs
  4. Singular Value Decomposition (SVD)
  5. Random Walks and Markov Chains
  6. Learning and the VC-dimension
  7. Algorithms for Massive Data Problems
  8. Clustering
  9. Topic Models, Hidden Markov Process, Graphical Models, and Belief Propagation
  10. Other Topics [Rankings, Hare System for Voting, Compressed Sensing and Sparse Vectors]
  11. Appendix

I am certain the authors would appreciate comments and suggestions concerning the text.

I first saw this in a tweet by CompSciFact.

September 24, 2013

Data Science: Not Just for Big Data (Webinar)

Filed under: Data Science — Patrick Durusau @ 3:55 pm

Data Science: Not Just for Big Data

October 16th at 11am EST

From the webpage:

These days, data science and big data have become synonymous phrases. But data doesn’t have to be big for data science to unlock big value.

Join Kalido CTO Darren Peirce as he hosts David Smith, Data Scientist at Revolution Analytics and Gregory Piatetsky, Editor of KDNuggets, two of today’s most influential data scientists, for an open-panel discussion. They’ll discuss why the value of the insights is not directly proportional to the size of a dataset.

If you are wondering whether data science can give your business an edge this may be the most important hour you’ll spend all week.

Confirmation that having the right data isn’t the same thing as having “big data.”

The NSA can mine all the telephone traffic if it wants. Mining the telephone traffic of security risks, a much smaller data set, is likely to be more productive.

See you at the webinar!

I first saw this at: Upcoming Data Science Webinar.

August 31, 2013

Do You Mansplain Topic Maps?

Filed under: Data Science,Marketing,Topic Maps — Patrick Durusau @ 3:54 pm

Selling Data Science: Common Language by Sean Gonzalez.

From the post:

What do you think of when you say the word “data”? For data scientists this means SO MANY different things from unstructured data like natural language and web crawling to perfectly square excel spreadsheets. What do non-data scientists think of? Many times we might come up with a slick line for describing what we do with data, such as, “I help find meaning in data” but that doesn’t help sell data science. Language is everything, and if people don’t use a word on a regular basis it will not have any meaning for them. Many people aren’t sure whether they even have data let alone if there’s some deeper meaning, some insight, they would like to find. As with any language barrier the goal is to find common ground and build from there.

You can’t blame people, the word “data” is about as abstract as you can get, perhaps because it can refer to so many different things. When discussing data casually, rather than mansplain what you believe data is or what it could be, it’s much easier to find examples of data that they are familiar with and preferably are integral to their work. (emphasis added)

Well? Your answer here:______.

Let’s recast that last clause to read:

…it’s much easier to find examples of subjects they are familiar with and preferably are integral to their work.

So that the conversation is about their subjects and what they want to say about them.

As a potential customer, I would find that more compelling.

You?

May 23, 2013

Data Science eBook by Analyticbridge – 2nd Edition

Filed under: Data Science — Patrick Durusau @ 7:27 am

Data Science eBook by Analyticbridge – 2nd Edition by Vincent Granville.

From the post:

This 2nd edition has more than 200 pages of pure data science, far more than the first edition. This new version of our very popular book will soon be available for download: we will make an announcement when it is officially published.

Sixty-two (62) new contributions split between data science recipes, data science discussions, data science resources.

If you can’t wait for the ebook, links to the contributions are given at Vincent’s post.

One post in particular caught my attention: How to reverse engineer Google?

The project sounds interesting but why not reverse engineer CNN or WSJ or NYT coverage?

Watch the stories that appear most often and the most visibly to determine what you need to do for coverage.

It may not have anything to do with your core competency, but then neither does gaming page rankings by Google.

Just that is your business model and then you are selling your service to people even less informed than you are.

Do be careful because some events covered by CNN, WSJ and the NTY are considered illegal in some jurisdictions.

May 21, 2013

“Practical Data Science with R” MEAP (ordered)

Filed under: Data Science,R — Patrick Durusau @ 11:17 am

Big News! “Practical Data Science with R” MEAP launched! by John Mount.

From the post:

Nina Zumel and I ( John Mount ) have been working very hard on producing an exciting new book called “Practical Data Science with R.” The book has now entered Manning Early Access Program (MEAP) which allows you to subscribe to chapters as they become available and give us feedback before the book goes into print.

R image

Deal of the Day May 21 2013: Half off Practical Data Science with R. Use code dotd0521au.

I ordered the “Practical Data Science with R” MEAP today, based on my other Manning MEAP experiences.

You?

May 10, 2013

Harvard Stat 221

Filed under: CS Lectures,Data Science,Mathematics,Statistics — Patrick Durusau @ 6:36 pm

Harvard Stat 221 “Statistical Computing and Visualization”: by Sergiy Nesterko.

From the post:

Stat 221 is Statistical Computing and Visualization. It’s a graduate class on analyzing data without losing scientific rigor, and communicating your work. Topics span the full cycle of a data-driven project including project setup, design, implementation, and creating interactive user experiences to communicate ideas and results. We covered current theory and philosophy of building models for data, computational methods, and tools such as d3js, parallel computing with MPI, R.

See Sergily’s post for the lecture slides from this course.

April 28, 2013

Algorithms Every Data Scientist Should Know: Reservoir Sampling

Filed under: Algorithms,Data Science,Reservoir Sampling — Patrick Durusau @ 12:12 pm

Algorithms Every Data Scientist Should Know: Reservoir Sampling by Josh Wills.

Data scientists, that peculiar mix of software engineer and statistician, are notoriously difficult to interview. One approach that I’ve used over the years is to pose a problem that requires some mixture of algorithm design and probability theory in order to come up with an answer. Here’s an example of this type of question that has been popular in Silicon Valley for a number of years:

Say you have a stream of items of large and unknown length that we can only iterate over once. Create an algorithm that randomly chooses an item from this stream such that each item is equally likely to be selected.

The first thing to do when you find yourself confronted with such a question is to stay calm. The data scientist who is interviewing you isn’t trying to trick you by asking you to do something that is impossible. In fact, this data scientist is desperate to hire you. She is buried under a pile of analysis requests, her ETL pipeline is broken, and her machine learning model is failing to converge. Her only hope is to hire smart people such as yourself to come in and help. She wants you to succeed.

Beaker image
Remember: Stay Calm.

The second thing to do is to think deeply about the question. Assume that you are talking to a good person who has read Daniel Tunkelang’s excellent advice about interviewing data scientists. This means that this interview question probably originated in a real problem that this data scientist has encountered in her work. Therefore, a simple answer like, “I would put all of the items in a list and then select one at random once the stream ended,” would be a bad thing for you to say, because it would mean that you didn’t think deeply about what would happen if there were more items in the stream than would fit in memory (or even on disk!) on a single computer.

The third thing to do is to create a simple example problem that allows you to work through what should happen for several concrete instances of the problem. The vast majority of humans do a much better job of solving problems when they work with concrete examples instead of abstractions, so making the problem concrete can go a long way toward helping you find a solution.

In addition to great interview advice, Josh also provides a useful overview of reservoir sampling.

Whether reservoir sampling will be useful to you depends on your test for subject identity.

I tend to think of subject identity as being very precise but that isn’t necessarily the case.

Or should I say that precision of subject identity is a matter of requirements?

For some purposes, it may be sufficient to know the gender of attendees, as a subject, within some margin of statistical error. With enough effort we could know that more precisely but the cost may be prohibitive.

Thinking of any test for subject identity being located on a continuum of subject identification. Where the notion of “precision” itself is up for definition.

Russia’s warnings on one of the Boston Marathon bombers, a warning that used his name as he did, not as captured by the US intelligence community, was a case of mistaken level of precision.

Mostly likely the result of an analyst schooled in an English-only curriculum.

April 20, 2013

The Amateur Data Scientist and Her Projects

Filed under: Authoring Topic Maps,Data Science — Patrick Durusau @ 10:33 am

The Amateur Data Scientist and Her Projects by Vincent Granville.

From the post:

With so much data available for free everywhere, and so many open tools, I would expect to see the emergence of a new kind of analytic practitioner: the amateur data scientist.

Just like the amateur astronomer, the amateur data scientist will significantly contribute to the art and science, and will eventually solve mysteries. Could the Boston bomber be found thanks to thousands of amateurs analyzing publicly available data (images, videos, tweets, etc.) with open source tools? After all, amateur astronomers have been able to detect exoplanets and much more.

Also, just like the amateur astronomer only needs one expensive tool (a good telescope with data recording capabilities), the amateur data scientist only needs one expensive tool (a good laptop and possibly subscription to some cloud storage/computing services).

Amateur data scientists might earn money from winning Kaggle contests, working on problems such as identifying a Bonet, explaining the stock market flash crash, defeating Google page-ranking algorithms, helping find new complex molecules to fight cancer (analytical chemistry), predicting solar flares and their intensity. Interested in becoming an amateur data scientist? Here’s a first project for you, to get started:

Amateur data scientist, I rather like the sound of that.

And would be an intersection of interests and talents, just like professional data scientists.

Vincent’s example of posing entry level problems is a model I need to follow for topic maps.

Amateur topic map authors?

Data Science Markets [Marketing]

Filed under: Data Science,Marketing,Topic Maps — Patrick Durusau @ 8:38 am

Data Visualization: The Data Industry by Sean Gonzalez.

From the post:

In any industry you either provide a service or a product, and data science is no exception. Although the people who constitute the data science workforce are in many cases rebranded from statistician, physicist, algorithm developer, computer scientist, biologist, or anyone else who has had to systematically encode meaning from information as the product of their profession, data scientists are unique from these previous professions in that they operate across verticals as opposed to diving ever deeper down the rabbit hole.

Sean identifies five (5) market segments in data science and a visualization product for each one:

  1. New Recruits
  2. Contributors
  3. Distillers
  4. Consultants
  5. Traders

See Sean’s post for the details.

Have you identified market segments and the needs they have for topic map based data and/or software?

Yes, I said their needs.

You may want a “…more just, verdant, and peaceful world” but that’s hardly a common requirement.

Starting with a potential customer’s requirements is more likely to result in a sale.

Data Computation Fundamentals [Promoting Data Literacy]

Filed under: Data,Data Science,R — Patrick Durusau @ 8:08 am

Data Computation Fundamentals by Daniel Kaplan and Libby Shoop.

From the first lesson:

Teaching the Grammar of Data

Twenty years ago, science students could get by with a working knowledge of a spreadsheet program. Those days are long gone, says Danny Kaplan, DeWitt Wallace Professor of Mathematics and Computer Science. “Excel isn’t going to cut it,” he says. “In today’s world, students can’t escape big data. Though it won’t be easy to teach it, it will only get harder as they move into their professional training.”

To that end, Kaplan and computer science professor Libby Shoop have developed a one-credit class called Data Computation Fundamentals, which is being offered beginning this semester. Though Kaplan doesn’t pretend the course can address all the complexities of specific software packages, he does hope it will provide a framework that students can apply when they come across databases or data-reliant programs in biology, chemistry, and physics. “We believe we can give students that grammar of data that they need to use these modern capabilities,” he says.

Not quite “have developed.” Should say, “are developing, in conjunction with a group of about 20 students.”

Data literacy impacts the acceptance and use of data and tools for using data.

Teaching people to read and write is not a threat to commercial authors.

By the same token, teaching people to use data is not a threat to competent data analysts.

Help the authors and yourself by reviewing the course and offering comments for its improvement.

I first saw this at: A Course in Data and Computing Fundamentals.

April 11, 2013

Cargo Cult Data Science [Cargo Cult Semantics?]

Filed under: Data Science,Semantics — Patrick Durusau @ 3:30 pm

Cargo Cult Data Science by Jim Harris.

From the post:

Last week, Phil Simon blogged about being wary of snake oil salesman who claim to be data scientists. In this post, I want to explore a related concept, namely being wary of thinking that you are performing data science by mimicking what data scientists do.

The American theoretical physicist Richard Feynman coined the term cargo cult science to refer to practices that have the semblance of being scientific, but do not in fact follow the scientific method.

As Feynman described his analogy, “in the South Seas there is a cult of people. During the war they saw airplanes land with lots of materials, and they want the same thing to happen now. So they’ve arranged to make things like runways, to put fires along the sides of the runways, to make a wooden hut for a man to sit in, with two wooden pieces on his head like headphones and bars of bamboo sticking out like antennas—he’s the controller—and they wait for the airplanes to land. They’re doing everything right. The form is perfect. But it doesn’t work. No airplanes land. So I call these things Cargo Cult Science, because they follow all the apparent precepts and forms of scientific investigation, but they’re missing something essential, because the planes don’t land.”

Feynman’s description of the runway and controller reminds me of attempts to create systems with semantic “understanding.”

We load them up with word lists, thesauri, networks of terms, the equivalent of runways.

We give them headphones (ontologies) with bars of bamboo (syntax) sticking out of them.

And after all that, semantic understanding continues to elude us.

Maybe those efforts are missing something essential? (Like us?)

PyData and More Tools…

Filed under: Data Science,PyData,Python — Patrick Durusau @ 3:14 pm

PyData and More Tools for Getting Started with Python for Data Scientists by Sean Murphy.

From the post:

It would turn out that people are very interested in learning more about python and our last post, “Getting Started with Python for Data Scientists,” generated a ton of comments and recommendations. So, we wanted to give back those comments and a few more in a new post. As luck would have it, John Dennison, who helped co-author this post (along with Abhijit), attended both PyCon and PyData and wanted to sneak in some awesome developments he learned at the two conferences.

I make out at least seventeen (17) different Pthon resources, libraries, etc.

Enough to keep you busy for more than a little while. 😉

March 21, 2013

Training a New Generation of Data Scientists

Filed under: Cloudera,CS Lectures,Data Science — Patrick Durusau @ 2:26 pm

Training a New Generation of Data Scientists by Ryan Goldman.

From the post:

Data scientists drive data as a platform to answer previously unimaginable questions. These multi-talented data professionals are in demand like never before because they identify or create some of the most exciting and potentially profitable business opportunities across industries. However, a scarcity of existing external talent will require companies of all sizes to find, develop, and train their people with backgrounds in software engineering, statistics, or traditional business intelligence as the next generation of data scientists.

Join us for the premiere of Training a New Generation of Data Scientists on Tuesday, March 26, at 2pm ET/11am PT. In this video, Cloudera’s Senior Director of Data Science, Josh Wills, will discuss what data scientists do, how they think about problems, the relationship between data science and Hadoop, and how Cloudera training can help you join this increasingly important profession. Following the video, Josh will answer your questions about data science, Hadoop, and Cloudera’s Introduction to Data Science: Building Recommender Systems course.

This could be fun!

And if nothing else, will give you the tools to distinguish legitimate training, like Cloudera’s, from the “How to make $millions in real estate,” from the guy who makes money selling lectures and books sort of training.

As “hot” as data science is, you don’t have to look for to find that sort of training.

Snowflake Data Science [Three R’s of Topic Maps?]

Filed under: BigData,Data Science,Topic Maps — Patrick Durusau @ 1:37 pm

Much of today’s statistical modeling and predictive analytics is beautiful but unique. It’s impossible to repeat, it’s snowflake data science. (Matt Wood, principal data scientist for Amazon Web Services)

Think about that for a moment.

Snowflakes are unique. Can the same be said about your data science projects?

Would that explain the 80% figure of data science time being spent on cleaning, ETL, and similar tasks with data?

Is it that data never gets clean or are you cleaning the same data over and over again?

Barb Darrow reported in: From Amazon’s top data geek: data has got to be big — and reproducible:

The next frontier is making that data reproducible, said Matt Wood, principal data scientist for Amazon Web Services, at GigaOM’s Structure:Data 2013 event Wednesday.

In short, it’s great to get a result from your number crunching, but if the result is different next time out, there’s a problem. No self-respecting scientist would think of submitting the findings for a trial or experiment unless she is able to show that the it will be the same after multiple runs.

“Much of today’s statistical modeling and predictive analytics is beautiful but unique. It’s impossible to repeat, it’s snowflake data science.” Wood told attendees in New York. “Reproducibility becomes a key arrow in the quiver of the data scientist.”

The next frontier is making sure that people can reproduce, reuse and remix their data which provides a “tremendous amount of value,” Wood noted. (emphasis added)

I like that: Reproduce, Reuse, Remix data.

That’s going to require robust and granular handling of subject identity.

The three R’s of topic maps.

Yes?

« Newer PostsOlder Posts »

Powered by WordPress