Data Science « Another Word For It

December 9, 2016

Data Science and Protests During the Age of Trump [How To Brick A School Bus…]

Filed under: Censorship,Data Science,Government,Politics,Protests — Patrick Durusau @ 3:48 pm

Pre-inauguration suppression of free speech/protests is underway for the Trump regime. (CNN link as subject identifier for Donald J. Trump, even though it fails to mention he looks like a cheeto in a suit.)

Women’s March on Washington barred from Lincoln Memorial by Amber Jamieson and Jessica Glenza.

From the post:

…
For the thousands hoping to echo the civil rights and anti-Vietnam rallies at Lincoln Memorial by joining the women’s march on Washington the day after Donald Trump’s inauguration: time to readjust your expectations.

The Women’s March won’t be held at the Lincoln Memorial.

That’s because the National Park Service, on behalf of the Presidential Inauguration Committee, filed documents securing large swaths of the national mall and Pennsylvania Avenue, the Washington Monument and the Lincoln Memorial for the inauguration festivities. None of these spots will be open for protesters.

The NPS filed a “massive omnibus blocking permit” for many of Washington DC’s most famous political locations for days and weeks before and after the inauguration on 20 January, said Mara Verheyden-Hilliard, a constitutional rights litigator and the executive director of the Partnership for Civil Justice Fund.
…

I contacted Amber Jamieson for more details on the permits and she forwarded two links (thanks Amber!):

Press Conference: Mass Protests Will Go Forward During Inauguration, which had the second link she forwarded:

PresidentialInauguralCommittee12052016.pdf, the permit requests made by the National Park Service on behalf of the Presidential Inaugural Committee.

Start with where protests are “permitted” to see what has been lost.

A grim read but 36 CFR 7.96 says in part:

…
3 (i) White House area. No permit may be issued authorizing demonstrations in the White House area, except for the White House sidewalk, Lafayette Park and the Ellipse. No permit may be issued authorizing special events, except for the Ellipse, and except for annual commemorative wreath-laying ceremonies relating to the statutes in Lafayette Park.
…
(emphasis added, material hosted by the Legal Information Institute (LII))

Summary: In White House area, protesters have only three places for permits to protest:

White House sidewalk
Lafayette Park
Ellipse

White House sidewalk / Lafayette Park (except North-East Quadrant) – Application 16-0289

Dates:

Set-up dates starting 11/1/2016 6:00 am ending 1/19/2017
Activity dates starting 1/20/2017 ending 1/20/2017
Break-down dates starting 1/21/2017 ending 3/1/2017 11:59 pm

Closes:

…
All of Lafayette Park except for its northeast quadrant pursuant to 36 CFR 7.96 (g)(4)(iii)(A). The initial areas of Lafayette Park and the White House Sidewalk that will be needed for construction set-up, and which will to be closed to ensure public safety, is detailed in the attached map. The attached map depicts the center portion of the White House Sidewalk as well as a portion of the southern oval of Lafayette Park. The other remaining areas in Lafayette Park and the White House Sidewalk that will be needed for construction set-up, will be closed as construction set-up progresses into these other areas, which will also then be delineated by fencing and sign age to ensure public safety.
…

Two of the three possible protest sites in the White House closed by Application 16-0289.

Ellipse – Application 17-0001

Dates:

Set-up dates starting 01/6/2017 6:00 am ending 1/19/2017
Activity dates starting 1/20/2017 ending 1/20/2017
Break-down dates starting 1/21/2017 ending 2/17/2017 11:59 pm

These dates are at variance with those for the White House sidewalk and Lafayette Park (shorter).

Closes:

Ellipse, a fitty-two acre park, as depicted by Google Maps:

Plans for the Ellipse?

…
Purpose of Activity: In connection with the Presidential Inaugural Ceremonies, this application is for use of the Ellipse by PIC, in the event that PIC seeks its use for Inaugural ceremonies and any necessary staging, which is expected to be:

A) In the event that PIC seeks the use of the Ellipse for pre- and/or post- Inaugural ceremonies, the area will be used for staging the event(s), staging of media to cover and/or broadcast the event, and if possible for ticketed and/or public viewing; and/or

B) In the event that PIC seeks the use of the Ellipse for the Inaugural ceremony and Inaugural parade staging, the area will be used to stage the various parade elements, for media to cover and/or broadcast the event, and if possible for ticketed and/or public viewing.
…

The PIC has no plans to use the Ellipse but has reserved it no doubt to deny its use to others.

Those two applications close three out of three protest sites in the White House area. The PIC went even further to reach out and close off other potential protest sites.

Other permits granted to the PIC include:

Misc. Areas – Application 16-0357

Ten (10) misc. areas identified by attached maps for PIC activities.

Arguably legitimate since the camp followers, sycophants and purveyors of “false news” need somewhere to be during the festivities.

National Mall -> Trump Mall – Application 17-0002

The National Mall will become Trump Mall for the following dates:

Set-up dates starting 01/6/2017 6:00 am ending 1/19/2017
Activity dates starting 1/20/2017 ending 1/20/2017
Break-down dates starting 1/21/2017 ending 1/30/2017 11:59 pm

Closes:

…
Plan for Proposed Activity: Consistent with NPS regulations at 36 CFR 7.96{g)(4)(iii)(C), this application seeks, in connection with the Presidential Inaugural Ceremonies, the area of the National Mall between 14th – 4th Streets, for the exclusive use of the Joint Task Force Headquarters (JTFHQ) on Inaugural Day for the assembly, staging, security and weather protection of the pre-Inaugural parade components and floats on Inaugural Day between 14th – 7th Streets. It also includes the placement of jumbotrons and sound towers by the Architect of the Capitol or the Joint Congressional Committee on Inaugural Ceremonies so that the Inaugural Ceremony may be observed by the Joint Congressional Committee’s ticketed standing room ticket holders between 4th – 3rd streets and the general public, which will be located on the National Mall between 7th – 4th Streets. Further, a 150-foot by 200-foot area on the National Mall just east of 7th Street, will be for the exclusive use of the Presidential Inaugural Committee for television and radio media broadcasts on Inaugural Day.
…

In the plans thus far, no mention of the main card or where the ring plus cage will be erected on Trump Mall. (that’s sarcasm, not “fake news”)

Most Other Places – Application 17-0003

If you read 36 CFR 7.96 carefully, you noticed there are places always prohibited to protesters:

…
(ii) Other park areas. Demonstrations and special events are not allowed in the following other park areas:

(A) The Washington Monument, which means the area enclosed within the inner circle that surrounds the Monument’s base, except for the official annual commemorative Washington birthday ceremony.

(B) The Lincoln Memorial, which means that portion of the park area which is on the same level or above the base of the large marble columns surrounding the structure, and the single series of marble stairs immediately adjacent to and below that level, except for the official annual commemorative Lincoln birthday ceremony.

(C) The Jefferson Memorial, which means the circular portion of the Jefferson Memorial enclosed by the outermost series of columns, and all portions on the same levels or above the base of these columns, except for the official annual commemorative Jefferson birthday ceremony.

(D) The Vietnam Veterans Memorial, except for official annual Memorial Day and Veterans Day commemorative ceremonies.
…

What about places just outside the already restricted areas?

Dates:

Set-up dates starting 01/6/2017 6:00 am ending 1/19/2017
Activity dates starting 1/20/2017 ending 1/20/2017
Break-down dates starting 1/21/2017 ending 2/10/2017 11:59 pm

Closes:

…
The Lincoln Memorial area, as more fully detailed as the park area bordered by 23rd Street, Daniel French Drive and Independence Avenue, Henry Bacon Drive and Constitution Avenue, Constitution Avenue between 15th & 23rd Streets, Constitution Gardens to include Area #5 outside of the Vietnam Veteran’s Memorial restricted area, the Lincoln Memorial outside of its restricted area, the Lincoln Memorial Plaza and Reflecting Pool Area, JFK Hockey Field, park area west of Lincoln Memorial between French Drive, Henry Bacon Drive, Parking Lots A, Band C, East and West Potomac Park, Memorial Bridge, Memorial Circle and Memorial Drive, the World War II Memorial. The Washington Monument Grounds as more fully depicted as the park area bounded by 14th & 15th Streets and Madison Drive and Independence Avenue.
…

Not to use but to prevent its use by others:

…
Purpose of Activity: In connection with the Presidential Inaugural Ceremonies, this application is for use of the Lincoln Memorial areas and Washington Monument grounds by PIC, in the event that PIC seeks its use for the Inaugural related ceremonies and any necessary staging, which is expected to be:

A) In the event that PIC seeks the use of the Lincoln Memorial areas for a pre-and/or post Inaugural ceremonies, the area will be used for staging the event(s), staging of media to cover and/or broadcast the event, and for ticketed and/or public viewing.

B) In the event that PIC seeks to use the Washington Monument grounds for a public overflow area to view the Inaugural ceremony and/ or parade, the area will be used for the public who will observe the activities through prepositioned jumbotrons and sound towers.
…

Next Steps

For your amusement, all five applications contain the following question answered No:

Do you have any reason to believe or any information indicating that any individual, group or organization might seek to disrupt the activity for which this application is submitted?

I would venture to say someone hasn’t been listening. 😉

Among the data science questions raised by this background information are:

How best to represent these no free speech and/or no free assembly zones on a map?
What data sets do you need to make protesters effective under these restrictions?
What questions would you ask of those data sets?
How to decide between viral/spontaneous action versus publicly known but lawful conduct, up until the point it becomes unlawful?

If you use any of this information, please credit Amber Jamieson, Jessica Glenza and the Partnership for Civil Justice Fund as the primary sources.

See further news from the Partnership for Civil Justice Fund at: Your Right of Resistance.

Tune in next Monday for: How To Brick A School Bus, Data Science Helps Park It.

PS: “The White House Sidewalk is the sidewalk between East and West Executive Avenues, on the south side Pennsylvania Avenue, N.W.” From OMB Control No. 1024-0021 – Application for a Permit to Conduct a Demonstration or Special Event in Park Areas and a Waiver of Numerical Limitations on Demonstrations for White House Sidewalk and/or Lafayette Park

Comments (1)

November 24, 2016

Learning R programming by reading books: A book list

Filed under: Data Science,Programming,R — Patrick Durusau @ 11:10 am

Learning R programming by reading books: A book list by Liang-Cheng Zhang.

From the post:

Despite R’s popularity, it is still very daunting to learn R as R has no click-and-point feature like SPSS and learning R usually takes lots of time. No worries! As self-R learner like us, we constantly receive the requests about how to learn R. Besides hiring someone to teach you or paying tuition fees for online courses, our suggestion is that you can also pick up some books that fit your current R programming level. Therefore, in this post, we would like to share some good books that teach you how to learn programming in R based on three levels: elementary, intermediate, and advanced levels. Each level focuses on one task so you will know whether these books fit your needs. While the following books do not necessarily focus on the task we define, you should focus the task when you reading these books so you are not lost in contexts.
…

Books and reading form the core of my most basic prejudice: Literacy is the doorway to unlimited universes.

A prejudice so strong that I have to work hard at realizing non-literates live in and sense worlds not open to literates. Not less complex, not poorer, just different.

But book lists in particular appeal to that prejudice and since my blog is read by literates, I’m indulging that prejudice now.

I do have a title to add to the list: Practical Data Science with R by Nina Zumel and John Mount.

Judging from the other titles listed, Practical Data Science with R falls in the intermediate range. Should not be your first R book but certainly high on the list for your second R book.

Avoid the rush! Start working on your Amazon wish list today! 😉

Comments Off

November 19, 2016

Python Data Science Handbook

Filed under: Data Science,Programming,Python — Patrick Durusau @ 5:27 pm

Python Data Science Handbook (Github)

From the webpage:

Jupyter notebook content for my OReilly book, the Python Data Science Handbook.

See also the free companion project, A Whirlwind Tour of Python: a fast-paced introduction to the Python language aimed at researchers and scientists.

This repository will contain the full listing of IPython notebooks used to create the book, including all text and code. I am currently editing these, and will post them as I make my way through. See the content here:
…

Enjoy!

Comments Off

November 2, 2016

How To Use Twitter to Learn Data Science (or anything)

Filed under: Data Science,Twitter — Patrick Durusau @ 7:55 pm

How To Use Twitter to Learn Data Science (or anything) by Data Science Renee.

Judging from the date on the post (May 2016), Renee’s enthusiasm for Twitter came before her recently breaking 10,000 followers on Twitter. (Congratulations!)

The one thing I don’t see Renee mentioning is the use of your own Twitter account to gain experience with a whole range of data mining tools.

Your Twitter feed will quickly out-strip your ability to “keep up,” so how do you propose to deal with that problem?

Renee suggests limiting examination of your timeline (in part), but have you considered using machine learning to assist you?

Or visualizing your areas of interests or people that you follow?

Indexing resources pointed to in tweets?

NLP processing of tweets?

Every tool of data science that you will be using for clients is relevant to your own Twitter feed.

What better way to learn tools than using them on content that interests you?

Enjoy!

BTW, follow Data Science Renee for a broad range of data science tools and topics!

Comments Off

October 30, 2016

How To Use Data Science To Write And Sell More Books (Training Amazon)

Filed under: Books,Data Science — Patrick Durusau @ 12:50 pm

From the description:

Chris Fox is the bestselling author of science fiction and dark fantasy, as well as non-fiction books for authors including Write to Market, 5000 words per hour and today we’re talking about his next book, Six Figure Author: Using data to sell books.

Show Notes What Amazon data science, and machine learning, are and how authors can use them. How Amazon differs from the other online book retailers and how authors can train Amazon to sell more books. What to look for to find a voracious readership. Strategically writing to market and how to know what readers are looking for. On Amazon ads and when they are useful. Tips on writing faster. The future of writing, including virtual reality and AI help with story.

Joanna Penn of The Creative Penn interviews Chris Fox

Some of the highlights:

Training Amazon To Work For You

…What you want to do is figure out, with as much accuracy as possible, who your target audience is.

And when you start selling your book, the number of sales is not nearly as important as who you sell your book to, because each of those sales to Amazon represents a customer profile.

If you can convince them that people who voraciously read in your genre are going to love this book and you sell a couple of hundred copies to people like that, Amazon’s going to take it and run with it. You’ve now successfully trained them about who your audience is because you used good data and now they’re able to easily sell your book.

If, on the other hand, you and your mom buys a copy and your friend at the coffee shop buys a copy, and people who aren’t necessarily into that genre are all buying it, Amazon gets really lost and confused.
…

Easier said than done but how’s that for taking advantage of someone else’s machine learning?

Chris also has tips for not “polluting” your Amazon sales data.

Discovering and Writing to a Market

…
How do you find a sub-category or a smaller niche within the Amazon ecosystem? What are the things to look for in order to find a voracious readership?

Chris: What I do is I start looking at the rankings of the number 1, the number 20, 40, 60, 80 and 100 books. You can tell based on where those books are ranked, how many books in the genre are selling. If the number one book is ranked in the top 100 in the store and so is the 20th book, then you’ve found one of the hottest genres on Amazon.

If you find that by the time you get down to number 40, the rank is dropping off sharply, that suggests that not enough books are being produced in that genre and it might be a great place for you to jump in and make a name for yourself. (emphasis in original)
…

I know, I know, this is a tough one. Especially for me.

As I have pointed out here on multiple occasions, “terrorism” is largely a fiction of both government and media.

However, if you look at the top 100 paid sellers on terrorism at Amazon, the top fifty (50) don’t have a single title that looks like it denies terrorism is a problem.

🙁

Which I take to mean, in terms of selling books, services, or data, the terrorism is coming for us all gravy train is the profitable line.

Or at least to indulge in analysis on the basis of “…if the threat of terrorism is real…” and let readers supply their own answers to that question.

There are other valuable tips and asides, so watch the video or read the transcript: How To Use Data Science To Write And Sell More Books With Chris Fox.

PS: As of today, there are 292 podcasts by Jonna Penn.

Comments Off

October 23, 2016

October 18, 2016

Threatening the President: A Signal/Noise Problem

Filed under: Data Science,Information Science — Patrick Durusau @ 6:27 pm

https://www.youtube.com/watch?v=SoQTaM50tKg

Even if you can’t remember why the pointy end of a pencil is important, you too can create national news.

This bit of noise reminded me of an incident when I was in high school where some similar type person bragged in a local bar about assassinating then President Nixon*. Was arrested and sentenced to several years in prison.

At the time I puzzled briefly over the waste of time and effort in such a prosecution and then promptly forgot it.

Until this incident with the overly “clever” Trump supporter.

To get us off on the same foot:

18 U.S. Code § 871 – Threats against President and successors to the Presidency

(a) Whoever knowingly and willfully deposits for conveyance in the mail or for a delivery from any post office or by any letter carrier any letter, paper, writing, print, missive, or document containing any threat to take the life of, to kidnap, or to inflict bodily harm upon the President of the United States, the President-elect, the Vice President or other officer next in the order of succession to the office of President of the United States, or the Vice President-elect, or knowingly and willfully otherwise makes any such threat against the President, President-elect, Vice President or other officer next in the order of succession to the office of President, or Vice President-elect, shall be fined under this title or imprisoned not more than five years, or both.

(b) The terms “President-elect” and “Vice President-elect” as used in this section shall mean such persons as are the apparent successful candidates for the offices of President and Vice President, respectively, as ascertained from the results of the general elections held to determine the electors of President and Vice President in accordance with title 3, United States Code, sections 1 and 2. The phrase “other officer next in the order of succession to the office of President” as used in this section shall mean the person next in the order of succession to act as President in accordance with title 3, United States Code, sections 19 and 20.

Commonplace threatening letters, calls, etc., aren’t documented for the public but President Barack Obama has a Wikipedia page devoted to the more significant ones: Assassination threats against Barack Obama.

Just as no one knows you are a dog on the internet, no one can tell by looking at a threat online if you are still learning how to use a pencil or are a more serious opponent.

Leaving to one side that a truly serious opponent allows actions to announce their presence or goal.

The treatment of even idle bar threats as serious is an attempt to improve the signal-to-noise ratio:

…

In analog and digital communications, signal-to-noise ratio, often written S/N or SNR, is a measure of signal strength relative to background noise. The ratio is usually measured in decibels (dB) using a signal-to-noise ratio formula. If the incoming signal strength in microvolts is V_s, and the noise level, also in microvolts, is V_n, then the signal-to-noise ratio, S/N, in decibels is given by the formula: S/N = 20 log₁₀(V_s/V_n)

If V_s = V_n, then S/N = 0. In this situation, the signal borders on unreadable, because the noise level severely competes with it. In digital communications, this will probably cause a reduction in data speed because of frequent errors that require the source (transmitting) computer or terminal to resend some packets of data.

…

I’m guessing the reasoning is the more threats that go unspoken, the less chaff the Secret Service has to winnow in order to uncover viable threats.

One assumes they discard physical mail with return addresses of prisons, mental hospitals, etc., or at most request notice of the release of such people from state custody.

Beyond that, they don’t appear to be too picky about credible threats, noting that in one case an unspecified “death ray” was going to be used against President Obama.

The EuroNews description of that case must be shared:

Two American men have been arrested and charged with building a remote-controlled X-ray machine intended for killing Muslims and other perceived enemies of the U.S.

Following a 15-month investigation launched in April 2012, Glenn Scott Crawford and Eric J. Feight are accused of developing the device, which the FBI has described as “mobile, remotely operated, radiation emitting and capable of killing human targets silently and from a distance with lethal doses of radiation”.
…

Sure, right. I will post a copy of the 67-page complaint, which uses terminology rather loosely, to say the least, in a day or so. Suffice it to say that the defendants never acquired a source for the needed radioactivity production.

On the order of having a complete nuclear bomb but not nuclear material to make it into a nuclear bomb. You would be in more danger from the conventional explosive degrading than the bomb as a nuclear weapon.

Those charged with defending public officials want to deter the making of threats, so as to improve the signal/noise ratio.

The goal of those attacking public officials is a signal/noise ratio of exactly 0.0.

Viewing threats from an information science perspective suggests various strategies for either side. (Another dividend of studying information science.)

*They did find a good picture of Nixon for the White House page. Doesn’t look as much like a weasel as he did in real life. Gimp/Photoshop you think?

Comments Off

October 13, 2016

Becoming a Data Scientist:

Filed under: Data Science — Patrick Durusau @ 7:51 pm

Becoming a Data Scientist: Advice From My Podcast Guests

Out-gassing from political candidates has kept pushing this summary by Renée Teate back in my queue. Well, fixing that today!

René has created more data science resources than I can easily mention so in addition to this guide, I will mention only two:

Data Science Renee @BecomingDataSci, a Twitter account that will soon break into the rarefied air of > 10,000 followers. Not yet, but you may be the one that puts her over the top!

Looking for women to speak at data science conferences? Renée maintains Women in Data Science, which today has 815 members.

Sorry, three, her blog: Becoming a Data Scientist.

That should keep you busy/distracted until the political noise subsides. 😉

Comments Off

October 1, 2016

Data Science Toolbox

Filed under: Data Science,Education,Teaching — Patrick Durusau @ 8:31 pm

Data Science Toolbox

From the webpage:

Start doing data science in minutes

As a data scientist, you don’t want to waste your time installing software. Our goal is to provide a virtual environment that will enable you to start doing data science in a matter of minutes.

As a teacher, author, or organization, making sure that your students, readers, or members have the same software installed is not straightforward. This open source project will enable you to easily create custom software and data bundles for the Data Science Toolbox.

A virtual environment for data science

The Data Science Toolbox is a virtual environment based on Ubuntu Linux that is specifically suited for doing data science. Its purpose is to get you started in a matter of minutes. You can run the Data Science Toolbox either locally (using VirtualBox and Vagrant) or in the cloud (using Amazon Web Services).

We aim to offer a virtual environment that contains the software that is most commonly used for data science while keeping it as lean as possible. After a fresh install, the Data Science Toolbox contains the following software:

Python, with the following packages: IPython Notebook, NumPy, SciPy, matplotlib, pandas, scikit-learn, and SymPy.

R, with the following packages: ggplot2, plyr, dplyr, lubridate, zoo, forecast, and sqldf.

dst, a command-line tool for installing additional bundles on the Data Science Toolbox (see next section).

Let us know if you want to see something added to the Data Science Toolbox.

…

Great resource for doing or teaching data science!

And an example of using a VM to distribute software in a learning environment.

Comments Off

September 4, 2016

Data Science Series [Starts 9 September 2016 but not for *nix users]

Filed under: Data Science — Patrick Durusau @ 8:20 pm

The BD2K Guide to the Fundamentals of Data Science Series

From the webpage:

…
Every Friday beginning September 9, 2016
9am – 10am Pacific Time

Working jointly with the BD2K Centers-Coordination Center (BD2KCCC) and the NIH Office of Data Science, the BD2K Training Coordinating Center (TCC) is spearheading this virtual lecture series on the data science underlying modern biomedical research. Beginning in September 2016, the seminar series will consist of regularly scheduled weekly webinar presentations covering the basics of data management, representation, computation, statistical inference, data modeling, and other topics relevant to “big data” biomedicine. The seminar series will provide essential training suitable for individuals at all levels of the biomedical community. All video presentations from the seminar series will be streamed for live viewing, recorded, and posted online for future viewing and reference. These videos will also be indexed as part of TCC’s Educational Resource Discovery Index (ERuDIte), shared/mirrored with the BD2KCCC, and with other BD2K resources.

View all archived videos on our YouTube channel:
https://www.youtube.com/channel/UCKIDQOa0JcUd3K9C1TS7FLQ

Please join our weekly meetings from your computer, tablet or smartphone.
https://global.gotomeeting.com/join/786506213
You can also dial in using your phone.
United States +1 (872) 240-3311
Access Code: 786-506-213
First GoToMeeting? Try a test session: http://help.citrix.com/getready

…

Of course, running Ubuntu, when I follow the “First GoToMeeting? Try a test session,” I get this result:

…
OS not supported

…

Long-Term Fix: Upgrade your computer.

You or your IT Admin will need to upgrade your computer’s operating system in order to install our desktop software at a later date.
…

Since this is most likely a lecture format, could just stream the video and use WebConf as a Q/A channel.

Of course, that would mean losing the various technical difficulties, licensing fees, etc., all of which are distractions from the primary goal of the project.

But who wants that?

PS: Most *nix users won’t be interested except to refer others but still, over engineered solutions to simple issues should not be encouraged.

Comments Off

August 29, 2016

DataScience+ (R Tutorials)

Filed under: Data Science,R — Patrick Durusau @ 2:25 pm

DataScience+

From the webpage:

We share R tutorials from scientists at academic and scientific institutions with a goal to give everyone in the world access to a free knowledge. Our tutorials cover different topics including statistics, data manipulation and visualization!

I encountered DataScience+ while running down David Kun’s RDBL post.

As of today, there are 120 tutorials with 451,129 reads.

That’s impressive! Whether you are looking for tutorials or you are looking to post your R tutorial where it will be appreciated.

Enjoy!

Comments Off

August 21, 2016

The Ethics of Data Analytics

Filed under: Data Analysis,Data Science,Ethics,Graphics,Statistics,Visualization — Patrick Durusau @ 4:00 pm

The Ethics of Data Analytics by Kaiser Fung.

Twenty-one slides on ethics by Kaiser Fung, author of: Junk Charts (data visualization blog), and Big Data, Plainly Spoken (comments on media use of statistics).

Fung challenges you to reach your own ethical decisions and acknowledges there are a number of guides to such decision making.

Unfortunately, Fung does not include professional responsibility requirements, such as the now out-dated Canon 7 of the ABA Model Code Of Professional Responsibility:

A Lawyer Should Represent a Client Zealously Within the Bounds of the Law

That canon has a much storied history, which is capably summarized in Whatever Happened To ‘Zealous Advocacy’? by Paul C. Sanders.

In what became known as Queen Caroline’s Case, the House of Lords sought to dissolve the marriage of King George the IV

to Queen Caroline

on the grounds of her adultery. Effectively removing her as queen of England.

Queen Caroline was represented by Lord Brougham, who had evidence of a secret prior marriage by King George the IV to Catholic (which was illegal), Mrs Fitzherbert.

Brougham’s speech is worth your reading in full but the portion most often cited for zealous defense reads as follows:

…
I once before took leave to remind your lordships — which was unnecessary, but there are many whom it may be needful to remind — that an advocate, by the sacred duty of his connection with his client, knows, in the discharge of that office, but one person in the world, that client and none other. To save that client by all expedient means — to protect that client at all hazards and costs to all others, and among others to himself — is the highest and most unquestioned of his duties; and he must not regard the alarm, the suffering, the torment, the destruction, which he may bring upon any other; nay, separating even the duties of a patriot from those of an advocate, he must go on reckless of the consequences, if his fate it should unhappily be, to involve his country in confusion for his client.
…

The name Mrs. Fitzherbert never slips Lord Brougham’s lips but the House of Lords has been warned that may not remain to be the case, should it choose to proceed. The House of Lords did grant the divorce but didn’t enforce it. Saving fact one supposes. Queen Caroline died less than a month after the coronation of George IV.

For data analysis, cybersecurity, or any of the other topics I touch on in this blog, I take the last line of Lord Brougham’s speech:

To save that client by all expedient means — to protect that client at all hazards and costs to all others, and among others to himself — is the highest and most unquestioned of his duties; and he must not regard the alarm, the suffering, the torment, the destruction, which he may bring upon any other; nay, separating even the duties of a patriot from those of an advocate, he must go on reckless of the consequences, if his fate it should unhappily be, to involve his country in confusion for his client.

as the height of professionalism.

Post-engagement of course.

If ethics are your concern, have that discussion with your prospective client before you are hired.

Otherwise, clients have goals and the task of a professional is how to achieve them. Nothing more.

Comments Off

August 19, 2016

Contributing to StackOverflow: How Not to be Intimidated

Filed under: Data Science,Programming — Patrick Durusau @ 12:43 pm

Contributing to StackOverflow: How Not to be Intimidated by Ksenia Coulter.

From the post:

StackOverflow is an essential resource for programmers. Whether you run into a bizarre and scary error message or you’re blanking on something you should know, StackOverflow comes to the rescue. Its popularity with coders spurred many jokes and memes. (Programming to be Officially Renamed “Googling Stackoverflow,” a satirical headline reads).

(image omitted)

While all of us are users of StackOverflow, contributing to this knowledge base can be very intimidating, especially to beginners or to non-traditional coders who many already feel like they don’t belong. The fact that an invisible barrier exists is a bummer because being an active contributor not only can help with your job search and raise your profile, but also make you a better programmer. Explaining technical concepts in an accessible way is difficult. It is also well-established that teaching something solidifies your knowledge of the subject. Answering StackOverflow questions is great practice.

All of the benefits of being an active member of StackOverflow were apparent to me for a while, but I registered an account only this week. Let me walk you t[h]rough thoughts that hindered me. (Chances are, you’ve had them too!)
…

I plead guilty to using StackOverFlow but not contributing back to it.

Another “intimidation” to avoid is thinking you must have the complete and killer answer to any question.

That can and does happen, but don’t wait for a question where you can supply such an answer.

Jump in! (Advice to myself as well as any readers.)

Comments Off

August 17, 2016

Pandas

Filed under: Data Science,Pandas,Python — Patrick Durusau @ 8:19 pm

Pandas by Reuven M. Lerner.

From the post:

Serious practitioners of data science use the full scientific method, starting with a question and a hypothesis, followed by an exploration of the data to determine whether the hypothesis holds up. But in many cases, such as when you aren’t quite sure what your data contains, it helps to perform some exploratory data analysis—just looking around, trying to see if you can find something.

And, that’s what I’m going to cover here, using tools provided by the amazing Python ecosystem for data science, sometimes known as the SciPy stack. It’s hard to overstate the number of people I’ve met in the past year or two who are learning Python specifically for data science needs. Back when I was analyzing data for my PhD dissertation, just two years ago, I was told that Python wasn’t yet mature enough to do the sorts of things I needed, and that I should use the R language instead. I do have to wonder whether the tables have turned by now; the number of contributors and contributions to the SciPy stack is phenomenal, making it a more compelling platform for data analysis.

In my article “Analyzing Data“, I described how to filter through logfiles, turning them into CSV files containing the information that was of interest. Here, I explain how to import that data into Pandas, which provides an additional layer of flexibility and will let you explore the data in all sorts of ways—including graphically. Although I won’t necessarily reach any amazing conclusions, you’ll at least see how you can import data into Pandas, slice and dice it in various ways, and then produce some basic plots.
…

Of course, scientific articles are written as though questions drop out of the sky and data is interrogated for the answer.

Aside from being rhetoric to badger others with, does anyone really think that is how science operates in fact?

Whether you have delusions about how science works in fact or not, you will find that Pandas will assist you in exploring data.

Comments Off

June 12, 2016

Ten Simple Rules for Effective Statistical Practice

Filed under: Data Science,Science,Statistics — Patrick Durusau @ 7:17 pm

Ten Simple Rules for Effective Statistical Practice by Robert E. Kass, Brian S. Caffo, Marie Davidian, Xiao-Li Meng, Bin Yu, Nancy Reid (Ciation: Kass RE, Caffo BS, Davidian M, Meng X-L, Yu B, Reid N (2016) Ten Simple Rules for Effective Statistical Practice. PLoS Comput Biol 12(6): e1004961. doi:10.1371/journal.pcbi.1004961)

From the post:

Several months ago, Phil Bourne, the initiator and frequent author of the wildly successful and incredibly useful “Ten Simple Rules” series, suggested that some statisticians put together a Ten Simple Rules article related to statistics. (One of the rules for writing a PLOS Ten Simple Rules article is to be Phil Bourne [1]. In lieu of that, we hope effusive praise for Phil will suffice.)
…

I started to copy out the “ten simple rules,” sans the commentary but that would be a disservice to my readers.

Nodding past a ten bullet point listing isn’t going to make your statistics more effective.

Re-write the commentary on all ten rules to apply them to every project. The focusing of the rules on your work will result in specific advice and examples for your field.

Who knows? Perhaps you will be writing a ten simple rule article in your specific field, sans Phil Bourne as a co-author. (Do be sure and cite Phil.)

PS: For the curious: Ten Simple Rules for Writing a PLOS Ten Simple Rules Article by Harriet Dashnow, Andrew Lonsdale, Philip E. Bourne.

Comments Off

April 27, 2016

Reboot Your $100+ Million F-35 Stealth Jet Every 10 Hours Instead of 4 (TM Fusion)

Filed under: Cybersecurity,Data Fusion,Data Integration,Data Science,Topic Maps — Patrick Durusau @ 10:07 am

Pentagon identifies cause of F-35 radar software issue

From the post:

The Pentagon has found the root cause of stability issues with the radar software being tested for the F-35 stealth fighter jet made by Lockheed Martin Corp, U.S. Defense Acquisition Chief Frank Kendall told a congressional hearing on Tuesday.

Last month the Pentagon said the software instability issue meant the sensors had to be restarted once every four hours of flying.

Kendall and Air Force Lieutenant General Christopher Bogdan, the program executive officer for the F-35, told a Senate Armed Service Committee hearing in written testimony that the cause of the problem was the timing of “software messages from the sensors to the main F-35” computer. They added that stability issues had improved to where the sensors only needed to be restarted after more than 10 hours.

“We are cautiously optimistic that these fixes will resolve the current stability problems, but are waiting to see how the software performs in an operational test environment,” the officials said in a written statement.
… (emphasis added)

At $100+ Million plane that requires rebooting every ten hours? I’m not a pilot but that sounds like a real weakness.

The precise nature of the software glitch isn’t described but you can guess one of the problems from Lockheed Martin’s, Software You Wish You Had: Inside the F-35 Supercomputer:

…
The human brain relies on five senses—sight, smell, taste, touch and hearing—to provide the information it needs to analyze and understand the surrounding environment.

Similarly, the F-35 relies on five types of sensors: Electronic Warfare (EW), Radar, Communication, Navigation and Identification (CNI), Electro-Optical Targeting System (EOTS) and the Distributed Aperture System (DAS). The F-35 “brain”—the process that combines this stellar amount of information into an integrated picture of the environment—is known as sensor fusion.

At any given moment, fusion processes large amounts of data from sensors around the aircraft—plus additional information from datalinks with other in-air F-35s—and combines them into a centralized view of activity in the jet’s environment, displayed to the pilot.

In everyday life, you can imagine how useful this software might be—like going out for a jog in your neighborhood and picking up on real-time information about obstacles that lie ahead, changes in traffic patterns that may affect your route, and whether or not you are likely to pass by a friend near the local park.

F-35 fusion not only combines data, but figures out what additional information is needed and automatically tasks sensors to gather it—without the pilot ever having to ask.
… (emphasis added)

The fusion of data from other in-air F-35s is a classic topic map merging of data problem.

You have one subject, say an anti-aircraft missile site, seen from up to four (in the F-35 specs) F-35s. As is the habit of most physical objects, it has only one geographic location but the fusion computer for the F-35 doesn’t come up with than answer.

Kris Osborn writes in Software Glitch Causes F-35 to Incorrectly Detect Targets in Formation:

…
“When you have two, three or four F-35s looking at the same threat, they don’t all see it exactly the same because of the angles that they are looking at and what their sensors pick up,” Bogdan told reporters Tuesday. “When there is a slight difference in what those four airplanes might be seeing, the fusion model can’t decide if it’s one threat or more than one threat. If two airplanes are looking at the same thing, they see it slightly differently because of the physics of it.”

For example, if a group of F-35s detect a single ground threat such as anti-aircraft weaponry, the sensors on the planes may have trouble distinguishing whether it was an isolated threat or several objects, Bogdan explained.

As a result, F-35 engineers are working with Navy experts and academics from John’s Hopkins Applied Physics Laboratory to adjust the sensitivity of the fusion algorithms for the JSF’s 2B software package so that groups of planes can correctly identify or discern threats.

“What we want to have happen is no matter which airplane is picking up the threat – whatever the angles or the sensors – they correctly identify a single threat and then pass that information to all four airplanes so that all four airplanes are looking at the same threat at the same place,” Bogdan said.
…

Unless Bogdan is using “sensitivity” in a very unusual sense, that doesn’t sound like the issue with the fusion computer of the F-35.

Rather the problem is the fusion computer has no explicit doctrine of subject identity to use when it is merging data from different F-35s, whether it be two, three, four or even more F-35s. The display of tactical information should be seamless to the pilot and without human intervention.

I’m sure members of Congress were impressed with General Bogdan using words like “angles” and “physics,” but the underlying subject identity issue isn’t hard to address.

At issue is the location of a potential target on the ground. Within some pre-defined metric, anything located within a given area is the “same target.”

The Air Force has already paid for this type of analysis and the mathematics of what is called Circular Error Probability (CEP) has been published in Use of Circular Error Probability in Target Detection by William Nelson (1988).

You need to use the “current” location of the detecting aircraft, allowances for inaccuracy in estimating the location of the target, etc., but once you call out the subject identity as an issue, its a matter of making choices of how accurate you want the subject identification to be.

Before you forward this to Gen. Bogdan as a way forward on the fusion computer, realize that CEP is only one aspect of target identification. But, calling the subject identity of targets out explicitly, enables reliable presentation of single/multiple targets to pilots.

Your call, confusing displays or a reliable, useful display.

PS: I assume military subject identity systems would not be running XTM software. Same principles apply even if the syntax is different.

Comments Off

April 25, 2016

Women in Data Science (~632) – Twitter List

Filed under: Data Science,Twitter — Patrick Durusau @ 10:07 am

Data Science Renee has a twitter list of approximately 632 women in data science.

I say “approximately” because when I first saw her post about the list it had 630 members. When I looked this AM, it had 632 members. By the time you look, that number will be different again.

If you are making a conscious effort to seek a diversity of speakers for your next data science conference, it should be on your list of sources.

Enjoy!

Comments Off

April 11, 2016

4330 Data Scientists and No Data Science Renee

Filed under: Data Science,Web Scrapers — Patrick Durusau @ 4:22 pm

After I posted 1880 Big Data Influencers in CSV File, I got a tweet from Data Science Renee pointing out that her name wasn’t in the list.

Renee does a lot more on “data science” and not so much on “big data,” which sounded like a plausible explanation.

Even if “plausible,” I wanted to know if there was some issue with my scrapping of Right Relevance.

Knowing that Renee’s influence score for “data science” is 81, I set the query to scrape the list between 65 and 98, just to account for any oddities in being listed.

The search returned 1832 entries. Search for Renee, nada, no got. Here’s the 1832-data-science-list.

In an effort to scrape all the listings, which should be 10,375 influencers, I set the page delay up to Ted Cruz reading speed. Ten entries every 72,000 milliseconds. 😉

That resulted in 4330-data-science-list.

No joy, no Renee!

It isn’t clear to me why my scraping fails before recovering the entire data set but in any reasonable sort order, a listing of roughly 10K data scientists should have Renee in the first 100 entries, much less the first 1,000 or even first 4K.

Something is clearly amiss with the data but what?

Check me on the first ten entries for data science as the search term but I find:

Hilary Mason
Kirk Borne – no data science
Nathan Yau
Gregory Piatetsky – no data science
Randy Olson
Jeff Hammerbacher – no data science
Chris Dixon @cdixon – no data science
dj patil @dpatil
Doug Laney – no data science
Big Data Science no data science

The notation, “no data science,” means that entry does not have a label for data science. Odd considering that my search was specifically for influencers in “data science.” The same result obtains if you choose one of the labels instead of searching. (I tried.)

Clearly all of these people could be listed for “data science,” but if I am searching for that specific category, why is that missing from six of the first ten “hits?”

As far as Data Science Renee, I can help you with that to a degree. Follow @BecomingDataSci, or @DataSciGuide, @DataSciLearning & @NewDataSciJobs. Visit her website: http://t.co/zv9NrlxdHO. Podcasts, interviews, posts, just a hive of activity.

On the mysteries of Right Relevance and its data I’m not sure what to say. I posted feedback a week ago mentioning the issue with scraping and ordering, but haven’t heard back.

The site has a very clever idea but looking in from the outside with a sample size of 1, I’m not impressed with its delivery on that idea.

Issues I don’t know about with Web Scraper?

If you have contacts with Right Relevance could you gently ping them for me? Thanks!

Comments Off

April 7, 2016

Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy

Filed under: BigData,Data Science,Ethics,History,Mathematics — Patrick Durusau @ 9:19 pm

Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy by Cathy O’Neil.

$math-weapons$

From the description at Amazon:

We live in the age of the algorithm. Increasingly, the decisions that affect our lives—where we go to school, whether we get a car loan, how much we pay for health insurance—are being made not by humans, but by mathematical models. In theory, this should lead to greater fairness: Everyone is judged according to the same rules, and bias is eliminated. But as Cathy O’Neil reveals in this shocking book, the opposite is true. The models being used today are opaque, unregulated, and uncontestable, even when they’re wrong. Most troubling, they reinforce discrimination: If a poor student can’t get a loan because a lending model deems him too risky (by virtue of his race or neighborhood), he’s then cut off from the kind of education that could pull him out of poverty, and a vicious spiral ensues. Models are propping up the lucky and punishing the downtrodden, creating a “toxic cocktail for democracy.” Welcome to the dark side of Big Data.

Tracing the arc of a person’s life, from college to retirement, O’Neil exposes the black box models that shape our future, both as individuals and as a society. Models that score teachers and students, sort resumes, grant (or deny) loans, evaluate workers, target voters, set parole, and monitor our health—all have pernicious feedback loops. They don’t simply describe reality, as proponents claim, they change reality, by expanding or limiting the opportunities people have. O’Neil calls on modelers to take more responsibility for how their algorithms are being used. But in the end, it’s up to us to become more savvy about the models that govern our lives. This important book empowers us to ask the tough questions, uncover the truth, and demand change.

Even if you have qualms about Cathy’s position, you have to admit that is a great book cover!

When I was in law school, I had F. Hodge O’Neal for corporation law. He is the O’Neal in O’Neal and Thompson’s Oppression of Minority Shareholders and LLC Members, Rev. 2d.

The publisher’s blurb is rather generous in saying:

Cited extensively, O’Neal and Thompson’s Oppression of Minority Shareholders and LLC Members shows how to take appropriate steps to protect minority shareholder interests using remedies, tactics, and maneuvers sanctioned by federal law. It clarifies the underlying cause of squeeze-outs and suggests proven arrangements for avoiding them.

You could read Oppression of Minority Shareholders and LLC Members that way but when corporate law is taught with war stories from the antics of the robber barons forward, you get the impression that isn’t why people read it.

Not that I doubt Cathy’s sincerity, on the contrary, I think she is very sincere about her warnings.

Where I disagree with Cathy is in thinking democracy is under greater attack now or that inequality is any greater problem than before.

If you read The Half Has Never Been Told: Slavery and the Making of American Capitalism by Edward E. Baptist:

carefully, you will leave it with deep uncertainty about the relationship of American government, federal, state and local to any recognizable concept of democracy. Or for that matter to the “equality” of its citizens.

Unlike Cathy as well, I don’t expect that shaming people is going to result in “better” or more “honest” data analysis.

What you can do is arm yourself to do battle on behalf of your “side,” both in terms of exposing data manipulation by others and concealing your own.

Perhaps there is room in the marketplace for a book titled: Suppression of Unfavorable Data. More than hiding data, what data to not collect? How to explain non-collection/loss? How to collect data in the least useful ways?

You would have to write it as a how to avoid these very bad practices but everyone would know what you meant. Could be the next business management best seller.

Comments Off

March 1, 2016

Avoid “Complete,” “Data Science,” in Titles

Filed under: Data Science — Patrick Durusau @ 10:06 pm

A Complete Tutorial to learn Data Science in R from Scratch by Manish Saraswat.

This is a useful tutorial but it isn’t:

Complete
Does NOT cover all of Data Science

But, this tutorial was tweeted and has been retweeted at least seven times that I know of, possibly more.

Using vague and/or inaccurate terms in titles makes tutorials more difficult to find.

That alone should be reason enough to use better titles.

A more accurate title would be:

R for Predictive Modeling, From Installation to Modeling

That captures the use of R, that the main focus is on predictive modeling and that it will start with the installation of R and proceed to modeling.

Not a word said about all of “data science,” or being “complete,” whatever that means in a discipline with daily advances on multiple fronts.

Just a little effort on the part of authors could improve the lives of all of us desperately searching to find their work.

Yes?

Comments Off

February 21, 2016

Streaming 101 & 102 – [Stream Processing with Batch Identities?]

Filed under: Data Science,Data Streams,Stream Analytics,Streams — Patrick Durusau @ 11:09 am

The world beyond batch: Streaming 101 by Tyler Akidau.

From part 1:

Streaming data processing is a big deal in big data these days, and for good reasons. Amongst them:

Businesses crave ever more timely data, and switching to streaming is a good way to achieve lower latency.

The massive, unbounded data sets that are increasingly common in modern business are more easily tamed using a system designed for such never-ending volumes of data.

Processing data as they arrive spreads workloads out more evenly over time, yielding more consistent and predictable consumption of resources.

Despite this business-driven surge of interest in streaming, the majority of streaming systems in existence remain relatively immature compared to their batch brethren, which has resulted in a lot of exciting, active development in the space recently.

…

Since I have quite a bit to cover, I’ll be splitting this across two separate posts:

Streaming 101: This first post will cover some basic background information and clarify some terminology before diving into details about time domains and a high-level overview of common approaches to data processing, both batch and streaming.

The Dataflow Model: The second post will consist primarily of a whirlwind tour of the unified batch + streaming model used by Cloud Dataflow, facilitated by a concrete example applied across a diverse set of use cases. After that, I’ll conclude with a brief semantic comparison of existing batch and streaming systems.

The world beyond batch: Streaming 102

…

In this post, I want to focus further on the data-processing patterns from last time, but in more detail, and within the context of concrete examples. The arc of this post will traverse two major sections:

Streaming 101 Redux: A brief stroll back through the concepts introduced in Streaming 101, with the addition of a running example to highlight the points being made.

Streaming 102: The companion piece to Streaming 101, detailing additional concepts that are important when dealing with unbounded data, with continued use of the concrete example as a vehicle for explaining them.

By the time we’re finished, we’ll have covered what I consider to be the core set of principles and concepts required for robust out-of-order data processing; these are the tools for reasoning about time that truly get you beyond classic batch processing.

…

You should also catch the paper by Tyler and others, The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing.

Cloud Dataflow, known as Beam at the Apache incubator, offers a variety of operations for combining and/or merging collections of values in data.

I mention that because I would hate to hear of you doing stream processing with batch identities. You know, where you decide on some fixed set of terms and those are applied across dynamic data.

Hmmm, fixed terms applied to dynamic data. Doesn’t even sound right does it?

Sometimes, fixed terms (read schema, ontology) are fine but in linguistically diverse environments (read real life), that isn’t always adequate.

Enjoy the benefits of stream processing but don’t artificially limit them with batch identities.

I first saw this in a tweet by Bob DuCharme.

Comments Off

February 15, 2016

People NOT Technology Produce Data ROI

Filed under: BigData,Data,Data Science,Data Silos — Patrick Durusau @ 4:00 pm

Too many tools… not enough carpenters! by Nicholas Hartman.

From the webpage:

Don’t let your enterprise make the expensive mistake of thinking that buying tons of proprietary tools will solve your data analytics challenges.

tl;dr = The enterprise needs to invest in core data science skills, not proprietary tools.

Most of the world’s largest corporations are flush with data, but frequently still struggle to achieve the vast performance increases promised by the hype around so called “big data.” It’s not that the excitement around the potential of harvesting all that data was unwarranted, but rather these companies are finding that translating data into information and ultimately tangible value can be hard… really hard.

In your typical new tech-based startup the entire computing ecosystem was likely built from day one around the need to generate, store, analyze and create value from data. That ecosystem was also likely backed from day one with a team of qualified data scientists. Such ecosystems spawned a wave of new data science technologies that have since been productized into tools for sale. Backed by mind-blowingly large sums of VC cash many of these tools have set their eyes on the large enterprise market. A nice landscape of such tools was recently prepared by Matt Turck of FirstMark Capital (host of Data Driven NYC, one of the best data science meetups around).

Consumers stopped paying money for software a long time ago (they now mostly let the advertisers pay for the product). If you want to make serious money in pure software these days you have to sell to the enterprise. Large corporations still spend billions and billions every year on software and data science is one of the hottest areas in tech right now, so selling software for crunching data should be a no-brainer! Not so fast.

The problem is, the enterprise data environment is often nothing like that found within your typical 3-year-old startup. Data can be strewn across hundreds or thousands of systems that don’t talk to each other. Devices like mainframes are still common. Vast quantities of data are generated and stored within these companies, but until recently nobody ever really envisioned ever accessing — let alone analyzing — these archived records. Often, it’s not initially even clear how the all data generated by these systems directly relates to a large blue chip’s core business operations. It does, but a lack of in-house data scientists means that nobody is entirely even sure what data is really there or how it can be leveraged.
…

I would delete “proprietary” from the above because non-proprietary tools create data problems just as easily.

Thus I would re-write the second quote as:

Tools won’t replace skilled talent, and skilled talent doesn’t typically need many particular tools.

I substituted “particular” tools to avoid religious questions about particular non-proprietary tools.

Understanding data, recognizing where data integration is profitable and where it is a dead loss, creating tests to measure potential ROI, etc., are all tasks of a human data analyst and not any proprietary or non-proprietary tool.

That all enterprise data has some intrinsic value that can be extracted if it were only accessible is an article of religious faith, not business ROI.

If you want business ROI from data, start with human analysts and not the latest buzzwords in technological tools.

Comments Off

February 9, 2016

Agile Data Science [Free Download]

Filed under: Data Science,Hadoop — Patrick Durusau @ 8:44 pm

Agile Data Science by Russell Jurney.

From the preface:

I wrote this book to get over a failed project and to ensure that others do not repeat my mistakes. In this book, I draw from and reflect upon my experience building analytics applications at two Hadoop shops.

Agile Data Science has three goals: to provide a how-to guide for building analyticsapplications with big data using Hadoop; to help teams collaborate on big data projectsin an agile manner; and to give structure to the practice of applying Agile Big Data analytics in a way that advances the field.

From 2013 and data science has moved quite a bit in the meantime but the principles Russell illustrates remain sound and people do still use Hadoop.

Depending on what you gave up for Lent, you should have enough non-work time to work through Agile Data Science by the end of Lent.

Maybe this year you will have something to show for the forty days of Lent. 😉

Comments Off

February 8, 2016

The Danger of Ad Hoc Data Silos – Discrediting Government Experts

Filed under: Data Science,Data Silos,Government — Patrick Durusau @ 8:48 am

This Canadian Lab Spent 20 Years Ruining Lives by Tess Owen.

From the post:

Four years ago, Yvonne Marchand lost custody of her daughter.

Even though child services found no proof that she was a negligent parent, that didn’t count for much against the overwhelmingly positive results from a hair test. The lab results said she was abusing alcohol on a regular basis and in enormous quantities.

The test results had all the trappings of credible forensic science, and was presented by a technician from the Motherisk Drug Testing Laboratory at Toronto’s Sick Kids Hospital, Canada’s foremost children’s hospital.

“I told them they were wrong, but they didn’t believe me. Nobody would listen,” Marchand recalls.

Motherisk hair test results indicated that Marchand had been downing 48 drinks a day, for 90 days. “If you do the math, I would have died drinking that much” Marchand says. “There’s no way I could function.”

The court disagreed, and determined Marchand was unfit to have custody of her daughter.

…

Some parents, like Marchand, pursued additional hair tests from independent labs in a bid to fight their cases. Marchand’s second test showed up as negative. But, because the lab technician couldn’t testify as an expert witness, the second test was thrown out by the court.

Marchand says the entire process was very frustrating. She says someone should have noticed a pattern when parents repeatedly presented hair test results from independent labs which completely contradicted Motherisk results. Alarm bells should have gone off sooner.

…

Tess’ post and a 366-page report make it clear that Motherisk has impaired the fairness of a large number of child-protection service cases.

Child services, the courts, state representatives, the only one would would have been aware of contradictions of Motherisk results over multiple cases, had not interest in “connecting the dots.”

Each case, with each attorney, was an ad hoc data silo that could not present the pattern necessary to challenge the systematic poor science from Motherisk.

The point is that not all data silos are in big data or nation-state sized intelligence services. Data silos can and do regularly have tragic impact upon ordinary citizens.

Privacy would be an issue but mechanisms need to be developed where lawyers and other advocates can share notice of contradiction of state agencies so that patterns such as by Motherisk can be discovered, documented and hopefully ended sooner rather than later.

BTW, there is an obvious explanation for why:

“No forensic toxicology laboratory in the world uses ELISA testing the way [Motherisk] did.”

Child services did not send hair samples to Motherisk to decide whether or not to bring proceedings.

Child services had already decided to remove children and sent hair samples to Motherisk to bolster their case.

How bright did Motherisk need to be to realize that positive results were expected outcome?

Does your local defense bar collect data on police/state forensic experts and their results?

Looking for suggestions?

Comments Off

February 6, 2016

Clojure for Data Science [Caution: Danger of Buyer’s Regret]

Filed under: Clojure,Data Science,Functional Programming,Programming — Patrick Durusau @ 10:15 pm

Clojure for Data Science by Mike Anderson.

From the webpage:

Presentation given at the Jan 2016 Singapore Clojure Users’ Group

You will have to work at the presentation because there is no accompanying video, but the effort will be well spent.

Before you review these slides or pass them onto others, take fair warning that you may experience “buyer’s regret” with regard to your current programming language/paradigm (if not already Clojure).

However powerful and shiny your present language seems now, its luster will be dimmed after scanning over this slides.

Don’t say you weren’t warned ahead of time!

BTW, if you search for “clojure for data science” (with the quotes) you will find among other things:

Clojure for Data Science Progressing by Henry Garner (Packt)

Repositories for the Clojure for Data Science Processing book.

@cljds Clojure Data Science twitter feed (Henry Garner). VG!

Clojure for Data Science Some 151 slides by Henry Garner.

Plus:

Planet Clojure, a metablog that collects posts from other Clojure blogs.

As a close friend says from time to time, “clojure for data science,”

“G*****s well.” 😉

Enjoy!

Comments Off

February 4, 2016

The Ethical Data Scientist

Filed under: Data Science,Ethics — Patrick Durusau @ 7:42 pm

The Ethical Data Scientist by Cathy O’Neil.

From the post:

….
After the financial crisis, there was a short-lived moment of opportunity to accept responsibility for mistakes with the financial community. One of the more promising pushes in this direction was when quant and writer Emanuel Derman and his colleague Paul Wilmott wrote the Modeler’s Hippocratic Oath, which nicely sums up the list of responsibilities any modeler should be aware of upon taking on the job title.

The ethical data scientist would strive to improve the world, not repeat it. That would mean deploying tools to explicitly construct fair processes. As long as our world is not perfect, and as long as data is being collected on that world, we will not be building models that are improvements on our past unless we specifically set out to do so.

At the very least it would require us to build an auditing system for algorithms. This would be not unlike the modern sociological experiment in which job applications sent to various workplaces differ only by the race of the applicant—are black job seekers unfairly turned away? That same kind of experiment can be done directly to algorithms; see the work of Latanya Sweeney, who ran experiments to look into possible racist Google ad results. It can even be done transparently and repeatedly, and in this way the algorithm itself can be tested.

The ethics around algorithms is a topic that lives only partly in a technical realm, of course. A data scientist doesn’t have to be an expert on the social impact of algorithms; instead, she should see herself as a facilitator of ethical conversations and a translator of the resulting ethical decisions into formal code. In other words, she wouldn’t make all the ethical choices herself, but rather raise the questions with a larger and hopefully receptive group.
…

First, the link for the Modeler’s Hippocratic Oath takes you to a splash page at Wiley for Derman’s book: My Life as a Quant: Reflections on Physics and Finance.

The Financial Modelers’ Manifesto (PDF) and The Financial Modelers’ Manifesto (HTML), are valid links as of today.

I commend the entire text of The Financial Modelers’ Manifesto to you for repeated reading but for present purposes, let’s look at the Modelers’ Hippocratic Oath:

~ I will remember that I didn’t make the world, and it doesn’t satisfy my equations.

~ Though I will use models boldly to estimate value, I will not be overly impressed by mathematics.

~ I will never sacrifice reality for elegance without explaining why I have done so.

~ Nor will I give the people who use my model false comfort about its accuracy. Instead, I will make explicit its assumptions and oversights.

~ I understand that my work may have enormous effects on society and the economy, many of them beyond my comprehension

It may just be me but I don’t see a charge being laid on data scientists to be the ethical voices in organizations using data science.

Do you see that charge?

To to put it more positively, aren’t other members of the organization, accountants, engineers, lawyers, managers, etc., all equally responsible for spurring “ethical conversations?” Why is this a peculiar responsibility for data scientists?

I take a legal ethics view of the employer – employee/consultant relationship. The client is the ultimate arbiter of the goal and means of a project, once advised of their options.

Their choice may or may not be mine but I haven’t ever been hired to play the role of Jiminy Cricket.

It’s heady stuff to be responsible for bringing ethical insights to the clueless but sometimes the clueless have ethical insights on their on, or not.

Data scientists can and should raise ethical concerns but no more or less than any other member of a project.

As you can tell from reading this blog, I have very strong opinions on a wide variety of subjects. That said, unless a client hires me to promote those opinions, the goals of the client, by any legal means, are my only concern.

PS: Before you ask, no, I would not work for Donald Trump. But that’s not an ethical decision. That’s simply being a good citizen of the world.

Comments Off

January 30, 2016

9 “Laws” for Data Mining [Be Careful With #5]

Filed under: Data Mining,Data Science — Patrick Durusau @ 10:13 pm

9 “Laws” for Data Mining

A Forbes piece on “laws” for data mining, that are equally applicable to data science.

Being Forbes, technology is valuable because it has value for business, not because “everyone is doing it,” “it’s really cool technology,” “it’s a graph,” or “it will bring all humanity to a new plane of existence.”

To be honest, Forbes is a welcome relief some days.

But even Forbes stumbles, as with law #5:

5. There are always patterns: In practice, your data always holds useful information to support decision-making and action.

What? “…your data always holds useful information to support decision-making and action.”

That’s as nutty as the “new plane of existence” stuff.

When I say “nutty,” I mean that in a professional sense. The term apohenia was coined to label the tendency to see meaningful patterns in random data. (Yes, that includes your data.) Apophenia.

The original work described the “…onset of delusional thinking in pyschosis.”

No doubt you will find patterns in your data but that the patterns “…holds useful information to support decision-making and action” isn’t a given.

That is an echo of the near fanatic belief that if businesses used big data, they would be more profitable.

Most of the other “laws” are more plausible than #5, but even there, don’t abandon your judgement even if Forbes says that something is so.

I first saw this in a tweet by Data Science Renee.

Comments Off

January 22, 2016

Improve Your Data Literacy: 16 Blogs to Follow in 2016

Filed under: Data Science — Patrick Durusau @ 2:49 pm

Improve Your Data Literacy: 16 Blogs to Follow in 2016 by Cedric Lombion.

From the post:

Learning data literacy is a never-ending process. Going to workshops and hands-on practice are important, but to really become acquainted with the “culture” of data literacy, you’ll have to do a lot of reading. Don’t worry, we’ve got your back: below is a curated list of 16 blogs to follow in 2016 if you want to: improve your data-visualisation skills; see the best examples of data journalism; discover the methodology behind the best data-driven projects; and pick-up some essential tips for working with data.
…

There are aggregated feeds to add to Feedly but it would have been more convenience to have one collection for all the feeds.

As you add feeds to Feedly or elsewhere, you will quickly find there are more feeds and stories than hours in the day.

The open question is how much data curation is required to make a viable publication? There are lots of lists, some with more or less comments, but what level of detail is required to create a financially viable publication?

Comments Off

December 20, 2015

Data Science Ethics: Who’s Lying to Hillary Clinton?

Filed under: Data Science,Ethics — Patrick Durusau @ 8:19 pm

The usual ethics example for data science involves discrimination against some protected class. Discrimination on race, religion, ethnicity, etc., most if not all of which is already illegal.

That’s not a question of ethics, that’s a question of staying out of jail.

A better ethics example is to ask: Who’s lying to Hillary Clinton about back doors for encryption?

I ask because in the debate on December 19, 2015, Hillary says:

Secretary Clinton, I want to talk about a new terrorist tool used in the Paris attacks, encryption. FBI Director James Comey says terrorists can hold secret communications which law enforcement cannot get to, even with a court order.

You’ve talked a lot about bringing tech leaders and government officials together, but Apple CEO Tim Cook said removing encryption tools from our products altogether would only hurt law-abiding citizens who rely on us to protect their data. So would you force him to give law enforcement a key to encrypted technology by making it law?

CLINTON: I would not want to go to that point. I would hope that, given the extraordinary capacities that the tech community has and the legitimate needs and questions from law enforcement, that there could be a Manhattan-like project, something that would bring the government and the tech communities together to see they’re not adversaries, they’ve got to be partners.

It doesn’t do anybody any good if terrorists can move toward encrypted communication that no law enforcement agency can break into before or after. There must be some way. I don’t know enough about the technology, Martha, to be able to say what it is, but I have a lot of confidence in our tech experts.

And maybe the back door is the wrong door, and I understand what Apple and others are saying about that. But I also understand, when a law enforcement official charged with the responsibility of preventing attacks — to go back to our early questions, how do we prevent attacks — well, if we can’t know what someone is planning, we are going to have to rely on the neighbor or, you know, the member of the mosque or the teacher, somebody to see something.

CLINTON: I just think there’s got to be a way, and I would hope that our tech companies would work with government to figure that out. Otherwise, law enforcement is blind — blind before, blind during, and, unfortunately, in many instances, blind after.

So we always have to balance liberty and security, privacy and safety, but I know that law enforcement needs the tools to keep us safe. And that’s what i hope, there can be some understanding and cooperation to achieve.
…

Who do you think has told Secretary Clinton there is a way to have secure encryption and at the same time enable law enforcement access to encrypted data?

That would be a data scientist or someone posing as a data scientist. Yes?

I assume you have read: Keys Under Doormats: Mandating Insecurity by Requiring Government Access to All Data and Communications by H. Abelson, R. Anderson, S. M. Bellovin, J. Benaloh, M. Blaze, W. Diffie, J. Gilmore, M. Green, S. Landau, P. G. Neumann, R. L. Rivest, J. I. Schiller, B. Schneier, M. Specter, D. J. Weitzner.

Abstract:

Twenty years ago, law enforcement organizations lobbied to require data and communication services to engineer their products to guarantee law enforcement access to all data. After lengthy debate and vigorous predictions of enforcement channels “going dark,” these attempts to regulate security technologies on the emerging Internet were abandoned. In the intervening years, innovation on the Internet flourished, and law enforcement agencies found new and more effective means of accessing vastly larger quantities of data. Today, there are again calls for regulation to mandate the provision of exceptional access mechanisms. In this article, a group of computer scientists and security experts, many of whom participated in a 1997 study of these same topics, has convened to explore the likely effects of imposing extraordinary access mandates.

We have found that the damage that could be caused by law enforcement exceptional access requirements would be even greater today than it would have been 20 years ago. In the wake of the growing economic and social cost of the fundamental insecurity of today’s Internet environment, any proposals that alter the security dynamics online should be approached with caution. Exceptional access would force Internet system developers to reverse “forward secrecy” design practices that seek to minimize the impact on user privacy when systems are breached. The complexity of today’s Internet environment, with millions of apps and globally connected services, means that new law enforcement requirements are likely to introduce unanticipated, hard to detect security flaws. Beyond these and other technical vulnerabilities, the prospect of globally deployed exceptional access systems raises difficult problems about how such an environment would be governed and how to ensure that such systems would respect human rights and the rule of law.

Whether you agree on policy grounds about back doors to encryption or not, is there any factual doubt that back doors to encryption leave users insecure?

That’s an important point because Hillary’s data science advisers should have clued her in that her position is factually false. With or without a “Manhattan Project.”

Here are the ethical questions with regard to Hillary’s position on back doors for encryption:

Did Hillary’s data scientist(s) tell her that access by the government to encrypted data means no security for users?
What ethical obligations do data scientists have to advise public office holders or candidates that their positions are at variance with known facts?
What ethical obligations do data scientists have to caution their clients when they persist in spreading mis-information, in this case about encryption?
What ethical obligations do data scientists have to expose their reports to a client outlining why the client’s public position is factually false?

Many people will differ on the policy question of access to encrypted data but that access to encrypted data weakens the protection for all users is beyond reasonable doubt.

If data scientists want to debate ethics, at least make it about an issue with consequences. Especially for the data scientists.

Questions with no risk aren’t ethics questions, they are parlor entertainment games.

PS: Is there an ethical data scientist in the Hillary Clinton campaign?

Comments (1)

December 16, 2015

20 Big Data Repositories You Should Check Out [Data Source Checking?]

Filed under: BigData,Data,Data Science — Patrick Durusau @ 11:43 am

20 Big Data Repositories You Should Check Out by Vincent Granville.

Vincent lists some additional sources along with a link to Bernard Marr’s original selection.

One of the issues with such lists is that they are rarely maintained.

For example, Bernard listed:

Topsy http://topsy.com/

Free, comprehensive social media data is hard to come by – after all their data is what generates profits for the big players (Facebook, Twitter etc) so they don’t want to give it away. However Topsy provides a searchable database of public tweets going back to 2006 as well as several tools to analyze the conversations.

But if you follow http://topsy.com/, you will find it points to:

Use Search on your iPhone, iPad, or iPod touch

With iOS 9, Search lets you look for content from the web, your contacts, apps, nearby places, and more. Powered by Siri, Search offers suggestions and updates results as you type.

That sucks doesn’t it? Expecting to be able to search public tweets back to 2006, along with analytical tools and what you get is a kiddie guide to search on a malware honeypot.

For a fuller explanation or at least the latest news on Topsy, check out: Apple shuts down Twitter analytics service Topsy by Sam Byford, dated December 16, 2015 (that’s today as I write this post).

So, strike Topsy off your list of big data sources.

Rather than bare lists, what big data needs is a curated list of big data sources that does more than list sources. Those sources need to be broken down to data sets to enable big data searchers to find all the relevant data sets and retrieve only those that remain accessible.

Like “link checking” but for big data resources. Data Source Checking?

That would be the “go to” place for big data sets and as bad as I hate advertising, a high traffic area for advertising to make it cost effective if not profitable.

Comments Off

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 9, 2016

White House sidewalk / Lafayette Park (except North-East Quadrant) – Application 16-0289

Ellipse – Application 17-0001

Misc. Areas – Application 16-0357

National Mall -> Trump Mall – Application 17-0002

Most Other Places – Application 17-0003

Next Steps

November 24, 2016

November 19, 2016

November 2, 2016

October 30, 2016

October 23, 2016

October 18, 2016

October 13, 2016

October 1, 2016

September 4, 2016

August 29, 2016

August 21, 2016

August 19, 2016

August 17, 2016

June 12, 2016

April 27, 2016

April 25, 2016

April 11, 2016

April 7, 2016

March 1, 2016

February 21, 2016

February 15, 2016

February 9, 2016

February 8, 2016

February 6, 2016

February 4, 2016

January 30, 2016

January 22, 2016

December 20, 2015

December 16, 2015