Archive for the ‘Data Science’ Category

Practical advice for analysis of large, complex data sets [IC tl;dr]

Saturday, November 11th, 2017

Practical advice for analysis of large, complex data sets by Patrick Riley.

From the post:

For a number of years, I led the data science team for Google Search logs. We were often asked to make sense of confusing results, measure new phenomena from logged behavior, validate analyses done by others, and interpret metrics of user behavior. Some people seemed to be naturally good at doing this kind of high quality data analysis. These engineers and analysts were often described as “careful” and “methodical”. But what do those adjectives actually mean? What actions earn you these labels?

To answer those questions, I put together a document shared Google-wide which I optimistically and simply titled “Good Data Analysis.” To my surprise, this document has been read more than anything else I’ve done at Google over the last eleven years. Even four years after the last major update, I find that there are multiple Googlers with the document open any time I check.

Why has this document resonated with so many people over time? I think the main reason is that it’s full of specific actions to take, not just abstract ideals. I’ve seen many engineers and analysts pick up these habits and do high quality work with them. I’d like to share the contents of that document in this blog post.

Great post and should be read and re-read until it becomes second nature.

I wave off the intelligence community (IC) with tl;dr because intelligence conclusions are policy and not fact, artifacts.

The best data science practices in the world have no practical application in intelligence circles, unless they support the desired conclusions.

Rather than sully data science, intelligence communities should publish their conclusions and claim the evidence cannot be shared.

Before you leap to defend the intelligence community, recall their lying about mass surveillance of Americans, lying about weapons of mass destruction in Iraq, numerous lies about US activities in Vietnam (before 50K+ Americans and millions of Vietnamese were killed).

The question to ask about American intelligence community reports isn’t whether they are lies (they are), but rather why they are lying?

For those interested in data driven analysis, follow Riley’s advice.

Data Munging with R (MEAP)

Monday, November 6th, 2017

Data Munging with R (MEAP) by Dr. Jonathan Carroll.

From the description:

Data Munging with R shows you how to take raw data and transform it for use in computations, tables, graphs, and more. Whether you already have some programming experience or you’re just a spreadsheet whiz looking for a more powerful data manipulation tool, this book will help you get started. You’ll discover the ins and outs of using the data-oriented R programming language and its many task-specific packages. With dozens of practical examples to follow, learn to fill in missing values, make predictions, and visualize data as graphs. By the time you’re done, you’ll be a master munger, with a robust, reproducible workflow and the skills to use data to strengthen your conclusions!

Five (5) out of eleven (11) parts available now under the Manning Early Access Program (MEAP). Chapter one, Introducing Data and the R Language is free.

Even though everyone writes books from front to back (or at least claim to), it would be nice to see a free “advanced” chapter every now and again. There’s not much you can say about an introductory chapter other than it’s an introductory chapter. That’s no different here.

I suspect you will get a better idea about Dr. Carroll’s writing from his blog, Irregularly Scheduled Programming or by following him on Twitter: @carroll_jono.

Building Data Science with JS – Lifting the Curtain on Game Reviews

Saturday, October 7th, 2017

Building Data Science with JS by Tim Ermilov.

Three videos thus far:

Building Data Science with JS – Part 1 – Introduction

Building Data Science with JS – Part 2 – Microservices

Building Data Science with JS – Part 3 – RabbitMQ and OpenCritic microservice

Tim starts with the observation that the percentage of users assigning a score to a game isn’t very helpful. It tells you nothing about the content of the game and/or the person rating it.

In subject identity terms, each level, mighty, strong, weak, fair, collapses information about the game and a particular reviewer into a single summary subject. OpenCritic then displays the percent of reviewers who are represented by that summary subject.

The problem with the summary subject is that one critic may have down rated the game for poor content, another for sexism and still another for bad graphics. But a user only knows for reasons unknown, a critic whose past behavior is unknown, evaluated unknown content and assigned it a rating.

A user could read all the reviews, study the history of each reviewer, along with the other movies they have evaluated, but Ermilov proposes a more efficient means to peak behind the curtain of game ratings. (part 1)

In part 2, Ermilov designs a microservice based application to extract, process and display game reviews.

If you thought the first two parts were slow, you should enjoy Part 3. 😉 Ermilov speeds through a number of resources, documents, JS libraries, not to mention his source code for the project. You are likely to hit pause during this video.

Some links you will find helpful for Part 3:

AMQP 0-9-1 library and client for Node.JS – Channel-oriented API reference

AMQP 0-9-1 library and client for Node.JS (Github)

https://github.com/BuildingXwithJS

https://github.com/BuildingXwithJS/building-data-science-with-js

Microwork – simple creation of distributed scalable microservices in node.js with RabbitMQ (simplifies use of AMQP)

node-unfluff – Automatically extract body content (and other cool stuff) from an html document

OpenCritic

RabbitMQ. (Recommends looking at the RabbitMQ tutorials.)

Women in Data Science (~1200) – Potential Speaker List

Sunday, September 24th, 2017

When I last posted about Data Science Renee‘s twitter list of women in data science in had ~632 members.

That was in April of 2016.

As of today, the list has 1,203 members! By the time you look, that number will be different again.

I call this a “potential speaker list” because not every member may be interested in your conference or have the time to attend.

Have you made a serious effort to recruit women speakers if you have not consulted this list and others like it?

Serious question.

Do you have a serious answer?

brename – data munging tool

Thursday, August 31st, 2017

brename — a practical cross-platform command-line tool for safely batch renaming files/directories via regular expression

Renaming files is a daily activity when data munging. Wei Shen has created a batch renaming tool with these features:

  • Cross-platform. Supporting Windows, Mac OS X and Linux.
  • Safe. By checking potential conflicts and errors.
  • File filtering. Supporting including and excluding files via regular expression.
    No need to run commands like find ./ -name "*.html" -exec CMD.
  • Renaming submatch with corresponding value via key-value file.
  • Renaming via ascending integer.
  • Recursively renaming both files and directories.
  • Supporting dry run.
  • Colorful output. Screenshots:

Binaries are available for Linux, OS X and Windows, both 32 and 64-bit versions.

Linux has a variety of batch file renaming options but I didn’t see any short-comings in brename that jumped out at me.

You?

HT, Stephen Turner.

Are You Investing in Data Prep or Technology Skills?

Wednesday, August 30th, 2017

Kirk Borne posted for #wisdomwednesday:

New technologies are my weakness.

What about you?

What if we used data driven decision making?

Different result?

John Carlisle Hunts Bad Science (you can too!)

Tuesday, June 6th, 2017

Carlisle’s statistics bombshell names and shames rigged clinical trials by Leonid Schneider.

From the post:

John Carlisle is a British anaesthesiologist, who works in a seaside Torbay Hospital near Exeter, at the English Channel. Despite not being a professor or in academia at all, he is a legend in medical research, because his amazing statistics skills and his fearlessness to use them exposed scientific fraud of several of his esteemed anaesthesiologist colleagues and professors: the retraction record holder Yoshitaka Fujii and his partner Yuhji Saitoh, as well as Scott Reuben and Joachim Boldt. This method needs no access to the original data: the number presented in the published paper suffice to check if they are actually real. Carlisle was fortunate also to have the support of his journal, Anaesthesia, when evidence of data manipulations in their clinical trials was found using his methodology. Now, the editor Carlisle dropped a major bomb by exposing many likely rigged clinical trial publications not only in his own Anaesthesia, but in five more anaesthesiology journals and two “general” ones, the stellar medical research outlets NEJM and JAMA. The clinical trials exposed in the latter for their unrealistic statistics are therefore from various fields of medicine, not just anaesthesiology. The medical publishing scandal caused by Carlisle now is perfect, and the elite journals had no choice but to announce investigations which they even intend to coordinate. Time will show how seriously their effort is meant.

Carlisle’s bombshell paper “Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals” was published today in Anaesthesia, Carlisle 2017, DOI: 10.1111/anae.13962. It is accompanied by an explanatory editorial, Loadsman & McCulloch 2017, doi: 10.1111/anae.13938. A Guardian article written by Stephen Buranyi provides the details. There is also another, earlier editorial in Anaesthesia, which explains Carlisle’s methodology rather well (Pandit, 2012).

… (emphasis in original)

Cutting to the chase, Carlisle found 90 papers with statistical patterns unlikely to occur by chance in 5,087 clinical trials.

There is a wealth of science papers to be investigated, Sarah Boon, in 21st Century Science Overload points out (2016) there are 2.5 million new scientific papers published every year, in 28,100 active scholarly peer-reviewed journals (2014).

Since Carlisle has done eight (8) journals, that leaves ~28,092 for your review. 😉

Happy hunting!

PS: I can easily imagine an exercise along these lines being the final project for a data mining curriculum. You?

Ethics, Data Scientists, Google, Wage Discrimination Against Women

Sunday, May 28th, 2017

Accused of underpaying women, Google says it’s too expensive to get wage data by Sam Levin.

From the post:

Google argued that it was too financially burdensome and logistically challenging to compile and hand over salary records that the government has requested, sparking a strong rebuke from the US Department of Labor (DoL), which has accused the Silicon Valley firm of underpaying women.

Google officials testified in federal court on Friday that it would have to spend up to 500 hours of work and $100,000 to comply with investigators’ ongoing demands for wage data that the DoL believes will help explain why the technology corporation appears to be systematically discriminating against women.

Noting Google’s nearly $28bn annual income as one of the most profitable companies in the US, DoL attorney Ian Eliasoph scoffed at the company’s defense, saying, “Google would be able to absorb the cost as easy as a dry kitchen sponge could absorb a single drop of water.”

Disclosure: I assume Google is resisting disclosure because it has in fact has a history of engaging in discrimination against women. It may or may not be discriminating this month/year, but if known, the facts will support the government’s claim. The $100,000 alleged cost is chump change to prove such a charge groundless. Resistance signals the charge has merit.

Levin’s post gives me reason to doubt Google will prevail on this issue or on the merits in general. Read it in full.

My question is what of the ethical obligations of data scientists at Google?

Should data scientists inside Google come forward with the requested information?

Should data scientists inside Google stage a work slow down to protest Googles’ resistance?

Exactly what should ethical data scientists do when their employer is the 500 pound gorilla in their field?

Do you think Google executives need a memo from their data scientists cluing them in on the ethical issues here?

Possibly not, this is old fashioned gender discrimination.

Google’s resistance signals to all of its mid-level managers that gender based discrimination will be defended.

Does that really qualify for “Don’t be evil?”

DataBASIC

Thursday, February 16th, 2017

DataBASIC

Not for you but an interesting resource for introducing children to working with data.

Includes WordCounter, WTFcsv, SameDiff and ConnectTheDots.

The network template is a csv file with a header, two fields separated by commas.

Pick the right text/examples and you could have a class captivated pretty quickly.

Enjoy!

Missing The Beltway Blockade? Considering Blockading A Ball?

Wednesday, January 11th, 2017

For one reason or another, you may not be able to participate in a Beltway Blockade January 20, 2017, see:

Don’t Panic!

You can still enjoy a non-permitted protest and contribute to the least attended inauguration in history!

2017 Presidential Inaugural Balls

The list is short on location information for many of the scheduled balls but the Commander in Chief’s Ball, Presidential Inaugural Ball, Mid-Atlantic Inauguration Ball, Midwest Inaugural Ball, Western Inaugural Ball, and the Neighborhood Inaugural Ball, are all being held at the: Walter E. Washington Convention Center.

Apologies but I haven’t looked up prior attendance records but just based on known scheduling, disruption in the area of Walter E. Washington Convention Center looks like it will pay the highest returns.

For the balls with location information and/or location information that I can discover, I will post a fuller list with Google Map links tomorrow.

Oh, for inside protesting, here are floor plans of the Walter E. Washington Convention Center.

Those are the official, posted floor plans.

Should that link go dark, let me know. I have a backup copy of them. 😉

The Best And Worst Data Stories Of 2016

Sunday, January 1st, 2017

The Best And Worst Data Stories Of 2016 by Walt Hickey.

From the post:

It’s time once again to dole out FiveThirtyEight’s Data Awards, our annual (OK, we’ve done it once before) chance to honor those who did remarkably good stuff with data, to shame those who did remarkably bad stuff with data, and to acknowledge the key numbers that help describe what went down over the past year. As always, these are based on the considered analysis of an esteemed panel of judges, by which I mean that I pestered people around the FiveThirtyEight offices until they gave me some suggestions.

I had to list this under both data science and humor. 😉

What “…bad stuff with data…” stories do you know and how will you avoid being listed in 2017? (Assuming there is another listing.)

I suspect we learn more from data fail stories than ones that report success.

You?

Enjoy!

Getting Started in Open Source: A Primer for Data Scientists

Saturday, December 31st, 2016

Getting Started in Open Source: A Primer for Data Scientists by Rebecca Bilbro.

From the post:

The phrase "open source” evokes an egalitarian, welcoming niche where programmers can work together towards a common purpose — creating software to be freely available to the public in a community that sees contribution as its own reward. But for data scientists who are just entering into the open source milieu, it can sometimes feel like an intimidating place. Even experienced, established open source developers like Jon Schlinkert have found the community to be less than welcoming at times. If the author of more than a thousand projects, someone whose scripts are downloaded millions of times every month, has to remind himself to stay positive, you might question whether the open source community is really the developer Shangri-la it would appear to be!

And yet, open source development does have a lot going for it:

  • Users have access to both the functionality and the methodology of the software (as opposed to just the functionality, as with proprietary software).
  • Contributors are also users, meaning that contributions track closely with user stories, and are intrinsically (rather than extrinsically) motivated.
  • Everyone has equal access to the code, and no one is excluded from making changes (at least locally).
  • Contributor identities are open to the extent that a contributor wants to take credit for her work.
  • Changes to the code are documented over time.

So why start a blog post for open source noobs with a quotation from an expert like Jon, especially one that paints such a dreary picture? It's because I want to show that the bar for contributing is… pretty low.

Ask yourself these questions: Do you like programming? Enjoy collaborating? Like learning? Appreciate feedback? Do you want to help make a great open source project even better? If your answer is 'yes' to one or more of these, you're probably a good fit for open source. Not a professional programmer? Just getting started with a new programming language? Don't know everything yet? Trust me, you're in good company.

Becoming a contributor to an open source project is a great way to support your own learning, to get more deeply involved in the community, and to share your own unique thoughts and ideas with the world. In this post, we'll provide a walkthrough for data scientists who are interested in getting started in open source — including everything from version control basics to advanced GitHub etiquette.

Two of Rebecca’s points are more important than the rest:

  • the bar for contributing is low
  • contributing builds community and a sense of ownership

Will 2017 be the year you move from the sidelines of open source and into the game?

Data Science, Protests and the Washington Metro – Feasibility

Friday, December 30th, 2016

Steven Nelson writes of plans to block DC traffic:


Protest plans often are overambitious and it’s unclear if there will be enough bodies or sacrificial vehicles to block roadways, or people willing to risk arrest by doing so, though Carrefour says the group has coordinated housing for a large number of out-of-town visitors and believes preliminary signs point to massive turnout.
….(Anti-Trump Activists Plan Road-Blocking ‘Clusterf–k’ for Inauguration)

Looking at a map of the ninety-one (91) Metro rail stations, you may feel discouragement at Steven’s question of “enough bodies or sacrificial vehicles to block roadways….”

www-wmata-com-rail-stations-460

(Screenshot of map from https://www.wmata.com/schedules/maps/, Rail maps selected, 30 December 2016.)

Steve’s question and data science

Steven’s question is a good one and it’s one data science and public data can address.

For a feel of the larger problem of blockading all 91 Metro Rail stations, download and view/print this color map of Metro stations from the Washington Metropolitan Area Transit Authority.

For every station where you don’t see:

metro-parking-460

you will need to move protesters to those locations. As you already know, moving protesters in a coordinated way is a logistical and resource intensive task.

Just so you know, there are forty-three (43) stations with no parking lots.

Data insight: If you look at the Metro Rail map: color map of Metro stations, you will notice that all the stations with parking are located at the outer stations of the Metro.

That’s no accident. The Metro Rail system is designed to move people into and out of the city, which of necessity means, if you block access to the stations with parking lots, you have substantially impeded access into the city.

Armed with that insight, the total of Metro Rail stations to be blocked drops to thirty-eight (38). Not a great number but less than half of the starting 91.

Blocking 38 Metro Rail Stations Still Sounds Like A Lot

You’re right.

Blocking all 38 Metro Rail stations with parking lots is a protest organizer’s pipe dream.

It’s in keeping with seeing themselves as proclaiming “Peace! Land! Bread!” to huddled masses.

Data science and public data won’t help block all 38 stations but it can help with strategic selection of stations based on your resources.

Earlier this year, Dan Malouff posted: All 91 Metro stations, ranked by ridership.

If you put that data into a spreadsheet, eliminate the 43 stations with no parking lots, you can then sort the parking lot stations by their daily ridership.

Moreover, you can keep a running total of the riders in order to calculate the percentage of Metro Rail riders blocked (assuming 100% blockage) as you progress down the list of stations.

The total daily ridership for those stations is 183,535.

You can review my numbers and calculations with a copy of Metro-Rail-Ridership-Station-Percentage.xls

Strategic Choice of Metro Rail Stations

Consider this excerpt from the spreadsheet:

Station Avg. # Count % of Total.
Silver Spring 12269 12269 6.68%
Shady Grove 11732 24001 13.08%
Vienna 10005 34006 18.53%
Fort Totten 7543 41549 22.64%
Wiehle 7306 48855 26.62%
New Carrollton 7209 56064 30.55%
Huntington 7002 63066 34.36%
Franconia-Springfield 6821 69887 38.08%
Anacostia 6799 76686 41.78%
Glenmont 5881 82567 44.99%
Greenbelt 5738 88305 48.11%
Rhode Island Avenue 5727 94032 51.23%
Branch Avenue 5449 99481 54.20%
Takoma 5329 104810 57.11%
Grosvenor 5206 110016 59.94%

The average ridership as reported by Dan Malouff in All 91 Metro stations, ranked by ridership comes to: 652,183. Of course, that includes people who rode from one station to transfer to another one. (I’m investigating ways/data to separate those out.)

As you can see, blocking only the first four stations Silver Spring, Shady Grove, Vienna and Fort Totten, is almost 23% of the traffic from stations with parking lots. It’s not quite 10% of the total ridership on a day but certainly noticeable.

The other important point to notice is that with public data and data science, the problem has been reduced from 91 potential stations to 4.

A reduction of more than an order of magnitude.

Not a bad payoff for using public data and data science.


That’s all I have for you now, but I can promise that deeper analysis of metro DC public data sets reveals event locations that impact both the “beltway” as well as Metro Rail lines.

More on that and maps for the top five (5) locations, a little over 25% of the stations with parking traffic, next week!

If you can’t make it to #DisruptJ20 protests, want to protest early or want to support research on data science and protests, consider a donation.

Disclaimer: I am exploring the potential of data science for planning protests. What you choose to do or not to do and when, is entirely up to you.

neveragain.tech [Or at least not any further]

Friday, December 16th, 2016

neveragain.tech [Or at least not any further]

Write a list of things you would never do. Because it is possible that in the next year, you will do them. —Sarah Kendzior [1]

We, the undersigned, are employees of tech organizations and companies based in the United States. We are engineers, designers, business executives, and others whose jobs include managing or processing data about people. We are choosing to stand in solidarity with Muslim Americans, immigrants, and all people whose lives and livelihoods are threatened by the incoming administration’s proposed data collection policies. We refuse to build a database of people based on their Constitutionally-protected religious beliefs. We refuse to facilitate mass deportations of people the government believes to be undesirable.

We have educated ourselves on the history of threats like these, and on the roles that technology and technologists played in carrying them out. We see how IBM collaborated to digitize and streamline the Holocaust, contributing to the deaths of six million Jews and millions of others. We recall the internment of Japanese Americans during the Second World War. We recognize that mass deportations precipitated the very atrocity the word genocide was created to describe: the murder of 1.5 million Armenians in Turkey. We acknowledge that genocides are not merely a relic of the distant past—among others, Tutsi Rwandans and Bosnian Muslims have been victims in our lifetimes.

Today we stand together to say: not on our watch, and never again.

I signed up but FYI, the databases we are pledging to not build, already exist.

The US Census Bureau collects information on race, religion and national origin.

The Statistical Abstract of the United States: 2012 (131st Edition) Section 1. Population confirms the Census Bureau has this data:

Population tables are grouped by category as follows:

  • Ancestry, Language Spoken At Home
  • Elderly, Racial And Hispanic Origin Population Profiles
  • Estimates And Projections By Age, Sex, Race/Ethnicity
  • Estimates And Projections–States, Metropolitan Areas, Cities
  • Households, Families, Group Quarters
  • Marital status And Living Arrangements
  • Migration
  • National Estimates And Projections
  • Native And Foreign-Born Populations
  • Religion

To be fair, the privacy principles of the Census Bureau state:

Respectful Treatment of Respondents: Are our efforts reasonable and did we treat you with respect?

  • We promise to ensure that any collection of sensitive information from children and other sensitive populations does not violate federal protections for research participants and is done only when it benefits the public good.

Disclosure: I like the US Census Bureau. Left to their own devices, I don’t have any reasonable fear of their mis-using the data in question.

But that’s the question isn’t it? Will the US Census Bureau be left to its own policies and traditions?

I view the various “proposed data collection policies” of the incoming administrations as intentional distractions. While everyone is focused on Trump’s Theater of the Absurd, appointments and policies at the US Census Bureau, may achieve the same ends.

Sign the pledge yes, but use FOIA requests, personal contacts with Census staff, etc., to keep track of the use of dangerous data at the Census Bureau and elsewhere.


Instructions for adding your name to the pledge are found at: https://github.com/neveragaindottech/neveragaindottech.github.io/.

Assume Census Bureau staff are committed to their privacy and appropriate use policies. A friendly approach will be far more productive than a confrontational or suspicious one. Let’s work with them to maintain their agency’s long history of data security.

How To Brick A School Bus, Data Science Helps Park It (Part 2)

Wednesday, December 14th, 2016

Immediate reactions to How To Brick A School Bus, Data Science Helps Park It (Part 1) include:

  • Blocking a public street with a bricked school bus is a crime.
  • Publicly committing a crime isn’t on your bucket list.
  • School buses are expensive.
  • Turning over a school bus is dangerous.

All true and all likely to diminish any enthusiasm for participation.

Bright yellow school buses bricked and blocking transportation routes attract the press like flies to …, well, you know, but may not be your best option.

Alternatives to a Bricked School Bus

Despite the government denying your right to assemble near the inauguration on January 20, 2017 in Washington, D.C., what other rights could lead to a newsworthy result?

You have the right to travel, although the Supreme Court has differed on the constitutional basis for that right. (Constitution of the United States of America: Analysis and Interpretation, 14th Admendment, page 1834, footnote 21).

You also have the right to be inattentive, which I suspect is secured 9th Amendment:

The enumeration in the Constitution, of certain rights, shall not be construed to deny or disparage others retained by the people.

If we put the right to travel together with the right to be inattentive (or negligent), then it stands to reason that your car could run out of gas on the highways normally used to attend an inauguration.

Moreover, we know from past cases, that drivers have not been held to be negligent simply for running out of gas, even at the White House.

Where to Run Out of Gas?

Interesting question and the one that originally had me reaching for historic traffic data.

It does exist, yearly summaries (Virginia), Inrix (Washington, DC), Traffic Volume Maps (District Department of Transportation), and others.

But we don’t want to be like the data scientist who used GPS and satellite data to investigate why you can’t get a taxi in Singapore when it rains. Starting Data Analysis with Assumptions Crunching large amounts of data discovered that taxis in Singapore stop moving when it rains.

Interesting observation but not the answer to the original question. Asking a local taxi driver, it was discovered that draconian traffic liability laws are the reason taxi drivers pull over when it rains. Not a “big data” question at all.

What Do We Know About DC Metro Traffic Congestion?

Let’s review what is commonly known about DC metro traffic congestion:

D.C. tops list of nation’s worst traffic gridlock (2015), Study ranks D.C. traffic 2nd-worst in U.S. (2016), DC Commuters Abandon Metro, Making Already Horrible Traffic Even Worse (metro repairs make traffic far worse).

At the outset, we know that motor vehicle traffic is a chaotic system, so small changes, such as addition impediment of traffic flow by cars running out of gas, can have large effects. Especially on a system that teeters on the edge of gridlock every day.

The loss of Metro usage has a cascading impact on metro traffic (from above). Which means blockage of access to Metro stations will exacerbate the impact of blockages on the highway system.

Time and expense could be spent on overly precise positioning of out-of-gas cars, but a two part directive is just as effective if not more so:

  • Go to Metro stations ingresses.
  • Go to any location on traffic map that is not red.

Here’s a sample traffic map that has traffic cameras:

fox5dc-map-460

From Fox5 DC but it is just one of many.

The use of existing traffic maps removes the need to construct the same and enable chaotic participation, which means you quite innocently ran out of gas and did not at any time contact and/or conspire with others to run out of gas.

Conspiracy is a crime and you should always avoid committing crimes.

General Comments

You may be wondering if authorities being aware of a theoretical discussion of people running out of gas will provoke effective counter measures?

I don’t think so and here’s why: What would be the logical response of an authority? Position more tow trucks? Setup temporary refueling stations?

Do you think the press will be interested in those changes? Such that not only do you have the additional friction of the additional equipment but the press buzzing about asking about the changes?

An authorities best strategy would be to do nothing at all but that advice is rarely taken. At the very best, local authorities will make transportation even more fragile in anticipation someone might run out of gas.

The numbers I hear tossed about as additional visitors, some activities are expecting more than 100,000 (Women’s March on Washington), so even random participation in running out of gas should have a significant impact.

What if they held the inauguration to empty bleachers?

Data Science Traditionalists – Don’t Re-invent the Wheel

Nudging a chaotic traffic system into gridlock, for hours if not more than a day, may not strike you as traditional data science.

Perhaps not but please don’t re-invent the wheel.

If you want to be more precise, perhaps to block particular activities or locations, let me direct you to the Howard University Transportation Safety Data Center.

They have the Traffic Count Database System (TCDS). Two screen shots that don’t do it justice:

tdc1-460

tdc2-460

From their guide to the system:

The Traffic Count Database System (TCDS) module is a powerful tool for the traffic engineer or planner to organize an agency’s traffic count data. It allows you to upload data from a traffic counter; view graphs, lists and reports of historic traffic count data; search for count data using either the database or the Google map; and print or export data to your desktop.

This guide is for users who are new to the TCDS system. It will provide you with the tools to carry out many common tasks. Any features not discussed in this guide are considered advanced features. If you have further questions, feel free to explore the online help guide or to contact the staff at MS2 for assistance.

I have referred to the inauguration of president-elect Donald J. Trump but the same lessons are applicable, with local modifications, to many other locations.

PS: Nothing should be construed as approval and/or encouragement that you break local laws in any venue. Those vary from jurisdiction to jurisdiction and what are acceptable risks and consequences are entirely your decision.

If you do run out of gas in or near Washington, DC on January 20, 2017, be polite to first-responders, including police officers. If you don’t realize your real enemies lie elsewhere, then you too have false class consciousness.

If you are tail-gating on the “Beltway,” offer responders a soft drink (they are on duty) and a hot dog.

Reporting in Aleppo: Can data science help?

Wednesday, December 14th, 2016

Reporting in Aleppo: Can data science help? by Nausicaa Renner. (Columbia Journalism Review)

from the post:

In war zones, reporting is hard to come by. Nowhere is this truer than in Syria, where many international journalists are banned, and more than one hundred journalists have been killed since the war began in early 2011. A deal was made on Tuesday between the Syrian government and the rebels allowing civilians and rebels to evacuate eastern Aleppo, but after years of bloody conflict, clarity is still hard to come by.

Is there a way for data science to give access to understudied war zones? A project at the Center for Spatial Research at Columbia University, partly funded by the Tow Center for Digital Journalism, uses what information we do have to “link eyes in the sky with algorithms and ears on the ground” in Aleppo.

The Center overlaid satellite images from 2012 to 2016 to create a map showing how Aleppo has changed: Destroyed buildings were identified by discrepancies in the images from year to year. Visualization can also put things in perspective; at a seminar the Center held, one student created a map showing how little the front lines of Aleppo have moved—a stark expression of the futility of war.

As of this AM, I saw reports that the ceasefire mentioned in this post failed.

The content is horrific but using the techniques described in The Twitterverse of Donald Trump to harvest Aleppo videos and images could preserve a record of the fall of Aleppo. Would mapping geo-locations to a map of Aleppo help document/confirm reports of atrocities?

Unlike the wall of silence around US military operations, there is a great deal of first-hand data and opportunities for analysis and confirmation. (It’s hard to analyze or confirm a press briefing document.)

Data Science and Protests During the Age of Trump [How To Brick A School Bus…]

Friday, December 9th, 2016

Pre-inauguration suppression of free speech/protests is underway for the Trump regime. (CNN link as subject identifier for Donald J. Trump, even though it fails to mention he looks like a cheeto in a suit.)

Women’s March on Washington barred from Lincoln Memorial by Amber Jamieson and Jessica Glenza.

From the post:


For the thousands hoping to echo the civil rights and anti-Vietnam rallies at Lincoln Memorial by joining the women’s march on Washington the day after Donald Trump’s inauguration: time to readjust your expectations.

The Women’s March won’t be held at the Lincoln Memorial.

That’s because the National Park Service, on behalf of the Presidential Inauguration Committee, filed documents securing large swaths of the national mall and Pennsylvania Avenue, the Washington Monument and the Lincoln Memorial for the inauguration festivities. None of these spots will be open for protesters.

The NPS filed a “massive omnibus blocking permit” for many of Washington DC’s most famous political locations for days and weeks before and after the inauguration on 20 January, said Mara Verheyden-Hilliard, a constitutional rights litigator and the executive director of the Partnership for Civil Justice Fund.

I contacted Amber Jamieson for more details on the permits and she forwarded two links (thanks Amber!):

Press Conference: Mass Protests Will Go Forward During Inauguration, which had the second link she forwarded:

PresidentialInauguralCommittee12052016.pdf, the permit requests made by the National Park Service on behalf of the Presidential Inaugural Committee.

Start with where protests are “permitted” to see what has been lost.

A grim read but 36 CFR 7.96 says in part:


3 (i) White House area. No permit may be issued authorizing demonstrations in the White House area, except for the White House sidewalk, Lafayette Park and the Ellipse. No permit may be issued authorizing special events, except for the Ellipse, and except for annual commemorative wreath-laying ceremonies relating to the statutes in Lafayette Park.

(emphasis added, material hosted by the Legal Information Institute (LII))

Summary: In White House area, protesters have only three places for permits to protest:

  • White House sidewalk
  • Lafayette Park
  • Ellipse

White House sidewalk / Lafayette Park (except North-East Quadrant) – Application 16-0289

Dates:

Set-up dates starting 11/1/2016 6:00 am ending 1/19/2017
Activity dates starting 1/20/2017 ending 1/20/2017
Break-down dates starting 1/21/2017 ending 3/1/2017 11:59 pm

Closes:


All of Lafayette Park except for its northeast quadrant pursuant to 36 CFR 7.96 (g)(4)(iii)(A). The initial areas of Lafayette Park and the White House Sidewalk that will be needed for construction set-up, and which will to be closed to ensure public safety, is detailed in the attached map. The attached map depicts the center portion of the White House Sidewalk as well as a portion of the southern oval of Lafayette Park. The other remaining areas in Lafayette Park and the White House Sidewalk that will be needed for construction set-up, will be closed as construction set-up progresses into these other areas, which will also then be delineated by fencing and sign age to ensure public safety.

Two of the three possible protest sites in the White House closed by Application 16-0289.

Ellipse – Application 17-0001

Dates:

Set-up dates starting 01/6/2017 6:00 am ending 1/19/2017
Activity dates starting 1/20/2017 ending 1/20/2017
Break-down dates starting 1/21/2017 ending 2/17/2017 11:59 pm

These dates are at variance with those for the White House sidewalk and Lafayette Park (shorter).

Closes:

Ellipse, a fitty-two acre park, as depicted by Google Maps:

ellipse-460

Plans for the Ellipse?


Purpose of Activity: In connection with the Presidential Inaugural Ceremonies, this application is for use of the Ellipse by PIC, in the event that PIC seeks its use for Inaugural ceremonies and any necessary staging, which is expected to be:

A) In the event that PIC seeks the use of the Ellipse for pre- and/or post- Inaugural ceremonies, the area will be used for staging the event(s), staging of media to cover and/or broadcast the event, and if possible for ticketed and/or public viewing; and/or ­

B) In the event that PIC seeks the use of the Ellipse for the Inaugural ceremony and Inaugural parade staging, the area will be used to stage the various parade elements, for media to cover and/or broadcast the event, and if possible for ticketed and/or public viewing.

The PIC has no plans to use the Ellipse but has reserved it no doubt to deny its use to others.

Those two applications close three out of three protest sites in the White House area. The PIC went even further to reach out and close off other potential protest sites.

Other permits granted to the PIC include:

Misc. Areas – Application 16-0357

Ten (10) misc. areas identified by attached maps for PIC activities.

Arguably legitimate since the camp followers, sycophants and purveyors of “false news” need somewhere to be during the festivities.

National Mall -> Trump Mall – Application 17-0002

The National Mall will become Trump Mall for the following dates:

Set-up dates starting 01/6/2017 6:00 am ending 1/19/2017
Activity dates starting 1/20/2017 ending 1/20/2017
Break-down dates starting 1/21/2017 ending 1/30/2017 11:59 pm

Closes:


Plan for Proposed Activity: Consistent with NPS regulations at 36 CFR 7.96{g)(4)(iii)(C), this application seeks, in connection with the Presidential Inaugural Ceremonies, the area of the National Mall between 14th – 4th Streets, for the exclusive use of the Joint Task Force Headquarters (JTFHQ) on Inaugural Day for the assembly, staging, security and weather protection of the pre-Inaugural parade components and floats on Inaugural Day between 14th – 7th Streets. It also includes the placement of jumbotrons and sound towers by the Architect of the Capitol or the Joint Congressional Committee on Inaugural Ceremonies so that the Inaugural Ceremony may be observed by the Joint Congressional Committee’s ticketed standing room ticket holders between 4th – 3rd streets and the general public, which will be located on the National Mall between 7th – 4th Streets. Further, a 150-foot by 200-foot area on the National Mall just east of 7th Street, will be for the exclusive use of the Presidential Inaugural Committee for television and radio media broadcasts on Inaugural Day.

In the plans thus far, no mention of the main card or where the ring plus cage will be erected on Trump Mall. (that’s sarcasm, not “fake news”)

Most Other Places – Application 17-0003

If you read 36 CFR 7.96 carefully, you noticed there are places always prohibited to protesters:


(ii) Other park areas. Demonstrations and special events are not allowed in the following other park areas:

(A) The Washington Monument, which means the area enclosed within the inner circle that surrounds the Monument’s base, except for the official annual commemorative Washington birthday ceremony.

(B) The Lincoln Memorial, which means that portion of the park area which is on the same level or above the base of the large marble columns surrounding the structure, and the single series of marble stairs immediately adjacent to and below that level, except for the official annual commemorative Lincoln birthday ceremony.

(C) The Jefferson Memorial, which means the circular portion of the Jefferson Memorial enclosed by the outermost series of columns, and all portions on the same levels or above the base of these columns, except for the official annual commemorative Jefferson birthday ceremony.

(D) The Vietnam Veterans Memorial, except for official annual Memorial Day and Veterans Day commemorative ceremonies.

What about places just outside the already restricted areas?

Dates:

Set-up dates starting 01/6/2017 6:00 am ending 1/19/2017
Activity dates starting 1/20/2017 ending 1/20/2017
Break-down dates starting 1/21/2017 ending 2/10/2017 11:59 pm

Closes:


The Lincoln Memorial area, as more fully detailed as the park area bordered by 23rd Street, Daniel French Drive and Independence Avenue, Henry Bacon Drive and Constitution Avenue, Constitution Avenue between 15th & 23rd Streets, Constitution Gardens to include Area #5 outside of the Vietnam Veteran’s Memorial restricted area, the Lincoln Memorial outside of its restricted area, the Lincoln Memorial Plaza and Reflecting Pool Area, JFK Hockey Field, park area west of Lincoln Memorial between French Drive, Henry Bacon Drive, Parking Lots A, Band C, East and West Potomac Park, Memorial Bridge, Memorial Circle and Memorial Drive, the World War II Memorial. The Washington Monument Grounds as more fully depicted as the park area bounded by 14th & 15th Streets and Madison Drive and Independence Avenue.

Not to use but to prevent its use by others:


Purpose of Activity: In connection with the Presidential Inaugural Ceremonies, this application is for use of the Lincoln Memorial areas and Washington Monument grounds by PIC, in the event that PIC seeks its use for the Inaugural related ceremonies and any necessary staging, which is expected to be:

A) In the event that PIC seeks the use of the Lincoln Memorial areas for a pre-and/or post Inaugural ceremonies, the area will be used for staging the event(s), staging of media to cover and/or broadcast the event, and for ticketed and/or public viewing.

B) In the event that PIC seeks to use the Washington Monument grounds for a public overflow area to view the Inaugural ceremony and/ or parade, the area will be used for the public who will observe the activities through prepositioned jumbotrons and sound towers.

Next Steps

For your amusement, all five applications contain the following question answered No:

Do you have any reason to believe or any information indicating that any individual, group or organization might seek to disrupt the activity for which this application is submitted?

I would venture to say someone hasn’t been listening. 😉

Among the data science questions raised by this background information are:

  • How best to represent these no free speech and/or no free assembly zones on a map?
  • What data sets do you need to make protesters effective under these restrictions?
  • What questions would you ask of those data sets?
  • How to decide between viral/spontaneous action versus publicly known but lawful conduct, up until the point it becomes unlawful?

If you use any of this information, please credit Amber Jamieson, Jessica Glenza and the Partnership for Civil Justice Fund as the primary sources.

See further news from the Partnership for Civil Justice Fund at: Your Right of Resistance.

Tune in next Monday for: How To Brick A School Bus, Data Science Helps Park It.

PS: “The White House Sidewalk is the sidewalk between East and West Executive Avenues, on the south side Pennsylvania Avenue, N.W.” From OMB Control No. 1024-0021 – Application for a Permit to Conduct a Demonstration or Special Event in Park Areas and a Waiver of Numerical Limitations on Demonstrations for White House Sidewalk and/or Lafayette Park

Learning R programming by reading books: A book list

Thursday, November 24th, 2016

Learning R programming by reading books: A book list by Liang-Cheng Zhang.

From the post:

Despite R’s popularity, it is still very daunting to learn R as R has no click-and-point feature like SPSS and learning R usually takes lots of time. No worries! As self-R learner like us, we constantly receive the requests about how to learn R. Besides hiring someone to teach you or paying tuition fees for online courses, our suggestion is that you can also pick up some books that fit your current R programming level. Therefore, in this post, we would like to share some good books that teach you how to learn programming in R based on three levels: elementary, intermediate, and advanced levels. Each level focuses on one task so you will know whether these books fit your needs. While the following books do not necessarily focus on the task we define, you should focus the task when you reading these books so you are not lost in contexts.

Books and reading form the core of my most basic prejudice: Literacy is the doorway to unlimited universes.

A prejudice so strong that I have to work hard at realizing non-literates live in and sense worlds not open to literates. Not less complex, not poorer, just different.

But book lists in particular appeal to that prejudice and since my blog is read by literates, I’m indulging that prejudice now.

I do have a title to add to the list: Practical Data Science with R by Nina Zumel and John Mount.

Judging from the other titles listed, Practical Data Science with R falls in the intermediate range. Should not be your first R book but certainly high on the list for your second R book.

Avoid the rush! Start working on your Amazon wish list today! 😉

Python Data Science Handbook

Saturday, November 19th, 2016

Python Data Science Handbook (Github)

From the webpage:

Jupyter notebook content for my OReilly book, the Python Data Science Handbook.

pdsh-cover

See also the free companion project, A Whirlwind Tour of Python: a fast-paced introduction to the Python language aimed at researchers and scientists.

This repository will contain the full listing of IPython notebooks used to create the book, including all text and code. I am currently editing these, and will post them as I make my way through. See the content here:

Enjoy!

How To Use Twitter to Learn Data Science (or anything)

Wednesday, November 2nd, 2016

How To Use Twitter to Learn Data Science (or anything) by Data Science Renee.

Judging from the date on the post (May 2016), Renee’s enthusiasm for Twitter came before her recently breaking 10,000 followers on Twitter. (Congratulations!)

The one thing I don’t see Renee mentioning is the use of your own Twitter account to gain experience with a whole range of data mining tools.

Your Twitter feed will quickly out-strip your ability to “keep up,” so how do you propose to deal with that problem?

Renee suggests limiting examination of your timeline (in part), but have you considered using machine learning to assist you?

Or visualizing your areas of interests or people that you follow?

Indexing resources pointed to in tweets?

NLP processing of tweets?

Every tool of data science that you will be using for clients is relevant to your own Twitter feed.

What better way to learn tools than using them on content that interests you?

Enjoy!

BTW, follow Data Science Renee for a broad range of data science tools and topics!

How To Use Data Science To Write And Sell More Books (Training Amazon)

Sunday, October 30th, 2016

From the description:

Chris Fox is the bestselling author of science fiction and dark fantasy, as well as non-fiction books for authors including Write to Market, 5000 words per hour and today we’re talking about his next book, Six Figure Author: Using data to sell books.

Show Notes What Amazon data science, and machine learning, are and how authors can use them. How Amazon differs from the other online book retailers and how authors can train Amazon to sell more books. What to look for to find a voracious readership. Strategically writing to market and how to know what readers are looking for. On Amazon ads and when they are useful. Tips on writing faster. The future of writing, including virtual reality and AI help with story.

Joanna Penn of The Creative Penn interviews Chris Fox

Some of the highlights:

Training Amazon To Work For You

…What you want to do is figure out, with as much accuracy as possible, who your target audience is.

And when you start selling your book, the number of sales is not nearly as important as who you sell your book to, because each of those sales to Amazon represents a customer profile.

If you can convince them that people who voraciously read in your genre are going to love this book and you sell a couple of hundred copies to people like that, Amazon’s going to take it and run with it. You’ve now successfully trained them about who your audience is because you used good data and now they’re able to easily sell your book.

If, on the other hand, you and your mom buys a copy and your friend at the coffee shop buys a copy, and people who aren’t necessarily into that genre are all buying it, Amazon gets really lost and confused.

Easier said than done but how’s that for taking advantage of someone else’s machine learning?

Chris also has tips for not “polluting” your Amazon sales data.

Discovering and Writing to a Market


How do you find a sub-category or a smaller niche within the Amazon ecosystem? What are the things to look for in order to find a voracious readership?

Chris: What I do is I start looking at the rankings of the number 1, the number 20, 40, 60, 80 and 100 books. You can tell based on where those books are ranked, how many books in the genre are selling. If the number one book is ranked in the top 100 in the store and so is the 20th book, then you’ve found one of the hottest genres on Amazon.

If you find that by the time you get down to number 40, the rank is dropping off sharply, that suggests that not enough books are being produced in that genre and it might be a great place for you to jump in and make a name for yourself. (emphasis in original)

I know, I know, this is a tough one. Especially for me.

As I have pointed out here on multiple occasions, “terrorism” is largely a fiction of both government and media.

However, if you look at the top 100 paid sellers on terrorism at Amazon, the top fifty (50) don’t have a single title that looks like it denies terrorism is a problem.

🙁

Which I take to mean, in terms of selling books, services, or data, the terrorism is coming for us all gravy train is the profitable line.

Or at least to indulge in analysis on the basis of “…if the threat of terrorism is real…” and let readers supply their own answers to that question.

There are other valuable tips and asides, so watch the video or read the transcript: How To Use Data Science To Write And Sell More Books With Chris Fox.

PS: As of today, there are 292 podcasts by Jonna Penn.

Data Science for Political and Social Phenomena [Special Interest Search Interface]

Sunday, October 23rd, 2016

Data Science for Political and Social Phenomena by Chris Albon.

From the webpage:

I am a data scientist and quantitative political scientist. I specialize in the technical and organizational aspects of applying data science to political and social issues.

Years ago I noticed a gap in the existing data literature. On one side was data science, with roots in mathematics and computer science. On the other side were the social sciences, with hard-earned expertise modeling and predicting complex human behavior. The motivation for this site and ongoing book project is to bridge that gap: to create a practical guide to applying data science to political and social phenomena.

Chris has organized three hundred and twenty-eight pages on Data Wrangling, Python, R, etc.

If you like learning from examples, this is the site for you!

Including this site, what other twelve (12) sites would you include in a Python/R Data Science search interface?

That is an interface that has indexed only that baker’s dozen of sites. So you don’t spend time wading through “the G that is not named” search results.

Serious question.

Not that I would want to maintain such a beast for external use, but having a local search engine tuned to your particular interests could be nice.

Threatening the President: A Signal/Noise Problem

Tuesday, October 18th, 2016

Even if you can’t remember why the pointy end of a pencil is important, you too can create national news.

This bit of noise reminded me of an incident when I was in high school where some similar type person bragged in a local bar about assassinating then President Nixon*. Was arrested and sentenced to several years in prison.

At the time I puzzled briefly over the waste of time and effort in such a prosecution and then promptly forgot it.

Until this incident with the overly “clever” Trump supporter.

To get us off on the same foot:

18 U.S. Code § 871 – Threats against President and successors to the Presidency

(a) Whoever knowingly and willfully deposits for conveyance in the mail or for a delivery from any post office or by any letter carrier any letter, paper, writing, print, missive, or document containing any threat to take the life of, to kidnap, or to inflict bodily harm upon the President of the United States, the President-elect, the Vice President or other officer next in the order of succession to the office of President of the United States, or the Vice President-elect, or knowingly and willfully otherwise makes any such threat against the President, President-elect, Vice President or other officer next in the order of succession to the office of President, or Vice President-elect, shall be fined under this title or imprisoned not more than five years, or both.

(b) The terms “President-elect” and “Vice President-elect” as used in this section shall mean such persons as are the apparent successful candidates for the offices of President and Vice President, respectively, as ascertained from the results of the general elections held to determine the electors of President and Vice President in accordance with title 3, United States Code, sections 1 and 2. The phrase “other officer next in the order of succession to the office of President” as used in this section shall mean the person next in the order of succession to act as President in accordance with title 3, United States Code, sections 19 and 20.

Commonplace threatening letters, calls, etc., aren’t documented for the public but President Barack Obama has a Wikipedia page devoted to the more significant ones: Assassination threats against Barack Obama.

Just as no one knows you are a dog on the internet, no one can tell by looking at a threat online if you are still learning how to use a pencil or are a more serious opponent.

Leaving to one side that a truly serious opponent allows actions to announce their presence or goal.

The treatment of even idle bar threats as serious is an attempt to improve the signal-to-noise ratio:

In analog and digital communications, signal-to-noise ratio, often written S/N or SNR, is a measure of signal strength relative to background noise. The ratio is usually measured in decibels (dB) using a signal-to-noise ratio formula. If the incoming signal strength in microvolts is Vs, and the noise level, also in microvolts, is Vn, then the signal-to-noise ratio, S/N, in decibels is given by the formula: S/N = 20 log10(Vs/Vn)

If Vs = Vn, then S/N = 0. In this situation, the signal borders on unreadable, because the noise level severely competes with it. In digital communications, this will probably cause a reduction in data speed because of frequent errors that require the source (transmitting) computer or terminal to resend some packets of data.

I’m guessing the reasoning is the more threats that go unspoken, the less chaff the Secret Service has to winnow in order to uncover viable threats.

One assumes they discard physical mail with return addresses of prisons, mental hospitals, etc., or at most request notice of the release of such people from state custody.

Beyond that, they don’t appear to be too picky about credible threats, noting that in one case an unspecified “death ray” was going to be used against President Obama.

The EuroNews description of that case must be shared:

Two American men have been arrested and charged with building a remote-controlled X-ray machine intended for killing Muslims and other perceived enemies of the U.S.

Following a 15-month investigation launched in April 2012, Glenn Scott Crawford and Eric J. Feight are accused of developing the device, which the FBI has described as “mobile, remotely operated, radiation emitting and capable of killing human targets silently and from a distance with lethal doses of radiation”.

Sure, right. I will post a copy of the 67-page complaint, which uses terminology rather loosely, to say the least, in a day or so. Suffice it to say that the defendants never acquired a source for the needed radioactivity production.

On the order of having a complete nuclear bomb but not nuclear material to make it into a nuclear bomb. You would be in more danger from the conventional explosive degrading than the bomb as a nuclear weapon.

Those charged with defending public officials want to deter the making of threats, so as to improve the signal/noise ratio.

The goal of those attacking public officials is a signal/noise ratio of exactly 0.0.

Viewing threats from an information science perspective suggests various strategies for either side. (Another dividend of studying information science.)

*They did find a good picture of Nixon for the White House page. Doesn’t look as much like a weasel as he did in real life. Gimp/Photoshop you think?

Becoming a Data Scientist:

Thursday, October 13th, 2016

Becoming a Data Scientist: Advice From My Podcast Guests

Out-gassing from political candidates has kept pushing this summary by Renée Teate back in my queue. Well, fixing that today!

René has created more data science resources than I can easily mention so in addition to this guide, I will mention only two:

Data Science Renee @BecomingDataSci, a Twitter account that will soon break into the rarefied air of > 10,000 followers. Not yet, but you may be the one that puts her over the top!

Looking for women to speak at data science conferences? Renée maintains Women in Data Science, which today has 815 members.

Sorry, three, her blog: Becoming a Data Scientist.

That should keep you busy/distracted until the political noise subsides. 😉

Data Science Toolbox

Saturday, October 1st, 2016

Data Science Toolbox

From the webpage:

Start doing data science in minutes

As a data scientist, you don’t want to waste your time installing software. Our goal is to provide a virtual environment that will enable you to start doing data science in a matter of minutes.

As a teacher, author, or organization, making sure that your students, readers, or members have the same software installed is not straightforward. This open source project will enable you to easily create custom software and data bundles for the Data Science Toolbox.

A virtual environment for data science

The Data Science Toolbox is a virtual environment based on Ubuntu Linux that is specifically suited for doing data science. Its purpose is to get you started in a matter of minutes. You can run the Data Science Toolbox either locally (using VirtualBox and Vagrant) or in the cloud (using Amazon Web Services).

We aim to offer a virtual environment that contains the software that is most commonly used for data science while keeping it as lean as possible. After a fresh install, the Data Science Toolbox contains the following software:

  • Python, with the following packages: IPython Notebook, NumPy, SciPy, matplotlib, pandas, scikit-learn, and SymPy.
  • R, with the following packages: ggplot2, plyr, dplyr, lubridate, zoo, forecast, and sqldf.
  • dst, a command-line tool for installing additional bundles on the Data Science Toolbox (see next section).

Let us know if you want to see something added to the Data Science Toolbox.

Great resource for doing or teaching data science!

And an example of using a VM to distribute software in a learning environment.

Data Science Series [Starts 9 September 2016 but not for *nix users]

Sunday, September 4th, 2016

The BD2K Guide to the Fundamentals of Data Science Series

From the webpage:


Every Friday beginning September 9, 2016
9am – 10am Pacific Time

Working jointly with the BD2K Centers-Coordination Center (BD2KCCC) and the NIH Office of Data Science, the BD2K Training Coordinating Center (TCC) is spearheading this virtual lecture series on the data science underlying modern biomedical research. Beginning in September 2016, the seminar series will consist of regularly scheduled weekly webinar presentations covering the basics of data management, representation, computation, statistical inference, data modeling, and other topics relevant to “big data” biomedicine. The seminar series will provide essential training suitable for individuals at all levels of the biomedical community. All video presentations from the seminar series will be streamed for live viewing, recorded, and posted online for future viewing and reference. These videos will also be indexed as part of TCC’s Educational Resource Discovery Index (ERuDIte), shared/mirrored with the BD2KCCC, and with other BD2K resources.

View all archived videos on our YouTube channel:
https://www.youtube.com/channel/UCKIDQOa0JcUd3K9C1TS7FLQ


Please join our weekly meetings from your computer, tablet or smartphone.
https://global.gotomeeting.com/join/786506213
You can also dial in using your phone.
United States +1 (872) 240-3311
Access Code: 786-506-213
First GoToMeeting? Try a test session: http://help.citrix.com/getready

Of course, running Ubuntu, when I follow the “First GoToMeeting? Try a test session,” I get this result:


OS not supported

Long-Term Fix: Upgrade your computer.

You or your IT Admin will need to upgrade your computer’s operating system in order to install our desktop software at a later date.

Since this is most likely a lecture format, could just stream the video and use WebConf as a Q/A channel.

Of course, that would mean losing the various technical difficulties, licensing fees, etc., all of which are distractions from the primary goal of the project.

But who wants that?

PS: Most *nix users won’t be interested except to refer others but still, over engineered solutions to simple issues should not be encouraged.

DataScience+ (R Tutorials)

Monday, August 29th, 2016

DataScience+

From the webpage:

We share R tutorials from scientists at academic and scientific institutions with a goal to give everyone in the world access to a free knowledge. Our tutorials cover different topics including statistics, data manipulation and visualization!

I encountered DataScience+ while running down David Kun’s RDBL post.

As of today, there are 120 tutorials with 451,129 reads.

That’s impressive! Whether you are looking for tutorials or you are looking to post your R tutorial where it will be appreciated.

Enjoy!

The Ethics of Data Analytics

Sunday, August 21st, 2016

The Ethics of Data Analytics by Kaiser Fung.

Twenty-one slides on ethics by Kaiser Fung, author of: Junk Charts (data visualization blog), and Big Data, Plainly Spoken (comments on media use of statistics).

Fung challenges you to reach your own ethical decisions and acknowledges there are a number of guides to such decision making.

Unfortunately, Fung does not include professional responsibility requirements, such as the now out-dated Canon 7 of the ABA Model Code Of Professional Responsibility:

A Lawyer Should Represent a Client Zealously Within the Bounds of the Law

That canon has a much storied history, which is capably summarized in Whatever Happened To ‘Zealous Advocacy’? by Paul C. Sanders.

In what became known as Queen Caroline’s Case, the House of Lords sought to dissolve the marriage of King George the IV

George IV 1821 color

to Queen Caroline

CarolineOfBrunswick1795

on the grounds of her adultery. Effectively removing her as queen of England.

Queen Caroline was represented by Lord Brougham, who had evidence of a secret prior marriage by King George the IV to Catholic (which was illegal), Mrs Fitzherbert.

Portrait of Mrs Maria Fitzherbert, wife of George IV

Brougham’s speech is worth your reading in full but the portion most often cited for zealous defense reads as follows:


I once before took leave to remind your lordships — which was unnecessary, but there are many whom it may be needful to remind — that an advocate, by the sacred duty of his connection with his client, knows, in the discharge of that office, but one person in the world, that client and none other. To save that client by all expedient means — to protect that client at all hazards and costs to all others, and among others to himself — is the highest and most unquestioned of his duties; and he must not regard the alarm, the suffering, the torment, the destruction, which he may bring upon any other; nay, separating even the duties of a patriot from those of an advocate, he must go on reckless of the consequences, if his fate it should unhappily be, to involve his country in confusion for his client.

The name Mrs. Fitzherbert never slips Lord Brougham’s lips but the House of Lords has been warned that may not remain to be the case, should it choose to proceed. The House of Lords did grant the divorce but didn’t enforce it. Saving fact one supposes. Queen Caroline died less than a month after the coronation of George IV.

For data analysis, cybersecurity, or any of the other topics I touch on in this blog, I take the last line of Lord Brougham’s speech:

To save that client by all expedient means — to protect that client at all hazards and costs to all others, and among others to himself — is the highest and most unquestioned of his duties; and he must not regard the alarm, the suffering, the torment, the destruction, which he may bring upon any other; nay, separating even the duties of a patriot from those of an advocate, he must go on reckless of the consequences, if his fate it should unhappily be, to involve his country in confusion for his client.

as the height of professionalism.

Post-engagement of course.

If ethics are your concern, have that discussion with your prospective client before you are hired.

Otherwise, clients have goals and the task of a professional is how to achieve them. Nothing more.

Contributing to StackOverflow: How Not to be Intimidated

Friday, August 19th, 2016

Contributing to StackOverflow: How Not to be Intimidated by Ksenia Coulter.

From the post:

StackOverflow is an essential resource for programmers. Whether you run into a bizarre and scary error message or you’re blanking on something you should know, StackOverflow comes to the rescue. Its popularity with coders spurred many jokes and memes. (Programming to be Officially Renamed “Googling Stackoverflow,” a satirical headline reads).

(image omitted)

While all of us are users of StackOverflow, contributing to this knowledge base can be very intimidating, especially to beginners or to non-traditional coders who many already feel like they don’t belong. The fact that an invisible barrier exists is a bummer because being an active contributor not only can help with your job search and raise your profile, but also make you a better programmer. Explaining technical concepts in an accessible way is difficult. It is also well-established that teaching something solidifies your knowledge of the subject. Answering StackOverflow questions is great practice.

All of the benefits of being an active member of StackOverflow were apparent to me for a while, but I registered an account only this week. Let me walk you t[h]rough thoughts that hindered me. (Chances are, you’ve had them too!)

I plead guilty to using StackOverFlow but not contributing back to it.

Another “intimidation” to avoid is thinking you must have the complete and killer answer to any question.

That can and does happen, but don’t wait for a question where you can supply such an answer.

Jump in! (Advice to myself as well as any readers.)

Pandas

Wednesday, August 17th, 2016

Pandas by Reuven M. Lerner.

From the post:

Serious practitioners of data science use the full scientific method, starting with a question and a hypothesis, followed by an exploration of the data to determine whether the hypothesis holds up. But in many cases, such as when you aren’t quite sure what your data contains, it helps to perform some exploratory data analysis—just looking around, trying to see if you can find something.

And, that’s what I’m going to cover here, using tools provided by the amazing Python ecosystem for data science, sometimes known as the SciPy stack. It’s hard to overstate the number of people I’ve met in the past year or two who are learning Python specifically for data science needs. Back when I was analyzing data for my PhD dissertation, just two years ago, I was told that Python wasn’t yet mature enough to do the sorts of things I needed, and that I should use the R language instead. I do have to wonder whether the tables have turned by now; the number of contributors and contributions to the SciPy stack is phenomenal, making it a more compelling platform for data analysis.

In my article “Analyzing Data“, I described how to filter through logfiles, turning them into CSV files containing the information that was of interest. Here, I explain how to import that data into Pandas, which provides an additional layer of flexibility and will let you explore the data in all sorts of ways—including graphically. Although I won’t necessarily reach any amazing conclusions, you’ll at least see how you can import data into Pandas, slice and dice it in various ways, and then produce some basic plots.

Of course, scientific articles are written as though questions drop out of the sky and data is interrogated for the answer.

Aside from being rhetoric to badger others with, does anyone really think that is how science operates in fact?

Whether you have delusions about how science works in fact or not, you will find that Pandas will assist you in exploring data.