Archive for the ‘Data Science’ Category

DataBASIC

Thursday, February 16th, 2017

DataBASIC

Not for you but an interesting resource for introducing children to working with data.

Includes WordCounter, WTFcsv, SameDiff and ConnectTheDots.

The network template is a csv file with a header, two fields separated by commas.

Pick the right text/examples and you could have a class captivated pretty quickly.

Enjoy!

Missing The Beltway Blockade? Considering Blockading A Ball?

Wednesday, January 11th, 2017

For one reason or another, you may not be able to participate in a Beltway Blockade January 20, 2017, see:

Don’t Panic!

You can still enjoy a non-permitted protest and contribute to the least attended inauguration in history!

2017 Presidential Inaugural Balls

The list is short on location information for many of the scheduled balls but the Commander in Chief’s Ball, Presidential Inaugural Ball, Mid-Atlantic Inauguration Ball, Midwest Inaugural Ball, Western Inaugural Ball, and the Neighborhood Inaugural Ball, are all being held at the: Walter E. Washington Convention Center.

Apologies but I haven’t looked up prior attendance records but just based on known scheduling, disruption in the area of Walter E. Washington Convention Center looks like it will pay the highest returns.

For the balls with location information and/or location information that I can discover, I will post a fuller list with Google Map links tomorrow.

Oh, for inside protesting, here are floor plans of the Walter E. Washington Convention Center.

Those are the official, posted floor plans.

Should that link go dark, let me know. I have a backup copy of them. 😉

The Best And Worst Data Stories Of 2016

Sunday, January 1st, 2017

The Best And Worst Data Stories Of 2016 by Walt Hickey.

From the post:

It’s time once again to dole out FiveThirtyEight’s Data Awards, our annual (OK, we’ve done it once before) chance to honor those who did remarkably good stuff with data, to shame those who did remarkably bad stuff with data, and to acknowledge the key numbers that help describe what went down over the past year. As always, these are based on the considered analysis of an esteemed panel of judges, by which I mean that I pestered people around the FiveThirtyEight offices until they gave me some suggestions.

I had to list this under both data science and humor. 😉

What “…bad stuff with data…” stories do you know and how will you avoid being listed in 2017? (Assuming there is another listing.)

I suspect we learn more from data fail stories than ones that report success.

You?

Enjoy!

Getting Started in Open Source: A Primer for Data Scientists

Saturday, December 31st, 2016

Getting Started in Open Source: A Primer for Data Scientists by Rebecca Bilbro.

From the post:

The phrase "open source” evokes an egalitarian, welcoming niche where programmers can work together towards a common purpose — creating software to be freely available to the public in a community that sees contribution as its own reward. But for data scientists who are just entering into the open source milieu, it can sometimes feel like an intimidating place. Even experienced, established open source developers like Jon Schlinkert have found the community to be less than welcoming at times. If the author of more than a thousand projects, someone whose scripts are downloaded millions of times every month, has to remind himself to stay positive, you might question whether the open source community is really the developer Shangri-la it would appear to be!

And yet, open source development does have a lot going for it:

  • Users have access to both the functionality and the methodology of the software (as opposed to just the functionality, as with proprietary software).
  • Contributors are also users, meaning that contributions track closely with user stories, and are intrinsically (rather than extrinsically) motivated.
  • Everyone has equal access to the code, and no one is excluded from making changes (at least locally).
  • Contributor identities are open to the extent that a contributor wants to take credit for her work.
  • Changes to the code are documented over time.

So why start a blog post for open source noobs with a quotation from an expert like Jon, especially one that paints such a dreary picture? It's because I want to show that the bar for contributing is… pretty low.

Ask yourself these questions: Do you like programming? Enjoy collaborating? Like learning? Appreciate feedback? Do you want to help make a great open source project even better? If your answer is 'yes' to one or more of these, you're probably a good fit for open source. Not a professional programmer? Just getting started with a new programming language? Don't know everything yet? Trust me, you're in good company.

Becoming a contributor to an open source project is a great way to support your own learning, to get more deeply involved in the community, and to share your own unique thoughts and ideas with the world. In this post, we'll provide a walkthrough for data scientists who are interested in getting started in open source — including everything from version control basics to advanced GitHub etiquette.

Two of Rebecca’s points are more important than the rest:

  • the bar for contributing is low
  • contributing builds community and a sense of ownership

Will 2017 be the year you move from the sidelines of open source and into the game?

Data Science, Protests and the Washington Metro – Feasibility

Friday, December 30th, 2016

Steven Nelson writes of plans to block DC traffic:


Protest plans often are overambitious and it’s unclear if there will be enough bodies or sacrificial vehicles to block roadways, or people willing to risk arrest by doing so, though Carrefour says the group has coordinated housing for a large number of out-of-town visitors and believes preliminary signs point to massive turnout.
….(Anti-Trump Activists Plan Road-Blocking ‘Clusterf–k’ for Inauguration)

Looking at a map of the ninety-one (91) Metro rail stations, you may feel discouragement at Steven’s question of “enough bodies or sacrificial vehicles to block roadways….”

www-wmata-com-rail-stations-460

(Screenshot of map from https://www.wmata.com/schedules/maps/, Rail maps selected, 30 December 2016.)

Steve’s question and data science

Steven’s question is a good one and it’s one data science and public data can address.

For a feel of the larger problem of blockading all 91 Metro Rail stations, download and view/print this color map of Metro stations from the Washington Metropolitan Area Transit Authority.

For every station where you don’t see:

metro-parking-460

you will need to move protesters to those locations. As you already know, moving protesters in a coordinated way is a logistical and resource intensive task.

Just so you know, there are forty-three (43) stations with no parking lots.

Data insight: If you look at the Metro Rail map: color map of Metro stations, you will notice that all the stations with parking are located at the outer stations of the Metro.

That’s no accident. The Metro Rail system is designed to move people into and out of the city, which of necessity means, if you block access to the stations with parking lots, you have substantially impeded access into the city.

Armed with that insight, the total of Metro Rail stations to be blocked drops to thirty-eight (38). Not a great number but less than half of the starting 91.

Blocking 38 Metro Rail Stations Still Sounds Like A Lot

You’re right.

Blocking all 38 Metro Rail stations with parking lots is a protest organizer’s pipe dream.

It’s in keeping with seeing themselves as proclaiming “Peace! Land! Bread!” to huddled masses.

Data science and public data won’t help block all 38 stations but it can help with strategic selection of stations based on your resources.

Earlier this year, Dan Malouff posted: All 91 Metro stations, ranked by ridership.

If you put that data into a spreadsheet, eliminate the 43 stations with no parking lots, you can then sort the parking lot stations by their daily ridership.

Moreover, you can keep a running total of the riders in order to calculate the percentage of Metro Rail riders blocked (assuming 100% blockage) as you progress down the list of stations.

The total daily ridership for those stations is 183,535.

You can review my numbers and calculations with a copy of Metro-Rail-Ridership-Station-Percentage.xls

Strategic Choice of Metro Rail Stations

Consider this excerpt from the spreadsheet:

Station Avg. # Count % of Total.
Silver Spring 12269 12269 6.68%
Shady Grove 11732 24001 13.08%
Vienna 10005 34006 18.53%
Fort Totten 7543 41549 22.64%
Wiehle 7306 48855 26.62%
New Carrollton 7209 56064 30.55%
Huntington 7002 63066 34.36%
Franconia-Springfield 6821 69887 38.08%
Anacostia 6799 76686 41.78%
Glenmont 5881 82567 44.99%
Greenbelt 5738 88305 48.11%
Rhode Island Avenue 5727 94032 51.23%
Branch Avenue 5449 99481 54.20%
Takoma 5329 104810 57.11%
Grosvenor 5206 110016 59.94%

The average ridership as reported by Dan Malouff in All 91 Metro stations, ranked by ridership comes to: 652,183. Of course, that includes people who rode from one station to transfer to another one. (I’m investigating ways/data to separate those out.)

As you can see, blocking only the first four stations Silver Spring, Shady Grove, Vienna and Fort Totten, is almost 23% of the traffic from stations with parking lots. It’s not quite 10% of the total ridership on a day but certainly noticeable.

The other important point to notice is that with public data and data science, the problem has been reduced from 91 potential stations to 4.

A reduction of more than an order of magnitude.

Not a bad payoff for using public data and data science.


That’s all I have for you now, but I can promise that deeper analysis of metro DC public data sets reveals event locations that impact both the “beltway” as well as Metro Rail lines.

More on that and maps for the top five (5) locations, a little over 25% of the stations with parking traffic, next week!

If you can’t make it to #DisruptJ20 protests, want to protest early or want to support research on data science and protests, consider a donation.

Disclaimer: I am exploring the potential of data science for planning protests. What you choose to do or not to do and when, is entirely up to you.

neveragain.tech [Or at least not any further]

Friday, December 16th, 2016

neveragain.tech [Or at least not any further]

Write a list of things you would never do. Because it is possible that in the next year, you will do them. —Sarah Kendzior [1]

We, the undersigned, are employees of tech organizations and companies based in the United States. We are engineers, designers, business executives, and others whose jobs include managing or processing data about people. We are choosing to stand in solidarity with Muslim Americans, immigrants, and all people whose lives and livelihoods are threatened by the incoming administration’s proposed data collection policies. We refuse to build a database of people based on their Constitutionally-protected religious beliefs. We refuse to facilitate mass deportations of people the government believes to be undesirable.

We have educated ourselves on the history of threats like these, and on the roles that technology and technologists played in carrying them out. We see how IBM collaborated to digitize and streamline the Holocaust, contributing to the deaths of six million Jews and millions of others. We recall the internment of Japanese Americans during the Second World War. We recognize that mass deportations precipitated the very atrocity the word genocide was created to describe: the murder of 1.5 million Armenians in Turkey. We acknowledge that genocides are not merely a relic of the distant past—among others, Tutsi Rwandans and Bosnian Muslims have been victims in our lifetimes.

Today we stand together to say: not on our watch, and never again.

I signed up but FYI, the databases we are pledging to not build, already exist.

The US Census Bureau collects information on race, religion and national origin.

The Statistical Abstract of the United States: 2012 (131st Edition) Section 1. Population confirms the Census Bureau has this data:

Population tables are grouped by category as follows:

  • Ancestry, Language Spoken At Home
  • Elderly, Racial And Hispanic Origin Population Profiles
  • Estimates And Projections By Age, Sex, Race/Ethnicity
  • Estimates And Projections–States, Metropolitan Areas, Cities
  • Households, Families, Group Quarters
  • Marital status And Living Arrangements
  • Migration
  • National Estimates And Projections
  • Native And Foreign-Born Populations
  • Religion

To be fair, the privacy principles of the Census Bureau state:

Respectful Treatment of Respondents: Are our efforts reasonable and did we treat you with respect?

  • We promise to ensure that any collection of sensitive information from children and other sensitive populations does not violate federal protections for research participants and is done only when it benefits the public good.

Disclosure: I like the US Census Bureau. Left to their own devices, I don’t have any reasonable fear of their mis-using the data in question.

But that’s the question isn’t it? Will the US Census Bureau be left to its own policies and traditions?

I view the various “proposed data collection policies” of the incoming administrations as intentional distractions. While everyone is focused on Trump’s Theater of the Absurd, appointments and policies at the US Census Bureau, may achieve the same ends.

Sign the pledge yes, but use FOIA requests, personal contacts with Census staff, etc., to keep track of the use of dangerous data at the Census Bureau and elsewhere.


Instructions for adding your name to the pledge are found at: https://github.com/neveragaindottech/neveragaindottech.github.io/.

Assume Census Bureau staff are committed to their privacy and appropriate use policies. A friendly approach will be far more productive than a confrontational or suspicious one. Let’s work with them to maintain their agency’s long history of data security.

How To Brick A School Bus, Data Science Helps Park It (Part 2)

Wednesday, December 14th, 2016

Immediate reactions to How To Brick A School Bus, Data Science Helps Park It (Part 1) include:

  • Blocking a public street with a bricked school bus is a crime.
  • Publicly committing a crime isn’t on your bucket list.
  • School buses are expensive.
  • Turning over a school bus is dangerous.

All true and all likely to diminish any enthusiasm for participation.

Bright yellow school buses bricked and blocking transportation routes attract the press like flies to …, well, you know, but may not be your best option.

Alternatives to a Bricked School Bus

Despite the government denying your right to assemble near the inauguration on January 20, 2017 in Washington, D.C., what other rights could lead to a newsworthy result?

You have the right to travel, although the Supreme Court has differed on the constitutional basis for that right. (Constitution of the United States of America: Analysis and Interpretation, 14th Admendment, page 1834, footnote 21).

You also have the right to be inattentive, which I suspect is secured 9th Amendment:

The enumeration in the Constitution, of certain rights, shall not be construed to deny or disparage others retained by the people.

If we put the right to travel together with the right to be inattentive (or negligent), then it stands to reason that your car could run out of gas on the highways normally used to attend an inauguration.

Moreover, we know from past cases, that drivers have not been held to be negligent simply for running out of gas, even at the White House.

Where to Run Out of Gas?

Interesting question and the one that originally had me reaching for historic traffic data.

It does exist, yearly summaries (Virginia), Inrix (Washington, DC), Traffic Volume Maps (District Department of Transportation), and others.

But we don’t want to be like the data scientist who used GPS and satellite data to investigate why you can’t get a taxi in Singapore when it rains. Starting Data Analysis with Assumptions Crunching large amounts of data discovered that taxis in Singapore stop moving when it rains.

Interesting observation but not the answer to the original question. Asking a local taxi driver, it was discovered that draconian traffic liability laws are the reason taxi drivers pull over when it rains. Not a “big data” question at all.

What Do We Know About DC Metro Traffic Congestion?

Let’s review what is commonly known about DC metro traffic congestion:

D.C. tops list of nation’s worst traffic gridlock (2015), Study ranks D.C. traffic 2nd-worst in U.S. (2016), DC Commuters Abandon Metro, Making Already Horrible Traffic Even Worse (metro repairs make traffic far worse).

At the outset, we know that motor vehicle traffic is a chaotic system, so small changes, such as addition impediment of traffic flow by cars running out of gas, can have large effects. Especially on a system that teeters on the edge of gridlock every day.

The loss of Metro usage has a cascading impact on metro traffic (from above). Which means blockage of access to Metro stations will exacerbate the impact of blockages on the highway system.

Time and expense could be spent on overly precise positioning of out-of-gas cars, but a two part directive is just as effective if not more so:

  • Go to Metro stations ingresses.
  • Go to any location on traffic map that is not red.

Here’s a sample traffic map that has traffic cameras:

fox5dc-map-460

From Fox5 DC but it is just one of many.

The use of existing traffic maps removes the need to construct the same and enable chaotic participation, which means you quite innocently ran out of gas and did not at any time contact and/or conspire with others to run out of gas.

Conspiracy is a crime and you should always avoid committing crimes.

General Comments

You may be wondering if authorities being aware of a theoretical discussion of people running out of gas will provoke effective counter measures?

I don’t think so and here’s why: What would be the logical response of an authority? Position more tow trucks? Setup temporary refueling stations?

Do you think the press will be interested in those changes? Such that not only do you have the additional friction of the additional equipment but the press buzzing about asking about the changes?

An authorities best strategy would be to do nothing at all but that advice is rarely taken. At the very best, local authorities will make transportation even more fragile in anticipation someone might run out of gas.

The numbers I hear tossed about as additional visitors, some activities are expecting more than 100,000 (Women’s March on Washington), so even random participation in running out of gas should have a significant impact.

What if they held the inauguration to empty bleachers?

Data Science Traditionalists – Don’t Re-invent the Wheel

Nudging a chaotic traffic system into gridlock, for hours if not more than a day, may not strike you as traditional data science.

Perhaps not but please don’t re-invent the wheel.

If you want to be more precise, perhaps to block particular activities or locations, let me direct you to the Howard University Transportation Safety Data Center.

They have the Traffic Count Database System (TCDS). Two screen shots that don’t do it justice:

tdc1-460

tdc2-460

From their guide to the system:

The Traffic Count Database System (TCDS) module is a powerful tool for the traffic engineer or planner to organize an agency’s traffic count data. It allows you to upload data from a traffic counter; view graphs, lists and reports of historic traffic count data; search for count data using either the database or the Google map; and print or export data to your desktop.

This guide is for users who are new to the TCDS system. It will provide you with the tools to carry out many common tasks. Any features not discussed in this guide are considered advanced features. If you have further questions, feel free to explore the online help guide or to contact the staff at MS2 for assistance.

I have referred to the inauguration of president-elect Donald J. Trump but the same lessons are applicable, with local modifications, to many other locations.

PS: Nothing should be construed as approval and/or encouragement that you break local laws in any venue. Those vary from jurisdiction to jurisdiction and what are acceptable risks and consequences are entirely your decision.

If you do run out of gas in or near Washington, DC on January 20, 2017, be polite to first-responders, including police officers. If you don’t realize your real enemies lie elsewhere, then you too have false class consciousness.

If you are tail-gating on the “Beltway,” offer responders a soft drink (they are on duty) and a hot dog.

Reporting in Aleppo: Can data science help?

Wednesday, December 14th, 2016

Reporting in Aleppo: Can data science help? by Nausicaa Renner. (Columbia Journalism Review)

from the post:

In war zones, reporting is hard to come by. Nowhere is this truer than in Syria, where many international journalists are banned, and more than one hundred journalists have been killed since the war began in early 2011. A deal was made on Tuesday between the Syrian government and the rebels allowing civilians and rebels to evacuate eastern Aleppo, but after years of bloody conflict, clarity is still hard to come by.

Is there a way for data science to give access to understudied war zones? A project at the Center for Spatial Research at Columbia University, partly funded by the Tow Center for Digital Journalism, uses what information we do have to “link eyes in the sky with algorithms and ears on the ground” in Aleppo.

The Center overlaid satellite images from 2012 to 2016 to create a map showing how Aleppo has changed: Destroyed buildings were identified by discrepancies in the images from year to year. Visualization can also put things in perspective; at a seminar the Center held, one student created a map showing how little the front lines of Aleppo have moved—a stark expression of the futility of war.

As of this AM, I saw reports that the ceasefire mentioned in this post failed.

The content is horrific but using the techniques described in The Twitterverse of Donald Trump to harvest Aleppo videos and images could preserve a record of the fall of Aleppo. Would mapping geo-locations to a map of Aleppo help document/confirm reports of atrocities?

Unlike the wall of silence around US military operations, there is a great deal of first-hand data and opportunities for analysis and confirmation. (It’s hard to analyze or confirm a press briefing document.)

Data Science and Protests During the Age of Trump [How To Brick A School Bus…]

Friday, December 9th, 2016

Pre-inauguration suppression of free speech/protests is underway for the Trump regime. (CNN link as subject identifier for Donald J. Trump, even though it fails to mention he looks like a cheeto in a suit.)

Women’s March on Washington barred from Lincoln Memorial by Amber Jamieson and Jessica Glenza.

From the post:


For the thousands hoping to echo the civil rights and anti-Vietnam rallies at Lincoln Memorial by joining the women’s march on Washington the day after Donald Trump’s inauguration: time to readjust your expectations.

The Women’s March won’t be held at the Lincoln Memorial.

That’s because the National Park Service, on behalf of the Presidential Inauguration Committee, filed documents securing large swaths of the national mall and Pennsylvania Avenue, the Washington Monument and the Lincoln Memorial for the inauguration festivities. None of these spots will be open for protesters.

The NPS filed a “massive omnibus blocking permit” for many of Washington DC’s most famous political locations for days and weeks before and after the inauguration on 20 January, said Mara Verheyden-Hilliard, a constitutional rights litigator and the executive director of the Partnership for Civil Justice Fund.

I contacted Amber Jamieson for more details on the permits and she forwarded two links (thanks Amber!):

Press Conference: Mass Protests Will Go Forward During Inauguration, which had the second link she forwarded:

PresidentialInauguralCommittee12052016.pdf, the permit requests made by the National Park Service on behalf of the Presidential Inaugural Committee.

Start with where protests are “permitted” to see what has been lost.

A grim read but 36 CFR 7.96 says in part:


3 (i) White House area. No permit may be issued authorizing demonstrations in the White House area, except for the White House sidewalk, Lafayette Park and the Ellipse. No permit may be issued authorizing special events, except for the Ellipse, and except for annual commemorative wreath-laying ceremonies relating to the statutes in Lafayette Park.

(emphasis added, material hosted by the Legal Information Institute (LII))

Summary: In White House area, protesters have only three places for permits to protest:

  • White House sidewalk
  • Lafayette Park
  • Ellipse

White House sidewalk / Lafayette Park (except North-East Quadrant) – Application 16-0289

Dates:

Set-up dates starting 11/1/2016 6:00 am ending 1/19/2017
Activity dates starting 1/20/2017 ending 1/20/2017
Break-down dates starting 1/21/2017 ending 3/1/2017 11:59 pm

Closes:


All of Lafayette Park except for its northeast quadrant pursuant to 36 CFR 7.96 (g)(4)(iii)(A). The initial areas of Lafayette Park and the White House Sidewalk that will be needed for construction set-up, and which will to be closed to ensure public safety, is detailed in the attached map. The attached map depicts the center portion of the White House Sidewalk as well as a portion of the southern oval of Lafayette Park. The other remaining areas in Lafayette Park and the White House Sidewalk that will be needed for construction set-up, will be closed as construction set-up progresses into these other areas, which will also then be delineated by fencing and sign age to ensure public safety.

Two of the three possible protest sites in the White House closed by Application 16-0289.

Ellipse – Application 17-0001

Dates:

Set-up dates starting 01/6/2017 6:00 am ending 1/19/2017
Activity dates starting 1/20/2017 ending 1/20/2017
Break-down dates starting 1/21/2017 ending 2/17/2017 11:59 pm

These dates are at variance with those for the White House sidewalk and Lafayette Park (shorter).

Closes:

Ellipse, a fitty-two acre park, as depicted by Google Maps:

ellipse-460

Plans for the Ellipse?


Purpose of Activity: In connection with the Presidential Inaugural Ceremonies, this application is for use of the Ellipse by PIC, in the event that PIC seeks its use for Inaugural ceremonies and any necessary staging, which is expected to be:

A) In the event that PIC seeks the use of the Ellipse for pre- and/or post- Inaugural ceremonies, the area will be used for staging the event(s), staging of media to cover and/or broadcast the event, and if possible for ticketed and/or public viewing; and/or ­

B) In the event that PIC seeks the use of the Ellipse for the Inaugural ceremony and Inaugural parade staging, the area will be used to stage the various parade elements, for media to cover and/or broadcast the event, and if possible for ticketed and/or public viewing.

The PIC has no plans to use the Ellipse but has reserved it no doubt to deny its use to others.

Those two applications close three out of three protest sites in the White House area. The PIC went even further to reach out and close off other potential protest sites.

Other permits granted to the PIC include:

Misc. Areas – Application 16-0357

Ten (10) misc. areas identified by attached maps for PIC activities.

Arguably legitimate since the camp followers, sycophants and purveyors of “false news” need somewhere to be during the festivities.

National Mall -> Trump Mall – Application 17-0002

The National Mall will become Trump Mall for the following dates:

Set-up dates starting 01/6/2017 6:00 am ending 1/19/2017
Activity dates starting 1/20/2017 ending 1/20/2017
Break-down dates starting 1/21/2017 ending 1/30/2017 11:59 pm

Closes:


Plan for Proposed Activity: Consistent with NPS regulations at 36 CFR 7.96{g)(4)(iii)(C), this application seeks, in connection with the Presidential Inaugural Ceremonies, the area of the National Mall between 14th – 4th Streets, for the exclusive use of the Joint Task Force Headquarters (JTFHQ) on Inaugural Day for the assembly, staging, security and weather protection of the pre-Inaugural parade components and floats on Inaugural Day between 14th – 7th Streets. It also includes the placement of jumbotrons and sound towers by the Architect of the Capitol or the Joint Congressional Committee on Inaugural Ceremonies so that the Inaugural Ceremony may be observed by the Joint Congressional Committee’s ticketed standing room ticket holders between 4th – 3rd streets and the general public, which will be located on the National Mall between 7th – 4th Streets. Further, a 150-foot by 200-foot area on the National Mall just east of 7th Street, will be for the exclusive use of the Presidential Inaugural Committee for television and radio media broadcasts on Inaugural Day.

In the plans thus far, no mention of the main card or where the ring plus cage will be erected on Trump Mall. (that’s sarcasm, not “fake news”)

Most Other Places – Application 17-0003

If you read 36 CFR 7.96 carefully, you noticed there are places always prohibited to protesters:


(ii) Other park areas. Demonstrations and special events are not allowed in the following other park areas:

(A) The Washington Monument, which means the area enclosed within the inner circle that surrounds the Monument’s base, except for the official annual commemorative Washington birthday ceremony.

(B) The Lincoln Memorial, which means that portion of the park area which is on the same level or above the base of the large marble columns surrounding the structure, and the single series of marble stairs immediately adjacent to and below that level, except for the official annual commemorative Lincoln birthday ceremony.

(C) The Jefferson Memorial, which means the circular portion of the Jefferson Memorial enclosed by the outermost series of columns, and all portions on the same levels or above the base of these columns, except for the official annual commemorative Jefferson birthday ceremony.

(D) The Vietnam Veterans Memorial, except for official annual Memorial Day and Veterans Day commemorative ceremonies.

What about places just outside the already restricted areas?

Dates:

Set-up dates starting 01/6/2017 6:00 am ending 1/19/2017
Activity dates starting 1/20/2017 ending 1/20/2017
Break-down dates starting 1/21/2017 ending 2/10/2017 11:59 pm

Closes:


The Lincoln Memorial area, as more fully detailed as the park area bordered by 23rd Street, Daniel French Drive and Independence Avenue, Henry Bacon Drive and Constitution Avenue, Constitution Avenue between 15th & 23rd Streets, Constitution Gardens to include Area #5 outside of the Vietnam Veteran’s Memorial restricted area, the Lincoln Memorial outside of its restricted area, the Lincoln Memorial Plaza and Reflecting Pool Area, JFK Hockey Field, park area west of Lincoln Memorial between French Drive, Henry Bacon Drive, Parking Lots A, Band C, East and West Potomac Park, Memorial Bridge, Memorial Circle and Memorial Drive, the World War II Memorial. The Washington Monument Grounds as more fully depicted as the park area bounded by 14th & 15th Streets and Madison Drive and Independence Avenue.

Not to use but to prevent its use by others:


Purpose of Activity: In connection with the Presidential Inaugural Ceremonies, this application is for use of the Lincoln Memorial areas and Washington Monument grounds by PIC, in the event that PIC seeks its use for the Inaugural related ceremonies and any necessary staging, which is expected to be:

A) In the event that PIC seeks the use of the Lincoln Memorial areas for a pre-and/or post Inaugural ceremonies, the area will be used for staging the event(s), staging of media to cover and/or broadcast the event, and for ticketed and/or public viewing.

B) In the event that PIC seeks to use the Washington Monument grounds for a public overflow area to view the Inaugural ceremony and/ or parade, the area will be used for the public who will observe the activities through prepositioned jumbotrons and sound towers.

Next Steps

For your amusement, all five applications contain the following question answered No:

Do you have any reason to believe or any information indicating that any individual, group or organization might seek to disrupt the activity for which this application is submitted?

I would venture to say someone hasn’t been listening. 😉

Among the data science questions raised by this background information are:

  • How best to represent these no free speech and/or no free assembly zones on a map?
  • What data sets do you need to make protesters effective under these restrictions?
  • What questions would you ask of those data sets?
  • How to decide between viral/spontaneous action versus publicly known but lawful conduct, up until the point it becomes unlawful?

If you use any of this information, please credit Amber Jamieson, Jessica Glenza and the Partnership for Civil Justice Fund as the primary sources.

See further news from the Partnership for Civil Justice Fund at: Your Right of Resistance.

Tune in next Monday for: How To Brick A School Bus, Data Science Helps Park It.

PS: “The White House Sidewalk is the sidewalk between East and West Executive Avenues, on the south side Pennsylvania Avenue, N.W.” From OMB Control No. 1024-0021 – Application for a Permit to Conduct a Demonstration or Special Event in Park Areas and a Waiver of Numerical Limitations on Demonstrations for White House Sidewalk and/or Lafayette Park

Learning R programming by reading books: A book list

Thursday, November 24th, 2016

Learning R programming by reading books: A book list by Liang-Cheng Zhang.

From the post:

Despite R’s popularity, it is still very daunting to learn R as R has no click-and-point feature like SPSS and learning R usually takes lots of time. No worries! As self-R learner like us, we constantly receive the requests about how to learn R. Besides hiring someone to teach you or paying tuition fees for online courses, our suggestion is that you can also pick up some books that fit your current R programming level. Therefore, in this post, we would like to share some good books that teach you how to learn programming in R based on three levels: elementary, intermediate, and advanced levels. Each level focuses on one task so you will know whether these books fit your needs. While the following books do not necessarily focus on the task we define, you should focus the task when you reading these books so you are not lost in contexts.

Books and reading form the core of my most basic prejudice: Literacy is the doorway to unlimited universes.

A prejudice so strong that I have to work hard at realizing non-literates live in and sense worlds not open to literates. Not less complex, not poorer, just different.

But book lists in particular appeal to that prejudice and since my blog is read by literates, I’m indulging that prejudice now.

I do have a title to add to the list: Practical Data Science with R by Nina Zumel and John Mount.

Judging from the other titles listed, Practical Data Science with R falls in the intermediate range. Should not be your first R book but certainly high on the list for your second R book.

Avoid the rush! Start working on your Amazon wish list today! 😉

Python Data Science Handbook

Saturday, November 19th, 2016

Python Data Science Handbook (Github)

From the webpage:

Jupyter notebook content for my OReilly book, the Python Data Science Handbook.

pdsh-cover

See also the free companion project, A Whirlwind Tour of Python: a fast-paced introduction to the Python language aimed at researchers and scientists.

This repository will contain the full listing of IPython notebooks used to create the book, including all text and code. I am currently editing these, and will post them as I make my way through. See the content here:

Enjoy!

How To Use Twitter to Learn Data Science (or anything)

Wednesday, November 2nd, 2016

How To Use Twitter to Learn Data Science (or anything) by Data Science Renee.

Judging from the date on the post (May 2016), Renee’s enthusiasm for Twitter came before her recently breaking 10,000 followers on Twitter. (Congratulations!)

The one thing I don’t see Renee mentioning is the use of your own Twitter account to gain experience with a whole range of data mining tools.

Your Twitter feed will quickly out-strip your ability to “keep up,” so how do you propose to deal with that problem?

Renee suggests limiting examination of your timeline (in part), but have you considered using machine learning to assist you?

Or visualizing your areas of interests or people that you follow?

Indexing resources pointed to in tweets?

NLP processing of tweets?

Every tool of data science that you will be using for clients is relevant to your own Twitter feed.

What better way to learn tools than using them on content that interests you?

Enjoy!

BTW, follow Data Science Renee for a broad range of data science tools and topics!

How To Use Data Science To Write And Sell More Books (Training Amazon)

Sunday, October 30th, 2016

From the description:

Chris Fox is the bestselling author of science fiction and dark fantasy, as well as non-fiction books for authors including Write to Market, 5000 words per hour and today we’re talking about his next book, Six Figure Author: Using data to sell books.

Show Notes What Amazon data science, and machine learning, are and how authors can use them. How Amazon differs from the other online book retailers and how authors can train Amazon to sell more books. What to look for to find a voracious readership. Strategically writing to market and how to know what readers are looking for. On Amazon ads and when they are useful. Tips on writing faster. The future of writing, including virtual reality and AI help with story.

Joanna Penn of The Creative Penn interviews Chris Fox

Some of the highlights:

Training Amazon To Work For You

…What you want to do is figure out, with as much accuracy as possible, who your target audience is.

And when you start selling your book, the number of sales is not nearly as important as who you sell your book to, because each of those sales to Amazon represents a customer profile.

If you can convince them that people who voraciously read in your genre are going to love this book and you sell a couple of hundred copies to people like that, Amazon’s going to take it and run with it. You’ve now successfully trained them about who your audience is because you used good data and now they’re able to easily sell your book.

If, on the other hand, you and your mom buys a copy and your friend at the coffee shop buys a copy, and people who aren’t necessarily into that genre are all buying it, Amazon gets really lost and confused.

Easier said than done but how’s that for taking advantage of someone else’s machine learning?

Chris also has tips for not “polluting” your Amazon sales data.

Discovering and Writing to a Market


How do you find a sub-category or a smaller niche within the Amazon ecosystem? What are the things to look for in order to find a voracious readership?

Chris: What I do is I start looking at the rankings of the number 1, the number 20, 40, 60, 80 and 100 books. You can tell based on where those books are ranked, how many books in the genre are selling. If the number one book is ranked in the top 100 in the store and so is the 20th book, then you’ve found one of the hottest genres on Amazon.

If you find that by the time you get down to number 40, the rank is dropping off sharply, that suggests that not enough books are being produced in that genre and it might be a great place for you to jump in and make a name for yourself. (emphasis in original)

I know, I know, this is a tough one. Especially for me.

As I have pointed out here on multiple occasions, “terrorism” is largely a fiction of both government and media.

However, if you look at the top 100 paid sellers on terrorism at Amazon, the top fifty (50) don’t have a single title that looks like it denies terrorism is a problem.

🙁

Which I take to mean, in terms of selling books, services, or data, the terrorism is coming for us all gravy train is the profitable line.

Or at least to indulge in analysis on the basis of “…if the threat of terrorism is real…” and let readers supply their own answers to that question.

There are other valuable tips and asides, so watch the video or read the transcript: How To Use Data Science To Write And Sell More Books With Chris Fox.

PS: As of today, there are 292 podcasts by Jonna Penn.

Data Science for Political and Social Phenomena [Special Interest Search Interface]

Sunday, October 23rd, 2016

Data Science for Political and Social Phenomena by Chris Albon.

From the webpage:

I am a data scientist and quantitative political scientist. I specialize in the technical and organizational aspects of applying data science to political and social issues.

Years ago I noticed a gap in the existing data literature. On one side was data science, with roots in mathematics and computer science. On the other side were the social sciences, with hard-earned expertise modeling and predicting complex human behavior. The motivation for this site and ongoing book project is to bridge that gap: to create a practical guide to applying data science to political and social phenomena.

Chris has organized three hundred and twenty-eight pages on Data Wrangling, Python, R, etc.

If you like learning from examples, this is the site for you!

Including this site, what other twelve (12) sites would you include in a Python/R Data Science search interface?

That is an interface that has indexed only that baker’s dozen of sites. So you don’t spend time wading through “the G that is not named” search results.

Serious question.

Not that I would want to maintain such a beast for external use, but having a local search engine tuned to your particular interests could be nice.

Threatening the President: A Signal/Noise Problem

Tuesday, October 18th, 2016

Even if you can’t remember why the pointy end of a pencil is important, you too can create national news.

This bit of noise reminded me of an incident when I was in high school where some similar type person bragged in a local bar about assassinating then President Nixon*. Was arrested and sentenced to several years in prison.

At the time I puzzled briefly over the waste of time and effort in such a prosecution and then promptly forgot it.

Until this incident with the overly “clever” Trump supporter.

To get us off on the same foot:

18 U.S. Code § 871 – Threats against President and successors to the Presidency

(a) Whoever knowingly and willfully deposits for conveyance in the mail or for a delivery from any post office or by any letter carrier any letter, paper, writing, print, missive, or document containing any threat to take the life of, to kidnap, or to inflict bodily harm upon the President of the United States, the President-elect, the Vice President or other officer next in the order of succession to the office of President of the United States, or the Vice President-elect, or knowingly and willfully otherwise makes any such threat against the President, President-elect, Vice President or other officer next in the order of succession to the office of President, or Vice President-elect, shall be fined under this title or imprisoned not more than five years, or both.

(b) The terms “President-elect” and “Vice President-elect” as used in this section shall mean such persons as are the apparent successful candidates for the offices of President and Vice President, respectively, as ascertained from the results of the general elections held to determine the electors of President and Vice President in accordance with title 3, United States Code, sections 1 and 2. The phrase “other officer next in the order of succession to the office of President” as used in this section shall mean the person next in the order of succession to act as President in accordance with title 3, United States Code, sections 19 and 20.

Commonplace threatening letters, calls, etc., aren’t documented for the public but President Barack Obama has a Wikipedia page devoted to the more significant ones: Assassination threats against Barack Obama.

Just as no one knows you are a dog on the internet, no one can tell by looking at a threat online if you are still learning how to use a pencil or are a more serious opponent.

Leaving to one side that a truly serious opponent allows actions to announce their presence or goal.

The treatment of even idle bar threats as serious is an attempt to improve the signal-to-noise ratio:

In analog and digital communications, signal-to-noise ratio, often written S/N or SNR, is a measure of signal strength relative to background noise. The ratio is usually measured in decibels (dB) using a signal-to-noise ratio formula. If the incoming signal strength in microvolts is Vs, and the noise level, also in microvolts, is Vn, then the signal-to-noise ratio, S/N, in decibels is given by the formula: S/N = 20 log10(Vs/Vn)

If Vs = Vn, then S/N = 0. In this situation, the signal borders on unreadable, because the noise level severely competes with it. In digital communications, this will probably cause a reduction in data speed because of frequent errors that require the source (transmitting) computer or terminal to resend some packets of data.

I’m guessing the reasoning is the more threats that go unspoken, the less chaff the Secret Service has to winnow in order to uncover viable threats.

One assumes they discard physical mail with return addresses of prisons, mental hospitals, etc., or at most request notice of the release of such people from state custody.

Beyond that, they don’t appear to be too picky about credible threats, noting that in one case an unspecified “death ray” was going to be used against President Obama.

The EuroNews description of that case must be shared:

Two American men have been arrested and charged with building a remote-controlled X-ray machine intended for killing Muslims and other perceived enemies of the U.S.

Following a 15-month investigation launched in April 2012, Glenn Scott Crawford and Eric J. Feight are accused of developing the device, which the FBI has described as “mobile, remotely operated, radiation emitting and capable of killing human targets silently and from a distance with lethal doses of radiation”.

Sure, right. I will post a copy of the 67-page complaint, which uses terminology rather loosely, to say the least, in a day or so. Suffice it to say that the defendants never acquired a source for the needed radioactivity production.

On the order of having a complete nuclear bomb but not nuclear material to make it into a nuclear bomb. You would be in more danger from the conventional explosive degrading than the bomb as a nuclear weapon.

Those charged with defending public officials want to deter the making of threats, so as to improve the signal/noise ratio.

The goal of those attacking public officials is a signal/noise ratio of exactly 0.0.

Viewing threats from an information science perspective suggests various strategies for either side. (Another dividend of studying information science.)

*They did find a good picture of Nixon for the White House page. Doesn’t look as much like a weasel as he did in real life. Gimp/Photoshop you think?

Becoming a Data Scientist:

Thursday, October 13th, 2016

Becoming a Data Scientist: Advice From My Podcast Guests

Out-gassing from political candidates has kept pushing this summary by Renée Teate back in my queue. Well, fixing that today!

René has created more data science resources than I can easily mention so in addition to this guide, I will mention only two:

Data Science Renee @BecomingDataSci, a Twitter account that will soon break into the rarefied air of > 10,000 followers. Not yet, but you may be the one that puts her over the top!

Looking for women to speak at data science conferences? Renée maintains Women in Data Science, which today has 815 members.

Sorry, three, her blog: Becoming a Data Scientist.

That should keep you busy/distracted until the political noise subsides. 😉

Data Science Toolbox

Saturday, October 1st, 2016

Data Science Toolbox

From the webpage:

Start doing data science in minutes

As a data scientist, you don’t want to waste your time installing software. Our goal is to provide a virtual environment that will enable you to start doing data science in a matter of minutes.

As a teacher, author, or organization, making sure that your students, readers, or members have the same software installed is not straightforward. This open source project will enable you to easily create custom software and data bundles for the Data Science Toolbox.

A virtual environment for data science

The Data Science Toolbox is a virtual environment based on Ubuntu Linux that is specifically suited for doing data science. Its purpose is to get you started in a matter of minutes. You can run the Data Science Toolbox either locally (using VirtualBox and Vagrant) or in the cloud (using Amazon Web Services).

We aim to offer a virtual environment that contains the software that is most commonly used for data science while keeping it as lean as possible. After a fresh install, the Data Science Toolbox contains the following software:

  • Python, with the following packages: IPython Notebook, NumPy, SciPy, matplotlib, pandas, scikit-learn, and SymPy.
  • R, with the following packages: ggplot2, plyr, dplyr, lubridate, zoo, forecast, and sqldf.
  • dst, a command-line tool for installing additional bundles on the Data Science Toolbox (see next section).

Let us know if you want to see something added to the Data Science Toolbox.

Great resource for doing or teaching data science!

And an example of using a VM to distribute software in a learning environment.

Data Science Series [Starts 9 September 2016 but not for *nix users]

Sunday, September 4th, 2016

The BD2K Guide to the Fundamentals of Data Science Series

From the webpage:


Every Friday beginning September 9, 2016
9am – 10am Pacific Time

Working jointly with the BD2K Centers-Coordination Center (BD2KCCC) and the NIH Office of Data Science, the BD2K Training Coordinating Center (TCC) is spearheading this virtual lecture series on the data science underlying modern biomedical research. Beginning in September 2016, the seminar series will consist of regularly scheduled weekly webinar presentations covering the basics of data management, representation, computation, statistical inference, data modeling, and other topics relevant to “big data” biomedicine. The seminar series will provide essential training suitable for individuals at all levels of the biomedical community. All video presentations from the seminar series will be streamed for live viewing, recorded, and posted online for future viewing and reference. These videos will also be indexed as part of TCC’s Educational Resource Discovery Index (ERuDIte), shared/mirrored with the BD2KCCC, and with other BD2K resources.

View all archived videos on our YouTube channel:
https://www.youtube.com/channel/UCKIDQOa0JcUd3K9C1TS7FLQ


Please join our weekly meetings from your computer, tablet or smartphone.
https://global.gotomeeting.com/join/786506213
You can also dial in using your phone.
United States +1 (872) 240-3311
Access Code: 786-506-213
First GoToMeeting? Try a test session: http://help.citrix.com/getready

Of course, running Ubuntu, when I follow the “First GoToMeeting? Try a test session,” I get this result:


OS not supported

Long-Term Fix: Upgrade your computer.

You or your IT Admin will need to upgrade your computer’s operating system in order to install our desktop software at a later date.

Since this is most likely a lecture format, could just stream the video and use WebConf as a Q/A channel.

Of course, that would mean losing the various technical difficulties, licensing fees, etc., all of which are distractions from the primary goal of the project.

But who wants that?

PS: Most *nix users won’t be interested except to refer others but still, over engineered solutions to simple issues should not be encouraged.

DataScience+ (R Tutorials)

Monday, August 29th, 2016

DataScience+

From the webpage:

We share R tutorials from scientists at academic and scientific institutions with a goal to give everyone in the world access to a free knowledge. Our tutorials cover different topics including statistics, data manipulation and visualization!

I encountered DataScience+ while running down David Kun’s RDBL post.

As of today, there are 120 tutorials with 451,129 reads.

That’s impressive! Whether you are looking for tutorials or you are looking to post your R tutorial where it will be appreciated.

Enjoy!

The Ethics of Data Analytics

Sunday, August 21st, 2016

The Ethics of Data Analytics by Kaiser Fung.

Twenty-one slides on ethics by Kaiser Fung, author of: Junk Charts (data visualization blog), and Big Data, Plainly Spoken (comments on media use of statistics).

Fung challenges you to reach your own ethical decisions and acknowledges there are a number of guides to such decision making.

Unfortunately, Fung does not include professional responsibility requirements, such as the now out-dated Canon 7 of the ABA Model Code Of Professional Responsibility:

A Lawyer Should Represent a Client Zealously Within the Bounds of the Law

That canon has a much storied history, which is capably summarized in Whatever Happened To ‘Zealous Advocacy’? by Paul C. Sanders.

In what became known as Queen Caroline’s Case, the House of Lords sought to dissolve the marriage of King George the IV

George IV 1821 color

to Queen Caroline

CarolineOfBrunswick1795

on the grounds of her adultery. Effectively removing her as queen of England.

Queen Caroline was represented by Lord Brougham, who had evidence of a secret prior marriage by King George the IV to Catholic (which was illegal), Mrs Fitzherbert.

Portrait of Mrs Maria Fitzherbert, wife of George IV

Brougham’s speech is worth your reading in full but the portion most often cited for zealous defense reads as follows:


I once before took leave to remind your lordships — which was unnecessary, but there are many whom it may be needful to remind — that an advocate, by the sacred duty of his connection with his client, knows, in the discharge of that office, but one person in the world, that client and none other. To save that client by all expedient means — to protect that client at all hazards and costs to all others, and among others to himself — is the highest and most unquestioned of his duties; and he must not regard the alarm, the suffering, the torment, the destruction, which he may bring upon any other; nay, separating even the duties of a patriot from those of an advocate, he must go on reckless of the consequences, if his fate it should unhappily be, to involve his country in confusion for his client.

The name Mrs. Fitzherbert never slips Lord Brougham’s lips but the House of Lords has been warned that may not remain to be the case, should it choose to proceed. The House of Lords did grant the divorce but didn’t enforce it. Saving fact one supposes. Queen Caroline died less than a month after the coronation of George IV.

For data analysis, cybersecurity, or any of the other topics I touch on in this blog, I take the last line of Lord Brougham’s speech:

To save that client by all expedient means — to protect that client at all hazards and costs to all others, and among others to himself — is the highest and most unquestioned of his duties; and he must not regard the alarm, the suffering, the torment, the destruction, which he may bring upon any other; nay, separating even the duties of a patriot from those of an advocate, he must go on reckless of the consequences, if his fate it should unhappily be, to involve his country in confusion for his client.

as the height of professionalism.

Post-engagement of course.

If ethics are your concern, have that discussion with your prospective client before you are hired.

Otherwise, clients have goals and the task of a professional is how to achieve them. Nothing more.

Contributing to StackOverflow: How Not to be Intimidated

Friday, August 19th, 2016

Contributing to StackOverflow: How Not to be Intimidated by Ksenia Coulter.

From the post:

StackOverflow is an essential resource for programmers. Whether you run into a bizarre and scary error message or you’re blanking on something you should know, StackOverflow comes to the rescue. Its popularity with coders spurred many jokes and memes. (Programming to be Officially Renamed “Googling Stackoverflow,” a satirical headline reads).

(image omitted)

While all of us are users of StackOverflow, contributing to this knowledge base can be very intimidating, especially to beginners or to non-traditional coders who many already feel like they don’t belong. The fact that an invisible barrier exists is a bummer because being an active contributor not only can help with your job search and raise your profile, but also make you a better programmer. Explaining technical concepts in an accessible way is difficult. It is also well-established that teaching something solidifies your knowledge of the subject. Answering StackOverflow questions is great practice.

All of the benefits of being an active member of StackOverflow were apparent to me for a while, but I registered an account only this week. Let me walk you t[h]rough thoughts that hindered me. (Chances are, you’ve had them too!)

I plead guilty to using StackOverFlow but not contributing back to it.

Another “intimidation” to avoid is thinking you must have the complete and killer answer to any question.

That can and does happen, but don’t wait for a question where you can supply such an answer.

Jump in! (Advice to myself as well as any readers.)

Pandas

Wednesday, August 17th, 2016

Pandas by Reuven M. Lerner.

From the post:

Serious practitioners of data science use the full scientific method, starting with a question and a hypothesis, followed by an exploration of the data to determine whether the hypothesis holds up. But in many cases, such as when you aren’t quite sure what your data contains, it helps to perform some exploratory data analysis—just looking around, trying to see if you can find something.

And, that’s what I’m going to cover here, using tools provided by the amazing Python ecosystem for data science, sometimes known as the SciPy stack. It’s hard to overstate the number of people I’ve met in the past year or two who are learning Python specifically for data science needs. Back when I was analyzing data for my PhD dissertation, just two years ago, I was told that Python wasn’t yet mature enough to do the sorts of things I needed, and that I should use the R language instead. I do have to wonder whether the tables have turned by now; the number of contributors and contributions to the SciPy stack is phenomenal, making it a more compelling platform for data analysis.

In my article “Analyzing Data“, I described how to filter through logfiles, turning them into CSV files containing the information that was of interest. Here, I explain how to import that data into Pandas, which provides an additional layer of flexibility and will let you explore the data in all sorts of ways—including graphically. Although I won’t necessarily reach any amazing conclusions, you’ll at least see how you can import data into Pandas, slice and dice it in various ways, and then produce some basic plots.

Of course, scientific articles are written as though questions drop out of the sky and data is interrogated for the answer.

Aside from being rhetoric to badger others with, does anyone really think that is how science operates in fact?

Whether you have delusions about how science works in fact or not, you will find that Pandas will assist you in exploring data.

Ten Simple Rules for Effective Statistical Practice

Sunday, June 12th, 2016

Ten Simple Rules for Effective Statistical Practice by Robert E. Kass, Brian S. Caffo, Marie Davidian, Xiao-Li Meng, Bin Yu, Nancy Reid (Ciation: Kass RE, Caffo BS, Davidian M, Meng X-L, Yu B, Reid N (2016) Ten Simple Rules for Effective Statistical Practice. PLoS Comput Biol 12(6): e1004961. doi:10.1371/journal.pcbi.1004961)

From the post:

Several months ago, Phil Bourne, the initiator and frequent author of the wildly successful and incredibly useful “Ten Simple Rules” series, suggested that some statisticians put together a Ten Simple Rules article related to statistics. (One of the rules for writing a PLOS Ten Simple Rules article is to be Phil Bourne [1]. In lieu of that, we hope effusive praise for Phil will suffice.)

I started to copy out the “ten simple rules,” sans the commentary but that would be a disservice to my readers.

Nodding past a ten bullet point listing isn’t going to make your statistics more effective.

Re-write the commentary on all ten rules to apply them to every project. The focusing of the rules on your work will result in specific advice and examples for your field.

Who knows? Perhaps you will be writing a ten simple rule article in your specific field, sans Phil Bourne as a co-author. (Do be sure and cite Phil.)

PS: For the curious: Ten Simple Rules for Writing a PLOS Ten Simple Rules Article by Harriet Dashnow, Andrew Lonsdale, Philip E. Bourne.

Reboot Your $100+ Million F-35 Stealth Jet Every 10 Hours Instead of 4 (TM Fusion)

Wednesday, April 27th, 2016

Pentagon identifies cause of F-35 radar software issue

From the post:

The Pentagon has found the root cause of stability issues with the radar software being tested for the F-35 stealth fighter jet made by Lockheed Martin Corp, U.S. Defense Acquisition Chief Frank Kendall told a congressional hearing on Tuesday.

Last month the Pentagon said the software instability issue meant the sensors had to be restarted once every four hours of flying.

Kendall and Air Force Lieutenant General Christopher Bogdan, the program executive officer for the F-35, told a Senate Armed Service Committee hearing in written testimony that the cause of the problem was the timing of “software messages from the sensors to the main F-35” computer. They added that stability issues had improved to where the sensors only needed to be restarted after more than 10 hours.

“We are cautiously optimistic that these fixes will resolve the current stability problems, but are waiting to see how the software performs in an operational test environment,” the officials said in a written statement.
… (emphasis added)

At $100+ Million plane that requires rebooting every ten hours? I’m not a pilot but that sounds like a real weakness.

The precise nature of the software glitch isn’t described but you can guess one of the problems from Lockheed Martin’s, Software You Wish You Had: Inside the F-35 Supercomputer:


The human brain relies on five senses—sight, smell, taste, touch and hearing—to provide the information it needs to analyze and understand the surrounding environment.

Similarly, the F-35 relies on five types of sensors: Electronic Warfare (EW), Radar, Communication, Navigation and Identification (CNI), Electro-Optical Targeting System (EOTS) and the Distributed Aperture System (DAS). The F-35 “brain”—the process that combines this stellar amount of information into an integrated picture of the environment—is known as sensor fusion.

At any given moment, fusion processes large amounts of data from sensors around the aircraft—plus additional information from datalinks with other in-air F-35s—and combines them into a centralized view of activity in the jet’s environment, displayed to the pilot.

In everyday life, you can imagine how useful this software might be—like going out for a jog in your neighborhood and picking up on real-time information about obstacles that lie ahead, changes in traffic patterns that may affect your route, and whether or not you are likely to pass by a friend near the local park.

F-35 fusion not only combines data, but figures out what additional information is needed and automatically tasks sensors to gather it—without the pilot ever having to ask.
… (emphasis added)

The fusion of data from other in-air F-35s is a classic topic map merging of data problem.

You have one subject, say an anti-aircraft missile site, seen from up to four (in the F-35 specs) F-35s. As is the habit of most physical objects, it has only one geographic location but the fusion computer for the F-35 doesn’t come up with than answer.

Kris Osborn writes in Software Glitch Causes F-35 to Incorrectly Detect Targets in Formation:


“When you have two, three or four F-35s looking at the same threat, they don’t all see it exactly the same because of the angles that they are looking at and what their sensors pick up,” Bogdan told reporters Tuesday. “When there is a slight difference in what those four airplanes might be seeing, the fusion model can’t decide if it’s one threat or more than one threat. If two airplanes are looking at the same thing, they see it slightly differently because of the physics of it.”

For example, if a group of F-35s detect a single ground threat such as anti-aircraft weaponry, the sensors on the planes may have trouble distinguishing whether it was an isolated threat or several objects, Bogdan explained.

As a result, F-35 engineers are working with Navy experts and academics from John’s Hopkins Applied Physics Laboratory to adjust the sensitivity of the fusion algorithms for the JSF’s 2B software package so that groups of planes can correctly identify or discern threats.

“What we want to have happen is no matter which airplane is picking up the threat – whatever the angles or the sensors – they correctly identify a single threat and then pass that information to all four airplanes so that all four airplanes are looking at the same threat at the same place,” Bogdan said.

Unless Bogdan is using “sensitivity” in a very unusual sense, that doesn’t sound like the issue with the fusion computer of the F-35.

Rather the problem is the fusion computer has no explicit doctrine of subject identity to use when it is merging data from different F-35s, whether it be two, three, four or even more F-35s. The display of tactical information should be seamless to the pilot and without human intervention.

I’m sure members of Congress were impressed with General Bogdan using words like “angles” and “physics,” but the underlying subject identity issue isn’t hard to address.

At issue is the location of a potential target on the ground. Within some pre-defined metric, anything located within a given area is the “same target.”

The Air Force has already paid for this type of analysis and the mathematics of what is called Circular Error Probability (CEP) has been published in Use of Circular Error Probability in Target Detection by William Nelson (1988).

You need to use the “current” location of the detecting aircraft, allowances for inaccuracy in estimating the location of the target, etc., but once you call out the subject identity as an issue, its a matter of making choices of how accurate you want the subject identification to be.

Before you forward this to Gen. Bogdan as a way forward on the fusion computer, realize that CEP is only one aspect of target identification. But, calling the subject identity of targets out explicitly, enables reliable presentation of single/multiple targets to pilots.

Your call, confusing displays or a reliable, useful display.

PS: I assume military subject identity systems would not be running XTM software. Same principles apply even if the syntax is different.

Women in Data Science (~632) – Twitter List

Monday, April 25th, 2016

Data Science Renee has a twitter list of approximately 632 women in data science.

I say “approximately” because when I first saw her post about the list it had 630 members. When I looked this AM, it had 632 members. By the time you look, that number will be different again.

If you are making a conscious effort to seek a diversity of speakers for your next data science conference, it should be on your list of sources.

Enjoy!

4330 Data Scientists and No Data Science Renee

Monday, April 11th, 2016

After I posted 1880 Big Data Influencers in CSV File, I got a tweet from Data Science Renee pointing out that her name wasn’t in the list.

Renee does a lot more on “data science” and not so much on “big data,” which sounded like a plausible explanation.

Even if “plausible,” I wanted to know if there was some issue with my scrapping of Right Relevance.

Knowing that Renee’s influence score for “data science” is 81, I set the query to scrape the list between 65 and 98, just to account for any oddities in being listed.

The search returned 1832 entries. Search for Renee, nada, no got. Here’s the 1832-data-science-list.

In an effort to scrape all the listings, which should be 10,375 influencers, I set the page delay up to Ted Cruz reading speed. Ten entries every 72,000 milliseconds. 😉

That resulted in 4330-data-science-list.

No joy, no Renee!

It isn’t clear to me why my scraping fails before recovering the entire data set but in any reasonable sort order, a listing of roughly 10K data scientists should have Renee in the first 100 entries, much less the first 1,000 or even first 4K.

Something is clearly amiss with the data but what?

Check me on the first ten entries for data science as the search term but I find:

  • Hilary Mason
  • Kirk Borne – no data science
  • Nathan Yau
  • Gregory Piatetsky – no data science
  • Randy Olson
  • Jeff Hammerbacher – no data science
  • Chris Dixon @cdixon – no data science
  • dj patil @dpatil
  • Doug Laney – no data science
  • Big Data Science no data science

The notation, “no data science,” means that entry does not have a label for data science. Odd considering that my search was specifically for influencers in “data science.” The same result obtains if you choose one of the labels instead of searching. (I tried.)

Clearly all of these people could be listed for “data science,” but if I am searching for that specific category, why is that missing from six of the first ten “hits?”

As far as Data Science Renee, I can help you with that to a degree. Follow @BecomingDataSci, or @DataSciGuide, @DataSciLearning & @NewDataSciJobs. Visit her website: http://t.co/zv9NrlxdHO. Podcasts, interviews, posts, just a hive of activity.

On the mysteries of Right Relevance and its data I’m not sure what to say. I posted feedback a week ago mentioning the issue with scraping and ordering, but haven’t heard back.

The site has a very clever idea but looking in from the outside with a sample size of 1, I’m not impressed with its delivery on that idea.

Issues I don’t know about with Web Scraper?

If you have contacts with Right Relevance could you gently ping them for me? Thanks!

Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy

Thursday, April 7th, 2016

Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy by Cathy O’Neil.

math-weapons

From the description at Amazon:

We live in the age of the algorithm. Increasingly, the decisions that affect our lives—where we go to school, whether we get a car loan, how much we pay for health insurance—are being made not by humans, but by mathematical models. In theory, this should lead to greater fairness: Everyone is judged according to the same rules, and bias is eliminated. But as Cathy O’Neil reveals in this shocking book, the opposite is true. The models being used today are opaque, unregulated, and uncontestable, even when they’re wrong. Most troubling, they reinforce discrimination: If a poor student can’t get a loan because a lending model deems him too risky (by virtue of his race or neighborhood), he’s then cut off from the kind of education that could pull him out of poverty, and a vicious spiral ensues. Models are propping up the lucky and punishing the downtrodden, creating a “toxic cocktail for democracy.” Welcome to the dark side of Big Data.

Tracing the arc of a person’s life, from college to retirement, O’Neil exposes the black box models that shape our future, both as individuals and as a society. Models that score teachers and students, sort resumes, grant (or deny) loans, evaluate workers, target voters, set parole, and monitor our health—all have pernicious feedback loops. They don’t simply describe reality, as proponents claim, they change reality, by expanding or limiting the opportunities people have. O’Neil calls on modelers to take more responsibility for how their algorithms are being used. But in the end, it’s up to us to become more savvy about the models that govern our lives. This important book empowers us to ask the tough questions, uncover the truth, and demand change.

Even if you have qualms about Cathy’s position, you have to admit that is a great book cover!

When I was in law school, I had F. Hodge O’Neal for corporation law. He is the O’Neal in O’Neal and Thompson’s Oppression of Minority Shareholders and LLC Members, Rev. 2d.

The publisher’s blurb is rather generous in saying:

Cited extensively, O’Neal and Thompson’s Oppression of Minority Shareholders and LLC Members shows how to take appropriate steps to protect minority shareholder interests using remedies, tactics, and maneuvers sanctioned by federal law. It clarifies the underlying cause of squeeze-outs and suggests proven arrangements for avoiding them.

You could read Oppression of Minority Shareholders and LLC Members that way but when corporate law is taught with war stories from the antics of the robber barons forward, you get the impression that isn’t why people read it.

Not that I doubt Cathy’s sincerity, on the contrary, I think she is very sincere about her warnings.

Where I disagree with Cathy is in thinking democracy is under greater attack now or that inequality is any greater problem than before.

If you read The Half Has Never Been Told: Slavery and the Making of American Capitalism by Edward E. Baptist:

half-history

carefully, you will leave it with deep uncertainty about the relationship of American government, federal, state and local to any recognizable concept of democracy. Or for that matter to the “equality” of its citizens.

Unlike Cathy as well, I don’t expect that shaming people is going to result in “better” or more “honest” data analysis.

What you can do is arm yourself to do battle on behalf of your “side,” both in terms of exposing data manipulation by others and concealing your own.

Perhaps there is room in the marketplace for a book titled: Suppression of Unfavorable Data. More than hiding data, what data to not collect? How to explain non-collection/loss? How to collect data in the least useful ways?

You would have to write it as a how to avoid these very bad practices but everyone would know what you meant. Could be the next business management best seller.

Avoid “Complete,” “Data Science,” in Titles

Tuesday, March 1st, 2016

A Complete Tutorial to learn Data Science in R from Scratch by Manish Saraswat.

This is a useful tutorial but it isn’t:

  1. Complete
  2. Does NOT cover all of Data Science

But, this tutorial was tweeted and has been retweeted at least seven times that I know of, possibly more.

Using vague and/or inaccurate terms in titles makes tutorials more difficult to find.

That alone should be reason enough to use better titles.

A more accurate title would be:

R for Predictive Modeling, From Installation to Modeling

That captures the use of R, that the main focus is on predictive modeling and that it will start with the installation of R and proceed to modeling.

Not a word said about all of “data science,” or being “complete,” whatever that means in a discipline with daily advances on multiple fronts.

Just a little effort on the part of authors could improve the lives of all of us desperately searching to find their work.

Yes?

Streaming 101 & 102 – [Stream Processing with Batch Identities?]

Sunday, February 21st, 2016

The world beyond batch: Streaming 101 by Tyler Akidau.

From part 1:

Streaming data processing is a big deal in big data these days, and for good reasons. Amongst them:

  • Businesses crave ever more timely data, and switching to streaming is a good way to achieve lower latency.
  • The massive, unbounded data sets that are increasingly common in modern business are more easily tamed using a system designed for such never-ending volumes of data.
  • Processing data as they arrive spreads workloads out more evenly over time, yielding more consistent and predictable consumption of resources.

Despite this business-driven surge of interest in streaming, the majority of streaming systems in existence remain relatively immature compared to their batch brethren, which has resulted in a lot of exciting, active development in the space recently.

Since I have quite a bit to cover, I’ll be splitting this across two separate posts:

  1. Streaming 101: This first post will cover some basic background information and clarify some terminology before diving into details about time domains and a high-level overview of common approaches to data processing, both batch and streaming.
  2. The Dataflow Model: The second post will consist primarily of a whirlwind tour of the unified batch + streaming model used by Cloud Dataflow, facilitated by a concrete example applied across a diverse set of use cases. After that, I’ll conclude with a brief semantic comparison of existing batch and streaming systems.

The world beyond batch: Streaming 102

In this post, I want to focus further on the data-processing patterns from last time, but in more detail, and within the context of concrete examples. The arc of this post will traverse two major sections:

  • Streaming 101 Redux: A brief stroll back through the concepts introduced in Streaming 101, with the addition of a running example to highlight the points being made.
  • Streaming 102: The companion piece to Streaming 101, detailing additional concepts that are important when dealing with unbounded data, with continued use of the concrete example as a vehicle for explaining them.

By the time we’re finished, we’ll have covered what I consider to be the core set of principles and concepts required for robust out-of-order data processing; these are the tools for reasoning about time that truly get you beyond classic batch processing.

You should also catch the paper by Tyler and others, The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing.

Cloud Dataflow, known as Beam at the Apache incubator, offers a variety of operations for combining and/or merging collections of values in data.

I mention that because I would hate to hear of you doing stream processing with batch identities. You know, where you decide on some fixed set of terms and those are applied across dynamic data.

Hmmm, fixed terms applied to dynamic data. Doesn’t even sound right does it?

Sometimes, fixed terms (read schema, ontology) are fine but in linguistically diverse environments (read real life), that isn’t always adequate.

Enjoy the benefits of stream processing but don’t artificially limit them with batch identities.

I first saw this in a tweet by Bob DuCharme.

People NOT Technology Produce Data ROI

Monday, February 15th, 2016

Too many tools… not enough carpenters! by Nicholas Hartman.

From the webpage:

Don’t let your enterprise make the expensive mistake of thinking that buying tons of proprietary tools will solve your data analytics challenges.

tl;dr = The enterprise needs to invest in core data science skills, not proprietary tools.

Most of the world’s largest corporations are flush with data, but frequently still struggle to achieve the vast performance increases promised by the hype around so called “big data.” It’s not that the excitement around the potential of harvesting all that data was unwarranted, but rather these companies are finding that translating data into information and ultimately tangible value can be hard… really hard.

In your typical new tech-based startup the entire computing ecosystem was likely built from day one around the need to generate, store, analyze and create value from data. That ecosystem was also likely backed from day one with a team of qualified data scientists. Such ecosystems spawned a wave of new data science technologies that have since been productized into tools for sale. Backed by mind-blowingly large sums of VC cash many of these tools have set their eyes on the large enterprise market. A nice landscape of such tools was recently prepared by Matt Turck of FirstMark Capital (host of Data Driven NYC, one of the best data science meetups around).

Consumers stopped paying money for software a long time ago (they now mostly let the advertisers pay for the product). If you want to make serious money in pure software these days you have to sell to the enterprise. Large corporations still spend billions and billions every year on software and data science is one of the hottest areas in tech right now, so selling software for crunching data should be a no-brainer! Not so fast.

The problem is, the enterprise data environment is often nothing like that found within your typical 3-year-old startup. Data can be strewn across hundreds or thousands of systems that don’t talk to each other. Devices like mainframes are still common. Vast quantities of data are generated and stored within these companies, but until recently nobody ever really envisioned ever accessing — let alone analyzing — these archived records. Often, it’s not initially even clear how the all data generated by these systems directly relates to a large blue chip’s core business operations. It does, but a lack of in-house data scientists means that nobody is entirely even sure what data is really there or how it can be leveraged.

I would delete “proprietary” from the above because non-proprietary tools create data problems just as easily.

Thus I would re-write the second quote as:

Tools won’t replace skilled talent, and skilled talent doesn’t typically need many particular tools.

I substituted “particular” tools to avoid religious questions about particular non-proprietary tools.

Understanding data, recognizing where data integration is profitable and where it is a dead loss, creating tests to measure potential ROI, etc., are all tasks of a human data analyst and not any proprietary or non-proprietary tool.

That all enterprise data has some intrinsic value that can be extracted if it were only accessible is an article of religious faith, not business ROI.

If you want business ROI from data, start with human analysts and not the latest buzzwords in technological tools.