Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 18, 2015

Buzzfeed uses R for Data Journalism

Filed under: Journalism,News,R,Reporting — Patrick Durusau @ 11:36 am

Buzzfeed uses R for Data Journalism by David Smith.

From the post:

Buzzfeed isn't just listicles and cat videos these days. Science journalist Peter Aldhous recently joined Buzzfeed's editorial team, after stints at Nature, Science and New Scientist magazines. He brings with him his data journalism expertise and R programming skills to tell compelling stories with data on the site. His stories, like this one on the rates of terrorism incidents in the USA, often include animated maps or interactive charts created with R. 

Data journalists and would be data journalists should be following the use of R and Python at Buzzfeed.

You don’t have to read Buzzfeed (I have difficulty with its concept of “news”), as David points out a way to follow all the Buzzfeed projects that make it to GitHub.

See David’s post for other great links.

Enjoy!

‘Linked data can’t be your goal. Accomplish something’

Filed under: Linked Data,Marketing,Semantic Web — Patrick Durusau @ 11:08 am

Tim Strehle points to his post: Jonathan Rochkind: Linked Data Caution, which is a collection of quotes from Linked Data Caution (Jonathan Rochkind).

In the process, Tim creates his own quote, inspired by Rochkind:

‘Linked data can’t be your goal. Accomplish something’

Which is easy to generalize to:

‘***** can’t be your goal. Accomplish something’

Whether your hobby horse is linked data, graphs, noSQL, big data, or even topic maps, technological artifacts are just and only that, artifacts.

Unless and until such artifacts accomplish something, they are curios, relics venerated by pockets of the faithful.

Perhaps marketers in 2016 should be told:

Skip the potential benefits of your technology. Show me what it has accomplished (past tense) for users similar to me.

With that premise, you could weed through four or five vendors in a morning. 😉

December 17, 2015

Why Big Data Fails to Detect Terrorists

Filed under: Astroinformatics,BigData,Novelty,Outlier Detection,Security — Patrick Durusau @ 10:15 pm

Kirk Borne tweeted a link to his presentation, Big Data Science for Astronomy & Space and more specifically to slides 24 and 25 on novelty detection, surprise discovery.

Casting about for more resources to point out, I found Novelty Detection in Learning Systems by Stephen Marsland.

The abstract for Stephen’s paper:

Novelty detection is concerned with recognising inputs that differ in some way from those that are usually seen. It is a useful technique in cases where an important class of data is under-represented in the training set. This means that the performance of the network will be poor for those classes. In some circumstances, such as medical data and fault detection, it is often precisely the class that is under-represented in the data, the disease or potential fault, that the network should detect. In novelty detection systems the network is trained only on the negative examples where that class is not present, and then detects inputs that do not fits into the model that it has acquired, that it, members of the novel class.

This paper reviews the literature on novelty detection in neural networks and other machine learning techniques, as well as providing brief overviews of the related topics of statistical outlier detection and novelty detection in biological organisms.

The rest of the paper is very good and worth your time to read but we need not venture beyond the abstract to demonstrate why big data cannot, by definition, detect terrorists.

The root of the terrorist detection problem summarized in the first sentence:

Novelty detection is concerned with recognising inputs that differ in some way from those that are usually seen.

So, what are the inputs of a terrorist that differ from the inputs usually seen?

That’s a simple enough question.

Previously committing a terrorist suicide attack is a definite tell but it isn’t a useful one.

Obviously the TSA doesn’t know because it has never caught a terrorist, despite its profile and wannabe psychics watching travelers.

You can churn big data 24×7 but if you don’t have a baseline of expected inputs, no input is going to stand out from the others.

The San Bernardino were not detected, because the inputs didn’t vary enough for the couple to stand out.

Even if they had been selected for close and unconstitutional monitoring of their etraffic, bank accounts, social media, phone calls, etc., there is no evidence that current data techniques would have detected them.

Before you invest or continue paying for big data to detect terrorists, ask the simple questions:

What is your baseline from which variance will signal a terrorist?

How often has it worked?

Once you have a dead terrorist, you can start from the dead terrorist and search your big data, but that’s an entirely different starting point.

Given the weeks, months and years of finger pointing following a terrorist attack, speed really isn’t an issue.

#IntelGroup

Filed under: Cybersecurity,Government,Security — Patrick Durusau @ 5:05 pm

#IntelGroup

From the about page:

The control of information is something the elite always does, particularly in a despotic form of government. Information, knowledge, is power. If you can control information, you can control people.” -Tom Clancy

Intelgroup was was created to first and foremost help amplify and spread the message of Anonymous wide and far. Like Anonymous Intelgroup started off as an idea and thru hard work and a lot of lulz We have become a well respected source for global information. Known for our exclusive one on one interviews with Acitivists , Hackers and Victims of Police Brutality , as well as in depth looks at Anonymous Operations. We Constantly keep you intellectually involved in the movement. And we continue to evolve as often as Information evolves . We Are Not Mainstream Media . We are “Think For Yourself Media” Welcome To Our Page . Welcome to Intelgroup.

Follow Us On Twitter : @AnonIntelGroup

Like Our Page On Facebook : https://www.facebook.com/IntelGroup

We Are Also On Instagram : @intelgroup

Check out our YouTube Channel : http://youtube.com/anonintelgroup1

I’m all for more and not less information about government and its activities.

And everyone has to pick the battles they want to fight.

What puzzles me is the disparity in reports of government insecurity, say the Office of Personnel Management, and the silence on the full 6,000+ page Senate Report on torture.

The most recent figure I could find for the Senate is 6,097 people on staff, as of 2009. Vital Statistics on Congress

Out of over 6,000 potential sources, none of the news organizations, hacktivists, etc. was able to obtain a copy of the full report?

That’s seems too remarkable to credit.

Even more remarkable is the near perfect security of all members of Congress, federal agencies and PACs.

I can’t imagine it is a lack of malfeasance and corruption that accounts for the lack of leaks.

What’s your explanation for the lack of leaks?

My Bad – You Are Not! 747 Edits Away From Using XML Tools

Filed under: XPath,XQuery,XSLT — Patrick Durusau @ 4:11 pm

The original, unedited post is below but in response to comments, I checked the XQuery, XPath, XSLT and XQuery Serialization 3.1 files in Chrome (CNTR-U) before saving them.

All the empty elements were properly closed.

I then saved the files and re-opened in Emacs, to discover that Chrome had stripped the “/” from the empty elements, which then caused BaseX to complain. It was an accurate complaint but the files I was tossing against BaseX were not the files as published by the W3C.

So now I need to file a bug report on Chrome, Version 47.0.2526.80 (64-bit) on Ubuntu, for mangling closed empty elements.


You could tell in XQuery, XPath, XSLT and XQuery Serialization 3.1, New Candidate Recommendations! that I was really excited to see the new drafts hit the street.

Me and my big mouth.

I grabbed copies of all three and tossed the XQuery draft against an xquery to create a list of all the paths in it. Simple enough.

The result weren’t.

Here is the first error message:

[FODC0002] “file:/home/patrick/working/w3c/XQuery3.1.html” (Line 68): The element type “link” must be terminated by the matching end-tag “</link>”.

Ouch!

I corrected that and running the query a second time I got:

[FODC0002] “file:/home/patrick/working/w3c/XQuery3.1.html” (Line 68): The element type “meta” must be terminated by the matching end-tag “</meta>”.

The <meta> elements appear on lines three and four.

On the third try:

[FODC0002] “file:/home/patrick/working/w3c/XQuery3.1.html” (Line 69): The element type “img” must be terminated by the matching end-tag “</img>”.

There are 3 <img> elements that are not closed.

I’m getting fairly annoyed at this point.

Fourth try:

[FODC0002] “file:/home/patrick/working/w3c/XQuery3.1.html” (Line 78): The element type “br” must be terminated by the matching end-tag “</br>”.

Of course at this point I revert to grep and discover there are 353
elements that are not closed.

Sigh, nothing to do but correct and soldier on.

Fifth attempt.

[FODC0002] “file:/home/patrick/working/w3c/XQuery3.1.html” (Line 17618): The element type “hr” must be terminated by the matching end-tag “</hr>”.

There are 2 <hr> elements that are not closed.

A total of 361 edits in order to use XML based tools with the most recent XQuery 3.1 Candidate draft.

The most recent XPath 3.1 has 238 empty elements that aren’t closed (same elements as XQuery 3.1).

The XSLT and XQuery Serialization 3.1 draft has 149 empty elements that aren’t closed, same as the other but with the addition of four <col> elements that weren’t closed.

Grand total: 747 edits in order to use XML tools.

Not an editorial but a production problem. A rather severe one it seems to me.

Anyone who wants to use XML tools on these drafts will have to perform the same edits.

What’s New for 2016 MeSH

Filed under: MeSH,Thesaurus,Topic Maps,Vocabularies — Patrick Durusau @ 3:41 pm

What’s New for 2016 MeSH by Jacque-Lynne Schulman.

From the post:

MeSH is the National Library of Medicine controlled vocabulary thesaurus which is updated annually. NLM uses the MeSH thesaurus to index articles from thousands of biomedical journals for the MEDLINE/PubMed database and for the cataloging of books, documents, and audiovisuals acquired by the Library.

MeSH experts/users will need to absorb the details but some of the changes include:

Overview of Vocabulary Development and Changes for 2016 MeSH

  • 438 Descriptors added
  • 17 Descriptor terms replaced with more up-to-date terminology
  • 9 Descriptors deleted
  • 1 Qualifier (Subheading) deleted

and,

MeSH Tree Changes: Uncle vs. Nephew Project

In the past, MeSH headings were loosely organized in trees and could appear in multiple locations depending upon the importance and specificity. In some cases the heading would appear two or more times in the same tree at higher and lower levels. This arrangement led to some headings appearing as a sibling (uncle) next to the heading under which they were treed as a nephew. In other cases a heading was included at a top level so it could be seen more readily in printed material. We reviewed these headings in MeSH and removed either the Uncle or Nephew depending upon the judgement of our Internal and External reviewers. There were over 1,000 tree changes resulting from this work, many of which will affect search retrieval in MEDLINE/PubMed and the NLM Catalog.

and,

MeSH Scope Notes

MeSH had a policy that each descriptor should have a scope note regardless of how obvious its meaning. There were many legacy headings that were created without scope notes before this rule came into effect. This year we initiated a project to write scope notes for all existing headings. Thus far 481 scope notes to MeSH were added and the project continues for 2017 MeSH.

Echoes of Heraclitus:

It is not possible to step twice into the same river according to Heraclitus, or to come into contact twice with a mortal being in the same state. (Plutarch) (Heraclitus)

Semantics and the words we use to invoke them are always in a state of flux. Sometimes more, sometimes less.

The lesson here is that anyone who says you can have a fixed and stable vocabulary is not only selling something, they are selling you a broken something. If not broken on the day you start to use it, then fairly soon thereafter.

It took time for me to come to the realization that the same is true about information systems that attempt to capture changing semantics at any given point.

Topic maps in the sense of ISO 13250-2, for example, can capture and map changing semantics, but if and only if you are willing to accept its data model.

Which is good as far as it goes but what if I want a different data model? That is to still capture changing semantics and map between them, but using a different data model.

We may have a use case to map back to ISO 13250-2 or to some other data model. The point being that we should not privilege any data model or syntax in advance, at least not absolutely.

Not only do communities change but their preferences for technologies change as well. It seems just a bit odd to be selling an approach on the basis of capturing change only to build a dike to prevent change in your implementation.

Yes?

XQuery, XPath, XSLT and XQuery Serialization 3.1, New Candidate Recommendations!

Filed under: W3C,XPath,XQuery,XSLT — Patrick Durusau @ 11:10 am

As I forecast 😉 earlier this week, new Candidate Recommendations for:

XQuery 3.1: An XML Query Language

XML Path Language (XPath) 3.1

XSLT and XQuery Serialization 3.1

have hit the streets for your review and comments!

Comments due by 2016-01-31.

That’s forty-five days, minus the ones spent with drugs/sex/rock-n-roll over the holidays and recovering from same.

Say something shy of forty-four actual working days (my endurance isn’t what it once was) for the review process.

What tools, techniques are you going to use to review this latest set of candidates?

BTW, some people review software and check only fixes, for standards I start at the beginning, go to the end, then stop. (Or the reverse for backward proofing.)

My estimates on days spent with drugs/sex/rock-n-rock are approximate only and your experience may vary.

US Court Opinion Links!

Filed under: Law,Law - Sources — Patrick Durusau @ 10:48 am

I was reading an account the opinion in Authors Guild vs. Google this morning, but the link pointed to a pay wall site. Sigh, court opinions are in the public domain so why not point to a public copy?

Let me make that easier for members of the media, at least for the Supreme Court and the Circuit Courts of Appeal:

You can get a more complete list, includes District and Bankruptcy courts, from Court Website Links.

The only value-add that I offer is a direct link to finding opinions.

One-click on “opinions” plus search gives you a public link to the public opinion.

Is that too much to ask?

PS: Websites of the Circuit Courts vary widely. District judges consider themselves just short of demi-gods so you can imagine the conversations with Circuit Courts judges on web options. Not quite fire by night and clouds by day but almost.

December 16, 2015

Vulnerable Printers on the Net

Filed under: Cybersecurity,Security — Patrick Durusau @ 10:42 pm

Bob Gourley tweeted:

Using Shodan I found 40,471 vulnerable printers connected to the Net. (Of the 5 at GMU 1 is low on yellow toner).

gourley-shodan

In case you don’t know Shodan, check it out! Search for “things” on the Insecure Internet of Things (IIoT).

I guess the question for 2016 is going to be: Are you a White or Black Hat on the Insecure Internet of Things (IIoT)?

O’Reilly Web Design Site

Filed under: CSS3,Graphics,Interface Research/Design,Visualization — Patrick Durusau @ 9:37 pm

O’Reilly Web Design Site

O’Reilly has launched a new website devoted to website design.

Organized by paths, what I have encountered so far is “free” for the price of registration.

I have long ignored web design much the same way others ignore the need for documentation. Perhaps there is more similarity there than I would care to admit.

It’s never too late to learn so I am going to start pursuing some of the paths at the O’Reilly Web Design site.

Suggestions or comments concerning your experience with this site welcome.

Enjoy!

Privacy Alert! – CISA By Friday (18 December 2015) Time to Raise Hell!

Filed under: Cybersecurity,Government,Privacy,Security — Patrick Durusau @ 9:19 pm

Lawmakers Have Snuck CISA Into a Bill That Is Guaranteed to Become a Law by Jason Koebler.

From the post:

To anyone who has protested the sweeping, vague, and privacy-killing iterations of the Cybersecurity Information Sharing and Protection Act or the Cybersecurity Information Sharing Act over the last several years, sorry, lawmakers have heard you, and they have ignored you.

That sounds bleak, but lawmakers have stripped the very bad CISA bill of almost all of its privacy protections and have inserted the full text of it into a bill that is essentially guaranteed to be passed and will certainly not be vetoed by President Obama.

CISA allows private companies to pass your personal information and online goings-on to the federal government and local law enforcement if it suspects a “cybersecurity threat,” a term so broadly defined that it can apply to “anomalous patterns of communication” and can be used to gather information about just about any crime, cyber or not.

At 2 AM Wednesday morning, Speaker of the House Paul Ryan unveiled a 2000-page budget bill that will fund the federal government well into next year. The omnibus spending bill, as it’s usually referred to, is the result of countless hours of backroom dealings and negotiations between Republicans and Democrats.

Without the budget bill (or a short-term emergency measure), the government shuts down, as it did in 2013 for 16 days when lawmakers couldn’t reach a budget deal. It contains dozens of measures that make the country run, and once it’s released and agreed to, it’s basically a guarantee to pass. Voting against it or vetoing it is politically costly, which is kind of the point: Republicans get some things they want, Democrats get some things they want, no one is totally happy but they live with it anyway. This is how countless pieces of bad legislation get passed in America—as riders on extremely important pieces of legislation that are politically difficult to vote against.

See Jason’s post for the full story but you get the gist of it, your privacy rights will be terminated to a large degree this coming Friday.

I don’t accept Jason’s fatalism, however.

There still remains time for members of Congress to strip the rider from the budget bill, but only if everyone raises hell with their representatives and senators between now and Friday.

We need to overload every switchboard in the 202 area code with legitimate, personal calls to representatives and senators. Fill up every voice mail box, every online message storage, etc.

Those of you will personal phone numbers, put them to good use. Call, now!

This may not make any difference, but, members of Congress can’t say they weren’t warned before taking this fateful step.

When Congress signals it doesn’t care about our privacy, then we damned sure don’t have to care about theirs.

Avoiding the Trap of Shallow Narratives

Filed under: Journalism,Narrative,News,Reporting — Patrick Durusau @ 8:42 pm

Avoiding the Trap of Shallow Narratives by Tiff Fehr.

From the post:


When we elevate immediate reactions to the same level as more measured narratives, we spring a trap on ourselves and our readers. I believe by the end of 2016, we will know if a “trap” is the right description. 2016 is going to be turbulent for news and news-reading audiences, which will add to the temptation to chase traffic via social-focused follow-on stories, and perhaps more of clickbait’s “leftover rehash.” Maybe we’ll even tweak them so they’re not “a potential letdown,” too: “Nine Good Things in the SCOTUS Brawl at the State of the Union.”

A great read on a very serious problem, if your goal is to deliver measured narratives of current events to readers.

Shallow narratives are not a problem if your goals are:

  • First, even if wrong, is better than being second
  • Headlines are judged by “click-through” rates
  • SEO drives the vocabulary of stories

This isn’t a new issue. Before social media, broadcast news was too short to present any measured narrative. It could signal events that needed measured narrative but it wasn’t capable of delivering it.

No one watched the CBS Evening News with Walter Cronkite to see a measured narrative about the Vietnam War. For that you consulted Foreign Affairs or any number of other history/policy sources.

That’s not a dig at broadcast journalism in general or CBS/Cronkite in particular. Each medium has its limits and Cronkite knew those limits as well as anyone. He would have NOT warned anyone off who was seeking “measured narrative” to supplement his reports.

The article I mentioned earlier about affective computing, We Know How You Feel from the New Yorker, qualifies as a measured narrative.

As an alternative, consider the shallow narrative: Mistrial in Freddie Gray Death. Testimony started December 2nd and the entire story is compressed into 1,564 words? Really?

Would anyone consider that to be a “measured narrative?” Well, other than its authors and colleagues who might fear a similar evaluation of their work?

You can avoid the trap of shallow narratives but that will depend upon the forum you choose for your content. Pick something like CNN and there isn’t anything but shallow narrative. Or at least that is the experience to date.

Your choice of forum has a much to do with avoiding shallow narrative as any other factor.

Choose wisely.

We Know How You Feel [A Future Where Computers Remain Imbeciles]

We Know How You Feel by Raffi Khatchadourian.

From the post:

Three years ago, archivists at A.T. & T. stumbled upon a rare fragment of computer history: a short film that Jim Henson produced for Ma Bell, in 1963. Henson had been hired to make the film for a conference that the company was convening to showcase its strengths in machine-to-machine communication. Told to devise a faux robot that believed it functioned better than a person, he came up with a cocky, boxy, jittery, bleeping Muppet on wheels. “This is computer H14,” it proclaims as the film begins. “Data program readout: number fourteen ninety-two per cent H2SOSO.” (Robots of that era always seemed obligated to initiate speech with senseless jargon.) “Begin subject: Man and the Machine,” it continues. “The machine possesses supreme intelligence, a faultless memory, and a beautiful soul.” A blast of exhaust from one of its ports vaporizes a passing bird. “Correction,” it says. “The machine does not have a soul. It has no bothersome emotions. While mere mortals wallow in a sea of emotionalism, the machine is busy digesting vast oceans of information in a single all-encompassing gulp.” H14 then takes such a gulp, which proves overwhelming. Ticking and whirring, it begs for a human mechanic; seconds later, it explodes.

The film, titled “Robot,” captures the aspirations that computer scientists held half a century ago (to build boxes of flawless logic), as well as the social anxieties that people felt about those aspirations (that such machines, by design or by accident, posed a threat). Henson’s film offered something else, too: a critique—echoed on television and in novels but dismissed by computer engineers—that, no matter a system’s capacity for errorless calculation, it will remain inflexible and fundamentally unintelligent until the people who design it consider emotions less bothersome. H14, like all computers in the real world, was an imbecile.

Today, machines seem to get better every day at digesting vast gulps of information—and they remain as emotionally inert as ever. But since the nineteen-nineties a small number of researchers have been working to give computers the capacity to read our feelings and react, in ways that have come to seem startlingly human. Experts on the voice have trained computers to identify deep patterns in vocal pitch, rhythm, and intensity; their software can scan a conversation between a woman and a child and determine if the woman is a mother, whether she is looking the child in the eye, whether she is angry or frustrated or joyful. Other machines can measure sentiment by assessing the arrangement of our words, or by reading our gestures. Still others can do so from facial expressions.

Our faces are organs of emotional communication; by some estimates, we transmit more data with our expressions than with what we say, and a few pioneers dedicated to decoding this information have made tremendous progress. Perhaps the most successful is an Egyptian scientist living near Boston, Rana el Kaliouby. Her company, Affectiva, formed in 2009, has been ranked by the business press as one of the country’s fastest-growing startups, and Kaliouby, thirty-six, has been called a “rock star.” There is good money in emotionally responsive machines, it turns out. For Kaliouby, this is no surprise: soon, she is certain, they will be ubiquitous.

This is a very compelling look at efforts that have in practice made computers more responsive to the emotions of users. With the goal of influencing users based upon the emotions that are detected.

Sound creepy already?

The article is fairly long but a great insight into progress already being made and that will be made in the not too distant future.

However, “emotionally responsive machines” remain the same imbeciles as they were in the story of H14. That is to say they can only “recognize” emotions much as they can “recognize” color. To be sure it “learns” but its reaction upon recognition remains a matter of programming and/or training.

The next wave of startups will create programmable emotional images of speakers, edging the arms race for privacy just another step down the road. If I were investing in startups, I would concentrate on those to defeat emotional responsive computers.

If you don’t want to wait for a high tech way to defeat emotionally responsive computers, may I suggest a fairly low tech solution:

Wear a mask!

One of my favorites:

Egyptian_Guy_Fawkes_Mask

(From https://commons.wikimedia.org/wiki/Category:Masks_of_Guy_Fawkes. There are several unusual images there.)

Or choose any number of other masks at your nearest variety store.

A hard mask that conceals your eyes and movement of your face will defeat any “emotionally responsive computer.”

If you are concerned about your voice giving you away, search for “voice changer” for over 4 million “hits” on software to alter your vocal characteristics. Much of it for free.

Defeating “emotionally responsive computers” remains like playing checkers against an imbecile. If you lose, it’s your own damned fault.

PS: If you have a Max Headroom type TV and don’t want to wear a mask all the time, consider this solution for its camera:

120px-Cutting_tool_2

Any startups yet based on defeating the Internet of Things (IoT)? Predicting 2016/17 will be the year for those to take off.

20 Big Data Repositories You Should Check Out [Data Source Checking?]

Filed under: BigData,Data,Data Science — Patrick Durusau @ 11:43 am

20 Big Data Repositories You Should Check Out by Vincent Granville.

Vincent lists some additional sources along with a link to Bernard Marr’s original selection.

One of the issues with such lists is that they are rarely maintained.

For example, Bernard listed:

Topsy http://topsy.com/

Free, comprehensive social media data is hard to come by – after all their data is what generates profits for the big players (Facebook, Twitter etc) so they don’t want to give it away. However Topsy provides a searchable database of public tweets going back to 2006 as well as several tools to analyze the conversations.

But if you follow http://topsy.com/, you will find it points to:

Use Search on your iPhone, iPad, or iPod touch

With iOS 9, Search lets you look for content from the web, your contacts, apps, nearby places, and more. Powered by Siri, Search offers suggestions and updates results as you type.

That sucks doesn’t it? Expecting to be able to search public tweets back to 2006, along with analytical tools and what you get is a kiddie guide to search on a malware honeypot.

For a fuller explanation or at least the latest news on Topsy, check out: Apple shuts down Twitter analytics service Topsy by Sam Byford, dated December 16, 2015 (that’s today as I write this post).

So, strike Topsy off your list of big data sources.

Rather than bare lists, what big data needs is a curated list of big data sources that does more than list sources. Those sources need to be broken down to data sets to enable big data searchers to find all the relevant data sets and retrieve only those that remain accessible.

Like “link checking” but for big data resources. Data Source Checking?

That would be the “go to” place for big data sets and as bad as I hate advertising, a high traffic area for advertising to make it cost effective if not profitable.

December 15, 2015

eSymposium on Hacktivism (Defeating Hactivists)

Filed under: Cybersecurity,Security — Patrick Durusau @ 8:57 pm

eSymposium on Hacktivism

This showed up in my inbox today with the following description:

These vigilante-style, politically motivated attacks are meant to embarass executives by publicizing their secret dealings. What can authorities do to go after those behind these illegal activities, and how can corporations better protect themselves so incidents such as those that happened at the NSA, RSA, Twitter, PayPal, Sony, Pfizer, the FBI, a number of police forces, the U.S. military and many other entities, doesn’t happen to them? We’ll take a deep dive.

Registration requires far more information than I am willing to disclose for a “free” eSymposium so someone else will have to fill in the details.

I can offer advice on defeating hacktivists:

  1. Don’t make deals that will embarrass you if made public.
  2. Don’t lie when questioned about any deal, information, etc.
  3. Don’t cheat or steal.
  4. Don’t do #1-#3 to cover for someone else’s incompetence or dishonesty.

Is any of that unclear?

As far as I can tell, that is a 100% foolproof defense against hacktivists.

Questions?

A Day in the Life of Americans

Filed under: Graphics,Visualization — Patrick Durusau @ 3:38 pm

A Day in the Life of Americans – This is how America runs by Nathan Yau.

You are accustomed to seeing complex graphs which are proclaimed to hold startling insights:

day-in-the-life

Nathan’s post starts off that way but you are quickly draw into one a visual presentation of daily activities of Americans as the clock runs from 4:00 AM.

Nathan has produced a number of stunning visualizations over the years but well, here’s his introduction:

From two angles so far, we’ve seen how Americans spend their days, but the views are wideout and limited in what you can see.

I can tell you that about 40 percent of people age 25 to 34 are working on an average day at three in the afternoon. I can tell you similar numbers for housework, leisure, travel, and other things. It’s an overview.

What I really want to see is closer to the individual and a more granular sense of how each person contributes to the patterns. I want to see how a person’s entire day plays out. (As someone who works from home, I’m always interested in what’s on the other side.)

So again I looked at microdata from the American Time Use Survey from 2014, which asked thousands of people what they did during a 24-hour period. I used the data to simulate a single day for 1,000 Americans representative of the population — to the minute.

More specifically, I tabulated transition probabilities for one activity to the other, such as from work to traveling, for every minute of the day. That provided 1,440 transition matrices, which let me model a day as a time-varying Markov chain. The simulations below come from this model, and it’s kind of mesmerizing.

Not only is it “mesmerizing,” its informative as well. To a degree.

Did you know that 74% of 1,000 average Americans are asleep when Jimmy Fallon comes on at 11:30 EST? 😉

What you find here and elsewhere on Nathan’s site is the result of a very talented person who practices data visualization ever day.

For me, the phrase, “a day in the life,” will always be associated with:

How does your average day compare to the average day? Or the average day in your office to the average day?

Readings in Database Systems, 5th Edition (Kindle Stuffer)

Filed under: Computer Science,Database — Patrick Durusau @ 2:28 pm

Readings in Database Systems, 5th Edition, Peter Bailis, Joseph M. Hellerstein, Michael Stonebraker, editors.

From the webpage:

  1. Preface [HTML] [PDF]
  2. Background introduced by Michael Stonebraker [HTML] [PDF]
  3. Traditional RDBMS Systems introduced by Michael Stonebraker [HTML] [PDF]
  4. Techniques Everyone Should Know introduced by Peter Bailis [HTML] [PDF]
  5. New DBMS Architectures introduced by Michael Stonebraker [HTML] [PDF]
  6. Large-Scale Dataflow Engines introduced by Peter Bailis [HTML] [PDF]
  7. Weak Isolation and Distribution introduced by Peter Bailis [HTML] [PDF]
  8. Query Optimization introduced by Joe Hellerstein [HTML] [PDF]
  9. Interactive Analytics introduced by Joe Hellerstein [HTML] [PDF]
  10. Languages introduced by Joe Hellerstein [HTML] [PDF]
  11. Web Data introduced by Peter Bailis [HTML] [PDF]
  12. A Biased Take on a Moving Target: Complex Analytics
    by Michael Stonebraker [HTML] [PDF]
  13. A Biased Take on a Moving Target: Data Integration
    by Michael Stonebraker [HTML] [PDF]

Complete Book: [HTML] [PDF]

Readings Only: [HTML] [PDF]

Previous Editions: [HTML]

Citations to the “reading” do not present themselves as hyperlinks but they are.

If you are giving someone a Kindle this Christmas, consider pre-loading Readings in Database Systems, along with the readings as a Kindle stuffer.

December 14, 2015

35 Lines XQuery versus 604 of XSLT: A List of W3C Recommendations

Filed under: BaseX,Saxon,XML,XQuery,XSLT — Patrick Durusau @ 10:16 pm

Use Case

You should be familiar with the W3C Bibliography Generator. You can insert one or more URLs and the generator produces correctly formatted citations for W3C work products.

It’s quite handy but requires a URL to produce a useful response. I need authors to use correctly formatted W3C citations and asking them to find URLs and to generate correct citations was a bridge too far. Simply didn’t happen.

My current attempt is to produce a list of correctly W3C citations in HTML. Authors can use CTRL-F in their browsers to find citations. (Time will tell if this is a successful approach or not.)

Goal: An HTML page of correctly formatted W3C Recommendations, sorted by title (ignoring case because W3C Recommendations are not consistent in their use of case in titles). “Correctly formatted” meaning that it matches the output from the W3C Bibliography Generator.

Resources

As a starting point, I viewed the source of http://www.w3.org/2002/01/tr-automation/tr-biblio.xsl, the XSLT script that generates the XHTML page with its responses.

The first XSLT script imports two more XSLT scripts, http://www.w3.org/2001/08/date-util.xslt and http://www.w3.org/2001/10/str-util.xsl.

I’m not going to reproduce the XSLT here, but can say that starting with <stylesheet> and ending with </stylesheet>, inclusive, I came up with 604 lines.

You will need to download the file used by the W3C Bibliography Generator, tr.rdf.

XQuery Script

I have used the XQuery script successfully with: BaseX 8.3, eXide 2.1.3 and SaxonHE-6-07J.

Here’s the prolog:

declare default element namespace "http://www.w3.org/2001/02pd/rec54#";
declare namespace rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#";
declare namespace dc = "http://purl.org/dc/elements/1.1/"; 
declare namespace doc = "http://www.w3.org/2000/10/swap/pim/doc#";
declare namespace contact = "http://www.w3.org/2000/10/swap/pim/contact#";
declare namespace functx = "http://www.functx.com";
declare function functx:substring-after-last
($string as xs:string?, $delim as xs:string) as xs:string?
{
if (contains ($string, $delim))
then functx:substring-after-last(substring-after($string, $delim), $delim)
else $string
};

Declaring the namespaces and functx:substring-after-last from Patricia Walmsley’s excellent FunctX XQuery Functions site and in particular, functx:substring-after-last.

<html>
<head>XQuery Generated W3C Recommendation List</head>
<body>
<ul class="ul">

Start the HTML page and the unordered list that will contain the list items.

{
for $rec in doc("tr.rdf")//REC
    order by upper-case($rec/dc:title)

If you sort W3C Recommendations by dc:title and don’t specify upper-case, rdf:PlainLiteral: A Datatype for RDF Plain Literals,
rdf:PlainLiteral: A Datatype for RDF Plain Literals (Second Edition), and xml:id Version 1.0, appear at the end of the list sorted by title. Dirty data isn’t limited to databases.

return <li class="li">
  <a href="{string($rec/@rdf:about)}"> {string($rec/dc:title)} </a>, 
   { for $auth in $rec/editor
   return
   if (contains(string($auth/contact:fullName), "."))
   then (concat(string($auth/contact:fullName), ","))
   else (concat(concat(concat(substring(substring-before(string($auth/\
   contact:fullName), ' '), 0, 2), ". "), (substring-after(string\
   ($auth/contact:fullName), ' '))), ","))}

Watch for the line continuation marker “\”.

We begin by grabbing the URL and title for an entry and then confront dirty author data. The standard author listing by the W3C creates an initial plus a period for the author’s first name and then concatenates the rest of the author’s name to that initial plus period.

Problem: There is one entry for authors that already has initials, T.V. Raman, so I had to account for that one entry (as does the XSLT).

{if (count ($rec/editor) >= 2) then " Editors," else " Editor,"}
W3C Recommendation, 
{fn:format-date(xs:date(string($rec/dc:date)), "[MNn] [D], [Y]") }, 
{string($rec/@rdf:about)}. <a href="{string($rec/doc:versionOf/\
@rdf:resource)}">Latest version</a> \
available at {string($rec/doc:versionOf/@rdf:resource)}.
<br/>[Suggested label: <strong>{functx:substring-after-last(uppercase\
(replace(string($rec/doc:versionOf/@rdf:resource), '/$', '')), "/")}\
</strong>]<br/></li>} </ul></body></html>

Nothing remarkable here, except that I snipped the concluding “/” off of the values from doc:versionOf/@rdf:resource so I could use functx:substring-after-last to create the token for a suggested label.

Comments / Omissions

I depart from the XSLT in one case. It calls http://www.w3.org/2002/01/tr-automation/known-tr-editors.rdf here:

<!-- Special casing for when we have the name in Original Script (e.g. in \
Japanese); currently assume that the order is inversed in this case... -->

<:xsl:when test="document('http://www.w3.org/2002/01/tr-automation/\
known-tr-editors.rdf')/rdf:RDF/*[contact:lastNameInOriginalScript=\
substring-before(current(),' ')]">

But that refers to only one case:

<REC rdf:about="http://www.w3.org/TR/2003/REC-SVG11-20030114/">
<dc:date>2003-01-14</dc:date>
<dc:title>Scalable Vector Graphics (SVG) 1.1 Specification</dc:title>

Where Jun Fujisawa appears as an editor.

Recalling my criteria for “correctness” being the output of the W3C Bibliography Generator:

svg-cite-image

Preparing for this post made me discover at least one bug in the XSLT that was supposed to report the name in original script:

&lt:xsl:when test=”document(‘http://www.w3.org/2002/01/tr-automation/\
known-tr-editors.rdf’)/rdf:RDF/*[contact:lastNameInOriginalScript=\
substring-before(current(),’ ‘)]”>

Whereas the entry in http://www.w3.org/2002/01/tr-automation/known-tr-editors.rdf reads:

<rdf:Description>
<rdf:type rdf:resource=”http://www.w3.org/2000/10/swap/pim/contact#Person”/>
<firstName>Jun</firstName>
<firstNameInOriginalScript>藤沢 淳</firstNameInOriginalScript>
<lastName>Fujisawa</lastName>
<sortName>Fujisawa</sortName>
</rdf:Description>

Since the W3C Bibliography Generator doesn’t produce the name in original script, neither do I. When the W3C fixes its output, I will have to amend this script to pick up that entry.

String

While writing this query I found text(), fn:string() and fn:data() by Dave Cassels. Recommended reading. The weakness of text() is that if markup is inserted inside your target element after you write the query, you will get unexpected results. The use of fn:string() avoids that sort of surprise.

Recommendations Only

Unlike the W3C Bibliography Generator, my script as written only generates entries for Recommendations. It would be trivial to modify the script to include drafts, notes, etc., but I chose to not include material that should not be used as normative citations.

I can see the usefulness of the bibliography generator for works in progress but external to the W3C, citing Recommendations is the better course.

Contra Search

The SpecRef project has a searchable interface to all the W3C documents. If you search for XQuery, the interface returns 385 “hits.”

Contrast that with using CNTR-F with the list of recommendations generated from the XQuery script, controlling for case, XQuery produced only 23 “hits.”

There are reasons for using search, but users repeatedly mining results of searches that could be captured (it was called curation once upon a time) is wasteful.

Reading

I can’t recommend Patricia Walmsley’s XQuery 2nd Edition strongly enough.

There is one danger to Walmsley’s book. You will be so ready to start using XQuery after the first ten chapters it’s hard to find the time to read the remaining ones. Great stuff!

You can download the XQuery file, tr.rdf and the resulting html file at: 35LinesOfXQuery.zip.

Congress.gov Enhancements: Quick Search, Congressional Record Index, and More

Filed under: Government,Government Data,XML,XQuery — Patrick Durusau @ 9:12 pm

New End of Year Congress.gov Enhancements: Quick Search, Congressional Record Index, and More by Andrew Weber.

From the post:

In our quest to retire THOMAS, we have made many enhancements to Congress.gov this year.  Our first big announcement was the addition of email alerts, which notify users of the status of legislation, new issues of the Congressional Record, and when Members of Congress sponsor and cosponsor legislation.  That development was soon followed by the addition of treaty documents and better default bill text in early spring; improved search, browse, and accessibility in late spring; user driven feedback in the summer; and Senate Executive Communications and a series of Two-Minute Tip videos in the fall.

Today’s update on end of year enhancements includes a new Quick Search for legislation, the Congressional Record Index (back to 1995), and the History of Bills from the Congressional Record Index (available from the Actions tab).  We have also brought over the State Legislature Websites page from THOMAS, which has links to state level websites similar to Congress.gov.

Text of legislation from the 101st and 102nd Congresses (1989-1992) has been migrated to Congress.gov. The Legislative Process infographic that has been available from the homepage as a JPG and PDF is now available in Spanish as a JPG and PDF (translated by Francisco Macías). Margaret and Robert added Fiscal Year 2003 and 2004 to the Congress.gov Appropriations Table. There is also a new About page on the site for XML Bulk Data.

The Quick Search provides a form-based search with fields similar to those available from the Advanced Legislation Search on THOMAS.  The Advanced Search on Congress.gov is still there with many additional fields and ways to search for those who want to delve deeper into the data.  We are providing the new Quick Search interface based on user feedback, which highlights selected fields most likely needed for a search.

There’s an impressive summary of changes!

Speaking of practicing programming, are you planning on practicing XQuery on congressional data in the coming year?

Fixing Bugs In Production

Filed under: Humor,Privacy,Programming,Security — Patrick Durusau @ 8:48 pm

MΛHDI posted this to twitter and it is too good not to share:

Amusing now but what happens when the illusion of “static data” disappears and economic activity data is streamed from every transaction point?

Your code and analysis will need to specify the time boundaries of the data that underlie your analysis. Depending on the level of your analysis, it may quickly become outdated as new data streams in for further analysis.

To do the level of surveillance that law enforcement longs for in the San Bernardino attack, you would need real time sales transaction data for the last 5 years, plus bank records and “see something say something” reports on 322+ million citizens of the United States.

Now imagine fixing bugs in that production code, when arrest and detention, if not more severe consequences await.

Essential 2016 Trends Overload Here!

Filed under: Forecasting,Journalism,News,Reporting — Patrick Durusau @ 7:55 pm

OUTLOOK ’16 /// Essential trends for 2016 by Ezra Eeman.

From the post:

The world is changing. In small iterations and disruptive shifts. New platforms emerge, new technologies shape our behaviour. The need for new business models and new talent was never more urgent. In order to stay ahead newsroom leaders and journalists have to look forward and understand the emerging trends that are/will be changing our daily lives.

OUTLOOK ‘16/// is a broad & growing collection of trend reports about media, technology and society. The selection is handmade in collaboration with VRT Start-Up. The order is random.

There are 17 trend reports as I write this post and no doubt more will be added as 2016 inches ever closer.


This is a growing collection. New relevant reports will be added when they are released. A Dutch version of this collection can be found here.

Want to suggest a great trend report for this collection? Mail us at journalismtools.mailbox@gmail.com

Hat tip to Journalism Tools for the collecting, which saves all of us the time of mining for 2016 trend reports.

Are you going to read all seventeen reports? Or whatever the ultimate number of trend reports?

How would you curate all seventeen+ reports to enable others to quickly survey the results and dip in and out of them?

PS: The “4 min read” is deceptive. You can scan the titles of all 17 trend reports in 4 minutes but be woefully short of reading time for all of them.

Data Science Lessons [Why You Need To Practice Programming]

Filed under: Data Science,Programming,Python — Patrick Durusau @ 7:30 pm

Data Science Lessons by Shantnu Tiwari.

Shantnu has authored several programming books using Python and has a series of videos (with more forthcoming) on doing data science with Python.

Shantnu had me when he used data from the Hubble Space telescope in his Introduction to Pandas with Practical examples.

The videos build one upon another and new users will appreciate that not very move is the correct one. 😉

If I had to pick one video to share, of those presently available, it would be:

Why You Need To Practice Programming.

It’s not new advice but it certainly is advice that needs repeating.

This anecdote is told about Pablo Casals (world famous cellist):

When Casals (then age 93) was asked why he continued to practice the cello three hours a day, he replied, “I’m beginning to notice some improvement.”

What are you practicing three hours a day?

XQuery, XPath, XSLT and XQuery Serialization 3.1 (Back-to-Front) Drafts (soon!)

Filed under: W3C,XPath,XQuery,XSLT — Patrick Durusau @ 4:04 pm

XQuery, XPath, XSLT and XQuery Serialization 3.1 (Back-to-Front) Drafts will be published quite soon so I wanted to give you a heads up on your holiday reading schedule.

This is deep enough in the review cycle that a back-to-front reading is probably your best approach.

You have read the drafts and corrections often enough by this point that you read the first few words of a paragraph and you “know” what it says so you move on. (At the very least I can report that happens to me.)

By back-to-front reading I mean to start at the end of each draft and read the last sentence and then the next to last sentence and so on.

The back-to-front process does two things:

  1. You are forced to read each sentence on its own.
  2. It prevents skimming and filling in errors with silent corrections (unknown to your conscious mind).

The back-to-front method is quite time consuming so its fortunate these drafts are due to appear just before a series of holidays in a large number of places.

I hesitate to mention it but there is another way to proof these drafts.

If you have XML experienced visitors, you could take turns reading the drafts to each other. It was a technique used by copyists many years ago where one person read and two others took down the text. The two versions were then compared to each other and the original.

Even with a great reading voice, I’m not certain many people would be up to that sort of exercise.

PS: I will post on the new drafts as soon as they are published.

No Sign of Terrorist Attack = Conflict in Government Priorities

Filed under: Journalism,News,Newspeak,Reporting — Patrick Durusau @ 3:09 pm

Egypt Says Investigators Found No Sign Of ‘Terrorist Act’ In Russian Plane Crash by Eyder Peralta.

Despite early conclusions by the Russians, the United States, Britain, and claims by the Islamic State, the Egyptian government has concluded there is no sign of a terrorist attack in the downing of a Russian passenger plane in Egypt last October.

Reporters and citizens alike should view claims of “terrorist” and “not a terrorist” attack with heavy additions of salt.

In this particular case, Egypt wants to avoid further damage to its revenue from tourism, which is reported down by 10% over last year.

U.S. intelligence and law enforcement agencies are desparately trying to find a terrorist connection for the shooters in San Bernardino. See also: Everything we know about the San Bernardino terror attack investigation so far.

From the second Los Angeles Times story:

Why their plot wasn’t detected

Farook and Malik were unknown to law enforcement until the day of the shooting, but the reason for that isn’t yet clear.

The FBI is focusing on how they missed the couple’s secret radicalization and Farook’s apparent comments to an associate as early as 2011 that he was considering a terrorist attack.

Two people out of a current population of 322,332,500 (as of 14:58 EST today) not being known to law enforcement doesn’t surprise me.

Does it surprise you?

As far as threats of a terrorist attack years ago, if they arrested everyone who shouts “kill the umpire/referee, etc.,” there would not be anyone left to stand as guards in the prisons.

The reason for treating the deaths of co-workers at a holiday party as an act of terrorism is that it furthers the budget agendas of law enforcement and intelligence communities.

Imagine the surveillance that would be required to gather and be aware of a random statement from four years ago from otherwise unremarkable individuals.

Your system would have to connect a statement to a co-worker to the purchases of the weapons, ammunition and other supplies and then ring the alert bell to clue officers in on a pending threat.

That’s possible in retrospect but to prevent random acts your system would have to do all those connections in the absence of any reason to focus on these individuals in particular.

You know the old saying, when they criminalize guns, only criminals will have guns?

Same is true for privacy, when they criminalize privacy, only criminals will have privacy.

PS: Remember terrorism is a label used to question the loyalty, judgement of others and/or for furthering other agendas. You could substitute “belch” where you see it and still have 99% of the informative content of a message.

December 13, 2015

The Moral Failure of Computer Scientists [Warning: Scam Alert!]

Filed under: Computer Science,Ethics — Patrick Durusau @ 9:09 pm

The Moral Failure of Computer Scientists by Kaveh Waddell.

From the post:

Computer scientists and cryptographers occupy some of the ivory tower’s highest floors. Among academics, their work is prestigious and celebrated. To the average observer, much of it is too technical to comprehend. The field’s problems can sometimes seem remote from reality.

But computer science has quite a bit to do with reality. Its practitioners devise the surveillance systems that watch over nearly every space, public or otherwise—and they design the tools that allow for privacy in the digital realm. Computer science is political, by its very nature.

That’s at least according to Phillip Rogaway, a professor of computer science at the University of California, Davis, who has helped create some of the most important tools that secure the Internet today. Last week, Rogaway took his case directly to a roomful of cryptographers at a conference in Auckland, New Zealand. He accused them of a moral failure: By allowing the government to construct a massive surveillance apparatus, the field had abused the public trust. Rogaway said the scientists had a duty to pursue social good in their work.

He likened the danger posed by modern governments’ growing surveillance capabilities to the threat of nuclear warfare in the 1950s, and called upon scientists to step up and speak out today, as they did then.

I spoke to Rogaway about why cryptographers fail to see their work in moral terms, and the emerging link between encryption and terrorism in the national conversation. A transcript of our conversation appears below, lightly edited for concision and clarity.

I don’t disagree with Rogaway that all science and technology is political. I might use the term social instead but I agree, there are no neutral choices.

Having said that, I do disagree that Rogaway has the standing to pre-package a political stance colored as “morals” and denounce others as “immoral” if they disagree.

It is one of the oldest tricks in rhetoric but quite often effective, which is why people keep using it.

If Rogaway is correct that CS and technology are political, then his stance for a particular take on government, surveillance and cryptography is equally political.

Not that I disagree with his stance, but I don’t consider it be a moral choice.

Anything you can do to impede, disrupt or interfere with any government surveillance is fine by me. I won’t complain. But that’s because government surveillance, the high-tech kind, is a waste of time and effort.

Rogaway uses scientists who spoke out in the 1950’s about the threat of nuclear warfare as an example. Some example.

The Federation of American Scientists estimates that as of September 2015, there are approximately 15,800 nuclear weapons in the world.

Hmmm, doesn’t sound like their moral outrage was very effective does it?

There will be sessions, presentations, conferences, along with comped travel and lodging, publications for tenure, etc., but the sum of the discussion of morality in computer science with be largely the same.

The reason for the sameness of result is that discussions, papers, resolutions and the rest, aren’t nearly as important as the ethical/moral choices you make in the day to day practice as a computer scientist.

Choices in the practice of computer science make a difference, discussions of fictional choices don’t. It’s really that simple.*

*That’s not entirely fair. The industry of discussing moral choices without making any of them is quite lucrative and it depletes the bank accounts of those snared by it. So in that sense it does make a difference.

Data Science Learning Club

Filed under: Data Science,Education — Patrick Durusau @ 8:11 pm

Data Science Learning Club by Renee Teate.

From the Hello and welcome message:

I’m Renee Teate, the host of the Becoming a Data Scientist Podcast, and I started this club so data science learners can work on projects together. Please browse the activities and see what we’re up to!

What is the Data Science Learning Club?

This learning club was created as part of the Becoming a Data Scientist Podcast [coming soon!]. Each episode, there is a “learning activity” announced. Anyone can come here to the club forum to get details and resources, participate in the activity, and share their results.

Participants can use any technology and any programming language to do the activities, though I expect most will use python or R. No one is “teaching” how to do the activity, we’ll just share resources and all do the activity during the same time period so we can help each other out if needed.

How do I participate?

Just register for a free account, and start learning!

If you’re joining in a “live” activity during the 2 weeks after a podcast episode airs (the original “assignment” period listed in the forum description), then you can expect others to be doing the activity at the same time and helping each other out. If you’re working through the activities from the beginning after the original assignment period is over, you can browse the existing posts for help and you can still post your results. If you have trouble, feel free to post a question, but you may not get a timely response if the activity isn’t the current one.

  • If you are brand new to data science, you may want to start at activity 00 and work your way through each activity with the help of the information in posts by people that did it before you. I plan to make them increase in difficulty as we go along, and they may build on one another. You may be able to skip some activities without missing out on much, and also if you finish more than 1 activity every 2 weeks, you will be going faster than new activities are posted and will catch up.
  • If you know enough to have done most of the prior activities on your own, you don’t have to start from the beginning. Join the current activity (latest one posted) with the “live” group and participate in the activity along with us.
  • If you are more advanced, please join in anyway! You can work through activities for practice and help out anyone that is struggling. Show off what you can do and write tutorials to share!

If you have challenges during the activity and overcome them on your own, please post about it and share what you did in case others come across the same challenges. Once you have success, please post about your experience and share your good results! If you write a post or tutorial on your own blog, write a brief summary and post a link to it, and I’ll check it out and promote the most helpful ones.

The only “dues” for being a member of the club are to participate in as many activities as possible, share as much of your work as you can, give constructive feedback to others, and help each other out as needed!

I look forward to this series of learning activities, and I’ll be participating along with you!

Renee’s Data Science Learning Club is due to go live on December 14, 2015!

With the various free courses, Stack Overflow and similar resources, it will be interesting to see how this develops.

Hopefully recurrent questions will develop into tutorials culled from discussions. That hasn’t happened with Stack Overflow, not that I am aware of, but perhaps it will happen here.

Stop by and see how the site develops!

December 12, 2015

Previously Unknown Hip Replacement Side Effect: Web Crawler Writing In Python

Filed under: Python,Search Engines,Searching,Webcrawler — Patrick Durusau @ 8:02 pm

Crawling the web with Python 3.x by Doug Mahugh.

From the post:

These days, most everyone is familiar with the concept of crawling the web: a piece of software that systematically reads web pages and the pages they link to, traversing the world-wide web. It’s what Google does, and countless tech firms crawl web pages to accomplish tasks ranging from searches to archiving content to statistical analyses and so on. Web crawling is a task that has been automated by developers in every programming language around, many times — for example, a search for web crawling source code yields well over a million hits.

So when I recently came across a need to crawl some web pages for a project I’ve been working on, I figured I could just go find some source code online and hack it into what I need. (Quick aside: the project is a Python library for managing EXIF metadata on digital photos. More on that in a future blog post.)

But I spent a couple of hours searching and playing with the samples I found, and didn’t get anywhere. Mostly because I’m working in Python version 3, and the most popular Python web crawling code is Scrapy, which is only available for Python 2. I found a few Python 3 samples, but they all seemed to be either too trivial (not avoiding re-scanning the same page, for example) or too needlessly complex. So I decided to write my own Python 3.x web crawler, as a fun little learning exercise and also because I need one.

In this blog post I’ll go over how I approached it and explain some of the code, which I posted on GitHub so that others can use it as well.

Doug has been writing publicly about his hip replacement surgery so I don’t think this has any privacy issues. 😉

I am interested to see what he writes once he is fully recovered.

My contacts at the American Medical Association disavow any knowledge of hip replacement surgery driving patients to write in Python and/or to write web crawlers.

I suppose there could be liability implications, especially for C/C++ programmers who lose their programming skills except for Python following such surgery.

Still, glad to hear Doug has been making great progress and hope that it continues!

Why the Open Government Partnership Needs a Reboot [Governments Too]

Filed under: Government,Government Data,Open Government,Transparency — Patrick Durusau @ 7:31 pm

Why the Open Government Partnership Needs a Reboot by Steve Adler.

From the post:

The Open Government Partnership was created in 2011 as an international forum for nations committed to implementing Open Government programs for the advancement of their societies. The idea of open government started in the 1980s after CSPAN was launched to broadcast U.S. Congressional proceedings and hearings to the American public on TV. While the galleries above the House of Representatives and Senate had been “open” to the “public” (if you got permission from your representative to attend) for decades, never before had all public democratic deliberations been broadcast on TV for the entire nation to behold at any time they wished to tune in.

I am a big fan of OGP and feel that the ideals and ambition of this partnership are noble and essential to the survival of democracy in this millennium. But OGP is a startup, and every startup business or program faces a chasm it must cross from early adopters and innovators to early majority market implementation and OGP is very much at this crossroads today. It has expanded membership at a furious pace the past three years and it’s clear to me that expansion is now far more important to OGP than the delivery of the benefits of open government to the hundreds of millions of citizens who need transparent transformation.

OGP needs a reboot.

The structure of a system produces its own behavior. OGP needs a new organizational structure with new methods for evaluating national commitments. But that reboot needs to happen within its current mission. We should see clearly that the current structure is straining due to the rapid expansion of membership. There aren’t enough support unit resources to manage the expansion. We have to rethink how we manage national commitments and how we evaluate what it means to be an open government. It’s just not right that countries can celebrate baby steps at OGP events while at the same time passing odious legislation, sidestepping OGP accomplishments, buckling to corruption, and cracking down on journalists.

Unlike Steve I didn’t and don’t have a lot of faith in governments being voluntarily transparent.

As I pointed out in Congress: More XQuery Fodder, sometime in 2016, full bill status data will be available for all legislation before the United States Congress.

A lot more data than is easy to access now but it is more smoke than fire.

With legislation status data, you can track the civics lesson progression of a bill through Congress, but that leaves you at least 3 to 4 degrees short of knowing who was behind the legislation.

Just a short list of what more would be useful:

  • Visitor/caller list for everyone who spoke to a member of Congress and their staff. With date and subject of the call.
  • All visitors and calls tied to particular legislation and/or classes of legislation
  • All fund raising calls made by members of Congress and/or their staffs, date, results, substance of call.
  • Representative conversations with reconciliation committee members or their staffers about legislation and requested “corrections.”
  • All conversations between a representative or member of their staff and agency staff, identifying all parties and the substance of the conversation
  • Notes, proposals, discussion notes for all agencies decisions

Current transparency proposals are sufficient to confuse the public with mounds of nearly useless data. None of it reflects the real decision making processes of government.

Before someone shouts “privacy,” I would point out that no citizen has a right to privacy when their request is for a government representative to favor them over other citizens of the same government.

Real government transparency will require breaking the mini-star chamber proceedings from the lowest to the highest levels of government.

What we need is a rebooting of governments.

Fun with ddR: Using Distributed Data Structures in R [Your Holiday Quiet Spot]

Filed under: Distributed Computing,Distributed Systems,R — Patrick Durusau @ 5:52 pm

Fun with ddR: Using Distributed Data Structures in R by Edward Ma and Vishrut Gupta (Hewlett Packard Enterprise).

From the post:

A few weeks ago, we revealed ddR (Distributed Data-structures in R), an exciting new project started by R-Core, Hewlett Packard Enterprise, and others that provides a fresh new set of computational primitives for distributed and parallel computing in R. The package sets the seed for what may become a standardized and easy way to write parallel algorithms in R, regardless of the computational engine of choice.

In designing ddR, we wanted to keep things simple and familiar. We expose only a small number of new user functions that are very close in semantics and API to their R counterparts. You can read the introductory material about the package here. In this post, we show how to use ddR functions.

Imagine that you are trapped after an indeterminate holiday meal in the TV room where A Christmas Story is playing for the fourth time that day.

You are at the point of saying/doing something that will offend the living members of your spouses family and generations to come.

What can you do?

Surely your powers of concentration exceed those of bridge players who claim to not see naked people cavorting about during bridge games.

Pull up the ddR post on your smartphone, read it and jump to the documentation and/or example programs.

You will have to be woken out of your reverie and handed your coat when it is time to go.

Well, maybe not exactly but it beats the hell out of biting one of your smaller relatives.

DataGenetics (blog)

Filed under: Data Science,Mathematical Reasoning,Narrative,Reasoning — Patrick Durusau @ 5:09 pm

DataGenetics (blog) by Nick Berry.

I mentioned Nick’s post Estimating “known unknowns” but his blog merits more than a mention of that one post.

As of today, Nick has 217 posts that touch on topics relevant to data science and have illustrations that make them memorable. You will remember those illustrations for discussions among data scientists, customers and even data science interviewers.

Follow Berry’s posts long enough and you may acquire the skill of illustrating data science ideas and problems in straight-forward prose.

Good luck!

« Newer PostsOlder Posts »

Powered by WordPress