Archive for January, 2016

Introducing Kaggle Datasets [No Data Feudalism Here]

Saturday, January 23rd, 2016

Introducing Kaggle Datasets

From the post:

At Kaggle, we want to help the world learn from data. This sounds bold and grandiose, but the biggest barriers to this are incredibly simple. It’s tough to access data. It’s tough to understand what’s in the data once you access it. We want to change this. That’s why we’ve created a home for high quality public datasets, Kaggle Datasets.

Kaggle Datasets has four core components:

  • Access: simple, consistent access to the data with clear licensing
  • Analysis: a way to explore the data without downloading it
  • Results: visibility to the previous work that’s been created on the data
  • Conversation: forums and comments for discussing the nuances of the data

Are you interested in publishing one of your datasets on kaggle.com/datasets? Submit a sample here.

Unlike some medievalists who publish in the New England Journal of Medicine, Kaggle not only makes the data sets freely available, but offers tools to help you along.

Kaggle will also assist you in making your datasets available as well.

Bible vs. Quran – Who’s More Violent?

Friday, January 22nd, 2016

Bible vs. Quran – Text analysis answers: Is the Quran really more violent than the Bible? by Tom H. C. Anderson.

Tom’s series appears in three parts, but sharing the common title:

Part I: The Project

From part 1:

With the proliferation of terrorism connected to Islamic fundamentalism in the late-20th and early 21st centuries, the question of whether or not there is something inherently violent about Islam has become the subject of intense and widespread debate.

Even before 9/11—notably with the publication of Samuel P Huntington’s “Clash of Civilizations” in 1996—pundits have argued that Islam incites followers to violence on a level that sets it apart from the world’s other major religions.

The November 2015 Paris attacks and the politicking of a U.S. presidential election year—particularly candidate Donald Trump’s call for a ban on Muslim’s entering the country and President Obama’s response in the State of the Union address last week—have reanimated the dispute in the mainstream media, and proponents and detractors, alike, have marshalled “experts” to validate their positions.

To understand a religion, it’s only logical to begin by examining its literature. And indeed, extensive studies in a variety of academic disciplines are routinely conducted to scrutinize and compare the texts of the world’s great religions.

We thought it would be interesting to bring to bear the sophisticated data mining technology available today through natural language processing and unstructured text analytics to objectively assess the content of these books at the surface level.

So, we’ve conducted a shallow but wide comparative analysis using OdinText to determine with as little bias as possible whether the Quran is really more violent than its Judeo-Christian counterparts.

Part II: Emotional Analysis Reveals Bible is “Angriest”

From part 2:

In my previous post, I discussed our potentially hazardous plan to perform a comparative analysis using an advanced data mining platform—OdinText—across three of the most important texts in human history: The Old Testament, The New Testament and the Quran.

Author’s note: For more details about the data sources and methodology, please see Part I of this series.

The project was inspired by the ongoing public debate around whether or not terrorism connected with Islamic fundamentalism reflects something inherently and distinctly violent about Islam compared to other major religions.

Before sharing the first set of results with you here today, due to the sensitive nature of this topic, I feel obliged to reiterate that this analysis represents only a cursory, superficial view of just the texts, themselves. It is in no way intended to advance any agenda or to conclusively prove anyone’s point.

Part III – Violence, Mercy and Non-Believers – to appear soon.

A comparison that may be an inducement for some to learn text/sentiment analysis but I would view its results with a great deal of caution.

Two of the comments to the first post read:


(comment) If you’re not completing the analysis in the native language, you’re just analyzing the translators’ understanding and interpretation of the texts; this is very different than the actual texts.

(to which a computational linguist replies) Technically, that is certainly true. However, if you are looking at broad categories of sentiment or topic, as this analysis does, there should be little variation in the results between translations, or by using the original. As well, it could be argued that what is most of interest is the viewpoint of the interpreters of the text, hence the translations may be *more* of interest, to some extent. But I would not expect that this analysis would be very sensitive at all to variations in translation or even language.

I find the position taken by the computational linguist almost incomprehensible.

Not only do we lack anything approaching a full social context for any of the texts in their original languages, moreover, terms that occur once (hapaxes) number approximately 1,300 in the Hebrew Bible and over 3,500 in the New Testament. For a discussion of the Qur’ān, see: Hapaxes in the Qur’ān: identifying and cataloguing lone words (and loadwords) by Shawkat M. Toorawa. Toorawa includes a list of hapaxes for the Qur’ān, a discussion of why they are important and a comparison to other texts.

Here is a quick example of where social context can change how you read a text:

23 The priest is to write these curses on a scroll and then wash them off into the bitter water. 24 He shall have the woman drink the bitter water that brings a curse, and this water will enter her and cause bitter suffering. 25 The priest is to take from her hands the grain offering for jealousy, wave it before the LORD and bring it to the altar. 26 The priest is then to take a handful of the grain offering as a memorial offering and burn it on the altar; after that, he is to have the woman drink the water. 27 If she has defiled herself and been unfaithful to her husband, then when she is made to drink the water that brings a curse, it will go into her and cause bitter suffering; her abdomen will swell and her thigh waste away, and she will become accursed among her people. (Numbers 5:23-27)

Does that sound sexist to you?

Interesting because a Hebrew Bible professor of my argued that it is one of the earliest pro-women passages in the text.

Think about the social context. There are no police, no domestic courts, short of retribution from the wife’s family members, there are no constraints on what a husband can do to his wife. Even killing her wasn’t beyond the pale.

Given that context, setting up a test that no one can fail, in the presence of a priest, which also deters resorting to a violent remedy, sounds like it gets the wife out of a dangerous situation where the priest can say: “See, you were jealous for no reason, etc.”

There’s no guarantee that is the correct interpretation either but it does accord with present understandings of law and custom at the time. The preservation of order in the community, no mean thing in the absence of an organized police force, was an important thing.

The English words used in translations also have their own context, which may be resolved differently from those in the original languages.

As I said, interesting but consider with a great deal of caution.

Improve Your Data Literacy: 16 Blogs to Follow in 2016

Friday, January 22nd, 2016

Improve Your Data Literacy: 16 Blogs to Follow in 2016 by Cedric Lombion.

From the post:

Learning data literacy is a never-ending process. Going to workshops and hands-on practice are important, but to really become acquainted with the “culture” of data literacy, you’ll have to do a lot of reading. Don’t worry, we’ve got your back: below is a curated list of 16 blogs to follow in 2016 if you want to: improve your data-visualisation skills; see the best examples of data journalism; discover the methodology behind the best data-driven projects; and pick-up some essential tips for working with data.

There are aggregated feeds to add to Feedly but it would have been more convenience to have one collection for all the feeds.

As you add feeds to Feedly or elsewhere, you will quickly find there are more feeds and stories than hours in the day.

The open question is how much data curation is required to make a viable publication? There are lots of lists, some with more or less comments, but what level of detail is required to create a financially viable publication?

How to verify images like a pro with Google Earth

Friday, January 22nd, 2016

How to verify images like a pro with Google Earth by Jenni Sargent.

From the post:

Google Earth offers much more than just satellite images. Find out how features like historical imagery, 3D buildings and measurement markers can help you confirm the exact location of an eyewitness photo or video.

Jenni has collected a number of guides and tips for you to make effective use of Google Earth as a verification tool.

Link rot mandates that you check links to verified images but a very useful tool to build up a collection of verified images for later inclusion in subject specific topic maps.

For example, you remember the Charlie Hebdo images showing all the government types as though they were standing in public? And later you saw the images proving that was a fraud? That gathering was on a separate street cleared of all others.

Problem is you are on deadline and can’t seem to pull of the image proving the fraud.

Multiple that by the number of times every day that you almost remember a resource but can’t seem to find the right resource.

The delivered content might not be in topic map syntax, after all the user wants the information, not a lesson on how it was delivered.

Something to consider.

Kindermädchen (Nanny) Court Protects Facebook Users – Hunting Down Original Sources

Friday, January 22nd, 2016

Facebook’s Friend Finder found unlawful by Germany’s highest court by Lisa Vaas.

From the post:

Reuters reports that a panel of the Federal Court of Justice has ruled that Facebook’s Friend Finder feature, used to encourage users to market the social media network to their contacts, constituted advertising harassment in a case that was filed in 2010 by the Federation of German Consumer Organisations (VZBV).

Friends Finder asks users for permission to snort the e-mail addresses of their friends or contacts from their address books, thereby allowing the company to send invitations to non-Facebook users to join up.

There was a time when German civil law and the reasoning of its courts were held in high regard. I regret to say it appear that may not longer be the case.

This decision on Facebook asking users to spread the use of Facebook being a good example.

From the Reuters account, it appears that sending of unsolicited email is the key to the court’s decision.

It’s difficult to say much more about the court’s decision because finding something other than re-tellings of the Reuters report is difficult.

You can start with the VZBV press release on the decision: Wegweisendes BGH-Urteil: Facebooks Einladungs-E-Mails waren unlautere Werbung, but it too is just a summary.

Unlike the Reuters report, it at least has: Auf anderen Webseiten Pressemitteilung des BGH, which takes you to: Bundesgerichtshof zur Facebook-Funktion “Freunde finden,” a press release by the court about its decision. 😉

The court’s press release offers: Siehe auch: Urteil des I. Zivilsenats vom 14.1.2016 – I ZR 65/14 –, which links to a registration facility to subscribe for a notice of the opinion of the court when it is published.

No promises on when the decision will appear. I subscribed today, January 22nd and the decision was made on January 14, 2016.

I did check Aktuelle Entscheidungen des Bundesgerichtshofes (recent decisions), but it refers you back to the register for the opinion to appear in the future.

Without the actual decision, it’s hard to tell if the court is unaware of the “delete” key on German keyboards or if there is some other reason to inject itself into a common practice on social media sites.

I will post a link to the decision when it becomes available. (The German court makes its decisions available for free to the public and charges a document fee for profit making services, or so I understand the terms of the site.)

PS: For journalists, researchers, bloggers, etc. I consider it a best practice to always include pointers to original sources.

PPS: The German keyboard does include a delete key (Entf) if you had any doubts:

880px-German-T2-Keyboard-Prototype-May-2012

(Select the image to display a larger version.)

CyberLoafing vs. MeetingLoafing

Friday, January 22nd, 2016

A news clip advised this morning:

A new study by Matthew McCarter, associate professor of management at The University of Texas at San Antonio (UTSA), looks into the bane of managers in nearly every industry: employees slacking off by excessively surfing the Internet, an activity known as cyberloafing. (The Role of the Decision-Making Regime on Cooperation in a Workgroup Social Dilemma: An Examination of Cyberloafing)

The news summary includes this line:

Cyberloafing costs hundreds of millions of dollars in lost productivity in the United States annually.

If that sends a wave of panic through your managers, consider the following headline:

$37 billion per year in unnecessary meetings, what is your share?.

Typical of management to strain at pennies on the ground, cyberloafing, while $100 bills are blowing just overhead, meetingloafing.

The former enables management to “manage” (read interfere with activities they don’t understand) and the latter points to management’s non-contribution to the service or product of the enterprise. You can guess which one management prefers.

Hire people who are creative/productive with a minimum of management and enable them to be creative/productive.

What’s so difficult about that?

(Unless you are in management and need to feel validated. If so, consult a priest or rabbi. They are in the business of validation.)

Parasitic Re-use of Data? Institutionalizing Toadyism.

Thursday, January 21st, 2016

Data Sharing by Dan L. Longo, M.D., and Jeffrey M. Drazen, M.D, N Engl J Med 2016; 374:276-277 January 21, 2016 DOI: 10.1056/NEJMe1516564.

This editorial in the New England Journal of Medicine advocates the following for re-use of medical data:


How would data sharing work best? We think it should happen symbiotically, not parasitically. Start with a novel idea, one that is not an obvious extension of the reported work. Second, identify potential collaborators whose collected data may be useful in assessing the hypothesis and propose a collaboration. Third, work together to test the new hypothesis. Fourth, report the new findings with relevant coauthorship to acknowledge both the group that proposed the new idea and the investigative group that accrued the data that allowed it to be tested. What is learned may be beautiful even when seen from close up.

I had to check my calendar to make sure April the 1st hadn’t slipped up on me.

This is one of the most bizarre and malignant proposals on data re-use that I have seen.

If you have an original idea, you have to approach other researchers as a suppliant and ask them to benefit from your idea, possibly using their data in new and innovative ways?

Does that smack of a “good old boys/girls” club to you?

If anyone uses the term parasitic or parasite with regard to data re-use, be sure to respond with the question:

How much do dogs in the manger contribute to science?

That phenomena is not unknown in the humanities nor in biblical studies. There was a wave of very disgusting dissertations that began with “…X entrusted me with this fragment of the Dead Sea Scrolls….”

I suppose those professors knew their ability to attract students based on merit versus their hoarding of original text fragments better than I did. You should judge them by their choices.

What Drives Compliance? Hint: The P Word Missing From Cybersecurity Discussions

Thursday, January 21st, 2016

Majority of Organizations Have False Sense of Data Security by David Weldon.

From the post:

A majority of organizations equate IT security compliance with actual strong defense, and are thereby leaving their data at risk to cyber incidents through a false sense of security.

That is the conclusion of the 2016 Vormetric Data Threat Report, released today by analyst firm 451 Resarch and Vormetric, a leader in enterprise data security.

The fourth annual report, which polled 1,100 senior IT security executives at large enterprises worldwide, details thee rates of data breach and compliance failures, perceptions of threats to data, data security stances and IT security spending plans. The study looked at physical, virtual, big data and cloud environments.

The bad news: 91 percent of organizations are vulnerable to data threats by not taking IT security measures beyond what is required by industry standards or government regulation.

Compliance occurs 44 time in the report, the third and fourth times in:

We’re also seeing encouraging signs that data security is moving beyond serving as merely a compliance checkbox. Though compliance remains a top reason for both securing sensitive data and spending on data security products and services, implementing security best practices posted the largest gain across all regions.

Why would a compliance be the top reason for data security measures?

I consulted Compliance Week, a leading compliance zine that featured on its enforcement blog: Court: Compliance Officers Must Ensure Compliance With AML Laws by Jaclyn Jaeger.

Here’s the lead paragraph from that story:

A federal district court this month upheld a $1 million fine imposed against the former chief compliance officer for MoneyGram International, finding that individual officers, including chief compliance officers, of financial institutions may be held responsible for ensuring compliance with the anti-money laundering provisions of the Bank Secrecy Act.

A $1 million dollar fine is an incentive in favor of compliance.

A very large incentive.

Let’s compare the incentives for compliance versus cybersecurity:

Non-Compliance $1 million
Data Breach $0.00

I selected the first compliance penalty I saw and such penalties run and entire range, some higher and some lower. The crucial point is that non-compliance carried penalties. Substantial ones in some cases.

Compare the iPhone “cookie theft bug” that took 18 months to fix, penalty imposed on vendor, $0.00.

Cybersecurity proposals without a stick are a waste of storage and more importantly, your time.

A Practical Guide to Graph Databases

Wednesday, January 20th, 2016

A Practical Guide to Graph Databases by Matthias Broecheler.

Slides from Graph Day 2016 @ Austin.

If you notice any of the “trash talking” on social media about graphs and graph databases, you will find slide 15 quite amusing.

Not everyone agrees on the relative position of graph products. 😉

I haven’t seen a video of Matthias’ presentation. If you happen across one, give me a ping. Thanks!

You Can Contribute Proof Reading! (Emacs/Elisp)

Wednesday, January 20th, 2016

I saw a tweet today by John Wiegley asking for volunteers to proof read the manual for Emacs 25.0.50 and Elisp 25.0.50.

Obtain the files: http://ftp.newartisans.com/pub/emacs/manuals/ (PDF and info formats)

Report bugs: M-x report-emacs-bug.

Is this the year you are going to make a contribution to an open source project?

Why Google Search Results Favor Democrats

Wednesday, January 20th, 2016

Why Google Search Results Favor Democrats by By Daniel Trielli, Sean Mussenden, and Nicholas Diakopoulos.

From the post:

As early as 2010, researchers at Harvard University started finding evidence that Google’s search rankings were not so objective, favoring its own products over those of competitors. A Federal Trade Commission investigation into the conglomerate in 2012 also indicated evidence that the company was using its monopoly power to help its own businesses. So it’s no secret that Google search results aren’t a font of objective and unbiased information. Now, as we enter into prime-time politics season in the U.S., the searching for candidates is heating up.  So what do Google’s biased search results mean for the election and for democracy itself?

Google is not fair; it favors some candidates, and it opposes others. And so far, it seems to prefer Democrats.

Our crowdsourced analysis of Google search results on Dec. 1 for the names of 16 presidential candidates revealed that Democrats fared better than Republicans when it came to supportive and positive sites within the first page of results. Democrats had, on average, seven favorable search results in those top 10, whereas GOP candidates had only 5.9.

You should search for some of the presidential candidates for yourself. My experience wasn’t the one reported by Trielli and friends. At least for the major candidates, Sanders, Clinton, Trump, Rubio, the first ten results had campaign homepages, twitter pages, wikipedia articles, etc.

Clinton did have an article on the latest from her email scandal and there was a recycled (Aug. 2015) link to a New Yorker piece on Trump trying to tie him to white extremists.

Clinton’s email scandal is a “truther” issue and not of interest to any sane person and people are voting for Trump because he may be a white extremist. I don’t see how either link, although “negative,” hurt either candidate.

Trielli and company go on to say how people trust Google search results and therefore Google has some special obligation play fair.

Google may feel that way but the criteria for search isn’t truth but satisfaction of the user’s search request. In some cases that may include accurate, factual information but only by happenstance.

I’m leery of anyone who wants to police the food I consume, the television channels (if any) I watch and to police search results on my behalf.

If you thought vendors have an agenda, you haven’t meet many “officious intermeddlers” recently. 😉

While I appreciate the concern over search content, I prefer to judge those results on my own.

You?

Writing Clickbait TopicMaps?

Wednesday, January 20th, 2016

‘Shocking Celebrity Nip Slips’: Secrets I Learned Writing Clickbait Journalism by Kate Lloyd.

I’m sat at a desk in a glossy London publishing house. On the floors around me, writers are working on tough investigations and hard news. I, meanwhile, am updating a feature called “Shocking celebrity nip-slips: boobs on the loose.” My computer screen is packed with images of tanned reality star flesh as I write captions in the voice of a strip club announcer: “Snooki’s nunga-nungas just popped out to say hello!” I type. “Whoops! Looks like Kim Kardashian forgot to wear a bra today!”

Back in 2013, I worked for a women’s celebrity news website. I stumbled into the industry at a time when online editors were panicking: Their sites were funded by advertisers who demanded that as many people as possible viewed stories. This meant writing things readers loved and shared, but also resorting to shadier tactics. With views dwindling, publications like mine often turned to the gospel of search engine optimisation, also known as SEO, for guidance.

Like making a deal with a highly-optimized devil, relying heavily on SEO to push readers to websites has a high moral price for publishers. When it comes to female pop stars and actors, people are often more likely to search for the celebrity’s name with the words “naked,” “boobs,” “butt,” “weight,” and “bikini” than with the names of their albums or movies. Since 2008, “Miley Cyrus naked” has been consistently Googled more than “Miley Cyrus music,” “Miley Cyrus album,” “Miley Cyrus show,” and “Miley Cyrus Instagram.” Plus, “Emma Watson naked” has been Googled more than “Emma Watson movie” since she was 15. In fact, “Emma Watson feet” gets more search traffic than “Emma Watson style,” which might explain why one women’s site has a fashion feature called “Emma Watson is an excellent foot fetish candidate.”

If you don’t know what other people are be searching for, try these two resources on Google Trends:

Hacking the Google Trends API (2014)

PyTrends – Pseudo API for Google Trends (Updated six days ago)

Depending on your sensibilities, you could collect content on celebrities into a topic map and when their searches spike, you can release links to the new material plus save readers the time of locating older content.

That might even be a viable business model.

Thoughts?

The Semasiology of Open Source [How Do You Define Source?]

Wednesday, January 20th, 2016

The Semasiology of Open Source by Robert Lefkowitz (Then, VP Enterprise Systems & Architecture, AT&T Wireless) 2004. Audio file.

Robert’s keynote from the Open Source Convention (OSCON) 2004 in Portland, Oregon.

From the description:

Semasiology, n. The science of meanings or sense development (of words); the explanation of the development and changes of the meanings of words. Source: Webster’s Revised Unabridged Dictionary, 1996, 1998 MICRA, Inc. “Open source doesn’t just mean access to the source code.” So begins the Open Source Definition. What then, does access to the source code mean? Seen through the lens of an Enterprise user, what does open source mean? When is (or isn’t) it significant? And a catalogue of open source related arbitrage opportunities.

If you haven’t heard this keynote, I hadn’t, do yourself a favor and make time to listen to it.

I do have one complaint: It’s not long enough. 😉

Enjoy!

Visualizing The Impact Of Hacks

Tuesday, January 19th, 2016

I saw Lisa Vaas‘ story Hyatt says 250 hotels were drained of credit card details and it has the usual recitation of the number of hotels, approximate number of countries.

Numbers that we all read by just nod by and they don’t make a real impression upon us.

How’s this for a restatement of the Hyatt hack’s impact:

hyatt-hacks

Every country you see in green was subject to the Hyatt hack from August 13, 2015 until December 8, 2015.

Now are you impressed?

Compare that to the roughly 35K disgruntled people who compose the Islamic State:

islamic-state-map

(Originally from: https://www.google.com/maps/d/viewer?mid=zDzQXfEc6tT8.k5aa_iAge_9E&hl=en)

That set of pins on little more than a dot is what all the hand wringing about the Islamic State is about.

At 35K members, the Islamic State doesn’t make up half of a good football crowd on any given Saturday.

Let’s use fewer numbers for hack reports and help make cybersecurity a priority.

BTW, I constructed the color map with Mapchart.net. It was actually quite nice.

PS: There were several small islands, such as Aruba that were included in the hacked countries but that did not appear on the Mapchart.net map.

90% banking, payment, health apps – Are Insecure – Surprised?

Monday, January 18th, 2016

Most Health and Financial Mobile Apps Are Rife With Vulnerabilities by Tara Seals.

From the post:

When it comes to mobile app security, there appears to be a disparity between consumer confidence in the level of security incorporated into mobile health and finance apps, and the degree to which those apps are actually vulnerable to common hack techniques (code tampering and reverse-engineering). In turn this has clear implications for both patient safety and data security.

According to Arxan Technologies’ 5th Annual State of Application Security Report, the majority of app users and app executives believe their apps to be secure. A combined 84% of respondents said that the offerings are “adequately secure,” and 63% believe that app providers are doing “everything they can” to protect their mobile health and finance apps.

Yet, nearly all of the apps that Arxan assessed, (90% of them in fact, including popular banking and payment apps and government-approved health apps), proved to be vulnerable to at least two of the Open Web Application Security Project (OWASP) Mobile Top 10 Risks, which could result in privacy violations, theft of customer credentials and other malicious acts, including device tampering.

I’m not proud, I’ll admit to being surprised.

I thought 100% of banking, payment and health care apps would be found to be vulnerable.

Perhaps the 90% range was just on cursory review.

Seriously.

After decades of patch-after-vulnerabilty-found, with no financial incentives to change that practice, what did you expect?

The real surprise for me was anyone thinking off the shelf apps were secure at all. Ever.

Such users are not following the news or have a crack pipe as a security consultant.

Illusory Truth (Illusory Publication)

Monday, January 18th, 2016

On Known Unknowns: Fluency and the Neural Mechanisms of Illusory Truth by Wei-Chun Wang, et al. Journal of Cognitive Neuroscience, Posted Online January 14, 2016. (doi:10.1162/jocn_a_00923)

Abstract:

The “illusory truth” effect refers to the phenomenon whereby repetition of a statement increases its likelihood of being judged true. This phenomenon has important implications for how we come to believe oft-repeated information that may be misleading or unknown. Behavioral evidence indicates that fluency or the subjective ease experienced while processing a statement underlies this effect. This suggests that illusory truth should be mediated by brain regions previously linked to fluency, such as the perirhinal cortex (PRC). To investigate this possibility, we scanned participants with fMRI while they rated the truth of unknown statements, half of which were presented earlier (i.e., repeated). The only brain region that showed an interaction between repetition and ratings of perceived truth was PRC, where activity increased with truth ratings for repeated, but not for new, statements. This finding supports the hypothesis that illusory truth is mediated by a fluency mechanism and further strengthens the link between PRC and fluency.

Whether you are crowd sourcing authoring of a topic map, measuring sentiment or having content authored by known authors, you are unlikely to want it populated by illusory truths. That is truths your sources would swear to but that are in fact false (from a certain point of view).

I would like to say more about what this article reports but it is an “illusory publication” that resides behind a pay-wall so I don’t know what is says in fact.

Isn’t that ironic? An article on illusory truth that cannot substantiate its own claims. It can only repeat them.

I first saw this in a tweet by Stefano Bertolo

Map Of A Single Tweet – Not Suitable For Current Use

Sunday, January 17th, 2016

I encountered a color-coded map of a single Tweet today:

tweet-map

Either select the image to see it full-size or follow the original link: http://online.wsj.com/public/resources/documents/TweetMetadata.pdf.

I haven’t done a detailed comparison against the Twitter API documentation but suffice it to say this map should not be cited and used only with caution.

I don’t think anything in the map is wrong, but it isn’t complete, missing for example, possibly_sensitive, quoted_status_id, quoted_status_id_str, quoted_status and others.

Suggestions for an updated map of a single Tweet?

Even the out-dated map gives you a good idea of the richness of information that can be transmitted by a single tweet.

Makes me wonder who is using the 140 characters and/or additional data for open but secure communication?

Your Apple Malware Protection Is Good For 5 Minutes

Sunday, January 17th, 2016

Researcher Bypasses Apple’s Updated Malware Protection in ‘5 Minutes’ by Lorenzo Franceschi-Bicchierai.

From the post:

Apple’s Mac computers have long been considered safer than their Windows-powered counterparts—so much so that the common belief for a long time was that they couldn’t get viruses or malware. Even Apple adopted that cliche for marketing purposes.

The reality, however, is slightly different. Trojans have targeted Mac computers for years, and things don’t seem to be improving. In fact, cybercriminals created more malware targeting Macs in 2015 than in the past five years combined, according to one study. Since 2012, Apple has tried to protect users with Gatekeeper, a feature designed to block common threats such as fake antivirus products, infected torrent files, and fake Flash installers—all malicious software that Mac users might download while regularly browsing the internet.

But it looks like Gatekeeper’s walls aren’t as strong as they should be. Patrick Wardle, a security researcher who works for the security firm Synack, has been poking holes in Gatekeeper for months. In fact, Wardle is still finding ways to bypass Gatekeeper, even after Apple issued patches for two of the vulnerabilities he found last year.

As it is designed now, Gatekeeper checks apps downloaded from the internet to see if they are digitally signed by either Apple or a developer recognized by Apple. If so, Gatekeeper lets the app run on the machine. If not, Gatekeeper prevents the user from installing and executing the app.

That Apple and Wardle have been going back and forth for months, with Wardel sans the actual source code, is further evidence of the software quality you get with no liability for security flaws in software.

You would think that when a flaw was discovered in Gatekeeper, that a full review would be undertaken to find and fix all of the security issues in Gatekeeper.

No, Apple fixed only the security issue(s) pointed out to it, and no others.

Would that change if there were legal liability for security defects?

There’s only one way to find out.

Teletext Time Travel [Extra Dirty Data]

Sunday, January 17th, 2016

Teletext Time Travel by Russ J. Graham.

From the post:

Transdiffusioner Jason Robertson has a complicated but fun project underway – recovering old teletext data from VHS cassettes.

Previously, it was possible – difficult but possible – to recover teletext from SVHS recordings, but they’re as rare as hen’s teeth as the format never really caught on. The data was captured by ordinary VHS but was never clear enough to get anything but a very few correct characters in amongst a massive amount of nonsense.

Technology is changing that. The continuing boom in processor power means it’s now possible to feed 15 minutes of smudged VHS teletext data into a computer and have it relentlessly compare the pages as they flick by at the top of the picture, choosing to hold characters that are the same on multiple viewing (as they’re likely to be right) and keep trying for clearer information for characters that frequently change (as they’re likely to be wrong).

I mention this so you the next time you complain about your “dirty data,” there is far dirtier data in the world!

The past and present of hypertext

Sunday, January 17th, 2016

The past and present of hypertext by Bob DuCharme.

From the post:

You know, links in the middle of sentences.

I’ve been thinking lately about the visionary optimism of the days when people dreamed of the promise of large-scale hypertext systems. I’m pretty sure they didn’t mean linkless content down the middle of a screen with columns of ads to the left and right of it, which is much of what we read off of screens these days. I certainly don’t want to start one of those rants of “the World Wide Web is deficient because it’s missing features X and Y, which by golly we had in the HyperThingie™ system that I helped design back in the 80s, and the W3C should have paid more attention to us” because I’ve seen too many of those. The web got so popular because Tim Berners-Lee found such an excellent balance between which features to incorporate and which (for example, central link management) to skip.

The idea of inline links, in which words and phrases in the middle of sentences link to other documents related to those words and phrases, was considered an exciting thing back when we got most of information from printed paper. A hypertext system had links between the documents stored in that system, and the especially exciting thing about a “world wide” hypertext system was that any document could link to any other document in the world.

But who does, in 2016? The reason I’ve been thinking more about the past and present of hypertext (a word that, sixteen years into the twenty-first century, is looking a bit quaint) is that since adding a few links to something I was writing at work recently, I’ve been more mindful of which major web sites include how many inline links and how many of those links go to other sites. For example, while reading the article Bayes’s Theorem: What’s the Big Deal? on Scientific American’s site recently, I found myself thinking “good for you guys, with all those useful links to other web sites right in the body of your article!”

My experience with contemporary hyperlinks has been like Bob’s. There are sites that cite only themselves but there are also sites that do point to external sources. Perhaps the most annoying failure to hyperlink is when a text mentions a document, report or agreement, and then fails to link the reader to that document.

The New York Times has a distinct and severe poverty of external links to original source materials. Some stories do have external links but not nearly all of them. Which surprises me for any news reporting site, much less the New York Times.

More hypertext linking would be great, but being able to compose documents from other documents, not our cut-n-paste of today but transclusion into a new document, that would be much better.

Voynich Manuscript:…

Sunday, January 17th, 2016

Voynich Manuscript: word vectors and t-SNE visualization of some patterns by Christian S. Perone.

From the post:

voynich_header

The Voynich Manuscript is a hand-written codex written in an unknown system and carbon-dated to the early 15th century (1404–1438). Although the manuscript has been studied by some famous cryptographers of the World War I and II, nobody has deciphered it yet. The manuscript is known to be written in two different languages (Language A and Language B) and it is also known to be written by a group of people. The manuscript itself is always subject of a lot of different hypothesis, including the one that I like the most which is the “culture extinction” hypothesis, supported in 2014 by Stephen Bax. This hypothesis states that the codex isn’t ciphered, it states that the codex was just written in an unknown language that disappeared due to a culture extinction. In 2014, Stephen Bax proposed a provisional, partial decoding of the manuscript, the video of his presentation is very interesting and I really recommend you to watch if you like this codex. There is also a transcription of the manuscript done thanks to the hard-work of many folks working on it since many moons ago.

Word vectors

My idea when I heard about the work of Stephen Bax was to try to capture the patterns of the text using word2vec. Word embeddings are created by using a shallow neural network architecture. It is a unsupervised technique that uses supervided learning tasks to learn the linguistic context of the words. Here is a visualization of this architecture from the TensorFlow site:

Proof that word vectors can be used to analyze unknown texts and manuscripts!

Enjoy!

PS: Glance at the right-hand column of Christian’s blog. If you are interested in data analysis using Python, he would be a great person to follow on Twitter: Christian S. Perone

For A Lack of Stoners, The Cybersecurity War Was Lost

Sunday, January 17th, 2016

Motherboard published The FBI Says It Can’t Find Hackers to Hire Because They All Smoke Pot by Max Cherney in May of 2014.

If you search for “marijuana FBI hiring,” the first two pages of results are a spate of articles dated in 2014, with State Marijuana Laws Complicate Federal Job Recruitment by Matthew Rosenberg and Mark Mazzetti (NYT), in 2015, being the only 2015 news.

By the third page of results you get a smattering of 2015 posts, mostly to the same effect as Rosenberg and Mazzetti, that the FBI in particular (3 years with no pot use) and many other federal agencies, have pushed their heads even further up their asses than before.

The one bright spot in government being the US Forestry Service that advertises its positions “are not drug tested.”

From the stories I read, there appears to be no hope that the FBI and other law enforcement agencies will adopt saner hiring strategies any time soon. Even if they did, there would be the legacy of prejudice against new hires who are allowed to smoke pot and the old hires who wish they had been smoking pot.

Hopefully private industry won’t continue to make the same mistake or at least exempt IT/hacker services from drug testing.

CEO/CIO’s need to ask themselves: Do you want to have the best hires and work or advance a social engineering fantasy of a kill-joy government drone?

What’s your call?

The Student, the Fish, and Agassiz [Viewing Is Not Seeing]

Saturday, January 16th, 2016

The Student, the Fish, and Agassiz by Samuel H. Scudder.

I was reminded of this story by Jenni Sargent’s Piecing together visual clues for verification.

Like Jenni, I assume that we can photograph, photo-copy or otherwise image anything of interest. Quickly.

But in quickly creating images, we also created the need to skim images, missing details that longer study would capture.

You should read the story in full but here’s enough to capture your interest:

It was more than fifteen years ago that I entered the laboratory of Professor Agassiz, and told him I had enrolled my name in the scientific school as a student of natural history. He asked me a few questions about my object in coming, my antecedents generally, the mode in which I afterwards proposed to use the knowledge I might acquire, and finally, whether I wished to study any special branch. To the latter I replied that while I wished to be well grounded in all departments of zoology, I purposed to devote myself specially to insects.

“When do you wish to begin?” he asked.

“Now,” I replied.

This seemed to please him, and with an energetic “Very well,” he reached from a shelf a huge jar of specimens in yellow alcohol.

“Take this fish,” he said, “and look at it; we call it a Haemulon; by and by I will ask what you have seen.”

In ten minutes I had seen all that could be seen in that fish, and started in search of the professor, who had, however, left the museum; and when I returned, after lingering over some of the odd animals stored in the upper apartment, my specimen was dry all over. I dashed the fluid over the fish as if to resuscitate it from a fainting-fit, and looked with anxiety for a return of a normal, sloppy appearance. This little excitement over, nothing was to be done but return to a steadfast gaze at my mute companion. Half an hour passed, an hour, another hour; the fish began to look loathsome. I turned it over and around; looked it in the face — ghastly; from behind, beneath, above, sideways, at a three-quarters view — just as ghastly. I was in despair; at an early hour, I concluded that lunch was necessary; so with infinite relief, the fish was carefully replaced in the jar, and for an hour I was free.

On my return, I learned that Professor Agassiz had been at the museum, but had gone and would not return for several hours. My fellow students were too busy to be disturbed by continued conversation. Slowly I drew forth that hideous fish, and with a feeling of desperation again looked at it. I might not use a magnifying glass; instruments of all kinds were interdicted. My two hands, my two eyes, and the fish; it seemed a most limited field. I pushed my fingers down its throat to see how sharp its teeth were. I began to count the scales in the different rows until I was convinced that that was nonsense. At last a happy thought struck me — I would draw the fish; and now with surprise I began to discover new features in the creature. Just then the professor returned.

“That is right said he, “a pencil is one of the best eyes. I am glad to notice, too, that you keep your specimen wet and your bottle corked.”

The student spends many more hours with the same fish but you need to read the account for yourself to fully appreciate it. There are other versions of the story which have been gathered here.

Two questions:

  • When was the last time you spent even ten minutes looking at a photograph or infographic?
  • When was the last time you tried drawing a copy of an image to make sure you are “seeing” all the detail an image has to offer?

I don’t offer myself as a model as “I can’t recall” is my answer to both questions.

In a world awash in images, shouldn’t we all be able to give a better answer than that?


Some addition resources on drawing versus photography.

Why We Should Draw More (and Photograph Less) – School of Life.

Why you should stop taking pictures on your phone – and learn to draw

The Elements of Drawing, in Three Letters to Beginners by John Ruskin

BTW, Ruskin was no Luddite of the mid-nineteenth century. He was an early adopter of photography to document the architecture of Venice.

How many images do you “view” in a day without really “seeing” them?

Piecing together visual clues for verification

Saturday, January 16th, 2016

Piecing together visual clues for verification by Jenni Sargent.

From the post:

When you start working to verify a photo or video, it helps to make a note of every clue you can find. What can you infer about the location from the architecture, for example? What information can you gather from signs and billboards? Are there any distinguishing landmarks or geographical features?

Piecing together and cross referencing these clues with existing data, maps and information can often give you the evidence that you need to establish where a photo or video was captured.

Jenni outlines seven (7) clues to look for in photos and her post includes a video plus a observation challenge!

Good luck with the challenge! Compare your results with one or more colleagues!

Building Web Apps Using Flask and Neo4j [O’Reilly Market Pricing For Content?]

Saturday, January 16th, 2016

Building Web Apps Using Flask and Neo4j

When I first saw this on one of my incoming feeds today I thought it might be of interest.

When I followed the link, I found an O’Reilly video, which broke out to be:

25:23 free minutes and 133:01 minutes for $59.99.

Rounding down that works out to about $30/hour for the video.

When you compare that to other links I saw today:

What is the value proposition that sets the price on an O’Reilly video?

So far as I can tell, pricing for content on the Internet is similar the pricing of seats on airlines.

Pricing for airline seats is beyond “arbitrary” or “capricious.” More akin to “absurd” and/or “whatever a credulous buyer will pay.”

Speculations on possible pricing models O’Reilly is using?

Suggestions on a viable pricing model for content?

End The Lack Of Diversity On The Internet Today!

Saturday, January 16th, 2016

Julia Evans tweeted earlier today:

“programmers are 0.66% of internet users, and build the software that everyone uses” – @heddle317

The strengths of having diversity on teams, including software teams, is well known and I won’t repeat those arguments here.

See: Why Diverse Teams Create Better Work, Diversity and Work Group Performance, More Diverse Personalities Mean More Successful Teams, Managing Groups and Teams/Diversity, or, How Diversity Makes Us Smarter, for five entry points into the literature on the diversity.

With 0.66% of internet users writing software for everyone, do you see the lack of diversity?

One response is to turn people into “Linus Torvalds” so we have a broader diversity of people programming. Good thought but I don’t know of anyone who wants to be a Linus Torvalds. (Sorry Linus.)

There’s a great benefit to having more people master programming but long-term, its not a solution to the lack of diversity in the production of software for the Internet.

Even if the number of people writing software for the Internet went up ten-fold, that’s only 6.6% of the population of Internet users. Far too monotone to qualify as any type of diversity.

There is another way to increase diversity in the production of Internet software.

Warnings: You will have to express your intuitive experience in words. You will have to communicate your experiences to programmers. Some programmers will think they know a “better way” for you to experience the interface. Always remember your experience is the “users” experience, unlike theirs.

You can use, express comments on, track your comments and respond to comments from programmers, on software built for the Internet. Programmers won’t seek you or your comments out so volunteering is the only option.

Programmers have their views, but if software doesn’t meet the need, habits, customs of users, it’s useless.

Programmers can only learn the needs, habits and customs of users from you.

Are you going to help end this lack of diversity and programmers to write better software or not?

Big data ROI in 2016: Put up, or shut up [IT vendors to “become” consumer-centric?]

Friday, January 15th, 2016

Big data ROI in 2016: Put up, or shut up by David Weldon.

From the post:

When it comes to data analytics investments, this is the year to put up, or shut up.

That is the take of analysts at Forrester Research, who collectively expect organizations to take a hard look at their data analytics investments so far, and see some very real returns on those investments. If strong ROI can’t be shown, some data initiatives may see the plug pulled on those projects.

These sober warnings emerge from Forrester’s top business trends forecast for 2016. Rather than a single study or survey on top trends, the Forrester forecast combines the results of 35 separate studies. Carrie Johnson, senior vice president of research at Forrester, discussed the highlights with Information Management, including the growing impatience at many organizations that big data produce big results, and where organizations truly are on the digital transformation journey.

“I think one of the key surprises is that folks in the industry assume that everyone is farther along than they are,” Johnson explains. “Whether it’s with digital transformation, or a transformation to become a customer-obsessed firm, there are very few companies pulling off those initiatives at a wholesale level very well. Worse, many companies in the year ahead will continue to flail a bit with one-off projects and bold-on strategies, versus true differentiation through transformation.”

Asked why this misconception exists, Johnson notes that “Vendors do tend to paint a rosier picture of adoption in general because it behooves them. Also, every leader in an organization sees their problems, and then sees an article or sees the use of an app by a competitor and thinks, ‘my gosh, these companies are so far ahead of where we are.’ The reality may be that that app may have been an experiment by a really savvy team in the organization, but it’s not necessarily representative of a larger commitment by the organization, both financially and through resources.”

It’s not the first time you have heard data ROI discussed on this blog but when Forrester Research says it, it sounds more important. Moreover, their analysis is the result of thirty-five separate studies.

Empirical verification (the studies) are good to have but you don’t have to have an MBA to realize businesses that make decisions on some basis other than ROI, aren’t businesses very long. Or, at least not profitable businesses.

David’s conclusion makes it clear that your ROI is your responsibility:

The good news: “We believe that this is the year that IT leaders — and CIOs in particular … embrace a new way of investing in and running technology that is customer-centric….”

If lack of clarity and a defined ROI for IT is a problem at your business, well, it’s your money.

Data Is Not The Same As Truth:…

Friday, January 15th, 2016

Data Is Not The Same As Truth: Interpretation In The Big Data Era by Kalev Leetaru.

From the post:

One of the most common misconceptions of the “big data” world is that from data comes irrefutable truth. Yet, any given piece of data records only a small fragment of our existence and the same piece of data can often support multiple conclusions depending on how it is interpreted. What does this mean for some of the major trends of the data world?

The notion of data supporting multiple conclusions was captured perhaps most famously in 2013 with a very public disagreement between New York Times reporter John Broder and Elon Musk, CEO of Tesla after Broder criticized certain aspects of the vehicle’s performance during a test drive. Using remotely monitored telemetry data the company recorded during Broder’s test drive, Musk argued that Broder had taken certain steps to purposely minimize the car’s capabilities. Broder, in turn, cited the exact same telemetry data to support his original arguments. How could such polar opposite conclusions be supported by the exact same data?

Read Kalev’s post to see how the same data could support both sides of that argument, not to mention many others.

Recitation of that story from memory should be a requirement of every data science related program as a condition of graduation.

Internalizing that story might quiet some of the claims made for “bigdata” and software to process “bigdata.”

It may be the case that actionable insights can be gained from data, big or small, that your company collects.

However, absence an examination of the data and your needs for analysis of that data, the benefits of processing that data remain unknown.

Think of it this way:

Would you order a train full of a new material that has no known relationship to your current products with no idea how it could be used?

Until someone can make a concrete business case for the use of big or small data, that is exactly what an investment in big data processing is to you today.

If after careful analysis, the projected ROI from specific big data processing has been demonstrated to your satisfaction, go for it.

But until then, keep both hands on your wallet when you hear a siren’s song about big data.

10 tools for investigative reporting in 2016

Thursday, January 14th, 2016

10 tools for investigative reporting in 2016 by Sam Berkhead.

From the post:

2015 was a big year for investigative journalism.

From the revelations of YakunovychLeaks in Ukraine to the “Fatal Extraction” investigations across Africa, investigative journalists have been responsible for uncovering some of society’s biggest abuses and holding people in power accountable.

But with newsrooms worldwide downsizing their investigative staffs to cut costs, it’s becoming harder to allocate resources to the journalism itself.

Luckily, the Internet holds a vast amount of free tools, resources and open databases available for aspiring investigative reporters. Here’s IJNet’s roundup of the best tools for investigative journalism to use throughout 2016:

Whether you are an investigative reporter or simply a topic map author looking for information, the Internet is awash with data, but its not always easy to find or to use.

The tools that Sam covers here will give you a better chance at finding and using information your discover online.

Bearing in mind, of course, that all data, online or not, comes from someone who has an interest in the data and what they think it shows.

New Clojure website is up!

Thursday, January 14th, 2016

Clojure

From the website:

Clojure is a robust, practical, and fast programming language with a set of useful features that together form a simple, coherent, and powerful tool.

The Clojure Programming Language

Clojure is a dynamic, general-purpose programming language, combining the approachability and interactive development of a scripting language with an efficient and robust infrastructure for multithreaded programming. Clojure is a compiled language, yet remains completely dynamic – every feature supported by Clojure is supported at runtime. Clojure provides easy access to the Java frameworks, with optional type hints and type inference, to ensure that calls to Java can avoid reflection.

Clojure is a dialect of Lisp, and shares with Lisp the code-as-data philosophy and a powerful macro system. Clojure is predominantly a functional programming language, and features a rich set of immutable, persistent data structures. When mutable state is needed, Clojure offers a software transactional memory system and reactive Agent system that ensure clean, correct, multithreaded designs.

I hope you find Clojure’s combination of facilities elegant, powerful, practical and fun to use.

Rich Hickey
author of Clojure and CTO Cognitect

I think the green color scheme in the header was intended to tie in with the download link for Clojure 1.7.0. 😉

If you haven’t downloaded Clojure 1.7.0, what are you waiting for?

Enjoy!