Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 14, 2015

How are recommendation engines built?

Filed under: Recommendation — Patrick Durusau @ 1:41 pm

How are recommendation engines built?

From the post:

The success of Amazon and Netflix has made recommendation systems not only common but also extremely popular. For many people, the recommendation system seems to be one of the easiest applications to understand; and a majority of us use them daily.

Haven’t you ever marveled at the ingenuity of a website offering the HDMI cable that goes with a television? Never been tempted by the latest trendy book about vampires? Been irritated by suggestions for diapers or baby powder though your child has been potty-trained for 3 months? Been annoyed to see flat screen TVs pop up on your browser every year with the approach of summer? The answer is, at least to me: “Yes, I have.”

But before cursing, every user should be aware of the difficulty of building an effective recommendation system! Below are some elements on how these systems are built (and ideas for how you can build your own).

A high level view of some of the strategies that underlie recommendation engines. This won’t help you will the nuts-n-bolts of building a recommendation engine but can serve as a brief introduction.

Recommendation engines could be used with topic maps to either annoy users with guesses as to what they would like to see next or perhaps more usefully in a topic map authoring context. To alert an author of closely similar material already in the topic map.

I first saw this in a tweet by Christophe Lalanne.

June 13, 2015

What Is Code?

Filed under: Computer Science,Programming — Patrick Durusau @ 8:33 pm

What Is Code? by Paul Ford.

A truly stunning exposition on programming and computers. Written for business users it is a model of exposition for business readers. Some 38,000 words so it isn’t superficial but insightful.

I suggest you bookmark it and read it on a regular basis. It won’t improve your computer skills but it may improve your communication skills.

If you want to know more background on the piece, see: What Is Code?: A Q&A With Writer and Programmer Paul Ford by Ashley Feinberg

If you want to follow on Twitter: Paul Ford.

Business Linkage Analysis: An Overview

Filed under: Business Intelligence,Topic Maps — Patrick Durusau @ 8:18 pm

Business Linkage Analysis: An Overview by Bob Hayes.

From the post:

Customer feedback professionals are asked to demonstrate the value of their customer feedback programs. They are asked: Does the customer feedback program measure attitudes that are related to real customer behavior? How do we set operational goals to ensure we maximize customer satisfaction? Are the customer feedback metrics predictive of our future financial performance and business growth? Do customers who report higher loyalty spend more than customers who report lower levels of loyalty? To answer these questions, companies look to a process called business linkage analysis.

Business Linkage Analysis is the process of combining different sources of data (e.g., customer, employee, partner, financial, and operational) to uncover important relationships among important variables (e.g., call handle time and customer satisfaction). For our context, linkage analysis will refer to the linking of other data sources to customer feedback metrics (e.g., customer satisfaction, customer loyalty).

Business Case for Linkage Analyses

Based on a recent study on customer feedback programs best practices (Hayes, 2009), I found that companies who regularly conduct operational linkages analyses with their customer feedback data had higher customer loyalty (72nd percentile) compared to companies who do conduct linkage analyses (50th percentile). Furthermore, customer feedback executives were substantially more satisfied with their customer feedback program in helping them manage customer relationships when linkage analyses (e.g., operational, financial, constituency) were a part of the program (~90% satisfied) compared to their peers in companies who did not use linkage analyses (~55% satisfied). Figure 1 presents the effect size for VOC operational linkage analyses.

Linkage analyses appears to have a positive impact on customer loyalty by providing executives the insights they need to manage customer relationships. These insights give loyalty leaders an advantage over loyalty laggards. Loyalty leaders apply linkage analyses results in a variety of ways to build a more customer-centric company: Determine the ROI of different improvement effort, create customer-centric operational metrics (important to customers) and set employee training standards to ensure customer loyalty, to name a few. In upcoming posts, I will present specific examples of linkage analyses using customer feedback data.

Discovering linkages between factors hidden in different sources of data?

Or as Bob summarizes:

Business linkage analysis is the process of combining different sources of data to uncover important insights about the causes and consequence of customer satisfaction and loyalty. For VOC programs, linkage analyses fall into three general types: financial, operational, and constituency. Each of these types of linkage analyses provide useful insight that can help senior executives better manage customer relationships and improve business growth. I will provide examples of each type of linkage analyses in following posts.

More posts in this series:

Linking Financial and VoC Metrics

Linking Operational and VoC Metrics

Linking Constituency and VoC Metrics

BTW, VoC = voice of customer.

A large and important investment, in data collection, linking and analysis.

Of course, you do have documentation for all the subjects that occur in your business linkage analysis? So that when that twenty-something who crunches all the numbers leaves, you won’t have to start from scratch? Yes?

Given the state of cybersecurity, I thought it better to ask than to guess.

Topic maps can save you from awkward questions about why the business linkage analysis reports are late. Or perhaps not coming until you can replace personnel and have them reconstruct the workflow.

Topic map based documentation is like insurance. You may not need it every day but after a mission critical facility burns to the ground, do you want to be the one to report that your insurance had lapsed?

Tipsheets & Links from 2015 IRE

Filed under: Journalism,News,Reporting — Patrick Durusau @ 3:30 pm

Tipsheets & Links from 2015 IRE

Program listing from the recent Investigative Reporters and Editors (IRE) conference, annotated with tipsheets and links.

I don’t find the format, “Tipsheet from speaker (name)” all that useful. I’m sure that is because I don’t recognize the usual haunts of each author but I suspect that is true for many others.

Anyone working on a subject index to the tipsheets?

Python Mode for Processing

Filed under: Processing,Python,Visualization — Patrick Durusau @ 3:20 pm

Python Mode for Processing

From the webpage:

You write Processing code. In Python.

Processing is a programming language, development environment, and online community. Since 2001, Processing has promoted software literacy within the visual arts and visual literacy within technology. Today, there are tens of thousands of students, artists, designers, researchers, and hobbyists who use Processing for learning, prototyping, and production.

Processing was initially released with a Java-based syntax, and with a lexicon of graphical primitives that took inspiration from OpenGL, Postscript, Design by Numbers, and other sources. With the gradual addition of alternative progamming interfaces — including JavaScript, Python, and Ruby — it has become increasingly clear that Processing is not a single language, but rather, an arts-oriented approach to learning, teaching, and making things with code.

We are thrilled to make available this public release of the Python Mode for Processing, and its associated documentation. More is on the way! If you’d like to help us improve the implementation of Python Mode and its documentation, please find us on Github!

A screen shot of part of one image from Dextro.org will give you a glimpse of the power of Processing:

processing-example

BTW, this screen shot pales on comparison to the original image.

Enough said?

Why Technology Hasn’t Delivered More Democracy…

Filed under: Government — Patrick Durusau @ 2:51 pm

Why Technology Hasn’t Delivered More Democracy – New technologies offer important tools for empowerment — yet democracy is stagnating. What’s up? by Thomas Carothers.

From the post:

The current moment confronts us with a paradox. The first fifteen years of this century have been a time of astonishing advances in communications and information technology, including digitalization, mass-accessible video platforms, smart phones, social media, billions of people gaining internet access, and much else. These revolutionary changes all imply a profound empowerment of individuals through exponentially greater access to information, tremendous ease of communication and data-sharing, and formidable tools for networking. Yet despite these changes, democracy — a political system based on the idea of the empowerment of individuals — has in these same years become stagnant in the world. The number of democracies today is basically no greater than it was at the start of the century. Many democracies, both long-established ones and newer ones, are experiencing serious institutional debilities and weak public confidence.

How can we reconcile these two contrasting global realities — the unprecedented advance of technologies that facilitate individual empowerment and the overall lack of advance of democracy worldwide? To help answer this question, I asked six experts on political change, all from very different professional and national perspectives. Here are their responses, followed by a few brief observations of my own.

Thomas gives this summary of the varying perspectives:

The contributors’ answers to the puzzle of why the advance of new communication technologies in the past fifteen years has not produced any overall advance of democracy in the world boil down to three different lines: First, it’s too soon to see the full effects. Second, the positive potential effects are being partially outweighed or limited by other factors, including some larger countervailing trends on the international political stage for democracy, the ability of authoritarian governments to use the same technologies for their own anti-democratic purposes, and the only partial reach of these technologies in many countries. And third, technology does not solve some basic challenges of democracy building, above all, stirring citizens to engage in collective action and the establishment of effective representative institutions.

Democracy has yet to arrive in the United States, although Leonard Cohen says it is coming in Democracy. The ill-gotten gains of IBM from supplying Nazi Germany in WWII to Monsanto’s war on crop diversity, are both immune from the effects of “democracy” as it is practiced in the United States. Which is to say not at all. Citizens can make choices within artificial boundaries that prevent imposition on the ruling class. Some “democracy.”

As far as the reach of technology in poorer countries, I was reminded of this image, goats in Somalia that bear the cell phone numbers of their owners.

cellphone-goats

The third concern, that technology doesn’t solve other problems, such a collective action and “effective representative institutions” comes as close as any to how I would answer the question.

You need only review the history of the second half of the 20th century to realize that oppressed people differ from their oppressors in only one way, they lack someone to oppress. Despite centuries of discrimination and oppression against Jews, Israel has practiced discrimination and oppression against Palestinians since its inception. Whatever justification makes that work for you, the fact remains that Palestinians are being oppressed by a formerly oppressed group.

You could substitute any other two names of religious or national groups and get the same result, so I am not singling out Israel for criticism.

The reason why technology fails to deliver more democracy or even make it more likely, is because technology can’t change us. If anything, it exacerbates some of our worse tendencies. Cyberbullying, sexual harassment, exploitation of children, and worse things are routinely enabled by technology.

Rather than chasing a false god of democracy in technology, we should look inward with a view of changing ourselves, if you are so minded. Not technology or others.

A Question of Trust [June 2015, “Anderson Report”]

Filed under: Government,Security — Patrick Durusau @ 1:42 pm

A Question of Trust – Report of the Investigatory Powers Review by David Anderson.

If you want to be informed about the current state of anti-terrorist efforts in the UK, you owe it to yourself to read the “Anderson report” as it is termed by the media. Unfortunately, other than quoting their favorite snippets, the media sources I reviewed, did not link to the original report. My trust for the media is on par with my trust for governments so I went looking for and found the link that appears above to the original report.

Always remember news accounts that quote from but don’t link to government documents are doing so for a reason. I leave the reason for not linking as an exercise for the reader.

To entice you to read further, the following statements appear in the executive summary:

RIPA [Regulation of Investigatory Powers Act 2000], obscure since its inception, has been patched up so many times as to make it incomprehensible to all but a tiny band of initiates. A multitude of alternative powers, some of them without statutory safeguards, confuse the picture further. This state of affairs is undemocratic, unnecessary and – in the long run – intolerable.

At three-hundred and seventy-three (373) pages, A Question of Trust isn’t an easy read but well worth your time. It sets a high bar for examination of programs justified by claims of alleged dangers.

June 12, 2015

We’ll tell suspects they’re on spy radar, says Twitter:… [Scare Mongering Across the Pond]

Filed under: Security — Patrick Durusau @ 8:13 pm

We’ll tell suspects they’re on spy radar, says Twitter: Arrogance of social media bosses who vow to sabotage fight against terrorism by James Slack.

From the post:

Twitter will ‘tip off’ terror suspects and criminals if the security services ask for information about them, a bombshell report reveals.

The social media giant will keep an investigation secret only if compelled to do so by a court, Britain’s terror watchdog found.

It is one of a string of US tech companies which have decided that – in the wake of the Edward Snowden leaks – customer ‘privacy’ and protecting their ‘brand’ takes priority.

Twitter said its policy was to ‘notify users of requests for their account information, which includes a copy of the request, prior to disclosure unless we are prohibited from doing so’.

The astonishing revelations come in a report by David Anderson QC, warning that MI5, MI6, GCHQ and the police are losing their ability to electronically track terrorists and criminals.

In other developments:

  • MI5 warned that Britain was facing an ‘unprecedented’ threat from Islamist fanatics trained overseas.
  • A political row broke out over whether judges or the Home Secretary should sign surveillance warrants. It emerged that Theresa May had personally granted approval for 2,345 intrusive spying missions last year.
  • US traitor Snowden was condemned for undermining British national security.

Well, it is the Daily Mail. Yes?

It wasn’t that many years ago that there were actual acts of terrorism in the UK and while the government acted repressively, it wasn’t nearly as extreme as it is today. And not nearly as repressive as the Daily Mail would apparently like to see. Perhaps you aren’t old enough to remember the “troubles” in Northern Ireland.

Slack must be writing in hopes of being hired by one of the Republican candidates for the US presidency. His work has the strident, disconnected from reality tone that Republicans like so much.

If that sounds harsh, consider that the danger from terrorists in the UK is the equivalent of the danger for being stung to death by bees. THE TERRORISM ACTS IN 2011.

Terrorist fear mongering is just a way to drive demand for more government funding for security services. In other words, you wallet is in immediate danger from the “war” on terrorism. You personally? Not so much.

Classic Papers in Programming Languages and Logic

Filed under: Logic,Programming — Patrick Durusau @ 7:34 pm

Classic Papers in Programming Languages and Logic a course reading list. A set of twenty-nine (29) papers, for a class that started September 9th and ended December 2nd.

If you are looking for guided reading on programming language and logic, you have chosen the right place.

I first saw this in a tweet by Alan MacDougall, who comments:

Maybe I’ll use these principles to devise a new language!

You have been warned. 😉

Tallinn Manual on the International Law Applicable to Cyber Warfare

Filed under: Cybersecurity,Law — Patrick Durusau @ 7:26 pm

Tallinn Manual on the International Law Applicable to Cyber Warfare by Professor Michael N. Schmitt.

Description from Amazon:

The product of a three-year project by twenty renowned international law scholars and practitioners, the Tallinn Manual identifies the international law applicable to cyber warfare and sets out ninety-five ‘black-letter rules’ governing such conflicts. It addresses topics including sovereignty, State responsibility, the jus ad bellum, international humanitarian law, and the law of neutrality. An extensive commentary accompanies each rule, which sets forth the rule’s basis in treaty and customary law, explains how the group of experts interpreted applicable norms in the cyber context, and outlines any disagreements within the group as to each rule’s application.

A bit pricey, $129.99 (hard cover), $41.59 (Kindle), $58.48 (paperback), but it is rather specialized.

The conventional “laws of war” are designed to favor current sovereign states and their style of warfare. I would not expect eventual “laws of war” for cyberwarfare to be any different.

Open Sourcing Pinot: Scaling the Wall of Real-Time Analytics

Filed under: Analytics,Pinot — Patrick Durusau @ 7:01 pm

Open Sourcing Pinot: Scaling the Wall of Real-Time Analytics by Kishore Gopalakrishna.

From the post:

Last fall we introduced Pinot, LinkedIn’s real-time analytics infrastructure, that we built to allow us to slice and dice across billions of rows in real-time across a wide variety of products. Today we are happy to announce that we have open sourced Pinot. We’ve had a lot of interest in Pinot and are excited to see how it is adopted by the open source community.

We’ve been using it at LinkedIn for more than two years, and in that time, it has established itself as the de facto online analytics platform to provide valuable insights to our members and customers. At LinkedIn, we have a large deployment of Pinot storing 100’s of billions of records and ingesting over a billion records every day. Pinot serves as the backend for more than 25 analytics products for our customers and members. This includes products such as Who Viewed My Profile, Who Viewed My Posts and the analytics we offer on job postings and ads to help our customers be as effective as possible and get a better return on their investment.

In addition, more than 30 internal products are powered by Pinot. This includes XLNT, our A/B testing platform, which is crucial to our business – we run more than 400 experiments in parallel daily on it.

I am intrigued by:

For ease of use we decided to provide a SQL like interface. We support most SQL features including a SQL-like query language and a rich feature set such as filtering, aggregation, group by, order by, distinct. Currently we do not support joins in order to ensure predictable latency.

“SQL-like” always seem a bit vague to me. Will be looking at the details on the query language.

Grab the code and/or see the documentation.

Pwning F-35 – Safety Alert

Filed under: Cybersecurity,Security — Patrick Durusau @ 3:08 pm

When I wrote Have You Ever Pwned an F-35?, I wasn’t aware the damned thing might catch on fire as it tries to take off.

The F-35 Just Catches on Fire Sometimes – And the Pentagon knew that for years by Kevin Knodell and Joseph Trevithick.

From the post:

The F-35 Lightning II is supposed to be America’s primary warplane for the next several decades. But here’s one big problem. The F-35 can catch on fire … just while trying to take off.

That’s what happened on June 23, 2014, when a fire swept through an F-35 fighter jet taxiing on a runway at Eglin Air Force Base in Florida. A new report by the U.S. Air Force’s Accident Investigation Board shines new light on what exactly happened with America’s hottest new warplane.

The military classifies the fire as a “Class A Mishap,” meaning an accident that causes death, permanent injury or costs $2 million or more in damage. According to the report, this particular incident cost the Department of Defense “in excess of” $50 million in damage.

And it could happen again.

See Kevin and Joseph’s post for the full details.

Be forewarned: If any marketer of hacking services claims to have made an F-35 burst into flame on take-off, remember the conclusion by Kevin and Joeseph:

…even when the plane is entrusted with experienced and capable personnel, the F-35— which is to eventually cost American taxpayers more than a trillion dollars — will still occasionally catch on fire all by itself.

There is no known method for distinguishing a hack attack on an F-35 from it bursting into flames on its own. The F-35 appears to have more than software security issues.

TeX Live 2015

Filed under: TeX/LaTeX — Patrick Durusau @ 2:53 pm

TeX Live 2015 availability

From the webpage:

TeX Live 2015 is available for download now. It is also available on DVD from TUG and other user groups.

You can acquire TeX Live in many ways. For typical use, we recommend the first two:

Enjoyable weekend update task!

Semiotics (discovered while pair-surfing)

Filed under: Semiotics — Patrick Durusau @ 2:36 pm

I was pair-surfing (long distance) with a friend when the conversation turned to Umberto Eco and his recent fiction. I haven’t kept up with Eco like I should, having been distracted by big data, graphs and such.

It took only a quick web search to find Eco’s homepage and a list of all of his fiction.

While there, I ran across Semiotics, a page that lists topics in semiotics, seminal authors and active writers. Highly recommended.

Unicode 8 – Coming Next Week!

Filed under: Graph Analytics,Unicode — Patrick Durusau @ 1:52 pm

Unicode 8 will be released next week. Rick McGowan has posted directions to code charts for final review:

For the complete archival charts, as a single-file 100MB file, or as individual block files, please see the charts directory here:

http://www.unicode.org/Public/8.0.0/charts/

For the set of “delta charts” only with highlighting for changes please see:

http://www.unicode.org/charts/PDF/Unicode-8.0/

(NOTE: There is a known problem viewing the charts using the PDF Viewer plugin for Firefox on the Mac platform.)

And the 8.0 beta UCD files are also available for cross-reference:

http://www.unicode.org/Public/8.0.0/ucd/

The draft version page is here:

http://www.unicode.org/versions/Unicode8.0.0/

From the draft version homepage:

Unicode 8.0 adds a total of 7,716 characters, encompassing six new scripts and many new symbols, as well as character additions to several existing scripts. Notable character additions include the following:

  • A set of lowercase Cherokee syllables, forming case pairs with the existing Cherokee characters
  • A large collection of CJK unified ideographs
  • Emoji symbols and symbol modifiers for implementing skin tone diversity; see Unicode Emoji.
  • Georgian lari currency symbol
  • Letters to support the Ik language in Uganda, Kulango in the Côte d’Ivoire, and other languages of Africa
  • The Ahom script for support of the Tai Ahom language in India
  • Arabic letters to support Arwi—the Tamil language written in the Arabic script

Other important updates in Unicode Version 8.0 include:

  • Change in encoding model of New Tai Lue to visual order

Synchronization

Two other important Unicode specifications are maintained in synchrony with the Unicode Standard, and include updates for the repertoire additions made in Version 8.0, as well as other modifications:

If you have the time this weekend, take a quick look.

June 11, 2015

CISOs Say Hackers Could Gain Upper Hand By 2020 [What?]

Filed under: Cybersecurity,Security — Patrick Durusau @ 7:20 pm

CISOs Say Hackers Could Gain Upper Hand By 2020 by Jeff Goldman.

From the post:

A recent RAND Corporation study based on interviews with 18 chief information security officers (CISOs) found that while companies are spending more and more on cyber security tools, they aren’t confident their data is secure, and many CISOs believe attackers are gaining on corporate defenses.

Even though worldwide spending on cyber security is approaching $70 billion a year and is continuing to grow at 10 to 15 percent annually, many CISOs told RAND they believe hackers may gain the upper hand between two and five years from now.

[What?]

Really? “….[H]ackers may gain the upper hand between two to five years from now.”

I’m not sure where RAND found the CISOs interviewed for this study but it makes me question those studies of nuclear war that RAND did back in the day.

Just a few of the more noteworthy hacks:

Banking hack heist yields up to $1 billion

Massive breach at health care company Anthem Inc.

Sony Data Breach Worse Than Expected

Home Depot Breach Hit 56 Million Credit And Debit Cards

Why the “biggest government hack ever” got past the feds

Perhaps these hackers didn’t get the RAND report. They aren’t supposed to have the upper hand for another two to five years.

Could just be me but given the track record of both large and small data breaches, I would say that hackers have the upper hand right now.

How about you?

RSA Cybersecurity Poverty Index™ 2015

Filed under: Cybersecurity,Security — Patrick Durusau @ 6:54 pm

RSA Cybersecurity Poverty Index™ 2015

From the overview:

Welcome to RSA’s inaugural Cybersecurity Poverty Index™.

The Cybersecurity Poverty Index is the result of an annual maturity self-assessment completed by organizations of all sizes, industries, and geographies across the globe. The assessment was created using the NIST Cybersecurity Framework (CSF). The 2015 assessment was completed by more than 400 security professionals across 61 countries.

Our goal in creating and conducting this global research initiative is two-fold. First, we want to provide a measure of the risk management and information security capabilities of the global population. As an industry leader and authority, we are often asked “why do damaging security incidents continue to occur?” We believe that a fundamental gap in capability is a major contributor, and hope that this research can illuminate and quantify that gap. Second, we wish to give organizations a way to benchmark their capabilities against peers and provide a globally recognized practical standard, with an eye towards identifying areas for improvement.

You are unlikely to find anything you don’t already “know” or at least suspect in this report. Still, I think it is worth reading in order to understand the depth of the cybersecurity problem.

As far as the “why do damaging security incidents continue to occur?” question, the “fundamental gap in capacity” is a symptom, but not an answer.

You need only read between the lines of the reports of the recent, catastrophic hack on OPM, to realize that off-the-shelf techniques were used to breach security that was known and publicly reported to be faulty. The gap in that case wasn’t “capacity” but the lack of an organizational imperative to take cybersecurity seriously and to allocate resources accordingly.

As long as cybersecurity remains a non-priority, as evidenced by its resourcing in corporate and government budgets, hackers will remain ten or more years ahead those trying to secure data.

Anewstip

Filed under: News,Tweets,Twitter — Patrick Durusau @ 4:23 pm

Anewstip

From the webpage:

Find journalists by what they tweet

Powered by all the tweets since 2006 from more than 1 million journalist & media outlets.

Search for relevant journalists

Search through 1 billion+ real-time and historical tweets (since 2006, when Twitter was born) from 1 million+ journalists and media outlets, to find out all the relevant media contacts that have talked about your product, your business, your competitors, or any other keywords in your industry.

Searches can be limited to tweets, journalists and outlets.

The advanced search interface looks useful:

anewstip-advanced

If you are mining twitter for news sources, this could prove to be very useful.

With the caveat that news sources tend to be highly repetitive. If the New York Times says the OPM hack originated in China, a large number of news lemmings will repeat that without a word of doubt or criticism. Still amounts to one unknown source cited by the New York Times. No matter how many times it is repeated.

Don’t Think Open Access Is Important?…

Filed under: Open Access,Open Data — Patrick Durusau @ 2:39 pm

Don’t Think Open Access Is Important? It Might Have Prevented Much Of The Ebola Outbreak by Mike Masnick

From the post:

For years now, we’ve been talking up the importance of open access to scientific research. Big journals like Elsevier have generally fought against this at every point, arguing that its profits are more important that some hippy dippy idea around sharing knowledge. Except, as we’ve been trying to explain, it’s that sharing of knowledge that leads to innovation and big health breakthroughs. Unfortunately, it’s often pretty difficult to come up with a concrete example of what didn’t happen because of locked up knowledge. And yet, it appears we have one new example that’s rather stunning: it looks like the worst of the Ebola outbreak from the past few months might have been avoided if key research had been open access, rather than locked up.

That, at least, appears to be the main takeaway of a recent NY Times article by the team in charge of drafting Liberia’s Ebola recovery plan. What they found was that the original detection of Ebola in Liberia was held up by incorrect “conventional wisdom” that Ebola was not present in that part of Africa:

Mike goes on to point out knowledge about Ebola in Liberia was published in pay-per-view medical journals, which would have been prohibitively expensive for Liberian doctors.

He has a valid point but how often do primary care physicians consult research literature? And would they have the search chops to find research from 1982?

I am very much in favor of open access but open access on its own doesn’t bring about access or meaningful use of information once accessed.

The OPM Hacking Scandal Just Got Worse [Quality of Reporting About the Same]

Filed under: Cybersecurity,Security — Patrick Durusau @ 10:18 am

The OPM Hacking Scandal Just Got Worse by John Schindler.

From the post:

The other day I explained in detail how the mega-hack of the Office of Personnel Management’s internal servers looks like a genuine disaster for the U.S. Government, a setback that will have long-lasting and painful counterintelligence consequences. In particular I explained what the four million Americans whose records have been purloined may be in for:

Whoever now holds OPM’s records possesses something like the Holy Grail from a CI perspective. They can target Americans in their database for recruitment or influence. After all, they know their vices, every last one — the gambling habit, the inability to pay bills on time, the spats with former spouses, the taste for something sexual on the side (perhaps with someone of a different gender than your normal partner) — since all that is recorded in security clearance paperwork (to get an idea of how detailed this gets, you can see the form, called an SF86, here).

Do you have friends in foreign countries, perhaps lovers past and present? They know all about them. That embarrassing dispute with your neighbor over hedges that nearly got you arrested? They know about that too. Your college drug habit? Yes, that too. Even what your friends and neighbors said about you to investigators, highly personal and revealing stuff, that’s in the other side’s possession now.

The bad news keeps piling up with this story, including reports that OPM records may have appeared, for sale, on the “darknet.” Even more disturbing, if predictable, is a new report in the New York Times that case “investigators believe that the Chinese hackers who attacked the databases of the Office of Personnel Management may have obtained the names of Chinese relatives, friends and frequent associates of American diplomats and other government officials, information that Beijing could use for blackmail or retaliation.” (emphasis in original)

The fallout from the OPM hack does seem to worsen but the quality of reporting on the hack remains fairly constant, as in poor.

The New York Times continues to parrot the unofficial government line that China was behind the OPM hack. Could be true but why would the Chinese government want to sell OPM records on the “darknet?” That seems to contradict the state enterprise line. Yes?

The media has failed to followup on who in the OPM was responsible for security and what steps have been taken to hold them accountable for this rather remarkable data breach.

Unless and until the holders of data have “skin in the game” for data breaches, data security will not improve.

NumPy / SciPy / Pandas Cheat Sheet

Filed under: Numpy,Python — Patrick Durusau @ 9:53 am

NumPy / SciPy / Pandas Cheat Sheet From quandl.

Useful but also an illustration of the tension between a true cheatsheet (one page, tiny print) and edging towards a legible but multi-page booklet.

I suspect the greatest benefit of a “cheatsheet” accrues to its author. The chores of selecting, typing and correcting being repetition that leads to memorization of the material.

I first saw this in a tweet by Kirk Borne.

June 10, 2015

How Entity-Resolved Data Dramatically Improves Analytics

Filed under: Entity Resolution,Merging,Topic Maps — Patrick Durusau @ 8:08 pm

How Entity-Resolved Data Dramatically Improves Analytics by Marc Shichman.

From the post:

In my last two blog posts, I’ve written about how Novetta Entity Analytics resolves entity data from multiple sources and formats, and why its speed and scalability are so important when analyzing large volumes of data. Today I’m going to discuss how analysts can achieve much better results than ever before by utilizing entity-resolved data in analytics applications.

When data from all available sources is combined and entities are resolved, individual records about a real-world entity’s transactions, actions, behaviors, etc. are aggregated and assigned to that person, organization, location, automobile, ship or any other entity type. When an application performs analytics on this entity-resolved data, the results offer much greater context than analytics on the unlinked, unresolved data most applications use today.

Analytics that present a complete view of all actions of an individual entity are difficult to deliver today as they can require many time-consuming and expensive manual processes. With entity-resolved data, complete information about each entity’s specific actions and behaviors is automatically linked so applications can perform analytics quickly and easily. Below are some examples of how applications, such as enterprise search, data warehouse and link analysis visualization, can employ entity-resolved data from Novetta Entity Analytics to provide more powerful analytics.

Marc isn’t long on specifics of how Novetta Entity Analytics works in his prior posts but I think we can all agree on his recitation of the benefits of entity resolution in this post.

Once we know the resolution of an entity or subject identity as we would say in topic maps, the payoffs are immediate and worthwhile. Search results are more relevant, aggregated (merged) data speeds up queries and multiple links are simplified as they are merged.

How we would get there varies but Marc does a good job of describing the benefits!

Open Source Intelligence Techniques:… (review)

Filed under: Intelligence,Open Source Intelligence — Patrick Durusau @ 7:59 pm

Open Source Intelligence Techniques: Resources for Searching and Analyzing Online Information by CyberWarrior.

From the post:

Author Michael Bazzell has been well known and respected in government circles for his ability to locate personal information about any target through Open Source Intelligence (OSINT). In this book, he shares his methods in great detail. Each step of his process is explained throughout sixteen chapters of specialized websites, application programming interfaces, and software solutions. Based on his live and online video training at IntelTechniques.com, over 250 resources are identified with narrative tutorials and screen captures.

This book will serve as a reference guide for anyone that is responsible for the collection of online content. It is written in a hands-on style that encourages the reader to execute the tutorials as they go. The search techniques offered will inspire analysts to “think outside the box” when scouring the internet for personal information.

On the flip side, Open Source Intelligence Techniques is must reading for anyone who is charged with avoiding disclosure of information that can be matched with other open source intelligence.

How many people has your agency outed today?

Apply Today To Become A 2016 Knight-Mozilla Fellow!

Filed under: Journalism,News,Reporting — Patrick Durusau @ 7:48 pm

Apply Today To Become A 2016 Knight-Mozilla Fellow! by Dan Sinker.

From the post:

Today we kick off our fifth search for Knight-Mozilla Fellows. It’s been a remarkably fast five years filled with some of the most remarkable, talented, and inspiring people we’ve had the pleasure of getting to know. And now you (yes you!) can join them by applying to become a 2016 Knight-Mozilla Fellow today.

Knight-Mozilla Fellows spend ten months building, experimenting, and collaborating at the intersection of journalism and the open web. Our fellows have worked on projects that liberate data from inside PDFs, stitch together satellite imagery to demonstrate the erosion of coastline in Louisiana, help journalists understand the security implications of their work, improve our understanding of site metrics, and much much more.

We want our fellows to spend time tackling real-world problems, and so for the last five years we have embedded our fellows in some of the best newsrooms in the world. For 2016 we have an amazing lineup of fellowship hosts once again: the Los Angeles Times Data Desk, NPR, Vox Media, Frontline, Correct!v, and The Coral Project (a collaboration between the New York Times, the Washington Post, and Mozilla).

Our fellows are given the freedom to try new things, follow their passions, and write open-source code that helps to drive journalism (and the web) forward. Over the course of the next two and a half months you’ll be hearing a lot more from us about the many exciting opportunities a Knight-Mozilla Fellowship brings, the impact journalism code has had on the web, and why you (yes you!) should apply to become a 2016 Knight-Mozilla Fellow. Or, you can just apply now.

A bit mainstream for my taste but there has to be a mainstream to give the rest of us something to rebel against. 😉

Can’t argue with the experience you can gain and the street cred that goes along with it. Not to mention having your name on cutting edge work.

Give it serious consideration and pass word of the opportunity along.

Announcing SparkR: R on Spark [Spark Summit next week – free live streaming]

Filed under: Conferences,R,Spark — Patrick Durusau @ 7:37 pm

Announcing SparkR: R on Spark by Shivaram Venkataraman.

From the post:

I am excited to announce that the upcoming Apache Spark 1.4 release will include SparkR, an R package that allows data scientists to analyze large datasets and interactively run jobs on them from the R shell.

R is a popular statistical programming language with a number of extensions that support data processing and machine learning tasks. However, interactive data analysis in R is usually limited as the runtime is single-threaded and can only process data sets that fit in a single machine’s memory. SparkR, an R package initially developed at the AMPLab, provides an R frontend to Apache Spark and using Spark’s distributed computation engine allows us to run large scale data analysis from the R shell.

The short news here or go to the Spark Summit to get the full story. (Code Databricks20 gets a 20% discount) (That’s next week, June 15 – 17, San Francisco. You need to act quickly.)

BTW, you can register for free live streaming!

Looking forward to this!

The Economics of Reproducibility in Preclinical Research

Filed under: Biomedical,Research Methods,Researchers — Patrick Durusau @ 4:15 pm

The Economics of Reproducibility in Preclinical Research by Leonard P. Freedman, Iain M. Cockburn, Timothy S. Simcoe. PLOS Published: June 9, 2015 DOI: 10.1371/journal.pbio.1002165.

Abstract:

Low reproducibility rates within life science research undermine cumulative knowledge production and contribute to both delays and costs of therapeutic drug development. An analysis of past studies indicates that the cumulative (total) prevalence of irreproducible preclinical research exceeds 50%, resulting in approximately US$28,000,000,000 (US$28B)/year spent on preclinical research that is not reproducible—in the United States alone. We outline a framework for solutions and a plan for long-term improvements in reproducibility rates that will help to accelerate the discovery of life-saving therapies and cures.

The authors find four categories of irreproducibility:

(1) study design, (2) biological reagents and reference materials, (3) laboratory protocols, and (4) data analysis and reporting.

But only address “(1) study design, (2) biological reagents and reference materials.”

Once again, documentation doesn’t make the cut. 🙁

I find that curious because judging just from the flood of social media data, people in general spend a good part of every day capturing and transmitting information. Where is the pain point between that activity and formal documentation that makes the later into an anathema?

Documentation, among other things, could lead to higher reproducibility rates for medical and other research areas, to say nothing of saving data scientists time puzzling out data and/or programmers debugging old code.

BASE – Bielefeld Academic Search Engine

Filed under: OAI,Search Engines — Patrick Durusau @ 1:48 pm

BASE – Bielefeld Academic Search Engine

From the post:

BASE is one of the world’s most voluminous search engines especially for academic open access web resources. BASE is operated by Bielefeld University Library.

As the open access movement grows and prospers, more and more repository servers come into being which use the “Open Archives Initiative Protocol for Metadata Harvesting” (OAI-PMH) for providing their contents. BASE collects, normalises, and indexes these data. BASE provides more than 70 million documents from more than 3,000 sources. You can access the full texts of about 70% of the indexed documents. The index is continuously enhanced by integrating further OAI sources as well as local sources. Our OAI-PMH Blog communicates information related to harvesting and aggregating activities performed for BASE.

One feature of the search interface that is particularly appealing is the ability to “boost” open access documents, request verbatim search, request additional word forms, and to invoke multilingual synonyms (Eurovoc Thesaurus).

I first saw this in a tweet by Amanda French

The Political One Percent of the One Percent:…

Filed under: Government,Government Data,Politics — Patrick Durusau @ 1:21 pm

The Political One Percent of the One Percent: Megadonors fuel rising cost of elections in 2014 by Peter Olsen-Phillips, Russ Choma, Sarah Bryner, and Doub Weber.

From the post:

In the 2014 elections, 31,976 donors — equal to roughly one percent of one percent of the total population of the United States — accounted for an astounding $1.18 billion in disclosed political contributions at the federal level. Those big givers — what we have termed the “Political One Percent of the One Percent” — have a massively outsized impact on federal campaigns.

They’re mostly male, tend to be city-dwellers and often work in finance. Slightly more of them skew Republican than Democratic. A small subset — barely five dozen — earned the (even more) rarefied distinction of giving more than $1 million each. And a minute cluster of three individuals contributed more than $10 million apiece.

The last election cycle set records as the most expensive midterms in U.S. history, and the country’s most prolific donors accounted for a larger portion of the total amount raised than in either of the past two elections.

The $1.18 billion they contributed represents 29 percent of all fundraising that political committees disclosed to the Federal Election Commission in 2014. That’s a greater share of the total than in 2012 (25 percent) or in 2010 (21 percent).

It’s just one of the main takeaways in the latest edition of the Political One Percent of the One Percent, a joint analysis of elite donors in America by the Center for Responsive Politics and the Sunlight Foundation.

BTW, although the report says conservatives “edged their liberal opponents,” the Republicans raised $553 million and Democrats raised $505 million from donors on the one percent of the one percent list. The $48 million difference isn’t rounding error size but once you break one-half $billon, it doesn’t seem as large as it might otherwise.

As far as I can tell, the report does not reproduce the addresses of the one percent of one percent donors. For that you need to use the advanced search option at the FEC and put 8810 (no dollar sign needed) in the first “amount range” box, set the date range to 2014 to 2015 and then search. Quite a long list so you may want to do it by state.

To get the individual location information, you can to follow the transaction number at the end of each record returned by your query and that returns a PDF page. Somewhere on that page will be the address information for the donor.

As far as campaign finance, the report indicates you need to find another way to influence the political process. Any donation much below the one percent of one percent minimum, i.e., $8810, isn’t going to buy you any influence. In fact, you are subsidizing the cost of a campaign that benefits the big donors the most. If big donors want to buy those campaigns, let them support the entire campaign.

In a sound bite: Don’t subsidize major political donors with small contributions.

Once you have identified the one percent of one percent donors, you can start to work out the other relationships between those donors and the levers of power.

The challenge of combining 176 x #otherpeoplesdata…

Filed under: Biodiversity,Biology,Github,Integration,Open Data — Patrick Durusau @ 10:39 am

The challenge of combining 176 x #otherpeoplesdata to create the Biomass And Allometry Database by Daniel Falster , Rich FitzJohn , Remko Duursma , Diego Barneche .

From the post:

Despite the hype around "big data", a more immediate problem facing many scientific analyses is that large-scale databases must be assembled from a collection of small independent and heterogeneous fragments — the outputs of many and isolated scientific studies conducted around the globe.

Collecting and compiling these fragments is challenging at both political and technical levels. The political challenge is to manage the carrots and sticks needed to promote sharing of data within the scientific community. The politics of data sharing have been the primary focus for debate over the last 5 years, but now that many journals and funding agencies are requiring data to be archived at the time of publication, the availability of these data fragments is increasing. But little progress has been made on the technical challenge: how can you combine a collection of independent fragments, each with its own peculiarities, into a single quality database?

Together with 92 other co-authors, we recently published the Biomass And Allometry Database (BAAD) as a data paper in the journal Ecology, combining data from 176 different scientific studies into a single unified database. We built BAAD for several reasons: i) we needed it for our own work ii) we perceived a strong need within the vegetation modelling community for such a database and iii) because it allowed us to road-test some new methods for building and maintaining a database ^1.

Until now, every other data compilation we are aware of has been assembled in the dark. By this we mean, end-users are provided with a finished product, but remain unaware of the diverse modifications that have been made to components in assembling the unified database. Thus users have limited insight into the quality of methods used, nor are they able to build on the compilation themselves.

The approach we took with BAAD is quite different: our database is built from raw inputs using scripts; plus the entire work-flow and history of modifications is available for users to inspect, run themselves and ultimately build upon. We believe this is a better way for managing lots of #otherpeoplesdata and so below share some of the key insights from our experience.

The highlights of the project:

1. Script everything and rebuild from source

2. Establish a data-processing pipeline

  • Don’t modify raw data files
  • Encode meta-data as data, not as code
  • Establish a formal process for processing and reviewing each data set

3. Use version control (git) to track changes and code sharing website (github) for effective collaboration

4. Embrace Openness

5. A living database

There was no mention of reconciliation of nomenclature for species. I checked some of the individual reports, such as Report for study: Satoo1968, which does mention:

Other variables: M.I. Ishihara, H. Utsugi, H. Tanouchi, and T. Hiura conducted formal search of reference databases and digitized raw data from Satoo (1968). Based on this reference, meta data was also created by M.I. Ishihara. Species name and family names were converted by M.I. Ishihara according to the following references: Satake Y, Hara H (1989a) Wild flower of Japan Woody plants I (in Japanese). Heibonsha, Tokyo; Satake Y, Hara H (1989b) Wild flower of Japan Woody plants II (in Japanese). Heibonsha, Tokyo. (Emphasis in original)

I haven’t surveyed all the reports but it appears that “conversion” of species and family names occurred prior to entering the data pipeline.

Not an unreasonable choice but it does mean that we cannot use the original names as recorded as search terms into literature that existed at the time of the original observations.

Normalization of data often leads to loss of information. Not necessarily but often does.

I first saw this in a tweet by Dr. Mike Whitfield.

spaCy: Industrial-strength NLP

Filed under: Natural Language Processing — Patrick Durusau @ 9:55 am

spaCy: Industrial-strength NLP by Matthew Honnibal.

From the post:

spaCy is a new library for text processing in Python and Cython. I wrote it because I think small companies are terrible at natural language processing (NLP). Or rather: small companies are using terrible NLP technology.

To do great NLP, you have to know a little about linguistics, a lot about machine learning, and almost everything about the latest research. The people who fit this description seldom join small companies. Most are broke — they’ve just finished grad school. If they don’t want to stay in academia, they join Google, IBM, etc.

The net result is that outside of the tech giants, commercial NLP has changed little in the last ten years. In academia, it’s changed entirely. Amazing improvements in quality. Orders of magnitude faster. But the academic code is always GPL, undocumented, unuseable, or all three. You could implement the ideas yourself, but the papers are hard to read, and training data is exorbitantly expensive. So what are you left with? A common answer is NLTK, which was written primarily as an educational resource. Nothing past the tokenizer is suitable for production use.

I used to think that the NLP community just needed to do more to communicate its findings to software engineers. So I wrote two blog posts, explaining how to write a part-of-speech tagger and parser. Both were well received, and there’s been a bit of interest in my research software — even though it’s entirely undocumented, and mostly unuseable to anyone but me.

So six months ago I quit my post-doc, and I’ve been working day and night on spaCy since. I’m now pleased to announce an alpha release.

If you’re a small company doing NLP, I think spaCy will seem like a minor miracle. It’s by far the fastest NLP software ever released. The full processing pipeline completes in 7ms per document, including accurate tagging and parsing. All strings are mapped to integer IDs, tokens are linked to embedded word representations, and a range of useful features are pre-calculated and cached.

Matthew uses an example based on Stephen King’s admonition “the adverb is not your friend“, which immediately brought to mind the utility of tagging all adverbs and adjectives in a standards draft and then generating comments that identify its parent <p> element and the offending phrase.

I haven’t verified the performance comparisons, but as you know, the real question is how well spaCy works on your data, work flow, etc.?

Thanks to Matthew for the reminder of: On writing : a memoir of the craft by Stephen King. Documentation will never be as gripping as a King novel, but it shouldn’t be painful to read.

I first saw this in a tweet by Jason Baldridge.

« Newer PostsOlder Posts »

Powered by WordPress