Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 11, 2016

Tackling Zika

Filed under: Bioinformatics,Medical Informatics,Open Access,Open Data — Patrick Durusau @ 1:43 pm

F1000Research launches rapid, open, publishing channel to help scientists tackle Zika

From the post:

ZAO provides a platform for scientists and clinicians to publish their findings and source data on Zika and its mosquito vectors within days of submission, so that research, medical and government personnel can keep abreast of the rapidly evolving outbreak.

The channel provides diamond-access: it is free to access and articles are published free of charge. It also accepts articles on other arboviruses such as Dengue and Yellow Fever.

The need for the channel is clearly evidenced by a recent report on the global response to the Ebola virus by the Harvard-LSHTM (London School of Hygiene & Tropical Medicine) Independent Panel.

The report listed ‘Research: production and sharing of data, knowledge, and technology’ among its 10 recommendations, saying: “Rapid knowledge production and dissemination are essential for outbreak prevention and response, but reliable systems for sharing epidemiological, genomic, and clinical data were not established during the Ebola outbreak.”

Dr Megan Coffee, an infectious disease clinician at the International Rescue Committee in New York, said: “What’s published six months, or maybe a year or two later, won’t help you – or your patients – now. If you’re working on an outbreak, as a clinician, you want to know what you can know – now. It won’t be perfect, but working in an information void is even worse. So, having a way to get information and address new questions rapidly is key to responding to novel diseases.”

Dr. Coffee is also a co-author of an article published in the channel today, calling for rapid mobilisation and adoption of open practices in an important strand of the Zika response: drug discovery – http://f1000research.com/articles/5-150/v1.

Sean Ekins, of Collaborative Drug Discovery, and lead author of the article, which is titled ‘Open drug discovery for the Zika virus’, said: “We think that we would see rapid progress if there was some call for an open effort to develop drugs for Zika. This would motivate members of the scientific community to rally around, and centralise open resources and ideas.”

Another co-author, of the article, Lucio Freitas-Junior of the Brazilian Biosciences National Laboratory, added: “It is important to have research groups working together and sharing data, so that scarce resources are not wasted in duplication. This should always be the case for neglected diseases research, and even more so in the case of Zika.”

Rebecca Lawrence, Managing Director, F1000, said: “One of the key conclusions of the recent Harvard-LSHTM report into the global response to Ebola was that rapid, open data sharing is essential in disease outbreaks of this kind and sadly it did not happen in the case of Ebola.

“As the world faces its next health crisis in the form of the Zika virus, F1000Research has acted swiftly to create a free, dedicated channel in which scientists from across the globe can share new research and clinical data, quickly and openly. We believe that it will play a valuable role in helping to tackle this health crisis.”

###

For more information:

Andrew Baud, Tala (on behalf of F1000), +44 (0) 20 3397 3383 or +44 (0) 7775 715775

Excellent news for researchers but a direct link to the new channel would have been helpful as well: Zika & Arbovirus Outbreaks (ZAO).

See this post: The Zika & Arbovirus Outbreaks channel on F1000Research by Thomas Ingraham.

News organizations should note that as of today, 11 February 2016, ZAO offers 9 articles, 16 posters and 1 set of slides. Those numbers are likely to increase rapidly.

Oh, did I mention the ZAO channel is free?

Unlike some journals, payment, prestige, privilege, are not pre-requisites for publication.

Useful research on Zika & Arboviruses is the only requirement.

I know, sounds like a dangerous precedent but defeating a disease like Zika will require taking risks.

February 10, 2016

“Butts In Seats” Management At The FBI

Filed under: Cybersecurity,Government,Security — Patrick Durusau @ 5:48 pm

The FBI Wants $38 More Million to Buy Encryption-Breaking Technology by Lorenzo Franceschi-Bicchierai.

From the post:

For more than a year, FBI Director James Comey has been publicly complaining about how much of a hard time his agents, as well as local and state cops, are having when they encounter encryption during their investigations.

Now, the FBI is asking for more money to break encryption when needed.

In its budget request for next year, the FBI asked for $38.3 more million on top of the $31 million already requested last year to “develop and acquire” tools to get encrypted data, or to unmask internet users who hide behind a cloak of encryption. This money influx is designed to avoid “going dark,” an hypothetical future where the rise of encryption technologies make it impossible for cops and feds to track criminal suspects, or to access and intercept the information or data they need to solve crimes and investigations.

Greet story and the total requested by the FBI totals up to: $69.3 million.

From further in the post:


For Julian Sanchez, one of the authors of a recent report on going dark, which concluded that technology is actually helping law enforcement, rather than hindering it, is skeptical that the FBI even needs all this money.

“$38.3 million is a hefty chunk of change to dole out for a ‘problem’ the FBI has so steadfastly refused to publicly quantify in any meaningful way,“ he told me. “First let’s see some hard numbers about how often encryption is a serious obstacle to investigations and what the alternatives are; then maybe we’ll be in a position to know how much it’s reasonable to spend addressing the issue.“

But to be fair to Director Comey, there isn’t a metric in the possession of the FBI (or anyone else) that would justify a dollar on breaking encryption any more or less than $1 million or $1 billion.

Those numbers simply don’t exist. How do we know that?

I’m willing to concede that the publicists for the FBI are probably dishonest but their’re not stupid.

If there was any evidence, even evidence that had to be perverted to support the case for breaking encryption research, it would be on a flashing banner on the FBI website.

What you are seeing from Director Comey is a “butts in seats” management style.

How many “butts” you can get into seats, yours or contractors, increases the prestige of your department and the patronage you can dispense. You may think those are not related to the mission of the department.

You would be right but so what? What made you think that appropriations have any relationship to the mission of the department? The core mission of the department is to survive and increase its influence. Mission is something you put on flyers. Nothing more.

I don’t mean to denigrate the “heads down, doing their jobs as best they can even with political master interventions staff,” but they aren’t the ones who set policy or waste funds in “butts in seats” management plans.

Congress needs to empower inspector generals and the Government Accounting Office to vet agency budget proposals prior to submission to Congress. Unsubstantiated requests should be deleted from requests and not restored by Congress during the budgetary process.

It’s called evidence based management for anyone unfamiliar with the practice.

Build your own neural network classifier in R

Filed under: Classifier,Neural Networks,R — Patrick Durusau @ 5:14 pm

Build your own neural network classifier in R by Jun Ma.

From the post:

Image classification is one important field in Computer Vision, not only because so many applications are associated with it, but also a lot of Computer Vision problems can be effectively reduced to image classification. The state of art tool in image classification is Convolutional Neural Network (CNN). In this article, I am going to write a simple Neural Network with 2 layers (fully connected). First, I will train it to classify a set of 4-class 2D data and visualize the decision boundary. Second, I am going to train my NN with the famous MNIST data (you can download it here: https://www.kaggle.com/c/digit-recognizer/download/train.csv) and see its performance. The first part is inspired by CS 231n course offered by Stanford: http://cs231n.github.io/, which is taught in Python.

One suggestion, based on some unrelated reading, don’t copy-n-paste the code.

Key in the code so you will get accustomed to your typical typing mistakes, which are no doubt different from mine!

Plus you will develop muscle memory in your fingers and code will either “look right” or not.

Enjoy!

PS: For R, Jun’s blog looks like one you need to start following!

First Pirate – Sci-Hub?

Filed under: Open Access,Open Science,Publishing — Patrick Durusau @ 4:23 pm

Sci-Hub romanticizes itself as:

Sci-Hub the first pirate website in the world to provide mass and public access to tens of millions of research papers. (from the about page)

I agree with:

…mass and public access to tens of millions of research papers

But Sci-Hub is hardly:

…the first pirate website in the world

I don’t remember the first gate-keeping publisher that went from stealing from the public in print to stealing from the public online.

With careful enough research I’m sure we could track that down but I’m not sure it matters at this point.

What we do know is that academic research is funded by the public, edited and reviewed by volunteers (to the extent it is reviewed at all), and then kept from the vast bulk of humanity for profit and status (gate-keeping).

It’s heady stuff to think of yourself as a bold and swashbuckling pirate, going to stick it “…to the man.”

However, gate-keeping publishers have developed stealing from the public to an art form. If you don’t believe me, take a brief look at the provisions in the Trans-Pacific Partnership that protect traditional publisher interests.

Recovering what has been stolen from the public isn’t theft at all, its restoration!

Use Sci-Hub, support Sci-Hub, spread the word about Sci-Hub.

Allow gate-keeping publishers to slowly, hopefully painfully, wither as opportunities for exploiting the public grow fewer and farther in between.

PS: You need to read: Meet the Robin Hood of Science by Simon Oxenham to get the full background on Sci-Hub and an extraordinary person, Alexandra Elbakyan.

Bringing the CIA to Heel

Filed under: Government,Politics,Security — Patrick Durusau @ 3:19 pm

Cory Doctorow reports in CIA boss flips out when Ron Wyden reminds him that CIA spied on the Senate, that John Brennan, CIA Director, had a tantrum when asked about the CIA spying on the Senate Select Committee on Intelligence.

See Cory’s post for the details and an amusing video of the incident.

The easiest way to bring the CIA to heel is for the Senate to publicly release all classified documents that come into its possession and to de-criminalize leaks from any U.S. government agency.

Starting at least with the Pentagon Papers and probably before, every government leak has demonstrated the incompetence and fundamental dishonesty of members of the U.S. government.

We should stop even pretending to listen to the fanciful list of horrors that “will result” if classified information is leaked.

Some post-9/11 torturers might face retribution but since U.S. authorities won’t pursue the cases, is that a bad thing?

I am untroubled by the claims, “…but we did it for you/country/flag….” That is as self-serving as “…order are orders….” And to my mind, just as reprehensible.

February 9, 2016

Barred From Home

Filed under: Journalism,News,Reporting — Patrick Durusau @ 9:15 pm

Barred From Home by Sarah Ryley and Barry Paddock of the New York Daily News and Christine Lee, special to ProPublica.

From the webpage:

To settle nuisance abatement actions with the New York Police Department, residents often must agree to strict provisions such as banning specific family members for life, warrantless searches, and automatically forfeiting their leases if accused of wrongdoing in the future. The News and ProPublica identified 297 people who were either barred from homes or gave up their tenancy to settle actions filed during 2013 and the first half of 2014. More than half were never convicted of a crime as a result of the underlying police investigation that triggered the case. Here are their stories. Read Story.

This story needs to be spread as widely as possible and its research and reporting techniques emulated just as widely.

After spending a good part of another lifetime listening to stories of physical abuse of women, children, prisoners and the mentally ill, to consider “unsound” solutions to their problems, I would have thought this story would not impact me so.

It is quite visceral and haunting. You will leave it wishing the “…take it from my cold dead hands…” types could “enjoy” this level of oppression by the government. Can’t help but wonder if their response would be as brave as their talk.

The fantasy of government oppression doesn’t hold a candle to the horrors described in this story.

BTW, since these are civil proceedings, guess what?

No right to an appointed attorney. No assistance at all.

Read the story and find out if your locality has similar ordinances. Another type of government abuse can be found in “protective” services cases.

Agile Data Science [Free Download]

Filed under: Data Science,Hadoop — Patrick Durusau @ 8:44 pm

Agile Data Science by Russell Jurney.

From the preface:

I wrote this book to get over a failed project and to ensure that others do not repeat my mistakes. In this book, I draw from and reflect upon my experience building analytics applications at two Hadoop shops.

Agile Data Science has three goals: to provide a how-to guide for building analyticsapplications with big data using Hadoop; to help teams collaborate on big data projectsin an agile manner; and to give structure to the practice of applying Agile Big Data analytics in a way that advances the field.

From 2013 and data science has moved quite a bit in the meantime but the principles Russell illustrates remain sound and people do still use Hadoop.

Depending on what you gave up for Lent, you should have enough non-work time to work through Agile Data Science by the end of Lent.

Maybe this year you will have something to show for the forty days of Lent. 😉

Not-so-secret atomic tests:… […how an earlier era viewed citizens’ rights and safety.]

Filed under: Government,Politics,Security — Patrick Durusau @ 8:25 pm

Not-so-secret atomic tests: Why the photographic film industry knew what the American public didn’t by Tim Barribeau.

From the post:

It’s one of the dark marks of the U.S. Government in the 20th century — a complete willingness to expose unwitting citizens to dangerous substances in the name of scientific advancement. It happened with the Tuskegee syphilis experiment, with the MKUltra mind control project and with the atomic bomb testing of the 1940s and 50s. The Atomic Energy Commission (AEC) knew that dangerous levels of fallout were being pumped into the atmosphere, but didn’t bother to tell anyone. Well, anyone except the photographic film industry, that is.

Photographic film is particularly radiosensitive — that’s the reason why you see dosimeters made from the stuff, as they can be used to detect gamma, X-ray and beta particles. But in 1946, Kodak customers started complaining about film they had bought coming out fogged.


Kodak complained to the Atomic Energy Commission and that Government agency agreed to give Kodak advanced information on future tests, including ‘expected distribution of radioactive material in order to anticipate local contamination.

In fact, the Government warned the entire photographic industry and provided maps and forecasts of potential contamination. Where, I ask, were the maps for dairy farmers? Where were the warnings to parents of children in these areas? So here we are, Mr. Chairman. The Government protected rolls of film, but not the lives of our kids. There is something wrong with this picture.

Senator Harkin’s remarks about dairy farms and children reveals the dark side of this story. It’s not enough that the AEC was knowingly releasing fallout into American skies, but that one of the side effects they were aware of was that it could enter the food supply, and potentially cause long term health problems. The I-131 would fall on the ground, be eaten by cattle through radioactive feed, and through their milk, be passed on to the public. Your thyroid needs iodine to function, so it builds up stores of iodine from the environment, and high concentrations of I-131 are directly linked to higher risks of radiogenic thyroid cancer — especially from exposure during childhood. And that’s exactly what happened to thousands of American children.

It turns out there’s a relatively easy way to prevent thyroid cancer after exposure to I-131 — standard iodine supplements will do. But if you’re unaware of the fallout, you wouldn’t know to take the countermeasure. The atmospheric tests have been linked to up to 75,000 cases of thyroid cancer in the U.S. alone. To this day, the National Cancer Institute runs a program to help people identify if they were exposed, and between 1951 and 1962, it was an awful lot of people.

radiation-map

If the story weren’t disturbing enough, consider the closing note from the editor:

[Ed. note: This piece ranges far from our normal digital photography fare, but we found it an interesting historical note on a moment in time when the photo industry, military development and public health all intersected, and on how an earlier era viewed citizens’ rights and safety.]

Really?

The atomic test piece was published in 2013.

In 2015/16, it is discovered that the entire city of Flint, Michigan was deliberately poisoned by state government. New information is appearing on a daily basis as the crisis continues.

The present era has little concern for citizens, their rights and safety. If you don’t believe that, consider all the reports of bad water elsewhere that have begun to surface. Mark Ruffalo: We’re Heading Toward a National Water Crisis.

To demonstrate her lack of concern for the citizens of Flint, Hillary Clinton wants to incorporate them in the planning of the recovery process. To “empower” them.

Pure BS. Every citizen in Flint wants potable drinking water and safe water for their families to use for bathing, laundry, etc. Empowerment isn’t going to do any of those things.

Let’s stop harming people first and play privilege/power shell game later, if we have to play it at all.

Perjurer’s Report: Worldwide Threat Assessment…

Filed under: Government,Security — Patrick Durusau @ 7:35 pm

Worldwide Threat Assessment of the US Intelligence Community by James R. Clapper, Director of National Intelligence.

From the introduction:

Chairman Burr, Vice Chairman Feinstein, Members of the Committee, thank you for the invitation to offer the United States Intelligence Community’s 2016 assessment of threats to US national security. My statement reflects the collective insights of the Intelligence Community’s extraordinary men and women, whom I am privileged and honored to lead. We in the Intelligence Community are committed every day to provide the nuanced, multidisciplinary intelligence that policymakers, warfighters, and domestic law enforcement personnel need to protect American lives and America’s interests anywhere in the world.

The order of the topics presented in this statement does not necessarily indicate the relative importance or magnitude of the threat in the view of the Intelligence Community.

Information available as of February 3, 2016 was used in the preparation of this assessment.

You may remember that in March of 2013, Director Clapper deliberately perjured himself before this self-same committee.

It’s entirely possible that some truths appear in the assessment Clapper presented, but those are either inadvertent or were a lie could not improve the story.

One of the difficulties of government agents lying when it suits their purposes, is that other members of government and/or the public have no means to distinguish self-serving lies from an occasional truth.

If your interests are served by the threat assessment, make what use of it you will, being mindful that leaks may suddenly discredit both it and any proposal you advance based upon it.

Topic Maps: On the Cusp of Success (Curate in Place/Death of ETL?)

Filed under: Data Integration,Semantics,Topic Maps — Patrick Durusau @ 7:10 pm

The Bright Future of Semantic Graphs and Big Connected Data by Alex Woodie.

From the post:

Semantic graph technology is shaping up to play a key role in how organizations access the growing stores of public data. This is particularly true in the healthcare space, where organizations are beginning to store their data using so-called triple stores, often defined by the Resource Description Framework (RDF), which is a model for storing metadata created by the World Wide Web Consortium (W3C).

One person who’s bullish on the prospects for semantic data lakes is Shawn Dolley, Cloudera’s big data expert for the health and life sciences market. Dolley says semantic technology is on the cusp of breaking out and being heavily adopted, particularly among healthcare providers and pharmaceutical companies.

“I have yet to speak with a large pharmaceutical company where there’s not a small group of IT folks who are working on the open Web and are evaluating different technologies to do that,” Dolley says. “These are visionaries who are looking five years out, and saying we’re entering a world where the only way for us to scale….is to not store it internally. Even with Hadoop, the data sizes are going to be too massive, so we need to learn and think about how to federate queries.”

By storing healthcare and pharmaceutical data as semantic triples using graph databases such as Franz’s AllegroGraph, it can dramatically lower the hurdles to accessing huge stores of data stored externally. “Usually the primary use case that I see for AllegroGraph is creating a data fabric or a data ecosystem where they don’t have to pull the data internally,” Dolley tells Datanami. “They can do seamless queries out to data and curate it as it sits, and that’s quite appealing.

….

This is leading-edge stuff, and there are few mission-critical deployments of semantic graph technologies being used in the real world. However, there are a few of them, and the one that keeps popping up is the one at Montefiore Health System in New York City.

Montefiore is turning heads in the healthcare IT space because it was the first hospital to construct a “longitudinally integrated, semantically enriched” big data analytic infrastructure in support of “next-generation learning healthcare systems and precision medicine,” according to Franz, which supplied the graph database at the heart of the health data lake. Cloudera’s free version of Hadoop provided the distributed architecture for Montefiore’s semantic data lake (SDL), while other components and services were provided by tech big wigs Intel (NASDAQ: INTC) and Cisco Systems (NASDAQ: CSCO).

This approach to building an SDL will bring about big improvements in healthcare, says Dr. Parsa Mirhaji MD. PhD., the director of clinical research informatics at Einstein College of Medicine and Montefiore Health System.

“Our ability to conduct real-time analysis over new combinations of data, to compare results across multiple analyses, and to engage patients, practitioners and researchers as equal partners in big-data analytics and decision support will fuel discoveries, significantly improve efficiencies, personalize care, and ultimately save lives,” Dr. Mirhaji says in a press release. (emphasis added)

If I hadn’t known better, reading passages like:

the only way for us to scale….is to not store it internally

learn and think about how to federate queries

seamless queries out to data and curate it as it sits

I would have sworn I was reading a promotion piece for topic maps!

Of course, it doesn’t mention how to discover valuable data not written in your terminology, but you have to hold something back for the first presentation to the CIO.

The growth of data sets too large for ETL are icing on the cake for topic maps.

Why ETL when the data “appears” as I choose to view it? My topic map may be quite small, at least in relationship to the data set proper.

computer-money

OK, truth-in-advertising moment, it won’t be quite that easy!

And I don’t take small bills. 😉 Diamonds, other valuable commodities, foreign deposit arrangements can be had.

People are starting to think in a “topic mappish” sort of way. Or at least a way where topic maps deliver what they are looking for.

That’s the key: What do they want?

Then use a topic map to deliver it.

$19 Billion in “Protection Money” and Not One Incentive For Secure Code

Filed under: Cybersecurity,Government,Security — Patrick Durusau @ 6:19 pm

Protecting U.S. Innovation From Cyberthreats by Barack Obama (current President of the United States).

From the statement:

More than any other nation, America is defined by the spirit of innovation, and our dominance in the digital world gives us a competitive advantage in the global economy. However, our advantage is threatened by foreign governments, criminals and lone actors who are targeting our computer networks, stealing trade secrets from American companies and violating the privacy of the American people.

Networks that control critical infrastructure, like power grids and financial systems, are being probed for vulnerabilities. The federal government has been repeatedly targeted by cyber criminals, including the intrusion last year into the Office of Personnel Management in which millions of federal employees’ personal information was stolen. Hackers in China and Russia are going after U.S. defense contractors. North Korea’s cyberattack on Sony in 2014 destroyed data and disabled thousands of computers. With more than 100 million Americans’ personal data compromised in recent years—including credit-card information and medical records—it isn’t surprising that nine out of 10 Americans say they feel like they’ve lost control of their personal information.

These cyberthreats are among the most urgent dangers to America’s economic and national security. That’s why, over the past seven years, we have boosted cybersecurity in government—including integrating and quickly sharing intelligence about cyberthreats—so we can act on threats even faster. We’re sharing more information to help companies defend themselves. We’ve worked to strengthen protections for consumers and students, guard the safety of children online, and uphold privacy and civil liberties. And thanks to bipartisan support in Congress, I signed landmark legislation in December that will help bolster cooperation between government and industry.

That’s why, today, I’m announcing our new Cybersecurity National Action Plan, backed by my proposal to increase federal cybersecurity funding by more than a third, to over $19 billion. This plan will address both short-term and long-term threats, with the goal of providing every American a basic level of online security.

First, I’m proposing a $3 billion fund to kick-start an overhaul of federal computer systems. It is no secret that too often government IT is like an Atari game in an Xbox world. The Social Security Administration uses systems and code from the 1960s. No successful business could operate this way. Going forward, we will require agencies to increase protections for their most valued information and make it easier for them to update their networks. And we’re creating a new federal position, Chief Information Security Officer—a position most major companies have already adopted—to drive these changes across government.

The Social Security Administration is no doubt running systems and code for the 1960s, which is no doubt why you so seldom hear its name in data breach stories.

Social Security Numbers, sure, those flooded from the Office of Personnel Management, but that wasn’t the fault of the Social Security Administration.

To be fair, the SSA has experienced data breaches, but self-inflicted ones like leaking information on 14,000 “live” people in a list of 90 million deceased Americans.

In case you are wondering, in round numbers that means SSA staff make an error in 00.015% of all the cases they handle.

I should be so careful! So should you! 😉

That’s a quite remarkably low error rate. Consider that a batter is “hot” if they hit more than 3 times out of 10.

Sorry, back to the main story.

President Obama’s “protection money” will delay the onset of incentives for producing secure code and systems.

Following the money, vendors/contractors will pursue strategies that layer more insecure code on top of already insecure code. After all, that’s what the President is paying for and that’s way he is going to get.

Pay close attention to any attempt to “upgrade” the information systems at the Social Security Administration. The net effect will be to bring the SSA to a modern level of insecurity.

The more code produced by the Cybersecurity National Action Plan, the more attack surfaces for hackers.

There is an upside to the President’s plan.

The surplus of hacking opportunities will doom some hackers to cycles of indecision and partial hacks. They will jump from one breach story to another.

How to calculate an ROI on surplus hacking opportunities isn’t clear. Suggestions?

Baby Blue’s Manual of Legal Citation [Public Review Ends 15 March 2016]

Filed under: Law,Legal Informatics — Patrick Durusau @ 5:42 pm

The Baby Blue’s Manual of Legal Citation, is available for your review and comments:

The manuscript currently resides at https://law.resource.org/pub/us/code/blue/. The manuscript is created from an HTML source file. Transformations of this source file are available in PDF and Word formats. You may submit point edits by editing the html source (from which we will create a diff) or using Word with Baby Blue’s Manual of Legal Citation track changes enabled. You may also provide comments on the PDF or Word documents, or as free-form text. Comments may be submitted before March 15, 2016 to:

Carl Malamud
Public.Resource.Org, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472 USA
carl@media.org

Comment early and often!

More to follow.

February 8, 2016

Fast search of thousands of short-read sequencing experiments [NEW! Sequence Bloom Tree]

Filed under: Bioinformatics,Genomics — Patrick Durusau @ 8:02 pm

Fast search of thousands of short-read sequencing experiments by Brad Solomon & Carl Kingsford.

Abstract from the “official” version at Nature Biotechnology (2016):

The amount of sequence information in public repositories is growing at a rapid rate. Although these data are likely to contain clinically important information that has not yet been uncovered, our ability to effectively mine these repositories is limited. Here we introduce Sequence Bloom Trees (SBTs), a method for querying thousands of short-read sequencing experiments by sequence, 162 times faster than existing approaches. The approach searches large data archives for all experiments that involve a given sequence. We use SBTs to search 2,652 human blood, breast and brain RNA-seq experiments for all 214,293 known transcripts in under 4 days using less than 239 MB of RAM and a single CPU. Searching sequence archives at this scale and in this time frame is currently not possible using existing tools.

That will set you back $32 for the full text and PDF.

Or, you can try the unofficial version:

Abstract:

Enormous databases of short-read RNA-seq sequencing experiments such as the NIH Sequence Read Archive (SRA) are now available. However, these collections remain difficult to use due to the inability to search for a particular expressed sequence. A natural question is which of these experiments contain sequences that indicate the expression of a particular sequence such as a gene isoform, lncRNA, or uORF. However, at present this is a computationally demanding question at the scale of these databases.

We introduce an indexing scheme, the Sequence Bloom Tree (SBT), to support sequence-based querying of terabase-scale collections of thousands of short-read sequencing experiments. We apply SBT to the problem of finding conditions under which query transcripts are expressed. Our experiments are conducted on a set of 2652 publicly available RNA-seq experiments contained in the NIH for the breast, blood, and brain tissues, comprising 5 terabytes of sequence. SBTs of this size can be queried for a 1000 nt sequence in 19 minutes using less than 300 MB of RAM, over 100 times faster than standard usage of SRA-BLAST and 119 times faster than STAR. SBTs allow for fast identification of experiments with expressed novel isoforms, even if these isoforms were unknown at the time the SBT was built. We also provide some theoretical guidance about appropriate parameter selection in SBT and propose a sampling-based scheme for potentially scaling SBT to even larger collections of files. While SBT can handle any set of reads, we demonstrate the effectiveness of SBT by searching a large collection of blood, brain, and breast RNA-seq files for all 214,293 known human transcripts to identify tissue-specific transcripts.

The implementation used in the experiments below is in C++ and is available as open source at http://www.cs.cmu.edu/∼ckingsf/software/bloomtree.

You will probably be interested in review comments by C. Titus Brown, Thoughts on Sequence Bloom Trees.

As of today, the exact string “Sequence Bloom Tree” gathers only 207 “hits” so the literature is still small enough to be read.

Don’t delay overlong pursuing this new search technique!

I first saw this in a tweet by Stephen Turner.

The 2016 cyber security roadmap – [Progress on Security B/C of Ransomware?]

Filed under: Cybersecurity,Security — Patrick Durusau @ 5:25 pm

The 2016 cyber security roadmap by Chloe Green.

From the post:

2014 was heralded as the ‘year of the data breach’ – but we’d seen nothing yet. From unprecedented data theft to crippling hacktivism attacks and highly targeted state-sponsored hacks, 2015 has been the bleakest year yet for the cyber security of businesses and organisations.

High profile breaches at Ashley Madison, TalkTalk and JD Wetherspoons have brought the protection of personal and enterprise data into the public consciousness.

In the war against cybercrime, companies are facing off against ever more sophisticated and crafty approaches, while the customer data they hold grows in value, and those that fail to protect it find themselves increasingly in the media and legislative spotlight with nowhere to hide.

We asked a panel of leading industry experts to highlight the major themes for enterprise cyber security in 2016 and beyond.

There isn’t a lot of comfort coming from industry experts these days. Some advice on mitigating strategies and a warning that ransomeware is about to come into its own in 2016. I believe the phrase was “…corporate and not consumer rates…” for ransoms.

A surge in rasonware may be a good thing for the software industry. It would fix a cost for insecure software and practices.

When ransomware extracts commercially unacceptable costs from users of software, users will demand better software from developers.

Financial incentives all the way around. Incentives for hackers to widely deploy ransomeware, incentives for software users to watch their bottom line and last but not least, incentives for developers to implement more robust testing and development processes.

Ransomware may do what reams of turgid prose in journals, conference presentations, books and classrooms have failed to do. Ransomware can create financial incentives for software users to demand better software engineering and testing. Not to mention liability for defects in software.

Faced with financial demands, the software industry will be forced to adopt better software development processes. Those unable to produce sufficiently secure (no software being perfect) software will collapse under the weight of falling sales or liability litigation.

Hackers will be forced to respond to improvement in software quality, for their own financial gain, creating a virtuous circle of immproving software security.

A Gentle Introduction to Category Theory (Feb 2016 version)

Filed under: Category Theory,Mathematics — Patrick Durusau @ 4:19 pm

A Gentle Introduction to Category Theory (Feb 2016 version) by Peter Smith.

From the preface:

This Gentle Introduction is work in progress, developing my earlier ‘Notes onBasic Category Theory’ (2014–15).

The gadgets of basic category theory fit together rather beautifully in mul-tiple ways. Their intricate interconnections mean, however, that there isn’t asingle best route into the theory. Different lecture courses, different books, canquite appropriately take topics in very different orders, all illuminating in theirdifferent ways. In the earlier Notes, I roughly followed the order of somewhatover half of the Cambridge Part III course in category theory, as given in 2014by Rory Lucyshyn-Wright (broadly following a pattern set by Peter Johnstone;see also Julia Goedecke’s notes from 2013). We now proceed rather differently.The Cambridge ordering certainly has its rationale; but the alternative orderingI now follow has in some respects a greater logical appeal. Which is one reasonfor the rewrite.

Our topics, again in different arrangements, are also covered in (for example)Awodey’s good but uneven Category Theory and in Tom Leinster’s terrific – and appropriately titled – Basic Category Theory. But then, if there are some rightly admired texts out there, not to mention various sets of notes on category theory available online (see here), why produce another introduction to category theory?

I didn’t intend to! My goal all along has been to get to understand what light category theory throws on logic, set theory, and the foundations of mathematics. But I realized that I needed to get a lot more securely on top of basic category theory if I was eventually to pursue these more philosophical issues. So my earlier Notes began life as detailed jottings for myself, to help really fix ideas: and then – as can happen – the writing has simply taken on its own momentum. I am still concentrating mostly on getting the technicalities right and presenting them in apleasing order: I hope later versions will contain more motivational/conceptual material.

What remains distinctive about this Gentle Introduction, for good or ill, is that it is written by someone who doesn’t pretend to be an expert who usually operates at the very frontiers of research in category theory. I do hope, however,that this makes me rather more attuned to the likely needs of (at least some)beginners. I go rather slowly over ideas that once gave me pause, spend more time than is always usual in motivating key ideas and constructions, and I have generally aimed to be as clear as possible (also, I assume rather less background mathematics than Leinster or even Awodey). We don’t get terribly far: however,I hope that what is here may prove useful to others starting to get to grips with category theory. My own experience certainly suggests that initially taking things at a rather gentle pace as you work into a familiarity with categorial ways of thinking makes later adventures exploring beyond the basics so very much more manageable.

Check the Category Theory – Reading List, also by Peter Smith, to make sure you have the latest version of this work.

Be an active reader!

If you spot issues with the text:

Corrections, please, to ps218 at cam dot ac dot uk.

At the category theory reading page Peter mentions having retired after forty years in academia.

Writing an introduction to category theory! What a great way to spend retirement!

(Well, different people have different tastes.)

International Conference on Learning Representations – Accepted Papers

Filed under: Data Structures,Learning,Machine Learning — Patrick Durusau @ 3:41 pm

International Conference on Learning Representations – Accepted Papers

From the conference overview:

It is well understood that the performance of machine learning methods is heavily dependent on the choice of data representation (or features) on which they are applied. The rapidly developing field of representation learning is concerned with questions surrounding how we can best learn meaningful and useful representations of data. We take a broad view of the field, and include in it topics such as deep learning and feature learning, metric learning, kernel learning, compositional models, non-linear structured prediction, and issues regarding non-convex optimization.

Despite the importance of representation learning to machine learning and to application areas such as vision, speech, audio and NLP, there was no venue for researchers who share a common interest in this topic. The goal of ICLR has been to help fill this void.

That should give you an idea of the range of data representations/features that you will encounter in the eighty (80) papers accepted for the conference.

ICLR 2016 will be held May 2-4, 2016 in the Caribe Hilton, San Juan, Puerto Rico.

Time to review How To Read A Paper!

Enjoy!

I first saw this in a tweet by Hugo Larochelle.

Governments Race To Bottom On Privacy Rights

Filed under: Government,Privacy,Security — Patrick Durusau @ 2:30 pm

British spies want to be able to suck data out of US Internet giants by Cory Doctorow.

Cory points out a recent US/UK agreement subjects U.S. citizens to surveillance under British laws that no one understands and that don’t require even a fig leaf of judicial approval.

The people of the United States fought one war to free themselves of arbitrary and capricious British rule. Declaration of Independence.

Is the stage being set for a war to enforce the constitution that resulted from the last war the United States waged against the UK?

Data from the World Health Organization API

Filed under: Medical Informatics,R,Visualization — Patrick Durusau @ 11:28 am

Data from the World Health Organization API by Peter’s stats stuff – R.

From the post:

Eric Persson released yesterday a new WHO R package which allows easy access to the World Health Organization’s data API. He’s also done a nice vignette introducing its use.

I had a play and found it was easy access to some interesting data. Some time down the track I might do a comparison of this with other sources, the most obvious being the World Bank’s World Development Indicators, to identify relative advantages – there’s a lot of duplication of course. It’s a nice problem to have, too much data that’s too easy to get hold of. I wish we’d had that problem when I studied aid and development last century – I vividly remember re-keying numbers from almanac-like hard copy publications, and pleased we were to have them too!

Here’s a plot showing country-level relationships between the latest data of three indicators – access to contraception, adolescent fertility, and infant mortality – that help track the Millennium Development Goals.

With visualizations and R code!

A nice way to start off your data mining week!

Enjoy!

I first saw this in a tweet by Christophe Lalanne.

Does HonestSociety.com Not Advertise With Google? (Rigging Search Results)

Filed under: Advertising,Search Engines,Searching — Patrick Durusau @ 11:27 am

I ask about Honestsociety.com because when I search on Google with the string:

honest society member

I get 82,100,000 “hits” and the first page is entirely, honor society stuff.

No, “did you mean,” or “displaying results for…”, etc.

Not a one.

Top of the second page of results did have a webpage that mentions honestsociety.com, but not their home site.

I can’t recall seeing an Honestsociety ad with Google and thought perhaps one of you might.

Lacking such ads, my seat of the pants explanation for “honest society member” returning the non-responsive “honor society” listing isn’t very generous.

What anomalies have you observed in Google (or other) search results?

What searches would you use to test ranking in search results by advertiser with Google versus non-advertiser with Google?

Rigging Searches

For my part, it isn’t a question of whether search results are rigged or not, but rather are they rigged the way I or my client prefers?

Or to say it in a positive way: All searches are rigged. If you think otherwise, you haven’t thought very deeply about the problem.

Take library searches for example. Do you think they are “fair” in some sense of the word?

Hmmm, would you agree that the collection practices of a library will give a user an impression of the literature on a subject?

So the search itself isn’t “rigged,” but the data underlying the results certainly influences the outcome.

If you let me pick the data, I can guarantee whatever search result you want to present. Ditto for the search algorithms.

The best we can do is make our choices with regard to the data and algorithms explicit, so that others accept our “rigged” data or choose to “rig” it differently.

The Danger of Ad Hoc Data Silos – Discrediting Government Experts

Filed under: Data Science,Data Silos,Government — Patrick Durusau @ 8:48 am

This Canadian Lab Spent 20 Years Ruining Lives by Tess Owen.

From the post:

Four years ago, Yvonne Marchand lost custody of her daughter.

Even though child services found no proof that she was a negligent parent, that didn’t count for much against the overwhelmingly positive results from a hair test. The lab results said she was abusing alcohol on a regular basis and in enormous quantities.

The test results had all the trappings of credible forensic science, and was presented by a technician from the Motherisk Drug Testing Laboratory at Toronto’s Sick Kids Hospital, Canada’s foremost children’s hospital.

“I told them they were wrong, but they didn’t believe me. Nobody would listen,” Marchand recalls.

Motherisk hair test results indicated that Marchand had been downing 48 drinks a day, for 90 days. “If you do the math, I would have died drinking that much” Marchand says. “There’s no way I could function.”

The court disagreed, and determined Marchand was unfit to have custody of her daughter.

Some parents, like Marchand, pursued additional hair tests from independent labs in a bid to fight their cases. Marchand’s second test showed up as negative. But, because the lab technician couldn’t testify as an expert witness, the second test was thrown out by the court.

Marchand says the entire process was very frustrating. She says someone should have noticed a pattern when parents repeatedly presented hair test results from independent labs which completely contradicted Motherisk results. Alarm bells should have gone off sooner.

Tess’ post and a 366-page report make it clear that Motherisk has impaired the fairness of a large number of child-protection service cases.

Child services, the courts, state representatives, the only one would would have been aware of contradictions of Motherisk results over multiple cases, had not interest in “connecting the dots.”

Each case, with each attorney, was an ad hoc data silo that could not present the pattern necessary to challenge the systematic poor science from Motherisk.

The point is that not all data silos are in big data or nation-state sized intelligence services. Data silos can and do regularly have tragic impact upon ordinary citizens.

Privacy would be an issue but mechanisms need to be developed where lawyers and other advocates can share notice of contradiction of state agencies so that patterns such as by Motherisk can be discovered, documented and hopefully ended sooner rather than later.

BTW, there is an obvious explanation for why:

“No forensic toxicology laboratory in the world uses ELISA testing the way [Motherisk] did.”

Child services did not send hair samples to Motherisk to decide whether or not to bring proceedings.

Child services had already decided to remove children and sent hair samples to Motherisk to bolster their case.

How bright did Motherisk need to be to realize that positive results were expected outcome?

Does your local defense bar collect data on police/state forensic experts and their results?

Looking for suggestions?

February 7, 2016

Interpretation Under Ambiguity [First Cut Search Results]

Filed under: Ambiguity,Semantic Diversity,Semantic Inconsistency,Semantics — Patrick Durusau @ 5:20 pm

Interpretation Under Ambiguity by Peter Norvig.

From the paper:

Introduction

This paper is concerned with the problem of semantic and pragmatic interpretation of sentences. We start with a standard strategy for interpretation, and show how problems relating to ambiguity can confound this strategy, leading us to a more complex strategy. We start with the simplest of strategies:

Strategy 1: Apply syntactic rules to the sentence to derive a parse tree, then apply semantic rules to get a translation into some logical form, and finally do a pragmatic interpretation to arrive at the final meaning.

Although this strategy completely ignores ambiguity, and is intended as a sort of strawman, it is in fact a commonly held approach. For example, it is approximately the strategy assumed by Montague grammar, where `pragmatic interpretation’ is replaced by `model theoretic interpretation.’ The problem with this strategy is that ambiguity can strike at the lexical, syntactic, semantic, or pragmatic level, introducing multiple interpretations. The obvious way to counter this problem is as follows:

Strategy 2: Apply syntactic rules to the sentence to derive a set of parse trees, then apply semantic rules to get a set of translations in some logical form, discarding any inconsistent formulae. Finally compute pragmatic interpretation scores for each possibility, to arrive at the `best’ interpretation (i.e. `most consistent’ or `most likely’ in the given context).

In this framework, the lexicon, grammar, and semantic and pragmatic interpretation rules determine a mapping between sentences and meanings. A string with exactly one interpretation is unambiguous, one with no interpretation is anomalous, and one with multiple interpretations is ambiguous. To enumerate the possible parses and logical forms of a sentence is the proper job of a linguist; to then choose from the possibilities the one “correct” or “intended” meaning of an utterance is an exercise in pragmatics or Artificial Intelligence.

One major problem with Strategy 2 is that it ignores the difference between sentences that seem truly ambiguous to the listener, and those that are only found to be ambiguous after careful analysis by the linguist. For example, each of (1-3) is technically ambiguous (with could signal the instrument or accompanier case, and port could be a harbor or the left side of a ship), but only (3) would be seen as ambiguous in a neutral context.

(1) I saw the woman with long blond hair.
(2) I drank a glass of port.
(3) I saw her duck.

Lotfi Zadeh (personal communication) has suggested that ambiguity is a matter of degree. He assumes each interpretation has a likelihood score attached to it. A sentence with a large gap between the highest and second ranked interpretation has low ambiguity; one with nearly-equal ranked interpretations has high ambiguity; and in general the degree of ambiguity is inversely proportional to the sharpness of the drop-off in ranking. So, in (1) and (2) above, the degree of ambiguity is below some threshold, and thus is not noticed. In (3), on the other hand, there are two similarly ranked interpretations, and the ambiguity is perceived as such. Many researchers, from Hockett (1954) to Jackendoff (1987), have suggested that the interpretation of sentences like (3) is similar to the perception of visual illusions such as the Necker cube or the vase/faces or duck/rabbit illusion. In other words, it is possible to shift back and forth between alternate interpretations, but it is not possible to perceive both at once. This leads us to Strategy 3:

Strategy 3: Do syntactic, semantic, and pragmatic interpretation as in Strategy 2. Discard the low-ranking interpretations, according to some threshold function. If there is more than one interpretation remaining, alternate between them.

Strategy 3 treats ambiguity seriously, but it leaves at least four problems untreated. One problem is the practicality of enumerating all possible parses and interpretations. A second is how syntactic and lexical preferences can lead the reader to an unlikely interpretation. Third, we can change our mind about the meaning of a sentence-“at first I thought it meant this, but now I see it means that.” Finally, our affectual reaction to ambiguity is variable. Ambiguity can go unnoticed, or be humorous, confusing, or perfectly harmonious. By `harmonious,’ I mean that several interpretations can be accepted simultaneously, as opposed to the case where one interpretation is selected. These problems will be addressed in the following sections.

Apologies for the long introduction quote but I want to entice you to read Norvig’s essay in full and if you have the time, the references that he cites.

It’s the literature you will have to master to use search engines and develop indexing strategies.

At least for one approach to search and indexing.

That within a language there is enough commonality for automated indexing or searching to be useful has been proven over and over again by Internet search engines.

But at the same time, the first twenty or so results typically leave you wondering what interpretation the search engine put on your words.

As I said, Peter’s approach is useful, at least for a first cut at search results.

The problem is that the first cut has become the norm for “success” of search results.

That works if I want to pay lawyers, doctors, teachers and others to find the same results as others have found before (past tense).

That cost doesn’t appear as a line item in any budget but repetitive “finding” of the same information over and over again is certainly a cost to any enterprise.

First cut on semantic interpretation, follow Norvig.

Saving re-finding costs and the cost of not-finding, requires something more robust than a one model to find words and in the search darkness bind them to particular meanings.

PS: See Peter@norvig.com for an extensive set of resources, papers, presentations, etc.

I first saw this in a tweet by James Fuller.

‘Avengers’ Comic Book Covers [ + MAD, National Lampoon]

Filed under: Art,Graphics,Visualization — Patrick Durusau @ 3:31 pm

50 Years of ‘Avengers’ Comic Book Covers Through Color by Jon Keegan.

From the post:

When Marvel’s “Avengers: Age of Ultron” opens in theaters next month, a familiar set of iconic colors will be splashed across movie screens world-wide: The gamma ray-induced green of the Hulk, Iron Man’s red and gold armor, and Captain America’s red, white and blue uniform.

How the Avengers look today differs significantly from their appearance in classic comic-book versions, thanks to advancements in technology and a shift to a more cinematic aesthetic. As Marvel’s characters started to appear in big-budget superhero films such as “X-Men” in 2000, the darker, muted colors of the movies began to creep into the look of the comics. Explore this shift in color palettes and browse more than 50 years of “Avengers” cover artwork below. Read more about this shift in color.

The fifty years of palettes are a real treat and should be used alongside your collection of the Avenger comics for the same time period. 😉

From what I could find quickly, you will have to purchase the forty year collection separately from more recent issues.

Of course, if you really want insight into American culture, you would order Absolutely MAD Magazine – 50+ Years.

MAD issues from 1952 to 2005 (17,500 pages in full color). Annotating those issues to include social context would be a massive but highly amusing project. And you would have to find a source for the following issues.

A more accessible collection that is easily as amusing as MAD would be the National Lampoon collection. Unfortunately, only 1970 – 1975 are online. 🙁

One of my personal favorites:

justice-lampoon

Visualization of covers is a “different” way to view all of these collections and with no promises, could be interesting comparisons to contemporary events when they were published.

Mapping the commentaries you will find in MAD and National Lampoon to current events when they were published, say to articles in New York Time historical archive, would be a great history project for students and an education in social satire as well.

If anyone objects to the lack of a “serious” nature of such a project, be sure to remind them that reading the leading political science journal of the 1960’s, the American Political Science Review would have left the casual reader with few clues that the United States was engaged in a war that would destroy the lives of millions in Vietnam.

In my experience, “serious” usually equates with “supports the current system of privilege and prejudice.”

You can be “serious” or you can choose to shape a new system of privilege and prejudice.

Your call.

February 6, 2016

Clojure for Data Science [Caution: Danger of Buyer’s Regret]

Filed under: Clojure,Data Science,Functional Programming,Programming — Patrick Durusau @ 10:15 pm

Clojure for Data Science by Mike Anderson.

From the webpage:

Presentation given at the Jan 2016 Singapore Clojure Users’ Group

You will have to work at the presentation because there is no accompanying video, but the effort will be well spent.

Before you review these slides or pass them onto others, take fair warning that you may experience “buyer’s regret” with regard to your current programming language/paradigm (if not already Clojure).

However powerful and shiny your present language seems now, its luster will be dimmed after scanning over this slides.

Don’t say you weren’t warned ahead of time!

BTW, if you search for “clojure for data science” (with the quotes) you will find among other things:

Clojure for Data Science Progressing by Henry Garner (Packt)

Repositories for the Clojure for Data Science Processing book.

@cljds Clojure Data Science twitter feed (Henry Garner). VG!

Clojure for Data Science Some 151 slides by Henry Garner.

Plus:

Planet Clojure, a metablog that collects posts from other Clojure blogs.

As a close friend says from time to time, “clojure for data science,”

G*****s well.” 😉

Enjoy!

Between the Words [Alternate Visualizations of Texts]

Filed under: Art,Literature,Visualization — Patrick Durusau @ 8:49 pm

Between the Words – Exploring the punctuation in literary classics by Nicholas Rougeux.

From the webpage:

Between the Words is an exploration of visual rhythm of punctuation in well-known literary works. All letters, numbers, spaces, and line breaks were removed from entire texts of classic stories like Alice’s Adventures in Wonderland, Moby Dick, and Pride and Prejudice—leaving only the punctuation in one continuous line of symbols in the order they appear in texts. The remaining punctuation was arranged in a spiral starting at the top center with markings for each chapter and classic illustrations at the center.

The posters are 24″ X 36.”

Some small images to illustrate the concept:

achistmascarol

ataleoftwocities

aliceinwonderland

I’m not an art critic but I can say that unusual or unexpected visualizations of data can lead to new insights. Or should I say different insights than you may have previously held.

Seeing this visualization reminded me of a presentation too any years ago at Cambridge that argued the cantillation (think crudely “accents”) marks in the Hebrew Bible were a reliable guide to clause boundaries and reading.

FYI, the versification and divisions in the oldest known witnesses to the Hebrew Bible were added centuries after the text stabilized. There are generally accepted positions on the text but at best, they are just that, generally accepted positions.

Any number of alternative presentations of texts suggest themselves.

I haven’t performed the experiment but for numeric data, reordering the data so as to force re-casting of formulas, could be a way to explore presumptions that are glossed over the the “usual form.”

Not unlike copying a text by hand as opposed to typing or photocopying the text. Each step of performing the task with less deliberation increases the odds you will miss some decision that you are making unconsciously.

If you like these posters ore know an English major/professor who may, pass this site along to them. (I have no interest, financial or otherwise in this site but I like to encourage creative thinking.)

I first saw this in a tweet by Christopher Phipps.

Finding Roman Roads

Filed under: History,LiDAR — Patrick Durusau @ 8:15 pm

You (yes, you) can find Roman roads using data collected by lasers by Barbara Speed.

Barbara reports that using Lidar data available from the UK Survey portal, David Rateledge was able to discover a Roman road between Ribchester and Lancaster.

She closes with:


The Environment Agency is planning to release 11 Terabytes (for Luddites: that’s an awful lot of data) worth of LIDAR information as part of the Department for Engironment, Food and Rural Affairs’ open data initiative, available through this portal. Which means that any of us could download it and dig about for more lost roads.

That seems a bit thin on the advice side, if you are truly interested in using the data to find Roman roads and other sites.

An article posted under ‘Lost’ Roman road is discovered, doesn’t provide more on the technique but does point to Roman Roads in Lancashire. Interesting site but no help on using the data.

I can’t comment on the ease of use or documentation but LiDAR tools are available at: Free LiDAR tools.

See also my post on the OpenTopography Project.

How To Profit from Human Trafficking – Become a Trafficker or NGO

Filed under: Government,Journalism,News,Reporting — Patrick Durusau @ 5:27 pm

Special Report: Money and Lies in Anti-Human Trafficking NGOs by Anne Elizabeth Moore.

From the post:

The United States’ beloved – albeit disgraced – anti-trafficking advocate Somaly Mam has been waging a slow but steady return to glory since a Newsweek cover story in May 2014 led to her ousting from the Cambodian foundation that bore her name. The allegations in the article were not new; they’d been reported and corroborated in bits and pieces for years. The magazine simply pointed out that Mam’s personal narrative as a survivor of sex trafficking and the similar stories that emerged from both clients and staff at the non-governmental organization (NGO) she founded to assist survivors of sex trafficking, were often unverifiable, if not outright lies.

Panic ensued. Mam had helped establish, for US audiences, key plot points in the narrative of trafficking and its future eradication. Her story is that she was forced into labor early in life by someone she called “Grandfather,” who then sold off her virginity and forced her into a child marriage. Later she says she was sold to a brothel where she watched several contemporaries die in violence. Childhood friends and even family members couldn’t verify Mam’s recollection of events for Newsweek, but Mam has suggested that her story is typical of trafficking victims.

Mam has also cultivated a massive global network of anti-trafficking NGOs, funders and supporters, who have based their missions, donations and often life’s work on her emotional – but fabricated – tale. Some distanced themselves from the Cambodian activist last spring, including her long-time supporter at The New York Times, Nicholas Kristof, while others suggested that even if untrue, Mam’s stories were told in support of a worthy cause and were therefore true enough.

Moore characterizes NGOs organized to stop human trafficking as follows:


Considering their common mythical enemy – the nameless and faceless men portrayed in TV dramas who trade in nubile human girl stock – one would hope anti-trafficking organizations would unite in an effort to be less shady. With names reliant on metaphors of recovery, light and sanctuary, anti-trafficking groups project an image of transparency. Yet these groups have shown a remarkable lack of fiscal accountability and organizational consistency, often even eschewing an open acknowledgement of board members, professional affiliates and funding relationships. The problems with this evasion go beyond ethical considerations: A certain level of budgetary disclosure, for example, is a legal requirement for tax-exempt 501(c)(3) organizations. Yet anti-trafficking groups fold, move, restructure and reappear under new names with alarming frequency, making them almost as difficult to track as their supposed foes.

It is a very compelling article that will leave you with more questions about the finances of NGOs “opposing” human trafficking than answers.

The lack of answers isn’t Moore’s fault, the NGOs in question were designed to make obtaining answers difficult, if not impossible.

After you read the article, more than once to get the full impact, how would you:

  1. Track organizations in the article that: “…fold, move, restructure and reappear under new names with alarming frequency…”?
  2. How would you gather and share data on those organizations?
  3. How would you map what data is available on funding to Moore’s report?
  4. How would you make Moore’s snapshot of data subject updating by later reporters?
  5. How would you track the individuals involved in the NGOs you track?

The answers to those questions are applicable to human traffickers as well.

Consider it to be a “two-for.”

The Vietnam War: A Non-U.S. Photo Essay

Filed under: Government,Journalism,News,Reporting — Patrick Durusau @ 3:28 pm

1965-1975 Another Vietnam by Alex Q. Arbuckle.

From the post:

For much of the world, the visual history of the Vietnam War has been defined by a handful of iconic photographs: Eddie Adams’ image of a Viet Cong fighter being executed, Nick Ut’s picture of nine-year-old Kim Phúc fleeing a napalm strike, Malcolm Browne’s photo of Thích Quang Duc self-immolating in a Saigon intersection.

Many famous images of the war were taken by Western photographers and news agencies, working alongside American or South Vietnamese troops.

But the North Vietnamese and Viet Cong had hundreds of photographers of their own, who documented every facet of the war under the most dangerous conditions.

Almost all were self-taught, and worked for the Vietnam News Agency, the National Liberation Front, the North Vietnamese Army or various newspapers. Many sent in their film anonymously or under a nom de guerre, viewing themselves as a humble part of a larger struggle.

A timely reminder that Western media and government approved photographs are evidence for only one side of any conflict.

Efforts by Twitter and Facebook to censor any narrative other than a Western one on the Islamic State should be very familiar to anyone who remembers the “Western view only” from media reports in the 1960’s.

Censorship, whether during Vietnam or in opposition to the Islamic State, doesn’t make the “other” narrative go away. It cannot deny the facts known to residents in a war zone.

The only goal that censorship achieves and not always, is to keep the citizens of the censoring powers in ignorance. So much for freedom of speech. You can’t talk about what you don’t know about.

The essay uses images from Another Vietnam: Pictures of the War from the Other Side. I checked at National Geographic, the publisher, and it isn’t listed in their catalog. Used/new the book is about $160.00 and contains 180 never before published photographs.

Questions come to mind:

Where are the other North Vietnam/Viet Cong photos now? Shouldn’t those be documented, digitized and placed online?

Where are the Islamic States photos and videos that are purged from Twitter and Facebook?

The media is repeating the same mistake with the Islamic State that it made during Vietnam.

No reader can decide between competing narratives in the face of only one narrative.

Nor can they avoid making the same mistakes as have been made in the past.

Vietnam is a very good example of such a mistake.

Replacing the choices of other cultures with our own is a mission doomed to failure (and defeat).

I first saw this in a tweet by Lars Marius Garshol.

Are You A Scientific Twitter User or Polluter?

Filed under: Science,Twitter — Patrick Durusau @ 11:22 am

Realscientists posted this image to Twitter:

science

Self-Scoring Test:

In the last week, how often have you retweeted without “read[ing] the actual paper” pointed to by a tweet?

How many times did you retweet in total?

Formula: retweets w/o reading / retweets in total = % of retweets w/o reading.

No scale with superlatives because I don’t have numbers to establish a baseline for the “average” Twitter user.

I do know that I see click-bait, out-dated and factually wrong material retweeted by people who know better. That’s Twitter pollution.

Ask yourself: Am I a scientific Twitter user or a polluter?

Your call.

February 5, 2016

Is Twitter A Global Town Censor? (Data Project)

Filed under: Censorship,Free Speech,Government,Tweets,Twitter — Patrick Durusau @ 9:51 pm

Twitter Steps Up Efforts to Thwart Terrorists’ Tweets by Mike Isaac.

From the post:

For years, Twitter has positioned itself as a “global town square” that is open to discourse from all. And for years, extremist groups like the Islamic State have taken advantage of that stance, using Twitter as a place to spread their messages.

Twitter on Friday made clear that it was stepping up its fight to stem that tide. The social media company said it had suspended 125,000 Twitter accounts associated with extremism since the middle of 2015, the first time it has publicized the number of accounts it has suspended. Twitter also said it had expanded the teams that review reports of accounts connected to extremism, to remove the accounts more quickly.

“As the nature of the terrorist threat has changed, so has our ongoing work in this area,” Twitter said in a statement, adding that it “condemns the use of Twitter to promote terrorism.” The company said its collective moves had already produced results, “including an increase in account suspensions and this type of activity shifting off Twitter.”

The disclosure follows intensifying pressure on Twitter and other technology companies from the White House, presidential candidates like Hillary Clinton and government agencies to take more action to combat the digital practices of terrorist groups. The scrutiny has grown after mass shootings in Paris and San Bernardino, Calif., last year, because of concerns that radicalizations can be accelerated by extremist postings on the web and social media.

Just so you know what the Twitter rule is:

Violent threats (direct or indirect): You may not make threats of violence or promote violence, including threatening or promoting terrorism. (The Twitter Rules)

Here’s your chance to engage in real data science and help decide the question if Twitter had changed from global town hall to global town censor.

Here’s the data gathering project:

Monitor all the Twitter streams for Republican and Democratic candidates for the U.S. presidency for tweets advocating violence/terrorism.

File requests with Twitter for those accounts to be replaced.

FYI: When you report a message (Reporting a Tweet or Direct Message for violations), it will disappear from Messages inbox.

You must copy every tweet you report (accounts disappear as well) if you want to keep a record of your report.

Keep track of your reports and the tweet you copied before reporting.

Post the record of your reports and the tweets reported, plus any response from Twitter.

Suggestions on how to format these reports?

Or would you rather not know what Twitter is deciding for you?

How much data needs to be collected to move onto part 2 of the project – data analysis?


Suggestions on who at Twitter to contact for a listing of the 125,000 accounts that were silenced along with the Twitter history for each one? (Or the entire history of silenced accounts at Twitter? Who gets censored by topic, race, gender, location, etc., are all open questions.)

That could change the Twitter process from a black box to having marginally more transparency. You would have to guess at why any particular account was silenced.

If Twitter wants to take credit for censoring public discourse then the least it can do is be honest about who was censored and what they were saying to be censored.

Yes?

Ethical Data Scientists: Will You Support A False Narrative – “Community of Hope?”

Filed under: Government,Politics — Patrick Durusau @ 11:22 am

Google executive Anthony House advocates a false narrative, a “community of hope” as a counter to truthful content from the Islamic State:

We should get the bad stuff down [online], but it’s also extremely important that people are able to find good information, that when people are feeling isolated, that when they go online, they find a community of hope, not a community of harm. (Google plans to fight extremist propaganda with AdWords)

Islamic State media is offering a community of hope. One based on facts, not a fantasy of Western planners.

The more immediate, but no less intractable, challenge is to change the reality on the ground in Syria and Iraq, so that ISIS’s narrative of Sunni Muslim persecution at the hands of the Assad regime and Iranian-backed Shiite militias commands less resonance among Sunnis. One problem in countering that narrative is that some of it happens to be true: Sunni Muslims are being persecuted in Syria and Iraq. This blunt empirical fact, just as much as ISIS’s success on the battlefield, and the rhetorical amplification and global dissemination of that success via ISIS propaganda, helps explain why ISIS has been so effective in recruiting so many foreign fighters to its cause. (Why It’s So Hard to Stop ISIS Propaganda)

Persecution of Sunni Muslims aren’t the only facts in the Islamic State narrative. Consider the following:

  • Muslim governments exist at the sufferance of the West. Ex. Afghanistan, Iran, Libya, Syria
  • Existing “Muslim” leaders are vassals of the West.
  • For more than a century the West has dictated the fate of Muslims in the Middle East.
  • The West supports oppression of the Palestinian people.
  • The West opposes democratic results in Muslim countries that don’t accord with its wishes.

We might disagree on the phrasing of those facts but can an ethical data scientist say they are not true?

Whatever the motivation of the West in each case, the West wants to decide the fate of Muslims.

Is the “community of hope” Google portrays to be based on false hopes or new realities on the ground?

There’s a question for all the “ethical” data scientists at Google.

Will you support a false narrative by Google for a “community of hope” to deter terrorism?

« Newer PostsOlder Posts »

Powered by WordPress