Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 20, 2015

Spark Summit East Agenda (New York, March 18-19 2015)

Filed under: Conferences,Spark — Patrick Durusau @ 3:05 pm

Spark Summit East Agenda (New York, March 18-19 2015)

Registration

The plenary and track sessions are on day one. Databricks is offering three training courses on day two.

The track sessions were divided into developer, applications and data science tracks. To assist you in finding your favorite speakers, I have collapsed that listing and sorted it by the first listed speaker’s last name. I certainly hope all of these presentations will be video recorded!

Take good notes and blog about your favorite sessions! Ping me with a pointer to your post. Thanks!

I first saw this in a tweet by Helena Edelson.

Modelling Data in Neo4j: Labels vs. Indexed Properties

Filed under: Graphs,Neo4j — Patrick Durusau @ 2:15 pm

Modelling Data in Neo4j: Labels vs. Indexed Properties by Christophe Willemsen.

From the post:

A common question when planning and designing your Neo4j Graph Database is how to handle "flagged" entities. This could
include users that are active, blog posts that are published, news articles that have been read, etc.

Introduction

In the SQL world, you would typically create a a boolean|tinyint column; in Neo4j, the same can be achieved in the
following two ways:

  • A flagged indexed property
  • A dedicated label

Having faced this design dilemma a number of times, we would like to share our experience with the two
presented possibilities and some Cypher query optimizations that will help you take a full advantage of a the graph database.

Throughout the blog post, we'll use the following example scenario:

  • We have User nodes
  • User FOLLOWS other users
  • Each user writes multiple blog posts stored as BlogPost nodes
  • Some of the blog posts are drafted, others are published (active)

This post will help you make the best use of labels in Neo4j.

Labels are semantically opaque so if your Neo4j database has “German” to label books written in German, you are SOL if you need German for nationality.

That is a weakness semantically opaque tokens. Having type properties on labels would push the semantic opaqueness to the next level.

pgcli [Inspiration for command line tool for XPath/XQuery?]

Filed under: PostgreSQL,XPath,XQuery — Patrick Durusau @ 1:22 pm

pgcli

From the webpage:

Pgcli is a command line interface for Postgres with auto-completion and syntax highlighting.

Postgres folks who don’t know about pgcli will be glad to see this post.

But, having spent several days with XPath/XQuery/FO 3.1 syntax, I can only imagine the joy in XML circles for a similar utility for use with command line XML tools.

Properly done, the increase in productivity would be substantial.

The same applies for your favorite NoSQL query language. (Datomic?)

Will SQL users be the only ones with such a command line tool?

I first saw this in a tweet by elishowk.

Improved Fault-tolerance and Zero Data Loss in Spark Streaming

Filed under: Spark,Streams — Patrick Durusau @ 11:38 am

Improved Fault-tolerance and Zero Data Loss in Spark Streaming by Tathagata Das.

From the post:

Real-time stream processing systems must be operational 24/7, which requires them to recover from all kinds of failures in the system. Since its beginning, Spark Streaming has included support for recovering from failures of both driver and worker machines. However, for some data sources, input data could get lost while recovering from the failures. In Spark 1.2, we have added preliminary support for write ahead logs (also known as journaling) to Spark Streaming to improve this recovery mechanism and give stronger guarantees of zero data loss for more data sources. In this blog, we are going to elaborate on how this feature works and how developers can enable it to get those guarantees in Spark Streaming applications.

Background

Spark and its RDD abstraction is designed to seamlessly handle failures of any worker nodes in the cluster. Since Spark Streaming is built on Spark, it enjoys the same fault-tolerance for worker nodes. However, the demand of high uptimes of a Spark Streaming application require that the application also has to recover from failures of the driver process, which is the main application process that coordinates all the workers. Making the Spark driver fault-tolerant is tricky because it is an arbitrary user program with arbitrary computation patterns. However, Spark Streaming applications have an inherent structure in the computation — it runs the same Spark computation periodically on every micro-batch of data. This structure allows us to save (aka, checkpoint) the application state periodically to reliable storage and recover the state on driver restarts.

For sources like files, this driver recovery mechanism was sufficient to ensure zero data loss as all the data was reliably stored in a fault-tolerant file system like HDFS or S3. However, for other sources like Kafka and Flume, some of the received data that was buffered in memory but not yet processed could get lost. This is because of how Spark applications operate in a distributed manner. When the driver process fails, all the executors running in a standalone/yarn/mesos cluster are killed as well, along with any data in their memory. In case of Spark Streaming, all the data received from sources like Kafka and Flume are buffered in the memory of the executors until their processing has completed. This buffered data cannot be recovered even if the driver is restarted. To avoid this data loss, we have introduced write ahead logs in Spark Streaming in the Spark 1.2 release.

Solid piece on the principles and technical details you will need for zero data loss in Spark Streaming. With suggestions for changes that may be necessary to support zero data loss at no loss in throughput. The latter being a non-trivial consideration.

Curious, I understand that many systems require zero data loss but do you have examples of systems were some data loss is acceptable? To what extend is data loss acceptable? (Given lost baggage rates, is airline baggage one of those?)

Modelling Plot: On the “conversional novel”

Filed under: Language,Literature,Text Analytics,Text Mining — Patrick Durusau @ 11:11 am

Modelling Plot: On the “conversional novel” by Andrew Piper.

From the post:

I am pleased to announce the acceptance of a new piece that will be appearing soon in New Literary History. In it, I explore techniques for identifying narratives of conversion in the modern novel in German, French and English. A great deal of new work has been circulating recently that addresses the question of plot structures within different genres and how we might or might not be able to model these computationally. My hope is that this piece offers a compelling new way of computationally studying different plot types and understanding their meaning within different genres.

Looking over recent work, in addition to Ben Schmidt’s original post examining plot “arcs” in TV shows using PCA, there have been posts by Ted Underwood and Matthew Jockers looking at novels, as well as a new piece in LLC that tries to identify plot units in fairy tales using the tools of natural language processing (frame nets and identity extraction). In this vein, my work offers an attempt to think about a single plot “type” (narrative conversion) and its role in the development of the novel over the long nineteenth century. How might we develop models that register the novel’s relationship to the narration of profound change, and how might such narratives be indicative of readerly investment? Is there something intrinsic, I have been asking myself, to the way novels ask us to commit to them? If so, does this have something to do with larger linguistic currents within them – not just a single line, passage, or character, or even something like “style” – but the way a greater shift of language over the course of the novel can be generative of affective states such as allegiance, belief or conviction? Can linguistic change, in other words, serve as an efficacious vehicle of readerly devotion?

While the full paper is available here, I wanted to post a distilled version of what I see as its primary findings. It’s a long essay that not only tries to experiment with the project of modelling plot, but also reflects on the process of model building itself and its place within critical reading practices. In many ways, its a polemic against the unfortunate binariness that surrounds debates in our field right now (distant/close, surface/depth etc.). Instead, I want us to see how computational modelling is in many ways conversional in nature, if by that we understand it as a circular process of gradually approaching some imaginary, yet never attainable centre, one that oscillates between both quantitative and qualitative stances (distant and close practices of reading).

Andrew writes of “…critical reading practices….” I’m not sure that technology will increase the use of “…critical reading practices…” but it certainly offers the opportunity to “read” texts in different ways.

I have done this with IT standards but never a novel, attempt reading it from the back forwards, a sentence at a time. At least with authoring you are proofing, it provides a radically different perspective than the more normal front to back. The first thing you notice is that it interrupts your reading/skimming speed so you will catch more errors as well as nuances in the text.

Before you think that literary analysis is a bit far afield from “practical” application, remember that narratives (think literature) are what drive social policy and decision making.

Take the current popular “war on terrorism” narrative that is so popular and unquestioned in the United States. Ask anyone inside the beltway in D.C. and they will blather on and on about the need to defend against terrorism. But there is an absolute paucity of terrorists, at least by deed, in the United States. Why does the narrative persist in the absence of any evidence to support it?

The various Red Scares in U.S. history were similar narratives that have never completely faded. They too had a radical disconnect between the narrative and the “facts on the ground.”

Piper doesn’t offer answers to those sort of questions but a deeper understanding of narrative, such as is found in novels, may lead to hints with profound policy implications.

Opportunistic “Information” on Sony Hack

Filed under: Cybersecurity,Security — Patrick Durusau @ 10:31 am

Why the US was so sure North Korea hacked Sony: it had a front-row seat by Lisa Vaas.

From the post:

We may finally know why the US was so confident about identifying North Korea’s hand in the Sony attack: it turns out the NSA had front-row seats to the cyber carnage, having infiltrated computers and networks of the country’s hackers years ago.

According to the New York Times, a recently released top-secret document traces the NSA’s infiltration back to 2010, when it piggybacked on South Korean “implants” on North Korea’s networks and “sucked back the data”.

The NSA didn’t find North Korea all that interesting, but that attitude changed as time went on, in part because the agency managed to intercept and repurpose a 0-day exploit – a “big win,” according to the document.

Stories like this one make me wonder if anyone follows hyperlinks embedded in posts?

The document, http://www.spiegel.de/media/media-35679.pdf is composed of war stories, one of which was to answer the question:

Is there “fifth party” collection?

“Fourth party collection” refers to passively or actively obtaining data from some other actor’s CNE activity against a target. Has there ever been an instance of NSA obtaining information from Actor One exploiting Actor Two’s CNE activity against a target that NSA, Actor One, and Actor Two all care about?

The response:

Yes. There was a project that I was working last year with regard to the South Korean CNE program. While we weren’t super interested in SK (things changed a bit when they started targeting us a bit more), we were interested in North Korea and SK puts a lot of resources against them. At that point, our access to NK was next to nothing but we were able to make some inroads into the SK CNE program. We found a few instances where there were NK officials with SK implants in their boxes, so we got on the exfil points, and sucked back the data. Thats forth party. (TS//SI//REL) However, some of the individuals that SK was targeting were also part of the NK CNE program. So I guess that would be the fifth party collect you were talking about. But once that started happening, we ramped up efforts to target NK ourselves (as you don’t want to rely on an untrusted actor to do your work for you.) But some of the work that was done there was to help us gain access. (TS//SI//REL) I know of another instance (I will be more vague because I believe there are more compartments involved and parts are probably NF) where there was an actor we were going against. We realized another actor was also going against them and having great success because of a 0 day they wrote. We got the 0 day out of passive and were able to re-purpose it. Big win. (TS//SI//REL) But they were all still referred to as a fourth party.

Origin: The document appears on the Free Snowden site under the title: ‘4th Party Collection’: Taking Advantage of Non-Parter Computer Network Exploitation Activity

Analysis:

There are a couple of claims in Lisa’s account that are easy to dismiss on the basis of the document itself:

Lisa says:

The NSA didn’t find North Korea all that interesting, but that attitude changed as time went on, in part because the agency managed to intercept and repurpose a 0-day exploit – a “big win,” according to the document.

Assuming that SK = South Korea and NK = North Korea, the document reports:

While we weren’t super interested in SK (things changed a bit when they started targeting us a bit more), we were interested in North Korea and SK puts a lot of resources against them. (Emphasis added)

I read that to say we weren’t “super interested” in South Korea until South Korea started targeting us more. Does anyone have an English reading of that to a different conclusion?

Lisa also says that:

The NSA didn’t find North Korea all that interesting, but that attitude changed as time went on, in part because the agency managed to intercept and repurpose a 0-day exploit – a “big win,” according to the document.

The war story in question has concluded the South Korea and North Korea account and then says:

I know of another instance (I will be more vague because I believe there are more compartments involved and parts are probably NF) where there was an actor we were going against. We realized another actor was also going against them and having great success because of a 0 day they wrote. We got the 0 day out of passive and were able to re-purpose it. Big win. (TS//SI//REL) But they were all still referred to as a fourth party. (emphasis added)

The “I know of another instance” signals to most readers a change in the narrative to start a different account from the one just concluded. In the second instance, only “actor” is used and there is no intimation that North Korea is one of those actors. Could certainly be but there is no apparent connection between the two accounts.

Moreover there is nothing in the war story to indicate that a permanent monitoring presence was established in any network, capable of the sort of monitoring that Lisa characterizes as having “a front seat.

Summary:

The leaking of this document is an attempt to exploit uncertainty about government claims concerning the Sony hack.

The document does not establish recovery of data from the North Korean network but only “…NK officials with SK implants in their boxes, so we got on the exfil points, and sucked back the data.”

Moreover, the document establishes that South Korea attempts to conduct CNE operations against the United States and is considered “…an untrusted actor….”

The zero day exploit may have been against North Korea, anything is possible but this document gives no basis for concluding it was against North Korea.

Finally, this document does not establish any basis for concluding that the United States had achieved a network monitoring capability on North Korean CNE networks or operations.

It is bad enough the United States government keeps inventing specious claims about the Sony hack. Let’s not assist it by manufacturing even less likely accounts.

January 19, 2015

D-Lib Magazine January/February 2015

Filed under: Data Science,Librarian/Expert Searchers,Library — Patrick Durusau @ 8:37 pm

D-Lib Magazine January/February 2015

From the table of contents (see the original toc for abstracts):

Editorials

2nd International Workshop on Linking and Contextualizing Publications and Datasets by Laurence Lannom, Corporation for National Research Initiatives

Data as “First-class Citizens” by Łukasz Bolikowski, ICM, University of Warsaw, Poland; Nikos Houssos, National Documentation Centre / National Hellenic Research Foundation, Greece; Paolo Manghi, Istituto di Scienza e Tecnologie dell’Informazione, Consiglio Nazionale delle Ricerche, Italy and Jochen Schirrwagen, Bielefeld University Library, Germany

Articles

Semantic Enrichment and Search: A Case Study on Environmental Science Literature by Kalina Bontcheva, University of Sheffield, UK; Johanna Kieniewicz and Stephen Andrews, British Library, UK; Michael Wallis, HR Wallingford, UK

A-posteriori Provenance-enabled Linking of Publications and Datasets via Crowdsourcing by Laura Drăgan, Markus Luczak-Rösch, Elena Simperl, Heather Packer and Luc Moreau, University of Southampton, UK; Bettina Berendt, KU Leuven, Belgium

A Framework Supporting the Shift from Traditional Digital Publications to Enhanced Publications by Alessia Bardi and Paolo Manghi, Istituto di Scienza e Tecnologie dell’Informazione, Consiglio Nazionale delle Ricerche, Italy

Science 2.0 Repositories: Time for a Change in Scholarly Communication by Massimiliano Assante, Leonardo Candela, Donatella Castelli, Paolo Manghi and Pasquale Pagano, Istituto di Scienza e Tecnologie dell’Informazione, Consiglio Nazionale delle Ricerche, Italy

Data Citation Practices in the CRAWDAD Wireless Network Data Archive by Tristan Henderson, University of St Andrews, UK and David Kotz, Dartmouth College, USA

A Methodology for Citing Linked Open Data Subsets by Gianmaria Silvello, University of Padua, Italy

Challenges in Matching Dataset Citation Strings to Datasets in Social Science by Brigitte Mathiak and Katarina Boland, GESIS — Leibniz Institute for the Social Sciences, Germany

Enabling Living Systematic Reviews and Clinical Guidelines through Semantic Technologies by Laura Slaughter; The Interventional Centre, Oslo University Hospital (OUS), Norway; Christopher Friis Berntsen and Linn Brandt, Internal Medicine Department, Innlandet Hosptial Trust and MAGICorg, Norway and Chris Mavergames, Informatics and Knowledge Management Department, The Cochrane Collaboration, Germany

Data without Peer: Examples of Data Peer Review in the Earth Sciences by Sarah Callaghan, British Atmospheric Data Centre, UK

The Tenth Anniversary of Assigning DOI Names to Scientific Data and a Five Year History of DataCite by Jan Brase and Irina Sens, German National Library of Science and Technology, Germany and Michael Lautenschlager, German Climate Computing Centre, Germany

New Events

N E W S   &   E V E N T S

In Brief: Short Items of Current Awareness

In the News: Recent Press Releases and Announcements

Clips & Pointers: Documents, Deadlines, Calls for Participation

Meetings, Conferences, Workshops: Calendar of Activities Associated with Digital Libraries Research and Technologies

The quality of D-Lib Magazine meets or exceeds the quality claimed by pay-per-view publishers.

Enjoy!

29 GIFs Only ScalaCheck Witches Will Understand

Filed under: Programming,Scala — Patrick Durusau @ 8:08 pm

29 GIFs Only ScalaCheck Witches Will Understand by Kelsey Gilmore-Innis.

From the post:

Because your attention span. Stew O’Connor and I recently gave a talk on ScalaCheck, the property-based testing library for Scala. You can watch the video, or absorb it here in the Internet’s Truest Form. Here are 29 GIFs you have to be a total ScalaCheck witch to get:

1. ScalaCheck is black magick…

geniebottle

I’ll confess, I don’t get some of the images. But they are interesting enough that I am willing to correct that deficiency!

BTW, I think I had a pipe that looked like this one a very, very long time ago. I don’t remember the smoke being pink. 😉

More Power for Datomic Datalog: Negation, Disjunction, and Range Optimizations

Filed under: Clojure,Datomic — Patrick Durusau @ 7:52 pm

More Power for Datomic Datalog: Negation, Disjunction, and Range Optimizations by Stuart Halloway.

From the post:

Today’s Datomic release includes a number of enhancements to Datomic’s Datalog query language:

  • Negation, via the new not and not-join clauses
  • Disjunction (or) without using rules, via the new or and or-join clauses
  • Required rule bindings
  • Improved optimization of range predicates

Each is described below, and you can follow the examples from the mbrainz data set in Java or in Clojure.

The four hours of video in Datomic Training Videos may not be enough for you to appreciate this post but you will soon enough!

If you want a steep walk on queries, try Datomic Queries and Rules. Not for the faint of heart.

Enjoy!

Datomic Training Videos

Filed under: Clojure,Datomic,Functional Programming — Patrick Durusau @ 7:38 pm

Datomic Training Videos by Stu Halloway.

Part I: What is Datomic?

Part II: The Datomic Information Model

Part III: The Datomic Transaction Model

Part IV: The Datomic Query Model

Part V: The Datomic Time Model

Part VI: The Datomic Operational Model

About four (4) hours of videos with classroom materials, slides, etc.

OK, it’s not Downton Abbey but if you missed it last night you have a week to kill before it comes on TV again. May as well learn something while you wait. 😉

Pay particular attention to the time model in Datomic. Then ask yourself (intelligence community): Why can’t I do that with my database? (Insert your answer as a comment, leaving out classified details.)

A bonus question: What role should Stu play on Downton Abbey?

Hackers for hire? Hacker’s List – for those with no ethics or espionage skills

Filed under: Cybersecurity,Security — Patrick Durusau @ 7:07 pm

Hackers for hire? Hacker’s List – for those with no ethics or espionage skills by Lisa Vaas.

From the post:

Need to break the law, but lack the technology chops to do it yourself?

Now, as they say, there’s an app for that.

More precisely, there’s a market for it, launched in November, called Hacker’s List.

As of Monday morning, the site was down, either because it was mobbed by every tech journalist on the planet, or because a whole lot of people really, really want to do things like break into their lovers’ Facebook and Gmail accounts to sniff out cheaters.

Lisa confirms that the site is real and gives some of the listed jobs (far more job offers than job takers) and then concludes:


It’s just plain dismaying that spying on others, ruining their credibility, gaining unfair competitive advantage and even cracking a bank’s database could be so casually listed, as if any one of them were postings for a lost cat or a request for help in cleaning out the basement.

This is one of the few times when I’ve wished for a news story to turn out to be a prank.

Unfortunately, given even the briefest scan of Naked Security headlines concerning spyware or data breaches, it very likely is quite real.

My first thought when I read Lisa’s account was the arrest of a Georgia resident for hiring a “hit man” to kill his wife. Turns out the “hit man” was an uncover police officer. At least in Georgia, most of the “hit men” you hear about on the news are uncover police officers. So, caution is advised when undertaking work from an anonymous source. Assuming reasonable site security, they are in far less danger than you are. Yes?

My second thought was that I don’t share Lisa’s dismay over:

spying on others, ruining their credibility, gaining unfair competitive advantage and even cracking a bank’s database….

Not that I am advocating you should do any of those things, but the United States government and other governments around the world do that sort of thing every day.

Do you effectively oppose those things by having “ethics?”

I would be real careful before putting the “ethics” card into play. It may very well be played to your disadvantage.

Do terrorists use spam to shroud their secrets?

Filed under: Cybersecurity,Security — Patrick Durusau @ 4:50 pm

Do terrorists use spam to shroud their secrets? by Paul Ducklin.

Paul reviews a paper by Michael Wertheimer (NSA) about subversion of the random number generator Dual_EC_DRBG that mentions as an aside, terrorists using spam subject lines to escape scrutiny of their email. Paul makes the point that discarding spam may lead to discarding of the intelligence you are seeking. Good read.

Spam filtering seems like a low-lying fruit way to avoid scrutiny.

One assumption for new security thinking should be: All your communications are being intercepted.

The intercept assumption is necessary in light of highly probable interception of all digital traffic by the NSA. That assumption also prevents solutions based on optimism concerning the data available to the NSA and others.

How would you use spam to keep private things private?

XPath/XQuery/FO/XDM 3.1 Definitions – Deduped/Sorted/Some Comments! Version 0.1

Filed under: XML,XPath,XQuery — Patrick Durusau @ 10:11 am

My first set of the XPath/XQuery/FO/XDM 3.1 Definitions, deduped, sorted, along with some comments is now online!

XPath, XQuery, XQuery and XPath Functions and Operators, XDM – 3.1 – Sorted Definitions Draft

Let me emphasize this draft is incomplete and more comments are needed on the varying definitions.

I have included all definitions, including those that are unique or uniform. This should help with your review of those definitions as well.

I am continuing to work on this and other work products to assist in your review of these drafts.

Reminder: Tentative deadline for comments at the W3C is 13 February 2015.

January 18, 2015

TinkerPop is moving to Apache (Incubator)

Filed under: Graphs,TinkerPop — Patrick Durusau @ 9:10 pm

TinkerPop is moving to Apache (Incubator) by Marko A. Rodriguez.

From the post:

Over the last (almost) year, we have been working to get TinkerPop into a recognized software foundation — with our eyes primarily on The Apache Software Foundation. This morning, the voting was complete and TinkerPop will become an Apache Incubator project on Tuesday January 16th.

The primary intention of this move to Apache was to:

  1. Further guarantee vendor neutrality and vendor uptake.
  2. Better secure our developers and users legally.
  3. Grow our developer and user base.

I hope people see this as a positive and will bear with us as we go through the process of migrating our infrastructure over the month of February. Note that we will be doing our 3.0.0.M7 release on Monday (Jan 15th) with it being the last TinkerPop release. The next one (M8 or GA) will be an Apache release. Finally, note that we will be keeping this mailing list with a mirror being on Apache’s servers (that was a hard won battle :).

Take care and thank you for using of our software, The TinkerPop.

http://markorodriguez.com

tinkerpop-apache

So long as Marko keeps doing cool graphics, it’s fine by me. 😉

More seriously increasing visibility can’t help but drive TinkerPop to new heights. Or for graph software, would that be to new connections?

Learn Statistics and R online from Harvard

Filed under: R — Patrick Durusau @ 8:59 pm

Learn Statistics and R online from Harvard by David Smith.

Starts January 19 (tomorrow)

From the post:

Harvard University is offering a free 5-week on-line course on Statistics and R for the Life Sciences on the edX platform. The course promises you will learn the basics of statistical inference and the basics of using R scripts to conduct reproducible research. You’ll just need a backround in basic math and programming to follow along and complete homework in the R language.

As a new course, I haven’t seen any of the content, but the presenters Rafael Irizarry and Michael Love are active contributors to the Bioconductor project, so it should be good. The course begins January 19 and registration is open through 27 April at the link below.

edX: Statistics and R for the Life Sciences

Apologies for the late notice!

Have you given any thought to an R for Voters course? Statistics using R on public data focused on current political issues? Something to think about. The talking heads on TV are already vetting possible candidates for 2016.

January 17, 2015

Obama backs call for tech backdoors [Government Frontdoors?]

Filed under: Cybersecurity,NSA,Security — Patrick Durusau @ 8:37 pm

Obama backs call for tech backdoors

From the post:

President Obama wants a backdoor to track people’s social media messages.

The president on Friday came to the defense of British Prime Minister David Cameron’s call for tech companies to create holes in their technology to allow the government to track suspected terrorists or criminals.

“Social media and the Internet is the primary way in which these terrorist organizations are communicating,” Obama said during a press conference with Cameron on Friday.

“That’s not different from anybody else, but they’re good at it and when we have the ability to track that in a way that is legal, conforms with due process, rule of law and presents oversight, then that’s a capability that we have to preserve,” he said.

While Obama measured his comments, he voiced support for the views expressed by Cameron and FBI Director James Comey, who have worried about tech companies’ increasing trends towards building digital walls around users’ data that no one but them can access.

Rather than argue about tech backdoors someday, why not have government frontdoors?

ISPs can copy and direct all email traffic to and from .gov addresses to a big inbox on one of the cloud providers. So that the public can keep a closer eye on the activities of “our” government. Think of it as citizen oversight.

Surely no sensitive information about citizens finds its way into government email so we won’t need any filtering.

Petition your elected representatives for a government frontdoor. For federal, state and local governments. As taxpayers we own the accounts. Just like a private employer. The owners of those accounts wants access to them.

Now that would be open data that could make a real difference!

I first saw this in a tweet by Violet Blue.

PS: We also need phone records for office and cell phones of all government employees. Signals data I think they call it.

Bulk Collection of Signals Intelligence: Technical Options (2015)

Filed under: Intelligence,NSA — Patrick Durusau @ 8:07 pm

Bulk Collection of Signals Intelligence: Technical Options (2015)

Description:

The Bulk Collection of Signals Intelligence: Technical Options study is a result of an activity called for in Presidential Policy Directive 28, issued by President Obama in January 2014, to evaluate U.S. signals intelligence practices. The directive instructed the Office of the Director of National Intelligence (ODNI) to produce a report within one year “assessing the feasibility of creating software that would allow the intelligence community more easily to conduct targeted information acquisition rather than bulk collection.” ODNI asked the National Research Council (NRC) — the operating arm of the National Academy of Sciences and National Academy of Engineering — to conduct a study, which began in June 2014, to assist in preparing a response to the President. Over the ensuing months, a committee of experts appointed by the Research Council produced the report.

Believe it or not, you can’t copy-n-paste from the pre-publication PDF file. Truly irritating.

From the report:

Conclusion 1. There is no software technique that will fully substitute for bulk collection where it is relied on to answer queries about the past after new targets become known.

A key value of bulk collection is its record of past signals intelligence that may be relevant to subsequent investigations. If past events become interesting in the present, because intelligence-gathering priorities change to include detection of new kinds of threats or because of new events such as the discovery that an individual is a terrorist, historical events and the context they provide will be available for analysis only if they were previously collected. (Emphasis in the original)

The report dodges any questions about effectiveness or appropriateness of bulk collection of signals data. However, its number one conclusion provides all the ammunition one needs to establish that bulk signals intelligence gathering is a clear and present danger to the American people and any semblance of a democratic government.

Would deciding that all Muslims from the Middle East represented potential terrorist threats to the United States qualify as a change in intelligence-gathering priorities? So all the bulk signals data from Muslims and their contacts in the United States suddenly becomes fair game for the NSA to investigate?

I don’t think any practicing Muslim is a threat to any government but you saw how quickly the French backslide into bigotry after Charlie Hebdo. Maybe they didn’t have that far to go. Not any further than large segments of the U.S. population.

Our National Research Council is too timid voice an opinion other than to say if you don’t preserve signals records you can’t consult them in the future. But whether there is any danger or is this a good policy choice, they aren’t up for those questions.

The focus on signals intelligence makes you wonder how local and state police have operated all these years without bulk signals intelligence? How have they survived without it? Well, for one thing they are out in the communities they serve, not cooped up in cube farms with other people who don’t have any experience with the communities in question. Simply being a member of the community makes them aware of new comers, changes in local activity, etc.

Traditional law enforcement doesn’t stop crime as a general rule because that would require too much surveillance and resources to be feasible. When a crime has been committed, law enforcement gathers evidence and in a very large (90%+) number of cases, captures the people responsible.

Which is a interesting parallel to the NSA, which has also not stopped any terrorist plots as far as anyone knows. Well, there as that case in the State of Georgia where two aging alcoholics were boosting about producing Ricin and driving down I-285 throwing it out the window. The government got a convicted child molester to work as in informant to put those two very dangerous terrorists in jail. And I don’t think the NSA was in on that one anyway.

If the NSA has stopped a major terrorist plot, something that actually was going to be another 9/11, you know it would have been leaked long before now. The absence of such leaks is the best evidence for the lack of any viable terrorist threats in the United States that I can think of.

And what if we stop bulk signals data collection and there is another terrorist attack? So, what is your question? Bulk signals collection hasn’t stopped one so far so if we stop bulk signals collection and there is another terrorist attack, look at all the money we will have saved for the same result. Just as a policy matter, we shouldn’t spend money for no measurable result.

If you really think terrorism is a threat, take the money from bulk signal data collection and fund state and local police hiring, training and paying (long term, not just a grant) more local police officers out in their communities. That will do more to reduce the potential for all types of crimes, including those labeled as terrorism.

To put it another way, bulk signal data collection is a form of wealth sharing, wealth sharing from the public treasury to contractor’s. Wealth sharing that has been shown to be ineffectual against terrorism. Why continue it?

Facebook open sources tools for bigger, faster deep learning models

Filed under: Artificial Intelligence,Deep Learning,Facebook,Machine Learning — Patrick Durusau @ 6:55 pm

Facebook open sources tools for bigger, faster deep learning models by Derrick Harris.

From the post:

Facebook on Friday open sourced a handful of software libraries that it claims will help users build bigger, faster deep learning models than existing tools allow.

The libraries, which Facebook is calling modules, are alternatives for the default ones in a popular machine learning development environment called Torch, and are optimized to run on Nvidia graphics processing units. Among the modules are those designed to rapidly speed up training for large computer vision systems (nearly 24 times, in some cases), to train systems on potentially millions of different classes (e.g., predicting whether a word will appear across a large number of documents, or whether a picture was taken in any city anywhere), and an optimized method for building language models and word embeddings (e.g., knowing how different words are related to each other).

“‘[T]here is no way you can use anything existing” to achieve some of these results, said Soumith Chintala, an engineer with Facebook Artificial Intelligence Research.

How very awesome! Keeping abreast of the latest releases and papers on deep learning is turning out to be a real chore. Enjoyable but a time sink none the less.

Derrick’s post and the release from Facebook have more details.

Apologies for the “lite” posting today but I have been proofing related specifications where one defines a term and the other uses the term, but doesn’t cite the other specification’s definition or give its own. Do those mean the same thing? Probably the same thing but users outside the process may or may not realize that. Particularly in translation.

I first saw this in a tweet by Kirk Borne.

January 16, 2015

Humanities Open Book: Unlocking Great Books

Filed under: Funding,Government,Humanities — Patrick Durusau @ 8:38 pm

Humanities Open Book: Unlocking Great Books

Deadline: June 10, 2015

A new joint grant program by the National Endowment for the Humanities (NEH) and the Andrew W. Mellon Foundation seeks to give a second life to outstanding out-of-print books in the humanities by turning them into freely accessible e-books.

Over the past 100 years, tens of thousands of academic books have been published in the humanities, including many remarkable works on history, literature, philosophy, art, music, law, and the history and philosophy of science. But the majority of these books are currently out of print and largely out of reach for teachers, students, and the public. The Humanities Open Book pilot grant program aims to “unlock” these books by republishing them as high-quality electronic books that anyone in the world can download and read on computers, tablets, or mobile phones at no charge.

The National Endowment for the Humanities (NEH) and the Andrew W. Mellon Foundation are the two largest funders of humanities research in the United States. Working together, NEH and Mellon will give grants to publishers to identify great humanities books, secure all appropriate rights, and make them available for free, forever, under a Creative Commons license.

The new Humanities Open Book grant program is part of the National Endowment for the Humanities’ agency-wide initiative The Common Good: The Humanities in the Public Square, which seeks to demonstrate and enhance the role and significance of the humanities and humanities scholarship in public life.

“The large number of valuable scholarly books in the humanities that have fallen out of print in recent decades represents a huge untapped resource,” said NEH Chairman William Adams. “By placing these works into the hands of the public we hope that the Humanities Open Book program will widen access to the important ideas and information they contain and inspire readers, teachers and students to use these books in exciting new ways.”

“Scholars in the humanities are making increasing use of digital media to access evidence, produce new scholarship, and reach audiences that increasingly rely on such media for information to understand and interpret the world in which they live,” said Earl Lewis, President of the Andrew W. Mellon Foundation. “The Andrew W. Mellon Foundation is delighted to join NEH in helping university presses give new digital life to enduring works of scholarship that are presently unavailable to new generations of students, scholars, and general readers.”

The National Endowment for the Humanities and the Andrew W. Mellon Foundation will jointly provide $1 million to convert out-of-print books into EPUB e-books with a Creative Commons (CC) license, ensuring that the books are freely downloadable with searchable texts and in formats that are compatible with any e-reading device. Books proposed under the Humanities Open Book program must be of demonstrable intellectual significance and broad interest to current readers.

Application guidelines and a list of F.A.Q’s for the Humanities Open Book program are available online at www.NEH.gov. The application deadline for the first cycle of Humanities Open Book grants is June 10, 2015.

What great news to start a weekend!

If you decided to apply, remember that topic maps can support indexes for a book or across books or across books and including other material. You could make a classic work in the humanities into a portal that opens onto work prior to its publication, at the time of its publication, or since. Something to set yourself apart from simply making that text available.

Key Court Victory Closer for IRS Open-Records Activist

Filed under: Government Data,Open Data — Patrick Durusau @ 8:12 pm

Key Court Victory Closer for IRS Open-Records Activist by Suzanne Perry.

From the post:

The open-records activist Carl Malamud has moved a step closer to winning his legal battle to give the public greater access to the wealth of information on Form 990 tax returns that nonprofits file.

During a hearing in San Francisco on Wednesday, U.S. District Judge William Orrick said he tentatively planned to rule in favor of Mr. Malamud’s group, Public. Resource. Org, which filed a lawsuit to force the Internal Revenue Service to release nonprofit tax forms in a format that computers can read. That would make it easier to conduct online searches for data about organizations’ finances, governance, and programs.

“It looks like a win for Public. Resource and for the people who care about electronic access to public documents,” said Thomas Burke, the group’s lawyer.

The suit asks the IRS to release Forms 990 in machine-readable format for nine nonprofits that had submitted their forms electronically. Under current practice, the IRS converts all Forms 990 to unsearchable image files, even those that have been filed electronically.

That’s a step in the right direction but not all that will be required.

Suzanne goes on to note that the IRS removes donor lists from the 990 forms.

Any number of organizations will object but I think the donor lists should be public information as well.

Making all donors public may discourage some people from donating to unpopular causes but that’s a hit I would be willing to take to know who owns the political non-profits. And/or who funds the NRA for example.

Data that isn’t open enough to know who is calling the shots at organizations isn’t open data, its an open data tease.

The golden ratio has spawned a beautiful new curve: the Harriss spiral

Filed under: Data Mining,Fractals — Patrick Durusau @ 7:51 pm

The golden ratio has spawned a beautiful new curve: the Harriss spiral by Alex Bellos.

Harriss

Yes, a new fractal!

See Alex’s post for the details.

The important lesson is this fractal has been patiently waiting to be discovered. What patterns are waiting to be discovered in your data?

I first saw this in a tweet by Lars Marius Garshol.

What Counts: Harnessing Data for America’s Communities

Filed under: Data Management,Finance Services,Government,Government Data,Politics — Patrick Durusau @ 5:44 pm

What Counts: Harnessing Data for America’s Communities Senior Editors: Naomi Cytron, Kathryn L.S. Pettit, & G. Thomas Kingsley. (new book, free pdf)

From: A Roadmap: How To Use This Book

This book is a response to the explosive interest in and availability of data, especially for improving America’s communities. It is designed to be useful to practitioners, policymakers, funders, and the data intermediaries and other technical experts who help transform all types of data into useful information. Some of the essays—which draw on experts from community development, population health, education, finance, law, and information systems—address high-level systems-change work. Others are immensely practical, and come close to explaining “how to.” All discuss the incredibly exciting opportunities and challenges that our ever-increasing ability to access and analyze data provide.

As the book’s editors, we of course believe everyone interested in improving outcomes for low-income communities would benefit from reading every essay. But we’re also realists, and know the demands of the day-to-day work of advancing opportunity and promoting well-being for disadvantaged populations. With that in mind, we are providing this roadmap to enable readers with different needs to start with the essays most likely to be of interest to them.

For everyone, but especially those who are relatively new to understanding the promise of today’s data for communities, the opening essay is a useful summary and primer. Similarly, the final essay provides both a synthesis of the book’s primary themes and a focus on the systems challenges ahead.

Section 2, Transforming Data into Policy-Relevant Information (Data for Policy), offers a glimpse into the array of data tools and approaches that advocates, planners, investors, developers and others are currently using to inform and shape local and regional processes.

Section 3, Enhancing Data Access and Transparency (Access and Transparency), should catch the eye of those whose interests are in expanding the range of data that is commonly within reach and finding ways to link data across multiple policy and program domains, all while ensuring that privacy and security are respected.

Section 4, Strengthening the Validity and Use of Data (Strengthening Validity), will be particularly provocative for those concerned about building the capacity of practitioners and policymakers to employ appropriate data for understanding and shaping community change.

The essays in section 5, Adopting More Strategic Practices (Strategic Practices), examine the roles that practitioners, funders, and policymakers all have in improving the ways we capture the multi-faceted nature of community change, communicate about the outcomes and value of our work, and influence policy at the national level.

There are of course interconnections among the essays in each section. We hope that wherever you start reading, you’ll be inspired to dig deeper into the book’s enormous richness, and will join us in an ongoing conversation about how to employ the ideas in this volume to advance policy and practice.

Thirty-one (31) essays by dozens of authors on data and its role in public policy making.

From the acknowledgements:

This book is a joint project of the Federal Reserve Bank of San Francisco and the Urban Institute. The Robert Wood Johnson Foundation provided the Urban Institute with a grant to cover the costs of staff and research that were essential to this project. We also benefited from the field-building work on data from Robert Wood Johnson grantees, many of whom are authors in this volume.

If you are pitching data and/or data projects where the Federal Reserve Bank of San Francisco/Urban Institute set the tone of policy making conversations, a must read. It is likely to have an impact on other policy discussions, but adjusted for local concerns and conventions. You could also use it to shape your local policy discussions.

I first saw this in There is no seamless link between data and transparency by Jennifer Tankard.

58 XML Specs Led the Big Parade!

Filed under: Standards,XML — Patrick Durusau @ 5:01 pm

Earlier this week I ferreted out most of the current XML specifications from the W3C site. I say “most” because I didn’t take the time to run down XML “related” standards such as SVG, etc. At some point I will spend the time to track down all the drafts, prior versions, and related materials.

But, for today, I have packaged up the fifty-eight (58) current XML standards in 58XMLRecs.tar.gz.

BTW, do realize that Extensible Stylesheet Language (XSL) Version 1.0 and XHTML™ Modularization 1.1 – Second Edition have table of contents only versions. I included the full HTML file versions in the package.

You can use grep or other search utilities to search prior XML work for definitions, productions, etc.

Do you remember the compilation of XML standards that used the old MS Help application? The file format was a variation on RTF. Ring any bells? Anything like that available now?

Unicode 7.0 Core Specification (paperback)

Filed under: Unicode — Patrick Durusau @ 3:58 pm

TUS-7-Vols1-and-2-400px

The Unicode 7.0 core specification is now available in paperback book form.

Responding to requests, the editorial committee has created a pair of modestly-priced print-on-demand volumes that contain the complete text of the core specification of Version 7.0 of the Unicode Standard.

The form-factor in this edition has been changed from US letter to 6×9 inch US trade paperback size, making the two volumes more compact than previous versions. The two volumes may be purchased separately or together. The cost for the pair is US$16.27, plus postage and applicable taxes. Please visit http://www.lulu.com/spotlight/unicode to order.

Note that these volumes do not include the Version 7.0 code charts, nor do they include the Version 7.0 Standard Annexes and Unicode Character Database, all of which are available only on the Unicode website, http://www.unicode.org/versions/Unicode7.0.0/.

Even with the aggressive pricing, I don’t see this getting onto the best seller list. 😉

It should be on the best seller list! The current version is the result of decades of work by Consortium staff and many volunteers.

Enjoy!

PS: Blog about this at your site and/or forward to your favorite mailing list. Typographers, programmers, editors and the computer literate should have a basic working knowledge of Unicode.

January 15, 2015

Computer Fraud and Abuse Act (“CFAA”) (Update)

Filed under: Cybersecurity,Law,Security — Patrick Durusau @ 8:05 pm

Obama’s proposed changes to the computer hacking statute: A deep dive by Orin Kerr.

From the post:

As part of the State of the Union rollout, President Obama has announced several new legislative proposals involving cybersecurity. One of the proposals is a set of amendments to the controversial Computer Fraud and Abuse Act (“CFAA”), the federal computer hacking statute. This post takes a close look at the main CFAA proposal. It starts with a summary of existing law; it then considers how the Administration’s proposal would change the law; and it concludes with my views on whether Congress should enact the changes.

My bottom line: My views are somewhat mixed, but on the whole I’m skeptical of the Administration’s proposal. On the downside, the proposal would make some punishments too severe, and it could expand liability in some undesirable ways. On the upside, there are some notable compromises in the Administration’s position. They’re giving up more than they would have a few years ago, and there are some promising ideas in there. If the House or Senate Judiciary Committees decides to work with this proposal, there’s room for a more promising approach if some language gets much-needed attention. On the other hand, if Congress does nothing with this proposal and just sits on it, letting the courts struggle with the current language, that wouldn’t necessarily be a bad thing.

Just as general awareness you need to read Orin’s take on the proposed amendments.

What I find disturbing from Orin’s analysis is that the administration has been pushing for increased punishments in this area for years.

I find that troubling because the purpose of longer sentences isn’t to put a guilty person in jail longer. Longer sentences enable you to threaten an innocent person with a long time in jail (does Aaron Swartz comes to mind?).

You can force an innocent person to inform on their friends, plead guilty to charges they aren’t guilty of, to work uncover to entrap others.

No, I don’t think increased sentences are as benign as Orin seems to think.

Increased sentences are key to more overreaching by federal prosecutors in marginal cases.

If you had the alternatives of a guarantee of ninety-nine (99) years in jail versus entrapping someone in a crime as an informant, which one would you take?

The administration isn’t asking for that much of an increase here but you really don’t want to do federal time outside of one of the club feds for the rich doctors/lawyers that cheat on their taxes.

Data analysis to the MAX()

Filed under: Excel,Spreadsheets — Patrick Durusau @ 7:45 pm

Data analysis to the MAX() by Felienne Hermans

School: DelftX

Course Code: EX101x

Classes Start: 6 Apr 2015

Course Length: 8 weeks

Estimated effort: 4-6 hours per week

From the webpage:

EX101x is for all of those struggling with data analysis. That crazy spreadsheet from your boss? Megabytes of sensor data to analyze? Looking for a smart way visualize your data in order to make sense out of it? We’ve got you covered!

Using video lectures and hands-on exercises, we will teach you techniques and best practices that will boost your data analysis skills.

We will take a deep dive into data analysis with spreadsheets: PivotTables, VLOOKUPS, Named ranges, what-if analyses, making great graphs – all those will be covered in the first weeks of the course. After, we will investigate the quality of the spreadsheet model, and especially how to make sure your spreadsheet remains error-free and robust.

Finally, once we have mastered spreadsheets, we will demonstrate other ways to store and analyze data. We will also look into how Python, a programming language, can help us with analyzing and manipulating data in spreadsheets.

EX101x will be created using Excel 2013, but the course can be followed using another spreadsheet program as well.

The goal of this course is it to help you to overcome data analysis challenges in your work, research or studies. Therefore we encourage you to participate actively and to raise real data analysis problems that you face in our discussion forums.

Want to stay up to date with the latest news on EX101x and behind the scenes footage? We have a Twitter account -> @EX101x

If your boss is a spreadsheet user (most of them are), imagine being able to say that not only does Mahout say this is the answer, so does their spreadsheet. 😉

I have never seen Felienne speak in person but I have seen enough videos to know she is a dynamic speaker. Enjoy!

0.1 Most important button in excel

Filed under: Excel — Patrick Durusau @ 7:33 pm

Felienne Hermans demonstrates the most important button in Excel. Hint, it helps you to understand formulas written by others.

After watching the video (1:20), can you answer this question:

Where is the debug button on your favorite software?

If there isn’t one, can you imagine it with one?

Bob DuCharme’s Treasure Trove

Filed under: XML — Patrick Durusau @ 7:16 pm

Bob DuCharme’s Treasure Trove

OK, Bob’s name for it is:

My involvement with RDF and XML technology:

So I took the liberty of spicing it up a bit! 😉

Seriously, I was trying to recall some half-remembered doggerel about xml:sort when I stumbled (literally) over this treasure trove of Bob’s writing on markup technologies.

I have been a fan of Bob’s writing since his SGML CD book. You probably want something more recent for XML, etc., but it was a great book.

Whether you have a specific issue or just want to browse some literate writing on markup languages, this collection is a good place to start.

Enjoy!

Draft Sorted Definitions for XPath 3.1

Filed under: Standards,XPath — Patrick Durusau @ 7:02 pm

I have uploaded a draft of sorted definitions for XPath 3.1. See: http://www.durusau.net/publications/xpath-alldefs-sorted.html

I ran across an issue you may encounter in the future with W3C documents in general and these drafts in particular.

While attempting to sort on the title attribute of the a elements that mark each definition, I got the following error:

A sequence of more than one item is not allowed as the @select attribute of xsl:sort

Really?

The stylesheet was working with a subset of the items but not when I added more items to it.

Turns out one of the items I added reads:

<p>[<a name=”dt-focus” id=”dt-focus” title=”focus” shape=”rect”>Definition</a>: The first three components of the <a title=”dynamic context” href=”#dt-dynamic-context” shape=”rect”>dynamic context</a> (context item, context position, and context size) are called the <b>focus</b> of the expression. ] The focus enables the processor to keep track of which items are being processed by the expression. If any component in the focus is defined, all components of the focus are defined.</p>

Ouch! The title attribute on the second a element was stepping into my sort select.

The solution:

<xsl:sort select=”a[position()=1]/@title” data-type=”text”/>

As we have seen already, markup in W3C specifications varies from author to author so a fixed set of stylesheets may or may not be helpful. Some XSLT snippets on the other hand are likely to turn out to be quite useful.

One of the requirements for the master deduped and sorted definitions is that I want to know the origin(s) of all the definitions. That is if the definition only occurs in XQuery, I want to know that as well was if the definition only occurs in XPath and XQuery, and so on.

Still thinking about the best way to make that easy to replicate. Mostly because you are going to encounter definition issues in any standard you proof.

Corrected Definitions Lists for XPath/XQuery/etc.

Filed under: Standards,XML,XPath,XQuery — Patrick Durusau @ 3:01 pm

In my extraction of the definitions yesterday I produced files that had HTML <p> elements embedded in other HTML <p> elements.

The corrected files are as follows:

These lists are unsorted and the paragraphs with multiple definitions are repeated for each definition. Helps me spot where I have multiple definitions that may be followed by non-normative prose, applicable to one or more definitions.

The XSLT code I used yesterday was incorrect:

<xsl:for-each select=”//p/a[contains(@name, ‘dt’)]”>
<p>
<xsl:copy-of select=”ancestor::p”/>
</p>
</xsl:for-each>

And results in:

<p>
<p>[<a name=”dt-expression-context” id=”dt-expression-context” title=”expression context” shape=”rect”>Definition</a>: The <b>expression
context</b> for a given expression consists of all the information
that can affect the result of the expression.]
</p>
</p>

Which is both ugly and incorrect.

When using xsl:copy-of for a p element, the surrounding p elements were unnecessary.

Thus (correctly):

&lt:xsl:for-each select=”//p/a[contains(@name, ‘dt’)]”>
<xsl:copy-of select=”ancestor::p”/>
</xsl:for-each>

I reproduced the corrected definition files above. Apologies for any inconvenience.

Work continues on the sorting and deduping.

« Newer PostsOlder Posts »

Powered by WordPress