Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 7, 2014

Twitter and the Arab Spring

Filed under: Social Media,Tweets — Patrick Durusau @ 6:30 pm

You may remember that “effective use of social media” was claimed as a hallmark of the Arab Spring. (The Arab Spring and the impact of social media and Opening Closed Regimes: What Was the Role of Social Media During the Arab Spring?)

When evaluating such claims remember that your experience with social media may or may not represent the experience with social media elsewhere.

For example, Citizen Engagement and Public Services in the Arab World: The Potential of Social Media from Mohammed Bin Rashid School of Government (2014) reports:

Figure 23: Egypt 22.4% Facebook User Penetration

Figure 34: Egypt 1.26% Twitter user penetration rate.

Those figures are as of 2014. Figures for prior years are smaller.

That doesn’t sound like a level of social media necessary for create and then drive a social movement like the Arab Spring.

You can find additional datasets and additional information at: http://www.arabsocialmediareport.com. Registration is free.

And check out: Mohammed Bin Rashid School of Government

I first saw this in a tweet by Peter W. Singer.

Coding for Lawyers

Filed under: Law,Programming — Patrick Durusau @ 4:14 pm

Coding for Lawyers by V. David Zvenyach.

From the FAQ:

What? Lawyers and Coding?

It’s true. Lawyers can code. In fact, if you’re a lawyer, the truth is that it’s easier than you think. I am a lawyer, and a coder.1 In the course of two years, I have gone from knowing essentially nothing to being a decent coder in several languages. This book is intended to drastically shorten that time for others who, like me, decide that they want to learn to code.

You have heard about all the public access to the law projects that are putting encouraging people to determine their own legal rights by reading primary legal texts.

Now we have a lawyer who is striking back at the technical elite by teaching lawyers to code.

Turnabout if fair play I suppose. 😉

I suspect that professions, like lawyers, have learning experiences and styles that are not common to all groups. For example, the first chapter in this book starts off with regexes and uses case citations as an example for a regex. I rather doubt most introductory computer books would take that approach. But to a lawyer, comprehension is immediate and obvious. The terminology has changed but a lawyer knows instinctively how to parse such expressions.

If you are a lawyer or know any lawyers, this is a project to follow. In part simply to learn coding but also to see where one approach used domain specific examples for teaching coding.

I first saw this in a tweet by Adam Ziegler.

Matrix Methods & Applications (DRAFT)

Filed under: Mathematics,Matrix — Patrick Durusau @ 3:55 pm

Matrix Methods & Applications (DRAFT)

Stephen Boyd (Stanford) and Lieven Vandenberghe advise:

The textbook is still under very active development by Lieven Vandenberghe and Stephen Boyd, so be sure to download the newest version often. For now, we’ve posted a rough draft that does not include the exercises (which we’ll be adding). The first few chapters are in reasonable shape, but later ones are quite incomplete.

The 10 August 2014 draft has one hundred and twenty-two (122) pages so you can assume more material is coming.

I particularly like the “practical” suggested use cases.

The use cases create opportunities to illustrate the impact of data on supposedly “neutral” algorithms. Deeper knowledge of these algorithms will alert you to potential gaming of data that lies behind “neutral” processing of data.

Inspection of data is the equivalent of Mannie’s grandfather’s second rule: “Always cut cards.” (The Moon Is A Harsh Mistress)

Anyone who objects to inspection of data is hiding something. It may be their own incompetence with the data but you won’t know unless you inspect the data.

Results + algorithms + code + data = Maybe we will agree after inspection.

I first saw this in a tweet by fastml extra.

Computational Linguistics [09-2014]

Filed under: Computational Linguistics,Linguistics — Patrick Durusau @ 3:27 pm

Chris Callison-Burch tweets that Volume 40, Issue 3 – September 2014, ACL Anthology is now available!

In the September issue:

J14-3001: Montserrat Marimon; Núria Bel; Lluís Padró
Squibs: Automatic Selection of HPSG-Parsed Sentences for Treebank Construction

J14-3002: Jürgen Wedekind
Squibs: On the Universal Generation Problem for Unification Grammars

J14-3003: Ahmed Hassan; Amjad Abu-Jbara; Wanchen Lu; Dragomir Radev
A Random Walk–Based Model for Identifying Semantic Orientation

J14-3004: Xu Sun; Wenjie Li; Houfeng Wang; Qin Lu
Feature-Frequency–Adaptive On-line Training for Fast and Accurate Natural Language Processing

J14-3005: Diarmuid Ó Séaghdha; Anna Korhonen
Probabilistic Distributional Semantics with Latent Variable Models

J14-3006: Joel Lang; Mirella Lapata
Similarity-Driven Semantic Role Induction via Graph Partitioning

J14-3007: Linlin Li; Ivan Titov; Caroline Sporleder
Improved Estimation of Entropy for Evaluation of Word Sense Induction

J14-3008: Cyril Allauzen; Bill Byrne; Adrià de Gispert; Gonzalo Iglesias; Michael Riley
Pushdown Automata in Statistical Machine Translation

J14-3009: Dan Jurafsky
Obituary: Charles J. Fillmore

All issues of 2014.

September 6, 2014

Jellyfish

Filed under: Duke,Levenshtein Distance,String Matching — Patrick Durusau @ 7:06 pm

Jellyfish by James Turk and Michael Stephens.

From the webpage:

Jellyfish is a python library for doing approximate and phonetic matching of strings.

String comparison:

  • Levenshtein Distance
  • Damerau-Levenshtein Distance
  • Jaro Distance
  • Jaro-Winkler Distance
  • Match Rating Approach Comparison
  • Hamming Distance

Phonetic encoding:

  • American Soundex
  • Metaphone
  • NYSIIS (New York State Identification and Intelligence System)
  • Match Rating Codex

You might want to consider the string matching offered by Duke (written on top of Lucene):

String comparators

  • Levenshtein
  • WeightedLevenshtein
  • JaroWinkler
  • QGramComparator

Simple comparators

  • ExactComparator
  • DifferentComparator

Specialized comparators

  • GeopositionComparator
  • NumericComparator
  • PersonNameComparator

Phonetic comparators

  • SoundexComparator
  • MetaphoneComparator
  • NorphoneComparator

Token set comparators

  • DiceCoefficientComparator
  • JaccardIndexComparator

Enjoy!

Elastic Search: The Definitive Guide

Filed under: ElasticSearch,Lucene — Patrick Durusau @ 6:52 pm

Elastic Search: The Definitive Guide by Clinton Gormley and Zachary Tong.

From “why we wrote this book:”

We wrote this book because Elasticsearch needs a narrative. The existing reference documentation is excellent… as long as you know what you are looking for. It assumes that you are intimately familiar with information retrieval concepts, distributed systems, the query DSL and a host of other topics.

This book makes no such assumptions. It has been written so that a complete beginner — to both search and distributed systems — can pick it up and start building a prototype within a few chapters.

We have taken a problem based approach: this is the problem, how do I solve it, and what are the trade-offs of the alternative solutions? We start with the basics and each chapter builds on the preceding ones, providing practical examples and explaining the theory where necessary.

The existing reference documentation explains how to use features. We want this book to explain why and when to use various features.

An important guide/reference for Elastic Search but the “why” for this book is important as well.

Reference documentation is absolutely essential but so is documentation that eases the learning curve in order to promote adoption of software or a technology.

Read this both for Elastic Search as well as one model for writing a “why” and “when” book for other technologies.

Green Eggs…The New “Hello, World!”?

Filed under: Clojure — Patrick Durusau @ 6:36 pm

Green Eggs and Transducers by Carin Meier.

From the post:

A quick tour of Clojure Transducers with core.async with Dr. Seuss as a guide.

A quick guide to transducers in Clojure 1.7, using strings from Green Eggs and Ham.

Using the familiar, Green Eggs, to illustrate the new, transducers, is an idea that needs to catch on!

September 5, 2014

Clojure 1.7.0-alpha2

Filed under: Clojure — Patrick Durusau @ 7:55 pm

Clojure 1.7.0-alpha2 by Alex Miller.

From the post:

Clojure 1.7.0-alpha1 is now available.

Try it via

Download: http://central.maven.org/maven2/org/clojure/clojure/1.7.0-alpha2/
Download securely: https://repo1.maven.org/maven2/org/clojure/clojure/1.7.0-alpha2/
Leiningen: [org.clojure/clojure “1.7.0-alpha2”]

Highlights below, full change log here:
https://github.com/clojure/clojure/blob/master/changes.md

In case you want some excitement this weekend!

International Conference on Functional Programming 2014 – Update

Filed under: Functional Programming,Haskell — Patrick Durusau @ 4:28 pm

International Conference on Functional Programming 2014 – Paper and Videos by yallop.

Links to papers and videos for ICFP 2014.

As an added bonus, links to:

ICFP 2012

ICFP 2013

Haskell 2014 accepted papers

PLDI 2014 accepted papers

Just in time for the weekend!

I first saw this in a tweet by Alejandro Cabrera.

Named Data Networking – Privacy Or Property Protection?

Filed under: Cybersecurity,Network Security,Security — Patrick Durusau @ 2:11 pm

The Named Data Networking Consortium launched on September 3, 2014! (Important changes to be baked into Internet infrastructure. Read carefully.)

Huh? 😉

In case you haven’t heard:

Named Data Networking (NDN) is a Future Internet Architecture inspired by years of empirical research into network usage and a growing awareness of persistently unsolved problems of the current Internet (IP) architecture. Its central premise is that the Internet is primarily used as an information distribution network, a use that is not a good match for IP, and that the future Internet’s “thin waist” should be based on named data rather than numerically addressed hosts.

This project continues research on NDN started in 2010 under NSF’s FIA program. It applies the project team’s increasingly sophisticated understanding of NDN’s opportunities and challenges to two national priorities–Health IT and Cyberphysical Systems–to further the evolution of the architecture in the experimental, application-driven manner that proved successful in the first three years. In particular, our research agenda is organized to translate important results in architecture and security into library code that guides development for these environments and other key applications toward native NDN designs. It simultaneously continues fundamental research into the challenges of global scalability and broad opportunities for architectural innovation opened up by “simply” routing and forwarding data based on names.

Our research agenda includes: (1) Application design, exploring naming and application design patterns, support for rendezvous, discovery and bootstrapping, the role and design of in-network storage, and use of new data synchronization primitives; (2) Security and trustworthiness, providing basic building blocks of key management, trust management, and encryption-based access control for the new network, as well as anticipating and mitigating future security challenges faced in broad deployment; (3) Routing and forwarding strategy, developing and evaluating path-vector, link-state, and hyperbolic options for inter-domain routing, creating overall approaches to routing security and trust, as well as designing flexible forwarding and mobility support; (4) Scalable forwarding, aiming to support real-world deployment, evaluation and adoption via an operational, scalable forwarding platform; (5) Library and tool development, developing reference implementations for client APIs, trust and security, and new network primitives based on the team’s fundamental results, as well as supporting internal prototype development and external community efforts; (6) Social and economic impacts, considering the specific questions faced in our network environments as well as broader questions that arise in considering a “World on NDN.”

We choose Mobile Health and Enterprise Building Automation and Management Systems as specific instances of Health IT and Cyberphysical Systems to validate the architecture as well as drive new research. Domain experts for the former will be the Open mHealth team, a non-profit patient-centric ecosystem for mHealth, led by Deborah Estrin (Cornell) and Ida Sim (UCSF). For the latter, our experts will be UCLA Facilities Management, operators of the second largest Siemens building monitoring system on the West Coast. To guide our research on the security dimensions of these important environments and the NDN architecture more generally, we have convened a Security Advisory Council (NDN-SAC) to complement our own security and trust effort.

Intellectual Merit

The NDN architecture builds on lessons learned from the success of the IP architecture, preserving principles of the thin waist, hierarchical names, and the end-to-end principle. The design reflects a recognition of the major shift in the applications communication model: from the “where” (i.e., the host/location) to the “what” (i.e., the content). Architecting a communications infrastructure around this shift can radically simplify application designs to allow applications to communicate directly using the name of the content they desire and leave to the network to figure out how and where to retrieve it. NDN also recognizes that the biggest weakness in the current Internet architecture is lack of security, and incorporates a fundamental building block to improve security by requiring that all content be cryptographically signed.

Truly an impressive effort and one that will be exciting to watch!

You may want to start with: Named Data Networking: Motivation & Details as an introduction.

One of the features of NDN is that named data can be cached and delivered by a router separate from its point of origin. Any user can request the named data and the caching router only knows that it has been requested. Or in the words of the Motivation document:

Caching named data may raise privacy concerns. Today’s IP networks offer weak privacy protection. One can find out what is in an IP packet by inspecting the header or payload, and who requested the data by checking the destination address. NDN explicitly names the data, arguably making it easier for a network monitor to see what data is being requested. One may also be able to learn what data is requested through clever probing schemes to derive what is in the cache. However NDN removes entirely the information regarding who is requesting the data. Unless directly connected to the requesting host by a point-to-point link, a router will only know that someone has requested certain data, but will not know who originated the request. Thus the NDN architecture naturally offers privacy protection at a fundamentally different level than the current IP networks.

Which sounds attractive, until you notice that the earlier quote ends saying:

and incorporates a fundamental building block to improve security by requiring that all content be cryptographically signed (emphasis added)

If I am interpreting the current NDN statements correctly, routers will not accept or transport un-cryptographically signed data packets.

Cryptographic signing of data packets, depending upon its requirements, will eliminate anonymous hosting of data. Think about that for a moment. What data might not be made public if its transmission makes its originator identifiable?

Lots of spam no doubt but also documents such as the recent Snowden leaks and other information flow embarrassing to governments.

Or to those who do not sow but who seek to harvest such as the RIAA.

NDN is in the early stages, which is the best time to raise privacy, fair use and similar issues in its design.

September 4, 2014

T3TROS (ClojureScript)

Filed under: Clojure,ClojureScript,Functional Programming,Games — Patrick Durusau @ 6:00 pm

T3TROS (ClojureScript)

From the webpage:

We are re-creating Tetris™ in ClojureScript. We are mainly doing this to produce the pleasure and to celebrate the 30th anniversary of its original release in 1984. Our remake will enable us to host a small, local tournament and to share a montage of the game’s history, with each level resembling a different version from its past. (We are working on the game at least once a week):

  • DevBlog 1 – data, collision, rotation, drawing
  • DevBlog 2 – basic piece control
  • DevBlog 3 – gravity, stack, collapse, hard-drop
  • DevBlog 4 – ghost piece, flash before collapse
  • DevBlog 5 – game over animation, score
  • DevBlog 6 – level speeds, fluid drop, improve collapse animation, etc.
  • DevBlog 7 – draw next piece, tilemap for themes
  • DevBlog 8 – allow connected tiles for richer graphics
  • DevBlog 9 – live board broadcasting
  • DevBlog 10 – chat room, more tilemaps, page layouts
  • DevBlog 11 – page routing, username

What could possibly go wrong with an additive video game as the target of a programming exercise? 😉

Shaun LeBron has posted Interactive Guide to Tetrix in ClojureScript.

The interactive guide is very helpful!

Will echoes of Tetris™ tempt you into functional programming? What video classics will you produce?

Celebrity Nudes: Blaming and Shaming

Filed under: Privacy,Security — Patrick Durusau @ 10:58 am

Violet Blue in Wake up: The celebrity nudes hack is everyone’s problem follows offering ten steps for victims to protect themselves with:

Telling victims that they “shouldn’t have done it” or “what did you expect” is pointless. Instead of blaming and shaming, how about some information people can really use to help them make the decisions that are right for them, and equipping them with tools to mitigate, minimize and even possibly avoid damage if something goes wrong?

Which is deeply ironic because both Violet Blue and a number of the comments blame/shame Apple for the security breach.

Blaming and shaming IT companies for security breaches is about as non-productive as any blaming and shaming can be.

As you probably know already, security breaches are not viewed as promotional opportunities, at least by the companies suffering the security breach.

Missing from most discussions of the hacked iCloud accounts are questions like:

  • How to improve ICloud security?
  • What improved security will cost?
  • Who will pay the cost (including inconvenience) of improved iCloud security?
  • …(and other issues)

Violet’s ten steps to help people protect themselves are OK, but if highly trained and security conscious administrators share passwords with Edward Snowden, a violation of basic password security, lots of luck on getting anyone to follow Violet’s ten rules.

Blaming and shaming IT companies for security breaches may play well to crowds, but it doesn’t get us any closer to solving security issues either from a technical (coding/system/authentication) or social (cost/inconvenience allocation) perspective.

PS: Perhaps Apple should have a warning on uploads to the ICloud:

Digital data, such as IPhone photos are at risk of being stolen and mis-used by others. Uploading/sharing/emailing digital data increases that risk exponentially. YHBW

September 3, 2014

A Web Magna Carta?

Filed under: Government,Politics,WWW — Patrick Durusau @ 4:40 pm

Crowdsourcing a Magna Carta for the Web at the Internet Governance Forum by Harry Halpin.

From the post:

At the Internet Governance Forum this week in Istanbul, we’ve been discussing how to answer the question posed by Tim Berners-Lee and the World Wide Web Foundation at the occasion of the 25th anniversary of the Web: What is the Web Web Want? How can a “Magna Carta” for Web rights be crowd-sourced directly from the users of the Web itself?

A session on the Magna Carta (panel and Q&A) is part of the agenda this week at IGF on Thursday [4] September at 10:00 CET in Room 4 and folks can participate remotely over WebEx, IRC, and Twitter. Please tweet your questions about the Magna Carta with #webwewant to Twitter or join the channel #webwewant at irc.freenode.org. The session will be livestreamed.

Before you get too excited about a Magna Carta for Web rights, recall some of the major events in history of the Magna Carta. Or see: Treasures in Full: Magna Carta (British Library) which includes the ability to read an image of the Magna Carta.

First, the agreement was an attempt to limit the powers of King John by a group of feudal barons, who wanted to protect their rights and property, not those of all subjects of King John. Moreover, both the king and the barons were willing to use force against the other in order to prevail.

The Magna Carta was renounced by King John and there ensued the First Baron’s War (after about three months).

I welcome the conversation but for a Magna Carta for the Web to succeed, sovereign states (read nations) must agree to enforceable limits on their power, much as King John did.

Twenty-five feudal barons, under article 61 of the Magna Carta (originally unnumbered) could enforce the Magna Carta:

Since, moreover, we have conceded all the above things (from reverence) for God, for the reform of our kingdom and the better quieting of the discord that has sprung up between us and our barons, and since we wish these things to flourish unimpaired and unshaken for ever, we constitute and concede to them the following guarantee:- namely, that the barons shall choose any twenty-five barons of the kingdom they wish, who with all their might are to observe, maintain and secure the observance of the peace and rights which we have conceded and confirmed to them by this present charter of ours; in this manner, that if we or our chief Justiciar or our bailiffs or any of our servants in any way do wrong to anyone, or transgress any of the articles of peace or security, and the wrong doing has been demonstrated to four of the aforesaid twenty-five barons, those four barons shall come to us or our chief Justiciar, (if we are out of the kingdom), and laying before us the grievance, shall ask that we will have it redressed without delay. And if we, or our chief Justiciar (should we be out of the kingdom) do not redress the grievance within forty days of the time when it was brought to the notice of us or our chief Justiciar (should we be out of the kingdom), the aforesaid four barons shall refer the case to the rest of the twenty-five barons and those twenty-five barons with the whole community of the land shall distrain and distress us in every way they can, namely by taking of castles, estates and possessions, and in such other ways as they can, excepting (attack on) our person and those of our queen and of our children until, in their judgment, satisfaction has been secured; and when satisfaction has been secured let them behave towards us as they did before. And let anyone in the country who wishes to do so take an oath to obey the orders of the said twenty-five barons in the execution of all the aforesaid matters and with them to oppress us to the best of his ability, and we publicly and freely give permission for the taking the oath to anyone who wishes to take it, and we will never prohibit anyone from taking it. [source: http://www.iamm.com/magnaarticles.htm]

To cut to the chase, the King in Article 61 agrees the twenty-five barons could seize his castles, estates and possessions, excepting they cannot attack the king, queen, and their children, in order to force the king to follow the terms of the Magna Carta.

In modern terms, the barons could seize the Treasury Department, Congress, etc., but not take the President and his family hostage.

Do we have twenty-five feudal barons, by that I mean the global IT companies, willing to join together to enforce a Magna Carta for the Web on nations and principalities?

Without enforcers, a modern Magna Carta for the Web will be a pale imitation of its inspiration.

Data Sciencing by Numbers:…

Filed under: Data Mining,Text Analytics,Text Mining — Patrick Durusau @ 3:28 pm

Data Sciencing by Numbers: A Walk-through for Basic Text Analysis by Jason Baldridge.

From the post:

My previous post “Titillating Titles: Uncoding SXSW Proposal Titles with Text Analytics” discusses a simple exploration I did into algorithmically rating SXSW titles, most of which I did while on a plane trip last week. What I did was pretty basic, and to demonstrate that, I’m following up that post with one that explicitly shows you how you can do it yourself, provided you have access to a Mac or Unix machine.

There are three main components to doing what I did for the blog post:

  • Topic modeling code: the Mallet toolkit’s implementation of Latent Dirichlet Allocation
  • Language modeling code: the BerkeleyLM Java package for training and using n-gram language models
  • Unix command line tools for processing raw text files with standard tools and the topic modeling and language modeling code
  • I’ll assume you can use the Unix command line at at least a basic level, and I’ve packaged up the topic modeling and language modeling code in the Github repository maul to make it easy to try them out. To keep it really simple: you can download the Maul code and then follow the instructions in the Maul README. (By the way, by giving it the name “maul” I don’t want to convey that it is important or anything — it is just a name I gave the repository, which is just a wrapper around other people’s code.)

    Jason’s post should help get you starting doing data exercises. It is up to you if you continue those exercises and branch out to other data and new tools.

    Like everything else, data exploration proficiency requires regular exercise.

    Are you keeping a data exercise calendar?

    I first saw this in a post by Jason Baldridge.

    Titillating Titles:…

    Filed under: Data Mining,Text Analytics — Patrick Durusau @ 3:13 pm

    Titillating Titles: Uncoding SXSW Proposal Titles with Text Analytics by Jason Baldridge.

    From the post:

    The proposals for SXSW 2015 have been posted for several weeks now, and the community portion of the process ends this week on Friday, September 5. As a proposer myself for Are You In A Social Media Experiment?, I’ve been meaning to find a chance to look into the titles and see whether some straight-forward Unix commands, text analytics and natural language processing can reveal anything interesting about them.

    People reportedly put a lot of thought into their titles since that is a big part of getting your proposal noticed in the community part of the voting process for panels. The creators of proposals for SXSW are given lots of feedback, including things like on their titles.

    “Vague, non-descriptive language is a common mistake on titles — but if readers can’t comprehend the basic focus of your proposal without also reading the description, then you probably need to re-think your approach. If you can make the title witty and attention-getting, then wonderful. But please don’t let wit sidetrack you from the more significant goals of simple, accurate and succinct.”

    In short, a title should stand out while remaining informative. It turns out that there has been research in computational linguistics into how to craft memorable quotes that is interesting with respect to standing out. Danescu-Niculescu-Mizil, Cheng, Kleinberg, and Lee’s (2012) “You had me at hello: How phrasing affects memorability” found that memorable movie quotes use less common words built on a scaffold of common syntactic patterns (BTW, the paper itself has great section titles). Chan, Lee and Pang (2014) go to the next step of building a model that predicts which of two versions of a tweet will have a better response (in terms of obtaining retweets) (see the demo).

    Are you read to take your titles beyond spell-check and grammar correction?

    What if you could check your titles at least to make them more memorable? Would you do it?

    Jason provides an example of how checking your title for “impact” may not be all that far fetched.

    PS: Be sure to try the demo for “better” tweets.

    FCC Net Neutrality Plan – 800,000 Comments

    Filed under: Data Analysis,Data Mining,Government — Patrick Durusau @ 1:40 pm

    What can we learn from 800,000 public comments on the FCC’s net neutrality plan? by Bob Lannon and Andrew Pendleton.

    From the post:

    On Aug. 5, the Federal Communications Commission announced the bulk release of the comments from its largest-ever public comment collection. We’ve spent the last three weeks cleaning and preparing the data and leveraging our experience in machine learning and natural language processing to try and make sense of the hundreds-of-thousands of comments in the docket. Here is a high-level overview, as well as our cleaned version of the full corpus which is available for download in the hopes of making further research easier.

    A great story of cleaning dirty data. Beyond eliminating both Les Misérables and War and Peace as comments, the authors detected statements by experts, form letters, etc.

    If you’re interested in doing your own analysis with this data, you can download our cleaned-up versions below. We’ve taken the six XML files released by the FCC and split them out into individual files in JSON format, one per comment, then compressed them into archives, one for each of XML file. Additionally, we’ve taken several individual records from the FCC data that represented multiple submissions grouped together, and split them out into individual files (these JSON files will have hyphens in their filenames, where the value before the hyphen represents the original record ID). This includes email messages to openinternet@fcc.gov, which had been aggregated into bulk submissions, as well as mass submissions from CREDO Mobile, Sen. Bernie Sanders’ office and others. We would be happy to answer any questions you may have about how these files were generated, or how to use them.

    All the code use in the project is available at: https://github.com/sunlightlabs/fcc-net-neutrality-comments

    I first saw this in a tweet by Scott Chamberlain.

    Apache Cassandra 2.1.0-rc7

    Filed under: Cassandra — Patrick Durusau @ 1:17 pm

    Apache Cassandra 2.1.0-rc7 (Changes)

    A new Apache Cassandra release candidate!

    Downloads: http://cassandra.apache.org/download/

    I like the generated list of changes, but as dead text, it is of limited usefulness. This works better for me:

    2.1.0-rc7

    • Add frozen keyword and require UDT to be frozen (CASSANDRA-7857)
    • Track added sstable size correctly (CASSANDRA-7239)
    • (cqlsh) Fix case insensitivity (CASSANDRA-7834)
    • Fix failure to stream ranges when moving (CASSANDRA-7836)
    • Correctly remove tmplink files (CASSANDRA-7803)
    • (cqlsh) Fix column name formatting for functions, CAS operations, and UDT field selections ()CASSANDRA-7806
    • (cqlsh) Fix COPY FROM handling of null/empty primary key values (CASSANDRA-7792)
    • Fix ordering of static cells (CASSANDRA-7763)

    Merged from 2.0:

    • Forbid re-adding dropped counter columns (CASSANDRA-7831)
    • Fix CFMetaData#isThriftCompatible() for PK-only tables (CASSANDRA-7832)
    • Always reject inequality on the partition key without token (CASSANDRA-7722)
    • Always send Paxos commit to all replicas (CASSANDRA-7479)
    • Don’t send schema change responses and events for no-op DDL statements (CASSANDRA-7600)
    • (Hadoop) fix cluster initialisation for a split fetching (CASSANDRA-7774)
    • Configure system.paxos with LeveledCompactionStrategy (CASSANDRA-7753)
    • Fix ALTER clustering column type from DateType to TimestampType when using DESC clustering order (CASSANRDA-7797)
    • Throw EOFException if we run out of chunks in compressed datafile (CASSANDRA-7664)
    • Fix PRSI handling of CQL3 row markers for row cleanup (CASSANDRA-7787)
    • Fix dropping collection when it’s the last regular column (CASSANDRA-7744)
    • Properly reject operations on list index with conditions (CASSANDRA-7499)
    • Make StreamReceiveTask thread safe and gc friendly (CASSANDRA-7795)
    • Validate empty cell names from counter updates (CASSANDRA-7798)

    Merged from 1.2:

    Being “on the web” should require more than access via the web. Whenever available, links to other web resources should be present as well.

    Best Map/Reduce Explanation

    Filed under: Hadoop,MapReduce — Patrick Durusau @ 10:45 am

    Michael Klishin tweeted today: “The best map/reduce explanation ever: https://pbs.twimg.com/media/Bwj9KO5IcAAdl4H.png:large

    For your viewing pleasure:

    map/reduce

    It does have “side effects” though.

    Is it lunch time yet?

    September 2, 2014

    Maps Published on GOV.UK

    Filed under: Mapping,Maps — Patrick Durusau @ 6:34 am

    How to find all the maps published on GOV.UK by Giles Turnbull.

    A “trick” you need to note for finding all the maps published on GOV.UK.

    I don’t know of any comparable “trick” or even a single location for all the maps published by the United States government. If you were to include state and local governments, the problem would be even worse.

    If you know of cross-agency map directories in the United States (or elsewhere), please sing out!

    Thanks!

    World’s Biggest Data Breaches

    Filed under: Cybersecurity,Security — Patrick Durusau @ 6:28 am

    World’s Biggest Data Breaches by David McCandless and Tom Evans.

    Interactive visualization of data breaches with more than 30,000 records lost.

    You can filter by organization or method of leak as well as alter the display with options such as “no of records stolen” and “data sensitivity.”

    You can find the data for this visualization at: bit.ly/bigdatabreaches.

    If data is stored, data breaches are going to occur. (full stop)

    The only useful question to discuss is how much benefit is derived from the data versus the cost of security to prevent breaches and the cost of a potential breach.

    Security is a cost issue and should be openly discussed as such. (As well as who will pay that cost.)

    I first saw this in a tweet by Nicholas Thompson.

    PS: Semantic transparency is also a cost issue. If I am writing for a group of close associates, I’m not going to include citations and/or documentation on issues and concepts that we all share. The more distant the potential audience, the greater amount of work to make the same material transparent to the larger group. Unfortunately that fronts loads the costs when the real beneficiaries may be some distance away.

    September 1, 2014

    How Could Language Have Evolved?

    Filed under: Evoluntionary,Language — Patrick Durusau @ 7:40 pm

    How Could Language Have Evolved? by Johan J. Bolhuis, Ian Tattersall, Noam Chomsky, Robert C. Berwick.

    Abstract:

    The evolution of the faculty of language largely remains an enigma. In this essay, we ask why. Language’s evolutionary analysis is complicated because it has no equivalent in any nonhuman species. There is also no consensus regarding the essential nature of the language “phenotype.” According to the “Strong Minimalist Thesis,” the key distinguishing feature of language (and what evolutionary theory must explain) is hierarchical syntactic structure. The faculty of language is likely to have emerged quite recently in evolutionary terms, some 70,000–100,000 years ago, and does not seem to have undergone modification since then, though individual languages do of course change over time, operating within this basic framework. The recent emergence of language and its stability are both consistent with the Strong Minimalist Thesis, which has at its core a single repeatable operation that takes exactly two syntactic elements a and b and assembles them to form the set {a, b}.

    Interesting that Chomsky and his co-authors have seized upon “hierarchical syntactic structure” as “the key distinguishing feature of language.”

    Remember text as an Ordered Hierarchy of Content Objects (OHCO), which has made the rounds in markup circles since 1993. It’s staying power was quite surprising since examples are hard to find outside of markup text encodings. Your average text prior to markup can be mapped to OHCO only with difficulty in most cases.

    Syntactic structures are attributed to languages so be mindful that any “hierarchical syntactic structure” is entirely of human origin separate and apart from language.

    Extracting images from scanned book pages

    Filed under: Data Mining,Image Processing,Image Recognition — Patrick Durusau @ 7:14 pm

    Extracting images from scanned book pages by Chris Adams.

    From the post:

    I work on a project which has placed a number of books online. Over the years we’ve improved server performance and worked on a fast, responsive viewer for scanned books to make our books as accessible as possible but it’s still challenging to help visitors find something of interest out of hundreds of thousands of scanned pages.

    Trevor and I have discussed various ways to improve the situation and one idea which seemed promising was seeing how hard it would be to extract the images from digitized pages so we could present a visual index of an item. Trevor’s THATCamp CHNM post on Freeing Images from Inside Digitized Books and Newspapers got a favorable reception and since it kept coming up at work I decided to see how far I could get using OpenCV.

    Everything you see below is open-source and comments are highly welcome. I created a book-illustration-detection branch in my image mining project (see my previous experiment reconstructing higher-resolution thumbnails from the masters) so feel free to fork it or open issues.

    Just in case you are looking for a Fall project. 😉

    Consider capturing the images and their contents in associations with authors, publishers, etc. To enable mining those associations for patterns.

    « Newer Posts

    Powered by WordPress