Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 18, 2015

Requirements For A Twitter Client

Filed under: Curation,Data Mining,Twitter — Patrick Durusau @ 2:57 pm

Kurt Cagle writes of needed improvements to Twitter’s “Moments,” in Project Voyager and Moments: Close, but not quite there yet saying:

This week has seen a pair of announcements that are likely to significantly shake up social media as its currently known. Earlier this week, Twitter debuted its Moments, a news service where the highlights of the week are brought together into a curated news aggregator.

However, this is 2015. What is of interest to me – topics such as Data Science, Semantics, Astronomy, Climate Change and so forth, are likely not going to be of interest to others. Similarly, I really have no time for cute pictures of dogs (cats, maybe), the state of the World Series race, the latest political races or other “general” interest topics. In other words, I want to be able to curate content my way, even if the quality is not necessarily the highest, than I do have other people who I do not know decide to curate to the lowest possible denominator.

A very small change, on the other hand, could make a huge difference for Moments for myself and many others. Allow users to aggregate a set of hash tags under a single “Paper section banner” – #datascience, #data, #science, #visualization, #analytics, #stochastics, etc. – could all go under the Data Science banner. Even better yet, throw in a bit of semantics to find every topic within two hops topically to the central terms and use these (with some kind of weighting factor) as well. Rank these tweets according to fitness, then when I come to Twitter I can “read” my twitter paper just by typing in the appropriate headers (or have them auto-populate a list).

My exclusion list would include cats, shootings, bombings, natural disasters, general news and other ephemera that will be replaced by another screaming headline next week, if not tomorrow.

Starting with Kurt’s suggested improvements, a Twitter client should offer:

  • User-based aggregation based upon # tags
  • Learning semantics (Kurt’s two-hop for example)
  • Deduping tweets for user set period, day, week, month, other
  • User determined sorting of tweets by time/date, author, retweets, favorites
  • Exclusion of tweets without URLs
  • Filtering of tweets based on sender (included by # tags), etc. and perhaps regex

I have looked but not found any Twitter client that comes even close.

Other requirements?

October 17, 2015

How to Read a Paper

Filed under: Librarian/Expert Searchers,Research Methods,Researchers — Patrick Durusau @ 4:39 pm

How to Read a Paper by S. Keshav.

Abstract:

Researchers spend a great deal of time reading research papers. However, this skill is rarely taught, leading to much wasted effort. This article outlines a practical and efficient three-pass method for reading research papers. I also describe how to use this method to do a literature survey.

Sean Cribbs mentions this paper in: The Refreshingly Rewarding Realm of Research Papers but it is important enough for a separate post.

You should keep a copy of it at hand until the three-pass method becomes habit.

Other resources that Keshav mentions:

T. Roscoe, Writing Reviews for Systems Conferences

H. Schulzrinne, Writing Technical Articles

G.M. Whitesides, Whitesides’ Group: Writing a Paper (updated URL)

All three are fairly short and well worth your time to read and re-read.

Experienced writers as well!

After more than thirty years of professional writing I still benefit from well-written writing/editing advice.

Congressional PageRank… [How To Avoid Bribery Charges]

Filed under: Graphs,GraphX,Neo4j,PageRank,Spark — Patrick Durusau @ 3:25 pm

Congressional PageRank – Analyzing US Congress With Neo4j and Apache Spark by William Lyon.

From the post:

As we saw previously, legis-graph is an open source software project that imports US Congressional data from Govtrack into the Neo4j graph database. This post shows how we can apply graph analytics to US Congressional data to find influential legislators in Congress. Using the Mazerunner open source graph analytics project we are able to use Apache Spark GraphX alongside Neo4j to run the PageRank algorithm on a collaboration graph of US Congress.

While Neo4j is a powerful graph database that allows for efficient OLTP queries and graph traversals using the Cypher query language, it is not optimized for global graph algorithms, such as PageRank. Apache Spark is a distributed in-memory large-scale data processing engine with a graph processing framework called GraphX. GraphX with Apache Spark is very efficient at performing global graph operations, like the PageRank algorithm. By using Spark alongside Neo4j we can enhance our analysis of US Congress using legis-graph.

Excellent walk-through to get you started on analyzing influence in congress, with modern data analysis tools. Getting a good grip on all these tools with be valuable.

Political scientists, among others, have studied the question of influence in Congress for decades so if you don’t want to repeat the results of others, being by consulting the American Political Science Review for prior work in this area.

An article that reports counter-intuitive results is: The Influence of Campaign Contributions on the Legislative Process by Lynda W. Powell.

From the introduction:

Do campaign donors gain disproportionate influence in the legislative process? Perhaps surprisingly, political scientists have struggled to answer this question. Much of the research has not identified an effect of contributions on policy; some political scientists have concluded that money does not matter; and this bottom line has been picked up by reporters and public intellectuals.1 It is essential to answer this question correctly because the result is of great normative importance in a democracy.

It is important to understand why so many studies find no causal link between contributions and policy outcomes. (emphasis added)

Linda cites much of the existing work on the influence of donations on process so her work makes a great starting point for further research.

As far as the lack of a “casual link between contributions and policy outcomes,” I think the answer is far simpler than Linda suspects.

The existence of a quid-pro-quo, the exchange of value for a vote on a particular bill, is the essence of the crime of public bribery. For the details (in the United States), see: 18 U.S. Code § 201 – Bribery of public officials and witnesses

What isn’t public bribery is to donate funds to an office holder on a regular basis, unrelated to any particular vote or act on the part of that official. Think of it as bribery on an installment plan.

When U.S. officials, such as former Secretary of State Hillary Clinton complain of corruption in other governments, they are criticizing quid-pro-quo bribery and not installment plan bribery as it is practiced in the United States.

Regular contributions gains ready access to legislators and, not surprisingly, more votes will go in your favor than random chance would allow.

Regular contributions are more expensive than direct bribes but avoiding the “causal link” is essential for all involved.

@SwiftLang “better, more valuable…than Haskell”?

Filed under: Functional Programming,Haskell — Patrick Durusau @ 2:02 pm

Erik Meijer tweeted:

At this point, @SwiftLang is probably a better, and more valuable, vehicle for learning functional programming than Haskell.

Given Erik’s deep experience and knowledge of functional programming, such a tweet has to give you pause.

Less daunting was the “67 retweets 56 favorites” by the known users of SwiftLang. 😉

A more accurate statement would be:

At this point, @SwiftLang is probably a better, and more valuable, vehicle for learning functional programming than Haskell, if you program for iOS, OS X, and watchOS.

Yes?

„To See or Not to See“…

Filed under: Text Analytics,Text Encoding Initiative (TEI) — Patrick Durusau @ 1:47 pm

„To See or Not to See“ – an Interactive Tool for the Visualization and Analysis of Shakespeare Plays by Thomas Wilhelm, Manuel Burghardt, and Christian Wolff.

Abstract:

In this article we present a web-based tool for the visualization and analysis of quantitative characteristics of Shakespeare plays. We use resources from the Folger Digital Texts Library 1 as input data for our tool. The Folger Shakespeare texts are annotated with structural markup from the Text Encoding Initiative (TEI) 2. Our tool interactively visualizes which character says what and how much at a particular point in time, allowing customized interpretations of Shakespeare plays on the basis of quantitative aspects, without having to care about technical hurdles such as markup or programming languages.

I found the remarkable web tool described in this paper at: http://www.thomaswilhelm.eu/shakespeare/output/hamlet.html.

You can easily change plays (menu, top left) but note that “download source” refers to the processed plays themselves, not the XSL/T code that transformed the TEI markup. I think all the display code is JavaScript/CSS so you can scrape that from the webpage. I am more interested in the XSL/T applied to the original markup.

In the paper the authors say that plays may have over “5000 lines of code” for their transformation with XSL/T.

I am very curious if translating the XSL/T code into XQuery would reduce the amount of code required?

I recently re-wrote the XSLT code for the W3C Bibliography Generator, limited to Recommendations, and the XQuery code was far shorter than the XSLT used by the W3C.

Look for a post on the XQuery I wrote for the W3C bibliography on Monday, 19 October 2015.

If you decide to cite this article:

Wilhelm, T., Burghardt, M. & Wolff, C. (2013). “To See or Not to See” – An Interactive Tool for the Visualization and Analysis of Shakespeare Plays. In Franken-Wendelstorf, R., Lindinger, E. & Sieck J. (eds): Kultur und Informatik – Visual Worlds & Interactive Spaces, Berlin (pp. 175-185). Glückstadt: Verlag Werner Hülsbusch.

Two of the resources mentioned in the article:

Folger Digital Texts Library

Text Encoding Initiative (TEI)

todonotes – Marking things to do in a LATEX document

Filed under: Editor,TeX/LaTeX — Patrick Durusau @ 1:06 pm

todonotes – Marking things to do in a LATEX document

From the webpage:

The pack­age lets the user mark things to do later, in a sim­ple and vi­su­ally ap­peal­ing way. The pack­age takes sev­eral op­tions to en­able cus­tomiza­tion/fine­tun­ing of the vi­sual ap­pear­ance.

The feature of this package that grabbed my attention was the ability to easily create a list of todos “…like a table of content or a list of figures.”

In any document longer than a couple of pages, that is going to be very handy.

Document Summarization via Markov Chains

Filed under: Algorithms,Markov Decision Processes,Summarization,Text Mining — Patrick Durusau @ 12:58 pm

Document Summarization via Markov Chains by Atabey Kaygun.

From the post:

Description of the problem

Today’s question is this: we have a long text and we want a machine generated summary of the text. Below, I will describe a statistical (hence language agnostic) method to do just that.

Sentences, overlaps and Markov chains.

In my previous post I described a method to measure the overlap between two sentences in terms of common words. Today, we will use the same measure, or a variation, to develop a discrete Markov chain whose nodes are labeled by individual sentences appearing in our text. This is essentially page rank applied to sentences.

Atabey says the algorithm (code supplied) works well on:

news articles, opinion pieces and blog posts.

Not so hot on Supreme Court decisions.

In commenting on a story from the New York Times, Obama Won’t Seek Access to Encrypted User Data, I suspect, Atabey says that we have no reference for “what frustrated him” in the text summary.

If you consider the relevant paragraph from the New York Times story:

Mr. Comey had expressed alarm a year ago after Apple introduced an operating system that encrypted virtually everything contained in an iPhone. What frustrated him was that Apple had designed the system to ensure that the company never held on to the keys, putting them entirely in the hands of users through the codes or fingerprints they use to get into their phones. As a result, if Apple is handed a court order for data — until recently, it received hundreds every year — it could not open the coded information.

The reference is clear. Several other people are mentioned in the New York Times article but none rank high enough to appear in the summary.

Not a sure bet but with testing, try attribution to people who rank high enough to appear in the summary.

October 16, 2015

Planet Platform Beta & Open California:…

Planet Platform Beta & Open California: Our Data, Your Creativity by Will Marshall.

From the post:

At Planet Labs, we believe that broad coverage frequent imagery of the Earth can be a significant tool to address some of the world’s challenges. But this can only happen if we democratise access to it. Put another way, we have to make data easy to access, use, and buy. That’s why I recently announced at the United Nations that Planet Labs will provide imagery in support of projects to advance the Sustainable Development Goals.

Today I am proud to announce that we’re releasing a beta version of the Planet Platform, along with our imagery of the state of California under an open license.

The Planet Platform Beta will enable a pioneering cohort of developers, image analysts, researchers, and humanitarian organizations to get access to our data, web-based tools and APIs. The goal is to provide a “sandbox” for people to start developing and testing their apps on a stack of openly available imagery, with the goal of jump-starting a developer community; and collecting data feedback on Planet’s data, tools, and platform.

Our Open California release includes two years of archival imagery of the whole state of California from our RapidEye satellites and 2 months of data from the Dove satellite archive; and will include new data collected from both constellations on an ongoing basis, with a two-week delay. The data will be under an open license, specifically CC BY-SA 4.0. The spirit of the license is to encourage R&D and experimentation in an “open data” context. Practically, this means you can do anything you want, but you must “open” your work, just as we are opening ours. It will enable the community to discuss their experiments and applications openly, and thus, we hope, establish the early foundation of a new geospatial ecosystem.

California is our first Open Region, but shall not be the last. We will open more of our data in the future. This initial release will inform how we deliver our data set to a global community of customers.

Resolution for the Dove satellites is 3-5 meters and the RapidEye satellites is 5 meters.

Not quite goldfish bowl or Venice Beach resolution but useful for other purposes.

Now would be a good time to become familiar with managing and annotating satellite imagery. Higher resolutions, public and private are only a matter of time.

Markdeep

Filed under: Editor — Patrick Durusau @ 6:43 pm

Markdeep

From the webpage:

Markdeep is a technology for writing plain text documents that will look good in any web browser. It supports diagrams, common styling conventions, and equations as extensions of Markdown syntax.

Markdown is free and easy to use. It doesn’t need a plugin, or Internet connection. There’s nothing to install. Just start writing in Vi, Nodepad, Emacs, Visual Studio, Atom, or another editor! You don’t have to export, compile, or otherwise process your document. Here’s an example of a text editor and a browser viewing the same file simultaneously:

markdeep-text-view

markdeep-web-view

Markdeep is ideal for design documents, specifications, README files, code documentation, lab reports, and technical web pages. Because the source is plain text, Markdeep works well with software development toolchains.

Markdeep was created by Morgan McGuire at Casual Effects with inspiration from John Gruber’s Markdown. The current 0.01 beta release is minified-only to find bugs and get feedback, but a full source version is coming soon after some more code cleanup.

You may find this useful but I certainly disagree with writing “design documents, specifications, …., code documentation, lab reports, and technical web pages” in plain text.

Yes, that fits into software development toolchains but the more relevant question is why haven’t software development toolchains upgraded to use XML? Unadorned plain text is better than no documentation at all but the lack of structure makes it difficult to stitch your documentation together with other documentation.

Unless preventing transclusion of documents is a goal of your documentation process?

The XML world has made a poor showing of transclusion over the years. That was driven by the impoverished view that documents are the proper targets of URLs and not more granular targets within documents.

That “document as the target” view perpetuated an eternal cycle of every reader having to navigate the same document to find the same snippet that is of importance.

Perhaps XQuery can free us from that eternal cycle of repetition and waste.

A meaningful and explicit structure to documents is a step towards XQuery accomplishing just that.

Project Production Glossary

Filed under: Glossary,Law — Patrick Durusau @ 4:39 pm

Project Production Glossary sponsored by LTPI, Legal Technology Professionals Institute.

From the webpage:

The Legal Technology Professionals Institute Production Glossary is designed as an educational resource on terminology used in connection with producing electronically stored information. While a number of useful industry-wide glossaries exist, we could not find one that specifically discussed document production, nor one that discussed not only the “what”, but also the “why”, so we created one.

If you are using or creating topic maps in a legal context, this may be very useful.

Public comments are open.

55 Articles Every Librarian Should Read (Updated)

Filed under: Librarian/Expert Searchers,Library — Patrick Durusau @ 3:25 pm

55 Articles Every Librarian Should Read (Updated) by Christina Magnifico.

The articles cover a wide range of subjects but you remember the line:

People become librarians because they know too much.”

A good starting place if you are looking for sparks for new ideas.

Enjoy!

scikit-learn 0.17b1 is out!

Filed under: Python,Scikit-Learn — Patrick Durusau @ 3:14 pm

scikit-learn 0.17b1 is out! by Olivier Grisel.

From the announcement:

The 0.17 beta release of scikit-learn has been uploaded to PyPI. As of now only the source tarball is available. I am waiting for the CI server to build the binary packages for the Windows and Mac OSX platform. They should be online tonight or tomorrow morning.

https://pypi.python.org/pypi/scikit-learn/0.17b1

Please test it as much as possible especially if you have a test suite for a project that has scikit-learn as a dependency.

If you find regressions from 0.16.1 please open issues on github and put `[REGRESSION]` in the title of the issue:

https://github.com/scikit-learn/scikit-learn/issues

Any bugfix will have to be merged to the master branch first and then we will do a cherrypick of the fix into the 0.17.X branch that will be used to generate 0.17.0 final, probably in less than 2 weeks.

Just in time for the weekend! 😉

Comment early and often.

Enjoy!

Google Book-Scanning Project Is Fair Use, 2nd Circ. Says

Filed under: Intellectual Property (IP) — Patrick Durusau @ 10:35 am

Google Book-Scanning Project Is Fair Use, 2nd Circ. Says by Bill Donahue.

From the post:

Law360, New York (October 16, 2015, 10:13 AM ET) — The Second Circuit ruled Friday that Google Inc.’s project to digitize and index millions of copyrighted books without permission was legal under the fair use doctrine, handing the tech giant a huge victory in a long-running fight with authors.

Coming more than a decade after the Authors Guild first sued over what would become “Google Books,” the appeals court’s opinion said that making the world’s books text-searchable — while not allowing users to read more than a snippet of text — was a sufficiently “transformative use” of the author’s content to be protected by the doctrine.

“Google’s making of a digital copy to provide a search function is a transformative use, which augments public knowledge by making available information about plaintiffs’ books without providing the public with a substantial substitute for matter protected by the plaintiffs’ copyright interests in the original works or derivatives of them,” the appeals court said.

Excellent!

Spread the good news!

I will update with a link to the opinion.


Apologies for the delayed update!

The Authors Guild vs. Google, Docket No. 13-4829-cv

October 15, 2015

Goodbye to True: Advancing semantics beyond the black and white

Filed under: Logic,Semantics — Patrick Durusau @ 8:10 pm

Goodbye to True: Advancing semantics beyond the black and white by Chris Welty.

Abstract:

The set-theoretic notion of truth proposed by Tarski is the basis of most work in machine semantics and probably has its roots in the work and influence of Aristotle. We take it for granted that the world can be described, not in shades of grey, but in terms of statements and propositions that are either true or false – and it seems most of western science stands on the same principle. This assumption at the core of our training as scientists should be questioned, because it stands in direct opposition to our human experience. Is there any statement that can be made that can actually be reduced to true or false? Only, it seems, in the artificial human-created realms of mathematics, games, and logic. We have been investigating a different mode of truth, inspired by results in Crowdsourcing, which allows for a highly dimension notion of semantic interpretation that makes true and false look like a childish simplifying assumption.

Chris was the keynote speaker at the Third International Workshop on Linked Data for Information Extraction (LD4IE2015). (Proceedings)

I wasn’t able to find a video for that presentation but I did find “Chris Welty formerly IBM Watson Team – Cognitive Computing GDG North Jersey at MSU” from about ten months ago.

Great presentation on “cognitive computing.”

Enjoy!

CyGraph: Cybersecurity Situational Awareness…

Filed under: Cybersecurity,Graphs,Neo4j,Security — Patrick Durusau @ 4:06 pm

CyGraph: Cybersecurity Situational Awareness That’s More Scalable, Flexible & Comprehensive by Steven Noel. (MITRE Corporation, if you can’t tell from the title.)

From the post:

Preventing and reacting to attacks in cyberspace involves a complex and rapidly changing milieu of factors, requiring a flexible architecture for advanced analytics, queries and graph visualization.

Information Overload in Security Analytics

Cyber warfare is conducted in complex environments, with numerous factors contributing to attack success and mission impacts. Network topology, host configurations, vulnerabilities, firewall settings, intrusion detection systems, mission dependencies and many other elements all play important parts.

To go beyond rudimentary assessments of security posture and attack response, organizations need to merge isolated data into higher-level knowledge of network-wide attack vulnerability and mission readiness in the face of cyber threats.

Network environments are always changing, with machines added and removed, patches applied, applications installed, firewall rules changed, etc., all with potential impact on security posture. Intrusion alerts and anti-virus warnings need attention, and even seemingly benign events such as logins, service connections and file share accesses could be associated with adversary activity.

The problem is not lack of information, but rather the ability to assemble disparate pieces of information into an overall analytic picture for situational awareness, optimal courses of action and maintaining mission readiness.

CyGraph: Turning Cybersecurity Information into Knowledge

To address these challenges, researchers at the MITRE Corporation are developing CyGraph, a tool for cyber warfare analytics, visualization and knowledge management.

Graph databases, Neo4j being one of many, can be very useful in managing complex security data.

However, as I mentioned earlier today, one of the primary issues in cybersecurity is patch management, with a full 76% of applications remaining unpatched more than two years after vulnerabilities have been discovered. (Yet Another Flash Advisory (YAFA) [Patch Due 19 October 2015])

If you haven’t taken basic steps on an issue like patch management, as in evaluating and installing patches in a timely manner, a rush to get the latest information is mis-placed.

Just in case you are wondering, if you do visit MITRE Corporation, you will find that a search for “CyGraph” comes up empty. Must not be quite to the product stage just yet.

Watch for name conflicts:

and others of course.

10,000 years of Cascadia earthquakes

Filed under: Interface Research/Design,Mapping,Maps — Patrick Durusau @ 3:15 pm

10,000 years of Cascadia earthquakes

From the webpage:

The chart shows all 40 major earthquakes in the Cascadia Subduction Zone that geologists estimate have occurred since 9845 B.C. Scientists estimated the magnitude and timing of each quake by examining soil samples at more than 50 undersea sites between Washington, Oregon and California.

This chart is followed by:

Core sample sites 1999-2009

U.S. Geological Survey scientists studied undersea core samples of soil looking for turbidites — deposits of sediments that flow along the ocean floor during large earthquakes. The samples were gathered from more than 50 sites during cruises in 1999, 2002 and 2009.

Great maps but apparently one has nothing to do with the other.

If you mouse over the red dot closest to San Francisco, a pop-up says: “ID M9907-50BC Water Depth in Feet 10925.1972.” I suspect that may mean the water depth for the sample but without more, I can’t really say.

The fatal flaw of the presentation is the data of the second map is disconnected from the first. There may be some relationship between the two but it isn’t evident in the current presentation.

A good example of how to not display data sets on the same subject.

Visual Information Theory

Filed under: Information Theory,Shannon,Visualization — Patrick Durusau @ 2:47 pm

Visual Information Theory by Christopher Olah.

From the post:

I love the feeling of having a new way to think about the world. I especially love when there’s some vague idea that gets formalized into a concrete concept. Information theory is a prime example of this.

Information theory gives us precise language for describing a lot of things. How uncertain am I? How much does knowing the answer to question A tell me about the answer to question B? How similar is one set of beliefs to another? I’ve had informal versions of these ideas since I was a young child, but information theory crystallizes them into precise, powerful ideas. These ideas have an enormous variety of applications, from the compression of data, to quantum physics, to machine learning, and vast fields in between.

Unfortunately, information theory can seem kind of intimidating. I don’t think there’s any reason it should be. In fact, many core ideas can be explained completely visually!

Great visualization of the central themes of information theory!

Plus an interesting aside at the end of the post:

Claude Shannon’s original paper on information theory, A Mathematical Theory of Computation, is remarkably accessible. (This seems to be a recurring pattern in early information theory papers. Was it the era? A lack of page limits? A culture emanating from from Bell Labs?)

Cover & Thomas’ Elements of Information Theory seems to be the standard reference. I found it helpful.

Cover & Thomas’ Elements of Information Theory

I don’t find Shannon’s “accessibility” all that remarkable, he was trying to be understood. Once a field matures and develops an insider jargon, trying to be understood is no longer “professional.” Witness the lack of academic credit for textbooks and other explanatory material as opposed to jargon-laden articles that may or may not be read by anyone other than proof readers.

Clojure Remote February 11-12, 2016 — Online

Filed under: Clojure,Conferences,Functional Programming,Programming — Patrick Durusau @ 1:58 pm

Clojure Remote February 11-12, 2016 — Online

Important Dates:

  • Oct. 15 — Early-bird admission starts.
  • Oct. 30 — CFP opens, Regular admission rate begins
  • Dec. 31 — CFP closes
  • Jan. 15 — Schedule released
  • Feb. 11, 12 — The Conference!

From the webpage:

This Winter, Homegrown Labs presents Clojure Remote—Clojure’s first exclusively Remote conference. Join us anywhere; from your home, your office, or the coffee shop.

Over two days, you’ll join hundreds of other Clojurists online via crowdcast.io to enjoy up to two tracks of beginner to intermediate Clojure talks.

Clojure Remote will be held February 11th and 12th, 2016 from 2:00 PM UTC – 9:00 pm UTC.

The conference will be broadcast via crowdcast.io, where attendees can:

  • View talks live
  • Ask & up-vote questions
  • And chat with fellow attendees.

Clojure Remote attendees will miss:

  • Delays and frustrations of airport security and missed connections
  • Wedging themselves into grade school size airline seats
  • Taxi transportation where drivers speak every language but yours
  • Disease producing dry air in hotels
  • Expenses that could have gone towards new hardware or books

but, for virtual conferences to make progress, sacrifices have to be made. 😉

True, virtual conferences do lack some of the randomness and “press the flesh” opportunities of physical conferences but CS has been slow to take up the advantages of more frequent but shorter virtual or online conferences.

Musical Genres Classified Using the Entropy of MIDI Files

Filed under: Music,Music Retrieval,Shannon — Patrick Durusau @ 1:35 pm

Musical Genres Classified Using the Entropy of MIDI Files (Emerging Technology from the arXiv, October 15, 2015)

Music analysis

Communication is the process of reproducing a message created in one point space at another point in space. It has been studied in depth by numerous scientists and engineers but it is the mathematical treatment of communication that has had the most profound influence.

To mathematicians, the details of a message are of no concern. All that matters is that the message can be thought of as an ordered set of symbols. Mathematicians have long known that this set is governed by fundamental laws first outlined by Claude Shannon in his mathematical theory of communication.

Shannon’s work revolutionized the way engineers think about communication but it has far-reaching consequences in other areas, too. Language involves the transmission of information from one individual to another and information theory provides a window through which to study and understand its nature. In computing, data is transmitted from one location to another and information theory provides the theoretical bedrock that allows this to be done most efficiently. And in biology, reproduction can be thought of as the transmission of genetic information from one generation to the next.

Music too can be thought of as the transmission of information from one location to another, but scientists have had much less success in using information theory to characterize music and study its nature.

Today, that changes thanks to the work of Gerardo Febres and Klaus Jaffé at Simon Bolivar University in Venezuela. These guys have found a way to use information theory to tease apart the nature of certain types of music and to automatically classify different musical genres, a famously difficult task in computer science.

One reason why music is so hard to study is that it does not easily translate into an ordered set of symbols. Music often consists of many instruments playing different notes at the same time. Each of these can have various qualities of timbre, loudness, and so on.

Music viewed by its Entropy content: A novel window for comparative analysis by Gerardo Febres and Klaus Jaffe.

Abstract:

Texts of polyphonic music MIDI files were analyzed using the set of symbols that produced the Fundamental Scale (a set of symbols leading to the Minimal Entropy Description). We created a space to represent music pieces by developing: (a) a method to adjust a description from its original scale of observation to a general scale, (b) the concept of higher order entropy as the entropy associated to the deviations of a frequency ranked symbol profile from a perfect Zipf profile. We called this diversity index the “2nd Order Entropy”. Applying these methods to a variety of musical pieces showed how the space “symbolic specific diversity-entropy – 2nd order entropy” captures some of the essence of music types, styles, composers and genres. Some clustering around each musical category is shown. We also observed the historic trajectory of music across this space, from medieval to contemporary academic music. We show that description of musical structures using entropy allows to characterize traditional and popular expressions of music. These classification techniques promise to be useful in other disciplines for pattern recognition, machine learning, and automated experimental design for example.

The process simplifies the data stream, much like you choose which subjects you want to talk about in a topic map.

Purists will object but realize that objection is because they have chosen a different (and much more complex) set of subjects to talk about in the analysis of music.

The important point is to realize we are always choosing different degrees of granularity of subjects and their identifications, for some specific purpose. Change that purpose and the degree of granularity will change.

How is NSA breaking so much crypto?

Filed under: Cybersecurity,Encryption,Security — Patrick Durusau @ 10:39 am

How is NSA breaking so much crypto? by Alex Halderman and Nadia Henniger.

From the post:

There have been rumors for years that the NSA can decrypt a significant fraction of encrypted Internet traffic. In 2012, James Bamford published an article quoting anonymous former NSA officials stating that the agency had achieved a “computing breakthrough” that gave them “the ability to crack current public encryption.” The Snowden documents also hint at some extraordinary capabilities: they show that NSA has built extensive infrastructure to intercept and decrypt VPN traffic and suggest that the agency can decrypt at least some HTTPS and SSH connections on demand.

However, the documents do not explain how these breakthroughs work, and speculation about possible backdoors or broken algorithms has been rampant in the technical community. Yesterday at ACM CCS, one of the leading security research venues, we and twelve coauthors presented a paper that we think solves this technical mystery.

The key is, somewhat ironically, Diffie-Hellman key exchange, an algorithm that we and many others have advocated as a defense against mass surveillance. Diffie-Hellman is a cornerstone of modern cryptography used for VPNs, HTTPS websites, email, and many other protocols. Our paper shows that, through a confluence of number theory and bad implementation choices, many real-world users of Diffie-Hellman are likely vulnerable to state-level attackers.

For the nerds in the audience, here’s what’s wrong: If a client and server are speaking Diffie-Hellman, they first need to agree on a large prime number with a particular form. There seemed to be no reason why everyone couldn’t just use the same prime, and, in fact, many applications tend to use standardized or hard-coded primes. But there was a very important detail that got lost in translation between the mathematicians and the practitioners: an adversary can perform a single enormous computation to “crack” a particular prime, then easily break any individual connection that uses that prime.

How enormous a computation, you ask? Possibly a technical feat on a scale (relative to the state of computing at the time) not seen since the Enigma cryptanalysis during World War II. Even estimating the difficulty is tricky, due to the complexity of the algorithm involved, but our paper gives some conservative estimates. For the most common strength of Diffie-Hellman (1024 bits), it would cost a few hundred million dollars to build a machine, based on special purpose hardware, that would be able to crack one Diffie-Hellman prime every year.

Whether you prefer the blog summary or the heavier sledding of Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice, this is a must read.

This paper should provide a significant push towards better encryption techniques but also serve as a warning that no encryption method is absolute.

Implementations, users, advances in technology and techniques, resources, all play roles in determining the security of any particular encryption technique.

Yet Another Flash Advisory (YAFA) [Patch Due 19 October 2015]

Filed under: Cybersecurity,Security — Patrick Durusau @ 10:07 am

Adobe issues advisory for Flash vulnerability targeting government agencies by Doug Olenick.

From the post:

Adobe has issued a security advisory for an Adobe Flash Player zero-day exploit being used by the folks behind the Pawn Storm cyber espionage campaign to target foreign ministries worldwide.

The critical vulnerability (CVE-2015-7645) has been identified in Adobe Flash Player version 19.0.0.207 and earlier for Windows, Macintosh and Linux. The company expects to issue an update for the vulnerability during the week of Oct. 19. Adobe said in its advisory that a successful exploit could allow the attacker to take control of a vulnerable system.

Adobe is aware that the exploit is being used in limited targeted attacks.

Depending upon your target(s), don’t take the projected patch date too seriously.

The 2015 NTT Group Global Threat Intelligence Report reports that 76% of the vulnerabilities in its report were over two years old, and 9% were more than ten years old.

I didn’t find data on the application of patches curve for Adobe Flash. Assume a bump on release + thirty days and the curve fall off rather steeply.

If you are defending against this latest in a series of Flash vulnerabilities, disable and then de-install Adobe Flash. That is the only long term “patch” known to cure all known and unknown Flash vulnerabilities. Plus it saves IT resources for some purpose other than patching bugware.

October 14, 2015

The Refreshingly Rewarding Realm of Research Papers

Filed under: Computer Science,Research Methods — Patrick Durusau @ 9:20 pm

From the description:

Sean Cribbs teaches us how to read and implement research papers – and translate what they describe into code. He covers examples of research implementations he’s been involved in and the relationships he’s built with researchers in the process.

A bit longer description at: http://chicago.citycode.io/sean-cribbs.html

Have you ever run into a thorny problem that makes your code slow or complicated, for which there is no obvious solution? Have you ever needed a data structure that your language’s standard library didn’t provide? You might need to implement a research paper!

While much of research in Computer Science doesn’t seem relevant to your everyday web application, all of those tools and techniques you use daily originally came from research! In this talk we’ll learn why you might want to read and implement research papers, how to read them for relevant information, and how to translate what they describe into code and test the results. Finally, we’ll discuss examples of research implementation I’ve been involved in and the relationships I’ve built with researchers in the process.

As you might imagine, I think this rocks!

Neo4j 2.3 RC1 is out!

Filed under: Graphs,Neo4j — Patrick Durusau @ 8:57 pm

I saw a tweet from Michael Hunger saying Neo4j 2.3 RC1 is out.

For development only – check here.

Comment early and often!

October 13, 2015

Rodeo 1.0: a Python IDE on your Desktop

Filed under: Programming,Python — Patrick Durusau @ 7:05 pm

Rodeo 1.0: a Python IDE on your Desktop by Greg.

From the post:

When we released our in-browser IDE for Python earlier this year, we couldn’t believe the response. Thousands of our readers all over the world saddled up and told their friends and colleagues to do the same (no more puns, we promise).

That reaction, as well as the endless search for hacks to make our lives easier, got us thinking about how to make Rodeo even better. Over the past few months, we’ve been working on Rodeo 1.0, a version of Rodeo than runs right on your desktop. Download the installers for Windows, OS X, or Linux here.

Something new for Python readers!

I grabbed the 64-bit version for Linux and will install it tomorrow.

Enjoy!

Data Journalism Tools

Filed under: Journalism,News,Reporting,Researchers — Patrick Durusau @ 6:48 pm

Data Journalism Tools

From the webpage:

This Silk is a structured database listing tools and resources that (data) journalists might want to include in their toolkit. We tried to cover the main steps of the ddj process: from data collection and scraping to data cleaning and enhancement; from analysis to data visualization and publishing. We’re trying to showcase especially tools that are free/freemium and open source, but you will find a bit of everything.

This Silk is updated regularly: we have collected a list of hundreds of tools, which we manually tag (are they open source tools? Free? for interactive datavizs?). Make sure you follow this Silk, so you won’t miss an update!

As of 13 October 2015, there are 120 tools listed.

Graphics have a strong showing but not overly so. There are tools for collaboration, web scrapping, writing, etc.

Pitched toward journalists but librarians, researchers, bloggers, etc., will all find tools of interest at this site.

Researchers say SHA-1 will soon be broken… [Woe for OPM’s Caesar Cipher]

Filed under: Cryptography,Cybersecurity,Security — Patrick Durusau @ 2:40 pm

Researchers say SHA-1 will soon be broken, urge migration to SHA-2 by Teri Robinson.

In as little as three short months, the SHA-1 internet security standard used for digital signatures and set to be phased out by January 2017, could be broken by motivated hackers, a team of international researchers found, prompting security specialists to call for a ramping up of the migration to SHA-2.

“We just successfully broke the full inner layer of SHA-1,” Marc Stevens of Centrum Wiskunde & Informatica in the Netherlands, one of the cryptanalysts that tested the standard, said in a release. Stevens noted that the cost of exploiting SHA-1 has dropped enough to make it affordable to every day hackers. The researchers explained that in 2012 security computer security and privacy specialist Bruce Schneier predicted that the cost of a SHA-1 attack would drop to $700,000 in 2015 and would decrease to an affordable $173,000 or so in 2018.

But the prices fell–and the opportunity rose–more quickly than predicted. “We now think that the state-of-the-art attack on full SHA-1 as described in 2013 may cost around 100,000 dollar renting graphics cards in the cloud,” said Stevens.

The silver lining in this dark cloud is that “every day hackers” can afford to spend “around $100,000 renting graphics cards in the cloud,” to break SHA-1 encryption.

I had no idea that “every day hackers” had that sort of cash flow.

Certainly something that should be mentioned at the next career day at local high schools and when recruiting for college CS programs. 😉

Depending on your interests, the even brighter silver lining will be the continued use and even upgrade to SHA-1, such as with the OPM (Office of Personnel Management), long after the graphic card rental price has broken into the three digit range.

How to teach gerrymandering…

Filed under: Government,Mathematics — Patrick Durusau @ 2:26 pm

How to teach gerrymandering and its many subtle, hard problems by Cory Doctorow.

From the post:

Ben Kraft teaches a unit on gerrymandering — rigging electoral districts to ensure that one party always wins — to high school kids in his open MIT Educational Studies Program course. As he describes the problem and his teaching methodology, I learned that district-boundaries have a lot more subtlety and complexity than I’d imagined at first, and that there are some really chewy math and computer science problems lurking in there.

Kraft’s pedagogy is lively and timely and extremely relevant. It builds from a quick set of theoretical exercises and then straight into contemporary, real live issues that matter to every person in every democracy in the world. This would be a great unit to adapt for any high school civics course — you could probably teach it in middle school, too.

Certainly timely considering that congressional elections are ahead (in the United States) in 2016.

Also a reminder that in real life situations, mathematics, algorithms, computers, etc., are never neutral.

The choices you make determine who will serve and who will eat.

It was ever thus and those who pretend otherwise are trying to hide their hand on the scale.

Tomas Petricek on The Against Method

Filed under: Language,Science,Scientific Computing — Patrick Durusau @ 1:57 pm

Tomas Petricek on The Against Method by Tomas Petricek.

From the webpage:

How is computer science research done? What we take for granted and what we question? And how do theories in computer science tell us something about the real world? Those are some of the questions that may inspire computer scientist like me (and you!) to look into philosophy of science. I’ll present the work of one of the more extreme (and interesting!) philosophers of science, Paul Feyerabend. In “Against Method”, Feyerabend looks at the history of science and finds that there is no fixed scientific methodology and the only methodology that can encompass the rich history is ‘anything goes’. We see (not only computer) science as a perfect methodology for building correct knowledge, but is this really the case? To quote Feyerabend:

“Science is much more ‘sloppy’ and ‘irrational’ than its methodological image.”

I’ll be mostly talking about Paul Feyerabend’s “Against Method”, but as a computer scientist myself, I’ll insert a number of examples based on my experience with theoretical programming language research. I hope to convince you that looking at philosophy of science is very much worthwhile if we want to better understand what we do and how we do it as computer scientists!

The video runs an hour and about eighteen minutes but is worth every minute of it. As you can imagine, I was particularly taken with Tomas’ emphasis on the importance of language. Tomas goes so far as to suggest that disagreements about “type” in computer science stem from fundamentally different understandings of the word “type.”

I was reminded of Stanley Fish‘s “Doing What Comes Naturally (DWCN).

DWCN is a long and complex work but in brief Fish argues that we are all members of various “interpretive communities,” and that each of those communities influence how we understand language as readers. Which should come as assurance to those who fear intellectual anarchy and chaos because our interpretations are always within the context of an interpretative community.

Two caveats on Fish. As far as I know, Fish has never made the strong move and pointed out that his concept of “interpretative communities is just as applicable to natural sciences as it is to social sciences. What passes as “objective” today is part and parcel of an interpretative community that has declared it so. Other interpretative communities can and do reach other conclusions.

The second caveat is more sad than useful. Post-9/11, Fish and a number of other critics who were accused of teaching cultural relativity of values felt it necessary to distance themselves from that position. While they could not say that all cultures have the same values (factually false), they did say that Western values, as opposed to those of “cowardly, murdering,” etc. others, were superior.

If you think there is any credibility to that post-9/11 position, you haven’t read enough Chompsky. 9/11 wasn’t 1/100,0000 of the violence the United States has visited on civilians in other countries after the Korea War.

October 12, 2015

Data Portals

Filed under: Open Data — Patrick Durusau @ 7:59 pm

Data Portals

From the webpage:

A Comprehensive List of Open Data Portals from Around the World

Two things spring to mind:

First, the number of portals seems a bit lite given the rate of data accumulation.

Second, take a look at the geographic distribution of data portals. Asia and Northern Africa seem rather sparse don’t you think?

« Newer PostsOlder Posts »

Powered by WordPress