Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 20, 2014

Chinese Open Access

Filed under: Open Access — Patrick Durusau @ 10:19 am

Chinese agencies announce open-access policies by Richard Van Noorden. (Nature | doi:10.1038/nature.2014.15255

From the post:

China has officially joined the international push to make research papers free to read. On 15 May, the National Natural Science Foundation of China (NSFC), one of the country’s major basic-science funding agencies, and the Chinese Academy of Sciences (CAS), which funds and conducts research at more than 100 institutions, announced that researchers they support should deposit their papers into online repositories and make them publicly accessible within 12 months of publication.

That’s certainly good news for data aggregation providers, using topic maps or no, but Richard’s post closes on an odd note:

Another unresolved issue is whether the 12-month grace period should also apply to articles in the social sciences and humanities — which are within the CAS’s purview but face “special challenges”, Zhang says.

Richard identifies Zhang as:

Xiaolin Zhang, director of the National Science Library at the CAS in Beijing, says that another major research-funding agency, the national ministry of science and technology, is also researching open-access policies.

Perhaps Richard will follow up with Zhang on what he means by “special challenges” for articles in the social sciences and the humanities.

I can’t imagine anyone thinking they are likely to obtain a patent based on social science or humanities research.

Open access would increase the opportunity for people outside of major academic institutions to read the latest social science and humanities research. I don’t see the downside to greater access to such articles.

Do you?

May 19, 2014

The Internet Is Obsessed With Maps…

Filed under: Humor,Maps — Patrick Durusau @ 5:58 pm

The Internet Is Obsessed With Maps — Here’s Why It’s Gone Too Far by Mike Nudelman and Christina Sterbenz.

From the post:


“There’s something about maps that’s really authoritative and hard to question — we’re so used to seeing them …. But the more popular something becomes, the more people try to duplicate it without the expertise,” Fanning explained.

Condensing complex data into an easily and quickly digestible package often leads to oversimplification or, worse, misinformation. That becomes especially problematic when the map is viewed and shared far from its original context.

Of course, the maps in question are all visual but you could easily represent some topic maps as visual maps and capture that same sense of authority.

Have you faced issues of “oversimplification” or “misinformation” from use of a topic map? Any impact from a map being removed from its original context?

BTW, correction to the first map in the post. Tulane isn’t the “most desirable” college in Louisiana. LSU Baton Rouge is the “most desirable” college in Louisiana. 😉

Mastering Clojure Data Analysis Danger!

Filed under: Clojure,Programming — Patrick Durusau @ 4:35 pm

Mastering Clojure Data Analysis by Eric Rochester.

I marked this with Danger! because if you follow the download link in the video you will see:

Premium Content

Please complete an offer below to unlock the download link

The offers are very easy and take only a few minutes to complete.
Your support helps us to provide more premium content. Thanks!!!

Get your auto insurance quote today!
Get Free Bathroom Samples
Get free samples for your pet!
Get 4 Tickets to the Movie Theater of your choice!
Get the brand new Xbox One!

Thinking how harmful could “Get Free Bathroom Samples” be? I followed that one.

Here is what I found:

By entering your email address and clicking “submit”, you agree to receive emails from Lifescript and/or trusted third parties containing promotions and other special offers and that Lifescript may provide your email address and corresponding information to such parties. If you do not wish to continue receiving such emails, you may unsubscribe at any time.

* No purchase necessary. In order to qualify for the Free Samples you must complete the Lifescript Advantage registration page providing your name, address, gender, date of birth and email address and then review a series of offers. Upon completion of the review of the offers, you will have the opportunity to choose from a selection of free samples. You may choose as many as you like. When you “click to redeem” a free sample, you will be directed to a third-party website and may be requested to provide information or take other actions (for example, it may request that you “Like” a Facebook page)….

(emphasis added)

The only upside is that they don’t ask for your bank account and routing information. 😉

Avoid this scam and go to Mastering Clojure Data Analysis at Packt Publishing. There is no TOC or sample chapters available at this time but it is a worthy topic.

Who would you report such a scam to? Suggestions?

Game development in Clojure (with play-clj)

Filed under: Clojure,Games,Programming — Patrick Durusau @ 3:56 pm

Uses Light Table so you will be getting an introduction to Light Table as well.

If you think about it, enterprise searches are very much treasure hunt adventures with poor graphics and no avatars. 😉

… Characters of Clojure

Filed under: Clojure,Searching — Patrick Durusau @ 2:33 pm

The Weird and Wonderful Characters of Clojure by James Hughes.

From the post:

A reference collection of characters used in Clojure that are difficult to “google”. Descriptions sourced from various blogs, StackOverflow, Learning Clojure and the official Clojure docs – sources attributed where necessary. Use CTRL-F “Character: …” to search or type the symbols into the box below. Sections not in any particular order but related items are grouped for ease. If I’m wrong or missing anything worthy of inclusion tweet me @kouphax or mail me at james@yobriefca.se.

Definitely a candidate for your browser toolbar!

I first saw this in a tweet by Daniel Higginbotham.

May 18, 2014

Yahoo Betting on Apache Hive, Tez, and YARN

Filed under: Hadoop YARN,Hive,Tez — Patrick Durusau @ 8:01 pm

Yahoo Betting on Apache Hive, Tez, and YARN

With the usual caveats about test results:

On the other hand, Hive 0.13 query execution times were not only significantly better at higher volumes of data (Fig 3 and 4) but also executed successfully without failing. In our comparisons and observations with Shark, we saw most queries fail with the larger (10TB) dataset. These same queries ran successfully and much faster on Hive 0.13, allowing for better scale. This was extremely critical for us, as we needed a single query and BI solution on the Hadoop grid regardless of dataset size. The Hive solution resonates with our users, as they do not have to worry about learning multiple technologies and discerning which solution to use when. A common solution also results in cost and operational efficiencies from having to build, deploy, and maintain a single solution.

Successful 10TB query times and results should be enough to get your attention. Not that many of us have data in that range, today, but tomorrow, who can say?

Enjoy!

I first saw this in a tweet by Joshua Lande.

Lying to the Supreme Court?

Filed under: NSA,Security — Patrick Durusau @ 7:50 pm

Everyone should know just how much the government lied to defend the NSA by Trevor Timm.

From the post:

If you blinked this week, you might have missed the news: two Senators accused the Justice Department of lying about NSA warrantless surveillance to the US supreme court last year, and those falsehoods all but ensured that mass spying on Americans would continue. But hardly anyone seems to care – least of all those who lied and who should have already come forward with the truth.

Here’s what happened: just before Edward Snowden became a household name, the ACLU argued before the supreme court that the Fisa Amendments Act – one of the two main laws used by the NSA to conduct mass surveillance – was unconstitutional.

In a sharply divided opinion, the supreme court ruled, 5-4, that the case should be dismissed because the plaintiffs didn’t have “standing” – in other words, that the ACLU couldn’t prove with near-certainty that their clients, which included journalists and human rights advocates, were targets of surveillance, so they couldn’t challenge the law. As the New York Times noted this week, the court relied on two claims by the Justice Department to support their ruling: 1) that the NSA would only get the content of Americans’ communications without a warrant when they are targeting a foreigner abroad for surveillance, and 2) that the Justice Department would notify criminal defendants who have been spied on under the Fisa Amendments Act, so there exists some way to challenge the law in court.

It turns out that neither of those statements were true – but it took Snowden’s historic whistleblowing to prove it.

See Trevor’s piece for the details.

There is one upside to this outrage.

Would you want to be representing the Justice Department the next time it appears before the Supreme Court?

Whatever semantic games the Justice Department want to play with whether it “lied” or simply didn’t reveal classified information, the bottom line is that the Justice Department deliberately lied to the Supreme Court.

You do know Rule #1 is to never knowingly suborn perjury. Right? Well, Rule #2 is to never lie to a judge. If you lose credibility with the court, that ends any effective representation on your part.

Of all the damage that the national security mania that started under Bush II and continued under Obama has done, destroying the minimal standards of decency and trust between the three branches of government has been the most damaging. Certainly the three branches can disagree and that is part of the checks and balances system. But to lose trust in one or more of the other branches, that is a very serious loss indeed.

It may not be too late for Congress, along with the Supreme Court to find and excise the national security cancer that lies at the heart of the executive branch of government. Here’s to hoping they don’t wait too much longer.

“Dear Piece of Shit…”

Filed under: Patents — Patrick Durusau @ 2:43 pm

“Dear piece of shit…” Life360 CEO sends a refreshingly direct response to a patent troll by Paul Carr.

From the post:

Tale as old as time. Now that family social network Life360 is firmly established in the big leagues — with 33m registered families, and having raised $50m last week from ADT — it was inevitable that the patent trolls would come calling.

But where most CEOs are happy to let their lawyers set the tone of how they respond, Life360′s Chris Hulls has a more, uh, refreshing approach.

When Hulls received a letter from attorney acting for Florida-based Advanced Ground Information Systems, inviting Life360 to “discuss” a “patent licensing agreement” for what AGIS claims is its pre-existing patent for displaying markers of people on a map, he decided to bypass his own attorneys and instead send an email reply straight out of David Mamet…

Dear Piece of Shit,…

Paul’s account of the demand by a patent troll, Advanced Ground Information Systems and the response of Life360 CEO Chris Hull is a masterpiece!

But it left me wondering, ok, so Life360 is stepping up to the plate, is there anything the rest of us can do other than cheer?

Not to discount the value of cheering but cheering is less filling and less satisfying than taking action.

The full complaint is here and the “Dear Piece of Shit” response appears in paragraph 11 of the compliant. The fact finder will be able to conclude that the “Dear Piece of Shit” response was sent, whether the court will take evidence on the plaintiff being a “piece of shit” remains unclear.

Let’s think about how to support Life360 as social network/graph people.

First, we all know about the six degrees of Kevin Bacon. Assuming that is generally true, that means someone reading this blog post is six degrees or less away from someone who is acting for or on behalf of Advanced Ground Information Systems (AGIS). Yes?

From the complaint we can identity the following people for AGIS:

  • Malcolm K. “Cap” Beyer, Jr. (paragraph 9 of the complaint)
  • Ury Fischer, Florida Bar No. 048534, E-mail: ufischer@lottfischer.com
  • Adam Diamond, Florida Bar No. 091008, E-mail: adiamond@lottfischer.com
  • Mark A. Hannemann, New York Bar No. 2770709, E-mail: mhannemann@kenyon.com
  • Thomas Makin, New York Bar No. 3953841, E-mail: tmakin@kenyon.com
  • Matthew Berkowitz, New York Bar No. 4397899, E-mail: mberkowitz@kenyon.com
  • Rose Cordero Prey. New York Bar No. 4326591, E-mail: rcordero@kenyon.com
  • Anne Elise Li, New York Bar No. 4480497, E-mail: ali@kenyon.com
  • Vincent Rubino, III, New York Bar No. 4557435, E-mail: vrubino@kenyon.com

Everyone with a “Bar No” is counsel for AGIS.

All that information appears in the public record of the pleadings filed on behalf of AGIS.

What isn’t known is who else works for AGIS?

Or, who had connections to people who work for AGIS?

Obviously no one should contact or harass anyone in connection with a pending lawsuit, civil or criminal.

On the other hand, everyone within six degrees of separation of those acting on behalf of AGIS, retain their freedom of association rights.

Or should I say, their freedom of disassociation rights? Much in the same way that were exercised concerning J. Bruce Ismay.

The USPTO, which recently issued a patent for taking a photograph against a white background, isn’t going to help fix the patent system.

Lawyers seeking:

C. An award to Plaintiff of the damages to which it is entitled under at least 35 U.S.C. § 284 for Defendant’s past infringement and any continuing or future infringement, including both compensatory damages and treble damages for defendants’ willful infringement;

D. A judgment and order requiring defendants to pay the costs of this action (including all disbursements), as well as attorneys’ fees;

aren’t going to fix the patent system.

Lawyers advising victims of patent troll litigation aren’t going to fix the patent system because settling is cheaper. It’s just a question of which costs more money, settlements or litigation? Understandable but that leaves trolls to harass others.

If anyone is going to fix it, it will have to be victims like Life360 along with the lawful support of all non-trolls in the IP universe.

PS: If you have legal analysis or evidence that would be relevant to invalidation of the patents in question, please don’t hesitate to speak up.

12 Free (as in beer) Data Mining Books

12 Free (as in beer) Data Mining Books by Chris Leonard.

While all of these volumes could be shelved under “data mining” in a bookstore, I would break them out into smaller categories:

  • Bayesian Analysis/Methods
  • Data Mining
  • Data Science
  • Machine Learning
  • R
  • Statistical Learning

Didn’t want you to skip over Chris’ post because it was “just about data mining.” 😉

Check your hard drive to see what you are missing.

I first saw this in a tweet by Carl Anderson.

May 17, 2014

Banksy on Advertising

Filed under: Advertising — Patrick Durusau @ 7:36 pm

Banksy on Advertising

Gauge your own tolerance for risk before following Banksy.

I think the RIAA and others win because our individual toleration for risk is so low.

We want to protest, take chances, etc., but you know, we might offend a potential future employer or one of their clients or some government wonk.

So long as a majority of us feel that way, the revolution is going to be delayed.

That suggests a solution to me.

You?

PS: Don’t take my RIAA example the wrong way. I think artists and others who contribute to the creative process should be compensated. The record industry with its executives and sycophants, etc., not so much. Music thrives in spite of the recording industry, not because of it.

Workload Matters: Why RDF Databases Need a New Design

Filed under: LOD,RDF,RDFa,Semantic Web — Patrick Durusau @ 7:23 pm

Workload Matters: Why RDF Databases Need a New Design by Gunes¸ Aluc¸, M. Tamer ¨ Ozsu, and, Khuzaima Daudjee.

Abstract:

The Resource Description Framework (RDF) is a standard for conceptually describing data on the Web, and SPARQL is the query language for RDF. As RDF is becoming widely utilized, RDF data management systems are being exposed to more diverse and dynamic workloads. Existing systems are workload-oblivious, and are therefore unable to provide consistently good performance. We propose a vision for a workload-aware and adaptive system. To realize this vision, we re-evaluate relevant existing physical design criteria for RDF and address the resulting set of new challenges.

The authors establish RDF data management systems are in need of better processing models. However, they mention a “prototype” only in their conclusion and offer no evidence concerning their possible alternatives for RDF processing.

I don’t doubt the need for better RDF processing but I would think the first step would be to determine the goals of RDF processing, separate and apart from the RDF model.

Simply because we conceptualize data as being encoded in “triples,” does not mean that computers must process them as “triples.” They can if it is advantageous but not if there are better processing models.

I first saw this in a tweet by Olaf Hartig.

Mapping Kidnappings in Nigeria (Updated)

Filed under: News,Reporting — Patrick Durusau @ 6:54 pm

Mapping Kidnappings in Nigeria (Updated) by Mona Chalabi.

From the post:

Editor’s note (May 16, 3:35 p.m.): This article contains many errors, some of them fundamental to the analysis.

The article repeatedly refers to the number and location of kidnappings. But the Global Database of Events, Language and Tone (GDELT) — the data source for the article — is a repository of media reports, not discrete events. As such, we should only have referred to “media reports of kidnappings,” not kidnappings.

This mistake led to other problems.

We should not have published an animated map showing “kidnappings” over time, or even “media reports of kidnappings” over time. Because we have no data on actual kidnappings, showing a time series requires normalizing the data to account for the increasing number of media reports overall. Thus, showing individual media reports is a mistake. The second map, showing “Kidnapping rate per 100,000 people, 1982-present,” has the same flaw.

This is a good example of why you should have a high degree of confidence in FiveThirtyEight.

Yes, the blog post admits to a number of errors but you should also note:

FiveThirtyEight made the correction before the original article. You can’t see the mis-information without seeing the correction.

FiveThirtyEight did not spend days or weeks in denial, only to have to confess in the end to being wrong. (Any recent American President would be a study in contrast.)

FiveThirtyEight tells us what went wrong. Good for them and us because now we are both aware of that type of error.

In the unlikely event that you should ever make a public mistake, ;-), please consider following the example of FiveThirtyEight.

I first saw this in a tweet by Christopher Phipps.

Types, and two approaches to problem solving

Filed under: Computer Science,Problem Solving,Types — Patrick Durusau @ 6:39 pm

Types, and two approaches to problem solving by Dan Piponi.

From the post:

There are two broad approaches to problem solving that I see frequently in mathematics and computing. One is attacking a problem via subproblems, and another is attacking a problem via quotient problems. The former is well known though I’ll give some examples to make things clear. The latter can be harder to recognise but there is one example that just about everyone has known since infancy.

I don’t want to spoil Dan’s surprise so all I can say is go read the post!

An intuitive appreciation for types may help you with the full monty of types.

Create Dataset of Users from the Twitter API

Filed under: Python,Tweets — Patrick Durusau @ 6:24 pm

Create Dataset of Users from the Twitter API by Ryan Swanson.

From the post:

This project provides an example of using python to pull user data from Twitter.

This project will create a dataset of the top 1000 twitter users for any given search topic.

As written, the project returns these values:

  1. handle – twitter username | string
  2. name – full name of the twitter user | string
  3. age – number of days the user has existed on twitter | number
  4. numOfTweets – number of tweets this user has created (includes retweets) | number
  5. hasProfile – 1 if the user has created a profile description, 0 otherwise | boolean
  6. hasPic – 1 if the user has setup a profile pic, 0 otherwise | boolean
  7. numFollowing – number of other twitter users, this user is following | number
  8. numOfFavorites – number of tweets the user has favorited | number
  9. numOfLists – number of public lists this user has been added to | number
  10. numOfFollowers – number of other users following this user | number

You need to read the Twitter documentation if you want to extend this project to capture other values. Such as a list of followers or who someone is following, important for sketching communities for example. Or tracing tweets/retweets across a community.

Enjoy!

Building a Recipe Search Site…

Filed under: ElasticSearch,Lucene,Search Engines,Solr — Patrick Durusau @ 4:32 pm

Building a Recipe Search Site with Angular and Elasticsearch by Adam Bard.

From the post:

Have you ever wanted to build a search feature into an application? In the old days, you might have found yourself wrangling with Solr, or building your own search service on top of Lucene — if you were lucky. But, since 2010, there’s been an easier way: Elasticsearch.

Elasticsearch is an open-source storage engine built on Lucene. It’s more than a search engine; it’s a true document store, albeit one emphasizing search performance over consistency or durability. This means that, for many applications, you can use Elasticsearch as your entire backend. Applications such as…

Think of this as a snapshot of the capabilities of most search solutions.

Which makes this a great baseline for answering the question: What does your app do that Elasticsearch + Angular cannot?

That’s a serious question.

Responses that don’t count include:

  1. My app is written in the Linear B programming language.
  2. My app uses a Post-Pre-NOSQL DB engine.
  3. My app will bring freedom and health to the WWW.
  4. (insert your reason)

You can say all those things if you like, but the convincing point for users is going to be exceeding their expectations about current solutions.

Do the best you can with Elasticsearch and Angular and use that as your basepoint for comparison.

The Algebra of Algebraic Data Types

Filed under: Algebra,Data Types — Patrick Durusau @ 3:52 pm

Chris Taylor has a series of posts that correspond to a talk he gave in London (November 2012), video on YouTube and slides on Github.

Part 1.

Part 2.

Part 3.

Suggest you read the blog posts first and then following the slides while listening to the video.

If you have been wondering about types in Haskell, this is a golden opportunity.

May 16, 2014

APIs for Scholarly Resources

Filed under: Data,Library — Patrick Durusau @ 7:58 pm

APIs for Scholarly Resources

From the webpage:

APIs, short for application programming interface, are tools used to share content and data between software applications. APIs are used in a variety of contexts, but some examples include embedding content from one website into another, dynamically posting content from one application to display in another application, or extracting data from a database in a more programmatic way than a regular user interface might allow.

Many scholarly publishers, databases, and products offer APIs to allow users with programming skills to more powerfully extract data to serve a variety of research purposes. With an API, users might create programmatic searches of a citation database, extract statistical data, or dynamically query and post blog content.

Below is a list of commonly used scholarly resources at MIT that make their APIs available for use. If you have programming skills and would like to use APIs in your research, use the table below to get an overview of some available APIs.

If you have any questions or know of an API you would like to see include in this list, please contact Mark Clemente, Library Fellow for Scholarly Publishing and Licensing in the MIT Libraries (contact information at the bottom of this page).

A nice listing of scholarly resources with public APIs and your opportunity to contribute back to this listing with APIs that you discover.

Sadly, as far as I know (subject to your corrections), the ACM Digital Library has no public API.

Not all that surprising considering considering the other shortcomings of the ACM Digital Library interface. For example, you can only save items (their citations) to a binder one item at a time. Customer service will opine they have had this request before but no, you can’t contact the committee that makes decisions about Digital Library features. Nor will they tell you who is on that committee. Sounds like the current Whitehouse doesn’t it?

I first saw this in a tweet by Scott Chamberlain.

spurious correlations

Filed under: Humor,Statistics — Patrick Durusau @ 7:44 pm

spurious correlations

You need to put this site on your browser toolbar for meeting where “correlations” are likely to be discussed.

May save a lot of explaining and hand waving on your part about the nature of correlations and causation.

My favorite so far is:

Per capita consumption of cheese (US)

correlates with

Number of people who died by becoming entangled in their bed sheets

cheese and bedsheets

Notice the number of people who died entangled in their bedsheets is 150X the number of Americans who died in domestic terror attacks in 2013. (Death rates from terrorism)

Makes you wonder how much money we are spending to make bedsheets safer for U.S. citizens only.

I first saw this in a tweet by Steven Strogatz.

Comparison of Corpora through Narrative Structure

Filed under: Computational Linguistics,Corpora,Narrative — Patrick Durusau @ 7:24 pm

Comparison of Corpora through Narrative Structure by Dan Simonson.

A very interesting slide deck from a presentation on how news coverage of police activity may have changed from before and after September 11th.

An early slide that caught my attention:

As a computational linguist, I can study 106 —instead of 100.6 —documents.

The sort of claim that clients might look upon with favor.

I first saw this in a tweet by Dominique Mariko.

There Should Be a Checklist for Maps

Filed under: Mapping,Maps — Patrick Durusau @ 7:07 pm

There Should Be a Checklist for Maps by Betsy Mason.

From the post:

Earlier this week, Stephanie Evergreen posted this great checklist for data visualizations. She and Ann Emery designed it to help social scientists understand the elements of a successful graph and offer guidance on how to make a graph better.

I’ve seen the list tweeted by data viz experts like Alberto Cairo and had it forwarded it to me by a designer I used to work with. It got me thinking that a list like this for maps would be really useful. We’re beginners at mapmaking here at Map Lab, and we’d love a list like this to check our own maps against, and to help us evaluate maps we come across.
….

Betsy has located one such guide but is seeking your advice on what should be on the checklist for map?

A checklist for maps, no disrespect intended towards data visualizations, is a very deep question. Maps, useful ones at any rate, reflect their author, purpose, intended audience, social context, technology for making the map, etc.

Suggestions? Comments?

A Distributed Systems Reading List

A Distributed Systems Reading List by

From the introduction:

I often argue that the toughest thing about distributed systems is changing the way you think. The below is a collection of material I’ve found useful for motivating these changes.

Categories include:

  • Thought Provokers
  • Amazon
  • Google
  • eBay
  • Consistency Models
  • Theory
  • Languages and Tools
  • Infrastructure
  • Storage
  • Paxos Consensus
  • Other Consensus Papers
  • Gossip Protocols (Epidemic Behaviors)
  • P2P

Unless you think the knowledge in your domain is small enough to fit into a single system, I suggest you start reading about distributed systems this weekend.

Enjoy!

I first saw this in a tweet by FoundationDB.

Chas Emerick on CRDT’s

Filed under: CRDT — Patrick Durusau @ 2:28 pm

Several resources for Chas Emerick’s “A comprehensive study of Convergent and Communtative Replicated Data Types” at Papers We Love #4 May 15, 2014.

Slides.

Tom LaGatta’s Notes.

When the video of Chas’ presentation is posted I will update this post.

See also: Christopher Meiklejohn’s Time Clocks and the Ordering of Events in a Distributed System.

Abstract:

Whether you realize it or not, if you’ve built a rich web application in Ember.js, and you’re sending data between clients and a server, you’ve build a distributed system. This talk will discuss the challenges of building such a system, specifically the challenges related to preserving consistency when dealing with concurrent actors. We will begin with a primer on the various types of consistency, covering topics such as eventual consistency and causal consistency, and then move on to discuss recent industrial and academic research that aims to solve some of these problems without synchronization, specifically discussing operational transformations and convergent and commutative replicated data types.

May 15, 2014

Speak and learn with Spell Up, our latest Chrome Experiment

Filed under: Education,Language — Patrick Durusau @ 7:15 pm

Speak and learn with Spell Up, our latest Chrome Experiment by Xavier Barrade.

From the post:

As a student growing up in France, I was always looking for ways to improve my English, often with a heavy French-to-English dictionary in tow. Since then, technology has opened up a world of new educational opportunities, from simple searches to Google Translate (and our backpacks have gotten a lot lighter). But it can be hard to find time and the means to practice a new language. So when the Web Speech API made it possible to speak to our phones, tablets and computers, I got curious about whether this technology could help people learn a language more easily.

That’s the idea behind Spell Up, a new word game and Chrome Experiment that helps you improve your English using your voice—and a modern browser, of course. It’s like a virtual spelling bee, with a twist.

This rocks!

If Google is going to open source another project and support it, Spell Up should be it.

The machine pronunciation could use some work, or at least it seems that way to me. (My hearing may be a factor there.)

Thinking of the impact of Spell Up for lesser often taught languages.

A New Nation Votes

Filed under: Government,Government Data,Politics — Patrick Durusau @ 2:58 pm

A New Nation Votes: American Election Returns 1787-1825

From the webpage:

A New Nation Votes is a searchable collection of election returns from the earliest years of American democracy. The data were compiled by Philip Lampi. The American Antiquarian Society and Tufts University Digital Collections and Archives have mounted it online for you with funding from the National Endowment for the Humanities.

Currently there are 18040 elections that have been digitized.

Interesting data set and certainly one that could be supplemented with all manner of other materials.

Among other things, the impact or lack thereof from extension of the voting franchise would make an interesting study.

Enjoy!

(String/text processing)++:…

Filed under: String Matching,Text Feature Extraction,Text Mining,Unicode — Patrick Durusau @ 2:49 pm

(String/text processing)++: stringi 0.2-3 released by Marek Gągolewski.

From the post:

A new release of the stringi package is available on CRAN (please wait a few days for Windows and OS X binary builds).

stringi is a package providing (but definitely not limiting to) replacements for nearly all the character string processing functions known from base R. While developing the package we had high performance and portability of its facilities in our minds.

Here is a very general list of the most important features available in the current version of stringi:

  • string searching:
    • with ICU (Java-like) regular expressions,
    • ICU USearch-based locale-aware string searching (quite slow, but working properly e.g. for non-Unicode normalized strings),
    • very fast, locale-independent byte-wise pattern matching;
  • joining and duplicating strings;
  • extracting and replacing substrings;
  • string trimming, padding, and text wrapping (e.g. with Knuth's dynamic word wrap algorithm);
  • text transliteration;
  • text collation (comparing, sorting);
  • text boundary analysis (e.g. for extracting individual words);
  • random string generation;
  • Unicode normalization;
  • character encoding conversion and detection;

and many more.

Interesting isn’t it? How CS keeps circling around back to strings?

Enjoy!

Diving into HDFS

Filed under: Hadoop,HDFS — Patrick Durusau @ 2:22 pm

Diving into HDFS by Julia Evans.

From the post:

Yesterday I wanted to start learning about how HDFS (the Hadoop Distributed File System) works internally. I knew that

  • It’s distributed, so one file may be stored across many different machines
  • There’s a namenode, which keeps track of where all the files are stored
  • There are data nodes, which contain the actual file data

But I wasn’t quite sure how to get started! I knew how to navigate the filesystem from the command line (hadoop fs -ls /, and friends), but not how to figure out how it works internally.

Colin Marc pointed me to this great library called snakebite which is a Python HDFS client. In particular he pointed me to the part of the code that reads file contents from HDFS. We’re going to tear it apart a bit and see what exactly it does!

Be cautious reading Julia’s post!

Her enthusiasm can be infectious. 😉

Seriously, I take Julia’s posts as the way CS topics are supposed to be explored. While there is hard work, there is also the thrill of discovery. Not a bad approach to have.

Improving GitHub for science

Filed under: Github,Identifiers,Identity — Patrick Durusau @ 1:53 pm

Improving GitHub for science

From the post:

GitHub is being used today to build scientific software that’s helping find Earth-like planets in other solar systems, analyze DNA, and build open source rockets.

Seeing these projects and all this momentum within academia has pushed us to think about how we can make GitHub a better tool for research. As scientific experiments become more complex and their datasets grow, researchers are spending more of their time writing tools and software to analyze the data they collect. Right now though, these efforts often happen in isolation.

Citable code for academic software

Sharing your work is good, but collaborating while also getting required academic credit is even better. Over the past couple of months we’ve been working with the Mozilla Science Lab and data archivers, Figshare and Zenodo, to make it possible to get a Digital Object Identifier (DOI) for any GitHub repository archive.

DOIs form the backbone of the academic reference and metrics system. With a DOI for your GitHub repository archive, your code becomes citable. Our newest Guide explains how to create a DOI for your repository.

A great step forward, but like http: pointing to entire resources, it is of limited utility.

Assume that I am using a DOI for a software archive and I want to point to and identify a code snippet in the archive that implements Fast Fourier Transform (FFT). My first task is to point to that snippet. A second task would be to create an association between the snippet and my annotation that it implements the Fast Fourier Transform. Yet a third task would be to gather up all the pointers that point to implementations of the Fast Fourier Transform (FFT).

For all of those tasks, I need to identify and point to a particular part of the underlying source code.

Unfortunately, a DOI is limited to identifying a single entity.

Each DOI® name is a unique “number”, assigned to identify only one entity. Although the DOI system will assure that the same DOI name is not issued twice, it is a primary responsibility of the Registrant (the company or individual assigning the DOI name) and its Registration Agency to identify uniquely each object within a DOI name prefix. (DOI Handbook

How would you extend the DOIs being used by GitHub to identify code fragments within source code repositories?

I first saw this in a tweet by Peter Desmet.

Why BI Projects Fail

Filed under: BI,Project Management — Patrick Durusau @ 1:17 pm

Top reasons your Business Intelligence (BI) project will fail by Andrew Bourne.

Reasons 1) Data models are complex, 2) Dirty data, and 5) Decision making errors from misinterpretation of information, all have topic map like elements in them.

Andrew outlines the issues here and promises to take up each one separately and cover “…what to do about them:”

OK, I’m game.

There does seem to be a trend towards explanations for why “big data” projects are failing. As we saw in The Shrinking Big Data MarketPlace, a survey by VoltDB found that a full 72% of the respondents could not access or utilize the majority of their data.

I don’t view such reports as being “skeptical” about big data but more as being realistic that all the things necessary for a successful project of any kind, clear goals, hard work, good management are necessary for BI projects.

I will be following Andrew’s post and report back on where he comes down on issues relevant to topic maps.

I first saw this in a tweet by Gregory Piatetsky.

Digital Libraries For Musicology

Filed under: Digital Library,Music,Music Retrieval — Patrick Durusau @ 12:53 pm

The 1st International Digital Libraries for Musicology workshop (DLfM 2014)

12th September 2014 (full day), London, UK

in conjunction with the ACM/IEEE Digital Libraries conference 2014

From the call for papers:

BACKGROUND

Many Digital Libraries have long offered facilities to provide multimedia content, including music. However there is now an ever more urgent need to specifically support the distinct multiple forms of music, the links between them, and the surrounding scholarly context, as required by the transformed and extended methods being applied to musicology and the wider Digital Humanities.

The Digital Libraries for Musicology (DLfM) workshop presents a venue specifically for those working on, and with, Digital Library systems and content in the domain of music and musicology. This includes Music Digital Library systems, their application and use in musicology, technologies for enhanced access and organisation of musics in Digital Libraries, bibliographic and metadata for music, intersections with music Linked Data, and the challenges of working with the multiple representations of music across large-scale digital collections such as the Internet Archive and HathiTrust.

IMPORTANT DATES

Paper submission deadline: 27th June 2014 (23:59 UTC-11)
Notification of acceptance: 30th July 2014
Registration deadline for one author per paper: 11th August 2014 (14:00 UTC)
Camera ready submission deadline: 11th August 2014 (14:00 UTC)

If you want a feel for the complexity of music as a retrieval subject, consult the various proposals at: Music markup languages, which are only some of the possible music encoding languages.

It is hard to say which domains are more “complex” than others in terms of encoding and subject identity, but it is safe to say that music falls towards the complex end of the scale. (sorry)

I first saw this in a tweet by Misanderasaurus Rex.

« Newer PostsOlder Posts »

Powered by WordPress