September « 2014 « Another Word For It

September 16, 2014

Stephen Wolfram Launching Today: Mathematica Online! (w/ secret pricing)

Filed under: Mathematica,Mathematics — Patrick Durusau @ 6:52 pm

Launching Today: Mathematica Online! by Stephen Wolfram.

From the post:

It’s been many years in the making, and today I’m excited to announce the launch of Mathematica Online: a version of Mathematica that operates completely in the cloud—and is accessible just through any modern web browser.

In the past, using Mathematica has always involved first installing software on your computer. But as of today that’s no longer true. Instead, all you have to do is point a web browser at Mathematica Online, then log in, and immediately you can start to use Mathematica—with zero configuration.

Some of the advantages that Stephen outlines:

Manipulate can be embedded in any web page
Files are stored in the Cloud to be accessed from anywhere or easily shared
Mathematica can now be used on mobile devices

What’s the one thing that isn’t obvious from Stephen’s post?

The pricing for access to Mathematical Online.

A Wolfram insider, proofing Stephen’s post probably said: “Oh, shit! Our pricing information is secret! What do you say in the post?”

So Stephen writes:

But get Mathematica Online too (which is easy to do—through Premier Service Plus for individuals, or a site license add-on).

You do that, or at least try to do that. If you manage to hunt down Premier Service, you will find you need an activation key in order to possibly get the pricing information.

If you don’t have a copy of Mathematica, you aren’t going to be ordering Mathematica Online today.

Sad that such remarkable software has such poor marketing.

Shout out to Stephen: Lots of people are interested in using Mathematica Online or off. Byzantine marketing excludes waiting, would be paying, customers.

I first saw this in a tweet by Alex Popescu.

Comments Off

A $23 million venture fund for the government tech set

Filed under: Funding,Government,Government Data — Patrick Durusau @ 5:07 pm

A $23 million venture fund for the government tech set by Nancy Scola.

Nancy tells a compelling story of a new VC firm, GovTech, which is looking for startups focused on providing governments with better technology infrastructure.

Three facts from the story stand out:

“The U.S. government buys 10 eBays’ worth of stuff just to operate,” from software to heavy-duty trucking equipment.

…

…working with government might be a tortuous slog, but Bouganim says that he saw that behind that red tape lay a market that could be worth in the neighborhood of $500 billion a year.

…

What most people don’t realize is government spends nearly $74 billion on technology annually. As a point of comparison, the video game market is a $15 billion annual market.

See Nancy’s post for the full flavor of the story but it sounds like there is gold buried in government IT.

Another way to look at it is the government is already spending $74 billion a year on technology that is largely an object of mockery and mirth. Effective software may be sufficiently novel and threatening to either attract business or a buy-out.

While you are pondering possible opportunities, existing systems, their structures and data are “subjects” in topic map terminology. Which means topic maps can protect existing contracts and relationships, while delivering improved capabilities and data.

Promote topic maps as “in addition to” existing IT systems and you will encounter less resistance both from within and without the government.

Don’t be squeamish about associating with governments, of whatever side. Their money spends just like everyone else’s. You can ask At&T and IBM about supporting both sides in a conflict.

I first saw this in a tweet by Mike Bracken.

Comments Off

User Onboarding

Filed under: Interface Research/Design,Usability,Users,UX — Patrick Durusau @ 4:27 pm

User Onboarding by Samuel Hulick.

From the webpage:

Want to see how popular web apps handle their signup experiences? Here’s every one I’ve ever reviewed, in one handy list.

I have substantially altered Samuel’s presentation to fit the list onto one screen and to open new tabs, enabling quick comparison of onboarding experiences.

Asana iOS	Instagram	OkCupid	Slingshot
Basecamp	InVision	Optimizely	Snapchat
Buffer	LessAccounting	Pinterest	Trello
Evernote	LiveChat	Pocket	Tumblr
Foursquare	Mailbox for Mac	Quora	Twitter
GetResponse	Meetup	Shopify	Vimeo
Gmail	Netflix	Slack	WhatsApp

Writers become better by reading good writers.

Non-random good onboarding comes from studying previous good onboarding.

Enjoy!

I first saw this in a tweet by Jason Ziccardi.

Comments Off

Shanghai Library adds 2 million records to WorldCat…

Filed under: Chinese,Library,WorldCat — Patrick Durusau @ 9:58 am

Shanghai Library adds 2 million records to WorldCat to share its collection with the world Compiled by Ming POON, Josephine SCHE, and Mi Chu WIENS (November, 2004).

From the post:

Shanghai Library, the largest public library in China and one of the largest libraries in the world, has contributed 2 million holdings to WorldCat, including some 770,000 unique bibliographic records, to share its collection worldwide.

These records, which represent books and journals published between 1911 and 2013, were loaded in WorldCat earlier this year. The contribution from Shanghai Library, an OCLC member since 1996, enhances the richness and depth of Chinese materials in WorldCat as well as the discoverability of these collections around the world.

“We are pleased to add Shanghai Library’s holdings to WorldCat, which is the global union catalog of library collections,” said Dr. Jianzhong Wu, Director, Shanghai Library “Shanghai is a renowned, global city, and the library should be as well. With WorldCat, we not only raise the visibility of our collection to a global level but we also share our national heritage and identity with other libraries and their users through the OCLC WorldShare Interlibrary Loan service.”

“The leadership of Shanghai Library has a bold global vision,” says Andrew H. Wang, Vice President, OCLC Asia Pacific. “The addition of Shanghai Library’s holdings and unique records enriches coverage of the Chinese collection in WorldCat for researchers everywhere.”

I don’t have a feel for how many unique Chinese bibliographic records are online but 770,000 sounds like a healthy addition.

You may also be interested in: Online Resources for Chinese Studies in North American Libraries.

Given the compilation date, 2004, I ran the W3C Link Checker on http://www.loc.gov/rr/asian/china-bib/.

You can review the results at: http://www.durusau.net/publications/W3CLinkChecker:http:_www.loc.gov_rr_asian_china-bib_.html

Summary of results:

Code Occurrences What to do

(N/A) 6 The link was not checked due to robots exclusion rules. Check the link manually, and see also the link checker documentation on robots exclusion.

(N/A) 2 The hostname could not be resolved. Check the link for typos.

403 1 The link is forbidden! This needs fixing. Usual suspects: a missing index.html or Overview.html, or a missing ACL.

404 61 The link is broken. Double-check that you have not made any typo, or mistake in copy-pasting. If the link points to a resource that no longer exists, you may want to remove or fix the link.

500 5 This is a server side problem. Check the URI.

(emphasis added)

Code	Occurrences	What to do
(N/A)	6	The link was not checked due to robots exclusion rules. Check the link manually, and see also the link checker documentation on robots exclusion.
(N/A)	2	The hostname could not be resolved. Check the link for typos.
403	1	The link is forbidden! This needs fixing. Usual suspects: a missing index.html or Overview.html, or a missing ACL.
404	61	The link is broken. Double-check that you have not made any typo, or mistake in copy-pasting. If the link points to a resource that no longer exists, you may want to remove or fix the link.
500	5	This is a server side problem. Check the URI.

At a minimum, the broken links need to be corrected but updating the listing to include new resources would make a nice graduate student project.

I don’t have the background or language skills with Chinese resources to embark on such a project but would be happy to assist anyone who undertakes the task.

Comments Off

New Directions in Vector Space Models of Meaning

Filed under: Meaning,Natural Language Processing,Vector Space Model (VSM),Vectors — Patrick Durusau @ 8:50 am

New Directions in Vector Space Models of Meaning by Edward Grefenstette, Karl Moritz Hermann, Georgiana Dinu, and Phil Blunsom. (video)

From the description:

This is the video footage, aligned with slides, of the ACL 2014 Tutorial on New Directions in Vector Space Models of Meaning, by Edward Grefenstette (Oxford), Karl Moritz Hermann (Oxford), Georgiana Dinu (Trento) and Phil Blunsom (Oxford).

This tutorial was presented at ACL 2014 in Baltimore by Ed, Karl and Phil.

The slides can be found at http://www.clg.ox.ac.uk/resources.

Running time is 2:45:12 so you had better get a cup of coffee before you start.

Includes a review of distributional models of semantics.

The sound isn’t bad but the acoustics are so you will have to listen closely. Having the slides in front of you helps as well.

The semantics part starts to echo topic map theory with the realization that having a single token isn’t going to help you with semantics. Tokens don’t stand alone but in a context of other tokens. Each of which has some contribution to make to the meaning of a token in question.

Topic maps function in a similar way with the realization that identifying any subject of necessity involves other subjects, which have their own identifications. For some purposes, we may assume some subjects are sufficiently identified without specifying the subjects that in our view identify it, but that is merely a design choice that others may choose to make differently.

Working through this tutorial and the cited references (one advantage to the online version) will leave you with a background in vector space models and the contours of the latest research.

I first saw this in a tweet by Kevin Safford.

Comments Off

KaTeX

Filed under: Javascript,TeX/LaTeX — Patrick Durusau @ 5:14 am

KaTeX

From the webpage:

Fast: KaTeX renders its math synchronously and doesn’t need to reflow the page.

Print quality: KaTeX’s layout is based on Donald Knuth’s TeX, the gold standard for math typesetting.

Self contained: KaTeX has no dependencies and can easily be bundled with your website resources.

Server side rendering: KaTeX produces the same output regardless of browser or environment, so you can pre-render expressions using Node.js and send them as plain HTML.

Is it just a matter of time before someone implements TeX in JS and we have a typographic solution for the Web?

I first saw this in a tweet by John Resig.

Comments Off

September 15, 2014

GraphX: Graph Processing in a Distributed Dataflow Framework

Filed under: Distributed Computing,Graphs,GraphX — Patrick Durusau @ 7:25 pm

GraphX: Graph Processing in a Distributed Dataflow Framework by Joseph Gonzalez, Reynold Xin, Ankur Dave, Dan Crankshaw, Michael Franklin, Ion Stoica.

Abstract:

In pursuit of graph processing performance, the systems community has largely abandoned general-purpose distributed dataflow frameworks in favor of specialized graph processing systems that provide tailored programming abstractions and accelerate the execution of iterative graph algorithms. In this paper we argue that many of the advantages of specialized graph processing systems can be recovered in a modern general-purpose distributed dataflow system. We introduce GraphX, an embedded graph processing framework built on top of Apache Spark, a widely used distributed dataflow system. GraphX presents a familiar composable graph abstraction that is sufficient to express existing graph APIs, yet can be implemented using only a few basic dataflow operators (e.g., join, map, group-by). To achieve performance parity with specialized graph systems, GraphX recasts graph-specific optimizations as distributed join optimizations and materialized view maintenance. By leveraging advances in distributed dataflow frameworks, GraphX brings low-cost fault tolerance to graph processing. We evaluate GraphX on real workloads and demonstrate that GraphX achieves an order of magnitude performance gain over the base dataflow framework and matches the performance of specialized graph processing systems while enabling a wider range of computation.

GraphX: Graph Processing in a Distributed Dataflow Framework (as PDF file)

The “other” systems for comparison were GraphLab and Giraph. Those systems were tuned in cooperation with experts in their use. These are some of the “fairest” benchmarks you are likely to see this year. Quite different from “shiny graph engine” versus lame or misconfigured system benchmarks.

Definitely the slow-read paper for this week!

I first saw this in a tweet by Arnon Rotem-Gal-Oz.

Comments Off

A Guide To Who Hates Whom In The Middle East

Filed under: Associations,Mapping,Maps — Patrick Durusau @ 7:01 pm

A Guide To Who Hates Whom In The Middle East by John Brown Lee.

John reviews an interactive visualization of players with an interest in the Middle East by David McCandless of Information is Beautiful.

The full interactive version of The Middle East Key players & notable relationships.

I would use this graphic with caution, mostly because if you select Jordan, it show no relationship to Israel. As you know, Jordan signed a peace agreement with Israel twenty years ago and Israel recently agreed to sell gas to Jordan’s state-owned National Electric Power Co.

Nor does it show any relationship between Turkey and the United States. At the very least, the United States and Turkey have a complicated relationship. Would you include the reported pettiness of Senator John McCain towards Turkey in an enhanced map?

Not to take anything away from a useful way to explore the web of relationships in the Middle East but more in the nature of a request for a fuller story.

Comments Off

Uncovering Hidden Text on a 500-Year-Old Map That Guided Columbus

Filed under: Mapping,Maps — Patrick Durusau @ 4:20 pm

Uncovering Hidden Text on a 500-Year-Old Map That Guided Columbus by Greg Miller.

Christopher Columbus probably used the map above as he planned his first voyage across the Atlantic in 1492. It represents much of what Europeans knew about geography on the verge discovering the New World, and it’s packed with text historians would love to read—if only the faded paint and five centuries of wear and tear hadn’t rendered most of it illegible.

But that’s about to change. A team of researchers is using a technique called multispectral imaging to uncover the hidden text. They scanned the map last month at Yale University and expect to start extracting readable text in the next few months, says Chet Van Duzer, an independent map scholar who’s leading the project, which was funded by the National Endowment for the Humanities.

The map was made in or around 1491 by Henricus Martellus, a German cartographer working in Florence. It’s not known how many were made, but Yale owns the only surviving copy. It’s a big map, especially for its time: about 4 by 6.5 feet. “It’s a substantial map, meant to be hung on a wall,” Van Duzer said.

Extracting the text is going to take some effort but expectations are that high resolution images will appear at the Beinecke Digital Library at Yale in 2015.

Greg covers a number of differences between the Martellus map (1491) and the Waldseeüller map (1507), as well as their places in historical context.

You should pass this post onto any friends who think Columbus “discovered” the world was round. I don’t see any end of the world markers on the Martellus map.

Do you?

Comments Off

Norwegian Ethnological Research [The Early Years]

Filed under: Data,Data Mining,Ethnological — Patrick Durusau @ 10:49 am

Norwegian Ethnological Research [The Early Years] by Lars Marius Garshol.

From the post:

The definitive book on Norwegian farmhouse ale is Odd Nordland’s “Brewing and beer traditions in Norway,” published in 1969. That book is now sadly totally unavailable, except from libraries. In the foreword Nordland writes that the book is based on a questionnaire issued by Norwegian Ethnological Research in 1952 and 1957. After digging a little I discovered that this material is actually still available at the institute. The questionnaire is number 35, running to 103 questions.

Because the questionnaire responses in general often contain descriptions of quite personal matters, access to the answers is restricted. However, by paying a quite stiff fee, describing the research I wanted to use the material for, and signing a legal agreement, I was sent a CD with all the answers to questionnaire 35. The contents are quite daunting: 1264 numbered JPEG files, with no metadata of any kind. The files are scans of individual pages of responses, plus one cover page for each Norwegian province. Most of the responses are handwritten, and legibility varies dramatically. Some, happily, are typewritten.

I appended “[The Early Years]” to the title because Lars has embarked on an adventure that can last as long as he remains interested.

Sixty-two year old survey results leave Lars wondering exactly what was meant in some cases. Keep that in mind the next time you search for word usage across centuries. Matching exact strings isn’t the same thing as matching the meanings attached to those strings.

You can imagine what gaps and ambiguities might exist when the time period stretches to centuries, if not millennia, and our knowledge of the languages is learned in a modern context.

The understanding we capture is our own, which hopefully has some connection to earlier witnesses. Recording that process is a uniquely human activity and one that I am glad Lars is sharing with a larger audience.

Looking forward to hearing about more results!

PS: Do you have a similar “data mining” story to share? Including the use of command line tool stories but working with non-electronic resources as well.

Comments Off

Open source datacenter computing with Apache Mesos

Filed under: Data Management,Data Repositories,Mesos — Patrick Durusau @ 9:26 am

Open source datacenter computing with Apache Mesos by Sachin P. Bappalige.

From the post:

Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications or frameworks. Mesos is a open source software originally developed at the University of California at Berkeley. It sits between the application layer and the operating system and makes it easier to deploy and manage applications in large-scale clustered environments more efficiently. It can run many applications on a dynamically shared pool of nodes. Prominent users of Mesos include Twitter, Airbnb, MediaCrossing, Xogito and Categorize.

Mesos leverages features of the modern kernel—”cgroups” in Linux, “zones” in Solaris—to provide isolation for CPU, memory, I/O, file system, rack locality, etc. The big idea is to make a large collection of heterogeneous resources. Mesos introduces a distributed two-level scheduling mechanism called resource offers. Mesos decides how many resources to offer each framework, while frameworks decide which resources to accept and which computations to run on them. It is a thin resource sharing layer that enables fine-grained sharing across diverse cluster computing frameworks, by giving frameworks a common interface for accessing cluster resources.The idea is to deploy multiple distributed systems to a shared pool of nodes in order to increase resource utilization. A lot of modern workloads and frameworks can run on Mesos, including Hadoop, Memecached, Ruby on Rails, Storm, JBoss Data Grid, MPI, Spark and Node.js, as well as various web servers, databases and application servers.
…

This introduction to Apache Mesos will give you a quick overview of what Mesos has to offer without getting bogged down in details. Details will come later, either if you want to run a datacenter using Mesos or to map a datacenter being run with Mesos.

Comments Off

Apache Storm 0.9 Training Deck and Tutorial

Filed under: Kafka,Storm — Patrick Durusau @ 7:10 am

Apache Storm 0.9 Training Deck and Tutorial by Michael G. Noll.

From the post:

Today I am happy to share an extensive training deck on Apache Storm version 0.9, which covers Storm’s core concepts, operating Storm in production, and developing Storm applications. I also discuss data serialization with Apache Avro and Twitter Bijection.

The training deck (130 slides) is aimed at developers, operations, and architects.

What the training deck covers

Introducing Storm: history, Storm adoption in the industry, why Storm

Storm core concepts: topology, data model, spouts and bolts, groupings, parallelism

Operating Storm: architecture, hardware specs, deploying, monitoring

Developing Storm apps: Hello World, creating a bolt, creating a topology, running a topology, integrating Storm and Kafka, testing, data serialization in Storm, example apps (with kafka-storm-starter), performance and scalability tuning

Playing with Storm using Wirbelsturm

What a great way to start the week! Well, at least if you were intending to start learning about Storm this week.

BTW, see Michael’s post for links to other resources, such as his tutorial on Kafka.

Comments Off

A Cambrian Explosion In AI Is Coming

Filed under: Artificial Intelligence,Topic Maps — Patrick Durusau @ 6:35 am

A Cambrian Explosion In AI Is Coming by Dag Kittlaus.

From the post:

However, done properly, this emerging conversational paradigm enables a new fluidity for achieving tasks in the digital realm. Such an interface requires no user manual, makes short work of complex tasks via simple conversational commands and, once it gets to know you, makes obsolete many of the most tedious aspects of using the apps, sites and services of today. What if you didn’t have to: register and form-fill; continuously express your preferences; navigate new interfaces with every new app; and the biggest one of them all, discover and navigate each single-purpose app or service at a time?

Let me repeat the last one.

When you can use AI as a conduit, as an orchestrating mechanism to the world of information and services, you find yourself in a place where services don’t need to be discovered by an app store or search engine. It’s a new space where users will no longer be required to navigate each individual application or service to find and do what they want. Rather they move effortlessly from one need to the next with thousands of services competing and cooperating to accomplish their desires and tasks simply by expressing their desires. Just by asking.

Need a babysitter tomorrow night in a jam? Just ask your assistant to find one and it will immediately present you with a near complete set of personalized options: it already knows where you live, knows how many kids you have and their ages, knows which of the babysitting services has the highest reputation and which ones cover your geographic area. You didn’t need to search and discover a babysitting app, download it, register for it, enter your location and dates you are requesting and so on.

Dag uses the time worn acronym AI (artificial intelligence), which covers any number of intellectual sins. For the scenarios that Dag describes, I propose a new acronym, UsI (user intelligence).

Take the babysitter example to make UsI concrete. The assistant has captured your current (it could change over time) identification of “babysitter” and uses that to find information with that identification. Otherwise searching for “babysitter” would return both useful and useless results, much like contemporary search engines today.

It is the capturing of your subject identifications, to use topic map language, that enables an assistant to “understand” the world as you do. Perhaps the reverse of “personalization” where an application attempts to guess your preferences for marketing purposes, this is “individualization” where the assistant becomes more like you and knows the usually unspoken facts that underlie your requests.

If I say, “check utility bill,” my assistant will already “know” that I mean for Covington, Georgia, not any of the other places I have resided and implicitly I mean the current (unpaid) bill.

The easier and faster it is for an assistant to capture UsI, the faster and more seamless it will become for users.

Specifying and inspecting properties that underlie identifications will play an important role in fueling a useful Cambrian explosion in UsI.

Who wants a “babysitter” using your definition? Could have quite unexpected (to me) results. http://www.imdb.com/title/tt0796302/ (Be mindful of your corporate policies on what you can or can’t view at work.)

PS: Did I mention topic maps as collections of properties for identifications?

I first saw this in a tweet by Subject-centric.

Comments Off

September 14, 2014

Building Blocks for Theoretical Computer Science

Filed under: Computer Science,Mathematics,Programming — Patrick Durusau @ 7:06 pm

Building Blocks for Theoretical Computer Science by Margaret M. Fleck.

From the preface:

This book teaches two different sorts of things, woven together. It teaches you how to read and write mathematical proofs. It provides a survey of basic mathematical objects, notation, and techniques which will be useful in later computer science courses. These include propositional and predicate logic, sets, functions, relations, modular arithmetic, counting, graphs, and trees. And, finally, it gives a brief introduction to some key topics in theoretical computer science: algorithm analysis and complexity, automata theory, and computability.

To whet your interest:

Enjoy!

I first saw this in Nat Torkington’s Four short links: 11 September 2014.

Comments Off

Clojure-Resources

Filed under: Clojure,Programming — Patrick Durusau @ 6:42 pm

Clojure-Resources by Matthias Nehlsen.

From the post:

This is a compilation of links and resources for learning about Clojure, ClojureScript, Om and more broadly, LISP. It is still in very early stages. Please feel free to add resources by issuing a pull request (preferred) or by getting in touch. For now it is mostly a dump of my bookmarks, but I intend to go through them one by one and write a quick note about each one (or delete those that I don’t find useful after all). Totally unordered at this point.

As of today, one hundred and three (103) resources listed.

Enjoy!

Comments Off

The growing problem of “link rot” and best practices for media and online publishers

Filed under: Hypertext,WWW — Patrick Durusau @ 6:11 pm

The growing problem of “link rot” and best practices for media and online publishers by Leighton Walter Kille.

From the post:

The Internet is an endlessly rich world of sites, pages and posts — until it all ends with a click and a “404 not found” error message. While the hyperlink was conceived in the 1960s, it came into its own with the HTML protocol in 1991, and there’s no doubt that the first broken link soon followed.

On its surface, the problem is simple: A once-working URL is now a goner. The root cause can be any of a half-dozen things, however, and sometimes more: Content could have been renamed, moved or deleted, or an entire site could have evaporated. Across the Web, the content, design and infrastructure of millions of sites are constantly evolving, and while that’s generally good for users and the Web ecosystem as a whole, it’s bad for existing links.

In its own way, the Web is also a very literal-minded creature, and all it takes is a single-character change in a URL to break a link. For example, many sites have stopped using “www,” and even if their content remains the same, the original links may no longer work. The rise of CMS platforms such as WordPress have led to the fall of static HTML sites with their .htm and .html extensions, and with each relaunch, untold thousands of links die.

Even if a core URL remains the same, many sites frequently append login information or search terms to URLs, and those are ephemeral. And as the Web has grown, the problem has been complicated by Google and other search engines that crawl the Web and archive — briefly — URLs and pages. Many work, but their long-term stability is open to question.
…

Hmmm, link rot, do you think that impacts the Semantic Web?

If you can have multiple IRI’s for the same subject, well, you can have a different result.

Leighton has a number of suggestions to lessen your own link rot. For the link rot (as far as identifiers) of others, I suggest topic maps.

I first saw this at Full Text Reports as: Website linking: The growing problem of “link rot” and best practices for media and online publishers.

Comments Off

Pig is Flying: Apache Pig on Apache Spark (aka “Spork”)

Filed under: Pig,Spark — Patrick Durusau @ 4:19 pm

Pig is Flying: Apache Pig on Apache Spark by Mayur Rustagi.

From the post:

Analysts can talk about data insights all day (and night), but the reality is that 70% of all data analyst time goes into data processing and not analysis. At Sigmoid Analytics, we want to streamline this data processing pipeline so that analysts can truly focus on value generation and not data preparation.

We focus our efforts on three simple initiatives:

Make data processing more powerful

Make data processing more simple

Make data processing 100x faster than before

As a data mashing platform, the first key initiative is to combine the power and simplicity of Apache Pig on Apache Spark, making existing ETL pipelines 100x faster than before. We do that via a unique mix of our operator toolkit, called DataDoctor, and Spark.

DataDoctor is a high-level operator DSL on top of Spark. It has frameworks for no-symmetrical joins, sorting, grouping, and embedding native Spark functions. It hides a lot of complexity and makes it simple to implement data operators used in applications like Pig and Apache Hive on Spark.

For the uninitiated, Spark is open source Big Data infrastructure that enables distributed fault-tolerant in-memory computation. As the kernel for the distributed computation, it empowers developers to write testable, readable, and powerful Big Data applications in a number of languages including Python, Java, and Scala.

Introduction to and how to get started using Spork (Pig-on-Spark).

I know, more proof that Phil Karton was correct in saying:

There are only two hard things in Computer Science: cache invalidation and naming things.

Comments Off

Astropy v0.4 Released

Filed under: Astroinformatics,Python — Patrick Durusau @ 3:58 pm

Astropy v0.4 Released by Erik Tollerud.

From the post:

This July, we performed the third major public release (v0.4) of the astropy package, a core Python package for Astronomy. Astropy is a community-driven package intended to contain much of the core functionality and common tools needed for performing astronomy and astrophysics with Python.

New and improved major functionality in this release includes:

A new astropy.vo.samp sub-package adapted from the previously standalone SAMPy package

A re-designed astropy.coordinates sub-package for celestial coordinates

A new ‘fitsheader’ command-line tool that can be used to quickly inspect FITS headers

A new HTML table reader/writer

Improved performance for Quantity objects

A re-designed configuration framework

Erik goes on to say that Astropy 1.0 should arrive by the end of the year!

Enjoy!

Comments Off

Forty-four More Greek Manuscripts Online

Filed under: British Library,Manuscripts — Patrick Durusau @ 3:49 pm

Forty-four More Greek Manuscripts Online by James Freeman.

From the post:

We are delighted to announce another forty-four Greek manuscripts have been digitised. As always, we are most grateful to the Stavros Niarchos Foundation, the A. G. Leventis Foundation, Sam Fogg, the Sylvia Ioannou Foundation, the Thriplow Charitable Trust, the Friends of the British Library, and our other generous benefactors for contributing to the digitisation project. Happy exploring!

A random sampling:

Add MS 31921, Gospel Lectionary with ekphonetic notation (Gregory-Aland l 336), imperfect, 12th century, with some leaves supplied in the 14th century. Formerly in Blenheim Palace Library.

Add MS 34059, Gospel Lectionary (Gregory-Aland l 939), with ekphonetic neumes. 12th century.,

Add MS 36660, Old Testament lectionary with ekphonetic notation, and fragments from a New Testament lectionary (Gregory-Aland l 1490). 12th century.

Add MS 37320, Four Gospels (Gregory-Aland 2290). 10th century, with additions from the 16th-17th century.

….

Burney MS 106, Sophocles, Ajax, Electra, Oedipus Tyrannus, Antigone; [Aeschylus], Prometheus Vinctus; Pindar, Olympia. End of the 15th century.

Burney MS 108, Aelian, Tactica; Leo VI, Tactica; Heron of Alexandria, Pneumatica, De automatis, with numerous diagrams. 1st quarter of the 16th century, possibly written at Venice.

Burney MS 109, Works by Theocritus, Hesiod, Pindar, Pythagoras and Aratus. 2nd half of the 14th century, Italy.

And many more!

Given the complex histories of the texts witnessed by these Greek manuscripts, their interpretations and commentaries, to say nothing of the history of the manuscripts per se, they are rich subjects that merit treatment with a topic map.

Be sure to visit the other treasures of the British Library. It is an exemplar of how an academic institution should function.

Comments Off

Army can’t track spending on $4.3b system to track spending, IG finds

Filed under: Government,Government Data — Patrick Durusau @ 2:30 pm

Army can’t track spending on $4.3b system to track spending, IG finds. by Mark Flatten.

From the post:

More than $725 million was spent by the Army on a high-tech network for tracking supplies and expenses that failed to comply with federal financial reporting rules meant to allow auditors to track spending, according to an inspector general’s report issued Wednesday.

The Global Combat Support System-Army, a logistical support system meant to track supplies, spare parts and other equipment, was launched in 1997. In 2003, the program switched from custom software to a web-based commercial software system.

About $95 million was spent before the switch was made, according to the report from the Department of Defense IG.

As of this February, the Army had spent $725.7 million on the system, which is ultimately expected to cost about $4.3 billion.

The problem, according to the IG, is that the Army has failed to comply with a variety of federal laws that require agencies to standardize reporting and prepare auditable financial statements.
…

The report is full of statements like this one:

PMO personnel provided a system change request, which they indicated would correct four account attributes in July 2014. In addition, PMO personnel provided another system change request they indicated would correct the remaining account attribute (Prior Period Adjustment) in late FY 2015.

PMO = Project Management Office (in this case, of GCSS–Army).

The lack of identification of personnel speaking on behalf of the project or various offices pervades the report. Moreover, the same is true for twenty-seven (27) other reports on issues with this project.

If the sources of statements and information were identified in these reports, then it would be possible to track people across reports and to identify who has failed to follow up on representations made in the reports.

The first step towards accountability is identification of decision makers in audit reports.

Tracking decision makers from one position to another and linking them to specific decisions is a natural application of topic maps.

I first saw this in Links I Liked by Chris Blattman, September 7, 2014.

Comments Off

Cassandra Performance Testing with cstar_perf

Filed under: Cassandra,Performance — Patrick Durusau @ 6:32 am

Cassandra Performance Testing with cstar_perf by Ryan Mcguire.

From the post:

It’s frequently been reiterated on this blog that performance testing of Cassandra is often done incorrectly. In my role as a Cassandra test engineer at DataStax, I’ve certainly done it incorrectly myself, numerous times. I’m convinced that the only way to do it right, consistently, is through automation – there’s simply too many variables to keep track of when doing things by hand.

cstar_perf is an easy to use tool to run performance tests on Cassandra clusters. A brief outline of what it does for you:

Downloads and builds Cassandra source code.

Configures your cassandra.yaml and environment settings.

Bootstraps nodes on a real cluster.

Runs a series of test operations on multiple versions or configs.

Collects and aggregates cluster performance metrics.

Creates easy to read performance charts comparing multiple test configurations in one view.

Runs a web frontend for convenient test scheduling, monitoring and reporting.

A great tool for Cassandra developers and a reminder of the first requirement for performance testing, automation. How’s your performance testing?

I first saw this in a tweet by Jason Brown.

Comments Off

September 13, 2014

A schemaless computer database in 1965

Filed under: Computer Science,Metadata,NoSQL,Schema — Patrick Durusau @ 6:50 pm

A schemaless computer database in 1965 by Bob DuCharme.

From the post:

To enable flexible metadata aggregation, among other things.

I’ve been reading up on America’s post-war attempt to keep up the accelerated pace of R&D that began during World War II. This effort led to an infrastructure that made accomplishments such as the moon landing and the Internet possible; it also led to some very dry literature, and I’m mostly interested in what new metadata-related techniques were developed to track and share the products of the research as they led to development.
… (emphasis in original)

I won’t spoil the surprise. Go read Bob’s post to see the answer.

His post does prompt me to ask: What early computing “dry” literature have you read lately?

Comments Off

Why Use Google Maps When You Can Get GPS Directions On The Death Star Instead?

Filed under: Graphics,MapBox,Mapping,Maps — Patrick Durusau @ 5:05 pm

Why Use Google Maps When You Can Get GPS Directions On The Death Star Instead? by John Brownlee.

From the post:

Mapbox Studio is a toolkit that allows apps and websites to serve up their own custom-designed maps to users. Companies like Square, Pinterest, Foursquare, and Evernote con provide custom-skinned Mapboxes instead, changing map elements to better fit in with their brand.

But Mapbox can do far cooler stuff. It can blast you to Space Station Earth, a Mapbox that makes the entire planet look like the blinking, slate gray skin of the Star Wars Death Star.
…

Great if your target audience are Star Wars or similar science fiction fans or you can convince management that it will hold the attention of users longer.

Even routine tasks, like logging service calls answered, would be more enjoyable using an X-Wing fighter to destroy the location of the call after service has been completed.

Comments Off

Open AI Resources

Filed under: Artificial Intelligence,Open Source — Patrick Durusau @ 10:54 am

Open AI Resources

From the about page:

We all go further when we all work together. That’s the promise of Open AIR, an open source collaboration hub for AI researchers. With the decline of university- and government-sponsored research and the rise of large search and social media companies insistence on proprietary software, the field is quickly privatizing. Open AIR is the antidote: it’s important for leading scientists and researchers to keep our AI research out in the open, shareable, and extensible by the community. Join us in our goal to keep the field moving forward, together, openly.

An impressive collection of open source AI software and data.

The categories are:

A number of the major players in AI research are part of this project, which bodes well for it being maintained into the future.

If you create or encounter any open AI resources not listed at Open AI Resources, please Submit a Resource.

I first saw this in a tweet by Ana-Maria Popescu.

Comments Off

CQL Under the Hood

Filed under: Cassandra,CQL - Cassandra Query Language — Patrick Durusau @ 10:26 am

CQL Under the Hood by Robbie Strickland.

Description:

As a reformed CQL critic, I’d like to help dispel the myths around CQL and extol its awesomeness. Most criticism comes from people like me who were early Cassandra adopters and are concerned about the SQL-like syntax, the apparent lack of control, and the reliance on a defined schema. I’ll pop open the hood, showing just how the various CQL constructs translate to the underlying storage layer–and in the process I hope to give novices and old-timers alike a reason to love CQL.

Slides from CassandraSummit 2014

Best viewed with a running instance of Cassandra.

Comments Off

Deep dive into understanding human language with Python

Filed under: Natural Language Processing,NLTK — Patrick Durusau @ 10:08 am

Deep dive into understanding human language with Python by Alyona Medelyan.

Abstract:

Whenever your data is text and you need to analyze it, you are likely to need Natural Language Processing algorithms that help make sense of human language. They will help you answer questions like: Who is the author of this text? What is his or her attitude? What is it about? What facts does it mention? Do I have similar texts like this one already? Where does it belong to?

This tutorial will cover several open-source Natural Language Processing Python libraries such as NLTK, Gensim and TextBlob, show you how they work and how you can use them effectively.

Level: Intermediate (knowledge of basic Python language features is assumed)

Pre-requisites: a Python environment with NLTK, Gensim and TextBlob already installed. Please make sure to run nltk.download() and install movie_reviews and stopwords (under Corpora), as well as POS model (under Models).

Code examples, data and slides from Alyona’s NLP tutorial at KiwiPyCon 2014.

Introduction to NLTK, Gensim and TextBlob.

Not enough to make you dangerous but enough to get you interested in natural language processing.

Comments Off

Apache Kafka for Beginners

Filed under: Kafka — Patrick Durusau @ 9:53 am

Apache Kafka for Beginners by Gwen Shapira and Jeff Holoman.

From the post:

When used in the right way and for the right use case, Kafka has unique attributes that make it a highly attractive option for data integration.

Apache Kafka is creating a lot of buzz these days. While LinkedIn, where Kafka was founded, is the most well known user, there are many companies successfully using this technology.

So now that the word is out, it seems the world wants to know: What does it do? Why does everyone want to use it? How is it better than existing solutions? Do the benefits justify replacing existing systems and infrastructure?

In this post, we’ll try to answers those questions. We’ll begin by briefly introducing Kafka, and then demonstrate some of Kafka’s unique features by walking through an example scenario. We’ll also cover some additional use cases and also compare Kafka to existing solutions.

What is Kafka?

Kafka is one of those systems that is very simple to describe at a high level, but has an incredible depth of technical detail when you dig deeper. The Kafka documentation does an excellent job of explaining the many design and implementation subtleties in the system, so we will not attempt to explain them all here. In summary, Kafka is a distributed publish-subscribe messaging system that is designed to be fast, scalable, and durable. (emphasis in original)

A great reference to use for your case to technical management about Kafka. In particular the line:

even a small three-node cluster can process close to a million events per second with an average latency of 3ms.

Sure, there are applications with more stringent processing requirements, but there are far more applications with less than a million events per second.

Does your topic map system get updated more than a million times a second?

Comments Off

First map of Rosetta’s comet

Filed under: Astroinformatics,Mapping,Maps — Patrick Durusau @ 6:15 am

First map of Rosetta’s comet

From the webpage:

Scientists have found that the surface of comet 67P/Churyumov-Gerasimenko — the target of study for the European Space Agency’s Rosetta mission — can be divided into several regions, each characterized by different classes of features. High-resolution images of the comet reveal a unique, multifaceted world.

ESA’s Rosetta spacecraft arrived at its destination about a month ago and is currently accompanying the comet as it progresses on its route toward the inner solar system. Scientists have analyzed images of the comet’s surface taken by OSIRIS, Rosetta’s scientific imaging system, and defined several different regions, each of which has a distinctive physical appearance. This analysis provides the basis for a detailed scientific description of 67P’s surface. A map showing the comet’s various regions is available at: http://go.nasa.gov/1pU26L2

“Never before have we seen a cometary surface in such detail,” says OSIRIS Principal Investigator Holger Sierks from the Max Planck Institute for Solar System Science (MPS) in Germany. In some of the images, one pixel corresponds to a scale of 30 inches (75 centimeters) on the nucleus. “It is a historic moment — we have an unprecedented resolution to map a comet,” he says.

The comet has areas dominated by cliffs, depressions, craters, boulders and even parallel grooves. While some of these areas appear to be quiet, others seem to be shaped by the comet’s activity, in which grains emitted from below the surface fall back to the ground in the nearby area.

comet

The Rosetta mission:

Rosetta launched in 2004 and will arrive at comet 67P/Churyumov-Gerasimenko on 6 August. It will be the first mission in history to rendezvous with a comet, escort it as it orbits the Sun, and deploy a lander to its surface. Rosetta is an ESA mission with contributions from its member states and NASA. Rosetta’s Philae lander is provided by a consortium led by DLR, MPS, CNES and ASI.

Not to mention being your opportunity to watch semantic diversity develop from a known starting point.

Already the comet has two names: (1 67P/Churyumov-Gerasimenko and 2) Rosetta’s comet. Can you guess which one will be used in the popular press?

Surface features will be described in different languages, which have different terms for features and the processes that formed them. Not to mention that even within natural languages there can be diversity as well.

Semantic diversity is our natural state. Normalization is an abnormal state, perhaps that is why it is so elusive on a large scale.

Comments Off

September 12, 2014

A Greater Voice for Individuals in W3C – Tell Us What You Would Value [Deadline: 30 Sept 2014]

Filed under: Standards,WWW — Patrick Durusau @ 6:54 pm

A Greater Voice for Individuals in W3C – Tell Us What You Would Value by Coralie Mercier.

From the post:

How is the W3C changing as the world evolves?

Broadening in recent years the W3C focus on industry is one way. Another was the launch in 2011 of W3C Community Groups to make W3C the place for new standards. W3C has heard the call for increased affiliation with W3C, and making W3C more inclusive of the web community.

W3C responded through the development of a program for increasing developer engagement with W3C. Jeff Jaffe is leading a public open task force to establish a program which seeks to provide individuals a greater voice within W3C, and means to get involved and help shape web technologies through open web standards.

Since Jeff announced the version 2 of the Webizen Task Force, we focused on precise goals, success criteria and a selection of benefits, and we built a public survey.

The W3C is a membership based organisation supported by way of membership fees, as to form a common set of technologies, written to the specifications defined through the W3C, which the web is built upon.

The proposal (initially called Webizen but that name may change and we invite your suggestions in the survey), seeks to extend participation beyond the traditional forum of incorporated entities with an interest in supporting open web standards, through new channels into the sphere of individual participation, already supported through the W3C community groups.

Today the Webizen Task Force is releasing a survey which will identify whether or not sufficient interest exists. The survey asks if you are willing to become a W3C Webizen. It offers several candidate benefits and sees which ones are of interest; which ones would make it worthwhile to become Webizens.

I took the survey today and suggest that you do the same before 30 September 2014.

In part I took the survey because on one comment on the original post that reads:

What a crock of shit! The W3C is designed to not be of service to individuals, but to the corporate sponsors. Any ideas or methods to improve web standards should not be taken from sources other then the controlling corporate powers.

I do think that as a PR stunt the Webizen concept could be a good ploy to allow individuals to think they have a voice, but the danger is that they may be made to feel as if they should have a voice.

This could prove detrimental in the future.

I believe the focus of the organization should remain the same, namely as a organization that protects corporate interests and regulates what aspects of technology can be, and should be, used by individuals.

The commenter apparently believes in a fantasy world where those with the gold don’t make the rules.

I am untroubled by those with the gold making the rules, so long as the rest of us have the opportunity for persuasion, that is to be heard by those making the rules.

My suggestion at #14 of the survey reads:

The anti-dilution of “value of membership” position creates a group of second class citizens, which can only lead to ill feelings and no benefit to the W3C. It is difficult to imagine that IBM, Oracle, HP or any of the other “members” of the W3C are all that concerned with voting on W3C specifications. They are likely more concerned with participating in the development of those standards. Which they could do without being members should they care to submit public comments, etc.

In fact, “non-members” can contribute to any work currently under development. If their suggestions have merit, I rather doubt their lack of membership is going to impact acceptance of their suggestions.

Rather than emphasizing the “member” versus “non-member” distinction, I would create a “voting member” and “working member” categories, with different membership requirements. “Voting members” would carry on as they are presently and vote on the administrative aspects of the W3C. “Working members” who consist of employees of “voting members,” “invited experts,” and “working members” who meet some criteria for interest in and expertise at a particular specification activity. Like an “invited expert” but without heavy weight machinery.

Emphasis on the different concerns of different classes of membership would go a long way to not creating a feeling of second class citizenship. Or at least it would minimize it more than the “in your face” type approach that appears to be the present position.

Being able to participate in teleconferences for example, should be sufficient for most working members. After all, if you have to win votes for a technical position, you haven’t been very persuasive in presenting your position.

Nothing against “voting members” at the W3C but I would rather be a “working member” any day.

How about you?

Take the Webizen survey.

Comments Off

Connected Histories: British History Sources, 1500-1900

Filed under: History,Search Engines,Searching — Patrick Durusau @ 4:24 pm

Connected Histories: British History Sources, 1500-1900

From the webpage:

Connected Histories brings together a range of digital resources related to early modern and nineteenth century Britain with a single federated search that allows sophisticated searching of names, places and dates, as well as the ability to save, connect and share resources within a personal workspace. We have produced this short video guide to introduce you to the key features.

Twenty-two remarkable resources can be searched by place, person, or keyword. Some of the sources require subscriptions but the vast majority do not. A summary of the resources would fail to do them justice so here is a list of the currently searchable resources:

British History Online
British Museum Images
British Newspapers, 1600-1900
Cause Papers in the Diocesan Courts of the Archbishopric of York, 1300-1858 From the description: “A particular strength of the project is the care taken to ensure that all spelling variants for surnames and place names are searchable under standard forms, while the database also provides the original spellings.”
Charles Booth Archive
Clergy of the Church of England Database 1540-1835
Convict Transportation Registries Database (The full dataset as a CSV file.)
Database of Mid-Victorian Wood-Engraved Illustration
History of Parliament (Public corruption isn’t new.)
House of Commons Parliamentary Papers
John Foxe’s The Acts and Monuments Online (In case you are looking for the Antichrist.)
John Johnson Collection of Printed Ephemera
John Strype’s Survey of London Online
Lane’s Masonic Records
London Lives 1690-1800
Nineteenth-Century British Pamphlets
Origins.net
Science in the Nineteenth-Century Periodical
The Proceedings of the Old Bailey Online, 1674-1913
Transcribed Papers of Jeremy Bentham (Workaholics take note: Betham produced ten to twelve pages a day for forty years. Until he was in his eighties. And people are volunteering to transcribe it.)
Victoria County History
Witches in Early Modern England

As you probably assume, there is no binding point for any person, object, date or thing across all twenty-two resources with its associations to other persons, objects, dates or things.

As you explore Connected Histories, keep track of where you found information on a person, object, date or thing. Depending on the granularity of pointing, you might want to create a topic map to capture that information.

Comments Off

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 16, 2014

September 15, 2014

September 14, 2014

September 13, 2014

September 12, 2014