Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 7, 2015

RawTherapee

Filed under: Image Processing,Topic Maps,Visualization — Patrick Durusau @ 4:41 pm

RawTherapee

From the RawPedia (Getting Started)

RawTherapee is a cross-platform raw image processing program, released under the GNU General Public License Version 3. It was originally written by Gábor Horváth of Budapest. Rather than being a raster graphics editor such as Photoshop or GIMP, it is specifically aimed at raw photo post-production. And it does it very well – at a minimum, RawTherapee is one of the most powerful raw processing programs available. Many of us would make bigger claims…

At intervals of more than a month but not much more than two months, there is a Play Raw competition with an image and voting (plus commentary along the way).

Very impressive!

Thoughts on topic map competitions?

I first saw this in a tweet by Neil Saunders.

Fifty Words for Databases

Fifty Words for Databases by Phil Factor

From the post:

Almost every human endeavour seems simple from a distance: even database deployment. Reality always comes as a shock, because the closer you get to any real task, the more you come to appreciate the skills that are necessary to accomplish it.

One of the big surprises I have when I attend developer conferences is to be told by experts how easy it is to take a database from development and turn it into a production system, and then implement the processes that allow it to be upgraded safely. Occasionally, I’ve been so puzzled that I’ve drawn the speakers to one side after the presentation to ask them for the details of how to do it so effortlessly, mentioning a few of the tricky aspects I’ve hit. Invariably, it soon becomes apparent from their answers that their experience, from which they’ve extrapolated, is of databases the size of a spreadsheet with no complicated interdependencies, compliance issues, security complications, high-availability mechanisms, agent tasks, alerting systems, complex partitioning, queuing, replication, downstream analysis dependencies and so on about which you, the readers, know more than I. At the vast international enterprise where I once worked in IT, we had a coded insult for such people: ‘They’ve catalogued their CD collection in a database’. Unfair, unkind, but even a huge well-used ‘Big Data’ database dealing in social media is a tame and docile creature compared with a heavily- used OLTP trading system where any downtime or bug means figures for losses where you have to count the trailing zeros. The former has unique problems, of course, but the two types of database are so different.

I wonder if the problem is one of language. Just as the English have fifty ways of describing rainfall, and the Inuit have many ways of describing pack ice, it is about time that we created the language for a variety of databases from a mild drizzle (‘It is a soft morning to be sure’) to a cloud-burst. Until anyone pontificating about the database lifecycle can give their audience an indication of the type of database they’re referring to, we will continue to suffer the sort of misunderstandings that so frustrate the development process. Though I’m totally convinced that the development culture should cross-pollinate far more with the science of IT operations, It will need more than a DevOps group-hug; it will require a change in the technical language so that it can accurately describe the rich variety of databases in operational use and their widely- varying requirements. The current friction is surely due more to misunderstandings on both sides, because it is so difficult to communicate these requirements. Any suggestions for suitable descriptive words for types of database? (emphasis added)

If you have “descriptive words” to suggest to Phil, comment on his post.

With the realization that your “descriptive words” may be different from my “descriptive words” for the same database or mean a different database altogether or have nothing to do with databases at all (when viewed by others).

Yes, I have been thinking about identifiers, again, and will start off the coming week with a new series of posts on subject identification. I hope to include a proposal for a metric of subject identification.

March 6, 2015

Linear SVM Classifier on Twitter User Recognition

Filed under: Classification,Machine Learning,Python,Support Vector Machines — Patrick Durusau @ 6:52 pm

Linear SVM Classifier on Twitter User Recognition by Leon van Bokhorst.

From the post:

Support Vector Machines (SVM) are very useful and popular in data classification, regression and outlier detection. This advanced supervised machine learning algorithm can quickly become very complex and hard to understand, but can lead to great results. In the example we train a linear SVM to detect and predict who’s the writer of a tweet.

Nice weekend type project, Python, iPython notebook, 400 tweets (I think Leon is right, the sample is too small), but an opportunity to “arm up the switches and dial in the mils.”

Enjoy!

While you are there, you should look around Leon’s blog. A number of interesting posts on statistics using Python.

Welcome to NDS Labs!

Filed under: BigData — Patrick Durusau @ 5:52 pm

Welcome to NDS Labs!

From the webpage:

Now what is it?

NDS Labs is an environment where developers can prototype tools and capabilities that help build out the NDS framework and services. Labs provides development teams with access to significant storage, machines that can run services, and useful tools for managing and manipulating data.

We have set up NDS Labs as a place to learn through building what is needed in a national data infrastructure. It’s an environment that enables a developer or a small team of developers to explore an innovative idea, prototype a service, or connect existing applications together to build out the NDS ecosystem.

Find out more about:

NDS Labs is just one way to join the NDS community.

Still in the early stages, formulating governance structures, etc. but certainly a deeply interesting project!

I first saw this in a tweet by Kirk Borne.

Snowden Archive

Filed under: Cybersecurity,Security — Patrick Durusau @ 5:14 pm

Snowden Archive

From the webpage:

This Archive is a complete collection of all documents that former NSA contractor Edward Snowden leaked in June 2013 to journalists Laura Poitras, Glenn Greenwald and Ewen MacAskill, and subsequently were published by news media, such as The Guardian, The New York Times, The Washington Post, Der Spiegel, Le Monde, El Mundo and The Intercept. The leaked documents and their coverage have raised significant public concerns and had a major impact on intelligence policy debates internationally over issues of freedom of expression, privacy, national security and democratic governance more broadly.

The Archive also contains some documents that the U.S. Government has published which are helpful in understanding the leaked documents. The Archive does not contain any documents that have not already been published in other sources.

The approximately 400 documents currently in the Archive are a small fraction of the estimated 50,000 documents Snowden turned over. Most of these will likely not be published, but as new documents are published, they will be added to the Archive.

Why did we build this archive?

Our aim in creating this archive is to provide a tool that would facilitate citizen and researcher access to these important documents. Indexes, document descriptions, links to original documents and to related news stories, glossary and comprehensive search features are all designed to enable a better understanding of state surveillance programs within the wider context of surveillance by the U.S. National Security Agency (NSA) along with its partners in the Five Eyes countries – U.K., Canada, Australia and New Zealand.

Our hope is that this resource will contribute to greater appreciation of the broad scope, intimate reach and profound implications of the global surveillance infrastructures and practices that Edward Snowden’s historic document leak reveals.

Visit the Archive.

Sigh, four hundred (400) documents out of fifty-thousand (50,000) which is far fewer than were estimated to be taken by Snowden.

The released 400 documents represent 0.0008% of the 50,000 figure and you see what they have provoked.

Is it clear now why I prefer for 100% of the leaked documents to be published?

Even if criminal prosecutions are unlikely, at least those responsible could be banned from international travel.

Airbnb open sources SQL tool built on Facebook’s Presto database

Filed under: Facebook,Presto,SQL — Patrick Durusau @ 4:25 pm

Airbnb open sources SQL tool built on Facebook’s Presto database by Derrick Harris.

From the post:

Apartment-sharing startup Airbnb has open sourced a tool called Airpal that the company built to give more of its employees access to the data they need for their jobs. Airpal is built atop the Presto SQL engine that Facebook created in order to speed access to data stored in Hadoop.

Airbnb built Airpal about a year ago so that employees across divisions and roles could get fast access to data rather than having to wait for a data analyst or data scientist to run a query for them. According to product manager James Mayfield, it’s designed to make it easier for novices to write SQL queries by giving them access to a visual interface, previews of the data they’re accessing, and the ability to share and reuse queries.

It sounds a little like the types of tools we often hear about inside data-driven companies like Facebook, as well as the new SQL platform from a startup called Mode.

At this point, Mayfield said, “Over a third of all the people working at Airbnb have issued a query through Airpal.” He added, “The learning curve for SQL doesn’t have to be that high.”

From the GitHub page:

Airpal is a web-based, query execution tool which leverages Facebook’s PrestoDB to make authoring queries and retrieving results simple for users. Airpal provides the ability to find tables, see metadata, browse sample rows, write and edit queries, then submit queries all in a web interface. Once queries are running, users can track query progress and when finished, get the results back through the browser as a CSV (download it or share it with friends). The results of a query can be used to generate a new Hive table for subsequent analysis, and Airpal maintains a searchable history of all queries run within the tool.

Features

  • Optional Access Control
  • Syntax highlighting
  • Results exported to a CSV for download or a Hive table
  • Query history for self and others
  • Saved queries
  • Table finder to search for appropriate tables
  • Table explorer to visualize schema of table and first 1000 rows

Requirements

  • Java 7 or higher
  • MySQL database
  • Presto 0.77 or higher
  • S3 bucket (to store CSVs)
  • Gradle 2.2 or higher

I understand to some degree the need to make SQL “simpler” but fail to see how simpler controls translate into a solution. The controls may be obvious enough but if I don’t know the semantics of the column headers, the simplicity of the interface won’t be terribly helpful.

Or to put it another way, users seem to be assumed to know the semantics of the tables they encounter. True/False?

Galleries, Libraries, Archives, and Museums (GLAM CC Licensing)

Filed under: Archives,Library,Museums — Patrick Durusau @ 4:00 pm

Galleries, Libraries, Archives, and Museums (GLAM CC Licensing)

A very extensive list of galleries, libraries, archives, and museums (GLAM) that are using CC licensing.

A good resource to have at hand if you need to argue for CC licensing with your gallerys, library, archive, or museum.

I first saw this in a tweet by Adrianne Russell.

Update: Resource List for March 5 Open Licensing Online Program

Data Visualization as a Communication Tool

Filed under: Graphics,Library,Visualization — Patrick Durusau @ 3:53 pm

Data Visualization as a Communication Tool by Susan [Gardner] Archambault, Joanne Helouvry, Bonnie Strohl, and Ginger Williams.

Abstract:

This paper provides a framework for thinking about meaningful data visualization in ways that can be applied to routine statistics collected by libraries. An overview of common data display methods is provided, with an emphasis on tables, scatter plots, line charts, bar charts, histograms, pie charts, and infographics. Research on “best practices” in data visualization design is presented as well as a comparison of free online data visualization tools. Different data display methods are best suited for different quantitative relationships. There are rules to follow for optimal data visualization design. Ten free online data visualization tools are recommended by the authors.

Good review of basic visualization techniques with an emphasis on library data. You don’t have to be in Tufte‘s league in order to make effective data visualizations.

Watch Hilary Mason discredit the cult of the algorithm

Filed under: Algorithms,Conferences — Patrick Durusau @ 3:26 pm

Watch Hilary Mason discredit the cult of the algorithm by Stacey Higginbotham.

From the post:

Want to see Hilary Mason, the CEO and founder at Fast Forward Labs, get fired up? Tell her about your new connected product and its machine learning algorithm that will help it anticipate your needs over time and behave accordingly. “That’s just a bunch of marketing bullshit,” said Mason when I asked her about these claims.

Mason actually builds algorithms and is well-versed in what they can and cannot do. She’s quick to dismantle the cult that has been built up around algorithms and machine learning as companies try to make sense of all the data they have coming in, and as they try to market products built on learning algorithms in the wake of Nest’s $3.2 billion sale to Google (I call those efforts faithware). She’ll do more of this during our opening session with Data collective co-managing partner Matt Ocko at Structure Data on March 18 in New York. You won’t want to miss it.

I won’t be there to see Hilary call “bullshit” on algorithms but you can be:

Structure Data, March 18-19, 2015, New York, NY.

Enjoy!

Chinese Tradition Inspires Machine Learning Advancements, Product Contributions

Filed under: Games,Machine Learning — Patrick Durusau @ 2:43 pm

Chinese Tradition Inspires Machine Learning Advancements, Product Contributions by George Thomas Jr..

From the post:

A new online Chinese riddle game is steeped in more than just tradition. In fact, the machine learning and artificial intelligence that fuels it derives from years of research that also helps drive Bing Search, Bing Translator, the Translator App for Windows Phone, and other products.

Launched in celebration of the Chinese New Year, Microsoft Chinese Character Riddle is based on the two-player game unique to Chinese traditional culture and part of the Chinese Lantern Festival. Developed by the Natural Language Computing Group in the Microsoft Research's Beijing lab, the game not only quickly returns an answer to a user's riddle, but also works in reverse: when a user enters a single Chinese character as the intended answer, the system generates several riddles from which to choose.

"These innovations typically embody the strategic thought of Natural Language Processing 2.0, which is to collect big data on the Internet, to automatically build AI models using statistical machine learning methods, and to involve users in the innovation process by quickly getting their on-line feedback." Says Dr. Ming Zhou, Group Leader for Natural Language Computing Group and Principal Researcher at Microsoft Research Asia. "Thus the riddle system will continue to improve."

I don’t know any Chinese characters at all so others will need to judge the usefulness of this machine learning version. I did find a general resource on Riddles about Chinese Characters.

What other word or riddle games would pose challenges for machine learning?

I first saw this in a tweet by Microsoft Research.

Statistical Thinking?

Filed under: Government,Politics — Patrick Durusau @ 11:20 am

Quote: Wells/Wilks on Statistical Thinking

From the webpage:

“Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write!”

Quote from the presidential address in 1951 of mathematical statistician Samuel S. Wilks (1906 – 1964) to the American Statistical Association found in JASA,Vol. 46, No. 253., pp. 1-18. Wilks was paraphrasing Herbert G. Wells (1866 – 1946) from his book Mankind in the Making. The full H.G. Wells quote reads:

“The great body of physical science, a great deal of the essential fact of financial science, and endless social and political problems are only accessible and only thinkable to those who have had a sound training in mathematical analysis, and the time may not be very remote when it will be understood that for complete initiation as an efficient citizen of one of the new great complex worldwide States that are now developing, it is as necessary to be able to compute, to think in averages and maxima and minima, as it is now to be able to read and write.”

Statistical thinking would enable most voters make more informed choices. However, given the lack of statistical thinking skills in elected and appointed leaders, it seems unlikely that statistical thinking is widespread among voters.

For example:

Clapper: The Attacks We Didn’t Prevent In The Past Can’t Be Prevented In The Future If Section 215 Is Allowed To Die by Tim Cushing..

From the post:

Over a decade has passed since the 9/11 attacks, and the intelligence community still won’t let the attack it didn’t prevent be laid to rest. It is exhumed over and over again — its tattered remains waved in front of legislators and the public, accompanied by shouts of, “YOU SEE THIS?!? THIS IS WHAT HAPPENS WHEN WE DON’T GET OUR WAY!”

It’s grotesque and ghastly and — quite frankly — more than a little tiresome. The NSA’s Section 215 program is set to expire on June 1st and James Clapper is making statements in its defense — statements that read like someone attempting to sound more disappointed than angry. But this is James Clapper speaking, and all prior evidence points to him being unwilling to make any concessions on the domestic surveillance front.

See Tim’s article for quotes from Clapper and others, I just can’t bear to repeat their nonsense today. There is no evidence, none, that phone surveillance has stopped any attacks, yet Clapper keeps waving the fear flag to continue the program.

That is not only a failure of reasoning, but a failure of statistical thinking as well. If the incidence of terrorist attacks that have been stopped by phone surveillance is zero, how much of a justification is there for phone surveillance? I realize that is close to being a math question but give it a try. 😉 Most of the elected officials in the United States will get the answer wrong.

I first saw this in a tweet by Kirk Borne.

I appreciate Kirk pitching for the traditional model of an informed and intelligent citizenry that is in control of its government but I doubt that was ever more than a myth. It certainly isn’t true today.

Is Google Dazed and Confused?

Filed under: Business Intelligence — Patrick Durusau @ 10:37 am

I ask because after a lot of strong talk about security, I read Google backtracks on Android 5.0 default encryption by Kevin C. Tofel, to suggest that Google is backing off its promise of encryption by default. Kevin has the details but the changing requirement is due to performance issues, or so they say.

But don’t you think Google engineers, the same type of engineers who now routinely beat Atari games with an AI, knew there would be a performance hit from default encryption? Isn’t it little late in the game to claim that “performance” issues took you by surprise?

The other odd bit of news on Google was Google performs U-turn on Blogger smut rule by Lee Munson.

From the post:


However, many of the people who use the service to publish explicit content complained that Blogger was a means of expressing themselves. Now, it seems like Google has listened to them.

The company will instead focus its attention on preventing the distribution of commercial porn, illegal content and videos and images that have been published without the consent of any persons featured within them.

You have to wonder if Google is getting bad results from machine learning algorithms for business strategies or if they are ignoring the machine learning algorithms?

After hosting porn (a/ka/a personal “expression”) for more than a decade, what host would not expect massive pushback from changes to the rules? Should be easy enough to discover how many “expression” accounts exist on blogger.

Focusing on illegal content or publications without consent makes sense because there is corporate liability that follows notice of its presence. But that’s nothing new.

With the election cycle about to begin, the term flip-flop comes to mind.

Thoughts on who at Google has that much clout?

‘FREAK’ Feature (not flaw) Undermines Security

Filed under: Cybersecurity,Security — Patrick Durusau @ 9:47 am

‘FREAK’ flaw undermines security for Apple and Google users, researchers discover by Craig Timberg.

From the post:

Technology companies are scrambling to fix a major security flaw that for more than a decade left users of Apple and Google devices vulnerable to hacking when they visited millions of supposedly secure Web sites, including Whitehouse.gov, NSA.gov and FBI.gov.

The flaw resulted from a former U.S. government policy that forbade the export of strong encryption and required that weaker “export-grade” products be shipped to customers in other countries, say the researchers who discovered the problem. These restrictions were lifted in the late 1990s, but the weaker encryption got baked into widely used software that proliferated around the world and back into the United States, apparently unnoticed until this year.

Researchers discovered in recent weeks that they could force browsers to use the weaker encryption, then crack it over the course of just a few hours. Once cracked, hackers could steal passwords and other personal information and potentially launch a broader attack on the Web sites themselves by taking over elements on a page, such as a Facebook “Like” button.

The problem illuminates the danger of unintended security consequences at a time when top U.S. officials, frustrated by increasingly strong forms of encryption on smartphones, have called for technology companies to provide “doors” into systems to protect the ability of law enforcement and intelligence agencies to conduct surveillance.

The falling back to weaker encryption was a feature, not a bug or flaw when it was introduced into browsers. It enabled browsers to work with non-U.S. sites using weaker encryption as well as U.S. sites using stronger encryption. Had that feature not been in place, browsers would have discriminated against non-U.S. websites and merchants. With the average complaints from the usual suspects.

Now that the fallback capability is included as a matter of default, without design analysis of the capabilities of browser software, the now legacy fallback capability produces encryption that is subject to attack by computing resources that did not exist when the original weaker encryption was mandated.

You can argue that the vulnerability introduced by the weaker encryption is an unintended consequence of earlier government mandates. But that skips over the responsibility of the browser development community for failing to remove a legacy capability that is no longer needed. As well as its failure to perform a security analysis on browser software in light of current computing capabilities.

Unless and until security becomes part of the software development culture, we are condemned as the sinners in the vestibule of Hell to follow whatever security flaw has captured our attention at the moment. Fixing flaws does not make software more secure, it only fixes the bug that was noticed.

March 5, 2015

Back on March 6th, 2015

Filed under: News — Patrick Durusau @ 7:15 pm

I deeply appreciate everyone who visits this blog, whether daily, weekly or at some other interval.

I was unexpectedly confined to a local hospital for tests late yesterday afternoon until late today. Which resulted in a few posts yesterday and only one today.

I probably won’t be a full force tomorrow but that is certainly my intention.

I used the time away from the keyboard to think of a number of potentially interesting topics and proposals.

As usual, when you are looking for something, you can’t find it and so the tests all proved negative. A good thing except for the lack of an explanation for other things. I will endeavor to keep unannounced interruptions to a bare minimum.

Hope everyone is having a great week!

ATLAS’ Higgs ML Challenge Data Open to Public

Filed under: Machine Learning,Physics — Patrick Durusau @ 7:09 pm

ATLAS’ Higgs ML Challenge Data Open to Public by David Rousseau.

From the post:

HiggsML

Higgs Machine Challenge Poster

The dataset from the ATLAS Higgs Machine Learning Challenge has been released on the CERN Open Data Portal.

The Challenge, which ran from May to September 2014, was to develop an algorithm that improved the detection of the Higgs boson signal. The specific sample used simulated Higgs particles into two tau particles inside the ATLAS detector. The downloadable sample was provided for participants at the host platform on Kaggle’s website. With almost 1,785 teams competing, the event was a huge success. Participants applied and developed cutting edge Machine Learning techniques, which have been shown to be better than existing traditional high-energy physics tools.

The dataset was removed at the end of the Challenge but due to high public demand ATLAS, as organizer of the event, has decided to house it in the CERN Open Data Portal where it will be available permanently. The 60MB zipped ASCII file can be decoded without a special software, and a few scripts are provided to help users get started. Detailed documentation for physicists and data scientists is also available. Thanks to the Digital Object Identifiers (DOIs) in CERN Open Data Portal, the dataset and accompanying material can be cited like any other paper.

The Challenge’s winner Gábor Melis, and recipients of the Special High Energy Physics meets Machine Learning Award, Tianqi Chen and Tong He, will be visiting CERN to deliver talks on their winning algorithms on 19 May.

If you missed your chance to test your mettle in the ATLAS’ Higgs ML Challenge, don’t despair! The data is available once again. How have ML techniques changed since the original challenge? How have your skills improved?

Enjoy!

March 4, 2015

The Ultimate Google Algorithm Cheat Sheet

Filed under: Algorithms,Search Engines — Patrick Durusau @ 1:07 pm

The Ultimate Google Algorithm Cheat Sheet by Neil Patel.

From the post:

Have you ever wondered why Google keeps pumping out algorithm updates?

Do you understand what Google algorithm changes are all about?

No SEO or content marketer can accurately predict what any future update will look like. Even some Google employees don’t understand everything that’s happening with the web’s most dominant search engine.

All this confusion presents a problem for savvy site owners: Since there are about 200 Google ranking factors, which ones should you focus your attention on after each major Google algorithm update?

Google has issued four major algorithm updates, named (in chronological order) Panda, Penguin, Hummingbird and Pigeon. In between these major updates, Google engineers also made some algorithm tweaks that weren’t that heavily publicized, but still may have had an impact on your website’s rankings in search results.

A very detailed post describing the history of changes to the Google algorithm, the impact of those changes and suggestions for how you can improve your search rankings.

The post alone is worth your attention but so are the resources that Neil relies upon, such as:

200 Google ranking factors, which really is a list of 200 Google ranking factors.

I first saw this in a tweet by bloggerink.

Take two steps back from journalism:… [Your six degrees to victims/perps]

Filed under: Journalism,News,Reporting — Patrick Durusau @ 11:36 am

Take two steps back from journalism: What are the editorial products we’re not building? by Jonathan Stray.

From the post:

The traditional goal of news is to say what just happened. That’s sort of what “news” means. But there are many more types of nonfiction information services, and many possibilities that few have yet explored.

I want to take two steps back from journalism, to see where it fits in the broader information landscape and try to imagine new things. First is the shift from content to product. A news source is more than the stories it produces; it’s also the process of deciding what to cover, the delivery system, and the user experience. Second, we need to include algorithms. Every time programmers write code to handle information, they are making editorial choices.

Imagine all the wildly different services you could deliver with a building full of writers and developers. It’s a category I’ve started calling editorial products.

In this frame, journalism is just one part of a broader information ecosystem that includes everything from wire services to Wikipedia to search engines. All of these products serve needs for factual information, and they all use some combination of professionals, participants, and software to produce and deliver it to users — the reporter plus the crowd and the algorithm. Here are six editorial products that journalists and others already produce, and six more that they could.

Jonathan’s existing editorial products list (with examples):

  • Record what just happened.
  • Locate pre-existing information.
  • Filter the information tsunami.
  • Give me background on this topic.
  • Expose wrongdoing.
  • Debunk rumors and lies.

A useful starting point to decide if a market is already saturated (or thought to be so) and how you could differentiate a new product in one of these areas. I’m not as certain as Jonathan that existing products perform well on locating pre-existing information or filtering the information tsunami. On the other hand, the low value of most queries may preclude a viable economic model for more accurate answers.

Jonathan’s potential editorial products list (with observations, VCs take note):

  • What can I do about it?
  • A moderated place for difficult discussions.
  • Personalized news that isn’t sort of terrible. [Terrible here refers to the algorithms that personalize the news.]
  • The online town hall.
  • Systematic government coverage.
  • Choose-your-own-adventure reporting.

A great starting point for discussing new editorial products. I suppose it is a refinement of “What can I do about it?” but I have a suggestion for a new editorial product: My Six Degrees.

Based on the idea of six degrees of separation (think Kevin Bacon), what if for any news report, you could enter your identification and based on the various social media sources and other data, you separation from the persons in the report could be calculated and returned to you with contact information for each step of the separation?

That has the potential to make the news you hear from other products a good deal more personal. It wouldn’t be “…too bad somebody got robbed…” it would be someone who was only two degrees of separation from you. As well as having the same revelation when someone is arrested for the crime.

Same should be true for the faceless bureaucrats that run much of the world. You would not hear “…the parole board denied clemency for someone on death row…” but rather X, Y, and Z, who are so many degrees from you denied clemency.

Could be a way to “personalize” the news in such a way as to motivate readers to take action.

Currently it would not work for everyone but there is enough data in the larger cities to “personalize” the news in a very meaningful way.

I first saw this in a tweet by Journalism Tools.

Futures of text

Filed under: Artificial Intelligence,Interface Research/Design,Messaging,UX — Patrick Durusau @ 10:31 am

Futures of text by Jonathan Libov.

From the post:

I believe comfort, not convenience, is the most important thing in software, and text is an incredibly comfortable medium. Text-based interaction is fast, fun, funny, flexible, intimate, descriptive and even consistent in ways that voice and user interface often are not. Always bet on text:

Text is the most socially useful communication technology. It works well in 1:1, 1:N, and M:N modes. It can be indexed and searched efficiently, even by hand. It can be translated. It can be produced and consumed at variable speeds. It is asynchronous. It can be compared, diffed, clustered, corrected, summarized and filtered algorithmically. It permits multiparty editing. It permits branching conversations, lurking, annotation, quoting, reviewing, summarizing, structured responses, exegesis, even fan fic. The breadth, scale and depth of ways people use text is unmatched by anything.

[Apologies, I lost some of Jonathan’s layout of the quote.]

Jonathan focuses on the use of text/messaging for interactions in a mobile environment, with many examples and suggestions for improvements along the way.

One observation that will have the fearful of an AI future (Elon Musk among others) running for the hills:

Messaging is the only interface in which the machine communicates with you much the same as the way you communicate with it. If some of the trends outlined in this post pervade, it would mark a qualitative shift in how we interact with computers. Whereas computer interaction to date has largely been about discrete, deliberate events — typing in the command line, clicking on files, clicking on hyperlinks, tapping on icons — a shift to messaging- or conversational-based UI’s and implicit hyperlinks would make computer interaction far more fluid and natural.

What’s more, messaging AI benefits from an obvious feedback loop: The more we interact with bots and messaging UI’s, the better it’ll get. That’s perhaps true for GUI as well, but to a far lesser degree. Messaging AI may get better at a rate we’ve never seen in the GUI world. Hold on tight.[Emphasis added.]

Think of it this way, a GUI locks you into the developer’s imagination. A text interface empowers the user and the AI’s imagination. I’m betting on the latter.

BTW, Jonathan ends with a great list of further reading on messaging and mobile applications.

Enjoy!

I first saw this in a tweet by Aloyna Medelyan.

March 3, 2015

Understanding Natural Language with Deep Neural Networks Using Torch

Filed under: GPU,Natural Language Processing,Neural Networks — Patrick Durusau @ 7:00 pm

Understanding Natural Language with Deep Neural Networks Using Torch by Soumith Chintala and Wojciech Zaremba.

This is a deeply impressive article and a good introduction to Torch (scientific computing package with neural network, optimization, etc.)

In the preliminary materials, the authors illustrate one of the difficulties of natural language processing by machine:

For a machine to understand language, it first has to develop a mental map of words, their meanings and interactions with other words. It needs to build a dictionary of words, and understand where they stand semantically and contextually, compared to other words in their dictionary. To achieve this, each word is mapped to a set of numbers in a high-dimensional space, which are called “word embeddings”. Similar words are close to each other in this number space, and dissimilar words are far apart. Some word embeddings encode mathematical properties such as addition and subtraction (For some examples, see Table 1).

Word embeddings can either be learned in a general-purpose fashion before-hand by reading large amounts of text (like Wikipedia), or specially learned for a particular task (like sentiment analysis). We go into a little more detail on learning word embeddings in a later section.

You can already see the problem but just to call it out, the language usage in Wikipedia, for example, may or may not match the domain of interest. You could certainly use it as a general case but it will produce very odd results when the text to be “understood” in a regional version of a language where common words have meanings other than you will find in Wikipedia.

Slang is a good example. In the 17th century for example, “cab” was a term used for a brothel. To take a “hit” has a different meaning than being struck by a boxer, would be a more recent example.

“Understanding” natural language with machines is a great leap forward but one should never leap without looking.

Top 50 Data Science Resources:

Filed under: Data Science — Patrick Durusau @ 6:33 pm

Top 50 Data Science Resources: The Best Blogs, Forums, Videos and Tutorials to Learn All about Data Science

From the webpage:

The field of data science is constantly evolving and ever-advancing, with new technologies placing more valuable insights in the hands of modern enterprises. More data-driven organizations are hiring data scientists to drive their efforts to gather, analyze, and make use of Big Data in valuable ways.

Because the field of data science is so broad and sometimes challenging to navigate, we’ve compiled a list of 50 of the most helpful data science resources on the web. Whether you’re a student or new professional working in the field of data science, these resources are valuable for discovering the latest employment opportunities, finding tutorials for the processes and systems you’re using on a daily basis, learning hacks and tricks to boost your performance, and connecting with other professionals in your field.

Note: The following 50 resources are not ranked or rated in order of importance or value; rather, they are categorized to make it easy for you to locate the resources you need most. Click through to a specific category using the links in the Table of Contents below.

A useful list as far as it goes but like all such lists, it probably has resources you have already seen. And the next person who thinks a list of data science resources is a great idea will make yet another list.

I suspect for web based resources, we can do a fair job at deduping lists of resources but how do we create incentives to seek out or make more visible all the existing lists? And of course having done that, how do we create incentives to combine those list together?

So far as I can tell, the nature and extent of incentives for such collaboration are either unknown or unpracticed. I’m betting on unknown. Thoughts on how to explore possible incentives? The worst we can do is remain with the status quo.

I first saw this in a tweet by Marcelo Domínguez.

“The Whole Is Greater Than the Sum of Its Parts”

Filed under: Crowd Sourcing — Patrick Durusau @ 6:04 pm

“The Whole Is Greater Than the Sum of Its Parts”: Optimization in Collaborative Crowdsourcing by Habibur Rahman, et al.

Abstract:

In this work, we initiate the investigation of optimization opportunities in collaborative crowdsourcing. Many popular applications, such as collaborative document editing, sentence translation, or citizen science resort to this special form of human-based computing, where, crowd workers with appropriate skills and expertise are required to form groups to solve complex tasks. Central to any collaborative crowdsourcing process is the aspect of successful collaboration among the workers, which, for the first time, is formalized and then optimized in this work. Our formalism considers two main collaboration-related human factors, affinity and upper critical mass, appropriately adapted from organizational science and social theories. Our contributions are (a) proposing a comprehensive model for collaborative crowdsourcing optimization, (b) rigorous theoretical analyses to understand the hardness of the proposed problems, (c) an array of efficient exact and approximation algorithms with provable theoretical guarantees. Finally, we present a detailed set of experimental results stemming from two real-world collaborative crowdsourcing application us- ing Amazon Mechanical Turk, as well as conduct synthetic data analyses on scalability and qualitative aspects of our proposed algorithms. Our experimental results successfully demonstrate the efficacy of our proposed solutions.

Heavy sledding but given the importance of crowd sourcing and the potential for any increase in productivity, well worth the effort!

I first saw this in a tweet by Dave Rubal.

Principles of Model Checking

Filed under: Design,Modeling,Software,Software Engineering — Patrick Durusau @ 5:15 pm

Principles of Model Checking by Christel Baier and Joost-Pieter Katoen. Foreword by Kim Guldstrand Larsen.

From the webpage:

Our growing dependence on increasingly complex computer and software systems necessitates the development of formalisms, techniques, and tools for assessing functional properties of these systems. One such technique that has emerged in the last twenty years is model checking, which systematically (and automatically) checks whether a model of a given system satisfies a desired property such as deadlock freedom, invariants, or request-response properties. This automated technique for verification and debugging has developed into a mature and widely used approach with many applications. Principles of Model Checking offers a comprehensive introduction to model checking that is not only a text suitable for classroom use but also a valuable reference for researchers and practitioners in the field.

The book begins with the basic principles for modeling concurrent and communicating systems, introduces different classes of properties (including safety and liveness), presents the notion of fairness, and provides automata-based algorithms for these properties. It introduces the temporal logics LTL and CTL, compares them, and covers algorithms for verifying these logics, discussing real-time systems as well as systems subject to random phenomena. Separate chapters treat such efficiency-improving techniques as abstraction and symbolic manipulation. The book includes an extensive set of examples (most of which run through several chapters) and a complete set of basic results accompanied by detailed proofs. Each chapter concludes with a summary, bibliographic notes, and an extensive list of exercises of both practical and theoretical nature.

The present IT structure has shown itself to be as secure as a sieve. Do you expect the “Internet of Things” to be any more secure?

If you are interested in secure or at least less buggy software, more formal analysis is going to be a necessity. This title will give you an introduction to the field.

It dates from 2008 so some updating will be required.

I first saw this in a tweet by Reid Draper.

Reddit Terminal Viewer

Filed under: Python — Patrick Durusau @ 4:42 pm

Reddit Terminal Viewer by Michael Lazar.

From the webpage:

Reddit Terminal Viewer (RTV) is a lightweight browser for Reddit (www.reddit.com) built into a terminal window. RTV is built in Python and utilizes the curses library. It is compatible with a large range of terminal emulators on Linux and OSX systems.

Sometimes, text is all you need for fast browsing/searching.

The more graphical the Web becomes the more useful text interfaces become. Is text the answer to graphic spam?

I first saw this in a tweet by Randy Olson.

The Matrix Cheatsheet

Filed under: Julia,Matrix,Numpy,R — Patrick Durusau @ 4:31 pm

The Matrix Cheatsheet by Sebastian Raschka.

Sebastian has created a spreadsheet of thirty (30) matrix tasks and compares the code for each in: MATLAB/Octave, Python NumPy, R, and Julia.

Given the prevalence of matrices in so many data science tasks, this can’t help but be useful.

A bit longer treatment can be found at: The Matrix Cookbook.

I first saw this in a tweet by Yhat, Inc.

ComputerWorld’s R for Beginners Hands-On Guide

Filed under: Data Science,R — Patrick Durusau @ 4:18 pm

ComputerWorld’s R for Beginners Hands-On Guide by David Smith.

From the post:

Computerworld’s Sharon Machlis has done a great service for the R community — and R especially novices — by creating the on-line Beginner’s Guide to R. You can read our overview of her guide from 2013 here, but it’s been regularly updated since then.

Now available in PDF format!

David also suggests that R beginners check out beginner’s tips for R from the Revolutions archive.

If you are using R, the Revolutions blog is on your browser toolbar. If you are learning R, the Revolutions blog should be on your browser toolbar.

SmileMiner [Conflicting Data Science Results?]

Filed under: Java,Machine Learning — Patrick Durusau @ 3:21 pm

SmileMiner

From the webpage:

SmileMiner (Statistical Machine Intelligence and Learning Engine) is a pure Java library of various state-of-art machine learning algorithms. SmileMiner is self contained and requires only Java standard library.

SmileMiner is well documented and you can browse the javadoc for more information. A basic tutorial is avilable on the project wiki.

To see SmileMiner in action, please download the demo jar file and then run java -jar smile-demo.jar.

  • Classification: Support Vector Machines, Decision Trees, AdaBoost, Gradient Boosting, Random Forest, Logistic Regression, Neural Networks, RBF Networks, Maximum Entropy Classifier, KNN, Naïve Bayesian, Fisher/Linear/Quadratic/Regularized Discriminant Analysis.
  • Regression: Support Vector Regression, Gaussian Process, Regression Trees, Gradient Boosting, Random Forest, RBF Networks, OLS, LASSO, Ridge Regression.
  • Feature Selection: Genetic Algorithm based Feature Selection, Ensemble Learning based Feature Selection, Signal Noise ratio, Sum Squares ratio.
  • Clustering: BIRCH, CLARANS, DBScan, DENCLUE, Deterministic Annealing, K-Means, X-Means, G-Means, Neural Gas, Growing Neural Gas, Hierarchical Clustering, Sequential Information Bottleneck, Self-Organizing Maps, Spectral Clustering, Minimum Entropy Clustering.
  • Association Rule & Frequent Itemset Mining: FP-growth mining algorithm
  • Manifold learning: IsoMap, LLE, Laplacian Eigenmap, PCA, Kernel PCA, Probabilistic PCA, GHA, Random Projection
  • Multi-Dimensional Scaling: Classical MDS, Isotonic MDS, Sammon Mapping
  • Nearest Neighbor Search: BK-Tree, Cover Tree, KD-Tree, LSH
  • Sequence Learning: Hidden Markov Model.

Great to have another machine learning library but it reminded me of a question I read yesterday:

When two teams of data scientists report conflicting results, how does a manager choose between them?

There is a view, says Florian Zettelmeyer, the Nancy L. Ertle Professor of Marketing, that data science represents disembodied truth.

Zettelmeyer, himself a data scientist, fervently disagrees with that view.

“Data science fundamentally speaks to management decisions” he said, “and management decisions are fundamentally political. There are agendas and there are winners and losers. As a result, different teams will often come up with different conclusions and it is the job of a manager to be able to call it. This requires a ‘working knowledge of data science.’”

Granting it is a promotion for the Kellogg School of Management but Zettelmeyer has a good point.

I’m not so sure that a “working knowledge of data science” is required to choose between different answers in data science. A knowledge of what their superiors are likely to accept is a more likely criteria.

A good machine learning library should give you enough options to approximate the expected answer.

I first saw this in a tweet by Bence Arato.

Light as Wave and Particle (Naming Issue?)

Filed under: Physics,Science — Patrick Durusau @ 2:50 pm

Scientists take the first ever photograph of light as both a wave and a particle by Kelly Dickerson.

light-wave-particle

For the first time ever, scientist have snapped a photo of light behaving as both a wave and a particle at the same time.

The research was published on Monday in the journal Nature Communications.

Scientists know that light is a wave. That’s why light can bend around buildings and squeeze through tiny pinholes. Different wavelengths of light are why we can see different colors, and why everyone freaked out about that black and blue dress.

But all the characteristics and behaviors of a wave aren’t enough to explain everything that light does.

Naming issue?

Before this photo, light behaved as a wave or as a particle. Now we have a photo of light between those two states? Neither of the old terms is sufficient by itself.

Who is going to break the news to Cyc? 😉

I first saw this in a tweet by Reg Saddler

KillrWeather

Filed under: Akka,Cassandra,Spark,Time Series — Patrick Durusau @ 2:31 pm

KillrWeather

From the post:

KillrWeather is a reference application (which we are constantly improving) showing how to easily leverage and integrate Apache Spark, Apache Cassandra, and Apache Kafka for fast, streaming computations in asynchronous Akka event-driven environments. This application focuses on the use case of time series data.

The site doesn’t give enough emphasis to the importance of time series data. Yes, weather is an easy example of time series data, but consider another incomplete listing of the uses of time series data:

A time series is a sequence of data points, typically consisting of successive measurements made over a time interval. Examples of time series are ocean tides, counts of sunspots, and the daily closing value of the Dow Jones Industrial Average. Time series are very frequently plotted via line charts. Time series are used in statistics, signal processing, pattern recognition, econometrics, mathematical finance, weather forecasting, earthquake prediction, electroencephalography, control engineering, astronomy, communications engineering, and largely in any domain of applied science and engineering which involves temporal measurements.

(Time Series)

Mastering KillrWeather will put you on the road to many other uses of time series data.

Enjoy!

I first saw this in a tweet by Chandra Gundlapalli.

March 2, 2015

How To Publish Open Data (in the UK)

Filed under: Humor,Open Data — Patrick Durusau @ 8:49 pm

http://www.owenboswarva.com/opendata/OD_Pub_DecisionTree.jpg

No way this will display properly so I just linked to it.

I don’t know about the UK but a very similar discussion takes place in academic circles before releasing data that less than a dozen people have asked to see, ever.

Enjoy!

I first saw this in a tweet by Irina Bolychevsky.

RAD – Outlier Detection on Big Data

Filed under: BigData,Outlier Detection — Patrick Durusau @ 8:35 pm

RAD – Outlier Detection on Big Data by Jeffrey Wong, Chris Colburn, Elijah Meeks, and Shankar Vedaraman.

From the post:

Outlier detection can be a pain point for all data driven companies, especially as data volumes grow. At Netflix we have multiple datasets growing by 10B+ record/day and so there’s a need for automated anomaly detection tools ensuring data quality and identifying suspicious anomalies. Today we are open-sourcing our outlier detection function, called Robust Anomaly Detection (RAD), as part of our Surus project.

As we built RAD we identified four generic challenges that are ubiquitous in outlier detection on “big data.”

  • High cardinality dimensions: High cardinality data sets – especially those with large combinatorial permutations of column groupings – makes human inspection impractical.
  • Minimizing False Positives: A successful anomaly detection tool must minimize false positives. In our experience there are many alerting platforms that “sound an alarm” that goes ultimately unresolved. The goal is to create alerting mechanisms that can be tuned to appropriately balance noise and information.
  • Seasonality: Hourly/Weekly/Bi-weekly/Monthly seasonal effects are common and can be mis-identified as outliers deserving attention if not handled properly. Seasonal variability needs to be ignored.
  • Data is not always normally distributed: This has been a particular challenge since Netflix has been growing over the last 24 months. Generally though, an outlier tool must be robust so that it works on data that is not normally distributed.

In addition to addressing the challenges above, we wanted a solution with a generic interface (supporting application development). We met these objectives with a novel algorithm encased in a wrapper for easy deployment in our ETL environment.

Looking for “suspicious anomalies” is always popular, in part because it implies someone has deliberately departed from “normal” behavior.

Certainly important but as the FBI staging terror plots we discussed earlier today, show that the normal FBI “mo” is to stage terror plots and an anomaly would be a real terror plot, one not staged by the FBI.

The lesson being don’t assume outliers are departures from a desired norm. Can be, but not always are.

« Newer PostsOlder Posts »

Powered by WordPress