Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 18, 2015

“We live in constant fear of upsetting the WH (White House).”

Filed under: Government,Politics,Transparency — Patrick Durusau @ 5:44 pm

Administration sets record for withholding government files by Ted Bridis.

From the post:

The Obama administration set a record again for censoring government files or outright denying access to them last year under the U.S. Freedom of Information Act, according to a new analysis of federal data by The Associated Press.

The government took longer to turn over files when it provided any, said more regularly that it couldn’t find documents and refused a record number of times to turn over files quickly that might be especially newsworthy.

It also acknowledged in nearly 1 in 3 cases that its initial decisions to withhold or censor records were improper under the law — but only when it was challenged.

Its backlog of unanswered requests at year’s end grew remarkably by 55 percent to more than 200,000. It also cut by 375, or about 9 percent, the number of full-time employees across government paid to look for records. That was the fewest number of employees working on the issue in five years.

The government’s new figures, published Tuesday, covered all requests to 100 federal agencies during fiscal 2014 under the Freedom of Information law, which is heralded globally as a model for transparent government. They showed that despite disappointments and failed promises by the White House to make meaningful improvements in the way it releases records, the law was more popular than ever. Citizens, journalists, businesses and others made a record 714,231 requests for information. The U.S. spent a record $434 million trying to keep up. It also spent about $28 million on lawyers’ fees to keep records secret.

Ted does a great job detailing the secretive and paranoid Obama White House up to and including a censored document that forgot to cover up:

“We live in constant fear of upsetting the WH (White House).”

Although I must confess that I don’t know if it is worse that President Obama and company are so non-transparent or that they lie about how transparent they are with such easy smiles. No shame, no embarrassment, they lie when the truth would do just as well.

Not that I think any other member of government does any better, but that is hardly an excuse.

The only legitimate solution that I see going forward are massive leaks from all parts of government. If you aren’t leaking, you are part of the problem.

Open Source Tensor Libraries For Data Science

Filed under: Data Science,Mathematics,Open Source,Programming — Patrick Durusau @ 5:20 pm

Let’s build open source tensor libraries for data science by Ben Lorica.

From the post:

Data scientists frequently find themselves dealing with high-dimensional feature spaces. As an example, text mining usually involves vocabularies comprised of 10,000+ different words. Many analytic problems involve linear algebra, particularly 2D matrix factorization techniques, for which several open source implementations are available. Anyone working on implementing machine learning algorithms ends up needing a good library for matrix analysis and operations.

But why stop at 2D representations? In a recent Strata + Hadoop World San Jose presentation, UC Irvine professor Anima Anandkumar described how techniques developed for higher-dimensional arrays can be applied to machine learning. Tensors are generalizations of matrices that let you look beyond pairwise relationships to higher-dimensional models (a matrix is a second-order tensor). For instance, one can examine patterns between any three (or more) dimensions in data sets. In a text mining application, this leads to models that incorporate the co-occurrence of three or more words, and in social networks, you can use tensors to encode arbitrary degrees of influence (e.g., “friend of friend of friend” of a user).

In case you are interested, Wikipedia has a list of software packages for tensor analaysis.

Not mentioned by Wikipedia: Facebook open sourcing TH++ last year, a library for tensor analysis. Along with fblualibz, which includes a bridge between Python and Lua (for running tensor analysis).

Uni10 wasn’t mentioned by Wikipedia either.

Good starting place: Big Tensor Mining, Carnegie Mellon Database Group.

Suggest you join an existing effort before you start duplicating existing work.

UK Bioinformatics Landscape

Filed under: Bioinformatics,Topic Maps — Patrick Durusau @ 4:16 pm

UK Bioinformatics Landscape

Two of the four known challenges in the UK bioinformatics landscape could be addressed by topic maps:

  • Data integration and management of omics data, enabling the use of “big data” across thematic areas and to facilitate data sharing;
  • Open innovation, or pre-competitive approaches in particular to data sharing and data standardisation

I say could be addressed by topic maps, I’m not sure what else you would use to address data integration issues, at least robustly. If you don’t mind paying to migrate data when terminology changes enough to impair your effectiveness, and continuing to pay for every future migration, I suppose that is one solution.

Given the choice, I suspect many people would like to exit the wheel of ETL.

A Comprehensive study of Convergent and Commutative Replicated Data Types

Filed under: CRDT — Patrick Durusau @ 3:47 pm

A Comprehensive study of Convergent and Commutative Replicated Data Types reviewed by Adrian Colyer.

From the post:

This paper introduces the concept of a CRDT, a “simple, theoretically sound approach to eventual consistency.” Let’s adddress one of the pressing distributed systems questions of our time right here: “what does CRDT stand for?” We’ve seen over the last couple of weeks that there are two fundamental approaches to replication: you can execute operations at a primary and replicate the resulting state, or you can replicate the operations themselves. If you’re replicating state, then given some convergence rules for state, you can create Convergent Replicated Data Types. If you’re replicating operations, then given operations carefully designed to commute , you can create Commutative Replicated Data Types. Conveniently both ‘convergent’ and ‘commutative’ begin with C, so we can call both of these CRDTs. In both cases, the higher order goal is to avoid the need for coordination by ensuring that actions taken independently can’t conflict with each other (and thus can be composed at a later point in time). Thus we might also call them Conflict-free Replicated Data Types.

Think of it a bit like this: early on languages gave us standard data type implementations for set, list, map, and so on. Then we saw the introduction of concurrent versions of collections and related data types. With CRDTs, we are seeing the birth of distributed collections and related data types. Eventually any self-respecting language/framework will come with a distributed collections library – Riak already supports CRDTs and Jonas has an Akka CRDT library in github at least. As you read through the paper, it’s tempting to think “oh, these are pretty straightforward to implement,” but pay attention to the section on garbage collection – a bit like we saw with Edelweiss, making production implementations with state that doesn’t grow unbounded makes things more difficult.

If you haven’t read A comprehensive study of Convergent and Commutative Replicated Data Types by Marc Shapiro, Nuno Pregui¸ca, Carlos Baquero, and Marek Zawirski, this is a very useful and approachable introduction.

Enjoy!

Pyhipku

Filed under: Humor — Patrick Durusau @ 3:25 pm

Pyhipku

Purely in the interest of Spring and Cherry trees being in full bloom, a site that generates a haiku from your IP address. Ideal for afternoons that seem to go on forever. 😉

I saw this several days ago on Twitter but I honestly don’t remember where.

Use The Code Luke!

Filed under: Deep Learning,Machine Learning,Neural Networks — Patrick Durusau @ 2:41 pm

Hacker’s guide to Neural Networks by Andrej Karpathy.

From the post:

Hi there, I'm a CS PhD student at Stanford. I've worked on Deep Learning for a few years as part of my research and among several of my related pet projects is ConvNetJS – a Javascript library for training Neural Networks. Javascript allows one to nicely visualize what's going on and to play around with the various hyperparameter settings, but I still regularly hear from people who ask for a more thorough treatment of the topic. This article (which I plan to slowly expand out to lengths of a few book chapters) is my humble attempt. It's on web instead of PDF because all books should be, and eventually it will hopefully include animations/demos etc.

My personal experience with Neural Networks is that everything became much clearer when I started ignoring full-page, dense derivations of backpropagation equations and just started writing code. Thus, this tutorial will contain very little math (I don't believe it is necessary and it can sometimes even obfuscate simple concepts). Since my background is in Computer Science and Physics, I will instead develop the topic from what I refer to as hackers's perspective. My exposition will center around code and physical intuitions instead of mathematical derivations. Basically, I will strive to present the algorithms in a way that I wish I had come across when I was starting out.

"…everything became much clearer when I started writing code."

You might be eager to jump right in and learn about Neural Networks, backpropagation, how they can be applied to datasets in practice, etc. But before we get there, I'd like us to first forget about all that. Let's take a step back and understand what is really going on at the core. Lets first talk about real-valued circuits.

I won’t say you don’t need to more formal methods as well but everyone learns in different ways. If doing the code first is better for you, here’s a treatment of deep learning from that perspective.

The last comments were approximately four (4) months ago. I am hopeful this work will continue.

Wandora tutorial – OCR extractor and Alchemy API Entity extractor

Filed under: Entity Resolution,OCR,Topic Map Software,Topic Maps,Wandora — Patrick Durusau @ 1:47 pm

From the description:

Video reviews the OCR (Optical Character Recognition) extractor and the Alchemy API Entity extractor of Wandora application. First, the OCR extractor is used to recognize text out of PNG images. Next the Alchemy API Entity extractor is used to recognize entities out of the text. Wandora is an open source tool for people who collect and process information, especially networked knowledge and knowledge about WWW resources. For more information see http://wandora.org.

A great demo of some of the many options of Wandora! (Wandora has more options than a Swiss army knife.)

It is an impressive demonstration.

If you aren’t familiar with Wandora, take a close look at it: http://wandora.org.

Interactive Intent Modeling: Information Discovery Beyond Search

Interactive Intent Modeling: Information Discovery Beyond Search by Tuukka Ruotsalo, Giulio Jacucci, Petri Myllymäki, Samuel Kaski.

From the post:

Combining intent modeling and visual user interfaces can help users discover novel information and dramatically improve their information-exploration performance.

Current-generation search engines serve billions of requests each day, returning responses to search queries in fractions of a second. They are great tools for checking facts and looking up information for which users can easily create queries (such as “Find the closest restaurants” or “Find reviews of a book”). What search engines are not good at is supporting complex information-exploration and discovery tasks that go beyond simple keyword queries. In information exploration and discovery, often called “exploratory search,” users may have difficulty expressing their information needs, and new search intents may emerge and be discovered only as they learn by reflecting on the acquired information. 8,9,18 This finding roots back to the “vocabulary mismatch problem” 13 that was identified in the 1980s but has remained difficult to tackle in operational information retrieval (IR) systems (see the sidebar “Background”). In essence, the problem refers to human communication behavior in which the humans writing the documents to be retrieved and the humans searching for them are likely to use very different vocabularies to encode and decode their intended meaning. 8,21

Assisting users in the search process is increasingly important, as everyday search behavior ranges from simple look-ups to a spectrum of search tasks 23 in which search behavior is more exploratory and information needs and search intents uncertain and evolving over time.

We introduce interactive intent modeling, an approach promoting resourceful interaction between humans and IR systems to enable information discovery that goes beyond search. It addresses the vocabulary mismatch problem by giving users potential intents to explore, visualizing them as directions in the information space around the user’s present position, and allowing interaction to improve estimates of the user’s search intents.

What!? All those years spend trying to beat users into learning complex search languages were in vain? Say it’s not so!

But, apparently it is so. All of the research on “vocabulary mismatch problem,” “different vocabularies to encode and decode their meaning,” has come back to bite information systems that offer static and author-driven vocabularies.

Users search best, no surprise, through vocabularies they recognize and understand.

I don’t know of any interactive topic maps in the sense used here but that doesn’t mean that someone isn’t working on one.

A shift in this direction could do wonders for the results of searches.

Full rules for protecting net neutrality released by FCC

Filed under: Government,Politics,Topic Maps — Patrick Durusau @ 1:08 pm

Full rules for protecting net neutrality released by FCC by Lisa Vaas

From the post:

The US Federal Communications Commission (FCC) on Thursday lay down 400 pages worth of details on how it plans to regulate broadband providers as a public utility.

These are the rules – and their legal justifications – meant to protect net neutrality.

Hardly the first word on net neutrality but it is a good centering point for much of the discussion that will follow. Think of using the document as a gateway into the larger discussion. A gateway that can lead you to interesting financial interests and relationships.

In response to provider claims about slow development of faster access and services, I would remind providers that the government built the Internet, it could certainly build another one. It could even contract out to Google to build one for it.

A WPA type project managed for quality purposes by Google. Then the government could lease equal access to it’s TB pipe. Changes the dynamics when it isn’t providers holding consumers hostage but a large competitor pushing against large providers.

PS: To anyone who thinks government competing with private business is “unfair,” given the conduct of private business, I wonder what you are using as a basis for comparison?

March 17, 2015

On Lemmings and PageRank

Filed under: PageRank,Searching,Software — Patrick Durusau @ 4:04 pm

Solving Open Source Discovery by Andrew Nesbitt.

From the post:

Today I’m launching Libraries.io, a project that I’ve been working on for the past couple of months.

The intention is to help developers find new open source libraries, modules and frameworks and keep track of ones they depend upon.

The world of open source software depends on a lot of open source libraries. We are standing on the shoulders of giants, which helps us to reach further than we could otherwise.

The problem with platforms like Rubygems and NPM is there are so many libraries, with hundreds of new ones added every day. Trying to find the right library can be overwhelming.

How do you find libraries that help you solve problems? How do you then know which of those libraries are worth using?

Andrew substitutes dependencies for links in a page rank algorithm and then:

Within Libraries.io I’ve aggregated over 700,000 projects, written in 130 languages from across 22 package managers, including dependencies, releases, license information and source code repository infomation. This results in a rich index of almost every open source library available for use today.

Follow me on Twitter at @teabass and @librariesio for updates. Discussion on Hacker News: https://news.ycombinator.com/item?id=9211084.

Is Libraries.io going to be useful? Yes!

Is Libraries.io a fun way to explore projects? Yes!

Is Libraries.io a great alternative to current source search options? Yes!

Is Libraries.io the solution to open source discovery? Less clear.

I say that because PageRank, whether using hyperlinks or dependencies, results in a lemming view of the world in question.

Wikipedia reports this is an image of a lemming:

Lemming

I, on the other hand, bear a passing resemblance to this image:

patrick-photo

I offer those images as evidence that I am not a lemming! 😉

The opinions and usages of others can be of interest, but I follow work and people of interest to me, not because they are of interest to others. Otherwise I would be following Lady Gaga on Twitter, for example. To save you the trouble of downloading her forty-five million (45M) followers, I hereby attest that I am not one of them.

Make no mistake, Andrew’s work should be used, followed, supported, improved, but as another view of an important data set, not a solution.

I first saw this in a tweet by Arfon Smith.

U.S. State Department: Email Alert!

Filed under: Cybersecurity,Security — Patrick Durusau @ 3:21 pm

I would have quoted some interesting material from SANS NewsBites Vol. 17 Num. 021 : U.S. State Department Email Goes Dark but the terms of the newsletter prohibit reposting to websites. You can subscribe for yourself at http://portal.sans.org/.

Without quoting their material, I can say that if you are having difficulty hacking U.S. State Department emails servers, it isn’t you.

Justin Fishel and Lee Ferran report in State Dept. Shuts Down Email After Cyber Attack, that the State Department has taken down large parts of its unclassified email system. This just months after another shutdown for similar reasons last November. No date as been set for restoration of service.

Attempting to configure a network connection only to discover that the Ethernet cable is unplugged doesn’t make for a happy day. Hacking a down system is likely quite similar. You’ll just have to catch State between purges of its email system.

The SANS NewsBites newsletter is a great read!

Enjoy!

Internet Search as a Crap Shoot in a Black Box

Filed under: Search Algorithms,Search Interface,Searching — Patrick Durusau @ 2:57 pm

The post, Google To Launch New Doorway Page Penalty Algorithm by Barry Schwartz reminded me that Internet search is truly a crap shoot in a black box.

Google has over two hundred (200) factors that are known (or suspected) to play a role in its search algorithms and their ranking of results.

Even if you memorized the 200, if you are searching you don’t know how those factors will impact pages with information you want to see. (Unless you want to drive web traffic, the 200 factors are a curiosity and not much more.)

When you throw your search terms, like dice, in to the Google black box, you don’t know how they will interact with the unknown results of the ranking algorithms.

To make matters worse, yes, worse, the Google algorithms change over time. Some major, some not quite so major. But every change stands a chance to impact any ad hoc process you have adopted for finding information.

A good number of you won’t remember print indexes but one of their attractive features (in hindsight) was that the indexing was uniform, at least within reasonable limits, for decades. If you learned how to effectively use the printed index, you could always find information using that technique, without fear that the familiar results would simply disappear.

Perhaps that is a commercial use case for the Common Crawl data. Imagine a disclosed ranking algorithm that could be exposed to create a custom ranking for a sub-set of the data against which to perform searches. So the ranking against which you are searching is known and can be explored.

It would not have the very latest data but that’s difficult to extract from Google since it apparently tosses the information about when it first encountered a page. Or at the very least doesn’t make it available to users. At least as an option, being able to pick the most recent resources matching a search would be vastly superior to the page-rank orthodoxy at Google.

Not to single Google out too much because I haven’t encountered other search engines that are more transparent. They may exist but I am unaware of them.

World Register of Marine Introduced Species (WRIMS)

Filed under: Biology,Science — Patrick Durusau @ 1:18 pm

World Register of Marine Introduced Species (WRIMS)

From the post:

WRIMS – a database of introduced and invasive alien marine species – has officially been released to the public. It includes more than 1,400 marine species worldwide, compiled through the collaboration with international initiatives and study of almost 2,500 publications.

WRIMS lists the known alien marine species worldwide, with an indication of the region in which they are considered to be alien. In addition, the database lists whether a species is reported to have ecological or economic impacts and thus considered invasive in that area. Each piece of information is linked to a source publication or a specialist database, allowing users to retrace the information or get access to the full source for more details.

Users can search for species within specific groups, and generate species lists per geographic region, thereby taking into account their origin (alien or origin unknown or uncertain) and invasiveness (invasive, of concern, uncertain …). For each region the year of introduction or first report has been documented where available. In the past, species have sometimes erroneously been labelled as ‘alien in region X’. This information is also stored in WRIMS, clearly indicating that this was an error. Keeping track of these kinds of errors or misidentifications can greatly help researchers and policy makers in dealing with alien species.

WRIMS is a subset of the World Register of Marine Species (WoRMS): the taxonomy of the species is managed by the taxonomic editor community of WoRMS, whereas the alien-related information is managed by both the taxonomic editors and the thematic editors within WRIMS. Just like its umbrella-database WoRMS, WRIMS is dynamic: a team of editors is not only keeping track of new reports of alien species, they also scan existing literature and databases to complete the general distribution range of each alien species in WRIMS.

Are there aliens in your midst? 😉

Exactly the sort of resource that if I don’t capture it now, I will never be able to find it again.

Enjoy!

Can Spark Streaming survive Chaos Monkey?

Filed under: Software,Software Engineering,Spark — Patrick Durusau @ 12:57 pm

Can Spark Streaming survive Chaos Monkey? by Bharat Venkat, Prasanna Padmanabhan, Antony Arokiasamy, Raju Uppalap.

From the post:

Netflix is a data-driven organization that places emphasis on the quality of data collected and processed. In our previous blog post, we highlighted our use cases for real-time stream processing in the context of online recommendations and data monitoring. With Spark Streaming as our choice of stream processor, we set out to evaluate and share the resiliency story for Spark Streaming in the AWS cloud environment. A Chaos Monkey based approach, which randomly terminated instances or processes, was employed to simulate failures.

Spark on Amazon Web Services (AWS) is relevant to us as Netflix delivers its service primarily out of the AWS cloud. Stream processing systems need to be operational 24/7 and be tolerant to failures. Instances on AWS are ephemeral, which makes it imperative to ensure Spark’s resiliency.

If Spark was commercial product this is where you would see in bold, not a vendor report, from a customer.

You need to see the post for the details but so you know what to expect:

Component
Type
Behaviour on Component Failure
Resilient
Driver
Process
Client Mode: The entire application is killed
Cluster Mode with supervise: The Driver is restarted on a different Worker node
Master
Process
Single Master: The entire application is killed
Multi Master: A STANDBY master is elected ACTIVE
Worker Process
Process
All child processes (executor or driver) are also terminated and a new worker process is launched
Executor
Process
A new executor is launched by the Worker process
Receiver
Thread(s)
Same as Executor as they are long running tasks inside the Executor
Worker Node
Node
Worker, Executor and Driver processes run on Worker nodes and the behavior is same as killing them individually

I can think of few things more annoying that software that works, sometimes. If you want users to rely upon you, then your service will have to be reliable.

A performance post by Netflix is rumored to be in the offing!

Enjoy!

…unique radicalization process now taking place in the digital era…

Filed under: News,Politics,Reporting — Patrick Durusau @ 12:22 pm

Social and news media, violent extremism, ISIS and online speech: Research review

The Journalist’s Resource is produced by Harvard’s Shorenstein Center on Media, Politics and Public Policy and is a must read site if you are a journalist or if you are interested in high quality background on current news stories. Having said that, you need to read the background materials themselves, in addition to the summary given by these reports.

It is often claimed but always absent evidence, that the social media campaigns of ISIS and other terrorist groups are “successful” and “attracting supporter,” so much so that governments pressure social media companies to censor their content.

From the review:

A March 2015 report from the Brookings Institution estimates that there at least 46,000 Twitter accounts run by supporters of the Islamic State (also known as ISIS or ISIL), a group of violent extremists that currently occupies parts of Syria and Iraq. This group has also taken to posting violent videos and recruiting materials on digital platforms, posing a dilemma for Silicon Valley companies — YouTube, Google, Twitter, Facebook and the like — as well as traditional news publishers. Facebook, for example, has grappled with whether or not to allow videos of beheadings to be viewed on its platform, and on March 16, 2015, again modified its “community standards.”

Although rising connectivity has helped make these problems more acute in the past few years, terrorism analysts have long been theorizing about an international media war and a globalized insurgency. The RAND Corporation has documented the unique radicalization process now taking place in the digital era. The dilemmas are personal for many organizations: ISIS has not only executed journalists but has even threatened employees of Twitter who seek to block accounts threatening violence.

For news media, there are hard questions about when exactly propaganda is itself newsworthy and when reporting on it serves a larger public purpose that justifies allowing access to a mass audience and amplifying a violent message, however well contextualized. This has led to questions about whether the slick production and deft use of media by ISIS is indeed just a form of “gaming” journalists. Reporting on terrorism in a globalized media environment has been the subject of much debate and research since the Sept. 11, 2001, attacks; the press has faced steady criticism for focusing too much on relatively rare violent acts while neglecting other aspects of the Muslim world, and for hyping threats and helping to sow fear.

I applaud the resources that the review assembles, including the RAND report which is cited for the proposition:

The RAND Corporation has documented the unique radicalization process now taking place in the digital era.

There’s only one problem with using that report as a source. See if you can spot it from the abstract:

This paper presents the results from exploratory primary research into the role of the internet in the radicalisation of 15 terrorists and extremists in the UK. In recent years, policymakers, practitioners and the academic community have begun to examine how the internet influences the process of radicalisation: how a person comes to support terrorism and forms of extremism associated with terrorism. This study advances the evidence base in the field by drawing on primary data from a variety of sources: evidence presented at trial, computer registries of convicted terrorists, interviews with convicted terrorists and extremists, as well as police senior investigative officers responsible for terrorist investigations. The 15 cases were identified by the research team together with the UK Association of Chief Police Officers (ACPO) and UK Counter Terrorism Units (CTU). The research team gathered primary data relating to five extremist cases (the individuals were part of the Channel programme, a UK government intervention aimed at individuals identified by the police as vulnerable to violent extremism), and ten terrorist cases (convicted in the UK), all of which were anonymised. Our research supports the suggestion that the internet may enhance opportunities to become radicalised and provide a greater opportunity than offline interactions to confirm existing beliefs. However, our evidence does not necessarily support the suggestion that the internet accelerates radicalisation or replaces the need for individuals to meet in person during their radicalisation process. Finally, we didn’t find any supporting evidence for the concept of self-radicalisation through the internet. (emphasis added)

Opps! “…didn’t find any supporting evidence for the concept of self-radicalisation through the internet.”

Unlike some experts and reporters, the RAND researchers themselves call the fifteen subjects a “convenience sample” and caution against drawing conclusions based on so small a sample. But that is fifteen (15) more than were spoken to in the Brookings Institute study which is now all the rage on ISIS and Twitter.

The authors go on to point out:

The consensus is that self-radicalisation is extremely rare, if possible at all (Bermingham et al., 2009; Change Institute, 2008; Precht, 2008; Saddiq, 2010; Stevens and Neumann, 2009; Yasin, 2011 (Rand report, page 20))

Of course, the authors did base their study on primary evidence and not what might play well on the evening news. That is the most likely explanation of the difference between their conclusions and those of governments and social media companies who are goading each other towards more censorship.

I don’t doubt for a minute that media, social and otherwise plays some role in the political positions people adopt. The footage of air strikes taking the lives of women and children are as painful for some as the videos of family members being beheaded are for others.

The callous indifference of Western governments to human suffering, in pursuit of their goals and policies, is a more effective recruitment tool for terrorists than any ISIS could invent. Not to mention it has the advantage of being true.

Or to put it another way, the answer to the suffering of Palestinians, Syrians, Iraqis, etc., isn’t “Yes, but….” Conditioning a solution to human suffering on political ends is enough of an answer for everyone to choose sides.

March 16, 2015

ADS: The Next Generation Search Platform

Filed under: Astroinformatics,Bibliography,Searching — Patrick Durusau @ 6:28 pm

ADS: The Next Generation Search Platform by Alberto Accomazzi et al.

Abstract:

Four years after the last LISA meeting, the NASA Astrophysics Data System (ADS) finds itself in the middle of major changes to the infrastructure and contents of its database. In this paper we highlight a number of features of great importance to librarians and discuss the additional functionality that we are currently developing. Starting in 2011, the ADS started to systematically collect, parse and index full-text documents for all the major publications in Physics and Astronomy as well as many smaller Astronomy journals and arXiv e-prints, for a total of over 3.5 million papers. Our citation coverage has doubled since 2010 and now consists of over 70 million citations. We are normalizing the affiliation information in our records and, in collaboration with the CfA library and NASA, we have started collecting and linking funding sources with papers in our system. At the same time, we are undergoing major technology changes in the ADS platform which affect all aspects of the system and its operations. We have rolled out and are now enhancing a new high-performance search engine capable of performing full-text as well as metadata searches using an intuitive query language which supports fielded, unfielded and functional searches. We are currently able to index acknowledgments, affiliations, citations, funding sources, and to the extent that these metadata are available to us they are now searchable under our new platform. The ADS private library system is being enhanced to support reading groups, collaborative editing of lists of papers, tagging, and a variety of privacy settings when managing one’s paper collection. While this effort is still ongoing, some of its benefits are already available through the ADS Labs user interface and API at this http URL

Now for a word from the people who were using “big data” before it was a buzz word!

The focus here is on smaller data, publications, but it makes a good read.

I have been following the work on Solr proper and am interested in learning more about the extensions created to Solr by ADS.

Enjoy!

I first saw this in a tweet by Kirk Borne.

Bias? What Bias?

Filed under: Bias,Facebook,Social Media,Social Sciences,Twitter — Patrick Durusau @ 6:09 pm

Scientists Warn About Bias In The Facebook And Twitter Data Used In Millions Of Studies by Brid-Aine Parnell.

From the post:

Social media like Facebook and Twitter are far too biased to be used blindly by social science researchers, two computer scientists have warned.

Writing in today’s issue of Science, Carnegie Mellon’s Juergen Pfeffer and McGill’s Derek Ruths have warned that scientists are treating the wealth of data gathered by social networks as a goldmine of what people are thinking – but frequently they aren’t correcting for inherent biases in the dataset.

If folks didn’t already know that scientists were turning to social media for easy access to the pat statistics on thousands of people, they found out about it when Facebook allowed researchers to adjust users’ news feeds to manipulate their emotions.

Both Facebook and Twitter are such rich sources for heart pounding headlines that I’m shocked, shocked that anyone would suggest there is bias in the data! 😉

Not surprisingly, people participate in social media for reasons entirely of their own and quite unrelated to the interests or needs of researchers. Particular types of social media attract different demographics than other types. I’m not sure how you could “correct” for those biases, unless you wanted to collect better data for yourself.

Not that there are any bias free data sets but some are so obvious that it hardly warrants mentioning. Except that institutions like the Brookings Institute bump and grind on Twitter data until they can prove the significance of terrorist social media. Brookings knows better but terrorism is a popular topic.

Not to make data carry all the blame, the test most often applied to data is:

Will this data produce a result that merits more funding and/or will please my supervisor?

I first saw this in a tweet by Persontyle.

Sharing and the IoT?

Filed under: IoT - Internet of Things,Privacy — Patrick Durusau @ 4:47 pm

Walter Adamson writes in Why the Internet of Things is about the data, not the ‘Thing’:

Wouldn’t it also be nice if you could learn the following about yourself and your lifestyle:

  • when you haven’t had a good enough sleep to undertake hard physical exertion without risking more fatigue;
  • when you seem to have an identifiable chronic bad sleep pattern that needs attention from an expert;
  • when your heart is healthy, and when it is needing attention;
  • your level of real fitness, and how your activity patterns are changing it for better or worse;
  • your real level of exertion, and which exercises/activities give you best fitness benefits;
  • When you are in danger of over-exercising and weakening your immune system;
  • how you compare to your peers and community and what you can learn from them?
Sharing the data shares the goodness

I’m sorry, I am old enough to have had any number of bad habits and poor lifestyle choices over the years. Deeply enjoyed all of them.

The very last thing I needed was my watch, TV, or car whining at me about my choices.

Adamson’s vision of the Internet of Things scenario is a nightmare where you may not live to be 100 but you will feel like it.

PS: You should cultivate good health habits, in moderation, but be mindful that no one says on their death bed: “I’m sorry I had such a good time.”

Computing the optimal road trip across the U.S.

Filed under: Mapping,Maps,Python — Patrick Durusau @ 4:31 pm

Computing the optimal road trip across the U.S. by Randal S. Olson.

From the webpage:

This notebook provides the methodology and code used in the blog post, Computing the optimal road trip across the U.S..

This is a nice surprise for a Monday!

The original post goes into the technical details and is quite good.

Grafter RDF Utensil

Filed under: Clojure,DSL,Graphs,Linked Data,RDF — Patrick Durusau @ 4:11 pm

Grafter RDF Utensil

From the homepage:

grafter

Easy Data Curation, Creation and Conversion

Grafter’s DSL makes it easy to transform tabular data from one tabular format to another. We also provide ways to translate tabular data into Linked Graph Data.

Data Formats

Grafter currently supports processing CSV and Excel files, with additional formats including Geo formats & shape files coming soon.

Separation of Concerns

Grafter’s design has a clear separation of concerns, disentangling tabular data processing from graph conversion and loading.

Incanter Interoperability

Grafter uses Incanter’s datasets, making it easy to incororate advanced statistical processing features with your data transformation pipelines.

Stream Processing

Grafter transformations build on Clojure’s laziness, meaning you can process large amounts of data without worrying about memory.

Linked Data Templates

Describe the linked data you want to create using simple templates that look and feel like Turtle.

Even if Grafter wasn’t a DSL, written in Clojure, producing graph output, I would have been compelled to mention it because of the cool logo!

Enjoy!

I first saw this in a tweet by ClojureWerkz.

Max Kuhn’s Talk on Predictive Modeling

Filed under: Modeling,Prediction,Predictive Analytics — Patrick Durusau @ 3:53 pm

Max Kuhn’s Talk on Predictive Modeling

From the post:

Max Kuhn, Director of Nonclinical Statistics of Pfizer and also the author of Applied Predictive Modeling joined us on February 17, 2015 and shared his experience with Data Mining with R.

Max is a nonclinical statistician who has been applying predictive models in the diagnostic and pharmaceutical industries for over 15 years. He is the author and maintainer for a number of predictive modeling packages, including: caret, C50, Cubist and AppliedPredictiveModeling. He blogs about the practice of modeling on his website at http://appliedpredictivemodeling.com/blog

Excellent! (You may need to adjust the sound on the video.)

Support your local user group, particularly those generous enough to post videos and slides for their speakers. It makes a real difference to those unable to travel for one reason or another.

I first saw this in a tweet by NYC Data Science.

Flock: Hybrid Crowd-Machine Learning Classifiers

Filed under: Authoring Topic Maps,Classifier,Crowd Sourcing,Machine Learning,Topic Maps — Patrick Durusau @ 3:09 pm

Flock: Hybrid Crowd-Machine Learning Classifiers by Justin Cheng and Michael S. Bernstein.

Abstract:

We present hybrid crowd-machine learning classifiers: classification models that start with a written description of a learning goal, use the crowd to suggest predictive features and label data, and then weigh these features using machine learning to produce models that are accurate and use human-understandable features. These hybrid classifiers enable fast prototyping of machine learning models that can improve on both algorithm performance and human judgment, and accomplish tasks where automated feature extraction is not yet feasible. Flock, an interactive machine learning platform, instantiates this approach. To generate informative features, Flock asks the crowd to compare paired examples, an approach inspired by analogical encoding. The crowd’s efforts can be focused on specific subsets of the input space where machine-extracted features are not predictive, or instead used to partition the input space and improve algorithm performance in subregions of the space. An evaluation on six prediction tasks, ranging from detecting deception to differentiating impressionist artists, demonstrated that aggregating crowd features improves upon both asking the crowd for a direct prediction and off-the-shelf machine learning features by over 10%. Further, hybrid systems that use both crowd-nominated and machine-extracted features can outperform those that use either in isolation.

Let’s see, suggest predictive features (subject identifiers in the non-topic map technical sense) and label data (identify instances of a subject), sounds a lot easier that some of the tedium I have seen for authoring a topic map.

I particularly like the “inducing” of features versus relying on a crowd to suggest identifying features. I suspect that would work well in a topic map authoring context, sans the machine learning aspects.

This paper is being presented this week, CSCW 2015, so you aren’t too far behind. 😉

How would you structure an inducement mechanism for authoring a topic map?

Chemical databases: curation or integration by user-defined equivalence?

Filed under: Cheminformatics,Chemistry,InChl,Subject Identity — Patrick Durusau @ 2:52 pm

Chemical databases: curation or integration by user-defined equivalence? by Anne Hersey, Jon Chambers, Louisa Bellis, A. Patrícia Bento, Anna Gaulton, John P. Overington.

Abstract:

There is a wealth of valuable chemical information in publicly available databases for use by scientists undertaking drug discovery. However finite curation resource, limitations of chemical structure software and differences in individual database applications mean that exact chemical structure equivalence between databases is unlikely to ever be a reality. The ability to identify compound equivalence has been made significantly easier by the use of the International Chemical Identifier (InChI), a non-proprietary line-notation for describing a chemical structure. More importantly, advances in methods to identify compounds that are the same at various levels of similarity, such as those containing the same parent component or having the same connectivity, are now enabling related compounds to be linked between databases where the structure matches are not exact.

The authors identify a number of reasons why databases of chemical identifications have different structures recorded for the same chemicals. One problem is that there is no authoritative source for chemical structures so upon publication, authors publish those aspects most relevant to their interest. Or publish images and not machine readable representations of a chemical. To say nothing of the usual antics with simple names and their confusions. But there are software limitations, business rules and other sources of a multiplicity of chemical structures.

Suffice it to say that the authors make a strong case for why there are multiple structures for any given chemical now and why that is going to continue.

The author’s openly ask if it is time to ask users for their assistance in mapping this diversity of structures:

Is it now time to accept that however diligent database providers are, there will always be differences in structure representations and indeed some errors in the structures that cannot be fixed with a realistic level of resource? Should we therefore turn our attention to encouraging the use and development of tools that enable the mapping together of related compounds rather than concentrate our efforts on ever more curation?

You know my answer to that question.

What’s yours?

I first saw this in a tweet by John P. Overington.

A Compendium of Clean Graphs in R

Filed under: R,Visualization — Patrick Durusau @ 9:17 am

A Compendium of Clean Graphs in R by Eric-Jan Wagenmakers and Quentin Gronau.

From the post:

Every data analyst knows that a good graph is worth a thousand words, and perhaps a hundred tables. But how should one create a good, clean graph? In R, this task is anything but easy. Many users find it almost impossible to resist the siren song of adding grid lines, including grey backgrounds, using elaborate color schemes, and applying default font sizes that makes the text much too small in relation to the graphical elements. As a result, many R graphs are an aesthetic disaster; they are difficult to parse and unfit for publication.

In constrast, a good graph obeys the golden rule: “create graphs unto others as you want them to create graphs unto you”. This means that a good graph is a simple graph, in the Einsteinian sense that a graph should be made as simple as possible, but not simpler. A good graph communicates the main message effectively, without fuss and distraction. In addition, a good graph balances its graphical and textual elements – large symbols demand an increase in line width, and these together require an increase in font size.

In order to reduce the time needed to find relevant R code, we have constructed a compendium of clean graphs in R. This compendium, available at http://shinyapps.org/apps/RGraphCompendium/index.html, can also be used for teaching or as inspiration for improving one’s own graphs. In addition, the compendium provides a selective overview of the kind of graphs that researchers often use; the graphs cover a range of statistical scenarios and feature contributions of different data analysts. We do not wish to presume the graphs in the compendium are in any way perfect; some are better than others, and overall much remains to be improved. The compendium is undergoing continual refinement. Nevertheless, we hope the graphs are useful in their current state.

This rocks! A tribute to the authors, R and graphics!

A couple samples to whet your appetite:

r-graph-1

r-graph-2

BTW, the images in the compendium have Show R-Code buttons!

Enjoy!

DYI Web Server

Filed under: Python,Software,WWW — Patrick Durusau @ 8:01 am

Let’s Build A Web Server. Part 1. by Ruslan Spivak.

From the post:

Out for a walk one day, a woman came across a construction site and saw three men working. She asked the first man, “What are you doing?” Annoyed by the question, the first man barked, “Can’t you see that I’m laying bricks?” Not satisfied with the answer, she asked the second man what he was doing. The second man answered, “I’m building a brick wall.” Then, turning his attention to the first man, he said, “Hey, you just passed the end of the wall. You need to take off that last brick.” Again not satisfied with the answer, she asked the third man what he was doing. And the man said to her while looking up in the sky, “I am building the biggest cathedral this world has ever known.” While he was standing there and looking up in the sky the other two men started arguing about the errant brick. The man turned to the first two men and said, “Hey guys, don’t worry about that brick. It’s an inside wall, it will get plastered over and no one will ever see that brick. Just move on to another layer.”1

The moral of the story is that when you know the whole system and understand how different pieces fit together (bricks, walls, cathedral), you can identify and fix problems faster (errant brick).

What does it have to do with creating your own Web server from scratch?

I believe to become a better developer you MUST get a better understanding of the underlying software systems you use on a daily basis and that includes programming languages, compilers and interpreters, databases and operating systems, web servers and web frameworks. And, to get a better and deeper understanding of those systems you MUST re-build them from scratch, brick by brick, wall by wall. (emphasis in original)

You probably don’t want to try this with an office suite package but for a basic web server this could be fun!

More installments to follow.

Enjoy!

March 15, 2015

Researchers just built a free, open-source version of Siri

Filed under: Artificial Intelligence,Computer Science,Machine Learning — Patrick Durusau @ 8:05 pm

Researchers just built a free, open-source version of Siri by Jordan Norvet.

From the post:

Major tech companies like Apple and Microsoft have been able to provide millions of people with personal digital assistants on mobile devices, allowing people to do things like set alarms or get answers to questions simply by speaking. Now, other companies can implement their own versions, using new open-source software called Sirius — an allusion, of course, to Apple’s Siri.

Today researchers from the University of Michigan are giving presentations on Sirius at the International Conference on Architectural Support for Programming Languages and Operating Systems in Turkey. Meanwhile, Sirius also made an appearance on Product Hunt this morning.

“Sirius … implements the core functionalities of an IPA (intelligent personal assistant) such as speech recognition, image matching, natural language processing and a question-and-answer system,” the researchers wrote in a new academic paper documenting their work. The system accepts questions and commands from a mobile device, processes information on servers, and provides audible responses on the mobile device.

Read the full academic paper (PDF) to learn more about Sirius. Find Sirius on GitHub here.

Opens up the possibility of a IPA (intelligent personal assistant) that has custom intelligence. Are your day-to-day tasks Apple cookie-cutter tasks or do they go beyond that?

The security implications are interesting as well. What if your IPA “reads” on a news stream that you have been arrested? Or if you fail to check in within some time window?

I first saw this in a tweet by Data Geek.

Teaching and Learning Data Visualization: Ideas and Assignments

Filed under: Graphics,Statistics,Visualization — Patrick Durusau @ 7:32 pm

Teaching and Learning Data Visualization: Ideas and Assignments by Deborah Nolan, Jamis Perrett.

Abstract:

This article discusses how to make statistical graphics a more prominent element of the undergraduate statistics curricula. The focus is on several different types of assignments that exemplify how to incorporate graphics into a course in a pedagogically meaningful way. These assignments include having students deconstruct and reconstruct plots, copy masterful graphs, create one-minute visual revelations, convert tables into `pictures’, and develop interactive visualizations with, e.g., the virtual earth as a plotting canvas. In addition to describing the goals and details of each assignment, we also discuss the broader topic of graphics and key concepts that we think warrant inclusion in the statistics curricula. We advocate that more attention needs to be paid to this fundamental field of statistics at all levels, from introductory undergraduate through graduate level courses. With the rapid rise of tools to visualize data, e.g., Google trends, GapMinder, ManyEyes, and Tableau, and the increased use of graphics in the media, understanding the principles of good statistical graphics, and having the ability to create informative visualizations is an ever more important aspect of statistics education.

You will find a number of ideas in this paper to use in teaching and learning visualization.

I understand that visualizing a table can, with the proper techniques, display relationships that are otherwise difficult to notice.

On the other hand, due to our limited abilities to distinguish colors, graphs can conceal information that would otherwise be apparent from a table.

Not an objection to visualizing tables but a caution that details can get lost in visualization as well as being highlighted for the viewer.

Distilling the Knowledge in a Neural Network

Filed under: Machine Learning,Neural Networks — Patrick Durusau @ 7:19 pm

Distilling the Knowledge in a Neural Network by Geoffrey Hinton, Oriol Vinyals, Jeff Dean.

Abstract:

A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.

The technique described appears very promising but I suspect the paper’s importance lies in another discovery by its authors:

Many insects have a larval form that is optimized for extracting energy and nutrients from the environment and a completely different adult form that is optimized for the very different requirements of traveling and reproduction. In large-scale machine learning, we typically use very similar models for the training stage and the deployment stage despite their very different requirements: For tasks like speech and object recognition, training must extract structure from very large, highly redundant datasets but it does not need to operate in real time and it can use a huge amount of computation. Deployment to a large number of users, however, has much more stringent requirements on latency and computational resources. The analogy with insects suggests that we should be willing to train very cumbersome models if that makes it easier to extract structure from the data.

The sparse results of machine learning haven’t been due to the difficulty of machine learning but by our limited conceptions of it.

Consider the recent rush of papers and promising results with deep learning. Compare that to years of labor spent on trying to specify rules and logic for machine reasoning. The verdict isn’t in, yet, but I suspect that formal logic is too sparse and pinched to support robust machine reasoning.

Like the Google’s Pinball Wizard with Atari games, so long as it wins, does its method matter? What if it isn’t expressible in first order logic?

It will be very ironic after the years of debate over “logical” entities if computers must become less logical and more like us in order to advance machine reasoning projects.

I first saw this in a tweet by Andrew Beam.

Artificial Neurons and Single-Layer Neural Networks…

Artificial Neurons and Single-Layer Neural Networks – How Machine Learning Algorithms Work Part 1 by Sebastian Raschka.

From the post:

This article offers a brief glimpse of the history and basic concepts of machine learning. We will take a look at the first algorithmically described neural network and the gradient descent algorithm in context of adaptive linear neurons, which will not only introduce the principles of machine learning but also serve as the basis for modern multilayer neural networks in future articles.

Machine learning is one of the hottest and most exciting fields in the modern age of technology. Thanks to machine learning, we enjoy robust email spam filters, convenient text and voice recognition, reliable web search engines, challenging chess players, and, hopefully soon, safe and efficient self-driving cars.

Without any doubt, machine learning has become a big and popular field, and sometimes it may be challenging to see the (random) forest for the (decision) trees. Thus, I thought that it might be worthwhile to explore different machine learning algorithms in more detail by not only discussing the theory but also by implementing them step by step.
To briefly summarize what machine learning is all about: “[Machine learning is the] field of study that gives computers the ability to learn without being explicitly programmed” (Arthur Samuel, 1959). Machine learning is about the development and use of algorithms that can recognize patterns in data in order to make decisions based on statistics, probability theory, combinatorics, and optimization.

The first article in this series will introduce perceptrons and the adaline (ADAptive LINear NEuron), which fall into the category of single-layer neural networks. The perceptron is not only the first algorithmically described learning algorithm [1], but it is also very intuitive, easy to implement, and a good entry point to the (re-discovered) modern state-of-the-art machine learning algorithms: Artificial neural networks (or “deep learning” if you like). As we will see later, the adaline is a consequent improvement of the perceptron algorithm and offers a good opportunity to learn about a popular optimization algorithm in machine learning: gradient descent.

Starting point for what appears to be a great introduction to neural networks.

While you are at Sebastian’s blog, it is very much worthwhile to look around. You will be pleasantly surprised.

March 14, 2015

The Data Engineering Ecosystem: An Interactive Map

Filed under: BigData,Data Pipelines,Visualization — Patrick Durusau @ 6:58 pm

The Data Engineering Ecosystem: An Interactive Map by David Drummond and John Joo.

From the post:

Companies, non-profit organizations, and governments are all starting to realize the huge value that data can provide to customers, decision makers, and concerned citizens. What is often neglected is the amount of engineering required to make that data accessible. Simply using SQL is no longer an option for large, unstructured, or real-time data. Building a system that makes data usable becomes a monumental challenge for data engineers.

There is no plug and play solution that solves every use case. A data pipeline meant for serving ads will look very different from a data pipeline meant for retail analytics. Since there are unlimited permutations of open-source technologies that can be cobbled together, it can be overwhelming when you first encounter them. What do all these tools do and how do they fit into the ecosystem?

Insight Data Engineering Fellows face these same questions when they begin working on their data pipelines. Fortunately, after several iterations of the Insight Data Engineering Program, we have developed this framework for visualizing a typical pipeline and the various data engineering tools. Along with the framework, we have included a set of tools for each category in the interactive map.

This looks quite handy if you are studying for a certification test and need to know the components and a brief bit about each one.

For engineering purposes, it would be even better if you could connect your pieces together and then map the data flows through the pipelines. That is where did the data previously held in table X go during each step and what operations were performed on it? Not to mention being able to track an individual datum through the process.

Is there a tool that I haven’t seen or overlooked that allows that type of insight into a data pipeline? With subject identities of course for the various subjects along the way.

« Newer PostsOlder Posts »

Powered by WordPress