Unsurprisingly, on the Datablog we often write articles about data when we have data. But some topics, like pornography, aren’t conducive to statistical analysis, no matter how important many claim they are.
Despite these challenges, a report released today has sought to assess children and young people’s exposure to pornography and understand its impact. Led by Middlesex University and commissioned by the Children’s Commissioner, this was a rapid evidence assessment – completed in the space of just three months as part of a much larger ongoing inquiry into child sexual exploitation.
The report found that a “significant proportion of children and young people are exposed to or access pornography”, and that this is linked to “unrealistic attitudes about sex” as well as “less progressive gender role attitudes (e.g. male dominance and female submission)”.
Though the report makes these and other important conclusions, you’ll notice that numbers are conspicuously absent in its language. One reason is that its findings were not based on primary research but a literature review that began with 41,000 identified sources and concluded by using 276 of those that were deemed relevant.
The post doesn’t even mention that we will know pornography when we see it.
Perhaps that is part of the problem of measurement.
Rather than processing a trillion triples, the next big data measure should be indexing all the pornography on the WWW over some time period.
This is the first in a series of blog posts that discuss the usage of a graph database like Neo4j to store, compute and visualize a variety of software metrics and other types of software analytics (method call hierarchies, transitive clojure, critical path analysis, volatility & code quality). Follow up posts by different contributors will be linked from this one.
Everyone who works in software development comes across software metrics at some point.
Just because of curiosity about the quality or complexity of the code we’ve written, or a real interest to improve quality and reduce technical debt, there are many reasons.
In general there are many ways of approaching this topic, from just gathering and rendering statistics in diagrams to visualizing the structure of programs and systems.
There are a number of commercial and free tools available that compute software metrics and help expose the current trend in your projects development.
Software metrics can cover different areas. Computing cyclomatic complexity, analysing dependencies or call traces is probably easy, using statical analysis to find smaller or larger issues is more involved and detecting code smells can be an interesting challenge in AST parsing.
Interesting work on using graph databases (here Neo4j) for software analysis.
Be sure to see the resources listed at the end of the post.
Continuum Analytics, the premier provider of Python-based data analytics solutions and services, today announced the release of Wakari version 1.0, an easy-to-use, cloud-based, collaborative Python environment for analyzing, exploring and visualizing large data sets .
Hosted on Amazon’s Elastic Compute Cloud (EC2), Wakari gives users the ability to share analyses and results via IPython notebook, visualize with Matplotlib, easily switch between multiple versions of Python and its scientific libraries, and quickly collaborate on analyses without having to download data locally to their laptops or workstations. Users can share code and results as simple web URLs, from which other users can easily create their own copies to modify and explore.
Previously in beta, the version 1.0 release of Wakari boasts a number of new features, including:
Premium access to SSH, ipcluster configuration, and the full range of Amazon compute nodes and clusters via a drop-down menu
Enhanced IPython notebook support, most notably an IPython notebook gallery and an improved UI for sharing
Bundles for simplified sharing of files, folders, and Python library dependencies
Multidisciplinary integrated research requires the ability to couple the diverse sets of data obtained from a range of complex experiments and computer simulations. Integrating data requires semantically rich information. In this paper an end-to-end use of semantically rich data in computational chemistry is demonstrated utilizing the Chemical Markup Language (CML) framework. Semantically rich data is generated by the NWChem computational chemistry software with the FoX library and utilized by the Avogadro molecular editor for analysis and visualization.
Results
The NWChem computational chemistry software has been modified and coupled to the FoX library to write CML compliant XML data files. The FoX library was expanded to represent the lexical input files and molecular orbitals used by the computational chemistry software. Draft dictionary entries and a format for molecular orbitals within CML CompChem were developed. The Avogadro application was extended to read in CML data, and display molecular geometry and electronic structure in the GUI allowing for an end-to-end solution where Avogadro can create input structures, generate input files, NWChem can run the calculation and Avogadro can then read in and analyse the CML output produced. The developments outlined in this paper will be made available in future releases of NWChem, FoX, and Avogadro.
Conclusions
The production of CML compliant XML files for computational chemistry software such as NWChem can be accomplished relatively easily using the FoX library. The CML data can be read in by a newly developed reader in Avogadro and analysed or visualized in various ways. A community-based effort is needed to further develop the CML CompChem convention and dictionary. This will enable the long-term goal of allowing a researcher to run simple “Google-style” searches of chemistry and physics and have the results of computational calculations returned in a comprehensible form alongside articles from the published literature.
Aside from its obvious importance for cheminformatics, I think there is another lesson in this article.
Integration of data required “…semantically rich information…, but just as importantly, integration was not a goal in and of itself.
Integration was only part of a workflow that had other goals.
No doubt some topic maps are useful as end products of integrated data, but what of cases where integration is part of a workflow?
Think of the non-reusable data integration mappings that are offered by many enterprise integration packages.
A new feature of Lucene 4 – pluggable codecs – allows for the modification of Lucene’s underlying storage engine. Working with codecs and examining their output yields fascinating insights into how exactly Lucene’s search works in its most fundamental form.
The centerpiece of a Lucene codec is it’s postings format. Postings are a commonly thrown around word in the Lucene space. A Postings format is the representation of the inverted search index – the core data structure used to lookup documents that contain a term. I think nothing really captures the logical look-and-feel of Lucene’s postings better than Mike McCandless’s SimpleTextPostingsFormat. SimpleText is a text-based representation of postings created for educational purposes. I’ve indexed a few documents in Lucene using SimpleText to demonstrate how postings are structured to allow for fast search:
A first step towards moving beyond being a search engine result consumer.
Postgres has long been known as a stable database product that reliably stores your data. However, in recent years it has picked up many features, allowing it to become a much sexier database.
This video covers a whirlwind of Postgres features, which highlight why you should consider it for your next project. These include: Datatypes Using other languages within Postgres Extensions including NoSQL inside your SQL database Accessing your non-Postgres data (Redis, Oracle, MySQL) from within Postgres Window Functions.
Chris Kerstiens does a fast paced overview of Postgres.
Welcome to Fact Tank, a new, real-time platform from the Pew Research Center, dedicated to finding news in the numbers.
Fact Tank will build on the Pew Research Center’s unique brand of data journalism. For years, our teams of writers and social scientists have combined rigorous research with high-quality storytelling to provide important information on issues and trends shaping the nation and the world.
Fact Tank will allow us to provide that sort of information at a faster pace, in an attempt to provide you with the information you need when you need it. We’ll fill the gap between major surveys and reports with shorter pieces using our data to give context to the news of the day. And we’ll scour other data sources, bringing you important insights on politics, religion, technology, media, economics and social trends.
An interesting source of additional data on current news stories.
Solr 4 has a subset of features that allow it be run as a distributed fault-tolerant cluster, referred to as “SolrCloud”. Installing and configuring Solr on a multi-node cluster can seem daunting when you’re a developer who just wants to give the latest release a try. The wiki page is long and complex, and configuring nodes manually is laborious and error-prone. And while your OS has ZooKeeper/Solr packages, they are probably outdated. But it doesn’t have to be a lot of work: in this post I will show you how to deploy and test a Solr 4 cluster using just a few commands, using mechanisms you can easily adjust for your own deployments.
I am using a cluster consisting of a virtual machines running Ubuntu 12.04 64bit and I am controlling them from my MacBook Pro. The Solr configuration will mimic the Two shard cluster with shard replicas and zookeeper ensemble example from the wiki.
You can run this on AWS EC2, but some special considerations apply, see the footnote.
We’ll use Fabric, a light-weight deployment tool that is basically a Python library to easily execute commands on remote nodes over ssh. Compared to Chef/Puppet it is simpler to learn and use, and because it’s an imperative approach it makes sequential orchestration of dependencies more explicit. Most importantly, it does not require a separate server or separate node-side software installation.
DISCLAIMER: these instructions and associated scripts are released under the Apache License; use at your own risk.
I strongly recommend you use disposable virtual machines to experiment with.
Something to get you excited about the upcoming weekend!
The second edition of MongoDB: The Definitive Guide is now available from O’Reilly! It covers both developing with and administering MongoDB. The book is language-agnostic: almost all of the examples are in JavaScript.
Looking forward to enjoying the second edition as much as the first!
Although, I am not really sure that always using JavaScript means you are “language-agnostic.”
The Bayesian method is the natural approach to inference, yet it is hidden from readers behind chapters of slow, mathematical analysis. The typical text on Bayesian inference involves two to three chapters on probability theory, then enters what Bayesian inference is. Unfortunately, due to mathematical intractability of most Bayesian models, the reader is only shown simple, artificial examples. This can leave the user with a so-what feeling about Bayesian inference. In fact, this was the author’s own prior opinion.
After some recent success of Bayesian methods in machine-learning competitions, I decided to investigate the subject again. Even with my mathematical background, it took me three straight-days of reading examples and trying to put the pieces together to understand the methods. There was simply not enough literature bridging theory to practice. The problem with my misunderstanding was the disconnect between Bayesian mathematics and probabilistic programming. That being said, I suffered then so the reader would not have to now. This book attempts to bridge the gap.
If Bayesian inference is the destination, then mathematical analysis is a particular path to towards it. On the other hand, computing power is cheap enough that we can afford to take an alternate route via probabilistic programming. The latter path is much more useful, as it denies the necessity of mathematical intervention at each step, that is, we remove often-intractable mathematical analysis as a prerequisite to Bayesian inference. Simply put, this latter computational path proceeds via small intermediate jumps from beginning to end, where as the first path proceeds by enormous leaps, often landing far away from our target. Furthermore, without a strong mathematical background, the analysis required by the first path cannot even take place.
Bayesian Methods for Hackers is designed as a introduction to Bayesian inference from a computational/understanding-first, and mathematics-second, point of view. Of course as an introductory book, we can only leave it at that: an introductory book. For the mathematically trained, they may cure the curiosity this text generates with other texts designed with mathematical analysis in mind. For the enthusiast with less mathematical-background, or one who is not interested in the mathematics but simply the practice of Bayesian methods, this text should be sufficient and entertaining.
Not yet complete but what is there you will find very useful.
This 2nd edition has more than 200 pages of pure data science, far more than the first edition. This new version of our very popular book will soon be available for download: we will make an announcement when it is officially published.
Sixty-two (62) new contributions split between data science recipes, data science discussions, data science resources.
If you can’t wait for the ebook, links to the contributions are given at Vincent’s post.
Most government statistics are mapped according to official geographical units. Whilst such units are essential for data analysis and making decisions about, for example, government spending, they are hard for many people to relate to and they don’t particularly stand out on a map. This is why I tried a new method back in July 2012 to show life expectancy statistics in a fresh light by mapping them on to London Tube stations. The resulting ”Lives on the Line” map has been really popular with many people surprised at the extent of the variations in the data across London and also grateful for the way that it makes seemingly abstract statistics more easily accessible. To find out how I did it (and read some of the feedback) you can see here.
James gives a number of examples of the use of transportation lines making “abstract statistics more easily accessible.”
Worth a close look if you are interested in making dry municipal statistics part of the basis for social change.
The Rendition Project, a collaboration between UK academics and the NGO Reprieve, has produced one of the most detailed and illuminating research projects shedding light on the CIA’s extraordinary rendition project to date. Here’s how to use it.
Truly remarkable project to date, but could be even more successful with your assistance.
Not likely that any of the principals will wind up in the dock at the Hague.
On the other hand, exposing their crimes may deter others from similar adventures.
A few weeks ago, we integrated the full text of federal bills and regulations into our alert system, Scout. Now, if you visit CISPA or a fascinating cotton rule, you’ll see the original document – nicely formatted, but also well-integrated into Scout’s layout. There are a lot of good reasons to integrate the text this way: we want you to see why we alerted you to a document without having to jump off-site, and without clunky iframes.
As importantly, we wanted to do this in a way that would be easily reusable by other projects and people. So we built a tool called us-documents that makes it possible for anyone to do this with federal bills and regulations. It’s available as a Ruby gem, and comes with a command line tool so that you can use it with Python, Node, or any other language. It lives inside the unitedstates project at unitedstates/documents, and is entirely public domain.
This could prove to be real interesting. Both as a matter of content and a technique to replicate elsewhere.
The DCAT Application profile for data portals in Europe (DCAT-AP) is a specification based on the Data Catalogue vocabulary (DCAT) for describing public sector datasets in Europe. Its basic use case is to enable a cross-data portal search for data sets and make public sector data better searchable across borders and sectors. This can be achieved by the exchange of descriptions of data sets among data portals.
This final draft is open for public review until 10 June 2013. Members of the public are invited to download the specification and post their comments directly on this page. To be able to do so you need to be registered and logged in.
If you are interested in integration of data from European data portals, it is worth the time to register, etc.
Not all the data you are going to need to integrate a data set but at least a start in the right direction.
Farming communities in Africa and South Asia are becoming increasingly vulnerable to shock as the effects of climate change become a reality. This increased vulnerability, however, comes at a time when improved technology makes critical information more accessible than ever before. aWhere Weather, an online platform offering free weather data for locations in Western, Eastern and Southern Africa and South Asia provides instant and interactive access to highly localized weather data, instrumental for improved decision making and providing greater context in shaping policies relating to agricultural development and global health.
Weather Data in 9km Grid Cells
Weather data is collected at meteorological stations around the world and interpolated to create accurate data in detailed 9km grids. Within each cell, users can access historical, daily-observed and 8 days of daily forecasted ‘localized’ weather data for the following variables:
Precipitation
Minimum and Maximum Temperature
Minimum and Maximum Relative Humidity
Solar Radiation
Maximum and Morning Wind Speed
Growing degree days (dynamically calculated for your base and cap temperature)
These data prove essential for risk adaption efforts, food security interventions, climate-smart decision making, and agricultural or environmental research activities.
At least as a public observer, I could not determine how much “interpolation” is going to the weather data. That would have a major impact on the risk of accepting the data provided at face value.
I suspect it varies from little interpolation at all in heavily instrumented areas to quite a bit in areas with sparser readings. How much is unclear.
It maybe that the amount of interpolation in the data is a factor of whether you use the free version or some upgraded commercial version.
Still, an interesting data source to combine with others, if you are mindful of the risks.
The schedule for CS188.2x hasn’t been announced, yet.
In the meantime, you can register for CS188.1x and peruse the videos, exercises, etc. while you wait for the second part of the course.
From the description:
CS188.1x is a new online adaptation of the first half of UC Berkeley’s CS188: Introduction to Artificial Intelligence. The on-campus version of this upper division computer science course draws about 600 Berkeley students each year.
Artificial intelligence is already all around you, from web search to video games. AI methods plan your driving directions, filter your spam, and focus your cameras on faces. AI lets you guide your phone with your voice and read foreign newspapers in English. Beyond today’s applications, AI is at the core of many new technologies that will shape our future. From self-driving cars to household robots, advancements in AI help transform science fiction into real systems.
CS188.1x focuses on Behavior from Computation. It will introduce the basic ideas and techniques underlying the design of intelligent computer systems. A specific emphasis will be on the statistical and decision–theoretic modeling paradigm. By the end of this course, you will have built autonomous agents that efficiently make decisions in stochastic and in adversarial settings. CS188.2x (to follow CS188.1x, precise date to be determined) will cover Reasoning and Learning. With this additional machinery your agents will be able to draw inferences in uncertain environments and optimize actions for arbitrary reward structures. Your machine learning algorithms will classify handwritten digits and photographs. The techniques you learn in CS188x apply to a wide variety of artificial intelligence problems and will serve as the foundation for further study in any application area you choose to pursue.
Lucene’s facet module has seen some great improvements recently: sizable (nearly 4X) speedups and new features like DrillSideways. The Jira issues search example showcases a number of facet features. Here I’ll describe two recently committed facet features: sorted-set doc-values faceting, already available in 4.3, and dynamic range faceting, coming in the next (4.4) release.
To understand these features, and why they are important, we first need a little background. Lucene’s facet module does most of its work at indexing time: for each indexed document, it examines every facet label, each of which may be hierarchical, and maps each unique label in the hierarchy to an integer id, and then encodes all ids into a binary doc values field. A separate taxonomy index stores this mapping, and ensures that, even across segments, the same label gets the same id.
At search time, faceting cost is minimal: for each matched document, we visit all integer ids and aggregate counts in an array, summarizing the results in the end, for example as top N facet labels by count.
This is in contrast to purely dynamic faceting implementations like ElasticSearch‘s and Solr‘s, which do all work at search time. Such approaches are more flexible: you need not do anything special during indexing, and for every query you can pick and choose exactly which facets to compute.
However, the price for that flexibility is slower searching, as each search must do more work for every matched document. Furthermore, the impact on near-real-time reopen latency can be horribly costly if top-level data-structures, such as Solr’s UnInvertedField, must be rebuilt on every reopen. The taxonomy index used by the facet module means no extra work needs to be done on each near-real-time reopen.
The dynamic range faceting sounds particularly useful.
It’s critical for analysts and presenters of data to share information in a way that people just get it. Enter data storytelling – a magical elixir to all your data communication woes! Well, maybe not quite. But you should be aware of recent efforts using this timeless approach to deliver information so naturally – through stories.
This exercise breaks down a structured (yet casual) introduction to data storytelling through a variety resources. We wanted to provide a diversity of depth and inspiration. Feel free to skip around or follow our 4 week sequence. Print it and post it near the water cooler or slap it to your virtual desktop.
I don’t have a water cooler but I will post “30 Days to Data Storytelling” next to my monitors.
Whatever the subject, knowledge you can’t communicate to others, is lost.
We have been blown away with the number and size of organizations who have downloaded the beta bits of this 100% open source, and native to Windows distribution of Hadoop and engaged Hortonworks and Microsoft around evolving their data architecture to respond to the challenges of enterprise big data.
With this key milestone HDP for Windows offers the millions of customers running their business on Microsoft technologies an ecosystem-friendly Hadoop-based solution that is built for the enterprise and purpose built for Windows. This release cements Apache Hadoop’s role as a key component of the next generation enterprise data architecture, across the broadest set of datacenter configurations as HDP becomes the first production-ready Apache Hadoop distribution to run on both Windows and Linux.
Additionally, customers now also have complete portability of their Hadoop applications between on-premise and cloud deployments via HDP for Windows and Microsofts’s HDInsight Service.
Two lessons here:
First, Hadoop is a very popular way to address enterprise big data.
Second, going where users are, not where they ought to be, is a smart business move.
A molecule editor, i.e. a program facilitating graphical input and interactive editing of molecules, is an indispensable part of every cheminformatics or molecular processing system. Today, when a web browser has become the universal scientific user interface, a tool to edit molecules directly within the web browser is essential. One of the most popular tools for molecular structure input on the web is the JME applet. Since its release nearly 15 years ago, however the web environment has changed and Java applets are facing increasing implementation hurdles due to their maintenance and support requirements, as well as security issues. This prompted us to update the JME editor and port it to a modern Internet programming language – JavaScript.
Summary
The actual molecule editing Java code of the JME editor was translated into JavaScript with help of the Google Web Toolkit compiler and a custom library that emulates a subset of the GUI features of the Java runtime environment. In this process, the editor was enhanced by additional functionalities including a substituent menu, copy/paste, drag and drop and undo/redo capabilities and an integrated help. In addition to desktop computers, the editor supports molecule editing on touch devices, including iPhone, iPad and Android phones and tablets. In analogy to JME the new editor is named JSME. This new molecule editor is compact, easy to use and easy to incorporate into web pages.
Conclusions
A free molecule editor written in JavaScript was developed and is released under the terms of permissive BSD license. The editor is compatible with JME, has practically the same user interface as well as the web application programming interface. The JSME editor is available for download from the project web page http://peter-ertl.com/jsme/
Just in case you were having any doubts about using JavaScript to power an annotation editor.
After over a year of R&D, five milestone releases, and two release candidates, we are happy to release Neo4j 1.9 today! It is available for download effective immediately. And the latest source code is available, as always, on Github.
The 1.9 release adds primarily three things:
Auto-Clustering, which makes Neo4j Enterprise clustering more robust & easier to administer, with fewer moving parts
Cypher language improvements make the language more functionally powerful and more performant, and
New welcome pages make learning easier for new users
Searching through all your content is fine – until you get a mountain of it with similar content, differentiated only by context. Then you’ll need to understand the meaning within the content. In this post I discuss how to do this using semantic techniques…
Organisations today have realised that for certain applications it is useful to have a consolidated search approach over several catalogues. This is most often the case when customers can interact with several parts of the company – sales, billing, service, delivery, fraud checks.
This approach is commonly called Enterprise Search, or Search and Discovery, which is where your content across several repositories is indexed in a separate search engine. Typically this indexing occurs some time after the content is added. In addition, it is not possible for a search engine to understand the fully capabilities of every content system. This means complex mappings are needed between content, meta data and security. In some cases, this may be retrofitted with custom code as the systems do not support a common vocabulary around these aspects of information management.
Content Search
We are all used to content search, so much so that for today’s teenagers a search bar with a common (‘Google like’) grammar is expected. This simple yet powerful interface allows us to search for content (typically web pages and documents) that contain all the words or phrases that we need. Often this is broadened by the use of a thesaurus and word stemming (plays and played stems to the verb play), and combined with some form of weighting based on relative frequency within each unit of content.
Other techniques are also applied. Metadata is extracted or implied – author, date created, modified, security classification, Dublin Core descriptive data. Classification tools can be used (either at the content store or search indexing stages) to perform entity extraction (Cheese is a food stuff) and enrichment (Sheffield is a place with these geospatial co-ordinates). This provides a greater level of description of the term being searched for over and above simple word terms.
Using these techniques, additional search functionality can be provided. Search for all shops visible on a map using a bounding box, radius or polygon geospatial search. Return only documents where these words are within 6 words of each other. Perhaps weight some terms as more important than others, or optional.
These techniques are provided by many of the Enterprise class search engines out there today. Even Open Source tools like Lucene and Solr are catching up with this. They have provided access to information where before we had to rely on Information and Library Services staff to correctly classify incoming documents manually, as they did back in the paper bound days of yore.
Content search only gets you so far though.
I was amening with the best of them until Adam reached the part about MarkLogic 7 going to add Semantic Web capabilities.
I didn’t see any mention of linked data replicating the semantic diversity that currently exists in data stores.
Making data more accessible isn’t going to make it less diverse.
Although making data more accessible may drive the development of ways to manage semantic diversity.
So perhaps there is a useful side to linked data after all.
Google IO wrapped up last week with a tremendous number of data-related announcements. Today’s post is going to focus on Google Compute Engine (GCE), Google’s answer to Amazon’s Elastic Compute Cloud (EC2) that allows you to create and run virtual compute instances within Google’s cloud. We have spent a good amount of time talking about GCE in the past, in particular, benchmarking it against EC2 here, here, here, and here.
The main GCE announcement at IO was, of course, the fact that now **anyone** and **everyone** can try out and use GCE. Yes, GCE instances now support up to 10 terabytes per disk volume, which is a BIG deal. However, the fact that GCE will use minute-by-minute pricing, which might not seem incredibly significant on the surface, is an absolute game changer.
Let’s say that I have a job that will take just a thousand instances each a little bit over an hour to finish (a total of just over a thousand “instance hours”). I launch my thousand instances, run the needed job, and then shut down my cloud 61 minutes later. Let’s also assume that Amazon and Google both charge about the same amount, say $0.50 per instance per hour (a relatively safe assumption) and that Amazon’s and Google’s instances have the same computational horsepower (this is not true, see my benchmark results). As Amazon charges by the hour, Amazon would charge me for two hours per instance or $1000.00 total (1000 instances x $0.50 per instance per hour x 2 hours per instance) whereas Google would only charge me $508.34 (1000 instances x $0.50 per instance per hour x 61/60 hours per instance). In this circumstance, Amazon’s hourly billing has almost doubled my costs but the impact is far worse.
Sean does a great job covering the impact of minute-by-minute pricing for cloud computing.
Great news for the short run and I suspect even greater news for the long run.
What happens when instances and storage become too cheap to meter?
Like domestic long distance telephone service.
When anything that can be computed is within the reach of everyone, what will be computed?
Nina Zumel and I ( John Mount ) have been working very hard on producing an exciting new book called “Practical Data Science with R.” The book has now entered Manning Early Access Program (MEAP) which allows you to subscribe to chapters as they become available and give us feedback before the book goes into print.
Building a search on BillTrack50 is fairly straightforward, however it isn’t exactly like doing a Google search. So there’s a few things you need to keep in mind, which I’ll explain in this post. There’s also a few tips and tricks advanced users might find useful. Any bills that are introduced later and meet your search terms will be automatically added to your bill sheet (if you made a bill sheet).
Tracking “thumb on the scale” (TOTS) at the state level? BillTrack50 is a great starting point.
BillTrack50 provides surface facts, to which you can add vote trading, influence peddling and other routine legislative activities.
Metaphor Identification in Large Texts Corpora by Yair Neuman, Dan Assaf, Yohai Cohen, Mark Last, Shlomo Argamon, Newton Howard, Ophir Frieder. (Neuman Y, Assaf D, Cohen Y, Last M, Argamon S, et al. (2013) Metaphor Identification in Large Texts Corpora. PLoS ONE 8(4): e62343. doi:10.1371/journal.pone.0062343)
Abstract:
Identifying metaphorical language-use (e.g., sweet child) is one of the challenges facing natural language processing. This paper describes three novel algorithms for automatic metaphor identification. The algorithms are variations of the same core algorithm. We evaluate the algorithms on two corpora of Reuters and the New York Times articles. The paper presents the most comprehensive study of metaphor identification in terms of scope of metaphorical phrases and annotated corpora size. Algorithms’ performance in identifying linguistic phrases as metaphorical or literal has been compared to human judgment. Overall, the algorithms outperform the state-of-the-art algorithm with 71% precision and 27% averaged improvement in prediction over the base-rate of metaphors in the corpus.
A deep review of current work and promising new algorithms on metaphor identification.