Archive for the ‘News’ Category

Introducing Fact Tank

Friday, May 24th, 2013

Introducing Fact Tank by Alan Murray.

From the post:

Welcome to Fact Tank, a new, real-time platform from the Pew Research Center, dedicated to finding news in the numbers.

Fact Tank will build on the Pew Research Center’s unique brand of data journalism. For years, our teams of writers and social scientists have combined rigorous research with high-quality storytelling to provide important information on issues and trends shaping the nation and the world.

Fact Tank will allow us to provide that sort of information at a faster pace, in an attempt to provide you with the information you need when you need it. We’ll fill the gap between major surveys and reports with shorter pieces using our data to give context to the news of the day. And we’ll scour other data sources, bringing you important insights on politics, religion, technology, media, economics and social trends.

An interesting source of additional data on current news stories.

You Are Listening to The New York Times

Sunday, May 19th, 2013

You Are Listening to The New York Times by Hugh Mandeville.

From the post:

When the San Francisco Giants won the 2010 World Series, the post-victory celebrations got out of control. Revelers smashed windows, got into fistfights and started fires. A Muni bus and the metaverse were both set alight.

To track the chaos, Eric Eberhardt, a techie from the Bay Area, tuned in to a San Francisco police scanner station on soma.fm — while also listening to music. Something about the combination of ambient music and live police chatter clicked for Eberhardt, and youarelistening.to was born.

Eberhardt’s site is a mash-up of three APIs: police scanner audio from RadioReference.com, ambient music from SoundCloud and images from Flickr. The outcome is like a real-time soundtrack to Michael Mann’s movie “Heat.” My colleague Chase Davis, interactive news assistant editor, describes it as “‘Hearts of Space’ meets ‘The Wire.’”

(…)

My explorations inspired me to create a page on youarelistening.to that takes New York Times headlines from the Times Newswire API and reads them aloud using TTS-API.com’s text-to-speech API. I also created a page that reads trending tweets, using Twitter’s Search API.

Definitely has potential to enrich a user experience.

Imagine studying early 21st century history and when George W. Bush or Dick Cheney show up on your ereader, War Pigs plays in the background.

Trivia: Did you know that War Pigs was one of 165 songs that Clear Channel suggested could be inappropriate to play after 9/11? 2001 Clear Channel Memorandum.

Cat Stevens with Peace Train also made the list.

Terrorism we can survive. Those trying to protect us, I’m not so sure.

Contextifier: Automatic Generation of Annotated Stock Visualizations

Sunday, May 12th, 2013

Contextifier: Automatic Generation of Annotated Stock Visualizations by Jessica Hullman, Nicholas Diakopoulos and Eytan Adar.

Abstract:

Online news tools—for aggregation, summarization and automatic generation—are an area of fruitful development as reading news online becomes increasingly commonplace. While textual tools have dominated these developments, annotated information visualizations are a promising way to complement articles based on their ability to add context. But the manual effort required for professional designers to create thoughtful annotations for contextualizing news visualizations is difficult to scale. We describe the design of Contextifier, a novel system that automatically produces custom, annotated visualizations of stock behavior given a news article about a company. Contextifier’s algorithms for choosing annotations is informed by a study of professionally created visualizations and takes into account visual salience, contextual relevance, and a detection of key events in the company’s history. In evaluating our system we find that Contextifier better balances graphical salience and relevance than the baseline.

The authors use a stock graph as the primary context in which to link in other news about a publicly traded company.

Other aspects of Contextifier were focused on enhancement of that primary context.

The lesson here is that a tool with a purpose is easier to hone than a tool that could be anything for just about anybody.

I first saw this at Visualization Papers at CHI 2013 by Enrico Bertini.

New York Times – Article Search API v. 2

Sunday, May 5th, 2013

New York Times – Article Search API v. 2

From the documentation page:

With the Article Search API, you can search New York Times articles from Sept. 18, 1851 to today, retrieving headlines, abstracts, lead paragraphs, links to associated multimedia and other article metadata.

The prior Article Search API described itself as:

With the Article Search API, you can search New York Times articles from 1981 to today, retrieving headlines, abstracts, lead paragraphs, links to associated multimedia and other article metadata.

An addition of one hundred and eighty years of content for searching. No bad for a v. 2 release.

On cursory review, the API does appear to have changed significantly.

For example, the default fields for each request in version 1.0 were body, byline, date, title, url.

In version 2.0, the default fields returned are: web_url, snippet, lead_paragraph, abstract, print_page, blog, source, multimedia, headline, keywords, pub_date, document_type, news_desk, byline, type_of_material, _id, and word_count.

Five default fields for version 1.0 versus seventeen for version 2.0.

There are changes in terminology that will make discovering all the changes from version 1.0 to version 2.0 non-trivial.

Two fields that were present in version 1.0 that I don’t see (under another name?) in version 2.0 are:

dbpedia_resource:

DBpedia person names mapped to Times per_facet terms. This field is case sensitive: values must be Mixed Case.

The Times per_facet is often more comprehensive than dbpedia_resource, but the DBpedia name is easier to use with other data sources. For more information about linked open data, see data.nytimes.com.

dbpedia_resource_url:

URLs to DBpedia person names that have been mapped to Times per_facet terms. This field is case sensitive: values must be Mixed Case.

For more information about linked open data, see data.nytimes.com.

More documentation is promised, which I hope includes a mapping from version 1.0 to version 2.0.

Certainly looks like the basis for annotating content in the New York Times archives as part of a topic map.

Where users input their authentication details for the New York Times and/or other pay-per-view sites.

I can’t imagine anyone objecting to you helping them sell their content. ;-)

Mapping the News [Idea for a NewsApp]

Tuesday, April 30th, 2013

NewsRel Uses Machine Learning To Summarize News Stories And Put Them On A Map by Frederic Lardinois.

From the post:

After 24 hours of staring at their screens, the teams that participated in our Disrupt NY 2013 Hackathon have now finished their projects and are currently presenting them onstage. With more than 160 hacks, there are far too many cool ones to write about, but one that stood out to me was NewsRel, an iPad-based news app that uses machine-learning techniques to understand how news stories relate to one other. The app uses Google Maps as its main interface and automatically decides which location is most appropriate for any given story.

The app currently uses Reuters‘ RSS feed and analyzes the stories, looking for clusters of related stories and then puts them on the map. Say you are looking at a story about the Boston Marathon bombings. The app, of course, will show you a number of news stories about it clustered around Boston, then maybe something about the president’s comments about it from Washington and another article that relates it to the massacre during the Munich Olympics in 1972.

In addition to this, the team built an algorithm that picks the most important sentences from each story to summarize it for you.

No pointers to software, just the news blurb.

But, does raise an interesting possibility.

What if news video streams were tagged with geolocation and type information?

So I could exclude “train hits parade float” stories from several states away, automobile accidents, crime stories and replaces it with substantive commentary from the BBC or Al Jazeera.

Now that would be a video feed worth paying for. Particularly if for a premium it was commercial free.

Freedom from Wolf Blitzer’s whines in disaster areas should come as a free pre-set.

Just a small amount of additional semantics could lead to entirely new markets and delivery systems.

PubMed Watcher (beta)

Thursday, April 25th, 2013

PubMed Watcher (beta)

After logging it with a Google account:

Welcome on PubMed Watcher!

Thanks for registering, here is what you need to know to get quickly started:

Step 1 – Add a Key Article

Define your research topic by setting up to four Key Articles. For instance you can use your own work as input or the papers of the lab you are working in at the moment. Key Articles describe the science you care about. The articles must be referenced on PubMed.

Step 2 – Read relevant stuff

PubMed Watcher will provide you with a feed of related articles, sorted by relevance and similarity in regards to the Key Articles content. The more Key Articles you have, the more tailored the list will be. PubMed Watcher helps to abstract away from journals, impact factors and date of publishing. Spend time reading, not searching! Come back every now and then to monitor your field and to get relevant literature to read.

Ready? Add your first Key Article or learn more about PubMed Watcher machinery.

OK, so I picked four seed articles and then read the “about,” where a “pinch of heuristics” says:

Now the idea behind PubMed Watcher is to pool the feeds coming from each one of your Key Article. If an article is present in more than one feed, it means that this article seems to be even more interesting to you, that’s the heuristic. The redundant article then gets a new higher score which is the sum of all its indivual scores. Example, let’s say you have two Key Articles named A and B. A has two similar articles F and G with respective similarity scores of 4 and 2. The Key Article B has two similar articles too: M and G with scores 7 and 6. The feed presented to you by PubMed Watcher will then be: G first (score of 6+2=8), M (score of 7) and finally F (4). This score is standardised in percentages (relative relatedness, the blue bars in the application), so here we would get: G (100%), M (88%) and F (50%). This metrics is not perfect yet it’s intuitive and gives good enough results; plus it’s fast to compute.

Paper on the technique:

PubMed related articles: a probabilistic topic-based model for content similarity by Jimmy Lin and W John Wilbur.

Code on Github.

The interface is fairly “lite” and you can change your four articles easily.

One thing I like from the start is that all I need do it pick one to four articles and I’m setup.

Hard to imagine an easier setup process that comes close to matching your interests.

Agrifeeds…

Wednesday, April 10th, 2013

Agrifeeds – What changes have been made and how

From the post:

The concept behind Agrifeeds has remained, in its core, the same. It harvests items from a collection of almost 200 feeds, in English, Spanish and French, and it offers the possibility of creating new feeds by filtering the aggregated content. The new version however has embellished and enriched the content being imported, thus enriching the items of the feeds it offers.

This has been accomplished with the use of both contributed modules, and modules that have been written purposely for Agrifeeds. For the sake of brevity, those modules that are either well known, or those that have been included mainly for visual purposed (e.g. the Calendar plugin for Views), will not be described in detail.

Despite having a garden and a few backyard chickens, agriculture isn’t my specialty. ;-)

I mention AIMS (Agricultural Information Management Service) as an example of interesting IT development outside my core reading area.

Curious how you would use topic maps with a continuous sets of feeds?

Knight News Challenge – 40 Finalists

Monday, April 8th, 2013

Knight News Challenge – 40 Finalists

There are 78 days (as of today) before the evaluation of the forty (40) finalists in the Knight News Challenge closes.

You will need to average better than two (2) a day in order to see all of them.

Worthwhile because:

  • Your comments may help improve a project.
  • Your comments may assist in evaluation of a project.
  • You may get some great ideas for another project.
  • You may see ways to incorporate topic maps in one or more projects. (or not)

It is important to learn to contribute to projects that are not your own and may not be your top choice.

You may discover ideas, techniques and even people who you would otherwise miss.

A Newspaper Clipping Service with Cascading

Friday, April 5th, 2013

A Newspaper Clipping Service with Cascading by Sujit Pal.

From the post:

This post describes a possible implementation for an automated Newspaper Clipping Service. The end-user is a researcher (or team of researchers) in a particular discipline who registers an interest in a set of topics (or web-pages). An assistant (or team of assistants) then scour information sources to find more documents of interest to the researcher based on these topics identified. In this particular case, the information sources were limited to a set of “approved” newspapers, hence the name “Newspaper Clipping Service”. The goal is to replace the assistants with an automated system.

The solution I came up with was to analyze the original web pages and treat keywords extracted out of these pages as topics, then for each keyword, query a popular search engine and gather the top 10 results from each query. The search engine can be customized so the sites it looks at is restricted by the list of approved newspapers. Finally the URLs of the results are aggregated together, and only URLs which were returned by more than 1 keyword topic are given back to the user.

The entire flow can be thought of as a series of Hadoop Map-Reduce jobs, to first download, extract and count keywords from (web pages corresponding to) URLs, and then to extract and count search result URLs from the keywords. I’ve been wanting to play with Cascading for a while, and this seemed like a good candidate, so the solution is implemented with Cascading.

Hmmm, but an “automated system” leaves the user to sort, create associations, etc., for themselves.

Assistants with such a “clipping service” could curate the clippings by creating associations with other materials and adding non-obvious but useful connections.

Think of the front page of the New York Times as an interface to curated content behind the stories that appear on it.

Where “home” is the article on the front page.

Not only more prose but a web of connections to material you might not even know existed.

For example, in Beijing Flaunts Cross-Border Clout in Search for Drug Lord by Jane Perlez and Bree Feng (NYT) we learn that:

Under Lao norms, law enforcement activity is not done after dark, (Liu Yuejin, leader of the antinarcotics bureau of the Ministry of Public Security)

Could be important information, depending upon your reasons for being in Laos.

OpenNews Learning… [data recycling?]

Tuesday, March 19th, 2013

OpenNews Learning wants to provide lessons to developers in and out of newsrooms by Justin Ellis.

From the post:

If you ever wanted an “Ask This Old House”-style guide set in the universe of newsroom developers and designers, today you’re in luck: OpenNews Learning is a new kind of online education project that looks at the nuts and bolts of interactive projects through the eyes of the people who built them. It’s the newest arm of Knight-Mozilla OpenNews, the two-foundation collaboration that aims to strengthen the bonds between the worlds of journalism and software development.

One of the central ideas behind OpenNews is sharing knowledge, through building community and by putting outside developers directly into newsrooms. OpenNews Learning is an extension of that, designed to help developers (aspiring and otherwise) learn how specific projects were built. Consider it another way to “show your work.”

Following these projects should provide ample opportunities to suggest where topic maps could have been used.

I suspect most researchers would prefer data recycling over data mining.

Document Mining with Overview:…

Friday, March 15th, 2013

Document Mining with Overview:… A Digital Tools Tutorial by Jonathan Stray.

The slides from the Overview presentation I mentioned yesterday.

One of the few webinars I have ever attended where nodding off was not a problem! Interesting stuff.

It is designed for the use case where there “…is too much material to read on deadline.”

A cross between document mining and document management.

A cross that hides a lot of the complexity from the user.

Definitely a project to watch.

The Most Expensive Fighter Jet Ever Built, by the Numbers

Thursday, March 14th, 2013

The Most Expensive Fighter Jet Ever Built, by the Numbers by Theodoric Meyer.

From the post:

Thanks to the sequester, the Defense Department is now required to cut more than $40 billion this fiscal year out of its $549 billion budget. But one program that’s unlikely to take a significant hit is the F-35 Joint Strike Fighter, despite the fact that it’s almost four times more expensive than any other Pentagon weapons program that’s in the works.

We’ve compiled some of the most headache-inducing figures, from the program’s hefty cost overruns to the billions it’s generating in revenue for Lockheed Martin.

[See the post for the numbers, which are impressive.]

While the F-35 is billions over budget and years behind schedule, the program seems to be doing better recently. A Government Accountability Office report released this week found that Lockheed has made progress in improving supply and manufacturing processes and addressing technical problems.

“We’ve made enormous progress over the last few years,” Steve O’Bryan, Lockheed’s vice president of F-35 business development, told the Washington Post.

The military’s current head of the program, Lt. Gen. Christopher Bogdan, agreed that things have improved but said Lockheed and another major contractor, Pratt & Whitney, still have a ways to go.

“I want them to take on some of the risk of this program,” Bogdan said last month in Australia, which plans to buy 100 of the planes. “I want them to invest in cost reductions. I want them to do the things that will build a better relationship. I’m not getting all that love yet.”

A story that illustrates the utility of a topic map approach to news coverage.

The story has already spanned more than a decade and language like: “[t]he military’s current head of the program…,” makes me wonder about the prior military heads of the program.

Or for that matter, it isn’t really Lockheed or Pratt & Whitney, that are building (allegedly) the F-35 but identifiable teams of people within those organizations.

And those companies are paying bonuses, stock dividends, etc. during the term of the project.

No one person or for that matter any one group of people could not chase down all the actors in a story like this one.

However, merging different investigations into distinct aspects of the story could assemble a mosaic clearer than any of its individual pieces.

Perhaps tying poor management, cost overruns, etc., to named individuals will have a greater impact than generalized stories about such practices have when the name is the DoD, Lockheed, etc.


PS: If you aren’t clinically depressed, read the GAO report.

Would you buy a plane where it isn’t known if the helmet mounted display, a critical control system, will work?

It’s like buying a car where a working engine is to-be-determined, maybe.

An F-35 topic map should start with the names, addresses and current status of everyone who signed any paperwork authorizing this project.

“Mixed Messages” on Cybersecurity [China ranks #12 among cyber-attackers]

Thursday, March 14th, 2013

Do you remember the “mixed messages” Dibert cartoon?

Mixed Messages

Where an “honest” answer meant “mixed messages?”

I had that feeling this morning when I read: Mark Rockwell’s post: German telecom company provides real-time map of Cyber attacks.

From the post:

In hopes of blunting mounting electronic assaults, a German telecommunications carrier unveiled a free online capability that shows where Cyber attacks are happening around the world in real time.

Deutsche Telekom, parent company of T-Mobile, put up what it calls its “Security dashboard” portal on March 6. The map, said the company, is based on attacks on its purpose-built network of decoy “honeypot” systems at 90 locations worldwide

Deutsche Telekom said it launched the online portal at the CeBIT telecommunications trade show in Hanover, Germany, to increase the visibility of advancing electronic threats.

“New cyber attacks on companies and institutions are found every day. Deutsche Telekom alone records up to 450,000 attacks per day on its honeypot systems and the number is rising. We need greater transparency about the threat situation. With its security radar, Deutsche Telekom is helping to achieve this,” said Thomas Kremer, board member responsible for Data Privacy, Legal Affairs and Compliance.

Which has a handy chart of the sources of attacks over the last month:

Top 15 of Source Countries (Last month)

Source of Attack Number of Attacks
Russia Russian Federation 2,402,722
Taiwan, Province of China 907,102
Germany 780,425
Ukraine 566,531
Hungary 367,966
United States 355,341
Romania 350,948
Brazil 337,977
Italy 288,607
Australia 255,777
Argentina 185,720
China 168,146
Poland 162,235
Israel 143,943
Japan 133,908

By measured “attacks,” the geographic location of China (not the Chinese government) is #12 as an origin of cyber-attacks.

After Russia, Taiwan (Province of China), Germany, Ukraine, Hungary, United States, and others.

Just in case you missed several recent news cycles, the Chinese government was being singled out as a cyber-attacker for policy or marketing reasons that are not clear.

This service makes the specious nature of those accusations apparent, although the motivations behind the reports remains unclear.

Before you incorporate any government data or report into a topic map, you should verify the information with at least two or more independent sources.

Document Mining with Overview:… [Webinar - March 15, 2013]

Thursday, March 14th, 2013

Document Mining with Overview: A Digital Tools Tutorial

From the post:

Friday, March 15, 2013 at 2:00pm Eastern Time Enroll Now

Overview is a free tool for journalists that automatically organizes a large set of documents by topic, and displays them in an interactive visualization for exploration, tagging, and reporting. Journalists have already used it to report on FOIA document dumps, emails, leaks, archives, and social media data. In fact it will work on any set of documents that is mostly text. It integrates with DocumentCloud and can import your projects, or you can upload data directly in CSV form.

You can’t read 10,000 pages on deadline, but Overview can help you rapidly figure out which pages are the important ones — even if you’re not sure what you’re looking for.

This training event is part of a series on digital tools in partnership with the American Press Institute and The Poynter Institute, funded by the John S. and James L. Knight Foundation.

See more tools in the Digital Tools Catalog.

I have been meaning to learn more about “Overview” and this looks like a good opportunity.

Interviewing Databases???

Sunday, March 10th, 2013

“We’re going to tell people how to interview databases”: The rise of data (big and small) in journalism

Caroline O’Donovan writes:

Viktor Mayer-Schönberger and Kenneth Cukier published their joint tome on big data this week, Big Data: A Revolution That Will Transform How We Live, Work and Think. Mayer-Schönberger, a professor of Internet governance and regulation at Oxford, and Cukier, the data editor of The Economist, argue that having access to vast amounts of data will soon overwhelm our natural human tendency to look for correlation and causality where there is none. In the near future, we’ll be able to rely on much larger pools of “messy” data rather than small pools of “clean” data to get more accurate answers to our questions.

“We are taking things we never thought of as informational and rendering them in data,” Mayer-Schönberger said in a talk Wednesday at the Berkman Center for Internet & Society at Harvard. “Once we think of it as data, we can organize it and extract new information.”

In their book, Mayer-Schönberger and Cukier give a number of examples of industries that will be changed forever by the new messiness of data. Bradford Cross cofounded FlightCaster.com, which predicted U.S. flight delays using data about flight times and weather patterns. The company was sold in 2011, at which point “Cross turned his sights on another aging industry.” He started Prismatic, one of a number of news aggregators that filters content for users by analyzing data about sharing frequency on social networks and user preferences.

Caroline quotes Cukier on “interviewing databases,” saying:

When we teach journalism in the future, we’re not just going to teach people the fundamentals of how to do an interview, or what a lede paragraph is. We’re going to tell people how to interview databases. And also, just as we train journalists by telling them that sometimes people that we interview are unfaithful and lie, we’re going to have to teach them to be suspicious of the data, because sometimes the data lies, too. You have to bring the same scrutiny as in the analog world — talking to people and observing — to the data as well.

I like the image of interviewing a database.

How many times do you think a database will be asked the same questions by different reporters?

Do you think recording and sharing those answers would save other reporters time and resources?

How about enabling other reporters to ask questions you forgot or didn’t know enough to ask?

If any of that rings a bell, there may be topic maps in your future.

Marketing Data Sets (Read Topic Maps)

Tuesday, March 5th, 2013

The National Institute for Computer-Assisted Reporting (NICAR) has forty-seven (47) databases for sale in bulk or by geographic region.

Data sets range from “AJC School Test Scores” and “FAA Accidents and Incidents” to “Social Security Administration Death Master File” and “Wage and Hour Enforcement.”

The data sets cover decades of records.

There is a one hundred (100) record sample for each database.

The samples offer an avenue to show what more is possible with topic maps, to paying customers based upon a familiar dataset.

With all the talk of gun control in the United States, consider the Federal Firearms/Explosives Licensees database.

For free you can see:

Main documentation (readme.txt)

Sample Data (sampleatf_ffl.xls)

Record layout (Layout.txt)

Do remember that NICAR already has the attention of an interested audience, should you need a partner in marketing a fuller result.

Tools, Slides and Links from NICAR13 [News Investigation/Reporting]

Tuesday, March 5th, 2013

Tools, Slides and Links from NICAR13 by Chrys Wu.

The acronyms were new to me: NICAR (National Institute for Computer-Assisted Reporting), a program of IRE (Investigative Reporters & Editors).

From the post:

NICAR13 brings together some of the sharpest minds and most experienced hands in investigative journalism. Over four days, people share, discuss and teach techniques for hunting leads, gathering data, and presenting stories. Of all the conferences I go to, this one gets the highest marks from attendees for intensive, immediately applicable learning; networking and fun.

NICAR 2014 will be in Baltimore from Feb. 27 to March 2. You should be there.

For additional tutorials, videos, presentations and tips see the lists from 2012 and 2011.

A real wealth of material if you are interesting in mining, analyzing and reporting data.

Enjoy!

I first saw this in a tweet by Chrys Wu.

Computational Journalism

Monday, February 25th, 2013

Computational Journalism by Jonathan Stray.

From the webpage:

Maybe it’s not obvious that computer science and journalism go together, but they do!

Computational journalism combines classic journalistic values of storytelling and public accountability with techniques from computer science, statistics, the social sciences, and the digital humanities.

This course, given at the University of Hong Kong during January-February 2013, is an advanced look at how techniques from visualization, natural language processing, social network analysis, statistics, and cryptography apply to four different areas of journalism: finding stories through data mining, communicating what you’ve learned, filtering an overwhelming volume of information, and tracking the spread of information and effects.

The course assumes knowledge of computer science, including standard algorithms and linear algebra. The assignments are in Python and require programming experience. But this introductory video, which explains the topics covered, is for everyone.

For more, see the syllabus, or jump directly to a lecture:

  1. Basics. Feature vectors, clustering, projections.
  2. Text analysis. Tokenization, TF-IDF, topic modeling.
  3. Algorithmic filters. Information overload. Newsblaster and Google News.
  4. Hybrid filters. Social networks as filters. Collaborative Filtering.
  5. Social network analysis. Using it in journalism. Centrality algorithms.
  6. Knowledge representation. Structured data. Linked open data. General Q&A.
  7. Drawing conclusions. Randomness. Competing hypotheses. Causation.
  8. Security, surveillance, and privacy. Cryptography. Threat modeling.

CS knowledge and programming experience still required.

Interfaces will lessen that need over time but that knowledge/experience will help you question when interfaces have given odd results.

I would settle for journalists who question reports, like the Mandiant advertisement on cybersecurity last week. (Crowdsourcing Cybersecurity: A Proposal (Part 1))

Even the talking heads on the PBS Sunday morning news treated it as serious content. It was poorly written/researched ad copy, nothing more.

Of course, you would have to read the first couple of pages to discover that, not just skim the press release.

I first saw this at Christophe Lalanne’s A bag of tweets / February 2013.

Finding tools vs. making tools:…

Sunday, February 17th, 2013

Finding tools vs. making tools: Discovering common ground between computer science and journalism by Nick Diakopoulos.

From the post:

The second Computation + Journalism Symposium convened recently at the Georgia Tech College of Computing to ask the broad question: What role does computation have in the practice of journalism today and in the near future? (I was one of its organizers.) The symposium attracted almost 150 participants, both technologists and journalists, to discuss and debate the issues and to forge a multi-disciplinary path forward around that question.

Topics for panels covered the gamut, from precision and data journalism, to verification of visual content, news dissemination on social media, sports and health beats, storytelling with data, longform interfaces, the new economic landscape of content, and the educational needs of aspiring journalists. But what made these sessions and topics really pop was that participants on both sides of the computation and journalism aisle met each other in a conversational format where intersections and differences in the ways they viewed these topics could be teased apart through dialogue. (Videos of the sessions are online.)

While the panelists were all too civilized for any brawls to break out, mixing two disciplines as different as computing and journalism nonetheless did lead to some interesting discussions, divergences, and opportunities that I’d like to explore further here. Keeping these issues top-of-mind should help as this field moves forward.

Tool foragers and tool forgers

The following metaphor is not meant to be incendiary, but rather to illuminate two different approaches to tool innovation that seemed apparent at the symposium.

Imagine you live about 10,000 years ago, on the cusp of the Neolithic Revolution. The invention of agriculture is just around the corner. It’s spring and you’re hungry after the long winter. You can start scrounging around for berries and other tasty roots to feed you and your family — or you can stop and try to invent some agricultural implements, tools adapted to your own local crops and soil that could lead to an era of prosperity. If you take the inventive approach, you might fail, and there’s a real chance you’ll starve trying — while foraging will likely guarantee you another year of subsistence life.

What role does computation have in your field of practice?

The Times Digital Archives

Saturday, February 9th, 2013

The Times Digital Archives

From the webpage:

Read by both world leaders and the general public, The Times has offered readers in-depth, award-winning and objective coverage of world events since its creation 1785 and is the oldest daily newspaper in continuous publication.

The Times Digital Archive is an online, full-text facsimile of more than 200 years of The Times, one of the most highly regarded resources for the 19th – 20th Century history detailing every complete page of every issue from 1785. This historical newspaper archive allows researchers an unparalleled opportunity to search and view the best-known and most cited newspaper in the world online in its original published context.

Covers the time period 1785-2006.

Unfortunately, the publisher of this collection, GALE, has limited access to individuals at institutions with subscriptions.

Still, if you have access, this is a great resource for recent event topic maps.

Simon Rogers

Wednesday, February 6th, 2013

Simon Rogers

From the “about” page:

Simon Rogers is editor of guardian.co.uk/data, an online data resource which publishes hundreds of raw datasets and encourages its users to visualise and analyse them – and probably the world’s most popular data journalism website.

He is also a news editor on the Guardian, working with the graphics team to visualise and interpret huge datasets.

He was closely involved in the Guardian’s exercise to crowdsource 450,000 MP expenses records and the organisation’s coverage of the Afghanistan and Iraq Wikileaks war logs. He was also a key part of the Reading the Riots team which investigated the causes of the 2011 England disturbances.

Previously he was the launch editor of the Guardian’s online news service and has edited the paper’s science section. He has edited three Guardian books, including How Slow Can You Waterski and The Hutton Inquiry and its impact.

If you are interested in “data journalism,” data mining or visualization, Simon’s site is one of the first to bookmark.

Informer

Wednesday, February 6th, 2013

Informer Newsletter of the BCS Information Retrieval Specialist Group.

The Winter 2013 issue of the Informer has been published!

You will find:

Prior issues are also available.

BBC …To Explore Linked Data Technology [Instead of hand-curated content management]

Friday, February 1st, 2013

BBC News Lab to Explore Linked Data Technology by Angela Guess.

From the post:

Matt Shearer of the BBC recently reported that the BBC’s News Lab team will begin exploring linked data technologies. He writes, “Hi I’m Matt Shearer, delivery manager for Future Media News. I manage the delivery of the News Product and I also lead on BBC News Labs. BBC News Labs is an innovation project which was started during 2012 to help us harness the BBC’s wider expertise to explore future opportunities. Generally speaking BBC News believes in allowing creative technologists to innovate and influence the direction of the News product. For example the delivery of BBC News’ responsive design mobile service started in 2011 when we made space for a multidiscipline project to explore responsive design opportunities for BBC News. With this in mind the BBC News team setup News Labs to explore linked data technologies.”

Shearer goes on, “The BBC has been making use of linked data technologies in its internal content production systems since 2011. As explained by Jem Rayfield this enabled the publishing of news aggregation pages ‘per athlete’, ‘per sport’ and ‘per event’ for the 2012 Olympics – something that would not have been possible with hand-curated content management. Linked data is being rolled out on BBC News from early 2013 to enrich the connections between BBC News stories, content assets, the wider BBC website and the World Wide Web. We framed each challenge/opportunity for the News Lab in terms of a clear ‘problem space’ (as opposed to a set of requirements that may limit options) supported by research findings, audience needs, market needs, technology opportunities and framed with the BBC News Strategy.”

Read more here.

(emphasis added)

Apologies for the long quote but I wanted to capture the BBC’s comparison of using linked data to hand-curated content management in context.

I never dreamed the BBC was still using “hand-curated content management” as a measure of modern IT systems.

Quite remarkable.

On the other hand, perhaps they were being kind to the linked data experiment by using a measure that enables it to excel.

If you know which one, please comment.

Thanks!

Complete Guardian Dataset Listing!

Thursday, January 17th, 2013

All our datasets: the complete index by Chris Cross.

From the post:

Lost track of the hundreds of datasets published by the Guardian Datablog since it began in 2009? Thanks to ScraperWiki, this is the ultimate list and resource. The table below is live and updated every day – if you’re still looking for that ultimate dataset, the chance is we’ve already done it. Click below to find out

I am simply in awe of the number of datasets produced by the Guardian since 2009.

A few of the more interesting titles include:

You will find things in the hundreds of datasets you have wondered about and other things you can’t imagine wondering about. ;-)

Enjoy!

A Paywall In Your Future? [Curated Data As Revenue Stream]

Tuesday, December 25th, 2012

The New York Times Paywall Is Working Better Than Anyone Had Guessed by Edmund Lee.

From the post:

Ever since the New York Times rolled out its so-called paywall in March 2011, a perennial dispute has waged. Anxious publishers say they can’t afford to give away their content for free, while the blogger set claim paywalls tend to turn off readers accustomed to a free and open Web.

More than a year and a half later, it’s clear the New York Times’ paywall is not only valuable, it’s helped turn the paper’s subscription dollars, which once might have been considered the equivalent of a generous tithing, into a significant revenue-generating business. As of this year, the company is expected to make more money from subscriptions than from advertising — the first time that’s happened.

Digital subscriptions will generate $91 million this year, according to Douglas Arthur, an analyst with Evercore Partners. The paywall, by his estimate, will account for 12 percent of total subscription sales, which will top $768.3 million this year. That’s $52.8 million more than advertising. Those figures are for the Times newspaper and the International Herald Tribune, largely considered the European edition of the Times.

It’s a milestone that upends the traditional 80-20 ratio between ads and circulation that publishers once considered a healthy mix and that is now no longer tenable given the industrywide decline in newsprint advertising. Annual ad dollars at the Times, for example, has fallen for five straight years.

More importantly, subscription sales are rising faster than ad dollars are falling. During the 12 months after the paywall was implemented, the Times and the International Herald Tribune increased circulation dollars 7.1 percent compared with the previous 12-month period, while advertising fell 3.7 percent. Subscription sales more than compensated for the ad losses, surpassing them by $19.2 million in the first year they started charging readers online.

I don’t think gate-keeper and camera-ready copy publishers should take much comfort from this report.

Unlike those outlets, the New York Times has a “value-add” with regard to the news it reports.

Much like UI/UX design, the open question is: What do users see as a value-add? (Hopefully a significant number of users.)

A life or death question for a new content stream, fighting for attention.

Structure and Dynamics of Information Pathways in Online Media

Friday, December 14th, 2012

Structure and Dynamics of Information Pathways in Online Media by Manuel Gomez Rodriguez, Jure Leskovec, Bernhard Schölkopf.

Abstract:

Diffusion of information, spread of rumors and infectious diseases are all instances of stochastic processes that occur over the edges of an underlying network. Many times networks over which contagions spread are unobserved, and such networks are often dynamic and change over time. In this paper, we investigate the problem of inferring dynamic networks based on information diffusion data. We assume there is an unobserved dynamic network that changes over time, while we observe the results of a dynamic process spreading over the edges of the network. The task then is to infer the edges and the dynamics of the underlying network.

We develop an on-line algorithm that relies on stochastic convex optimization to efficiently solve the dynamic network inference problem. We apply our algorithm to information diffusion among 3.3 million mainstream media and blog sites and experiment with more than 179 million different pieces of information spreading over the network in a one year period. We study the evolution of information pathways in the online media space and find interesting insights. Information pathways for general recurrent topics are more stable across time than for on-going news events. Clusters of news media sites and blogs often emerge and vanish in matter of days for on-going news events. Major social movements and events involving civil population, such as the Libyan’s civil war or Syria’s uprise, lead to an increased amount of information pathways among blogs as well as in the overall increase in the network centrality of blogs and social media sites.

A close reading of this paper will have to wait for the holidays but it will be very near the top of the stack!

Transient subjects anyone?

Encyclo

Saturday, December 1st, 2012

Encyclo : An encyclopedia of the future of news from the Nieman Journalism Lab

From the about page:

Encyclo is an encyclopedia of the future of news, produced by the Nieman Journalism Lab at Harvard University.

You may already know the Lab for our reporting, analysis, and commentary on how the world of journalism is changing, both through our website and our Twitter feed. The Internet has revolutionized the way news is gathered, assembled, distributed, and consumed, and our mission is to learn about those changes, to identify what’s working and what isn’t, and to do our small part in helping that evolution along.

But our main site emphasizes new developments and the latest news. We think there’s great value in a resource that steps back a bit from the daily updates and focuses on background and context. What is it about Voice of San Diego that people find interesting? How has The New York Times been innovating? What model is Politico trying to achieve? Those kinds of questions are why we decided to build Encyclo — a resource on the most important organizations and issues in journalism’s evolution.

Another area where avoiding re-finding information and links between subjects of stories would be a tremendous benefit.

A site to watch and explore for opportunities for topic maps.

Spundge

Saturday, December 1st, 2012

First look: Spundge is software to help journalists to manage real-time data streams by Andrew Phelps.

From the post:

“Spundge is a platform that’s built to take a journalist from information discovery and tracking all the way to publishing, regardless of whatever internal systems they have to contend with,” he told me.

A user creates notebooks to organize material (a scheme familiar to Evernote users). Inside a notebook, a user can add streams from multiple sources and activate filters to refine by keyword, time (past few minutes, last week), location, and language.

Spundge extracts links from those sources and displays headlines and summaries in a blog-style river. A user can choose to save individual items to the notebook or hide them from view, and Spundge’s algorithms begin to learn what kind of content to show more or less of. A user can also save clippings from around the web with a bookmarklet (another Evernote-like feature). If a notebook is public, the stream can be embedded in webpages, à la Storify. (Here’s an example of a notebook tracking the ONA 2012 conference.)

Looks interesting but I wonder about the monochrome view it presents the user?

That is some particular user makes their settings and until and unless they change those settings, the limits of the content they are shown is measured by that user.

As opposed to say a human curated source like the New York Times. (Give me human editors and the New York Times)

Or is the problem a lack of human curated data feeds?

Benefits stigma: how newspapers report on welfare

Tuesday, November 20th, 2012

Benefits stigma: how newspapers report on welfare by Randeep Ramesh.

From the post:

New research out today looks at the benefits stigma in Britain. The Guardian’s social affairs editor takes a look at the most common myths and sees how content on welfare differs by newspapers.

Those working in benefits and with claimants have become increasingly exasperated with the gap between the reality of poor peoples’ lives and the rhetoric of welfare reform.

Such is the scale of successive governments’ disinformation that the report by Turn2us, part of anti-poverty charity Elizabeth Finn, calls for ministers to abandon briefing journalists in advance of their speeches and asks departments to seek corrections for “for predictable and repeated media misinterpretations”.

It is articles like this one that have me contemplating a hard copy subscription to the Guardian.

Mapping the distortions won’t stop them but might sharpen your aim on their sources.

Paying for What Was Free: Lessons from the New York Times Paywall

Sunday, November 4th, 2012

Paying for What Was Free: Lessons from the New York Times Paywall

From the post:

In a national online longitudinal survey, participants reported their attitudes and behaviors in response to the recently implemented metered paywall by the New York Times. Previously free online content now requires a digital subscription to access beyond a small free monthly allotment. Participants were surveyed shortly after the paywall was announced and again 11 weeks after it was implemented to understand how they would react and adapt to this change. Most readers planned not to pay and ultimately did not. Instead, they devalued the newspaper, visited its Web site less frequently, and used loopholes, particularly those who thought the paywall would lead to inequality. Results of an experimental justification manipulation revealed that framing the paywall in terms of financial necessity moderately increased support and willingness to pay. Framing the paywall in terms of a profit motive proved to be a noncompelling justification, sharply decreasing both support and willingness to pay. Results suggest that people react negatively to paying for previously free content, but change can be facilitated with compelling justifications that emphasize fairness.

The original article: Jonathan E. Cook and Shahzeen Z. Attari. Cyberpsychology, Behavior, and Social Networking. -Not available-, ahead of print. doi:10.1089/cyber.2012.0251

Another data point in the struggle to find a viable model for delivery of online content.

The difficulty with “free” content, followed by discovering you still need to pay expenses for that content, is that consumers, when charged, gain nothing over when the content was free. They are losers in that proposition.

I mention this because topic maps that provide content over the web face the same economic challenges as other online content providers.

A model that I haven’t seen (you may have so sing out) is one that offers the content for free, but the links to other materials, the research adds value to the content, are dead links without subscription. True, someone could track down each and every reference but if you are using the content as part of your job, do you really want to do that?

The full and complete content is simply made available. To anyone who want a copy. After all, the wider the circulation of the content, the more free advertising you are getting for your publication.

Delivery of PDF files with citations, sans links, for non-subscribers is perhaps one line of XSL-FO code. It satisfies the question of “access” and yet leaves publishers a new area to fill with features and value-added content.

Take for example, less than full article level linking. If I wanted to read another thirty pages to find a citation was just boiler-plate, I hardly need a citation network do I? Of course value-added content isn’t found directly under the lamp post, but requires some imagination.