Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 13, 2014

Al Jazeera

Filed under: News,Reporting — Patrick Durusau @ 6:58 pm

Al Jazeera just soft-launched its AJ+ online video network by Janko Roettgers.

From the post:

Qatar-based Al Jazeera soft-launched its AJ+ online video network Friday with a new YouTube channel as well as a dedicated Facebook page and Twitter account. Al Jazeera also announced AJ+ with a press release that described the network as “current affairs experience for mobiles and social streams,” and promised a formal launch later this year.

Cool!

Should be useful for topic maps on current events that try to not be US-centric.

June 12, 2014

Condensing News

Filed under: Data Mining,Information Overload,News,Reporting,Summarization — Patrick Durusau @ 7:27 pm

Information Overload: Can algorithms help us navigate the untamed landscape of online news? by Jason Cohn.

From the post:

Digital journalism has evolved to a point of paradox: we now have access to such an overwhelming amount of news that it’s actually become more difficult to understand current events. IDEO New York developer Francis Tseng is—in his spare time—searching for a solution to the problem by exploring its root: the relationship between content and code. Tseng received a grant from the Knight Foundation to develop Argos*, an online news aggregation app that intelligently collects, summarizes and provides contextual information for news stories. Having recently finished version 0.1.0, which he calls the first “complete-ish” release of Argos, Tseng spoke with veteran journalist and documentary filmmaker Jason Cohn about the role technology can play in our consumption—and comprehension—of the news.

Great story and very interesting software. And as Alyona notes in her tweet, it’s open source!

Any number of applications, particularly for bloggers who are scanning lots of source material everyday.

Intended for online news but a similar application would be useful for TV news as well. In the Altanta, Georgia area a broadcast could be prefaced by:

  • Accidents (gristly ones) 25%
  • Crimes (various) 30%
  • News previously reported but it’s a slow day today 15%
  • News to be reported on a later broadcast 10%
  • Politics (non-contextualized posturing) 10%
  • Sports (excluding molesting stories reported under crimes) 5%
  • Weather 5%

I haven’t timed the news and some channels are worse than others but take that as a recurrent, public domain summary of Atlanta news. 😉

For digital news feeds, check out the Argos software!

I first saw this in a tweet by Alyona Medelyan.

June 7, 2014

A heuristic for sorting science stories in the news

Filed under: News,Reporting — Patrick Durusau @ 7:49 pm

A heuristic for sorting science stories in the news by David Spiegelhalter.

From the post:

Dominic Lawson’s article in the Sunday Times today[paywall] quotes me as having the rather cynical heuristic: “the very fact that a piece of health research appears in the papers indicates that it is nonsense.” I stand by this, but after a bit more consideration I would like to suggest a slightly more refined version for dealing with science stories in the news, particularly medical ones.

Ask yourself: if the study had come up with a negative result, would I be hearing about it? If NO, then don’t bother to read or listen to the story

(emphasis in the original)

This is a great post and deserves to be read in full.

After reading it, how would you answer this question: Would you use the same criteria for social media reports?

Granting there is a lot of noise in some social media streams but at the same time, there are some that are quite high quality.

As far as the “mainstream” news, you are dumber for having heard it.

June 4, 2014

Overview and Splitting PDF Files

Filed under: News,PDF,Reporting — Patrick Durusau @ 4:09 pm

I have been seeing tweets from the Overview Project that as of today, yoiu can split PDF files into pages without going through DocumentCloud or other tools.

I don’t have Overview installed so I can’t confirm that statement but if true, it is a step in the right direction.

Think about it for a moment.

If you “tag” a one hundred page PDF file with all the “tags” you need to return to that document, what happens? Sure, you can go back to that document, but then you have to search for the material you were tagging.

It is a question of the granularity of your “tagging.” Now imagine tagging a page in PDF. Is it now easier for you to return to that one page? Can you also say it would be easier for someone else to return to the same page following your path?

Which makes you wonder about citation practices that simply cite an article and not a location within the article.

Are they trying to make your job as a reader that much harder?

May 30, 2014

BBC Radio Explorer:…

Filed under: BBC,News,Reporting — Patrick Durusau @ 1:56 pm

BBC Radio Explorer: a new way to listen to radio by James Cridland.

From the post:

The BBC has quietly released a prototype service called BBC Radio Explorer.

The service is the result of “10% time”, a loose concept that allows the BBC’s software engineers time to develop and play about with things. Unusually, this one is visible to the public, if you know where to look. But, with a quiet announcement on Twitter and no press release, you’ll be forgiven to not know it exists. That’s by design: since it’s not finished: every page tells us it’s “work-in-progress”.

BBC Radio Explorer is a relatively simple idea. Type something that you’re interested in, and the service plays you clips and programmes that it thinks you’ll like: one after the other. It’s a different way to listen to the BBC’s speech radio output, and it should unearth a lot of interesting programming from the BBC.

Technically, it’s nicely done: type a topic, and it instantly starts playing some audio. The BBC’s invested some time in clipping some of their programmes into small chunks, and typically you’ll get a little bit of the Today programme, or BBC Radio 5 live’s breakfast show, as well as longer-form programmes. You can skip forward and back to different clips, and a quite clever progress bar shows you images of what’s coming up, while the current programme slowly disappears. It’s a responsive site, and apparently works well on iOS devices too, though Android support is lacking.
….

James compares similar services and discusses a number short-comings of the service.

An old and familiar one is the inadequacy of BBC Radio Explorer search capabilities. Not unique to the BBC but common across search engines everywhere.

But on the whole, James take this to be a worthwhile venture and I would have to agreed.

Unless and until users become more vocal about what is lacking in current search capabilities, business as usual will prevail as search engines tweak their results to sell more ads.

May 23, 2014

Overview (new release)

Filed under: Journalism,News,Reporting — Patrick Durusau @ 6:57 pm

Overview (new release)

A new version of Overview was released last Monday. The GitHub page lists the following new features:

  • Overview will reserve less memory on Windows with 32-bit Java. That means it won’t support larger document sets; it also means it won’t crash on startup for some users.
  • Overview now starts in “single-user mode”. You won’t be prompted for a username or password.
  • Overview will automatically open up a browser window to http://localhost:9000 when it’s ready to go.
  • You can export huge document sets without running out of memory.

Installation and upgrade instructions: Installation and upgrade instructions: https://github.com/overview/overview-server/wiki/Installing-and-Running-Overview

For more details on how Overview supports “document-driven journalism,” see the Overview Project homepage.

May 17, 2014

Mapping Kidnappings in Nigeria (Updated)

Filed under: News,Reporting — Patrick Durusau @ 6:54 pm

Mapping Kidnappings in Nigeria (Updated) by Mona Chalabi.

From the post:

Editor’s note (May 16, 3:35 p.m.): This article contains many errors, some of them fundamental to the analysis.

The article repeatedly refers to the number and location of kidnappings. But the Global Database of Events, Language and Tone (GDELT) — the data source for the article — is a repository of media reports, not discrete events. As such, we should only have referred to “media reports of kidnappings,” not kidnappings.

This mistake led to other problems.

We should not have published an animated map showing “kidnappings” over time, or even “media reports of kidnappings” over time. Because we have no data on actual kidnappings, showing a time series requires normalizing the data to account for the increasing number of media reports overall. Thus, showing individual media reports is a mistake. The second map, showing “Kidnapping rate per 100,000 people, 1982-present,” has the same flaw.

This is a good example of why you should have a high degree of confidence in FiveThirtyEight.

Yes, the blog post admits to a number of errors but you should also note:

FiveThirtyEight made the correction before the original article. You can’t see the mis-information without seeing the correction.

FiveThirtyEight did not spend days or weeks in denial, only to have to confess in the end to being wrong. (Any recent American President would be a study in contrast.)

FiveThirtyEight tells us what went wrong. Good for them and us because now we are both aware of that type of error.

In the unlikely event that you should ever make a public mistake, ;-), please consider following the example of FiveThirtyEight.

I first saw this in a tweet by Christopher Phipps.

May 9, 2014

The Data Journalism Handbook

Filed under: Journalism,News,Reporting — Patrick Durusau @ 6:41 pm

The Data Journalism Handbook edited by Jonathan Gray, Liliana Bounegru and Lucy Chambers.

From the webpage:

The Data Journalism Handbook is a free, open source reference book for anyone interested in the emerging field of data journalism.

It was born at a 48 hour workshop at MozFest 2011 in London. It subsequently spilled over into an international, collaborative effort involving dozens of data journalism’s leading advocates and best practitioners – including from the Australian Broadcasting Corporation, the BBC, the Chicago Tribune, Deutsche Welle, the Guardian, the Financial Times, Helsingin Sanomat, La Nacion, the New York Times, ProPublica, the Washington Post, the Texas Tribune, Verdens Gang, Wales Online, Zeit Online and many others.

A practical tome, it is available in English, Russian, French, German and Georgian.

A very useful and highly entertaining read.

Enjoy and recommend it to others!

April 19, 2014

Streamtools – Update

Filed under: News,Reporting,Stream Analytics,Visualization — Patrick Durusau @ 1:50 pm

streamtools 0.2.4

From the webpage:

This release contains:

  • toEmail and fromEmail blocks: use streamtools to receive and create emails!
  • linear modelling blocks: use streamtools to perform linear and logistic regression using stochastic gradient descent.
  • GUI updates : a new block reference/creation panel.
  • a kullback leibler block for comparing distributions.
  • added a tutorials section to streamtools available at /tutorials in your streamtools server.
  • many small bug fixes and tweaks.

See also: Introducing Streamtools.

+1 on news input becoming more stream-like. But streams, of water and news, can both become polluted.

Filtering water is a well-known science.

Filtering information is doable but with less certain results.

How do you filter your input? (Not necessarily automatically, algorithmic, etc. You have to define the filter first, then choose the means implement it.)

I first saw this in a tweet by Micahael Dewar.

March 26, 2014

Verification Handbook

Filed under: News,Reporting — Patrick Durusau @ 1:55 pm

Verification Handbook: A definitive guide to verifying digital content for emergency coverage

From the website:

Authored by leading journalists from the BBC, Storyful, ABC, Digital First Media and other verification experts, the Verification Handbook is a groundbreaking new resource for journalists and aid providers. It provides the tools, techniques and step-by-step guidelines for how to deal with user-generated content (UGC) during emergencies.

What

When a crisis breaks, trusted sources such as news and aid organisations must sift through and verify the mass of reports being shared and published, and report back to the public with accurate, fact-checked information The handbook provides actionable advice to facilitate disaster preparedness in newsrooms, and best practices for how to verify and use information, photos and videos provided by the crowd.

Who

While it primarily targets journalists and aid providers, the handbook can be used by anyone. It’s advice and guidance are valuable whether you are a news journalist, citizen reporter, relief responder, volunteer, journalism school student, emergency communication specialist, or an academic researching social media.

Interesting reading.

Now what we need is a handbook of common errors for reviewers.

I first saw this in Pete Warden’s Five short links, 18 March 2014.

March 17, 2014

If you could make a computer do anything with documents,…

Filed under: Document Classification,Document Management,News,Reporting — Patrick Durusau @ 7:56 pm

If you could make a computer do anything with documents, what would you make it do?

The OverviewProject has made a number of major improvements in the last year and now they are asking your opinion on what to do next?

They do have funding, developers and are pushing out new features. I take all of those to be positive signs.

No guarantee that what you ask for is possible with their resources or even of any interest to them.

But, you won’t know if you don’t ask.

I will be posting my answer to that question on this blog this coming Friday, 21 March 2014.

Spread the word! Get other people to try Overview and to answer the survey.

March 14, 2014

Introducing Streamtools:…

Filed under: News,Reporting,Visualization — Patrick Durusau @ 7:46 pm

Introducing Streamtools: A Graphical Tool for Working with Streams of Data by Mike Dewar.

From the post:

We see a moment coming when the collection of endless streams of data is commonplace. As this transition accelerates it is becoming increasingly apparent that our existing toolset for dealing with streams of data is lacking. Over the last 20 years we have invested heavily in tools that deal with tabulated data, from Excel, MySQL, and MATLAB to Hadoop, R, and Python+Numpy. These tools, when faced with a stream of never-ending data, fall short and diminish our creative potential.

In response to this shortfall we have created streamtools—a new, open source project by the New York Times R&D Lab which provides a general purpose, graphical tool for dealing with streams of data. It offers a vocabulary of operations that can be connected together to create live data processing systems without the need for programming or complicated infrastructure. These systems are assembled using a visual interface that affords both immediate understanding and live manipulation of the system.

I’m quite excited about this tool, although I would not go so far as to say it will “encourage new forms of reasoning. (emphasis in original)” 😉

Still, this is an exciting new tool and I commend both the post and the tool to you.

March 12, 2014

“The Upshot”

Filed under: Journalism,News,Reporting — Patrick Durusau @ 8:03 pm

“The Upshot” is the New York Times’ replacement for Nate Silver’s FiveThirtyEight by John McDuling.

From the post:

“The Upshot.” That’s the name the New York Times is giving to its new data-driven venture, focused on politics, policy and economic analysis and designed to fill the void left by Nate Silver, the one-man traffic machine whose statistical approach to political reporting was a massive success.

David Leonhardt, the Times’ former Washington bureau chief, who is in charge of The Upshot, told Quartz that the new venture will have a dedicated staff of 15, including three full-time graphic journalists, and is on track for a launch this spring. “The idea behind the name is, we are trying to help readers get to the essence of issues and understand them in a contextual and conversational way,” Leonhardt says. “Obviously, we will be using data a lot to do that, not because data is some secret code, but because it’s a particularly effective way, when used in moderate doses, of explaining reality to people.”

The New York Times’ own public editor admitted that Silver, a onetime baseball stats geek, never really fit into the paper’s culture, and that “a number of traditional and well-respected Times journalists disliked his work.” But Leonhardt says being part of the Times is an “enormous advantage” for The Upshot. “The Times is in an extremely strong position digitally. We are going to be very much a Times product. Having said that, we are not going to do stuff the same way the Times does.” The tone, he said, will be more like having “a journalist sitting next to you, or sending you an email.”

I really like the New York Times for its long tradition of excellence in news gathering. Couple that with technologies to connect its staff’s collective insights with the dots and it would be a formidable enterprise.

March 7, 2014

Introducing the ProPublica Data Store

Filed under: Data,News,Reporting — Patrick Durusau @ 8:07 pm

Introducing the ProPublica Data Store by Scott Klein and Ryann Grochowski Jones.

From the post:

We work with a lot of data at ProPublica. It's a big part of almost everything we do — from data-driven stories to graphics to interactive news applications. Today we're launching the ProPublica Data Store, a new way for us to share our datasets and for them to help sustain our work.

Like most newsrooms, we make extensive use of government data — some downloaded from "open data" sites and some obtained through Freedom of Information Act requests. But much of our data comes from our developers spending months scraping and assembling material from web sites and out of Acrobat documents. Some data requires months of labor to clean or requires combining datasets from different sources in a way that's never been done before.

In the Data Store you'll find a growing collection of the data we've used in our reporting. For raw, as-is datasets we receive from government sources, you'll find a free download link that simply requires you agree to a simplified version of our Terms of Use. For datasets that are available as downloads from government websites, we've simply linked to the sites to ensure you can quickly get the most up-to-date data.

For datasets that are the result of significant expenditures of our time and effort, we're charging a reasonable one-time fee: In most cases, it's $200 for journalists and $2,000 for academic researchers. Those wanting to use data commercially should reach out to us to discuss pricing. If you're unsure whether a premium dataset will suit your purposes, you can try a sample first. It's a free download of a small sample of the data and a readme file explaining how to use it.

The datasets contain a wealth of information for researchers and journalists. The premium datasets are cleaned and ready for analysis. They will save you months of work preparing the data. Each one comes with documentation, including a data dictionary, a list of caveats, and details about how we have used the data here at ProPublica.

A data store you can feel good about supporting!

I first saw this at Nathan Yau’s ProPublica opened a data store.

March 6, 2014

Crisis News on Twitter

Filed under: Authoring Topic Maps,News,Reporting — Patrick Durusau @ 3:50 pm

Who to Follow on Twitter for Crisis News, Part 2: Venezuela by David Godsall.

From the post:

With political strife dominating so much of our news cycle these past months, and events from Ukraine to Venezuela rapidly unfolding, Twitter is one of the best ways to stay informed in real time. But when social media turns everyone into an information source, it can be a challenge to sort the signal from the noise and figure out who to trust.

To help you find reliable sources for some of the most timely geopolitical news stories, we’ve created a series of Twitter lists compiling trusted journalists, activists and citizens on the ground in the conflict regions. These are the people sharing the most up-to-date information, often from their own first hand experiences. In Part 1 of this series, we talked about sources of news from Ukraine.

Our second list in the series focuses on the events currently taking place in Venezuela:

If you are building a topic map for current events, you need information feeds. Twitter has some suggestions if you want to follow events in the Ukraine or Venezuela.

As will any information feed, use even the best feeds with caution. I saw Henry Kissinger on Charlie Rose. Kissinger was very even handed while Rose was an “America lectures the world” advocate. If you haven’t read The ugly American by William J Lederer and Eugene Burdick, you should.

It is a very crowded field for who would qualify as the “ugliest” American these days.

March 5, 2014

Introducing Source Guides

Filed under: News,Reporting — Patrick Durusau @ 4:29 pm

Introducing Source Guides by Erin Kissane.

From the post:

Topical collections for readers new and experienced

In the two-and-a-bit years we’ve been publishing Source, we’ve built up a solid archive of project walkthroughs, introductions to new tools and libraries, and case studies. They’re all tagged and searchable, but as with most archives presented primarily in reverse-chron order, pieces tend to attract less attention once they fall off the first page of a given section.

We’ve also been keeping an eye out for ways of inviting in readers who haven’t been following along since we started Source, and who may be a little newer to journalism code—either to the “code” or the “journalism” part.

Introducing Guides

Earlier this year, we got the OpenNews team together for a few workdays in space graciously lent to us by the New York Times, and in our discussion of the two above challenges, we hit on the idea of packaging articles from our archives into topical “guides” that could highlight the most useful and evergreen of our articles on a given subject. Ryan extended our CMS to allow for the easy creation of topical collections via the admin interface, and we started collecting and annotating pieces a few weeks ago.

Today, we’re launching Source Guides with three topics: News Apps Essentials and Better Mapping, which are just what they say on the tin; and the Care and Feeding of News Apps, a beyond-the-basics Guide that considers the introduction, maintenance, and eventual archiving of code projects in newsrooms. In the coming months, we’ll be rolling out a few more batches of Guides, and then adding to the list organically as new themes coalesce in the archives.

Reminds me of the vertical files (do they still call them that?) reference librarians used to maintain. A manila folder with articles, photocopies, etc., centered on some particular topic. Usually one that came up every year or of local interest.

Not all that far from a topic map except that you have to read all the text to collate related pieces together and every reader repeats that task.

Having said that, this is quite a remarkable project that merits your interest and support.

I first saw this in a tweet by Bryan Connor.

March 2, 2014

One Thing Leads To Another (#NICAR2014)

Filed under: Data Mining,Government,Government Data,News,Reporting — Patrick Durusau @ 11:51 am

A tweet this morning read:

overviewproject ‏@overviewproject 1h
.@djournalismus talking about handling 2.5 million offshore leaks docs. Content equivalent to 50,000 bibles. #NICAR14

That sound interesting! Can’t ever tell when a leaked document will prove useful. But where to find this discussion?

Following #NICAR14 leaves you with the impression this is a conference. (I didn’t recognize the hashtag immediately.)

Searching on the web, the hashtag lead me to: 2014 Computer-Assisted Reporting Conference. (NICAR = National Institute for Computer-Assisted Reporting)

The handle @djournalismus offers the name Sebastian Mondia.

Checking the speakers list, I found this presentation:

Inside the global offshore money maze
Event: 2014 CAR Conference
Speakers: David Donald, Mar Cabra, Margot Williams, Sebastian Mondial
Date/Time: Saturday, March 1 at 2 p.m.
Location: Grand Ballroom West
Audio file: No audio file available.

The International Consortium of Investigative Journalists “Secrecy For Sale: Inside The Global Offshore Money Maze” is one of the largest and most complex cross-border investigative projects in journalism history. More than 110 journalists in about 60 countries analyzed a 260 GB leaked hard drive to expose the systematic use of tax havens. Learn how this multinational team mined 2.5 million files and cracked open the impenetrable offshore world by creating a web app that revealed the ownership behind more than 100,000 anonymous “shell companies” in 10 offshore jurisdictions.

Along the way I discovered the speakers list, who cover a wide range of subjects of interest to anyone mining data.

Another treasure is the Tip Sheets and Tutorial page. Here are six (6) selections out of sixty-one (61) items to pique your interest:

  • Follow the Fracking
  • Maps and charts in R: real newsroom examples
  • Wading through the sea of data on hospitals, doctors, medicine and more
  • Free the data: Getting government agencies to give up the goods
  • Campaign Finance I: Mining FEC data
  • Danger! Hazardous materials: Using data to uncover pollution

Not to mention that NICAR2012 and NICAR2013 are also accessible from the NICAR2014 page, with their own “tip” listings.

If you find this type of resource useful, be sure to check out Investigative Reporters and Editors (IRE)

About the IRE:

Investigative Reporters and Editors, Inc. is a grassroots nonprofit organization dedicated to improving the quality of investigative reporting. IRE was formed in 1975 to create a forum in which journalists throughout the world could help each other by sharing story ideas, newsgathering techniques and news sources.

IRE provides members access to thousands of reporting tip sheets and other materials through its resource center and hosts conferences and specialized training throughout the country. Programs of IRE include the National Institute for Computer Assisted Reporting, DocumentCloud and the Campus Coverage Project

Learn more about joining IRE and the benefits of membership.

Sounds like a win-win offer to me!

You?

February 27, 2014

Same Documents – Multiple Trees

Filed under: News,Reporting — Patrick Durusau @ 11:58 am

View the same documents in different ways with multiple trees by Jonathan Stray.

From the post:

Starting today Overview supports multiple trees for each document set. That is, you can tell Overview to re-import your documents — or a subset of them — with different options, without uploading them again. You can use this to:

  • Focus on a subset of your documents, such as those with a particular tag or containing a specific word.
  • Use ignored and important words to try sorting your documents in different ways.

You create a new tree using this button next your document set list page:

OK, overlapping markup it’s not but this looks like a very useful feature!

January 27, 2014

ProPublica Launches New Version of SecureDrop

Filed under: Cybersecurity,News,Reporting,Security — Patrick Durusau @ 9:45 pm

ProPublica Launches New Version of SecureDrop by Trevor Timm.

From the post:

Today, ProPublica became the first US news organization to launch the new 0.2.1 version of SecureDrop, our open-source whistleblower submission system journalism organizations can use to safely accept documents and information from sources.

ProPublica, an independent, not-for profit news outlet, is known for their hard-hitting journalism and has won several Pulitzer Prizes since its founding just five and a half years ago. ProPublica’s mission focuses on “producing journalism that shines a light on exploitation of the weak by the strong and on the failures of those with power to vindicate the trust placed in them.”

It’s exactly the type of journalism that we aim support at Freedom of the Press Foundation and we hope SecureDrop will help ProPublica further that mission.

Get your IT people to read this post and its references in detail.

Poor security is worse than no security at all. Poor security betrays the trust of those who relied on it.

Overview-Server – Developer Install

Filed under: News,Reporting — Patrick Durusau @ 4:31 pm

Setting up a development Environment

The Overview project has posted a four (4) step process to setup an Overview development environment (Github):

  1. Install PostgreSQL, a Java Development Kit and Git.
  2. git clone https://github.com/overview/overview-server.git
  3. cd overview-server
  4. ./run

That last command will take a long time — maybe an hour as it downloads and compiles all required components. It will be clear when it’s ready.

Overview lowers the bar for swimming in a sea of documents. Not quite big data style oceans of documents but goodly sized seas of documents.

Documents that are delivered in a multitude of formats, usually as inconveniently as possible.

The hope being too many documents for timely/economical review will break any requester before they find embarrassing data.

I prefer to disappoint that hope.

Don’t you?

January 26, 2014

…a quantified-self, semantic-analysis tool to track web browsing

Filed under: Information Sharing,News,Reporting — Patrick Durusau @ 8:17 pm

The New York Times’ R&D Lab is building a quantified-self, semantic-analysis tool to track web browsing

From the post:

Let’s say you work in a modern digital newsroom. Your colleagues are looking at interesting stuff online all day long — reading stimulating news stories, searching down rabbit holes you’ve never thought of. There are probably connections between what the reporter five desks down from you is looking for and what you already know — or vice versa. Wouldn’t it be useful if you could somehow gather up that all that knowledge-questing and turn it into a kind of intraoffice intel?

A version of that vision is what Noah Feehan and others in The New York Times’ R&D Lab is working on with a new system called Curriculum. It started as an in-house browser extension he and Jer Thorp built last year called Semex, which monitored your browsing and, by semantically analyzing the web pages you visit, rendered it as a series of themes.

…if Semex was most useful to me as a way to record my cognitive context, the state in which I left a problem, maybe I could share that state with other people who might need to know it. Sharing topics from my browsing history with a close group of colleagues can afford us insight into one another’s processes, yet is abstracted enough (and constrained to a trusted group) to not feel too invasive…

Each user in a group has a Chrome extension that submits pageviews to a server to perform semantic analysis and publish a private, authenticated feed. (I should note here that the extension ignores any pages using HTTPS, to avoid analyzing emails, bank statements, and other secure pages.) Curriculum is carefully designed to be anonymous; that is, no topic in the feed can be traced back to any one particular user. The anonymity isn’t perfect, of course: because there are only five people using it, and because we five are in very close communication with each other, it is usually not too difficult to figure out who might be researching a particular topic.

Curriculum is kind of like a Fitbit for context, an effortless way to record what’s on our minds throughout the day and make it available to the people who need it most: the people we work with. The function Curriculum performs, that of semantic listening, is fantastically useful when people need to share their contexts (what they were working on, what approaches they were investigating, what problems they’re facing) with each other.

The Curriculum feed is truly a new channel of input for us, a stream of information of a different character than we’ve encountered before. Having access to the residue of our collective web travels has led to many questions, conversations, and jokes that wouldn’t have happened without it. (emphasis added)

Are you ready for real information sharing?

I was rather surprised that anyone in a newsroom would be that sensitive about their browsing history. I would stream mine to the Net if I thought anyone were interested. You might be offended by what you find, but that’s not my problem. 😉

I do know of rumored intelligence service projects that never got off the ground because of information sharing concerns. As well as one state legislature that decided it liked to talk about transparency more than it enjoyed practicing it.

While we call for tearing down data silos (those of others) are we anxious to keep our own personal data silos in place?

January 25, 2014

How to use Overview to analyze social media posts

Filed under: News,Reporting — Patrick Durusau @ 3:57 pm

How to use Overview to analyze social media posts by Jonathan Stray.

From the post:

Even when 10,000 people post about the same topic, they’re not saying 10,000 different things. People talking about an event will focus on different aspects of it, or have different reactions, but many people will be saying pretty much the same thing. People posting about a product might be talking about one or another of its features, the price, their experience using it, or how it compares to competitors. Citizens of a city might be concerned about many different things, but which things are the most important? Overview’s document sorting system groups similar posts so you can find these conversations quickly, and figure out how many people are saying what.

This post explains how to use Overview to quickly figure out what the different “conversations” are, how many people are involved in each, and how they overlap.

I wondered at first about Jonathan mentioning Radian 6, Sysomos, and Datasift as tools to harvest social media data.

After thinking about it, I realized that all of these tools can capture content from a variety of social media feeds.

Suggestions of open source alternatives that can harvest from multiple social media feeds? Particularly with export capabilities.

Thanks!

January 17, 2014

Algorithms are not enough:…

Filed under: News,Reporting — Patrick Durusau @ 8:06 pm

Algorithms are not enough: lessons bringing computer science to journalism by Jonathan Stray.

From the post:

There are some amazing algorithms coming out the computer science community which promise to revolutionize how journalists deal with large quantities of information. But building a tool that journalists can use to get stories done takes a lot more than algorithms. Closing this gap has been one of the most challenging and rewarding aspects of building Overview, and I really think we’ve learned something.

I want to get into the process of going form algorithm to application here, because — somewhat to my surprise — I don’t think this process is widely understood. The computer science research community is going full speed ahead developing exciting new algorithms, but seems a bit disconnected from what it takes to get their work used. This is doubly disappointing, because understanding the needs of users often shows that you need a different algorithm.

The development of Overview is a story about text analysis algorithms applied to journalism, but the principles might apply to any sort of data analysis system. One definition says data science is the intersection of computer science, statistics, and subject matter expertise. This post is about connecting computer science with subject matter expertise.

I rather like the line:

This post is about connecting computer science with subject matter expertise.

If you have ever wondered about how an idea goes from one-off code to software that is easy to use for others, this is a post you need to read.

Jonathan being a reporter by trade makes the story all the more compelling.

It also makes me wonder if topic map interfaces should focus more on how users see the world and not so much on how topic map mavens see the world.

For example the precision of identification users expect may be very different from that of specialists.

Thoughts?

January 9, 2014

Getting Into Overview

Filed under: Data Mining,Document Management,News,Reporting,Text Mining — Patrick Durusau @ 7:09 pm

Getting your documents into Overview — the complete guide Jonathan Stray.

From the post:

The first and most common question from Overview users is how do I get my documents in? The answer varies depending the format of your material. There are three basic paths to get documents into Overview: as multiple PDFs, from a single CSV file, and via DocumentCloud. But there are several other tricks you might need, depending on your situation.

Great coverage of the first step towards using Overview.

Just in case you are not familiar with Overview (for the about page):

Overview is an open-source tool to help journalists find stories in large numbers of documents, by automatically sorting them according to topic and providing a fast visualization and reading interface. Whether from government transparency initiatives, leaks or Freedom of Information requests, journalists are drowning in more documents than they can ever hope to read.

There are good tools for searching within large document sets for names and keywords, but that doesn’t help find the stories you’re not specifically looking for. Overview visualizes the relationships among topics, people, and places to help journalists to answer the question, “What’s in there?”

Overview is designed specifically for text documents where the interesting content is all in narrative form — that is, plain English (or other languages) as opposed to a table of numbers. It also works great for analyzing social media data, to find and understand the conversations around a particular topic.

It’s an interactive system where the computer reads every word of every document to create a visualization of topics and sub-topics, while a human guides the exploration. There is no installation required — just use the free web application. Or you can run this open-source software on your own server for extra security. The goal is to make advanced document mining capability available to anyone who needs it.

Examples of people using Overview? See Completed Stories for a sampling.

Overview is a good response to government “disclosures” that attempt to hide wheat in lots of chaff.

December 25, 2013

Duplicate News Story Detection Revisited

Filed under: Deduplication,Duplicates,News,Reporting — Patrick Durusau @ 5:34 pm

Duplicate News Story Detection Revisited by Omar Alonso, Dennis Fetterly, and Mark Manasse.

Abstract:

In this paper, we investigate near-duplicate detection, particularly looking at the detection of evolving news stories. These stories often consist primarily of syndicated information, with local replacement of headlines, captions, and the addition of locally-relevant content. By detecting near-duplicates, we can offer users only those stories with content materially different from previously-viewed versions of the story. We expand on previous work and improve the performance of near-duplicate document detection by weighting the phrases in a sliding window based on the term frequency within the document of terms in that window and inverse document frequency of those phrases. We experiment on a subset of a publicly available web collection that is comprised solely of documents from news web sites. News articles are particularly challenging due to the prevalence of syndicated articles, where very similar articles are run with different headlines and surrounded by different HTML markup and site templates. We evaluate these algorithmic weightings using human judgments to determine similarity. We find that our techniques outperform the state of the art with statistical significance and are more discriminating when faced with a diverse collection of documents.

Detecting duplicates or near-duplicates of subjects (such as news stories) is part and parcel of a topic maps toolkit.

What I found curious about this paper was the definition of “content” to mean the news story and not online comments as well.

That’s a rather limited view of near-duplicate content. And it has a pernicious impact.

If a story quotes a lead paragraph or two from a New York Times story, comments may be made at the “near-duplicate” site, not the New York Times.

How much of a problem is that? When was the last time you saw a comment that was not in English in the New York Times?

Answer: Very unlikely you have ever seen such a comment:

If you are writing a comment, please be thoughtful, civil and articulate. In the vast majority of cases, we only accept comments written in English; foreign language comments will be rejected. Comments & Readers’ Reviews

If a story appears in the New York Times and a “near-duplicate” in Arizona, Italy, and Sudan, with comments, according to the authors, you will not have the opportunity to see that content.

That’s replacing American Exceptionalism with American Myopia.

Doesn’t sound like a winning solution to me.

I first saw this at Full Text Reports as Duplicate News Story Detection Revisited.

December 23, 2013

Hiding Interrogation Manual – In Plain Sight

Filed under: News,Reporting,Security — Patrick Durusau @ 8:52 pm

You’ll Never Guess Where This FBI Agent Left a Secret Interrogation Manual by Nick Baumann.

From the post:

In a lapse that national security experts call baffling, a high-ranking FBI agent filed a sensitive internal manual detailing the bureau’s secret interrogation procedures with the Library of Congress, where anyone with a library card can read it.

For years, the American Civil Liberties Union fought a legal battle to force the FBI to release a range of documents concerning FBI guidelines, including this one, which covers the practices agents are supposed to employ when questioning suspects. Through all this, unbeknownst to the ACLU and the FBI, the manual sat in a government archive open to the public. When the FBI finally relented and provided the ACLU a version of the interrogation guidebook last year, it was heavily redacted; entire pages were blacked out. But the version available at the Library of Congress, which a Mother Jones reporter reviewed last week, contains no redactions.

The 70-plus-page manual ended up in the Library of Congress, thanks to its author, an FBI official who made an unexplainable mistake. This FBI supervisory special agent, who once worked as a unit chief in the FBI’s counterterrorism division, registered a copyright for the manual in 2010 and deposited a copy with the US Copyright Office, where members of the public can inspect it upon request. What’s particularly strange about this episode is that government documents cannot be copyrighted.

A bit further on in the story it is reported:

Because the two versions are similar, a side-by-side comparison allows a reader to deduce what was redacted in the later version. The copyright office does not allow readers to take pictures or notes, but during a brief inspection, a few redactions stood out.

See Nick’s story for the redactions but what puzzled me was the “does not allow readers to take pictures or notes…” line.

Turns out what Mother Jones should have done was contact the ACLU, who is involved in litigation over this item.

Why?

Because under Circular 6 of the Copyright Office, copies of a deposit can be obtained under three (3) conditions, one of which is:

The Copyright Office Litigation Statement Form is completed and received from an attorney or authorized representative in connection with litigation, actual or prospective, involving the copyrighted work. The following information must be included in such a request: (a) the names of all parties involved and the nature of the controversy, and (b) the name of the court in which the actual case is pending. In the case of a prospective proceeding, the requestor must give a full statement of the facts of controversy in which the copyrighted work is involved, attach any letter or other document that supports the claim that litigation may be instituted, and make satisfactory assurance that the requested reproduction will be used only in connection with the specified litigation.

Contact the Records Research and Certification Section for a Litigation Statement Form. This form must be used. No substitute will be permitted. The form must contain an original signature and all information requested for the Copyright Office to process a request.

You can also get a court order but this one looks like a custom fit for the ACLU case.

It is hard to argue the government is in bad faith while ignoring routine administrative procedures to obtain the information you seek.

PS: If you have any ACLU contacts, please forward this post to them.

If you have Mother Jones contacts, suggest to them the drill is to get the information first, then break the story. They seem to have gotten that backwards on this one.

December 15, 2013

What is xkcd all about?…

Filed under: News,Reporting,Topic Models (LDA) — Patrick Durusau @ 9:10 pm

What is xkcd all about? Text mining a web comic by Jonathan Stray.

From the post:

I recently ran into a very cute visualization of the topics of XKCD comics. It’s made using a topic modeling algorithm where the computer automatically figures out what topics xkcd covers, and the relationships between them. I decided to compare this xkcd topic visualization to Overview, which does a similar sort of thing in a different way (here’s how Overview’s clustering works).

Stand back, I’m going to try science!

I knew that topic modeling had to have some practical use. 😉

Jonathan uses the wildly popular xkcd comic to illustrate some of the features of Overview.

Emphasis on “some.”

Something fun to start the week with!

Besides, you are comparing topic modeling algorithms on a known document base.

What could be more work related than that?

December 14, 2013

Step-by-step instructions for using Overview

Filed under: Document Management,News,Reporting,Visualization — Patrick Durusau @ 8:06 pm

Step-by-step instructions for using Overview by Jonathan Stray.

The Overview project posted the first job ad that I ever posted to this blog: Overview: Visualization to Connect the Dots.

A great project that enables ordinary users to manage large numbers of documents, to mine them and then to visualized relationships, all part of the process of news investigations.

Johnathan has written very clear and useful instructions for using Overview.

It is an open source software project so if you see possible improvement or added features, sing out! Or even better, contribute such improvement and/or features to the project.

November 24, 2013

GDELT:…

Filed under: Data Mining,Graphs,News,Reporting,Topic Maps — Patrick Durusau @ 3:26 pm

GDELT: The Global Database of Events, Language, and Tone

From the about page:

The Global Database of Events, Language, and Tone (GDELT) is an initiative to construct a catalog of human societal-scale behavior and beliefs across all countries of the world over the last two centuries down to the city level globally, to make all of this data freely available for open research, and to provide daily updates to create the first “realtime social sciences earth observatory.” Nearly a quarter-billion georeferenced events capture global behavior in more than 300 categories covering 1979 to present with daily updates.

GDELT is designed to help support new theories and descriptive understandings of the behaviors and driving forces of global-scale social systems from the micro-level of the individual through the macro-level of the entire planet by offering realtime synthesis of global societal-scale behavior into a rich quantitative database allowing realtime monitoring and analytical exploration of those trends.

GDELT’s evolving ability to capture ethnic, religious, and other social and cultural group relationships will offer profoundly new insights into the interplay of those groups over time, offering a rich new platform for understanding patterns of social evolution, while the data’s realtime nature will expand current understanding of social systems beyond static snapshots towards theories that incorporate the nonlinear behavior and feedback effects that define human interaction and greatly enrich fragility indexes, early warning systems, and forecasting efforts.

GDELT’s goal is to help uncover previously-obscured spatial, temporal, and perceptual evolutionary trends through new forms of analysis of the vast textual repositories that capture global societal activity, from news and social media archives to knowledge repositories.

Key Features


  • Covers all countries globally
  • Covers a quarter-century: 1979 to present
  • Daily updates every day, 365 days a year
  • Based on cross-section of all major international, national, regional, local, and hyper-local news sources, both print and broadcast, from nearly every corner of the globe, in both English and vernacular
  • 58 fields capture all available detail about event and actors
  • Ten fields capture significant detail about each actor, including role and type
  • All records georeferenced to the city or landmark as recorded in the article
  • Sophisticated geographic pipeline disambiguates and affiliates geography with actors
  • Separate geographic information for location of event and for both actors, including GNS and GNIS identifiers
  • All records include ethnic and religious affiliation of both actors as provided in the text
  • Even captures ambiguous events in conflict zones (“unidentified gunmen stormed the mosque and killed 20 civilians”)
  • Specialized filtering and linguistic rewriting filters considerably enhance TABARI’s accuracy
  • Wide array of media and emotion-based “importance” indicators for each event
  • Nearly a quarter-billion event records
  • 100% open, unclassified, and available for unlimited use and redistribution

The download page lists various data sets, including the GDELT Global Knowledge Graph and daily downloads of intake data.

If you are looking for data to challenge your graph, topic map or data mining skills, GDELT is the right spot.

July 25, 2013

Comparing text to data by importing tags

Filed under: Interface Research/Design,News,Reporting,UX — Patrick Durusau @ 1:12 pm

Comparing text to data by importing tags by Jonathan Stray.

From the post:

Overview sorts documents into folders based on the topic of each document, as determined by analyzing every word in each document. But it can also be used to see how the document text relates to the date of publication, document type, or any other field related to each document.

This is possible because Overview can import tags. To use this feature, you will need to get your documents into CSV file, which is a simple rows and columns spreadsheet format. As usual, the text of each document does in the “text” column. But you can also add a “tags” column which gives the tag or tags to be initially assigned to each document, separated by commas if more than one.

Jonathan demonstrates this technique on the Afghanistan War Logs.

Associations at the level of a document are useful.

Such as Jonathan suggests, document + date of publication; document + document type, etc.

But doesn’t that leave the reader with the last semantic mile to travel on their own?

That is I would rather have: document + source/author + term in document + data of publication and a host of other associations represented.

Otherwise, once I find the document, using tags perhaps, I have to retrace the steps of anyone who discovered the “document + source/author + term in document + data of publication” relationship before I did.

And anyone following me will have to retrace my steps.

How many searches get retraced in your department every month?

« Newer PostsOlder Posts »

Powered by WordPress