NSA — Untangling the Web: A Guide to Internet Research

Wednesday, May 15th, 2013

A Freedom of Information Act (FOIA) request caused the NSA to disgorge its guide to web research, which is some six years out of date.

From the post:

The National Security Agency just released “Untangling the Web,” an unclassified how-to guide to Internet search. It’s a sprawling document, clocking in at over 650 pages, and is the product of many years of research and updating by a NSA information specialist whose name is redacted on the official release, but who is identified as Robyn Winder of the Center for Digital Content on the Freedom of Information Act request that led to its release.

It’s a droll document on many levels. First and foremost, it’s funny to think of officials who control some of the most sophisticated supercomputers and satellites ever invented turning to a .pdf file for tricks on how to track down domain name system information on an enemy website. But “Untangling the Web” isn’t for code-breakers or wire-tappers. The target audience seems to be staffers looking for basic factual information, like the preferred spelling of Kazakhstan, or telephonic prefix information for East Timor.

I take it as guidance on how “good” does your application or service need to be to pitch to the government?

I keep thinking to attract government attention, an application needs to fall just short of solving P = NP?

On the contrary, the government needs spell checkers, phone information and no doubt lots of other dull information, quickly.

Perhaps an app that signals fresh doughnuts from bakeries within X blocks would be just the thing.

Google’s Hybrid Approach to Research [Lessons For Topic Map Research?]

Friday, November 2nd, 2012

Google’s Hybrid Approach to Research by Alfred Spector, Peter Norvig, and Slav Petrov.

From the start of the article:

In this Viewpoint, we describe how we organize computer science research at Google. We focus on how we integrate research and development and discuss the benefits and risks of our approach. The challenge in organizing R&D is great because CS is an increasingly broad and diverse field. It combines aspects of mathematical reasoning, engineering methodology, and the empirical approaches of the scientific method. The empirical components are clearly on the upswing, in part because the computer systems we construct have become so large that analytic techniques cannot properly describe their properties, because the systems now dynamically adjust to the difficult-to-predict needs of a diverse user community, and because the systems can learn from vast datasets and large numbers of interactive sessions that provide continuous feedback.

We have also noted that CS is an expanding sphere, where the core of the field (theory, operating systems, and so forth) continues to grow in depth, while the field keeps expanding into neighboring application areas. Research results come not only from universities, but also from companies, both large and small. The way research results are disseminated is also evolving and the peer-reviewed paper is under threat as the dominant dissemination method. Open source releases, standards specifications, data releases, and novel commercial systems that set new standards upon which others then build are increasingly important.

This seems particularly useful:

Thus, we have structured the Google environment as one where new ideas can be rapidly verified by small teams through large-scale experiments on real data, rather than just debated. The small-team approach benefits from the services model, which enables a few engineers to create new systems and put them in front of users.

Particularly in terms of research and development for topic maps.

I confess to a fondness for the “…just debated” side but point out that developers aren’t users. For interface requirements or software capabilities.

Selling what you have debated or written isn’t the same thing as selling what customers want. You can verify that lesson with with the Semantic Web folks.

Semantic impedance is going to grow along with “big data.”

Topic maps need to be poised to deliver a higher ROI in resolving semantic impedance than ad hoc solutions. And to delivery that ROI in the context of “big data” tools.

Book Review – “Universal Methods of Design”

Saturday, September 1st, 2012

Book Review – “Universal Methods of Design” by Cyd Harrell.

From the review:

I’ve never been one to use a lot of inspirational tools, like decks of design method cards. Day to day, I figure I have a very solid understanding of core practices and can make others up if I need to. But I’ve also been the leader of a fast-paced team that has been asked to solve all kinds of difficult problems through research and design, so sticking to my personal top five techniques was never an option. After all, only the most basic real-world research goals can be attained without combining and evolving methods.

So I was quite intrigued when I received a copy of Bella Martin and Bruce Hanington’s Universal Methods of Design, which presents summaries of 100 different research and analysis methods as two-page spreads in a nice, large-format hardback. Could this be the ideal reference for a busy research team with a lot of chewy problems to solve?

In short: yes. It functions as a great reference when we hear of a method none of us is familiar with, but more importantly it’s an excellent “unsticker” when we run into a challenge in the design or analysis of a study. I have a few quibbles with organization that I’ll get to in a minute, but in general this is a book that every research team should have on hand.

See the review for Cyd’s quibble.

For a copy near you, see: “Universal Methods of Design.”

Data-Intensive Librarians for Data-Intensive Research

Friday, August 10th, 2012

Data-Intensive Librarians for Data-Intensive Research by Chelcie Rowell.

From the post:

A packed house heard Tony Hey and Clifford Lynch present on The Fourth Paradigm: Data-Intensive Research, Digital Scholarship and Implications for Libraries at the 2012 ALA Annual Conference.

Jim Gray coined The Fourth Paradigm in 2007 to reflect a movement toward data-intensive science. Adapting to this change would, Gray noted, require an infrastructure to support the dissemination of both published work and underlying research data. But the return on investment for building the infrastructure would be to accelerate the transformation of raw data to recombined data to knowledge.

In outlining the current research landscape, Hey and Lynch underscored how right Gray was.

Hey led the audience on a whirlwind tour of how scientific research is practiced in the Fourth Paradigm. He showcased several projects that manage data from capture to curation to analysis and long-term preservation. One example he mentioned was the Dataverse Network Project that is working to preserve diverse scholarly outputs from published work to data, images and software.

Lynch reflected on the changing nature of the scientific record and the different collaborative structures that will be needed to define, generate and preserve that record. He noted that we tend to think of the scholarly record in terms of published works. In light of data-intensive science, Lynch said the definition must be expanded to include the datasets which underlie results and the software required to render data.

I wasn’t able to find a video of the presentations and/or slides but while you wait for those to appear, you can consult the homepages of Lynch and Hey for related materials.

Librarians already have searching and bibliographic skills, which are appropriate to the Fourth Paradigm.

What if they were to add big data design, if not processing, skills to their resumes?

What if articles in professional journals carried a byline in addition to the authors: Librarian(s): ?

NSF, NIH to Hold Webinar on Big Data Solicitation

Monday, April 30th, 2012

NSF, NIH to Hold Webinar on Big Data Solicitation by Erwin Gianchandani.

Guidance on BIGDATA Solicitation

<= $25 Million Webinar: Tuesday, May 8th, from 11am to 12pm ET. Registration closes 11:59pm PDT on Monday, May 7th. From the post: Late last month, the Administration unveiled a$200 million Big Data R&D Initiative, committing new funding to improve “our ability to extract knowledge and insights from large and complex collections of digital data.” The initiative includes a joint solicitation by the National Science Foundation (NSF) and National Institutes of Health (NIH), providing up to \$25 million for Core Techniques and Technologies for Advancing Big Data Science and Engineering (BIGDATA). Now NSF and NIH have announced a webinar “to describe the goals and focus of the BIGDATA solicitation, help investigators understand its scope, and answer any questions potential Principal Investigators (PIs) may have.” The webinar will take place next week — on Tuesday, May 8th, from 11am to 12pm ET.

So, how clever are you really?

(The post has links to other materials you probably need to read before the webinar.)

Google in the World of Academic Research (Lead by Example?)

Thursday, April 5th, 2012

Google in the World of Academic Research by Whitney Grace.

From the post:

Librarians, teachers, and college professors all press their students not to use Google to research their projects, papers, and homework, but it is a dying battle. All students have to do is type in a few key terms and millions of results are displayed. The average student or person, for that matter, is not going to scour through every single result. If they do not find what they need, they simply rethink their initial key words and hit the search button again.

The Hindu recently wrote about, “Of Google and Scholarly Search,” the troubles researchers face when they only use Google and makes several suggestions for alternate search engines and databases.

The perennial complaint (academics used to debate the perennial philosophy, now the perennial complaint).

Is Google responsible for superficial searching and consequently superficial results?

Or do superficial Google results reflect our failure to train students in “doing” research?

What research models do students have to follow? In terms of research behavior?

In my next course, I will do a research problem by example. Good as well as bad results. What worked and what didn’t. And yes, Google will be in the mix of methods.

Why not? With four and five work queries and domain knowledge, I get pretty good results from Google. You?

Research Tip: Conference Proceedings (ACM DL)

Monday, January 2nd, 2012

To verify the expansion of the acronyms for Jeff Haung’s Best Paper Awards in Computer Science [2011], I used the ACM Digital Library.

If the conference is listed under conferences in the Digital Library, following the link results in a listing of the top ten (10) paper downloads in the last six (6) weeks and the top ten (10) “most cited article” listings.

Be aware it isn’t always the most recent papers that are the most downloaded.

Another way to keep abreast of what is of interest in a particular area of computing.

Lifting the veil on my “system”

Sunday, December 11th, 2011

Lifting the veil on my “system” by Meredith Farkas.

From the post:

I am a huge fan of research log and research process reflection assignments. Because research is a means to an end (the paper) and because people are often doing it in a rush, there is little reflection on process. What worked? What didn’t? What can I take from this experience for the next time I have to do something similar? Because this reflection is not usually written into the curriculum, students don’t learn enough from their mistakes or even the good things they did. Having a research log helps students become better researchers in the future and, most importantly, helps them to develop a “system” that works for them.

I definitely remember the many years that I did not have a system for research and writing. Most reference librarians have probably encountered a frantic student who realizes just before his/her paper is due that s/he can’t track down some of the sources they need to cite. Yeah, that was me (though I would have been too embarrassed to come to the reference desk). I probably never followed the same path twice and wasted a lot of time doing things over again because I wasn’t organized. Looking back, I wish a nice librarian had provided an session for me on developing a system for finding, organizing, reading and synthesizing information, because I wasted a lot of time and sweat needlessly.

What do you think? Would a topic mapping tool do better? Worse? About the same?

While you are at it, give Meredith some feedback as well.

Real scientists never report fraud

Saturday, November 12th, 2011

Real scientists never report fraud

Daniel Lemire writes (in part):

People who want to believe that “peer reviewed work” means “correct work” will object that this is just one case. But what about the recently dismissed Harvard professor Marc Hauser? We find exactly the same story. Marc Hauser published over 200 papers in the best journals, making up data as he went. Again colleagues, journals and collaborators failed to openly challenge him: it took naive students, that is, outsiders, to report the fraud.

While I agree that other “professionals” may not have time to closely check work in the peer review process (see some of the comments), I think that illustrates the valuable role that students can play in the publication process.

Why not have a departmental requirement that papers for publication be circulated among students with an anonymous but public comment mechanism? Students are as pressed for time as anyone but they have the added incentive of wanting to become skilled at criticism of ideas and writing.

Not only would such a review process increase the likelihood of detection of fraud, but it would catch all manner of poor writing or citation practices. I regularly encounter published CS papers that incorrectly cite other published work or that cite work eventually published but under other titles. No fraud, just poor practices.

Information Literacy 2.0

Friday, November 4th, 2011

Information Literacy 2.0 by Meredith Farkas.

From the post:

Critical inquiry in the age of social media

Ideas about information literacy have always adapted to changes in the information environment. The birth of the web made it necessary for librarians to shift more towards teaching search strategies and evaluation of sources. The tool-focused “bibliographic instruction” approach was later replaced by the skill-focused “information literacy” approach. Now, with the growth of Web 2.0 technologies, we need to start shifting towards providing instruction that will enable our patrons to be successful information seekers in the Web 2.0 environment, where the process of evaluation is quite a bit more nuanced.

Critical inquiry skills are among the most important in a world in which the half-life of information is rapidly shrinking. These days, what you know is almost less important than what you can find out. And finding out today requires a set of skills that are very different from what most libraries focus on. In addition to academic sources, a huge wealth of content is being produced by people every day in knowledgebases like Wikipedia, review sites like Trip Advisor, and in blogs. Some of this content is legitimate and valuable—but some of it isn’t.

While I agree with Meredith that evaluation of information is a critical skill, I am less convinced that it is a new one. Research, even pre-Internet, was never about simply finding resources for the purpose of citation. There always was an evaluative aspect with regard to sources.

I was able to take a doctoral seminar in research methods for Old Testament students that taught critical evaluation of resources. I don’t remember the text off hand but we were reading a transcription of a cuneiform text which had a suggested “emendation” (think added characters) for a broken place in the text. The professor asked whether we should accept the “emendation” or not and on what basis we would make that judgement. The article was by a known scholar so of course we argued about the “emendation” but never asked one critical question: What about the original text? The source the scholar was relying upon.

The theology library had a publication with an image of the text that we reviewed for the next class. Even though it was only a photograph, it was clear that you might get one, maybe two characters in the broken space of the text, but there was no way you would have the five or six required by the “emendation.”

We were told to never rely upon quotations, transcriptions of texts, etc., unless there was simply no way to verify the source. Not that many of us do that in practice but that is the ideal. There is even less excuse for relying on quotations and other secondary materials now that so many primary materials are easy to access online and more are coming online every day.

I think the lesson of information literacy 2.0 should be critical evaluation of information but as part of that evaluation to seek out the sources of the information. You would be surprised how many times what an authors said is not what they are quoted as saying, when read in the context of the original.

Don’t trust your instincts

Wednesday, September 14th, 2011

Pennebaker is a word counter who first rule is: “Don’t trust your instincts.”

Why? In part because our expectations shape our view of the data. (sound familiar?)

The review quotes the Druge Report as posting a headline about President Obama that reads: “I ME MINE: Obama praises C.I.A. for bin Laden raid – while saying ‘I’ 35 Times.”

If the listener thinks President Obama is self-centered, the “I’s” have it as it were.

But, Pennebaker has used his programs to mindlessly count usage of words in press conferences since Truman. Obama is the lowest user I-word user of modern presidents.

That is only one illustration of how badly we can “look” at text or data and get it seriously wrong.

The Secret Life of Pronouns website has exercises to demonstrate how badly we get things wrong. (The videos are very entertaining.)

What does that mean for topic maps and authoring topic maps?

1. Don’t trust your instincts. (courtesy of Pennebaker)
2. View your data in different ways, ask unexpected questions.
3. Ask people unfamiliar with your data how they view it.
4. Read books on subjects you know nothing about. (Just general good advice.)
5. Ask known unconventional people to question your data/subjects. (Like me! Sorry, consulting plug.)

A Workflow for Digital Research Using Off-the-Shelf Tools

Monday, August 15th, 2011

A Workflow for Digital Research Using Off-the-Shelf Tools by William J. Turkel.

An excellent overview of useful tools for digital research.

One or more of these will be useful in authoring your next topic map.