Archive for the ‘WWW’ Category

NSA — Untangling the Web: A Guide to Internet Research

Wednesday, May 15th, 2013

NSA — Untangling the Web: A Guide to Internet Research

A Freedom of Information Act (FOIA) request caused the NSA to disgorge its guide to web research, which is some six years out of date.

From the post:

The National Security Agency just released “Untangling the Web,” an unclassified how-to guide to Internet search. It’s a sprawling document, clocking in at over 650 pages, and is the product of many years of research and updating by a NSA information specialist whose name is redacted on the official release, but who is identified as Robyn Winder of the Center for Digital Content on the Freedom of Information Act request that led to its release.

It’s a droll document on many levels. First and foremost, it’s funny to think of officials who control some of the most sophisticated supercomputers and satellites ever invented turning to a .pdf file for tricks on how to track down domain name system information on an enemy website. But “Untangling the Web” isn’t for code-breakers or wire-tappers. The target audience seems to be staffers looking for basic factual information, like the preferred spelling of Kazakhstan, or telephonic prefix information for East Timor.

I take it as guidance on how “good” does your application or service need to be to pitch to the government?

I keep thinking to attract government attention, an application needs to fall just short of solving P = NP?

On the contrary, the government needs spell checkers, phone information and no doubt lots of other dull information, quickly.

Perhaps an app that signals fresh doughnuts from bakeries within X blocks would be just the thing. ;-)

Seventh ACM International Conference on Web Search and Data Mining

Monday, May 13th, 2013

WSDM 2014 : Seventh ACM International Conference on Web Search and Data Mining

Abstract submission deadline: August 19, 2013
Paper submission deadline: August 26, 2013
Tutorial proposals due: September 9, 2013
Tutorial and paper acceptance notifications: November 25, 2013
Tutorials: February 24, 2014
Main Conference: February 25-28, 2014

From the call for papers:

WSDM (pronounced “wisdom”) is one of the premier conferences covering research in the areas of search and data mining on the Web. The Seventh ACM WSDM Conference will take place in New York City, USA during February 25-28, 2014.

WSDM publishes original, high-quality papers related to search and data mining on the Web and the Social Web, with an emphasis on practical but principled novel models of search, retrieval and data mining, algorithm design and analysis, economic implications, and in-depth experimental analysis of accuracy and performance.

WSDM 2014 is a highly selective, single track meeting that includes invited talks as well as refereed full papers. Topics covered include but are not limited to:

(…)

Papers emphasizing novel algorithmic approaches are particularly encouraged, as are empirical/analytical studies of specific data mining problems in other scientific disciplines, in business, engineering, or other application domains. Application-oriented papers that make innovative technical contributions to research are welcome. Visionary papers on new and emerging topics are also welcome.

Authors are explicitly discouraged from submitting papers that do not present clearly their contribution with respect to previous works, that contain only incremental results, and that do not provide significant advances over existing approaches.

Sets a high bar but one that can be met.

Would be very nice PR to have a topic map paper among those accepted.

Vote for Web Science MOOC!

Wednesday, May 1st, 2013

Please help me to realize my Web science massive open online course by René Pickhardt.

René has designed a Web Science MOOC but needs your vote at: https://moocfellowship.org/submissions/web-science to get the course funded.

Details on the course are at: Please help me to realize my Web science massive open online course.

The Web is important but to be honest, I am hopeful success here will encourage René to do a MOOC on graphs.

So I have an ulterior motive for promoting this particular MOOC. ;-)

Ultimate library challenge: taming the internet

Saturday, April 6th, 2013

Ultimate library challenge: taming the internet by Jill Lawless.

From the post:

Capturing the unruly, ever-changing internet is like trying to pin down a raging river. But the British Library is going to try.

For centuries, the library has kept a copy of every book, pamphlet, magazine and newspaper published in Britain. Starting on Saturday, it will also be bound to record every British website, e-book, online newsletter and blog in a bid to preserve the nation’s ”digital memory”.

As if that’s not a big enough task, the library also has to make this digital archive available to future researchers – come time, tide or technological change.

The library says the work is urgent. Ever since people began switching from paper and ink to computers and mobile phones, material that would fascinate future historians has been disappearing into a digital black hole. The library says firsthand accounts of everything from the 2005 London transit bombings to Britain’s 2010 election campaign have already vanished.

”Stuff out there on the web is ephemeral,” said Lucie Burgess the library’s head of content strategy. ”The average life of a web page is only 75 days, because websites change, the contents get taken down.

”If we don’t capture this material, a critical piece of the jigsaw puzzle of our understanding of the 21st century will be lost.”

For more details, see Jill’s post or, Click to save the nations digital memory (British Library press release), or 100 websites: Capturing the digital universe (sample of results of archiving with only 100 sites).

The content gathered by the project will be made available to the public.

A welcome venture, particularly since the results will be made available to the public.

An unanswerable question but I do wonder how we would view Greek drama if all of it had been preserved?

Hundreds if not thousands of plays were written and performed every year.

The Complete Greek Drama lists only forty-seven (47) that have survived to this day.

If whole scale preservation is the first step, how do we preserve paths to what’s worth reading in a data labyrinth as a second step?

I first saw this in a tweet by Jason Ronallo.

Our Internet Surveillance State [Intelligence Spam]

Tuesday, March 26th, 2013

Our Internet Surveillance State by Bruce Schneier.

Nothing like a good rant to get your blood pumping during a snap of cold weather! ;-)

Bruce writes:

Maintaining privacy on the Internet is nearly impossible. If you forget even once to enable your protections, or click on the wrong link, or type the wrong thing, and you’ve permanently attached your name to whatever anonymous service you’re using. Monsegur slipped up once, and the FBI got him. If the director of the CIA can’t maintain his privacy on the Internet, we’ve got no hope.

In today’s world, governments and corporations are working together to keep things that way. Governments are happy to use the data corporations collect — occasionally demanding that they collect more and save it longer — to spy on us. And corporations are happy to buy data from governments. Together the powerful spy on the powerless, and they’re not going to give up their positions of power, despite what the people want.

And welcome to a world where all of this, and everything else that you do or is done on a computer, is saved, correlated, studied, passed around from company to company without your knowledge or consent; and where the government accesses it at will without a warrant.

Welcome to an Internet without privacy, and we’ve ended up here with hardly a fight.

I don’t disagree with anything Bruce writes but I do not counsel despair.

Nor would I suggest any stop using the “Internet, email, cell phones, web browser, social networking sites, search engines,” in order to avoid spying.

But remember that one of the reasons U.S. intelligence services have fallen on hard times is the increased reliance on “easy” data to collect.

Clipping articles from newspaper or now copy-n-paste from emails and online zines, isn’t the same as having culturally aware human resources on the ground.

“Easy” data collection is far cheaper, but also less effective.

My suggestion is that everyone go “bare” and load up all listeners with as much junk as humanly possible.

Intelligence “spam” as it were.

Routinely threaten to murder fictitious characters in books or conspire to kidnap them. Terror plots, threats against Alderaan, for example.

Apparently even absurd threats, ‘One Definition of “Threat”,’ cannot be ignored.

A proliferation of fictional threats will leave them too little time to spy people going about their lawful activities.

BTW, not legal advice but I have heard that directly communicating any threat to any law enforcement agency is a crime. And not a good idea in any event.

Nor should you threaten any person or place or institution that isn’t entirely and provably fictional.

When someone who thinks mining social networks sites is a blow against terrorism overhears DC comic characters being threatened, that should be enough.

Aaron Swartz’s A Programmable Web: An Unfinished Work

Wednesday, March 13th, 2013

Aaron Swartz’s A Programmable Web: An Unfinished Work

Abstract:

This short work is the first draft of a book manuscript by Aaron Swartz written for the series “Synthesis Lectures on the Semantic Web” at the invitation of its editor, James Hendler. Unfortunately, the book wasn’t completed before Aaron’s death in January 2013. As a tribute, the editor and publisher are publishing the work digitally without cost.

From the author’s introduction:

” . . . we will begin by trying to understand the architecture of the Web — what it got right and, occasionally, what it got wrong, but most importantly why it is the way it is. We will learn how it allows both users and search engines to co-exist peacefully while supporting everything from photo-sharing to financial transactions.

We will continue by considering what it means to build a program on top of the Web — how to write software that both fairly serves its immediate users as well as the developers who want to build on top of it. Too often, an API is bolted on top of an existing application, as an afterthought or a completely separate piece. But, as we’ll see, when a web application is designed properly, APIs naturally grow out of it and require little effort to maintain.

Then we’ll look into what it means for your application to be not just another tool for people and software to use, but part of the ecology — a section of the programmable web. This means exposing your data to be queried and copied and integrated, even without explicit permission, into the larger software ecosystem, while protecting users’ freedom.

Finally, we’ll close with a discussion of that much-maligned phrase, ‘the Semantic Web,’ and try to understand what it would really mean.”

Table of Contents: Introduction: A Programmable Web / Building for Users: Designing URLs / Building for Search Engines: Following REST / Building for Choice: Allowing Import and Export / Building a Platform: Providing APIs / Building a Database: Queries and Dumps / Building for Freedom: Open Data, Open Source / Conclusion: A Semantic Web?

Even if you disagree with Aaron, on issues both large and small, as I do, it is a very worthwhile read.

But I will save my disagreements for another day. Enjoy the read!

Click Dataset [HTTP requests]

Tuesday, January 22nd, 2013

Click Dataset

From the webpage:

To foster the study of the structure and dynamics of Web traffic networks, we make available a large dataset (‘Click Dataset’) of HTTP requests made by users at Indiana University. Gathering anonymized requests directly from the network rather than relying on server logs and browser instrumentation allows one to examine large volumes of traffic data while minimizing biases associated with other data sources. It also provides one with valuable referrer information to reconstruct the subset of the Web graph actually traversed by users. The goal is to develop a better understanding of user behavior online and create more realistic models of Web traffic. The potential applications of this data include improved designs for networks, sites, and server software; more accurate forecasting of traffic trends; classification of sites based on the patterns of activity they inspire; and improved ranking algorithms for search results.

The data was generated by applying a Berkeley Packet Filter to a mirror of the traffic passing through the border router of Indiana University. This filter matched all traffic destined for TCP port 80. A long-running collection process used the pcap library to gather these packets, then applied a small set of regular expressions to their payloads to determine whether they contained HTTP GET requests.

Data available under terms and restrictions, including transfer by physical hard drive (~ 2.5 TB of data).

Intrigued by the notion of a “subset of the Web graph actually traversed by users.”

Does that mean that semantic annotation should occur on the portion of the “…Web graph actually traversed by users” before reaching other parts?

If the language of 4,148,237 English Wikipedia pages is never in doubt for any user, do we really need triples to record that for every page?

Common Crawl URL Index

Thursday, January 10th, 2013

Common Crawl URL Index by Lisa Green.

From the post:

We are thrilled to announce that Common Crawl now has a URL index! Scott Robertson, founder of triv.io graciously donated his time and skills to creating this valuable tool. You can read his guest blog post below and be sure to check out the triv.io site to learn more about how they help groups solve big data problems.

From Scott’s post:

If you want to create a new search engine, compile a list of congressional sentiment, monitor the spread of Facebook infection through the web, or create any other derivative work, that first starts when you think “if only I had the entire web on my hard drive.” Common Crawl is that hard drive, and using services like Amazon EC2 you can crunch through it all for a few hundred dollars. Others, like the gang at Lucky Oyster , would agree.

Which is great news! However if you wanted to extract only a small subset, say every page from Wikipedia you still would have to pay that few hundred dollars. The individual pages are randomly distributed in over 200,000 archive files, which you must download and unzip each one to find all the Wikipedia pages. Well you did, until now.

I’m happy to announce the first public release of the Common Crawl URL Index, designed to solve the problem of finding the locations of pages of interest within the archive based on their URL, domain, subdomain or even TLD (top level domain).

What research project would you want to do first?

The Top 5 Website UX Trends of 2012

Saturday, December 29th, 2012

The Top 5 Website UX Trends of 2012

From the post:

User interface techniques continued to evolve in 2012, often blurring the lines between design, usability, and technology in positive ways to create an overall experience that has been both useful and pleasurable.

Infinite scrolling, for example, is a technological achievement that also helps the user by enabling a more seamless experience. Similarly, advances in Web typography have an aesthetic dimension but also represent a movement toward greater clarity of communication.

Quick coverage of:

  1. Single-Page Sites
  2. Infinite Scrolling
  3. Persistent Top Navigation or “Sticky Nav”
  4. The Death of Web 2.0 Aesthetics
  5. Typography Returns

Examples of each trend but you are left on your own for the details.

Good time to review your web presence for the coming year.

10 Rules for Persistent URIs [Actually only one] Present of Persistent URIs

Monday, December 24th, 2012

Interoperability Solutions for European Public Administrations got into the egg nog early:

D7.1.3 – Study on persistent URIs, with identification of best practices and recommendations on the topic for the MSs and the EC (PDF) (I’m not kidding, go see for yourself.)

Five (5) positive rules:

  1. Follow the pattern: http://(domain)/(type)/(concept)/(reference)
  2. Re-use existing identifiers
  3. Link multiple representations
  4. Implement 303 redirects for real-world objects
  5. Use a dedicated servive

Five (5) negative rules:

  1. Avoid stating ownership
  2. Avoid version numbers
  3. Avoid using auto-increment
  4. Avoid query strings
  5. Avoid file extensions

If the goal is “persistent” URIs, only the “Use a dedicated server” has any relationship to making a URIs “persistent.”

That is that five (5) or ten (10) years from now, a URI used as an identifier will return the same value as today.

The other nine rules have no relationship to persistence. Good arguments can be made for some of them, but persistence isn’t one of them.

Why the report hides behind the rhetoric of persistence I cannot say.

But you can satisfy yourself that only a “dedicated server” can persist a URI, whatever its form.

W3C confusion over identifiers and locators for web resources continues to plague this area.

There isn’t anything particularly remarkable about using a URI as an identifier. So long as it is understood that URI identifiers are just like any other identifier.

That is they can be indexed, annotated, searched for and returned to users with data about the object of the identification.

Viewed that way, that once upon a time there was a resource with the location specified by a URI, has little or nothing to do with the persistent of that URI.

So long as we have indexed the URI, that index can serve as a resolution of that URI/identifier for as long as the index persists. With additional information should we choose to create and provide it.

The EU document concedes as much when it says:

Without exception, all the use cases discussed in section 3 where a policy of URI persistence has been adopted, have used a dedicated service that is independent of the data originator. The Australian National Data Service uses a handle resolver, Dublin Core uses purl.org, services, data.gov.uk and publications.europa.eu are all also independent of a specific government department and could readily be transferred and run by someone else if necessary. This does not imply that a single service should be adopted for multiple data providers. On the contrary – distribution is a key advantage of the Web. It simply means that the provision of persistent URIs should be independent of the data originator.

That is if you read: “…independent of the data originator” to mean independent of a particular location on the WWW.

No changes in form, content, protocols, server software, etc., required. And you get persistent URIs.

Merry Christmas to all and to all…, persistent URIs as identifiers (not locators)!

(I first saw this at: New Report: 10 Rules for Persistent URIs)

HTML5 and Canvas 2D – Feature Complete

Tuesday, December 18th, 2012

HTML5 and Canvas 2D have been released as feature complete drafts.

Not final but a stable target for development.

If you are interested in “testimonials,” see: HTML5 Definition Complete, W3C Moves to Interoperability Testing and Performance

Personally I prefer the single page HTML versions:

HTML5 singe page version.

The Canvas 2D draft is already a single page version.

Now would be a good time to begin working on how you will use HTML5 and Canvas 2D for delivery of topic map based information.

BigMLer in da Cloud: Machine Learning made even easier [Amateur vs. Professional Models]

Sunday, December 9th, 2012

BigMLer in da Cloud: Machine Learning made even easier by Martin Prats.

From the post:

We have open-sourced BigMLer, a command line tool that will let you create predictive models much easier than ever before.

BigMLer wraps BigML’s API Python bindings to offer a high-level command-line script to easily create and publish Datasets and Models, create Ensembles, make local Predictions from multiple models, and simplify many other machine learning tasks. BigMLer is open sourced under the Apache License, Version 2.0.

“…will let you create predictive models much easier than ever before.”

Well…., true, but the amount of effort you invest in a predictive model has a relationship to the usefulness of the model for some given purpose.

It is a great idea to create an easy “on ramp” to introduce machine learning. But it may lead some users to confuse “…easier than ever before” models with professionally crafted models.

An old friend confided their organization was about to write a classification system for a well know subject. Exciting to think they will put all past errors to rest while adding new capabilities.

But in reality librarians have labored in such areas for centuries. It isn’t an good target for a start-up project. Particularly for those innocent of existing classification systems and the theory/praxis that drove their creation.

Librarians didn’t invent the Internet. If they had, we wouldn’t be searching for ways to curate information on the Internet, in a backwards compatible way.

Linking Web Data for Education Project [Persisting Heterogeneity]

Friday, November 30th, 2012

Linking Web Data for Education Project

From the about page:

LinkedUp aims to push forward the exploitation of the vast amounts of public, open data available on the Web, in particular by educational institutions and organizations.

This will be achieved by identifying and supporting highly innovative large-scale Web information management applications through an open competition (the LinkedUp Challenge) and dedicated evaluation framework. The vision of the LinkedUp Challenge is to realise personalised university degree-level education of global impact based on open Web data and information. Drawing on the diversity of Web information relevant to education, ranging from Open Educational Resources metadata to the vast body of knowledge offered by the Linked Data approach, this aim requires overcoming substantial challenges related to Web-scale data and information management involving Big Data, such as performance and scalability, interoperability, multilinguality and heterogeneity problems, to offer personalised and accessible education services. Therefore, the LinkedUp Challenge provides a focused scenario to derive challenging requirements, evaluation criteria, benchmarks and thresholds which are reflected in the LinkedUp evaluation framework. Information management solutions have to apply data and learning analytics methods to provide highly personalised and context-aware views on heterogeneous Web data.

Before linked data, we had: “…interoperability, multilinguality and heterogeneity problems….”

After linked data, we have: “…interoperability, multilinguality and heterogeneity problems….” + linked data (with heterogeneity problems).

Not unexpected but still need a means of resolution. Topic maps anyone?

The personal cloud series

Sunday, October 21st, 2012

The personal cloud series by Jon Udell.

Excellent source of ideas on the web/cloud as we experience it today and as we may experience it tomorrow.

Going through prior posts now and will call some of them out for further discussion.

Which ones impress you the most?

Creating Your First HTML 5 Web Page [HTML5 - Feature Freeze?]

Saturday, August 18th, 2012

Creating Your First HTML 5 Web Page by Michael Dorf.

From the post:

Whether you have been writing web pages for a while or you are new to writing HTML, the new HTML 5 elements are still within your reach. It is important to learn how HTML 5 works since there are many new features that will make your pages better and more functional. Once you get your first web page under your belt you will find that they are very easy to put together and you will be on your way to making many more.

To begin, take a look at this base HTML page we will be working with. This is just a plain-ol’ HTML page, but we can start adding HTML5 elements to jazz it up!

But that’s not why I am posting it here. ;-)

A little later Michael says:

The new, simple DOCTYPE is much easier to remember and use than previous versions. The W3C is trying to stop versioning HTML so that backwards compatibility will become easier, so there are “technically” no more versions of HTML.

I’m not sure I follow on “…to stop versioning HTML so that backwards compatibility will become easier….”

Unless that means that HTML (5 I assume) is going into a feature/semantic freeze?

That would promote backwards compatibility but I am not sure is a good solution.

Just curious if you have heard the same?

Comments?

Does Time Fix All? [And my response]

Saturday, August 18th, 2012

Does Time Fix All? by Daniel Lemire, starts off:

As an graduate, finding useful references was painful. What the librarians had come up with were terrible time-consuming systems. It took an outsider (Berners-Lee) to invent the Web. Even so, the librarians were slow to adopt the Web and you could often see them warn students against using the Web as part of their research. Some of us ignored them and posted our papers online, or searched for papers online. Many, many years later, we are still a crazy minority but a new generation of librarians has finally adopted the Web.

What do you conclude from this story?

Whenever you point to a difficult systemic problem (e.g., it is time consuming to find references), someone will reply that “time fixes everything”. A more sophisticated way to express this belief is to say that systems are self-correcting.

Here is my response:

From above: “… What the librarians had come up with were terrible time-consuming systems. It took an outsider (Berners-Lee) to invent the Web….”

Really?

You mean the librarians who had been working on digital retrieval since the late 1940′s and subject retrieval longer than that? Those librarians?

With the web, every user repeats the search effort of others. Why isn’t repeating the effort of others a “terrible time-consuming system?”

BTW, Berners-Lee invented allowing 404s for hyperlinks. Significant because it lowered the overhead of hyperlinking enough to be practical. It was other CS types with high overhead hyperlinking. Not librarians.

Berners-Lee fixed hyperlinking maintenance, failed and continues to fail on IR. Or have you not noticed?

I won’t amplify my answer here but will wait to see what happens to my comment at Daniel’s blog.

Digital Methods

Sunday, December 4th, 2011

Digital Methods

From the website:

Welcome to the Digital Methods course, which is a focused section of the more expansive Digital Methods wiki. The Digital Methods course consists of seven units with digital research protocols, specially developed tools, tutorials as well as sample projects. In particular this course is dedicated to how else links, Websites, engines and other digital objects and spaces may be studied, if methods were to follow the medium, as opposed to importing standard methods from the social sciences more generally, including surveys, interviews and observation. Here digital methods are central. Short literature reviews are followed by distinctive digital methods approaches, step-by-step guides and exemplary projects.

Jack Park forwarded this link. A site that merits careful exploration. You will find things that you did not expect. Much like using the WWW. ;-)

Curious what parts of it you find to be the most useful/interesting?

The section on digital tools is my current favorite. I suspect that may change as I continue to explore the site.

Enjoy!