Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 10, 2012

Apache Nutch v1.6 and Apache 2.1 Releases

Filed under: Gora,HBase,Nutch,Search Engines,Solr — Patrick Durusau @ 10:45 am

Apache Nutch v1.6 Released

From the news:

The Apache Nutch PMC are extremely pleased to announce the release of Apache Nutch v1.6. This release includes over 20 bug fixes, the same in improvements, as well as new functionalities including a new HostNormalizer, the ability to dynamically set fetchInterval by MIME-type and functional enhancements to the Indexer API inluding the normalization of URL’s and the deletion of robots noIndex documents. Other notable improvements include the upgrade of key dependencies to Tika 1.2 and Automaton 1.11-8. Please see the list of changes or the release report made in this version for a full breakdown. The release is available here.

See the Nutch 1.x tutorial.

Apache Nutch v2.1 Released

From the news:

The Apache Nutch PMC are very pleased to announce the release of Apache Nutch v2.1. This release continues to provide Nutch users with a simplified Nutch distribution building on the 2.x development drive which is growing in popularity amongst the community. As well as addressing ~20 bugs this release also offers improved properties for better Solr configuration, upgrades to various Gora dependencies and the introduction of the option to build indexes in elastic search. Please see the list of changes made in this version for a full breakdown. The release is available here.

See the Nutch 2.x tutorial.

I haven’t done a detailed comparison but roughly, Nutch 1.x relies upon Solr for storage and Nutch 2.x relies upon Gora and HBase.

Surprised that isn’t in the FAQ.

Perhaps I will investigate further and offer a short summary of the differences.

December 9, 2012

Neo4j 1.9 M02 – Under the Hood

Filed under: Graphs,Neo4j,Networks — Patrick Durusau @ 8:28 pm

Neo4j 1.9 M02 – Under the Hood by Peter Neubauer.

From the post:

We have been working hard over the last weeks to tune and improve many aspects in the Neo4j internals, to deliver an even faster, more stable and less resource intensive graph database in this 1.9.M02 milestone release. Those efforts span a lot of areas that benefit everyone from the typical developer to sysops and to most other Neo4j users.

We are thrilled about the feedback we got from customers, and our community via Google GroupStack Overflow and Twitter. Thanks for helping us improve.

While the new changes might not be visible at the first glance, let’s look into Neo4j’s engine room to see what has changed.

Everyone’s most beloved query language, Cypher, has matured a lot thanks to Jake and Andres’ incredible work. They have made query execution much faster, for most use-cases, while utilizing less memory. The lazy execution of queries has sneaked away lately, so Andres caught it and put it back in. That means you can run queries with potentially infinitely large result sets without exhausting memory. Especially when streaming results (no aggregation and ordering) it will use only a tiny fraction of your memory. The very frequent construct ORDER BY … LIMIT … now benefits from a better top-n-select algorithm. These latest improvements are closing the performance gap to the core-API even more. We’ve also glimpsed a new internal SPI, that will allow Cypher to run even faster in the future.

Peter gives a quick tour of improvements in the latest milestone release of Neo4j.

Suggest you download the latest version to experiment with while you read Peter’s post.

Fun with Lucene’s faceted search module

Filed under: Faceted Search,Lucene,Search Engines,Searching — Patrick Durusau @ 8:16 pm

Fun with Lucene’s faceted search module by Mike McCandless.

From the post:

These days faceted search and navigation is common and users have come to expect and rely upon it.

Lucene’s facet module, first appearing in the 3.4.0 release, offers a powerful implementation, making it trivial to add a faceted user interface to your search application. Shai Erera wrote up a nice overview here and worked through nice “getting started” examples in his second post.

The facet module has not been integrated into Solr, which has an entirely different implementation, nor into ElasticSearch, which also has its own entirely different implementation. Bobo is yet another facet implementation! I’m sure there are more…

The facet module can compute the usual counts for each facet, but also has advanced features such as aggregates other than hit count, sampling (for better performance when there are many hits) and complements aggregation (for better performance when the number of hits is more than half of the index). All facets are hierarchical, so the app is free to index an arbitrary tree structure for each document. With the upcoming 4.1, the facet module will fully support near-real-time (NRT) search.

Take some time over the holidays to play with faceted searches in Lucene.

Humans are difficult [Design by Developer]

Filed under: Design,Graphics,Interface Research/Design,Visualization — Patrick Durusau @ 7:10 pm

Humans are difficult by Kristina Chodorow.

From the post:

My web app, Noodlin, has two basic functions: 1) create notes and 2) connect them, so I tried to make it blindingly obvious how to do both in the UI. Unfortunately, when I first started telling people about it, the first feedback I got was, “how do you create a connection?”

The original scheme: you clicked on the dark border to start a connection.

At that point, the way you created a connection was to click on the border of a note (a dark border would appear around a note when the mouse got close). Okay, so that wasn’t obvious enough, even though the tutorial said, “click on my border to create a connection.” I learned a lesson there: no one reads tutorials. However, I didn’t know what users did expect.

I started trying things: I darkened the color of the border, I had little connections pop out of the edge and follow your mouse as you moved it near a note. I heard from one user that she had tried dragging from one note to another, so I made that work, too. But people were still confused.

So what tripped Kristina up? She has authored two books on MongoDB, numerous other contributions, so she really knows her stuff.

In a phrase, design by developer.

All of her solutions were perfectly obvious to her, but as you will read in her post, not to her users.

Not the release of commercial software (you can supply examples of those failures on your own) but illustrates a major reason for semantic diversity:

We all view the world from a different set of default settings.

So we react to interfaces differently. The only way to discover which one will work for others, is to ask.

BTW, strictly my default view but Kristina’s Noodlin is worth a long look!

BigMLer in da Cloud: Machine Learning made even easier [Amateur vs. Professional Models]

Filed under: Cloud Computing,Machine Learning,WWW — Patrick Durusau @ 5:19 pm

BigMLer in da Cloud: Machine Learning made even easier by Martin Prats.

From the post:

We have open-sourced BigMLer, a command line tool that will let you create predictive models much easier than ever before.

BigMLer wraps BigML’s API Python bindings to offer a high-level command-line script to easily create and publish Datasets and Models, create Ensembles, make local Predictions from multiple models, and simplify many other machine learning tasks. BigMLer is open sourced under the Apache License, Version 2.0.

“…will let you create predictive models much easier than ever before.”

Well…., true, but the amount of effort you invest in a predictive model has a relationship to the usefulness of the model for some given purpose.

It is a great idea to create an easy “on ramp” to introduce machine learning. But it may lead some users to confuse “…easier than ever before” models with professionally crafted models.

An old friend confided their organization was about to write a classification system for a well know subject. Exciting to think they will put all past errors to rest while adding new capabilities.

But in reality librarians have labored in such areas for centuries. It isn’t an good target for a start-up project. Particularly for those innocent of existing classification systems and the theory/praxis that drove their creation.

Librarians didn’t invent the Internet. If they had, we wouldn’t be searching for ways to curate information on the Internet, in a backwards compatible way.

Dissing Disqus

Filed under: Blogs — Patrick Durusau @ 2:55 pm

I wanted to advise a blogger of a URL error I found. Was going to simply leave a comment with the correct information.

Imagine my surprise when I tried to authenticate using Twitter to be told that Disqus could:

  • Read Tweets from your timeline.
  • See who you follow, and follow new people.
  • Update your profile.
  • Post Tweets for you.

The “Read Tweets from your timeline.” doesn’t bother me. Mostly because I don’t write anything down I would mind being public. 😉

But, following new people, updating my profile and posting tweets, all for me, what about that doesn’t suck?

If you use Disqus, don’t expect any comments from me.

Abandoning Disqus could make such over-reaching less common.

Autocomplete Search with Redis

Filed under: Authoring Semantics,Authoring Topic Maps,AutoComplete,Redis — Patrick Durusau @ 2:43 pm

Autocomplete Search with Redis

From the post:

When we launched GetGlue HD, we built a faster and more powerful search to help users find the titles they were looking for when they want to check-in to their favorite shows and movies as they typed into the search box. To accomplish that, we used the in-memory data structures of the Redis data store to build an autocomplete search index.

Search Goals

The results we wanted to autocomplete for are a little different than the usual result types. The Auto complete with Redis writeup by antirez explores using the lexicographical ordering behavior of sorted sets to autocomplete for names. This is a great approach for things like usernames, where the prefix typed by the user is also the prefix of the returned results: typing mar could return Mara, Marabel, and Marceline. The deal-breaking limitation is that it will not return Teenagers From Mars, which is what we want our autocomplete to be able to do when searching for things like show and movie titles. To do that, we decided to roll our own autocomplete engine to fit our requirements. (Updated the link to the “Auto complete with Redis” post.)

Rather like the idea of autocomplete being more than just string completion.

What if while typing a name, “autocompletion” returns one or more choices for what it thinks you may be talking about? With additional properties/characteristics, you can disambiguate your usage by allowing your editor to tag the term.

Perhaps another way to ease the burden of authoring a topic map.

December 8, 2012

Piccolo: Distributed Computing via Shared Tables

Filed under: Annotation,Distributed Systems,Piccolo — Patrick Durusau @ 7:41 pm

Piccolo: Distributed Computing via Shared Tables

From the homepage:

Piccolo is a framework designed to make it easy to develop efficient distributed applications.

In contrast to traditional data-centric models (such as Hadoop) which present the user a single object at a time to operate on, Piccolo exposes a global table interface which is available to all parts of the computation simulataneously. This allows users to specify programs in an intuitive manner very similar to that of writing programs for a single machine.

Piccolo includes a number of optimizations to ensure that using this table interface is not just easy, but also fast:

Locality
To ensure locality of execution, tables are explicitly partitioned across machines. User code that interacts with the tables can specify a locality preference: this ensures that the code is executed locally with the data it is accessing.
Load-balancing
Not all load is created equal – often some partition of a computation will take much longer then others. Waiting idly for this task to finish wastes valuable time and resources. To address this Piccolo can migrate tasks away from busy machines to take advantage of otherwise idle workers, all while preserving the locality preferences and the correctness of the program.
Failure Handling
Machines failures are inevitable, and generally occur when you’re at the most critical time in your computation. Piccolo makes checkpointing and restoration easy and fast, allowing for quick recovery in case of failures.
Synchronization
Managing the correct synchronization and update across a distributed system can be complicated and slow. Piccolo addresses this by allowing users to defer synchronization logic to the system. Instead of explicitly locking tables in order to perform updates, users can attach accumulation functions to a table: these are used automatically by the framework to correctly combine concurrent updates to a table entry.

The closer you are to the metal, the more aware you will be of the distributed nature of processing and data.

Will the success of distributed processing/storage be when all but systems architects are unaware of its nature?

GraphLab vs. Piccolo vs. Spark

Filed under: GraphLab,Graphs,Networks,Piccolo,Spark — Patrick Durusau @ 7:26 pm

GraphLab vs. Piccolo vs. Spark by Danny Bickson.

From the post:

I got an interesting case study from Cui Henggang, a first year graduate student at CMU Parallel Data Lab. Cui implemented GMM on GraphLab, for comparing its performance to Piccolo and Spark. His collaborators on this projects where Jinliang Wei and Wei Dai. The algorithm is described on Chris Bishop, Pattern Recognition and Machine Learning, Chapter 9.2, page 438.

Danny reports Chu will be releasing his report and posting his GMM code to the graphic models toolkit (GraphLab).

I will post a pointer when the report appears, here and probably in a new post as well.

Applying “Lateral Thinking” to Data Quality

Filed under: Data,Data Quality — Patrick Durusau @ 7:08 pm

Applying “Lateral Thinking” to Data Quality by Ken O’Connor.

From the post:

I am a fan of Edward De Bono, the originator of the concept of Lateral Thinking. One of my favourite examples of De Bono’s brilliance, relates to dealing with the worldwide problem of river pollution.

River Discharge Pipe

De Bono suggested “each factory must be downstream of itself” – i.e. Require factories’ water inflow pipes to be just downstream of their outflow pipes.

Suddenly, the water quality in the outflow pipe becomes a lot more important to the factory. Apparently several countries have implemented this idea as law.

What has this got to do with data quality?

By applying the same principle to data entry, all downstream data users will benefit, and information quality will improve.

How could this be done?

So how do you move the data input pipe just downstream of the data outflow pipe?

Before you take a look at Ken’s solution, take a few minutes to brain storm about how you would do it.

Important for semantic technologies because there aren’t enough experts to go around. Meaning non-expert users will do a large portion of the work.

Comments/suggestions?

Practical Foundations for Programming Languages

Filed under: CS Lectures,Programming,Types — Patrick Durusau @ 5:32 pm

PFPL is out! by Robert Harper.

From the post:

Practical Foundations for Programming Languages, published by Cambridge University Press, is now available in print! It can be ordered from the usual sources, and maybe some unusual ones as well. If you order directly from Cambridge using this link, you will get a 20% discount on the cover price (pass it on).

Since going to press I have, inevitably, been informed of some (so far minor) errors that are corrected in the online edition. These corrections will make their way into the second printing. If you see something fishy-looking, compare it with the online edition first to see whether I may have already corrected the mistake. Otherwise, send your comments to me.rwh@cs.cmu.edu

By the way, the cover artwork is by Scott Draves, a former student in my group, who is now a professional artist as well as a researcher at Google in NYC. Thanks, Scott!

Update: The very first author’s copy hit my desk today!

Congratulations to Robert!

The holidays are upon us so order early and often!

Looking at a Plaintext Lucene Index

Filed under: Indexing,Lucene — Patrick Durusau @ 5:24 pm

Looking at a Plaintext Lucene Index by Florian Hopf.

From the post:

The Lucene file format is one of the reasons why Lucene is as fast as it is. An index consist of several binary files that you can’t really inspect if you don’t use tools like the fantastic Luke.

Starting with Lucene 4 the format for these files can be configured using the Codec API. Several implementations are provided with the release, among those the SimpleTextCodec that can be used to write the files in plaintext for learning and debugging purposes.

Good starting point for learning more about Lucene indexes.

Library Hi Tech Journal seeks papers on LOV & LOD

Filed under: Linked Data,LOD,LOV — Patrick Durusau @ 2:44 pm

Library Hi Tech Journal seeks papers on LOV & LOD

From the post:

Library Hi Tech (LHT) seeks papers about new works, initiatives, trends and research in the field of linking and opening vocabularies. This call for papers is inspired by the 2012 LOV Symposium: Linking and Opening Vocabularies symposium and SKOS-2-HIVE —Helping Interdisciplinary Vocabulary Engineering workshop—held at the Universidad Carlos III de Madrid (UC3M).

This Library Hi Tech special issue might include papers delivered at the UC3M-LOV events and other original works related with this subject, not yet published.

Topics: LOV & LOD

Papers specifically addressing research and development activities, implementation challenges and solutions, and educative aspects of Linked Open Vocabularies (LOV) and/or in a broader sense Linked Open Data, are of particular interest.

Those interested in submitting an article should send papers before 30 January 2013. Full articles should be between 4,000 and 8,000 words. References should use the Harvard style. Please submit completed articles via the Scholar One online submission system. All final submissions will be peer reviewed.

On the style for references, you may find the Author Guidelines at LHT useful.

More generally, see Harvard System, posted by the University Library of Anglia Ruskin University.

Four Organizational Personas Of Disruptive Tech Adoption

Filed under: Marketing,Topic Maps — Patrick Durusau @ 2:27 pm

Monday’s Musings: Understand The Four Organizational Personas Of Disruptive Tech Adoption by R “Ray” Wang.

From the post:

Rapid innovation, flexible deployment options, and easy consumption models create favorable conditions for the proliferation of disruptive technology. In fact, convergence in the five pillars of enterprise disruption (i.e. social, mobile, cloud, big data, and unified communications), has led to new innovations and opportunities to apply disruptive technologies to new business models. New business models abound at the intersection of cloud and big data, social and mobile, social and unified communications, and cloud and mobile.

Unfortunately, most organizations are awash with discovering, evaluating, and consuming disruptive technologies. Despite IT budgets going down from 3 to 5% year over year, technology spending is up 18 to 20%. Why? Amidst constrained budgets, resources, and time limits, executives are willing to invest in disruptive technology to improve business outcomes. Consequently, successful adoption is the key challenge in consuming this torrent of innovation. This rapid pace of change and inability to consume innovation detract organizations from the realization of business value.

“Ray” writes the analysis of organizational personas from the perspective of someone within an organization who is pushing for adoption of a disruptive technology. The insights are quite useful for anyone with that perspective.

Have you used organizational personas to target adopters of disruptive technologies?

It is a low percentage shot to pitch a disruptive technology to a known “laggard.”

Is history of organizations useful? Thinking “market leaders” who created/adopted a disruptive technology, could over time become cautious adopters or even laggards.

You can supply your own examples of current “market leaders” who are shaving algorithms but not doing anything fundamentally new or disruptive.

Suggestions/comments?

December 7, 2012

Open Babel: One year on

Filed under: Cheminformatics,Open Babel — Patrick Durusau @ 8:08 pm

Open Babel: One year on by Bailey Fallon.

From the post:

Just over a year ago, Journal of Cheminformatics published a paper describing the open source chemical toolbox, Open Babel. Despite almost 10 years as an independent project, a description of the features and implementation of Open Babel had never been formally published. However, in the 14 months since publication, the Open Babel paper has quickly become one of the most influential articles in Journal of Cheminformatics. It is the second most cited article in the journal according to Thomson Reuters Web of Science, and is amongst the most widely read, with close to 10 000 accesses. The software itself has been downloaded over 40 000 times in the last year alone.

Open Babel attempts to solve a common problem in cheminformatics – the need to convert between different chemical structure formats. It offers a solution by allowing anyone to search, convert, analyze, or store data in over 110 formats covering molecular modeling, chemistry, solid-state materials, biochemistry, and related areas.

Introductory training guide to Open Babel (by Noel O’Boyle).

That impressive!

But you need to remember it wasn’t that many years ago when commercial conversion software offered more than 300 source and target formats.

Still, worth taking a deep look to see if there are useful lessons for topic maps.

Building graphs with Hadoop

Filed under: GraphBuilder,Graphs,Hadoop,Networks — Patrick Durusau @ 8:00 pm

Building graphs with Hadoop

From the post:

Faced with a mass of unstructured data, the first step of analysing it should be to organise it, and the first step of that process should be working out in what way it should be organised. But then that mass of data has to be fed into the graph which can take a long time and may be inefficient. That’s why Intel has announced the release of the open source GraphBuilder library, a tool that is meant to help scientists and developers working with large amounts of data build applications that make sense of this data.

The library plugs into Apache Hadoop and is designed to create graphs from big data sets which can then be used in applications. GraphBuilder is written in Java using the MapReduce parallel programming model and takes care of many of the complexities of graph construction. According to the developers, this makes it easier for scientists and developers who do not necessarily have skills in distributed systems engineering to make use of large data sets in their Hadoop applications. They can focus on writing the code that breaks the data up into meaningful nodes and useful edge information which can be run across the distributed architecture where the library also performs a wide range of other useful processes to optimise the data for later analysis.

A nice way to re-use those Hadoop skills you have been busy acquiring!

Definitely on the weekend schedule!

Wikiweb [Clue to a topic map delivery interface]

Filed under: Graphs,Interface Research/Design,Networks,Visualization — Patrick Durusau @ 7:32 pm

Wikiweb

I don’t have an iPhone or IPad so I have to take the video at face value. 🙁

But, what it shows was quite impressive!

Still not convinced about graph layouts that move about but obviously some users really like them.

Imagine this display adapted to merged subject representatives. With configurable display of other subjects/connections.

Now that would rock!

DMR – Data Mining and Reporting (blog)

Filed under: Data Mining,Knime — Patrick Durusau @ 7:26 pm

DMR – Data Mining and Reporting by Rosaria Silipo.

Data mining focused on KNIME.

I followed a reference by Sandro Saitta to it.

KNIME website (in case it is unfamiliar).

Astronomy Resources [Zillman]

Filed under: Astroinformatics,Data — Patrick Durusau @ 6:38 pm

Astronomy Resources by Marcus P. Zillman.

From the post:

Astronomy Resources (AstronomyResources.info) is a Subject Tracer™ Information Blog developed and created by the Virtual Private Library™. It is designed to bring together the latest resources and sources on an ongoing basis from the Internet for astronomical resources which are listed below….

With some caveats, this may be of interest.

First, the level of content is uneven. It ranges from professional surveys (suitable for topic map explorations) to more primary/secondary education type materials. Nothing against the latter but the mix is rather jarring.

Second, I didn’t test every link but for example AstroGrid is a link to a project that was completed two years ago (2010).

Just in case you stumble across any of the “white papers” at http://www.whitepapers.us/, also by Marcus P. Zillman, do verify resources before citing them to others.

Reconstruct Gene Networks Using Shiny

Filed under: Graphs,Networks,R — Patrick Durusau @ 6:21 pm

Reconstruct Gene Networks Using Shiny by Jeff Allen

From the post:

We’ve been experimenting with RStudio’s new Shiny software as a way to quickly and easily create interactive, responsive web applications which are able to leverage complicated analytics back-ends built in the R programming language.

(graphic omitted)

We created a simple interface which can infer the structure of an underlying Gene Regulatory information based on gene expression patterns; the application is available at http://glimmer.rstudio.com/qbrc/grn/. (At the time of writing, Shiny’s file upload functionality is highly unstable and may not work from your machine — hopefully improvements to the project will resolve the issues shortly.)

Inferring networks. Sounds like inferring associations. (Being mindful of Eric Freese’s demo years ago of the family tree topic map application.)

When looking for connections, consider dropping an association in a map to see what may/may not realign. Or change the basis for the association.

For example, tracking lobbyists in the coming frenzy of tax and fiscal reform.

Learn R by trying R

Filed under: R,Teaching,Topic Maps — Patrick Durusau @ 6:06 pm

Learn R by trying R by David Smith.

From the post:

If you are new to R, and want to get an introduction to the R language, in the classic “learning by doing way”, Code school and O’Reilly have put together the Try R interactive tutorial.

This tutorial is a painless introduction to the R programming language. During the course you’ll become familiar with using vectors, matrices, factors, and data frames to manipulate data into powerful visualizations.

How would you translate this concept into a means of teaching topic maps?

As an online and responsive interface?

What would you choose as the domain for the topic map?

Or perhaps better, what trend indicators would you watch so you could pick something of broad current interest?

Would you change it seasonally?

Fiscal Cliff + OMB or Fool Me Once/Twice

Filed under: Government,Government Data,Marketing,Topic Maps — Patrick Durusau @ 12:00 pm

Call it a fiscal “cliff,” “slope,” “curb,” “bump,” or whatever, it is all the rage in U.S. news programming.

Two things are clear:

First, tax and fiscal policy are important for government services, the economy and citizens.

Second, the American people are being kept in near total darkness about what may, could or should be done in tax and fiscal policy.

House Speaker Boehner’s “proposal” to close some tax loopholes, some day by some amount is too vacuous to merit further comment.

President Obama has been clear on wanting an increase in taxes for income over $250,000, but there clarity from the Obama administration stops.

The Office of Management and Budget issued OMB Report Pursuant to the Sequestration Transparency Act of 2012 (P. L. 112–155) as a PDF file. Meaning no one could easily evaluate its contents.

Especially:

Appendix A. Preliminary Estimates of Sequestrable and Exempt Budgetary Resources and Reduction in Sequestrable Budgetary Resources by OMB Account – FY 2013

and,

Appendix B. Preliminary Sequestrable / Exempt Classification by OMB Account and Type of Budgetary Resource

I converted Appendix A in to a comma separated data file, with a short commentary to alert the reader to issues in the data file. (OMB-Sequestration-Data-Appendix-A.zip)

For example:

  • Meaning and application of “offsets” varies throughout Appendix A of the OMB report.
  • The OMB report manages to multiple 0 by 7.6 percent for a result of $91 million.
  • Appendix B has a different ordering of the accounts than Appendix A and uses different identifiers.

Whatever the intent of the report’s authors, it fails to provide meaningful information on the sequestration issue.

Contact the White House, your Senator or Representative.

Demand all proposals be accompanied by machine readable spreadsheets with details.

Demand your favorite news outlet carry no reports without data from any side in this debate. (Being ignored is the most powerful weapon against the White House, Congress and various federal agencies.)

Lobbyists, OMB, member of congress, all have those files. The public is the only side without the details.

Topic maps can map points of clarity as well as obscurity, assuming you have the files for mapping.

December 6, 2012

Kolmogorov Complexity – A Primer

Filed under: Complexity,Compression,String Matching — Patrick Durusau @ 11:46 am

Kolmogorov Complexity – A Primer by Jeremy Kun.

From the post:

Previously on this blog (quite a while ago), we’ve investigated some simple ideas of using randomness in artistic design (psychedelic art, and earlier randomized css designs), and measuring the complexity of such constructions. Here we intend to give a more thorough and rigorous introduction to the study of the complexity of strings. This naturally falls into the realm of computability theory and complexity theory, and so we refer the novice reader to our other primers on the subject (Determinism and Finite Automata, Turing Machines, and Complexity Classes; but Turing machines will be the most critical to this discussion).

Jeremy sets the groundwork necessary for a later post in this series. (covering machine learning)

Digest this for a couple of days and I will point out the second post.

History of the Book [Course resources]

Filed under: Books,History,Interface Research/Design — Patrick Durusau @ 11:45 am

History of the Book by Kate Martinson.

From the webpage:

This website consists of information relating to Art 43 – The History of the Book. Participants should consider this site as a learning tool for the class. It will contain updated documents, images for reference, necessary links, class announcements and other information necessary for participation in the course. It will be constantly modified throughout the semester. Questions or problems should be directed to Kate Martinson, or in the event of technical difficulties, to the Help Desk.

A large number of links to images and other materials on writing and book making around the world. From cuneiform tablets to electronic texts.

I encountered it while looking for material on book indexing.

Useful for studying the transmission of and access to information. Which may influence how you design your topic maps.

Grossly oversimplified but consider the labor involved in writing/accessing information on a cuneiform tablet, on a scroll, in a movable type codex or in electronic form.

At each stage the labor becomes less and the amount of recorded information (not the same as useful information) goes up.

Rather than presenting more information to a user, would it be better for topic maps to present less? And/or to make it difficult to add more information?

What if FaceBook offered a filter to exclude coffee, pictures not taken by the sender, etc.? Would that make it a more useful information stream?

How We Read….[Does Your Topic Map Contribute to Information Overload?]

Filed under: Indexing,Information Overload,Interface Research/Design,Usability,Users — Patrick Durusau @ 11:43 am

How we read, not what we read, may be contributing to our information overload by Justin Ellis.

From the post:

Every day, a new app or service arrives with the promise of helping people cut down on the flood of information they receive. It’s the natural result of living in a time when an ever-increasing number of news providers push a constant stream of headlines at us every day.

But what if it’s the ways we choose to read the news — not the glut of news providers — that make us feel overwhelmed? An interesting new study out of the University of Texas looks at the factors that contribute to the concept of information overload, and found that, for some people, the platform on which news is being consumed can make all the difference between whether you feel overwhelmed.

The study, “News and the Overloaded Consumer: Factors Influencing Information Overload Among News Consumers” was conducted by Avery Holton and Iris Chyi. They surveyed more than 750 adults on their digital consumption habits and perceptions of information overload. On the central question of whether they feel overloaded with the amount of news available, 27 percent said “not at all”; everyone else reported some degree of overloaded.

The results imply that the more constrained the platform for delivery of content, the less overwhelmed users feel. Reading news on a cell phone for example. The links and videos on Facebook being at the other extreme.

Which makes me curious about information interfaces in general and topic map interfaces in particular.

Does the traditional topic map interface (think Omnigator) contribute to a feeling of information overload?

If so, how would you alter that display to offer the user less information by default but allow its expansion upon request?

Compare to a book index, which offers sparse information on a subject, that can be expanded by following a pointer to fuller treatment of a subject.

I don’t think replicating a print index with hyperlinks in place of traditional references is the best solution but it might be a starting place for consideration.

Introducing Noodlin – A Brainstorming App

Filed under: Design,Graphics,Visualization — Patrick Durusau @ 11:43 am

Introducing Noodlin – A Brainstorming App by Kristina Chodorow.

From the webpage:

I’ve been working on a web app, Noodlin, for brainstorming online. Basically, Noodlin just lets you create notes and connect them. I’ve been using it for taking notes during meetings, figuring out who gets what for the holidays, and organizing The Definitive Guide. I think it might be handy for people studying for finals this time of year, too.

I find it really difficult to be creative while looking at a text editor: it’s just not a good form factor for organizing thoughts, taking notes, and coming up with new ideas. There’s a whole class of problems where people say, “Let’s go find a whiteboard” or start sticking post-its to a wall. Noodlin is an attempt to make this kind of thinking easier to do on a computer.

You may recognize Kristina Chodorow from her work on MongoDB.

Could be quite useful.

I would prefer the ability to host it locally, a few more shapes and properties on the edges connecting shapes.

Tails: The Amnesic Incognito Live System [Data Mining Where You Shouldn’t]

Filed under: Security,Software — Patrick Durusau @ 11:42 am

Tails: The Amnesic Incognito Live System

From the webpage:

Privacy for anyone anywhere

Tails is a live DVD+ or live USB+ that aims at preserving your privacy and anonymity.

It helps you to:

  • use the Internet anonymously almost anywhere you go and on any computer: all connections to the Internet are forced to go through the Tor network;
  • leave no trace on the computer you’re using unless you ask it explicitly;
  • use state-of-the-art cryptographic tools to encrypt your files, email and instant messaging.

If you go data mining where you are unwanted, don’t use your regular user name and real address.

In fact, something like Tails might be in order.

Being mindful that possession of a USB stick with Tails on it could be considered a breach of security, should someone choose to take it that way.

Probably best to use a DVD disgiused as a Lady Gaga disk. 😉

PS: Being mindful there is always the old fashioned hostile data mining, steal the drives: Swiss Spy Agency: Counter-Terrorism Secrets Stolen.

VO Inside: CALIFA Survey Data Release

Filed under: Astroinformatics,BigData — Patrick Durusau @ 11:41 am

VO Inside: CALIFA Survey Data Release

From the post:

The Calar Alto Legacy Integral Field Area Survey (CALIFA) is observing approximately 600 galaxies in the local universe using 250 observing nights with the PMAS/PPAK integral field spectrophotometer, mounted on the Calar Alto 3.5 m telescope. The first data release occurred on the 1st of November 2012. This DR comprises 200 datacubes corresponding to 100 CALIFA objects (one per setup: V500 and V1200). The data have been fully reduced, quality tested, and are scientifically useful.

The CALIFA survey team provides information here about accessing data with Topcat and other VO Tools.

Something for the astronomer on your gift list, another big data set!

Introduction to Databases [MOOC, Stanford, January 2013]

Filed under: CS Lectures,Database — Patrick Durusau @ 11:39 am

Introduction to Databases (info/registration link) – Starts January 15, 2013.

From the webpage:

About the Course

“Introduction to Databases” had a very successful public offering in fall 2011, as one of Stanford’s inaugural three massive open online courses. Since then, the course materials have been improved and expanded, and we’re excited to be launching a second public offering of the course in winter 2013. The course includes video lectures and demos with in-video quizzes to check understanding, in-depth standalone quizzes, a wide variety of automatically-checked interactive programming exercises, midterm and final exams, a discussion forum, optional additional exercises with solutions, and pointers to readings and resources. Taught by Professor Jennifer Widom, the curriculum draws from Stanford’s popular Introduction to Databases course.

Why Learn About Databases?

Databases are incredibly prevalent — they underlie technology used by most people every day if not every hour. Databases reside behind a huge fraction of websites; they’re a crucial component of telecommunications systems, banking systems, video games, and just about any other software system or electronic device that maintains some amount of persistent information. In addition to persistence, database systems provide a number of other properties that make them exceptionally useful and convenient: reliability, efficiency, scalability, concurrency control, data abstractions, and high-level query languages. Databases are so ubiquitous and important that computer science graduates frequently cite their database class as the one most useful to them in their industry or graduate-school careers.

Course Syllabus

This course covers database design and the use of database management systems for applications. It includes extensive coverage of the relational model, relational algebra, and SQL. It also covers XML data including DTDs and XML Schema for validation, and the query and transformation languages XPath, XQuery, and XSLT. The course includes database design in UML, and relational design principles based on dependencies and normal forms. Many additional key database topics from the design and application-building perspective are also covered: indexes, views, transactions, authorization, integrity constraints, triggers, on-line analytical processing (OLAP), JSON, and emerging NoSQL systems. Working through the entire course provides comprehensive coverage of the field, but most of the topics are also well-suited for “a la carte” learning.

Biography

Jennifer Widom is the Fletcher Jones Professor and Chair of the Computer Science Department at Stanford University. She received her Bachelors degree from the Indiana University School of Music in 1982 and her Computer Science Ph.D. from Cornell University in 1987. She was a Research Staff Member at the IBM Almaden Research Center before joining the Stanford faculty in 1993. Her research interests span many aspects of nontraditional data management. She is an ACM Fellow and a member of the National Academy of Engineering and the American Academy of Arts & Sciences; she received the ACM SIGMOD Edgar F. Codd Innovations Award in 2007 and was a Guggenheim Fellow in 2000; she has served on a variety of program committees, advisory boards, and editorial boards.

Another reason to take the course:

The structure and capabilities of databases shape the way we create solutions.

Consider normalization. An investment of time and effort that may be needed, for some problems, but not others.

Absent alternative approaches, you see every data problem as requiring normalization.

(You may anyway after taking this course. Education cannot impart imagination.)

50 years of Rolling Stones tours

Filed under: Mapping,Maps — Patrick Durusau @ 11:38 am

50 years of Rolling Stones tours by Nathan Yau.

From the post:

CartoDB mapped every Rolling Stones tour from 1963 to 2007.

This is awesome.

More to follow.

« Newer PostsOlder Posts »

Powered by WordPress