Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 5, 2011

Does GE Think We’re Stupid?

Filed under: Graphs,Interface Research/Design — Patrick Durusau @ 6:45 pm

Does GE Think We’re Stupid?

A gem from Stephen Few:

The series of interactive data visualizations that have appeared on GE’s website over the last two years has provided a growing pool of silly examples. They attempt to give the superficial impression that GE cares about data while in fact providing almost useless content. They look fun, but communicate little. As such, they suggest that GE does not in fact care about the information and has little respect for the intelligence and interests of its audience. This is a shame, because the stories contained in these data sets are important.

(graphic omitted)

Most of the visualizations were developed by Ben Fry (including the colorful pie that Homer is drooling over above); someone who is able to design effective data visualizations, but shows no signs of this in the work that he’s done for GE. The latest visualization was designed by David McCandless, who has to my knowledge never produced an effective data visualization. In other words, GE has gone from bad to worse.

Before you decide this is over the top criticism, go read the original post and view the graphics.

The question I would raise, I suppose all those years in law not being wasted, is whether the GE graphics were intended to inform or confuse?

If the latter, made to make the public feel that these are issues beyond their ken and best left to experts, then these maybe very successful graphics.

Even if not sinister in purpose, I think we need to attend very closely to what we assume about ourselves and graphics (and other interfaces). It may be, even often may be, that it isn’t us but the interface that is incorrect.

If you encounter a graphic you don’t understand, don’t assume it is you. If in writing, investigate further; if in class, ask for a better explanation; if in a meeting, ask for and follow up on the actual details.

Statsmodels

Filed under: Python,Statistics — Patrick Durusau @ 6:44 pm

Statsmodels

From the webpage:

scikits.statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration. An extensive list of result statistics are avalable for each estimator. The results are tested against existing statistical packages to ensure that they are correct. The pacakge is released under the open source Simplied BSD (2-clause) license.

LibraryCloud News

Filed under: Legal Informatics,Library — Patrick Durusau @ 6:44 pm

LibraryCloud News

An update from the Harvard Library Innovation Laboratory:

I spent the week working on LibraryCloud News (a project that Jeff, David, and I have been batting around for a while). We hope LibraryCloud News will become the Hacker News for library dorks (instead of startup dorks). It’s a place where you can submit questions or a link to the community and then engage the community through comment-style discussion. (Exactly the way Reddit and Hacker News work) LibaryCloud News is powered by the same code that powers Hacker News and is humming along in the Amazon Cloud. If you’re interested, we’d love to have you help us beta test LibraryCloud News at http://news.librarycloud.org/

OK, so I have a weakness for libraries and law libraries in particular. 😉

Still, I think this is a promising development and should be encouraged.

Now imagine harvesting the stories from this as a feed and mapping in related resources so people pursuing the stories have related technology, users and vendors with them. Beats the hell out of ads anyday.

How to enter a data contest – machine learning for newbies like me

Filed under: Contest,Data Contest,Machine Learning — Patrick Durusau @ 6:43 pm

How to enter a data contest – machine learning for newbies like me

From the post:

I’ve not had much experience with machine learning, most of my work has been a struggle just to get data sets that are large enough to be interesting! That’s a big reason why I turned to the Kaggle community when I needed a good prediction algorithm for my current project. I wasn’t completely off the hook though, I still needed to create an example of our current approach, limited as it is, to serve as a benchmark for the teams. While I was at it, it seemed worthwhile to open up the code too, so I’ve created a new Github project:

https://github.com/petewarden/MLloWorld

It actually produces very poor results, but does demonstrate the basics of how to pull in the data and apply one of scikit-learn’s great collection of algorithms. If you get the itch there’s lots of room for improvement, and the contest has another two weeks to run!

There is a case to be made for machine learning in the production of topic maps and what better motivation than contests for learning it?

Which makes me wonder how to structure something similar for topic maps? Contests that is for creating topic maps from one or more data sets? Coming up with funding for something like a meaningful prize would not be as hard as setting up something that was not too easy but also not too hard. At least not for the early contests anyway. 😉

For the early ones, pride of first place might be enough.

Suggestions/Comments?

Go away kid, you bother me.

Filed under: Software,Solr — Patrick Durusau @ 6:42 pm

I was reminded of this W.C. Fields quote when I read the following from Fromtek:

Formtek releases version 5.4.2 of the Formtek | Orion 5 SDK Pure Java API product for Linux, providing Full-Text-Search capability.

I went to the announcement (a pdf file) only to read:

Formtek releases version 5.4.2 of the Formtek | Orion 5 SDK Pure Java API™ product for Linux®, which provides support for:

  • Full-Text Indexing and Search

If you are a current customer, you can find out more by logging on at:

http://support.formtek.com/Login.asp

After logging on, click the link for Formtek Product Documentation to view the Product Release Notes for this release.

If you are not a current customer and would like more information, please contact us at sales@formtek.com.

The Formtek blog said: ECM: Formtek Announces SOLR Integration for Orion ECM, hence my interest, the integration of SOLR.

But that was all. To be contrasted with announcements from other vendors that often give specifics for everyone to read about integration of open source projects into their software offerings, even proprietary ones.

I’m not interested enough to ask for more information from Formtek. Are you?

The cool aspects of Odiago WibiData

Filed under: Hadoop,HBase,Wibidata — Patrick Durusau @ 6:42 pm

The cool aspects of Odiago WibiData

From the post:

Christophe Bisciglia and Aaron Kimball have a new company.

  • It’s called Odiago, and is one of my gratifyingly more numerous tiny clients.
  • Odiago’s product line is called WibiData, after the justly popular We Be Sushi restaurants.
  • We’ve agreed on a split exclusive de-stealthing launch. You can read about the company/founder/investor stuff on TechCrunch. But this is the place for — well, for the tech crunch.

WibiData is designed for management of, investigative analytics on, and operational analytics on consumer internet data, the main examples of which are web site traffic and personalization and their analogues for games and/or mobile devices. The core WibiData technology, built on HBase and Hadoop,* is a data management and analytic execution layer.

Still in private beta (you can sign up for notice) but the post covers the infrastructure with enough detail to be enticing.

Just as a tease (on my part):

where you’d have a single value in a relational table, you might have the equivalent of a whole relational table (or at least selection/view) in WibiData-enhanced HBase. For example, if a user visits the same web page ten times, and each time 50 attributes are recorded (including a timestamp), all 500 data – to use the word “data” in its original “plural of datum” sense – would likely be stored in the same WibiData cell.

You need to go read the post to put that in context.

I keep thinking all the “good” names are gone and then something like WibiData shows up. 😉

I suspect there are going to be a number of lessons to learn from this combination of HBase and Hadoop.

What Market Researchers could learn from eavesdropping on R2D2

Filed under: Machine Learning,Marketing — Patrick Durusau @ 6:41 pm

What Market Researchers could learn from eavesdropping on R2D2

From the post:

Scott asks: in the context of research and insight, why should we care about what the Machine Learning community is doing?

For those not familiar with Machine Learning, it is a scientific discipline related to artificial intelligence. But it is more concerned with the science of teaching machines to solve useful problems as opposed to trying to get machines to replicate human behavior. If you were to put it in Star Wars terms, a Machine Learning expert would be more focused on building the short, bleeping useful R2D2 than the shiny, linguistically gifted but clumsy C3P0—a machine that is useful and efficient as opposed to a machine that replicates behaviors and mannerisms of humans.

There are many techniques and approaches that marketing insights consultants could borrow from the Machine Learning community. The community is made up of a larger group of researchers and scientists as well as those concerned with market research, and their focus is improving algorithms that can be applied across a wide variety of scientific, technology, business, and engineering problems. And so it is a wonderful source of inspiration for approaches that can be adapted to our own industry.

Since topic mappers aren’t large enough to be the objects of study (yet), I thought this piece on how marketers view the machine learning community might be instructive.

Successful topic mappers will straddle semantic communities and to do that, they need to be adept at what I would call “semantic cross-overs.”

Semantic cross-overs are those people and written pieces that give you a view that over arches two or more communities. Almost always written more from one point of view than another, but enough of both to give you ideas that may spark in both camps.

Remember, crossing over between two communities isn’t your view of the cross-over, but that of members of the respective communities. In other words, your topic map between them may seem very clever to you, but unless it is clever to members of those communities, we call it: No Sale!

Spy vs. Spy

Filed under: Marketing,Topic Maps — Patrick Durusau @ 6:40 pm

I mentioned in Google+ Ripples: Revealing How Posts are Shared over Time that topic maps could be used to find the leaker’s of information about the killing of Osama ben Laden.

I did not mean to leave the impression that topic maps can only be used to find leakers. Topic maps can be used to find people with access to information not commonly available. Or to find people inclined to share such information. Or the reasons they might share it. Or those around them who might share it.

Leaked information, to be valuable, often must be matched with other information, from other sources. All of which is as much human judgement as the development of sources of information. Nary a drop of logic in any of it.

And the hunting of leakers isn’t a matter of deduction or formal logic either. I really don’t buy the analysis that Peirce would have said in the Wikileaks case: “Quick! Look for someone with a Lady GaGa CD!” (I will run down a reference to Peirce’s retelling of his racist account of tracking down stolen articles. It involves a great deal of luck and racism, not so much formal logic.)

How you choose to use topic maps, as a technology, is entirely up to you.

META’2012 International Conference on Metaheuristics and Nature Inspired Computing

META’2012 International Conference on Metaheuristics and Nature Inspired Computing

Dates:

  • Paper submission: May 15, 2012
  • Session/Tutorial submission: May 15, 2012
  • Paper notification: July 15, 2012
  • Session/Tutorial notification: June 15, 2012
  • Conference: October 27-31, 2012

From the website:

The 4th International Conference on Metaheuristics and Nature Inspired Computing, META’2012, will held in Port El-Kantaoiui (Sousse, Tunisia).

The Conference will be an exchange space thanks to the sessions of the research works presentations and also will integrate tutorials and a vocational training of metaheuristics and nature inspired computing.

The scope of the META’2012 conference includes, but is not limited to:

  • Local search, tabu search, simulated annealing, VNS, ILS, …
  • Evolutionary algorithms, swarm optimization, scatter search, …
  • Emergent nature inspired algorithms: quantum computing, artificial immune systems, bee colony, DNA computing, …
  • Parallel algorithms and hybrid methods with metaheuristics, machine learning, game theory, mathematical programming, constraint programming, co-evolutionary, …
  • Application to: logistics and transportation, telecommunications, scheduling, data mining, engineering design, bioinformatics, …
  • Theory of metaheuristics, landscape analysis, convergence, problem difficulty, very large neighbourhoods, …
  • Application to multi-objective optimization
  • Application in dynamic optimization, problems with uncertainty,bi-level optimization, …

The “proceedings” for Meta ’10 can be seen at: Meta ’10 papers. It would be more accurate to say “extended abstracts” because, for example,

Luis Filipe de Mello Santos, Daniel Madeira, Esteban Clua, Simone Martins and Alexandre Plastino. A parallel GRASP resolution for a GPU architecture

runs all of two (2) pages. As is about the average length of the other twenty (20) papers that I checked.

I like concise writing but two pages to describe a parallel GRASP setup on a GPU architecture? Just an enticement (there is an ugly word I could use) to get you to read the ISI journal with the article.

Conference and its content look very interesting. Can’t say I care for the marketing technique for the journals in question. Not objecting to the marketing of the journals, but don’t say proceedings when what is meant is ads for the journals.

Expression cartography of human tissues using self organizing maps

Filed under: Bioinformatics,Biomedical,Self Organizing Maps (SOMs),Self-Organizing — Patrick Durusau @ 6:39 pm

Expression cartography of human tissues using self organizing maps by Henry Wirth; Markus Löffler; Martin von Bergen; Hans Binder. (BMC Bioinformatics. 2011;12:306)

Abstract:

Parallel high-throughput microarray and sequencing experiments produce vast quantities of multidimensional data which must be arranged and analyzed in a concerted way. One approach to addressing this challenge is the machine learning technique known as self organizing maps (SOMs). SOMs enable a parallel sample- and gene-centered view of genomic data combined with strong visualization and second-level analysis capabilities. The paper aims at bridging the gap between the potency of SOM-machine learning to reduce dimension of high-dimensional data on one hand and practical applications with special emphasis on gene expression analysis on the other hand.

A nice introduction to self organizing maps (SOMs) in a bioinformatics context. Think of them as being yet another way to discover subjects about which people want to make statements and to attach data and analysis.

November 4, 2011

The number one trait you want in a data scientist

Filed under: Data Science,Jobs — Patrick Durusau @ 6:11 pm

The number one trait you want in a data scientist by Audrey Watters.

Description: DJ Patil on the traits of data scientists and how data science will evolve within companies.

From the post:

“Data scientist” is an on-the-rise job title, but what are the skills that make a good one? And how can both data scientists and the companies they work for make sure data-driven insights become actionable?

In a recent interview, DJ Patil (@dpatil), formerly the chief scientist at LinkedIn and now the data scientist in residence at Greylock Partners, discussed common data scientist traits and the challenges that those in the profession face getting their work onto company roadmaps.

An interesting read and good interview.

I think the #1 trait will surprise you.

Not all topic map authors are data scientists but it would be hard to write a good topic map and not be a data scientist.

Is this terminology that we want to adopt for ourselves in the topic map community? It is popular and might help on resumes and job applications.

Near-real-time readers with Lucene’s SearcherManager and NRTManager

Filed under: Indexing,Lucene,Software — Patrick Durusau @ 6:11 pm

Near-real-time readers with Lucene’s SearcherManager and NRTManager

From the post:

Last time, I described the useful SearcherManager class, coming in the next (3.5.0) Lucene release, to periodically reopen your IndexSearcher when multiple threads need to share it. This class presents a very simple acquire/release API, hiding the thread-safe complexities of opening and closing the underlying IndexReaders.

But that example used a non near-real-time (NRT) IndexReader, which has relatively high turnaround time for index changes to become visible, since you must call IndexWriter.commit first.

If you have access to the IndexWriter that’s actively changing the index (i.e., it’s in the same JVM as your searchers), use an NRT reader instead! NRT readers let you decouple durability to hardware/OS crashes from visibility of changes to a new IndexReader. How frequently you commit (for durability) and how frequently you reopen (to see new changes) become fully separate decisions. This controlled consistency model that Lucene exposes is a nice “best of both worlds” blend between the traditional immediate and eventual consistency models.

Getting into the hardcore parts of Lucene!

Understanding Lucene (or a similar indexing engine) is critical to both mining data as well as delivery of topic map based information to users.

Big Data : Case Studies, Best Practices and Why America should care

Filed under: BigData,Jobs,Marketing — Patrick Durusau @ 6:10 pm

Big Data : Case Studies, Best Practices and Why America should care by Themos Kalafatis.

From the post:

We know that Knowledge is Power. Due to Data Explosion more Data Scientists will be needed and being a Data Scientist becomes increasingly a “cool” profession. Needless to say that America should be preparing for the increased need for Predictive Analytics professionals in Research and Businesses.

Being able to collect, analyze and extract knowledge from a huge amount of Data is not only about Businesses being able to make the right decisions but also critical for a Country as a whole. The more efficient and fast this cycle is, the better for the Country that puts Analytics to work.

This Blog post is actually about the words and phrases being used for this post : All words and phrases on the title of the post (and the introductory text) were carefully selected to produce specific thoughts which can be broken down in three parts :

  • Being a Data Scientist has high value.
  • “Case Studies” and “Best Practices” communicate to readers successful applications and knowledge worthwhile reading.
  • “America should”. This phrase obviously creates specific emotions and feelings to Americans.

Being a “cool” profession or even a member of a “cool” profession doesn’t guarantee good results. Whatever tools you are using, good analytical skills have to lie behind their use. I think topic maps have a role to play in managing “big data” and being a tool that is reached for early and often.

Confidence Bias: Evidence from Crowdsourcing

Filed under: Bias,Confidence Bias,Crowd Sourcing,Interface Research/Design — Patrick Durusau @ 6:10 pm

Confidence Bias: Evidence from Crowdsourcing Crowdflower

From the post:

Evidence in experimental psychology suggests that most people overestimate their own ability to complete objective tasks accurately. This phenomenon, often called confidence bias, refers to “a systematic error of judgment made by individuals when they assess the correctness of their responses to questions related to intellectual or perceptual problems.” 1 But does this hold up in crowdsourcing?

We ran an experiment to test for a persistent difference between people’s perceptions of their own accuracy and their actual objective accuracy. We used a set of standardized questions, focusing on the Verbal and Math sections of a common standardized test. For the 829 individuals who answered more than 10 of these questions, we asked for the correct answer as well as an indication of how confident they were of the answer they supplied.

We didn’t use any Gold in this experiment. Instead, we incentivized performance by rewarding those finishing in the top 10%, based on objective accuracy.

I am not sure why crowdsourcing would make a difference on the question of overestimation of ability but now the answer is in, N0. But do read the post for the details, I think you will find it useful when doing user studies.

For example, when you ask a user if some task is too complex as designed, are they likely to overestimate their ability to complete it, either to avoid being embarrassed in front of others or admitting that they really didn’t follow your explanation?

My suspicion is yes and so in addition to simply asking users if they understand particular search or other functions with an interface, you need to also film them using the interface with no help from you (or others).

You will remember in Size Really Does Matter… that Blair and Maron reported that lawyers over estimated their accuracy in document retrieval by 55%. Of course, the question of retrieval is harder to evaluate than those in the Crowdflower experiment but it is a bias you need to keep in mind.

bibleQuran: Comparing the Word Frequency between Bible and Quran

Filed under: Bible,Quran,Synonymy,Visualization — Patrick Durusau @ 6:10 pm

bibleQuran: Comparing the Word Frequency between Bible and Quran

From the post:

bibleQuran [pitchinteractive.com] by datavis design firm Pitch Interactive reveals the frequency of word usage between two of the most important holy books: the Bible and the Quran.

The densely populated interactive visualization allows people to search for any word (and similar variations of that word) to explore its frequency in both texts. As each verse is always visible, one is able to compare the relative density of ideas and topics between both passages. For instance, one could select verbs that represent acts of ‘terror’ or ‘love’, and investigate which book discusses the topics more. The appropriate little rectangles, each representing an according verse, which include such this chosen word, are then highlighted, and can be read in detail by hovering the mouse over them.

In addition to being a great graphic presentation of information, with my background and appreciation for both texts, you know why I had to include this post.

I like the synonym feature, although I reserve judgment on what is considered a synonym. 😉 I would have to read the original. Translations of both texts are, well, translations. Not really the same text in a very real sense of the word.

Just as a suggestion, I would do the word count statistics separately for the Old/New Testament.

Word of warning: Loads great with Firefox (7.1) on Windows XP, doesn’t load with IE 8 on Windows XP, doesn’t load with Firefox (3.6) on Ubuntu 10.04. So, your experience may vary.

Comments from users with other browser/OS combinations?

Paper about “BioStar” published in PLoS Computational Biology

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 6:09 pm

Paper about “BioStar” published in PLoS Computational Biology by Pierre Lindenbaum.

I have mentioned Biostar.

Pierre links to the paper, a blog entry about the paper and has collected tweets about it.

Be forewarned about the slides if you are sensitive to remarks comparing twelve year olds and politicians. Personally I think twelve year olds have just been insulted. 😉

Solr Performance Monitoring with SPM

Filed under: Java,Solr — Patrick Durusau @ 6:09 pm

Solr Performance Monitoring with SPM

From the post:

Originally delivered as Lightning Talk at Lucene Eurocon 2011 in Barcelona, this quick presentation shows how to use Sematext’s SPM service, currently free to use for unlimited time, to monitor Solr, OS, JVM, and more.

We built SPM because we wanted to have a good and easy to use tool to help us with Solr performance tuning during engagements with our numerous Solr customers. We hope you find our Scalable Performance Monitoring service useful! Please let us know if you have any sort of feedback, from SPM functionality and usability to its speed. Enjoy!

Nice set of slides!

I was relieved to discover that Sematext (I can spell it right with effort) is 100% organic, no GMOs! 😉

Please heed the call for the community to respond with feedback on SPM!

A Taxonomy of Enterprise Search and Discovery

A Taxonomy of Enterprise Search and Discovery by Tony Russell-Rose.

Abstract:

Classic IR (information retrieval) is predicated on the notion of users searching for information in order to satisfy a particular “information need”. However, it is now accepted that much of what we recognize as search behaviour is often not informational per se. Broder (2002) has shown that the need underlying a given web search could in fact be navigational (e.g. to find a particular site) or transactional (e.g. through online shopping, social media, etc.). Similarly, Rose & Levinson (2004) have identified the consumption of online resources as a further common category of search behaviour.

In this paper, we extend this work to the enterprise context, examining the needs and behaviours of individuals across a range of search and discovery scenarios within various types of enterprise. We present an initial taxonomy of “discovery modes”, and discuss some initial implications for the design of more effective search and discovery platforms and tools.

If you are flogging software/interfaces for search/discovery in an enterprise context, you really need to read this paper. In part because of their initial findings but in part to establish the legitimacy of evaluating how users search before designing an interface for them to search with. They may not be able to articulate all their search behaviors which means you will have to do some observation to establish what may be the elements that make a difference in a successful interface and one that is less so. (No one wants to be the next Virtual Case Management project at the FBI.)

Read the various types of searching as rough guides to what you may find true for your users. When in doubt, trust your observations of and feedback from your users. Otherwise you will have an interface that fits an abstract description in a paper but not your users. I leave it for you to judge which one results in repeat business.

Don’t take that as a criticism of the paper, I think it is one of the best I have read lately. My concern is that the evaluation of user needs/behaviour be an ongoing process and not prematurely fixed or obscured by categories or typologies of how users “ought” to act.

The paper is also available in PDF format.

Information Literacy 2.0

Filed under: Information Retrieval,Research Methods — Patrick Durusau @ 6:08 pm

Information Literacy 2.0 by Meredith Farkas.

From the post:

Critical inquiry in the age of social media

Ideas about information literacy have always adapted to changes in the information environment. The birth of the web made it necessary for librarians to shift more towards teaching search strategies and evaluation of sources. The tool-focused “bibliographic instruction” approach was later replaced by the skill-focused “information literacy” approach. Now, with the growth of Web 2.0 technologies, we need to start shifting towards providing instruction that will enable our patrons to be successful information seekers in the Web 2.0 environment, where the process of evaluation is quite a bit more nuanced.

Critical inquiry skills are among the most important in a world in which the half-life of information is rapidly shrinking. These days, what you know is almost less important than what you can find out. And finding out today requires a set of skills that are very different from what most libraries focus on. In addition to academic sources, a huge wealth of content is being produced by people every day in knowledgebases like Wikipedia, review sites like Trip Advisor, and in blogs. Some of this content is legitimate and valuable—but some of it isn’t.

While I agree with Meredith that evaluation of information is a critical skill, I am less convinced that it is a new one. Research, even pre-Internet, was never about simply finding resources for the purpose of citation. There always was an evaluative aspect with regard to sources.

I was able to take a doctoral seminar in research methods for Old Testament students that taught critical evaluation of resources. I don’t remember the text off hand but we were reading a transcription of a cuneiform text which had a suggested “emendation” (think added characters) for a broken place in the text. The professor asked whether we should accept the “emendation” or not and on what basis we would make that judgement. The article was by a known scholar so of course we argued about the “emendation” but never asked one critical question: What about the original text? The source the scholar was relying upon.

The theology library had a publication with an image of the text that we reviewed for the next class. Even though it was only a photograph, it was clear that you might get one, maybe two characters in the broken space of the text, but there was no way you would have the five or six required by the “emendation.”

We were told to never rely upon quotations, transcriptions of texts, etc., unless there was simply no way to verify the source. Not that many of us do that in practice but that is the ideal. There is even less excuse for relying on quotations and other secondary materials now that so many primary materials are easy to access online and more are coming online every day.

I think the lesson of information literacy 2.0 should be critical evaluation of information but as part of that evaluation to seek out the sources of the information. You would be surprised how many times what an authors said is not what they are quoted as saying, when read in the context of the original.

More Data: Tweets & News Articles

Filed under: Dataset,News,TREC,Tweets — Patrick Durusau @ 6:07 pm

From Max Lin’s blog, Ian Soboroff posted:

Two new collections being released from TREC today:

The first is the long-awaited Tweets2011 collection. This is 16 million tweets sampled by Twitter for use in the TREC 2011 microblog track. We distribute the tweet identifiers and a crawler, and you download the actual tweets using the crawler. http://trec.nist.gov/data/tweets/

The second is TRC2, a collection of 1.8 million news articles from Thompson Reuters used in the TREC 2010 blog track. http://trec.nist.gov/data/reuters/reuters.html

Both collections are available under extremely permissive usage agreements that limit their use to research and forbid redistribution, but otherwise are very open as data usage agreements go.

It may just be my memory but I don’t recall seeing topic map research with the older Reuters data set (the new one is too recent). Is that true?

Anyway, more large data sets for your research pleasure.

November 3, 2011

Google+ Ripples: Revealing How Posts are Shared over Time

Filed under: Google+,Ripples — Patrick Durusau @ 7:22 pm

Google+ Ripples: Revealing How Posts are Shared over Time

From the post:

Google+ Ripples [plus.google.com] is the first data visualization project from the elusive Big Picture Group, organized around (previous IBM Visual Communication Lab pioneers) Fernanda Viegas and Martin Wattenberg. It is a working demonstration how aesthetics and functionality can still be effectively be merged.

The ‘Ripple Diagram’ shows how a post spreads as people (publicly) share it using the Google+ service, with arrows indicating the direction of the sharing. A timeline at the bottom of the diagram allow the ripple to animate, revealing how this post was shared over time. People who have reshared the post are displayed with their own circle. Inside the circle are people who have reshared the post from that person (and so on). All circles are roughly sized based on the relative influence of that person.

Awesome graphics! You need to visit if for no other reason than the graphics!

As far as the content/idea, with just a little bit of tweaking and better tracking, the title could read: Revealing How Information is Shared over Time. Think about it, there were a limited number of people party to the mission against bin Laden and according to the Sec. of Defense, there was a deal to no reveal some information about the mission. But by the following Monday (that was on Sunday), the deal fell appart as everyone leaked to the news media.

Now, just imagine that you have all the phone records for all the persons who were party to any or all of that information. Plus records of most of the people they could have spoken to overnight. Does that sound like over time you will be able to find the leakers?

Particularly with a topic map to flesh out contacts of contacts, merging phone numbers, etc.

Nothing new as Jack Park would say, you could do the same thing with pencil and paper but with a topic map you can combine numerous occasions of leaking to establish patterns, etc. Something to think about.

Introducing DocDiver

Introducing DocDiver by Al Shaw. The ProPublica Nerd Blog

From the post:

Today [4 Oct. 2011] we’re launching a new feature that lets readers work alongside ProPublica reporters—and each other—to identify key bits of information in documents, and to share what they’ve found. We call it DocDiver [1].

Here’s how it works:

DocDiver is built on top of DocumentViewer [2] from DocumentCloud [3]. It frames the DocumentViewer embed and adds a new right-hand sidebar with options for readers to browse findings and to add their own. The “overview” tab shows, at a glance, who is talking about this document and “key findings”—ones that our editors find especially illuminating or noteworthy. The “findings” tab shows all reader findings to the right of each page near where readers found interesting bits.

Graham Moore (Networkedplanet) mentioned early today that the topic map working group should look for technologies and projects where topic maps can make a real difference for a minimal amount of effort. (I’m paraphrasing so if I got it wrong, blame me, not Graham.)

This looks like a case where an application is very close to having topic map capabilities but not quite. The project already has users, developers and I suspect would be interested in anything that would improve their software, without starting over. That would be the critical part, to leverage existing software an imbue it with subject identity as we understand the concept, to the benefit of current users of the software.

Neo4j in 10 Minutes + Plus Dr. Who!

Filed under: Dr. Who,Graphs,Neo4j — Patrick Durusau @ 7:21 pm

Dr. Who and Neo4j

Ian Robinson covers Neo4j in 10 minutes and then moves onto Dr. Who! Entertaining and quite useful. Presented on 2 November 2011.

Joins the fictional universe (it’s fiction?) and the “real” universe about making the show.

The more formal description:

Doctor Who is the world’s longest running science-fiction TV series. Battling daleks, cybermen and sontarans, and always accompanied by his trusted human companions, the last Timelord has saved earth from destruction more times than you’ve failed the build.

Neo4j is the world’s leading open source graph database. Designed to interrogate densely connected data with lightning speed, it lets you traverse millions of nodes in a fraction of the time it takes to run a multi-join SQL query. When these two meet, the result is an entertaining introduction to the complex history of a complex hero, and a rapid survey of the elegant APIs of a delightfully simple graph database.

Armed only with a data store packed full of geeky Doctor Who facts, by the end of this session we’ll have you tracking down pieces of memorabilia from a show that, like the graph theory behind Neo4j, is older than Codd’s relational model.

NoSQL Exchange – 2 November 2011

NoSQL Exchange – 2 November 2011

It doesn’t get much better or fresher (for non-attendees) than this!

  • Dr Jim Webber of Neo Technology starts the day by welcoming everyone to the first of many annual NOSQL eXchanges. View the podcast here…
  • Emil Eifrém gives a Keynote talk to the NOSQL eXchange on the past, present and future of NOSQL, and the state of NOSQL today. View the podcast here…
  • HANDLING CONFLICTS IN EVENTUALLY CONSISTENT SYSTEMS In this talk, Russell Brown examines how conflicting values are kept to a minimum in Riak and illustrates some techniques for automating semantic reconciliation. There will be practical examples from the Riak Java Client and other places.
  • MONGODB + SCALA: CASE CLASSES, DOCUMENTS AND SHARDS FOR A NEW DATA MODEL Brendan McAdams — creator of Casbah, a Scala toolkit for MongoDB — will give a talk on “MongoDB + Scala: Case Classes, Documents and Shards for a New Data Model”
  • REAL LIFE CASSANDRA Dave Gardner: In this talk for the NOSQL eXchange, Dave Gardner introduces why you would want to use Cassandra, and focuses on a real-life use case, explaining each Cassandra feature within this context.
  • DOCTOR WHO AND NEO4J Ian Robinson: Armed only with a data store packed full of geeky Doctor Who facts, by the end of this session we’ll have you tracking down pieces of memorabilia from a show that, like the graph theory behind Neo4j, is older than Codd’s relational model.
  • BUILDING REAL WORLD SOLUTION WITH DOCUMENT STORAGE, SCALA AND LIFT Aleksa Vukotic will look at how his company assessed and adopted CouchDB in order to rapidly and successfully deliver a next generation insurance platform using Scala and Lift.
  • ROBERT REES ON POLYGLOT PERSISTENCE Robert Rees: Based on his experiences of mixing CouchDB and Neo4J at Wazoku, an idea management startup, Robert talks about the theory of mixing your stores and the practical experience.
  • PARKBENCH DISCUSSION This Park Bench discussion will be chaired by Jim Webber.
  • THE FUTURE OF NOSQL AND BIG DATA STORAGE Tom Wilkie: Tom Wilkie takes a whistle-stop tour of developments in NOSQL and Big Data storage, comparing and contrasting new storage engines from Google (LevelDB), RethinkDB, Tokutek and Acunu (Castle).

And yes, I made a separate blog post on Neo4j and Dr. Who. 😉 What can I say? I am a fan of both.

Cache-Oblivious Search Trees Project (Fractal Trees, TokuDB)

Filed under: B-trees,Cache-Oblivious Search Trees,Fractal Trees,Search Trees — Patrick Durusau @ 7:20 pm

Cache-Oblivious Search Trees Project (Fractal Trees, TokuDB)

I watched a very disappointing presentation on Fractal Trees (used by Tokutek in the TokuDB) and so went looking for better resources.

The project is described as:

We implemented a cache-oblivious dynamic search tree as an alternative to the ubiquitious B-tree. We used a binary tree with a “van Emde Boas” layout whose leaves point to intervals in a “packed memory structure”. The search tree supports efficient lookup, as well as efficient amortized insertion and deletion. Efficient implementation of a B-tree requires understanding the cache-line size and page size and is optimized for a specific memory hierarchy. In contrast, a cache-oblivious dynamic search tree contains no machine-dependent variables, performs well on any memory hierarchy, and requires minimal user-level memory management. For random insertion of data, the data structure performs better than the Berkeley DB and a good implementation of B-trees. Another advantage of my data structure is that the packed memory array maintains data in sorted order, allows sequential reads at high speeds, and data insertions and deletions with few data writes on average. In addition, the data structure is easy to implement because he employed memory mapping rather than making the data stored on disk be a single level store.

We also have designed cache-oblivious search trees for which the keys can be very long (imagine a key, such as a DNA sequence, that is larger than main memory), and trees into which data can be quickly inserted.

One essential difference is that the B-Tree supports random I/O and the Fractal Tree converts random I/O into sequential I/O, which operates at near disk bandwidth speeds.

At Tokutek, I would review the paper by Bradley C. Kuszmaul, How TokuDB Fractal Tree™ IndexesWork.

Impressive software for the right situation.

The background literature is interesting. Not sure if directly applicable to topic maps or not.

Trello

Filed under: Interface Research/Design — Patrick Durusau @ 7:18 pm

Trello – Organize anything, together

From the site:

Trello is a collaboration tool that organizes your projects into boards. In one glance, Trello tells you what’s being worked on, who’s working on what, and where something is in a process.

Trello requires a free account to use the tool. I am not suggesting that you need an online task management tool, this one or another.

I mention it because of the simplicity of the interface.

Boards

A board is a just a collection of lists (and lists hold the cards). You’ll probably want a board for each project you’re working on. You can add and start using a new board in seconds. You can glance at a board and get a handle on the status of any project.

Lists

Lists can be just simple lists, but they are most powerful when they represent a stage in a process. Simply drag your lists into place to represent your workflow. Move a card from one list to the next to show your progress.

Cards

Cards are tasks. You make a card to track something you or your team needs to get done. You can add attachments, embed video, assign users, add due dates, make checklists, or you can just add your card to a board with no fuss and no overhead, and know exactly what work needs to get done.

My question is: Is this enough? For the average project? There is software project software like JIRA but the interface is far more complex.

  1. What about Trello makes the interface easy? Not asking you to describe the interface, this post already does that. What is it about board/list/card that is familiar? What other contexts do we use or see such arrangements?
  2. What other ways of visually organizing data seem common to you?
  3. What is it about other methods of organizing data that makes it “work” for you?

RethinkDB

Filed under: Key-Value Stores,Memcached,NoSQL,RethinkDB — Patrick Durusau @ 7:17 pm

RethinkDB

From the features page:

RethinkDB is a persistent, industrial-strength key-value store with full support for the Memcached protocol.

Powerful technology

  • Ten times faster on solid-state
  • Linear scaling across cores
  • Fine-grained durability control
  • Instantaneous recovery on power failure

Supported core features

  • Point queries
  • Atomic increment/decrement
  • Arbitrary atomic operations
  • Append/prepend operations
  • Values up to 10MB in size
  • Pipelining support
  • Row expiration support
  • Multi-GET support

I particularly liked this line:

Can I use RethinkDB even if I don’t have solid-state drives in my infrastructure?

While RethinkDB performs best on dedicated commodity hardware that has a multicore processor and is backed by solid-state storage, it will still deliver a performance advantage both on rotational drives and in the cloud. (emphasis added to the answer)

Don’t worry your “rotational drives” and “cloud” account have not suddenly become obsolete. The skill you need to acquire before the next upgrade cycle is evaluating performance claims with your processes and data.

It doesn’t matter that all the UN documents can be retrieved in under sub-millisecond time, translated and served with a hot Danish if you don’t use the same format, have no need for translation and are more a custard tart fan. Vendor performance figures may attract your interest but your decision making should be driven by performance figures that represent your environment.

Build into the acquisition budget funding for your staff to replicate a representative subset of your data and processes for testing with vendor software/hardware. True enough, after the purchase you will probably toss that subset, but remember you will be living with the software purchase for years. And be known as the person who managed the project. Suddenly spending a little more money on making sure your requirements are met doesn’t sound so bad.

D3.js – Data Driven Documents

Filed under: D3,Graphics,Protovis,Visualization — Patrick Durusau @ 7:17 pm

D3.js – Data Driven Documents

From the webpage:

D3 allows you to bind arbitrary data to a Document Object Model (DOM), and then apply data-driven transformations to the document. As a trivial example, you can use D3 to generate a basic HTML table from an array of numbers. Or, use the same data to create an interactive SVG bar chart with smooth transitions and interaction.

D3 is not a traditional visualization framework. Rather than provide a monolithic system with all the features anyone may ever need, D3 solves only the crux of the problem: efficient manipulation of documents based on data. This gives D3 extraordinary flexibility, exposing the full capabilities of underlying technologies such as CSS3, HTML5 and SVG. It avoids learning a new intermediate proprietary representation. With minimal overhead, D3 is extremely fast, supporting large datasets and dynamic behaviors for interaction and animation. And, for those common needs, D3’s functional style allows code reuse through a diverse collection of optional modules.

Any description of the graphics on my part would be inadequate. Visit the site, you will see what I mean.

Pay special attention to the architecture of D3, there may be lessons for future topic maps development.

D3 replaces Protovis, which is no longer under active development. Protovis is, however, still suggested as a source of examples, etc.

A Protovis Primer, Part 1

Filed under: Graphics,Interface Research/Design,Protovis,Visualization — Patrick Durusau @ 7:16 pm

A Protovis Primer, Part 1

From the post:

Protovis is a very powerful visualization toolkit. Part of what makes it special is that it is written in JavaScript and runs in the browser without the need for any plugins. Its clever use of JavaScript’s language features makes it very elegant, but it can also be confusing to people who are not familiar with functional programming concepts and the finer points of JavaScript. This multi-part tutorial shows how to create a visualization (my interactive Presidents Chart) in Protovis, and explains the concepts that are involved along the way.

This introduction is based on my experiences with using Protovis in my Visualization and Visual Communication class earlier this spring. While the concepts involved are really not that difficult, they are rather foreign to students who have not been exposed to functional programming. And since that is also the case for a lot of hobbyists and people wanting to do visualization who do not have a computer science background, I imagine they run into the same problems.

This has grown from being a single article into several parts (and is still expanding). Let me know if there are things that you don’t understand or that you think need to be covered in more detail, so I can tailor the next parts accordingly.

Protovis requires a modern browser, which means any recent version of Safari, Chrome, FireFox, or Opera. Internet Explorer does not work, because it does not support the HTML5 Canvas element. The visualizations in this article are all Protovis drawings (check out the source code!), with a fall-back to images for RSS readers and IE users. There is no real difference at this point, but once we get to interaction, you will want to read this in a supported browser.

See the comments as well for pointers.

A Protovis Primer, Part 2 – If you are interested in earthquakes, this is the tutorial for you! Plus really nifty bar chart techniques.

A Protovis Primer, Part 3 – Lives and office terms of US presidents. OK, so not every data set is a winner. 😉 Still, the techniques are applicable to other, more interesting data sets.

Neo4j’s Cypher internals – Part 2: All clauses, more Scala’s Parser Combinators and query entry point

Filed under: Cypher,Graphs,Neo4j,Query Language,Scala — Patrick Durusau @ 7:15 pm

Neo4j’s Cypher internals – Part 2: All clauses, more Scala’s Parser Combinators and query entry point

From the post:

During the previous post, I’ve explained what is Neo4j and then, explained how graph traversal could be done on Neo4j using the Java API. Next, I’ve introduced Cypher and how it helped write queries, in order to retrieve data from the graph. After introducing Cypher’s syntax, we dissected the Start Clause, which is the start point (duh) for any query being written on Cypher. If you hadn’t read it, go there, and then come back to read this one.

In this second part, I’ll show the other clauses existents in Cypher, the Match, Where, Return, Skip and Limit, OrderBy and Return. Some will be simple, some not and I’ll go in a more detailed way on those clauses that aren’t so trivial. After that, we will take a look at the Cypher query entry point, and how the query parsing is unleashed.

Nuff said, let’s get down to business.

This and part 1 are starting points for understanding Cypher. A key to evaluation of Neo4j as a topic map storage/application platform.

True enough, at present (1.4) Neo4j only supports 32 billion nodes, 32 billion relationships and 64 billion properties per database but on the other hand, I have fewer than 32 billion books than that so at a certain level of coarseness it should be fine. 😉

BTW, I do collect CS texts, old as well as new. Mostly algorithm, parsing, graph, IR, database sort of stuff but occasionally other stuff too. Just in case you have a author’s copy or need to clear out space for more books. Drop me a line if you would like to make a donation to my collection.

« Newer PostsOlder Posts »

Powered by WordPress