Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 13, 2012

Keep the web weird

Filed under: Semantic Web,Semantics — Patrick Durusau @ 8:14 pm

Keep the web weird

Pete Warden writes:

I’m doing a short talk at SXSW tomorrow, as part of a panel on Creating the Internet of Entities. Preparing is tough because don’t I believe it’s possible, and even if it was I wouldn’t like it. Opposing better semantic tagging feels like hating on Girl Scout cookies, but I’ve realized that I like an internet full of messy, redundant, ambiguous data.

The stated goal of an Internet of Entities is a web where “real-world people, places, and things can be referenced unambiguously“. We already have that. Most pages give enough context and attributes for a person to figure out which real world entity it’s talking about. What the definition is trying to get at is a reference that a machine can understand.

The implicit goal of this and similar initiatives like Stephen Wolfram’s .data proposal is to make a web that’s more computable. Right now, the pages that make up the web are a soup of human-readable text, a long way from the structured numbers and canonical identifiers that programs need to calculate with. I often feel frustrated as I try to divine answers from chaotic, unstructured text, but I’ve also learned to appreciate the advantages of the current state of things.

Now there is a position that merits cheerful support!

You need to read what comes in between but Pete concludes:

The web is literature; sprawling, ambiguous, contradictory, and weird. Let’s preserve those as virtues, and write better code to cope with the resulting mess.

I remember Bible society staffers who were concerned that if non-professionals could publish their own annotations attached to the biblical text, that the text might suffer as a result. I tried to assume them that despite years, centuries, if not longer, of massed lay and professional effort, the biblical text has resisted all attempts to tame it. I see no reason to think that will change now or in the future.

Data and Machine Learning at Pycon 2012

Filed under: Conferences,Python — Patrick Durusau @ 8:14 pm

Some data and machine learning talks from PyCon US 2012

Marcel Caraciolo has mined the online videos from PyCon US 2012 to list the following:

  1. Practical Machine Learning with Python
  2. Python for data lovers, explore it, analyze it, map it
  3. Python and HDFS – Fast Storage for Large Data
  4. Restful APIs with TastyPie
  5. Storm: The Hadoop of Realtime Stream Processing
  6. Data, Design and Meaning
  7. Graph Processing with Python
  8. Pandas: powerful data analysis tools for Python
  9. Sage: Open Source Math with Python
  10. High Performance Python 1
  11. High Performance Python 2
  12. Introduction to Interactive Predictive Analytics with Python in scikit-learn
  13. Plotting with matplotlib
  14. Ipython in-depth: high-production interactive and parallel python
  15. Data analysis in Python with pandas
  16. Social network analysis with Python
  17. Bayesian statistics made as simple as possible
  18. IPython: Python at your fingertips

Marcel has embedded all those videos at his post.

Question: Which videos do you think are important for data mining/topic maps? From: PyCon 2012.

Care to share as a comment to this post? Thanks!

March 12, 2012

Pluralistic Data?

Filed under: BigData,Functional Programming — Patrick Durusau @ 8:06 pm

Why Big Data Needs to be Functional by Dean Wampler.

Slides from Dean Wampler’s keynote at NEScala 2012.

I won’t spoil the ending for you so suffice it to say that functional programming is said to be relevant for big data tasks. 😉

Is looking at “big data” as “functional” saying that use of “big data” needs to be pluralistic?

That the cost of acquiring, cleaning, maintaining, etc., “big data” is big (sorry) enough that re-use is a real value?

Such that changing data that has already been gathered, cleaned, reconciled, i.e., imperative processing, is simply unthinkable?

Perhaps read-only access is the new norm, to protect your investment in “big data.”

Even the Clever Stumble

Filed under: Graphics,Humor — Patrick Durusau @ 8:05 pm

A reader’s guide to a New York Times graphic from Junkcharts.

If you feel like your presentation graphics or heavens forbid, your topic map interface is somehow lacking, take heart.

Even the best professionals at information delivery stumble every now and again. Just less often than the rest of us.

Consider the chart that the New York Times posted with the results from Super Tuesday for Rick Santorum and Mitt Romney (see the blog post).

Even after understanding what it was trying to convey, I could not get that content from the original graphic.

Joins with MapReduce

Filed under: Joins,MapReduce — Patrick Durusau @ 8:05 pm

Joins with MapReduce by Buddhika Chamith.

From the post:

I have been reading up on Join implementations available for Hadoop for past few days. In this post I recap some techniques I learnt during the process. The joins can be done at both Map side and Join side according to the nature of data sets of to be joined.

Covers examples of different types of joins.

Is there a MapReduce source with a wider range of examples? Thinking it would be useful to have a fairly full set of examples for joins using MapReduce

Cross Validation vs. Inter-Annotator Agreement

Filed under: Annotation,LingPipe,Linguistics — Patrick Durusau @ 8:05 pm

Cross Validation vs. Inter-Annotator Agreement by Bob Carpenter.

From the post:

Time, Negation, and Clinical Events

Mitzi’s been annotating clinical notes for time expressions, negations, and a couple other classes of clinically relevant phrases like diagnoses and treatments (I just can’t remember exactly which!). This is part of the project she’s working on with Noemie Elhadad, a professor in the Department of Biomedical Informatics at Columbia.

LingPipe Chunk Annotation GUI

Mitzi’s doing the phrase annotation with a LingPipe tool which can be found in

She even brought it up to date with the current release of LingPipe and generalized the layout for documents with subsections.

Lessons in the use of LingPipe tools!

If you are annotating texts or anticipate annotating texts, read this post.

An Efficient Trie-based Method for Approximate Entity Extraction…

Filed under: Edit Distance,Entity Extraction,Tries — Patrick Durusau @ 8:05 pm

An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints by Dong Deng, Guoliang Li, and Jianhua Feng. (PDF)

Abstract:

Dictionary-based entity extraction has attracted much attention from the database community recently, which locates substrings in a document into predefined entities (e.g., person names or locations). To improve extraction recall, a recent trend is to provide approximate matching between substrings of the document and entities by tolerating minor errors. In this paper we study dictionary-based approximate entity extraction with edit-distance constraints. Existing methods have several limitations. Firstly, they need to tune many parameters to achieve a high performance. Secondly, they are inefficient for large editdistance thresholds. We propose a trie-based method to address these problems. We partition each entity into a set of segments. We prove that if a substring of the document is similar to an entity, it must contain a segment of the entity. Based on this observation, we first search segments from the document, and then extend the matching segments in both entities and the document to find similar pairs. To facilitate searching segments, we use a trie structure to index segments and develop an efficient trie-based algorithm. We develop an extension-based method to efficiently find similar string pairs by extending the matching segments. We optimize our partition scheme and select the best partition strategy to improve the extraction performance. The experimental results show that our method achieves much higher performance compared with state-of-the-art studies.

Project page with author contact information. Code coming soon.

In case you are wondering about the path for the project including the word “taste:”

To address these problems, we propose a trie-based method for dictionary-based approximate entity extraction with edit distance constraints, called TASTE. TASTE does not need to tune parameters. Moreover TASTE achieves much higher performance, even for large edit-distance thresholds.

Is there a word for a person who creates acronyms? Acronymist perhaps?

Deeply interesting paper on the use of tries for entity extraction. Interesting due to its performance but also because of its approach.

You do remember that tries were what made the original e-version of the OED (Oxford English Dictionary) possible? Extremely responsive on less powerful machines than in your smart phone.

One wonders how starting from a set of entities this approach would fare against the TREC legal archives? But that would be “seeding” the application with entities. It may be that being given queries against a dark corpus isn’t all that realistic.

Cross Domain Search by Exploiting Wikipedia

Filed under: Linked Data,Searching,Wikipedia — Patrick Durusau @ 8:04 pm

Cross Domain Search by Exploiting Wikipedia by Chen Liu, Sai Wu, Shouxu Jiang, and Anthony K. H. Tung.

Abstract:

The abundance of Web 2.0 resources in various media formats calls for better resource integration to enrich user experience. This naturally leads to a new cross domain resource search requirement, in which a query is a resource in one modal and the results are closely related resources in other modalities. With cross domain search, we can better exploit existing resources.

Intuitively, tags associated with Web 2.0 resources are a straightforward medium to link resources with different modality together. However, tagging is by nature an ad hoc activity. They often contain noises and are affected by the subjective inclination of the tagger. Consequently, linking resources simply by tags will not be reliable. In this paper, we propose an approach for linking tagged resources to concepts extracted from Wikipedia, which has become a fairly reliable reference over the last few years. Compared to the tags, the concepts are therefore of higher quality. We develop effective methods for cross-modal search based on the concepts associated with resources. Extensive experiments were conducted, and the results show that our solution achieves good performance.

When the authors say “cross domain,” they are referring to different types of resources, say text vs. images or images vs. sound or any of those three vs. some other type of resource. One search can return “related” resources of different resource types.

Although the “cross domain” searching is interesting, I am more interested in the mapping that was performed on Wikipedia. The authors define three semantic relationships:

  • Link between Tag and Concept
  • Correlation of Concepts
  • Semantic Distance

It seems to me that the author’s are attacking “big data,” which has unbounded semantics from the “other” end. That is they are mapping a finite universe of semantics (Wikipedia) and then using that finite mapping to mine a much larger, unbounded semantic universe.

Or perhaps creating a semantic lens through which to view “related resources” in a much larger semantic universe. And without the overhead of Linked Data, which is mentioned under other work.

Introducing Spring Hadoop

Filed under: Hadoop,Spring Hadoop — Patrick Durusau @ 8:04 pm

Introducing Spring Hadoop by Costin Leau.

From the post:

I am happy to announce that the first milestone release (1.0.0.M1) for Spring Hadoop project is available and talk about some of the work we have been doing over the last few months. Part of the Spring Data umbrella, Spring Hadoop provides support for developing applications based on Hadoop technologies by leveraging the capabilities of the Spring ecosystem. Whether one is writing stand-alone, vanilla MapReduce applications, interacting with data from multiple data stores across the enterprise, or coordinating a complex workflow of HDFS, Pig, or Hive jobs, or anything in between, Spring Hadoop stays true to the Spring philosophy offering a simplified programming model and addresses "accidental complexity" caused by the infrastructure. Spring Hadoop, provides a powerful tool in the developer arsenal for dealing with big data volumes.

I rather like that, “accidental complexity.” 😉

Still, if you are learning Hadoop, Spring Hadoop may ease the learning curve. Not to mention making application development easier. Your mileage may vary but it is worth a long look.

Graph Degree Distributions using R over Hadoop

Filed under: Graphs,Hadoop,R — Patrick Durusau @ 8:04 pm

Graph Degree Distributions using R over Hadoop

From the post:

The purpose of this post is to demonstrate how to express the computation of two fundamental graph statistics — each as a graph traversal and as a MapReduce algorithm. The graph engines explored for this purpose are Neo4j and Hadoop. However, with respects to Hadoop, instead of focusing on a particular vertex-centric BSP-based graph-processing package such as Hama or Giraph, the results presented are via native Hadoop (HDFS + MapReduce). Moreover, instead of developing the MapReduce algorithms in Java, the R programming language is used. RHadoop is a small, open-source package developed by Revolution Analytics that binds R to Hadoop and allows for the representation of MapReduce algorithms using native R.

The two graph algorithms presented compute degree statistics: vertex in-degree and graph in-degree distribution. Both are related, and in fact, the results of the first can be used as the input to the second. That is, graph in-degree distribution is a function of vertex in-degree. Together, these two fundamental statistics serve as a foundation for more quantifying statistics developed in the domains of graph theory and network science.

Observes that 10 billion elements (nodes + edges) require a single server. In the 100 billion element range, multiple servers are required.

Despite the emphasis on “big data,” 10 billion elements would be sufficient for many purposes.

Interesting use of R with Hadoop.

Bio4jExplorer, new features and design!

Filed under: Bio4j,Bioinformatics,Medical Informatics — Patrick Durusau @ 8:04 pm

Bio4jExplorer, new features and design!

Pablo Pareja Tobes writes:

I’m happy to announce a new set of features for our tool Bio4jExplorer plus some changes in its design. I hope this may help both potential and current users to get a better understanding of Bio4j DB structure and contents.

Among the new features:

  • Node & Relationship Properties
  • Node & Relationship Data Source
  • Relationships Name Property

It may take time but even with “big data,” the source of data (as an aspect of validity or trust) is going to become a requirement.

Related-work.net – Product Requirement Document released!

Filed under: Bibliography,News — Patrick Durusau @ 8:04 pm

Related-work.net – Product Requirement Document released! by RenĂ© Pickhardt.

From the post:

Recently I visited my friend Heinrich Hartmann in Oxford. We talked about various issues how research is done in these days and how the web could theoretically help to spread information faster and more efficiently connect people interested in the same paper / topics.

The idea of http://www.related-work.net was born. A scientific platform which is open source and open data and tries to solve those problems.

But we did not want to reinvent the wheel. So we did some research on existing online solutions and also asked people from various disciplines to name their problems. Find below our product requirement document! If you like our approach you can contact us or contribute on the source code find some starting documentation!

So the plan is to fork an open source question answer system and enrich it with the features fulfilling the needs of scientists and some social aspects (hopefully using neo4j as a supporting data base technology) which will eventually help to rank related work of a paper.

Feel free to provide us with feedback and wishes and join our effort!

More of a “first cut” at requirements than a requirements document but it is an interesting starting point.

What requirements would you add?

March 11, 2012

Corpus-Wide Association Studies

Filed under: Corpora,Data Mining,Linguistics — Patrick Durusau @ 8:10 pm

Corpus-Wide Association Studies by Mark Liberman.

From the post:

I’ve spent the past couple of days at GURT 2012, and one of the interesting talks that I’ve heard was Julian Brooke and Sali Tagliamonte, “Hunting the linguistic variable: using computational techniques for data exploration and analysis”. Their abstract (all that’s available of the work so far) explains that:

The selection of an appropriate linguistic variable is typically the first step of a variationist analysis whose ultimate goal is to identify and explain social patterns. In this work, we invert the usual approach, starting with the sociolinguistic metadata associated with a large scale socially stratified corpus, and then testing the utility of computational tools for finding good variables to study. In particular, we use the ‘information gain’ metric included in data mining software to automatically filter a huge set of potential variables, and then apply our own corpus reader software to facilitate further human inspection. Finally, we subject a small set of particularly interesting features to a more traditional variationist analysis.

This type of data-mining for interesting patterns is likely to become a trend in sociolinguistics, as it is in other areas of the social and behavioral sciences, and so it’s worth giving some thought to potential problems as well as opportunities.

If you think about it, the social/behavioral sciences are being applied to the results of data mining of user behavior now. Perhaps you can “catch the wave” early on this cycle of research.

Big Graph-Processing Library From Twitter: Cassovary

Filed under: Cassovary,Graphs — Patrick Durusau @ 8:10 pm

Big Graph-Processing Library From Twitter: Cassovary

Seen Alex Popescu’s myNoSQL, report on Twitter’s release of Cassovary, a Scala library for processing large graphs.

See Alex’s post and the Github project page for details.

Challenges of Chinese Natural Language Processing

Filed under: Chinese,Homographs,Natural Language Processing,Segmentation — Patrick Durusau @ 8:10 pm

Thinkudo Labs is posting a series on Chinese natural language processing.

I will be gathering those posts here for ease of reference.

Challenges of Chinese Natural Language Processing – Segmentation

Challenges of Chinese Natural Language Processing – Homograph
(If you are betting this was the post that caught my attention, you are right in one.)

You will need native Chinese speaker assistance for serious Chinese language processing but understanding some of the issues ahead of time won’t hurt.

Old-style mapping provides a new take on our poverty maps

Filed under: Mapping,Maps,Visualization — Patrick Durusau @ 8:10 pm

Old-style mapping provides a new take on our poverty maps

John Burn-Murdoch writes:

Mapping data is tricky. The normal approach – as we used with our poverty maps today – is to create a chloropleth – a map where areas are coloured. But there is another way – and it’s quite old. This intricate visualisation by Oliver O’Brien (via spatialanalysis.co.uk) illustrates the demographics of housing throughout Britain in a style dating back to the 19th Century. Echoing the work of philanthropist Charles Booth, the map highlights groups of buildings rather than block-areas. The result is a much more detailed visualisation, allowing viewers to drill down almost to household level.

The meaningful display of data isn’t a new task. I suspect there are a number of visualization techniques that lie in library stacks waiting to be re-discovered.

AlchemyAPI

Filed under: AlchemyAPI,Data Mining,Machine Learning — Patrick Durusau @ 8:10 pm

AlchemyAPI

From the documentation:

AlchemyAPI utilizes natural language processing technology and machine learning algorithms to analyze content, extracting semantic meta-data: information about people, places, companies, topics, facts & relationships, authors, languages, and more.

API endpoints are provided for performing content analysis on Internet-accessible web pages, posted HTML or text content.

To use AlchemyAPI, you need an access key. If you do not have an API key, you must first obtain one.

I haven’t used it but it looks like a useful service for information products meant for an end user.

Do you use such services? Any others you would suggest?

“All Models are Right, Most are Useless”

Filed under: Modeling,Regression,Statistics — Patrick Durusau @ 8:09 pm

“All Models are Right, Most are Useless”

A counter to George Box saying: “all models are wrong, some are useful.” by Thad Tarpey. Pointer to slides for the presentation.

Covers the fallacy of “reification” (in the modeling sense) among other amusements.

Useful to remember that maps are approximations as well.

Keyword Searching and Browsing in Databases using BANKS

Filed under: Database,Keywords,Searching — Patrick Durusau @ 8:09 pm

Keyword Searching and Browsing in Databases using BANKS

From the post:

BANKS is a system that enables keyword based searches on a relational database. As a paper that was published 10 years ago in ICDE 2002, it has won the most influential paper award for past decade this year at ICDE. Hearty congrats to the team from IIT Bombay’s CSE department.

Abstract:

With the growth of the Web, there has been a rapid increase in the number of users who need to access online databases without having a detailed knowledge of the schema or of query languages; even relatively simple query languages designed for non-experts are too complicated for them. We describe BANKS, a system which enables keyword-based search on relational databases, together with data and schema browsing. BANKS enables users to extract information in a simple manner without any knowledge of the schema or any need for writing complex queries. A user can get information by typing a few keywords, following hyperlinks, and interacting with controls on the displayed results.

BANKS models tuples as nodes in a graph, connected by links induced by foreign key and other relationships. Answers to a query are modeled as rooted trees connecting tuples that match individual keywords in the query. Answers are ranked using a notion of proximity coupled with a notion of prestige of nodes based on inlinks, similar to techniques developed for Web search. We present an efficient heuristic algorithm for finding and ranking query results.

The paper: http://www.cse.iitb.ac.in/~sudarsha/Pubs-dir/BanksICDE2002.pdf.

It is a very interesting paper.

BTW, can someone point me to the ICDE proceedings where it was voted best paper of the decade? I am assuming that ICDE = International Conference on Data Engineering. I am sure I am just overlooking the award and would like to include a pointer to it in this post. Thanks!

Talend Open Studio for Big Data w/ Hadoop

Filed under: Hadoop,MapReduce,Talend,Tuple MapReduce,Tuple-Join MapReduce — Patrick Durusau @ 8:09 pm

Talend Empowers Apache Hadoop Community with Talend Open Studio for Big Data

From the post:

Talend, a global leader in open source integration software, today announced the availability of Talend Open Studio for Big Data, to be released under the Apache Software License. Talend Open Studio for Big Data is based on the world’s most popular open source integration product, Talend Open Studio, augmented with native support for Apache Hadoop. In addition, Talend Open Studio for Big Data will be bundled in Hortonworks’ leading Apache Hadoop distribution, Hortonworks Data Platform, constituting a key integration component of Hortonworks Data Platform, a massively scalable, 100 percent open source platform for storing, processing and analyzing large volumes of data.

Talend Open Studio for Big Data is a powerful and versatile open source solution for data integration that dramatically improves the efficiency of integration job design through an easy-to-use graphical development environment. Talend Open Studio for Big Data provides native support for Hadoop Distributed File System (HDFS), Pig, HBase, Sqoop and Hive. By leveraging Hadoop’s MapReduce architecture for highly-distributed data processing, Talend generates native Hadoop code and runs data transformations directly inside Hadoop for maximum scalability. This feature enables organizations to easily combine Hadoop-based processing, with traditional data integration processes, either ETL or ELT-based, for superior overall performance.

“By making Talend Open Studio for Big Data a key integration component of the Hortonworks Data Platform, we are providing Hadoop users with the ability to move data in and out of Hadoop without having to write complex code,” said Eric Baldeschwieler, CTO & co-founder of Hortonworks. “Talend provides the most powerful open source integration solution for enterprise data, and we are thrilled to be working with Talend to provide to the Apache Hadoop community such advanced integration capabilities.”
…..

Availability

Talend Open Studio for Big Data will be available in May 2012. A preview version of the product is available immediately at http://www.talend.com/download-tosbd.

Good news but we also know that the Hadoop paradigm is evolving: Tuple MapReduce: beyond the classic MapReduce.

Will early adopters of Hadoop be just as willing to migrate as the paradigm develops?

Are You An Invisible Librarian?

Filed under: Librarian/Expert Searchers,Library — Patrick Durusau @ 8:09 pm

Are librarians choosing to disappear from the information & knowledge delivery process? by Carl Grant.

Carl Grant writes:

As librarians, we frequently strive to connect users to information as seamlessly as possible. A group of librarians said to me recently: “As librarian intermediation becomes less visible to our users/members, it seems less likely it is that our work will be recognized. How do we keep from becoming victims of our own success?”

This is certainly not an uncommon question or concern. As our library collections have become virtual and as we increasingly stop housing the collections we offer, there is a tendency to see us as intermediaries serving as little more than pipelines to our members. We have to think about where we’re adding value to that information so that when delivered to the user/member that value is recognized. Then we need to make that value part of our brand. Otherwise, as stated by this concern, librarians become invisible and that seems to be an almost assured way to make sure our funding does the same. As evidenced by this recently updated chart on the Association of Research Libraries website, this seems to be the track we are on currently:

I ask Carl’s question more directly to make it clear that invisibility is a matter of personal choice for librarians.

Vast institutional and profession wide initiatives are needed but those do not relieve librarians of the personal responsibility for demonstrating the value add of library services in their day to day activities.

It is the users of libraries, those whose projects, research and lives are impacted by librarians who can (and will) come to the support of libraries and librarians, but only if asked and only if librarians stand out as the value-adds in libraries.

Without librarians, libraries may as well be random crates of books. (That might be a good demonstration of the value-add of libraries by the way.) All of the organization, retrieval, and other value adds are present due to librarians. Makes that and other value adds visible. Market librarians as value adds at every opportunity.

At the risk of quoting too much, Grant gives a starting list of value adds for librarians:

… Going forward, we should be focusing on more fine-grained service goals and objectives and then selecting technology that supports those goals/objectives.  For instance, in today’s (2012) environment, I think we should be focusing on providing products that support these types of services:

  1. Access to the library collections and services from any device, at anytime from anywhere. (Mobile products)
  2. Massive aggregates of information that have been selected for inclusion because of their quality by either: a) librarians, or b) filtered by communities of users through ranking systems and ultimately reviewed and signed-off by librarians for final inclusion in those aggregates. (Cloud computing products are the foundation technology here)
  3. Discovery workbenches or platforms that allow the library membership to discover existing knowledge and build new knowledge in highly personalized manners. (Discovery products serve as the foundation, but they don’t yet have the necessary extensions)
  4. Easy access and integration of the full range of library services into other products they use frequently, such as course or learning management systems, social networking, discussion forums, etc.  (Products that offer rich API’s, extensive support of Apps and standards to support all types of other extensions)
  5. Contextual support, i.e. the ability for librarianship to help members understand the environment in which a particular piece of work was generated (for instance, Mark Twain’s writings, or scientific research—is this a peer reviewed publication? Who funded it and what are their biases?) is an essential value-add we provide.  Some of this is conveyed by the fact that the item is in collections we provide access to, but other aspects of this will require new products we’ve yet to see.
  6. Unbiased information. I’ve written about this in another post and I strongly believe we aren’t conveying the distinction we offer our members by providing access to knowledge that is not biased by constructs based on data unknown and inaccessible to them.   This is a huge differentiator and we must promote and ensure is understood.  If we do decide to use filtering technologies, and there are strong arguments this is necessary to meet the need of providing “appropriate” knowledge, then we should provide members with the ability to see and/or modify the data driving that filtering.  I’ve yet to see the necessary technology or products that provides good answers here.  
  7. Pro-active services (Analytics).  David Lankes makes the point in many of his presentations (here is one) that library services need to be far more pro-active.  He and I couldn’t agree more.   We need go get out there in front of our member needs.  Someone is up for tenure?  Let’s go to their office.  Find out what they need and get it to them.  (Analytic tools, coupled with massive aggregates of data are going to enable us to do this and a lot more.)

Which of these are you going to reduce down to an actionable item for discussion with your supervisor this week? It is really that simple. The recognition of the value add of librarians is up to you.

March 10, 2012

Exploring Wikipedia with Gremlin Graph Traversals

Filed under: Gremlin,Neo4j,Wikipedia — Patrick Durusau @ 8:21 pm

Exploring Wikipedia with Gremlin Graph Traversals by Marko Rodriguez.

From the post:

There are numerous ways in which Wikipedia can be represented as a graph. The articles and the href hyperlinks between them is one way. This type of graph is known a single-relational graph because all the edges have the same meaning — a hyperlink. A more complex rendering could represent the people discussed in the articles as “people-vertices” who know other “people-vertices” and that live in particular “city-vertices” and work for various “company-vertices” — so forth and so on until what emerges is a multi-relational concept graph. For the purpose of this post, a middle ground representation is used. The vertices are Wikipedia articles and Wikipedia categories. The edges are hyperlinks between articles as well as taxonomical relations amongst the categories.

If you aren’t interested in graph representations of data before reading this post, it is likely you will be afterwards.

Take a few minutes to read it and then let me know what you think.

Campaign Finance Data in Real Time

Filed under: Data Source,Politics — Patrick Durusau @ 8:21 pm

Campaign Finance Data in Real Time by DEREK WILLIS.

From the post:

Political campaigns can change every day. The Campaign Finance API now does a better job of keeping pace.

We worked with ProPublica, one of the heaviest users of the API, to make the API more real-time, and to surface more data, such as itemized contributions for every presidential candidate and “super PAC”.

When the API was launched, most of the data it served up was updated every week or, in some cases, on a daily basis. But we work for news organizations, and what is news right now can be old news tomorrow. Committees that raise and spend money influencing federal elections are filing reports every day, not just on the day that reports are due.

If you are mapping political data, the New York Times is a real treasure trove of information.

Read this post for more details on real time campaign finance data.

CS262 Programming Languages

Filed under: Programming,Web Browser — Patrick Durusau @ 8:21 pm

CS262 Programming Languages

From the webpage:

Starts: April 16, 2012.

Learn about programming languages while building a web browser! You will understand JavaScript and HTML from the inside-out in this exciting class.

The course is being taught by: Westley Weimer.

Westley Weimer is a Professor of Computer Science at the University of Virginia where he teaches computer science and leads research in programming languages and software engineering. He has won three awards for teaching over half a dozen “best paper” awards for research. He has MS and PhD degrees from the University of California at Berkeley.

Hard to think of a better way to understand content delivery than by understanding its consumption.

Spreadsheets: Let’s Just Be Friends

Filed under: Marketing — Patrick Durusau @ 8:21 pm

Spreadsheets: Let’s Just Be Friends by Timothy Powers.

From the post:

Have you ever had that awkward conversation with a significant other where they tell you they just want to be friends?

Sometimes the news is hard to swallow. It forces you to ask yourself, “What could I have done better?”

This same tough conversation needs to happen with certain software applications too. People just stay in relationships with software for too long. That said, it’s time to have the “friend talk” and break up with spreadsheets.

You’ve never really loved them. It’s been a relationship of convenience – they just showed up one day on your laptop and the rest was history. Yes, they’re nice and have a good personality (as much as software can), but it’s time to cut the cord and just be friends.

You have to really admire the copy writers for IBM. Without a hint of a blush, the post concludes with notice of an upcoming presentation (March 7, 2012) that promises you will: “…see new solutions that will give you a more personal relationship with your data.”

So we go from “just being friends,” to a “more personal relationship” in a scant number of lines.

To be honest, I have never really wanted a personal relation with my data. Or with my computer for that matter. One is a resource and the other is a tool, nothing more. Maybe that view depends on your social skills. 😉

This piece may help you cast doubt on the suitability of spreadsheets for all cases but I would avoid promising that topic maps or any other technology offers a “personal relationship.” (Do you know if common law still has a breach of promise of marriage action?)

THOMSON REUTERS NEWS ANALYTICS FOR INTERNET NEWS AND SOCIAL MEDIA

Filed under: Marketing,News — Patrick Durusau @ 8:21 pm

THOMSON REUTERS NEWS ANALYTICS FOR INTERNET NEWS AND SOCIAL MEDIA

From the post:

Thomson Reuters News Analytics (TRNA) for Internet News and Social Media is a powerful tool that allows users to analyze millions of public and premium sources of internet content, tag and filter that content to focus on the most relevant sources, and turn of the mass of data into actionable analytics that can be used to support trading, investment and risk management decisions.

The TRNA engine is based on tried and tested technology that is widely deployed by trading firms to analyze Reuters News and a host of other professional news wire services. TRNA for Internet News and Social Media leverages this core technology to analyze content sourced in collaboration with Moreover Technologies, which aggregates content from more than four million social media channels and 50,000 Internet news sites. This content is then analyzed in real-time by the TRNA engine, generating an output of quantifiable data points across a number of dimensions such as sentiment, relevance, and novelty. These and many other metrics can help analysts understand with greater context, what is being said and how it is being said across a number of media channels for a more complete picture.

I mention this story not because I think Thomson Reuters is using topic maps or that you will be likely to compete with them in the same markets.

No, I mention this story because Thomson Reuters doesn’t offer services for which there is no demand. That is to say that repackaging of information from “big data,” and other sources offers a new market for information products.

Using topic maps as the basis for repackaging streams of data into information products will enable you to leverage the talents of one analyst across any number of products. Instead of tasking analysts with producing new graphics of well-known information.

Conceptual colors, negative proportions, mysterious axes, and all that

Filed under: Communication,Graphics,Visualization — Patrick Durusau @ 8:20 pm

Conceptual colors, negative proportions, mysterious axes, and all that

Junk Charts asks a serious question about the use of color for race representation but also has the oddest graphic I have seen in some time.

In other words, since the last time I looked at Junk Charts. It is a great site.

I report this graphic for your amusement and/or more serious discussion of the use of color for race representation.

Good graphics don’t guarantee communication but bad graphics…, you know that part.

Jer Thorpe on “Data” and “History”

Filed under: Graphics,Visualization — Patrick Durusau @ 8:20 pm

Jer Thorpe on “Data” and “History”

From the description:

At TEDxVancouver in November of last year, data artist and NYTLabs artist in residence Jer Thorpe talked about all matters relating to “data” and “history”, meanwhile showcasing his impressive range of creative data visualization projects, that all combine qualities of science, art and design.

A deeply moving and informative presentation on “data” and “history.”

Among other things, Jer talks about the HyperCard program for the early Macs, which enabled users to create not only their own information tracking programs but their own programs proper.

And you get a peek at exploration and visualization tools that “place data in human context.”

The visualization of threads of communication about New York Times stories is my current favorite. Which one is yours?

The thought does occur to me that having a HyperCard like application to enable analysts to create interchangeable mappings or analysis of data could be quite useful in a number of contexts.

Ad targeting at Yahoo

Filed under: Ad Targeting,Marketing,User Targeting — Patrick Durusau @ 8:20 pm

Ad targeting at Yahoo by Greg Linden.

From the post:

A remarkably detailed paper, “Web-Scale User Modeling for Targeting” (PDF), will be presented at WWW 2012 that gives many insights into how Yahoo does personalized advertising.

In summary, the researchers describe a system used in production at Yahoo that does daily builds of large user profiles. Each profile contains tens of thousands of features that summarize the interests of each user from the web pages they have viewed, searches they made, and ads they have viewed, clicked on, and converted (bought something) on. They explain how important it is to use conversions, not just ad clicks, to train the system. They measure the importance of using recent history (what you did in the last couple days), of using fine-grained data (detailed categories and even some specific pages and queries), of using large profiles, and of including data about ad views (which is a huge and low quality data source since there are multiple ad views per page view), and find all those significantly help performance.

You need to read the paper and Greg’s analysis (+ additional references) if you are interested in user profiles/marketing.

Even if you are not, I think the paper offers a window into one view of user behavior. Whether that view works for you, your ad clients or topic map applications, is another question.

Lang.Next 2012

Filed under: Conferences,Programming — Patrick Durusau @ 8:20 pm

Lang.Next 2012

April 2-4, 2012 – Redmond, WA

From the post:

Lang.NEXT 2012 is a cross-industry conference for programming language designers and implementers on the MIcrosoft Campus in Redmond, Washington. With three days of talks, panels and discussion on leading programming language work from industry and research, Lang.NEXT is the place to learn, share ideas and engage with fellow programming language design experts and enthusiasts. Native, functional, imperative, object oriented, static, dynamic, managed, interpreted… It’s a programming language geek fest.

Suspects for recruited presentations:

  • Andy Gordon, Microsoft
  • Andy Moran, Galois
  • Donna Malayeri, Microsoft
  • Dustin Campbell, Microsoft
  • Erik Meijer, Microsoft
  • Gilad Bracha, Google
  • Herb Sutter, Microsoft
  • James Noble, Victoria University of Wellington
  • Jeroen Frijters, Sumatra Software
  • John Cook, University of Texas Graduate School of Biomedical Sciences
  • Kim Bruce, Pomona College
  • Kunle Olukotun, Stanford
  • Luke Hoban, Microsoft
  • Mads Torgersen, Microsoft
  • Martin Odersky, EPFL, Typesafe
  • Martyn Lovell, Microsoft
  • Peter Alvaro, University of California at Berkeley
  • Robert Griesemer, Google
  • Walter Bright, Digital Mars
  • William Cook, University of Texas at Austin

Tweets, blogs, slides/paper, videos? for those of us unable to attend would be appreciated!

« Newer PostsOlder Posts »

Powered by WordPress