May 16th, 2012
Actually the post is titled: Introducing the Knowledge Graph: things, not strings.
It reads in part:
Search is a lot about discovery—the basic human need to learn and broaden your horizons. But searching still requires a lot of hard work by you, the user. So today I’m really excited to launch the Knowledge Graph, which will help you discover new information quickly and easily.
Take a query like [taj mahal]. For more than four decades, search has essentially been about matching keywords to queries. To a search engine the words [taj mahal] have been just that—two words.
But we all know that [taj mahal] has a much richer meaning. You might think of one of the world’s most beautiful monuments, or a Grammy Award-winning musician, or possibly even a casino in Atlantic City, NJ. Or, depending on when you last ate, the nearest Indian restaurant. It’s why we’ve been working on an intelligent model—in geek-speak, a “graph”—that understands real-world entities and their relationships to one another: things, not strings.
The Knowledge Graph enables you to search for things, people or places that Google knows about—landmarks, celebrities, cities, sports teams, buildings, geographical features, movies, celestial objects, works of art and more—and instantly get information that’s relevant to your query. This is a critical first step towards building the next generation of search, which taps into the collective intelligence of the web and understands the world a bit more like people do.
Google’s Knowledge Graph isn’t just rooted in public sources such as Freebase, Wikipedia and the CIA World Factbook. It’s also augmented at a much larger scale—because we’re focused on comprehensive breadth and depth. It currently contains more than 500 million objects, as well as more than 3.5 billion facts about and relationships between these different objects. And it’s tuned based on what people search for, and what we find out on the web.
Google just set the bar for search/information appliances, including topic maps.
What is the value add of your appliance when compared to Google?
When people ask me to explain topic maps now I can say:
You know Google’s Knowledge Graph? It’s like that but customized to your interests and data.
(I would just leave it at that. Let them start imagining what they want to do beyond the reach of Google. In their “dark data.”)
Who knew? Google advertising for topic maps. Without any click-through. Amazing.
Posted in Google Knowledge Graph, Marketing, Topic Maps | 1 Comment »
May 16th, 2012
Mobilizing Knowledge Networks for Development
June 19—20, 2012
The World Bank Group
1818 H Street NW, Washington DC 20433
From the webpage:
The goal of the workshop is to explore ways to become better providers and connectors of knowledge in a world where the sources of knowledge are increasingly diverse and disbursed. At the World Bank, for example, we are seeking ways to connect with new centers of research, emerging communities of practice, and tap the practical experience of development organizations and the policy makers in rapidly developing economies. Our goal is to find better ways to connect those that have the development knowledge with those that need it, when they need it.
We are also seeking to engage research communities and civil society organizations through an Open Development initiative that makes data and publications freely available. We understand that many other organizations are exploring similar initiatives. The Conference and Knowledge fair will provide an opportunity for knowledge organizations working in development to learn from one another about their knowledge services, practices, and successes and challenges in providing these services.
You can register to attend in person or over the Internet.
As always, networking opportunities are what you make of them. This will be a good opportunity to spread the good news about topic maps.
Posted in Conferences, Marketing | No Comments »
May 16th, 2012
From the Bin Laden Letters: Mapping OBL’s Reach into Yemen
I puzzled over this headline. A close friend refers to President Obama as “OB1″ so I had a moment of confusion when reading the headline. Didn’t make sense for Bin Laden’s letters to map President Obama’s reach into Yemen.
With some diplomatic cables and White House internal documents, that would be an interesting visualization as well.
The mining of a larger corpus of 70,000+ public sources for individuals mentioned in the Ben Laden letters is responsible for the visualizations.
What we don’t know is what means of analysis produced the visualizations in question.
Some process was used to reduce redundant references to the same actors, events and relationships. Just by way of example.
That isn’t a complaint, simply an observation. It isn’t possible to evaluate the techniques used to obtain the results.
It would be interesting to see Recorded Future in one of the TREC competitions. At least then the results would be against a shared data set.
Do be aware that when the text says “open source,” what is meant is “open source intelligence.”
The better practice would be to say “open source intelligence or (OSINT)” and not “open source,” the latter having a well recognized meaning in the software community.
Posted in Intelligence | No Comments »
May 16th, 2012
Need cash? NLnet advances open source technology by funding new projects
Next Round of Ideas Due: June 1st 2012.
Lead story at OpenSource.com today.
From the story:
If you have a valuable idea or project that can help create a more open global information society, and are looking for financial means to make your ideas come through, we might be able to help you. Indeed our mission is to fund open source projects and individuals to improve important and strategic networking technologies for the better of mankind. Whether this concerns more robust internet technologies and standards, privacy enhancing technologies or open document formats – we are open for your proposals.
We are independent. We are not like other funding bodies you may have experience with, because we only have to judge on quality and relevance, and not on politics or any other dimension. What is important for us is that the technology you develop and promote is usable for others and has real impact. And we are also interested to hear your inspiring ideas if you are unable to manage it yourself.
We spend our money in supporting strategic initiatives that contribute to an open information society, especially where these are aimed at development and dissemination of open standards and network related technology.
More details in the story or at the NLnet website.
What’s your great idea?
Posted in Funding, Open Source | No Comments »
May 16th, 2012
OpenSource.com
Not sure how I got to OpenSource.com but it showed up as a browser tab after a crash. Maybe it is a new feature and not a bug.
Thought I would take the opportunity to point it out (and record it here) as a source of projects and news from the open source community.
Not to mention data sets, source code, marketing opportunities, etc.
Posted in Open Data, Open Source | No Comments »
May 16th, 2012
Identifying And Weighting Integration Hypotheses On Open Data Platforms by Julian Eberius, Katrin Braunschweig, Maik Thiele, and Wolfgang Lehner.
Abstract:
Open data platforms such as data.gov or opendata.socrata.com provide a huge amount of valuable information. Their free-for-all nature, the lack of publishing standards and the multitude of domains and authors represented on these platforms lead to new integration and standardization problems. At the same time, crowd-based data integration techniques are emerging as new way of dealing with these problems. However, these methods still require input in form of specific questions or tasks that can be passed to the crowd. This paper discusses integration problems on Open Data Platforms, and proposes a method for identifying and ranking integration hypotheses in this context. We will evaluate our findings by conducting a comprehensive evaluation using on one of the largest Open Data platforms.
This is interesting work on Open Data platforms but it is marred by claims such as:
Open Data Platforms have some unique integration problems that do not appear in classical integration scenarios and which can only be identied using a global view on the level of datasets. These problems include partial- or duplicated datasets, partitioned datasets, versioned datasets and others, which will be described in detail in Section 4.
Really?
Would come as a surprise to the World Data Centre for Aerosols which had Synthesis and INtegration of Global Aerosol Data Sets. Contract No. ENV4-CT98-0780 (DG 12 –EHKN) produced on data sets from 1999 to 2001. One of the specific issues they addressed were duplicate data sets.
More than a decade ago counts for a “classical integration scenario” I think.
Another quibble. Cited sources do not support the text.
New forms of data management such as dataspaces and pay-as-you-go data integration [2, 6] are a hot topic in database research. They are strongly related to Open Data Platforms in that they assume large sets of heterogeneous data sources lacking a global or mediated schemata, which still should be queried uniformly.
…
2 M. Franklin, A. Halevy, and D. Maier. From databases to dataspaces: a new abstraction for information management. SIGMOD Rec., 34:27{33, December 2005.
…
6 J. Madhavan, S. R. Jeery, S. Cohen, X. . Dong, D. Ko, C. Yu, A. Halevy, and G. Inc. Web-scale Data Integration: You Can Only Afford to Pay As You Go. In Proc. of CIDR-07, 2007.
Articles written seven (7) and five (5) years ago, do not justify a “hot topic(s) in database research.” claim today.
There are other issues, major and minor but for all that, this is important work.
I want to see reports that do justice to its importance.
Posted in Crowd Sourcing, Data Integration, Integration, Open Data | No Comments »
May 16th, 2012
Steve Miller writes in Politics of Data Models and Mining:
I recently came across an interesting thread, “Is data mining still a sin against the norms of econometrics?”, from the Advanced Business Analytics LinkedIn Discussion Group. The point of departure for the dialog is a paper entitled “Three attitudes towards data mining”, written by couple of academic econometricians.
The data mining “attitudes” range from the extremes that DM techniques are to be avoided like the plague, to one where “data mining is essential and that the only hope that we have of using econometrics to uncover true economic relationships is to be found in the intelligent mining of data.” The authors note that machine learning phobia is currently the norm in economics research.
Why is this? “Data mining is considered reprehensible largely because the world is full of accidental correlations, so that what a search turns up is thought to be more a reflection of what we want to find than what is true about the world.” In contrast, “Econometrics is regarded as hypothesis testing. Only a well specified model should be estimated and if it fails to support the hypothesis, it fails; and the economist should not search for a better specification.”
In other words, econometrics focuses on explanation, expecting its practitioners to generate hypotheses for testing with regression models. ML, on the other hand, obsesses on discovery and prediction, often content to let the data talk directly, without the distraction of “theory.” Just as bad, the results of black-box ML might not be readily interpretable for tests of economic hypotheses.
Watching other communities fight over odd questions is always more enjoyable than serious disputes of grave concern in our own. (See Using “Punning” to Answer httpRange-14 for example.)
I mention the economist’s dispute, not simply to make jests at the expense of “econometricians.” (Do topic map supporters need a difficult name? TopicMapologists? Too short.)
The economist’s debate is missing an understanding that modeling requires some knowledge of the domain (mining whether formal or informal) and mining requires some idea of an output (models whether spoken or unspoken). A failing that is all too common across modeling/mining domains.
To put it another way:
We never stumble upon data that is “untouched by human hands.”
We never build models without knowledge of the data we are modeling.
The relevant question is: Does the model or data mining provide a useful result?
(Typically measured by your client’s joy or sorrow over your results.)
Posted in Data Mining, Data Models | No Comments »
May 16th, 2012
Have you ever gotten an advertising email with clean links in it? I mean a link without all the marketing crap appended to the end. The stuff you have to clean off before using it in a post or sending it to a friend?
Got my first one today. From Skills Matter on the free videos for their Progressive NoSQL Tutorials that just concluded.
High quality presentations, videos freely available after presentation, friendly links in email, just a few of the reasons to support Skills Matter.
The tutorials:
- Cassandra
- Consistency
- Couchbase
- CouchDB
- MongoDB
- Neo4j
- RavenDB
- Riak
Posted in Cassandra, CouchDB, Couchbase, MongoDB, Neo4j, NoSQL, RavenDB, Riak | No Comments »
May 16th, 2012
Multi-word synonym filter (synonym expansion at indexing time) Lucene-1622
From the description:
It would be useful to have a filter that provides support for indexing-time synonym expansion, especially for multi-word synonyms (with multi-word matching for original tokens).
The problem is not trivial, as observed on the mailing list. The problems I was able to identify (mentioned in the unit tests as well):
- if multi-word synonyms are indexed together with the original token stream (at overlapping positions), then a query for a partial synonym sequence (e.g., “big” in the synonym “big apple” for “new york city”) causes the document to match;
- there are problems with highlighting the original document when synonym is matched (see unit tests for an example),
- if the synonym is of different length than the original sequence of tokens to be matched, then phrase queries spanning the synonym and the original sequence boundary won’t be found. Example “big apple” synonym for “new york city”. A phrase query “big apple restaurants” won’t match “new york city restaurants”.
I am posting the patch that implements phrase synonyms as a token filter. This is not necessarily intended for immediate inclusion, but may provide a basis for many people to experiment and adjust to their own scenarios.
This remains an open issue as of 16 May 2012.
It is also an important open issue.
Think about it.
As “big data” gets larger and larger, at some point traditional ETL isn’t going to be practical. Due to storage, performance, selective granularity or other issues, ETL is going to fade into the sunset.
Indexing, on the other hand, which treats data “in situ” (“in position” for you non-archaeologists in the audience), avoids many of the issues with ETL.
The treatment of synonyms, that is synonyms across data sets, multi-word synonyms, specifying the ranges of synonyms (both for indexing and search), synonym expansion, a whole range of synonyms features and capabilities, needs to “man up” to take on “big data.”
Posted in Indexing, Lucene, Synonymy | No Comments »
May 16th, 2012
Managing context data for diverse operating spaces by Wenwei Xuea, Hung Keng Pungb, and Shubhabrata Senb.
Abstract:
Context-aware computing is an exciting paradigm in which applications perceive and react to changing environments in an unattended manner. To enable behavioral adaptation, a context-aware application must dynamically acquire context data from different operating spaces in the real world, such as homes, shops and persons. Motivated by the sheer number and diversity of operating spaces, we propose a scalable context data management system in this paper to facilitate data acquisition from these spaces. In our system, we design a gateway framework for all operating spaces and develop matching algorithms to integrate the local context schemas of operating spaces into a global set of domain schemas upon which SQL-based context queries can be issued from applications. The system organizes the operating space gateways as peers in semantic overlay networks and employs distributed query processing techniques over these overlays. Evaluation results on a prototype implementation demonstrate the effectiveness of our system design.
This article came up in a sweep for “semantic overlay networks.”
Encouraging recognition that results may need to vary based on physical context. Who knows? Perhaps recognition that the terminology for one domain and its journals/authors/monographs has different semantics than other domains.
Imagine that, a system that manages queries across semantic domains for users, as opposed to users having to understand all the possible semantic domains in advance to have useful query results (or better query results).
Perhaps the “context” metaphor may be a useful one in marketing topic maps. Less aggressive than “silo.” Let the client come up with that to characterize competing agencies or information sources.
“Context” in the sense of physical space is popular among the smart phone crowd so don’t neglect that as an avenue for topic maps as well. (Looking at your surroundings would mean breaking eye contact with your phone. Might miss an ad or something.)
Posted in Context-aware, Semantic Overlay Network | No Comments »
May 15th, 2012
No sorting and lack of structure undermine a chart
Kaiser Fung takes the Guardian newspaper, yes, that Guardian, to task for poor graphics on gay rights in the United States.
When people are critical of your graphics but take heart that even experts fail from time to time.
Posted in Graphics, Visualization | No Comments »
May 15th, 2012
History matters by Gene Golovchinsky.
Whose history? Your history. Your search history. Visualized.
Interested? Read more:
Exploratory search is an uncertain endeavor. Quite often, people don’t know exactly how to express their information need, and that need may evolve over time as information is discovered and understood. This is not news.
When people search for information, they often run multiple queries to get at different aspects of the information need, to gain a better understanding of the collection, or to incorporate newly-found information into their searches. This too is not news.
The multiple queries that people run may well retrieve some of the same documents. In some cases, there may be little or no overlap between query results; at other times, the overlap may be considerable. Yet most search engines treat each query as an independent event, and leave it to the searcher to make sense of the results. This, to me, is an opportunity.
Design goal: Help people plan future actions by understanding the present in the context of the past.
While web search engines such as Bing make it easy for people to re-visit some recent queries, and early systems such as Dialog allowed Boolean queries to be constructed by combining results of previously-executed queries, these approaches do not help people make sense of the retrieval histories of specific documents with respect to a particular information need. There is nothing new under the sun, however: Mark Sanderson’s NRT system flagged documents as having been previously retrieved for a given search task, VOIR used retrieval histograms for each document, and of course a browser maintains a limited history of activity to indicate which links were followed.
Our recent work in Querium (see here and here) seeks to explore this space further by providing searchers with tools that reflect patterns of retrieval of specific documents within a search mission.
Even more interested? Read Greg’s post in full.
If not, check your pulse.
Posted in Search Behavior, Search Engines, Search History | No Comments »
May 15th, 2012
SIAM Data Mining 2012 Conference
Ryan Rosario writes:
From April 26-28 I had the pleasure to attend the SIAM Data Mining conference in Anaheim on the Disneyland Resort grounds. Aside from KDD2011, most of my recent conferences had been more “big data” and “data science” oriented, and I wanted to step away from the hype and just listen to talks that had more substance.
Attending a conference on Disneyland property was quite a bizarre experience. I wanted to get everything I could out of the conference, but the weather was so nice that I also wanted to get everything out of Disneyland as I could. Seeing adults wearing Mickey ears carrying Mickey shaped balloons, and seeing girls dressed up as their favorite Disney princesses screams “fun” rather than “business”, but I managed to make time for both.
The first two days started with a plenary talk from industry or research labs. After a coffee break, there were the usual breakout sessions followed by lunch. During my free 90 minutes, I ran over to Disneyland and California Adventure both days to eat lunch. I managed to run there, wait in line, guide myself through crowds, wait in line, get my food, eat it, and run back to the conference in 90 minutes on a weekend. After lunch on the first two days was another plenary session followed by breakout sessions. The evening of the first two days was reserved for poster sessions. Saturday hosted half-day and full-day workshops.
Below is my summary of the conference. Of course, such a summary is very high level my description may miss things, or may not be entirely correct if I misunderstood the speaker.
I doubt Ryan would claim his summary is “as good as being there” but in the absence of attending, you could do far worse.
Suggestions of papers from the conference that I should read first?
Posted in Conferences, Data Mining | No Comments »
May 15th, 2012
Using “Punning” to Answer httpRange-14
Jeni Tennison writes in her introduction:
As part of the TAG’s work on httpRange-14, Jonathan Rees has assessed how a variety of use cases could be met by various proposals put before the TAG. The results of the assessment are a matrix which shows that “punning” is the most promising method, unique in not failing on either ease of use (use case J) or HTTP consistency (use case M).
In normal use, “punning” is about making jokes based around a word that has two meanings. In this context, “punning” is about using the same URI to mean two (or more) different things. It’s most commonly used as a term of art in OWL but normal people don’t need to worry particularly about that use. Here I’ll explore what that might actually mean as an approach to the httpRange-14 issue.
Jeni writes quite well and if you are really interested in the details of this self-inflicted wound, read her post in its entirety.
The post is summarized when she says:
Thus an implication of this approach is that the people who define languages and vocabularies must specify what aspect of a resource a URI used in a particular way identifies.
Her proposal makes disambiguation explicit. A strategy that is more likely to be successful than others.
Following that statement she treats how to usefully proceed from that position. (No guarantee her position will carry the day but it would be a good thing if it does.)
Posted in Linked Data, RDF, Semantic Web | 1 Comment »
May 15th, 2012
Open Data Visualization: Keeping Traces of the Exploration Process by Benoît Otjacques, Mickaël Stefas, Maël Cornil, and Fernand Feltz.
Abstract:
This paper describes a system to support the visual exploration of Open Data. During his/her interactive experience with the graphics, the user can easily store the current complete state of the visualization application (called a viewpoint). Next, he/she can compose sequences of these viewpoints (called scenarios) that can easily be reloaded. This feature allows to keep traces of a former exploration process, which can be useful in single user (to support investigation carried out in multiple sessions) as well as in collaborative setting (to share points of interest identified in the data set).
I was unaware of this paper when I wrote my “knowledge toilet” post earlier today. This looks like an interesting starting point for discussion.
Just speculating but I think there will be a “sweet spot” for how much effort users will devote to recording their input. For some purposes it will need to be almost automatic. Like the relationship between search terms and links users choose. Crude but somewhat effective.
On the other hand, there will be professional researchers/authors who want to sell their semantic annotations/mappings of resources.
And applications/use cases in between.
Posted in Open Data, Visualization | 1 Comment »
May 15th, 2012
Operations on soft sets revisited by Ping Zhu and Qiaoyan Wen.
Abstract:
Soft sets, as a mathematical tool for dealing with uncertainty, have recently gained considerable attention, including some successful applications in information processing, decision, demand analysis, and forecasting. To construct new soft sets from given soft sets, some operations on soft sets have been proposed. Unfortunately, such operations cannot keep all classical set-theoretic laws true for soft sets. In this paper, we redefine the intersection, complement, and difference of soft sets and investigate the algebraic properties of these operations along with a known union operation. We find that the new operation system on soft sets inherits all basic properties of operations on classical sets, which justifies our definitions.
An interesting paper will get you interested in soft sets if you aren’t already.
It isn’t easy going, even with the Alice and Bob examples, which I am sure the authors found immediately intuitive.
If you have data where numeric values cannot be assigned, it will be worth your while to explore this paper and the literature on soft sets.
Posted in Sets, Soft Sets | No Comments »
May 15th, 2012
Improving Schema Matching with Linked Data by Ahmad Assaf, Eldad Louw, Aline Senart, Corentin Follenfant, Raphaël Troncy, and David Trastour.
Abstract:
With today’s public data sets containing billions of data items, more and more companies are looking to integrate external data with their traditional enterprise data to improve business intelligence analysis. These distributed data sources however exhibit heterogeneous data formats and terminologies and may contain noisy data. In this paper, we present a novel framework that enables business users to semi-automatically perform data integration on potentially noisy tabular data. This framework offers an extension to Google Refine with novel schema matching algorithms leveraging Freebase rich types. First experiments show that using Linked Data to map cell values with instances and column headers with types improves significantly the quality of the matching results and therefore should lead to more informed decisions.
Personally I don’t find mapping Airport -> Airport Code all that convincing a demonstration.
The other problem I have is what happens after a user “accepts” a mapping?
Now what?
I can contribute my expertise to mappings between diverse schemas all day, even public ones.
What happens to all that human effort?
It is what I call the “knowledge toilet” approach to information retrieval/integration.
Software runs (I can’t count the number of times integration software has been run on Citeseer. Can you?) and a user corrects the results as best they are able.
Now what?
Oh, yeah, the next user or group of users does it all over again.
Why?
Because the user before them flushed the knowledge toilet.
The information had been mapped. Possibly even hand corrected by one or more users. Then it is just tossed away.
That has to seem wrong at some very fundamental level. Whatever semantic technology you choose to use.
I’m open to suggestions.
How do we stop flushing the knowledge toilet?
Posted in Linked Data, Schema | 4 Comments »
May 15th, 2012
Introducing Neo4j into a Relational Database Organisation
The details:
What: Neo4J User Group:Introducing Neo4j into a Relational Database Organisation
Where: The Skills Matter eXchange, London
When: 23 May 2012 Starts at 18:30
From the webpage:
This month, Toby O’Rourke and Michael McCarthy present their experiences of introducing Neo4j into Gamesys: a Relational Database Organisation.
You will hear about Toby and Michael’s experiences, including
- the path taken from spring data through tinkerpop, to straight neo then spring data again
- Satisfying the reporting requirements of a place built on a data warehouse approach
- Modelling our domain
- Experience of support contracts and the community as a whole
Just in case you need an additional reason to be in London on 23 May 2012, consult London Drum City Guide.
Posted in Neo4j, RDBMS | No Comments »
May 15th, 2012
Electronic Discovery Institute
From the home page:
The Electronic Discovery Institute is a non-profit organization dedicated to resolving electronic discovery challenges by conducting studies of litigation processes that incorporate modern technologies. The explosion in volume of electronically stored information and the complexity of its discovery overwhelms the litigation process and the justice system. Technology and efficient processes can ease the impact of electronic discovery.
The Institute operates under the guidance of an independent Board of Diplomats comprised of judges, lawyers and technical experts. The Institute’s studies will measure the relative merits of new discovery technologies and methods. The results of the Institute’s studies will be shared with the public free of charge. In order to obtain our free publications, you must create a free log-in with a legitimate user profile. We do not sell your information. Please visit our sponsors – as they provide altruistic support to our organization.
I encountered the Electronic Discovery Institute while researching information on electronic discovery. Since law was and still is an interest of mine, wanted to record it here.
The area of e-discovery is under rapid development, in terms rules that govern it, the technology that it employs and its practice in real world situations with consequences for the players.
Commend this site/organization to anyone interested in e-discovery issues.
Posted in Law, Legal Informatics | No Comments »
May 15th, 2012
I was reading a paper on natural language processing (NLP) when it occurred to me to ask:
When is parsing of any data not natural language processing?
I hear the phrase, “natural language processing,” applied to a corpus of emails, blog posts, web pages, electronic texts, transcripts of international phone calls and the like.
Other than following others out of habit, why do we say those are subject to “natural language processing?”
As opposed to say a database schema?
When we “process” the column headers in a database schema, aren’t we engaged in “natural language processing?” What about SGML/XML schemas or instances they govern?
Being mindful of semantics, synonymy and polysemy, it’s hard think of examples that are not “natural language processing.”
At least for data that would be meaningful if read by a person. Streams of numbers perhaps not, but the symbolism that defines their processing I would argue falls under natural language processing.
Thoughts?
Posted in Natural Language Processing | No Comments »
May 14th, 2012
Mining GitHub – Followers in Tinkerpop
Patrick Wagstrom writes:
Development of any moderately complex software package is a social process. Even if a project is developed entirely by a single person, there is still a social component that consists of all of the people who use the software, file bugs, and provide recommendations for enhancements. This social aspect is one of the driving forces behind the proliferation of social software development sites such as GitHub, SourceForge, Google Code, and BitBucket.
These sites combine together a variety of tools that are common for software development such as version control, bug trackers, mailing lists, release management, project planning, and wikis. In addition, some of these have more social aspects that allow you find and follow individual developers or watch particular projects. In this post I’m going to show you how we can use some this information to gain insight into a software development community, specifically the community around the Tinkerpop stack of tools for graph databases.
GitHub as a social community. Who knew?
Very instructive walk through Gremlin, GraphML, and R with a prepared data set. It doesn’t get much better than this!
Posted in Github, GraphML, Neo4j, R, TinkerPop | No Comments »
May 14th, 2012
Finite State Automata in Luceneby Mike McCandless
From the post:
Lucene Revolution 2012 is now done, and the talk Robert and I gave went well! We showed how we are using automata (FSAs and FSTs) to make great improvements throughout Lucene.
You can view the slides here.
This was the first time I used Google Docs exclusively for a talk, and I was impressed! The real-time collaboration was awesome: we each could see the edits the other was doing, live. You never have to “save” your document: instead, every time you make a change, the document is saved to a new revision and you can then use infinite undo, or step back through all revisions, to go back.
Finally, Google Docs covers the whole life-cycle of your talk: editing/iterating, presenting (it presents in full-screen just fine, but does require an internet connection; I exported to PDF ahead of time as a backup) and, finally, sharing with the rest of the world!
I must confess to disappointment when I read at slide 23 that “multi-token synonyms mess up graph.”
Particularly since I suspect that not only do synonyms need to be “multi-token” but “multi-dimensional” as well.
Posted in Finite State Automata, Lucene | No Comments »
May 14th, 2012
Sorting and Filtering Results in Custom Search
From the post:
Using Custom Search Engine (CSE), you can create rich search experiences that make it easier for visitors to find the information they’re looking for on your site. Today we’re announcing two improvements to sorting and filtering of search results in CSE.
First, CSE now supports UI-based results sorting, which you can enable in the Basics tab of the CSE control panel. Once you’ve updated the CSE element code on your site, a “sort by” picker will become visible at the top of the results section.
I am not sure I would call this a “rich search experience” but I suppose any improvement is better than none at all.
Curious how you evaluate the use of “product rich snippets” as being similar to Newcomb’s conferral of properties? (see the post for “product rich snippets”).
Or for that matter, how you would in an indexing context, “confer” additional information on an index entry that does not appear in the document?
To be used when the index is searched.
Comments?
Posted in Google CSE, Searching | No Comments »
May 14th, 2012
CDG – Community Data Generator
From the post:
CDG is a datawarehouse generator and the newest member of the Ctools family. Given the definition of dimensions that we want, CDG will randomize data within certain parameters and output 3 different things:
- Database and table ddl for the fact table
- A file with inserts for the fact table
- Mondrian schema file to be used within pentaho
While most of the documentation mentions the usage within the scope of Pentaho there’s absolutely nothing that prevents the resulting database to be used in different contexts.
I had mentioned ctools before but not in any detail. This was the additional resource that made me pick them back up.
It isn’t hard to see how this data generator will be useful.
For subject-centric software, generating files with known “same subject” characteristics would be more useful.
Thoughts, suggestions or pointers to work on generation of such files?
Posted in Ctools, Data | No Comments »
May 14th, 2012
C*Tools
From the webpage:
The CTools are a Webdetails Open Source project composed by a collection of Pentaho plugins. Its purpose is to streamline the implementation and design process, expanding even further the range of possibilities of Pentaho Dashboards. This page represents our effort to keep you up to date with the our latest developments. Have fun, dazzle your clients and build a “masterpiece of a Dashboard”.
Tools include:
CCC: Community Charting Components (CCC) is a charting library on top of Protovis, a very powerful free and open-source visualization toolkit.
CBF: Focused on a multi-project/ multi-environment scenario, the Community Build Framework (CBF) is the way to setup and deploy Pentaho based applications.
CDA: Community Data Access (CDA) is a Pentaho plugin designed for accessing data with great flexibility. Born for overcoming some cons of the older implementation, CDA allows you to access any of the various Pentaho data sources and:
- join different datasources just by editing an XML file
- cache queries providing a great boost in performance.
- deliver data in different formats (csv, xls, etc.) through the Pentaho User
CDE: The Community Dashboard Editor (CDE) is the outcome of real-world needs: It was born to greatly simplify the creation, edition and rendering of dashboards.
CDF: Community Dashboard Framework (CDF) is a project that allows you to create friendly, powerful, fully featured dashboards on top of the Pentaho BI server. Former Pentaho dashboards had several drawbacks from a developer’s point of view. The developing process was awkward, it required know-how of web technologies and programming languages, and basically it was time-consuming. CDF emerged as a need for a framework that overcame all those difficulties. The final result is a powerful framework featuring the following:
- It is based on Open Source technologies.
- It separates logic (JavaScript) of the presentation (HTML, CSS)
- It features a life cycle with components interacting with each other
- It uses AJAX
- It is extensible, which gives the users a high level of customization: . Advanced users can extend the library of components.
- They also can insert their own snippets of JavaScript and jQuery code.
CST: Community Startup Tabs (CST) represents the easiest way to define and implement the Pentaho startup tabs depending on the user that logs into the PUC. Ranging from a single institutional page to a list of dashboards or reports among other contents, the tabs that each Pentaho user uses to open after loging into the PUC vary depending on the user preferences, or his/her role in the company. Then, why let Pentaho open always the same home page for everyone? The list of tabs to be opened automatically right after the login can be different depending on the user thanks to CST. Community Startup Tabs (CST) is a plugin with the following features:
- it allows you to define diferent startup tabs for each user that logs into the PUC. .it is easy to configure.
- it allows to define startup tabs based on user names or user roles.
- for the definition of the startup tabs it allows you to specify user names or roles using regular expressions.
The trick to dashboards (as opposed to some, nameless, applications) is to deliver obviously useful options and information to users.
Posted in Ctools, Dashboard, Pentaho | No Comments »
May 14th, 2012
TREC Document Review Project on Hiatus, Recommind Asked to Withdraw
From the post:
TREC Legal Track — part of the U.S. government’s Text Retrieval Conference — announced last week that the 2012 edition of its annual document review project for testing new systems is canceled, while prominent e-discovery software company Recommind confirmed that it’s been asked to leave the project for prematurely sharing results.
These difficulties highlight the need for:
- open data sets and
- protocols for reporting of results as they occur.
That requires a data set with relevance judgments and other work.
Have you thought about the: Open Relevance Project at the Apache Foundation?
Email archives from Apache projects, the backbone of the web as we know it, are ripe for your contributions.
Let me be the first to ask Recommind to join in building a public data set for everyone.
Posted in Data Mining, Data Source, Open Relevance Project, TREC | No Comments »
May 14th, 2012
ETL 2.0 – Data Integration Comes of Age by Robin Bloor PhD & Rebecca Jozwiak.
Well…., sort of.
It is a “white paper” and all that implies but when you read:
Versatility of Transformations and Scalability
All ETL products provide some transformations but few are versatile. Useful transformations may involve translating data formats and coded values between the data sources and the target (if they are, or need to be, different). They may involve deriving calculated values, sorting data, aggregating data, or joining data. They may involve transposing data (from columns to rows) or transposing single columns into multiple columns. They may involve performing look-ups and substituting actual values with looked-up values accordingly, applying validations (and rejecting records that fail) and more. If the ETL tool cannot perform such transformations, they will have to be hand coded elsewhere – in the database or in an application.
It is extremely useful if transformations can draw data from multiple sources and data joins can be performed between such sources “in flight,” eliminating the need for costly and complex staging. Ideally, an ETL 2.0 product will be rich in transformation options since its role is to eliminate the need for direct coding all such data transformations.
you start to lose what little respect you had for industry “white papers.”
Not once in this white paper is the term “semantics” used. It is also innocent of using the term “documentation.”
Don’t you think an ETL 2.0 application should enable re-use of “useful transformations?”
Wouldn’t that be a good thing?
Instead of IT staff starting from zero with every transformation request?
Failure to capture the semantics of data leaves you at ETL 2.0, while everyone else is at ETL 3.0.
Where does your business sense tell you about that choice?
(ETL 3.0 – Documented, re-usable, semantics for data and data structures. Enables development of transformation modules for particular data sources.)
Posted in Data Integration, ETL | No Comments »
May 14th, 2012
Web Developers Can Now Easily “Play” with RDFa by Eric Franzon.
From the post:
Yesterday, we announced RDFa.info, a new site devoted to helping developers add RDFa (Resource Description Framework-in-attributes) to HTML.
Building on that work, the team behind RDFa.info is announcing today the release of “PLAY,” a live RDFa editor and visualization tool. This release marks a significant step in providing tools for web developers that are easy to use, even for those unaccustomed to working with RDFa.
“Play” is an effort that serves several purposes. It is an authoring environment and markup debugger for RDFa that also serves as a teaching and education tool for Web Developers. As Alex Milowski, one of the core RDFa.info team, said, “It can be used for purposes of experimentation, documentation (e.g. crafting an example that produces certain triples), and testing. If you want to know what markup will produce what kind of properties (triples), this tool is going to be great for understanding how you should be structuring your own data.”
A useful site for learning RDFa that is open for contributions, such as examples and documentation.
Posted in RDF, RDFa, Semantic Web | No Comments »
May 14th, 2012
Cloudera Manager 4.0 Beta released by Aparna Ramani
From the post:
We’re happy to announce the Beta release of Cloudera Manager 4.0.
This version of Cloudera Manager includes support for CDH4 Beta2 and several new features for both the Free edition and the Enterprise edition.
This is the last beta before the GA release.
The details are:
I’m pleased to inform our users and customers that we have released the Cloudera’s Distribution Including Apache Hadoop version 4 (CDH4) 2nd and final beta today. We received great feedback from the community from the first beta and this release incorporates that feedback as well as a number of new enhancements.
CDH4 has a great many enhancements compared to CDH3.
- Availability – a high availability namenode, better job isolation, improved hard disk failure handling, and multi-version support
- Utilization – multiple namespaces and a slot-less resource management model
- Performance – improvements in HBase, HDFS, MapReduce, Flume and compression performance
- Usability – broader BI support, expanded API options, a more responsive Hue with broader browser support
- Extensibility – HBase co-processors enable developers to create new kinds of real-time big data applications, the new MapReduce resource management model enables developers to run new data processing paradigms on the same cluster resources and storage
- Security – HBase table & column level security and Zookeeper authentication support
Some items of note about this beta:
This is the second (and final) beta for CDH4, and this version has all of the major component changes that we’ve planned to incorporate before the platform goes GA. The second beta:
- Incorporates the Apache Flume, Hue, Apache Oozie and Apache Whirr components that did not make the first beta
- Broadens the platform support back out to our normal release matrix of Red Hat, CentOS, SUSE, Ubuntu and Debian
- Standardizes our release matrix of supported databases to include MySQL, PostgresSQL and Oracle
- Includes a number of improvements to existing components like adding auto-failover support to HDFS’s high availability feature and adding multi-homing support to HDFS and MapReduce
- Incorporates a number of fixes that were identified during the first beta period like removing a HBase performance regression
Not as romantic as your subject analysis activities but someone has to manage the systems that implement your analysis!
Not to mention skills here making you more attractive in any big data context.
Posted in Cloudera, Hadoop, MapReduce | No Comments »
May 14th, 2012
Lucene conference touches many areas of growth in search by Andy Oram.
From the post:
With a modern search engine and smart planning, web sites can provide visitors with a better search experience than Google. For instance, Google may well turn up interesting results if you search for a certain kind of shirt, but a well-designed clothing site can also pull up related trousers, skirts, and accessories. It’s not Google’s job to understand the intricate interrelationships of data on a particular web property, but the site’s own team can constantly tune searches to reflect what the site has to offer and what its visitors uniquely need.
Hence the important of search engines like Solr, based on the Lucene library. Both are open source Apache projects, maintained by Lucid Imagination, a company founded to commercialize the underlying technology. I attended parts of Lucid Imagination’s conference this week, Lucene Revolution, and found Lucene evolving in the ways much of the computer industry is headed.
Andy’s summary of the conference will make you wonder two things:
- Why weren’t you at the Lucene Revolution conference this year?
- Where are the videos from Lucene Revolution 2012?
I won’t ever be able to answer #1 but will post an answer to #2 as soon as it is available.
Posted in BigData, Lucene, LucidWorks, Solr | No Comments »