Machine Learning with Hadoop by Josh Patterson.
Very current (Sept. 2011) review of Hadoop, data mining and related issues. Plus pointers to software projects such as Lumberyard, which deals with terabyte-sized time series data.
Machine Learning with Hadoop by Josh Patterson.
Very current (Sept. 2011) review of Hadoop, data mining and related issues. Plus pointers to software projects such as Lumberyard, which deals with terabyte-sized time series data.
HTML Data Task Force, chaired by Jeni Tennison.
Another opportunity to participate in important work at the W3C without a membership. The “details” of getting diverse formats to work together.
Close analysis may show the need for changes to syntaxes, etc., but as far as mapping goes, topic maps can take syntaxes as they are. Could be an opportunity to demonstrate working solutions for actual use cases.
From the wikipage:
This HTML Data Task Force considers RDFa 1.1 and microdata as separate syntaxes, and conducts a technical analysis on the relationship between the two formats. The analysis discusses specific use cases and provide guidance on what format is best suited for what use cases. It further addresses the question how different formats can be used within the same document when required and how data expressed in the different formats can be combined by consumers.
The task force MAY propose modifications in the form of bug reports and change proposals on the microdata and/or RDFa specifications, to help users to easily transition between the two syntaxes or use them together. As with all such comments, the ultimate decisions on implementing these will rest with the respective Working Groups.
Further, the Task Force should also produce a draft specifications of mapping algorithms from an HTML+microdata content to RDF, as well as a mapping of RDFa to microdata’s JSON format. These MAY serve as input documents to possible future recommendation track works. These mappings should be, if possible, generic, i.e., they should not be dependent on any particular vocabulary. A goal for these mappings should be to facilitate the use of both formats with the same vocabularies without creating incompatibilities.
The Task Force will also consider design patterns for vocabularies, and provide guidance on how vocabularies should be shaped to be usable with both microdata and RDFa and potentially with microformats. These patterns MAY lead to change proposals of existing (RDF) vocabularies, and MAY result in general guidelines for the design of vocabularies for structured data on the web, building on existing community work in this area.
The Task Force liaises with the SWIG Web Schemas Task Force to ensure that lessons from real-world experience are incorporated into the Task Force recommendations and that any best practices described by the Task Force are synchronised with real-world practice.
The Task Force conducts its work through the public-html-data-tf@w3.org mailing list (use this link to subscribe or look at the public archives), as well as on the #html-data-tf channel of the (public) W3C IRC server.
Web Schemas Task Force, chaired by R.V. Guha (Google).
Here is your opportunity to participate in some very important work at the W3C without a W3C membership.
From the wiki page:
This is the main Wiki page for W3C’s Semantic Web Interest Group Web Schemas task force.
The taskforce chair is R.V.Guha (Google).
In scope include collaborations on mappings, tools, extensibility and cross-syntax interoperability. An HTML Data group is nearby; detailed discussion about Web data syntax belongs there.
See the charter for more details.
The group uses the public-vocabs@w3.org mailing list
- See public-vocabs@w3.org archives
- To subscribe, send a message to public-vocabs-request@w3.org with Subject: subscribe (see lists.w3.org for more details).
- If you are new to the W3C community, you will need to go through the archive approval process before your posts show up in the archives.
- To edit this wiki, you’ll need a W3C account; these are available to all
Groups who maintain Web Schemas are welcome to use this forum as a feedback channel, in additional to whatever independent mechanisms they also offer.
The following from the charter makes me think that topic maps may be relevant to the task at hand:
Participants are encouraged to use the group to take practical steps towards interoperability amongst diverse schemas, e.g. through development of mappings, extensions and supporting tools. Those participants who maintain vocabularies in any format designed for wide-scale public Web use are welcome to also to participate in the group as a ‘feedback channel’, including practicalities around syntax, encoding and extensibility (which will be relayed to other W3C groups as appropriate).
DM SIG “Bayesian Statistical Reasoning ” 5/23/2011 by Prof. David Draper, PhD.
I think you will be surprised at how interesting and even compelling this presentation becomes at points. Particularly his comments early in the presentation about needing an analogy machine, to find things not expressed in the way you usually look for them. And he has concrete examples of where that has been needed.
Title: Bayesian Statistical Reasoning: an inferential, predictive and decision-making paradigm for the 21st century
Professor Draper gives examples of Bayesian inference, prediction and decision-making in the context of several case studies from medicine and health policy. There will be points of potential technical interest for applied mathematicians, statisticians, and computer scientists. Broadly speaking, statistics is the study of uncertainty: how to measure it well, and how to make good choices in the face of it. Statistical activities are of four main types: description of a data set, inference about the underlying process generating the data, prediction of future data, and decision-making under uncertainty. The last three of these activities are probability based. Two main probability paradigms are in current use: the frequentist (or relative-frequency) approach, in which you restrict attention to phenomena that are inherently repeatable under “identical” conditions and define P(A) to be the limiting relative frequency with which A would occur in hypothetical repetitions, as n goes to infinity; and the Bayesian approach, in which the arguments A and B of the probability operator P(A|B) are true-false propositions (with the truth status of A unknown to you and B assumed by you to be true), and P(A|B) represents the weight of evidence in favor of the truth of A, given the information in B. The Bayesian approach includes the frequentest paradigm as a special case,so you might think it would be the only version of probability used in statistical work today, but (a) in quantifying your uncertainty about something unknown to you, the Bayesian paradigm requires you to bring all relevant information to bear on the calculation; this involves combining information both internal and external to the data you’ve gathered, and (somewhat strangely) the external-information part of this approach was controversial in the 20th century, and (b) Bayesian calculations require approximating high-dimensional integrals (whereas the frequentist approach mainly relies on maximization rather than integration), and this was a severe limitation to the Bayesian paradigm for a long time (from the 1750s to the 1980s). The external-information problem has been solved by developing methods that separately handle the two main cases: (1) substantial external information, which is addressed by elicitation techniques, and (2) relatively little external information, which is covered by any of several methods for (in the jargon) specifying diffuse prior distributions. Good Bayesian work also involves sensitivity analysis: varying the manner in which you quantify the internal and external information across reasonable alternatives, and examining the stability of your conclusions. Around 1990 two things happened roughly simultaneously that completely changed the Bayesian computational picture: * Bayesian statisticians belatedly discovered that applied mathematicians (led by Metropolis), working at the intersection between chemistry and physics in the 1940s, had used Markov chains to develop a clever algorithm for approximating integrals arising in thermodynamics that are similar to the kinds of integrals that come up in Bayesian statistics, and * desk-top computers finally became fast enough to implement the Metropolis algorithm in a feasibly short amount of time. As a result of these developments, the Bayesian computational problem has been solved in a wide range of interesting application areas with small-to-moderate amounts of data; with large data sets, variational methods are available that offer a different approach to useful approximate solutions. The Bayesian paradigm for uncertainty quantification does appear to have one remaining weakness, which coincides with a strength of the frequentest paradigm: nothing in the Bayesian approach to inference and prediction requires you to pay attention to how often you get the right answer (thisis a form of calibration of your uncertainty assessments), which is an activity that’s (i) central to good science and decision-making and (ii) natural to emphasize from the frequentist point of view. However, it has recently been shown that calibration can readily be brought into the Bayesian story by means of decision theory, turning the Bayesian paradigm into an approach that is (in principle) both logically internally consistent and well-calibrated. In this talk I’ll (a) offer some historical notes about how we have arrived at the present situation and (b) give examples of Bayesian inference, prediction and decision-making in the context of several case studies from medicine and health policy. There will be points of potential technical interest for applied mathematicians, statisticians and computer scientists.
The Getty Search Gateway at all things cataloged
Interesting review of the new search capabilities at the Getty. Covers their use of Solr and some of its more interesting capabilities. Searches across collections and other information sources.
After reading the post and using the site, what would you do differently with a topic map? In particular?
Introduction to Restricted Boltzmann Machines
While I was at Edwin Chen’s blog, I discovered this post on Restricted Boltzmann Machines which begins:
Suppose you ask a bunch of users to rate a set of movies on a 0-100 scale. In classical factor analysis, you could then try to explain each movie and user in terms of a set of latent factors. For example, movies like Star Wars and Lord of the Rings might have strong associations with a latent science fiction and fantasy factor, and users who like Wall-E and Toy Story might have strong associations with a latent Pixar factor.
Restricted Boltzmann Machines essentially perform a binary version of factor analysis. (This is one way of thinking about RBMs; there are, of course, others, and lots of different ways to use RBMs, but I’ll adopt this approach for this post.) Instead of users rating a set of movies on a continuous scale, they simply tell you whether they like a movie or not, and the RBM will try to discover latent factors that can explain the activation of these movie choices.
Not for the novice user but something you may run across in the analysis of data sets or need yourself. Excellent pointers to additional resources.
Introduction to Latent Dirichlet Allocation by Edwin Chen.
From the introduction:
Suppose you have the following set of sentences:
- I like to eat broccoli and bananas.
- I ate a banana and spinach smoothie for breakfast.
- Chinchillas and kittens are cute.
- My sister adopted a kitten yesterday.
- Look at this cute hamster munching on a piece of broccoli.
What is latent Dirichlet allocation? It’s a way of automatically discovering topics that these sentences contain. For example, given these sentences and asked for 2 topics, LDA might produce something like
- Sentences 1 and 2: 100% Topic A
- Sentences 3 and 4: 100% Topic B
- Sentence 5: 60% Topic A, 40% Topic B
- Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at which point, you could interpret topic A to be about food)
- Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which point, you could interpret topic B to be about cute animals)
The question, of course, is: how does LDA perform this discovery?
About as smooth an explanation of Latent Dirichlet Allocation as you are going to find.
Linked Data for Education and Technology-Enhanced Learning (TEL)
From the website:
Interactive Learning Environments special issue on Linked Data for Education and Technology-Enhanced Learning (TEL)
- Special issue website: http://linkededucation.org/ile-special-issue/
- Journal website: http://www.tandf.co.uk/journals/ile
IMPORTANT DATES
================
- 30 November 2011: Paper submission deadline (11:59pm Hawaiian time)
- 30 March 2012: Notification of first review round
- 30 April 2012: Submission of major revisions
- 15 July 2012: Notification of major revision reviews
- 15 August 2012: Submission of minor revisions
- 30 August 2012: Notification of acceptance
- late 2012 : Publication
OVERVIEW
=========While sharing of open learning and educational resources on the Web became common practice throughout the last years a large amount of research was dedicated to interoperability between educational repositories based on semantic technologies. However, although the Semantic Web has seen large-scale success in its recent incarnation as a Web of Linked Data, there is still only little adoption of the successful Linked Data principles in the domains of education and technology-enhanced learning (TEL). This special issue builds on the fundamental belief that the Linked Data approach has the potential to fulfill the TEL vision of Web-scale interoperability of educational resources as well as highly personalised and adaptive educational applications. The special issue solicits research contributions exploring the promises of the Web of Linked Data in TEL by gathering researchers from the areas of the Semantic Web and educational science and technology.
TOPICS OF INTEREST
=================We welcome papers describing current trends on research in (a) how technology-enhaced learning approaches take advantage of Linked Data on the Web and (b) how Linked Data principles and semantic technologies are being applied in technology-ehnaced learning contexts. Both rather application-oriented as well as theoretical papers are welcome. Relevant topics include but are not limited to the following:
- Using Linked Data to support interoperability of educational resources
- Linked Data for informal learning
- Personalisation and context-awareness in TEL
- Usability and advanced user interfaces in learning environments and Linked Data
- Light-weight TEL metadata schemas
- Exposing learning object metadata via RDF/SPARQL & service-oriented approaches
- Semantic & syntactic mappings between educational metadata schemas and standards
- Controlled vocabularies, ontologies and terminologies for TEL
- Personal & mobile learning environments and Linked Data
- Learning flows and designs and Linked Data
- Linked Data in (visual) learning analytics and educational data mining
- Linked Data in organizational learning and learning organizations
- Linked Data for harmonizing individual learning goals and organizational objectives
- Competency management and Linked Data
- Collaborative learning and Linked Data
- Linked-data driven social networking collaborative learning
DSL for the Uninitiated by Debasish Ghosh
From the post:
One of the main reasons why software projects fail is the lack of communication between the business users, who actually know the problem domain, and the developers who design and implement the software model. Business users understand the domain terminology, and they speak a vocabulary that may be quite alien to the software people; it’s no wonder that the communication model can break down right at the beginning of the project life cycle.
A DSL (domain-specific language)1,3 bridges the semantic gap between business users and developers by encouraging better collaboration through shared vocabulary. The domain model that the developers build uses the same terminologies as the business. The abstractions that the DSL offers match the syntax and semantics of the problem domain. As a result, users can get involved in verifying business rules throughout the life cycle of the project.
This article describes the role that a DSL plays in modeling expressive business rules. It starts with the basics of domain modeling and then introduces DSLs, which are classified according to implementation techniques. The article then explains in detail the design and implementation of an embedded DSL from the domain of securities trading operations.
The subject identity and merging requirements of a particular domain are certainly issues where users, who actually know the problem domain, should be in the lead. Moreover, if users object to some merging operation result, that will bring notice to perhaps unintended consequences of an identity or merging rule.
Perhaps the rule is incorrect, perhaps there are assumptions yet to be explored, but the focus in on the user’s understanding of the domain, where it should be (assuming the original coding is correct).
This sounds like a legend to me.
BTW, the comments point to Lisp resources that got to DSLs first (as is the case with most/all programming concepts):
Matthias Felleisen | Thu, 04 Aug 2011 22:26:46 UTC
DSLs have been around in the LISP world forever. The tools for building them and for integrating them into the existing toolchain are far more advanced than in the JAVA world. For an example, see
http://www.ccs.neu.edu/scheme/pubs/#pldi11-thacff for a research-y introduction
or
http://hashcollision.org/brainfudge/ for a hands-on introduction.
You might also want to simply start at the Racket homepage.
OCaml for the Masses by Yaron Minsky (Appears in ACM’s Queue)
Great article and merits a close read.
As does the first comment, which gives a concrete example of “readable” code not preserving even for its author, “why” a particular code block was written? As the commenter points out, literate programming is the key to capturing “why” code was written, which is a prerequisite to effective long-term maintenance.
Curious what you see as the upsides/downsides to using topic map as overlays on code to provide literate programming?
True enough, it is “quicker” to write comments inline, but are your comments consistent from one instance of being written inline to another? And do your comments in code tend to be briefer than if you are writing them separately? Or are you deceiving yourself in thinking your code is “commented” when the comments are as cryptic as your code? (Not that topic maps can help with that issue, just curious.)
I see downsides in the fragility of pointing into code that may change frequently but that may be a matter of how pointers are constructed. Could experiment with strings of varying lengths as identifiers. If of sufficient length, do you really want to have different documentation for each instance? Hmmm, could create identifiers by highlighting text and after some time period, app returns you to each of those so you can write the documentation.
Other upsides/downsides?
Neo4j Spatial: Why Should You Care? by Peter Neubauer at SamGIS 2011.
A very nice slide deck from Peter Neubauer on Neo4j Spatial! Great images!
From the webpage:
Springer offers two options for MARC records for Springer eBook collections:
1. Free Springer MARC records, SpringerProtocols MARC records & eBook Title Lists
- Available free of charge
- Generated using Springer metadata containing most common fields
- Pick, download and install Springer MARC records in 4 easy steps
2.Free OCLC MARC records
- Available free of charge
- More enhanced MARC records
- Available through OCLC WORLDCAT service
This looks like very good topic map fodder.
I saw this at all things cataloged.
Powered by WordPress