Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 18, 2014

Artificial Intelligence | Natural Language Processing

Filed under: Artificial Intelligence,CS Lectures,Natural Language Processing — Patrick Durusau @ 4:26 pm

Artificial Intelligence | Natural Language Processing by Christopher Manning.

From the webpage:

This course is designed to introduce students to the fundamental concepts and ideas in natural language processing (NLP), and to get them up to speed with current research in the area. It develops an in-depth understanding of both the algorithms available for the processing of linguistic information and the underlying computational properties of natural languages. Wordlevel, syntactic, and semantic processing from both a linguistic and an algorithmic perspective are considered. The focus is on modern quantitative techniques in NLP: using large corpora, statistical models for acquisition, disambiguation, and parsing. Also, it examines and constructs representative systems.

Lectures with notes.

If you are new to natural language processing, it would be hard to point at a better starting point.

Enjoy!

Build Roads not Stagecoaches

Filed under: Data,Integration,Subject Identity — Patrick Durusau @ 3:40 pm

Build Roads not Stagecoaches by Martin Fenner.

Describing Eric Hysen’s keynote, Martin says:

In his keynote he described how travel from Cambridge to London in the 18th and early 19th century improved mainly as a result of better roads, made possible by changes in how these roads are financed. Translated to today, he urged the audience to think more about the infrastructure and less about the end products:

Ecosystems, not apps

— Eric Hysen

On Tuesday at csv,conf, Nick Stenning – Technical Director of the Open Knowledge Foundation – talked about data packages, an evolving standard to describe data that are passed around betwen different systems. He used the metaphor of containers, and how they have dramatically changed the transportation of goods in the last 50 years. He argued that the cost of shipping was in large part determined by the cost of loading and unloading, and the container has dramatically changed that equation. We are in a very similar situation with datasets, where most of the time is spent translating between different formats, joining things together that use different names for the same thing [emphasis added], etc.

…different names for the same thing.

Have you heard that before? šŸ˜‰

But here is the irony:

When I thought more about this I realized that these building blocks are exactly the projects I get most excited about, i.e. projects that develop standards or provide APIs or libraries. Some examples would be

  • ORCID: unique identifiers for scholarly authors

OK, but many authors already have unique identifiers in DBLP, Library of Congress, Twitter, and at places I have not listed.

Nothing against ORCID, but adding yet another identifier isn’t all that helpful.

A mapping between identifiers, so having one means I can leverage the others, now that is what I call infrastructure.

You?

Duplicate Tool Names

Filed under: Duplicates,Names — Patrick Durusau @ 9:25 am

You wait ages for somebody to develop a bioinformatics tool called ‘Kraken’ and then three come along at once by Keith Bradnam.

From the post:

So Kraken is either a universal genomic coordinate translator for comparative genomics, or a tool for ultrafast metagenomic sequence classification using exact alignments, or even a set of tools for quality control and analysis of high-throughput sequence data. The latter publication is from 2013, and the other two are from this year (2014).

Yet another illustration that names are not enough.

A URL identifier would not help unless you recognize the URL.

Identification with name/value plus other key/value pairs?

Leaves everyone free to choose whatever names they like.

It also enables the rest of us to distinguish tools (or other subjects) with the same names apart.

Simply concept. Easy to apply. Disappoints people who want to be in charge of naming things.

Sounds like three good reasons to me, especially the last one.

July 17, 2014

Scikit-learn 0.15 release

Filed under: Machine Learning,Python,Scikit-Learn — Patrick Durusau @ 6:16 pm

Scikit-learn 0.15 release by Gaƫl Varoquaux.

From the post:

Highlights:

Quality— Looking at the commit log, there has been a huge amount of work to fix minor annoying issues.

Speed— There has been a huge effort put in making many parts of scikit-learn faster. Little details all over the codebase. We do hope that you’ll find that your applications run faster. For instance, we find that the worst case speed of Ward clustering is 1.5 times faster in 0.15 than 0.14. K-means clustering is often 1.1 times faster. KNN, when used in brute-force mode, got faster by a factor of 2 or 3.

Random Forest and various tree methods— The random forest and various tree methods are much much faster, use parallel computing much better, and use less memory. For instance, the picture on the right shows the scikit-learn random forest running in parallel on a fat Amazon node, and nicely using all the CPUs with little RAM usage.

Hierarchical aglomerative clusteringComplete linkage and average linkage clustering have been added. The benefit of these approach compared to the existing Ward clustering is that they can take an arbitrary distance matrix.

Robust linear models— Scikit-learn now includes RANSAC for robust linear regression.

HMM are deprecated— We have been discussing for a long time removing HMMs, that do not fit in the focus of scikit-learn on predictive modeling. We have created a separate hmmlearn repository for the HMM code. It is looking for maintainers.

And much more— plenty of “minor things”, such as better support for sparse data, better support for multi-label data…

Get thee to Scikit-learn!

April 2014 Crawl Data Available

Filed under: Common Crawl — Patrick Durusau @ 2:40 pm

April 2014 Crawl Data Available by Stephen Merity.

From the post:

The April crawl of 2014 is now available! The new dataset is over 183TB in size containing approximately 2.6 billion webpages. The new data is located in the aws-publicdatasets bucket at /common-crawl/crawl-data/CC-MAIN-2014-15/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://aws-publicdatasets/ or https://aws-publicdatasets.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

Thanks again to Blekko for their ongoing donation of URLs for our crawl!

Well, at 183TB, I don’t guess I am going to have a local copy. šŸ˜‰

Enjoy!

July 16, 2014

FDA Recall Data

Filed under: Government Data,Open Access — Patrick Durusau @ 6:53 pm

OpenFDA Provides Ready Access to Recall Data by Taha A. Kass-Hout.

From the post:

Every year, hundreds of foods, drugs, and medical devices are recalled from the market by manufacturers. These products may be labeled incorrectly or might pose health or safety issues. Most recalls are voluntary; in some cases they may be ordered by the U.S. Food and Drug Administration. Recalls are reported to the FDA, and compiled into its Recall Enterprise System, or RES. Every week, the FDA releases an enforcement report that catalogues these recalls. And now, for the first time, there is an Application Programming Interface (API) that offers developers and researchers direct access to all of the drug, device, and food enforcement reports, dating back to 2004.

The recalls in this dataset provide an illuminating window into both the safety of individual products and the safety of the marketplace at large. Recent reports have included such recalls as certain food products (for not containing the vitamins listed on the label), a soba noodle salad (for containing unlisted soy ingredients), and a pain reliever (for not following laboratory testing requirements).

You will get warnings that this data is “not for clinical use.”

Sounds like a treasure trove of data if you are looking for products still being sold despite being recalled.

Or if you want to advertise for “victims” of faulty products that have been recalled.

I think both of those are non-clinical uses. šŸ˜‰

Darwin’s ship library goes online

Filed under: Library — Patrick Durusau @ 3:48 pm

Darwin’s ship library goes online by Dennis Normile.

From the post:

As Charles Darwin cruised the world on the HMS Beagle, he had access to an unusually well-stocked 400-volume library. That collection, which contained the observations of numerous other naturalists and explorers, has now been recreated online. As of today, all of more than 195,000 pages and 5000 illustrations from the works are available for the perusal of scholars and armchair naturalists alike, thanks to the Darwin Online project.

Perhaps it isn’t the amount of information you have available but how deeply you understand it that makes a difference.

Yes?

Which gene did you mean?

Filed under: Annotation,Semantics,Tagging — Patrick Durusau @ 3:38 pm

Which gene did you mean? by Barend Mons.

Abstract:

Computational Biology needs computer-readable information records. Increasingly, meta-analysed and pre-digested information is being used in the follow up of high throughput experiments and other investigations that yield massive data sets. Semantic enrichment of plain text is crucial for computer aided analysis. In general people will think about semantic tagging as just another form of text mining, and that term has quite a negative connotation in the minds of some biologists who have been disappointed by classical approaches of text mining. Efforts so far have tried to develop tools and technologies that retrospectively extract the correct information from text, which is usually full of ambiguities. Although remarkable results have been obtained in experimental circumstances, the wide spread use of information mining tools is lagging behind earlier expectations. This commentary proposes to make semantic tagging an integral process to electronic publishing.

From within the post:

If all words had only one possible meaning, computers would be perfectly able to analyse texts. In reality however, words, terms and phrases in text are highly ambiguous. Knowledgeable people have few problems with these ambiguities when they read, because they use context to disambiguate ‘on the fly’. Even when fed a lot of semantically sound background information, however, computers currently lag far behind humans in their ability to interpret natural language. Therefore, proper semantic tagging of concepts in texts is crucial to make Computational Biology truly viable. Electronic Publishing has so far only scratched the surface of what is needed.

Open Access publication shows great potential, andis essential for effective information mining, but it will not achieve its full potential if information continues to be buried in plain text. Having true semantic mark up combined with open access for mining is an urgent need to make possible a computational approach to life sciences.

Creating semantically enriched content as part and parcel of the publication process should be a winning strategy.

First, for current data, estimates of what others will be searching for should not be hard to find out. That will help focus tagging on the material users are seeking. Second, a current and growing base of enriched material will help answer questions about the return on enriching material.

Other suggestions for BMC Bioinformatics?

Introducing Source Han Sans:…

Filed under: Fonts,Language — Patrick Durusau @ 2:57 pm

Introducing Source Han Sans: An open source Pan-CJK typeface by Caleb Belohlavek.

From the post:

Adobe, in partnership with Google, is pleased to announce the release of Source Han Sans, a new open source Pan-CJK typeface family that is now available on Typekit for desktop use. If you donā€™t have a Typekit account, itā€™s easy to set one up and start using the font immediately with our free subscription. And for those who want to play with the original source files, you can get those from our download page on SourceForge.

It’s rather difficult to describe your semantics when you can’t write in your own language.

Kudos to Adobe and Google for sponsoring this project!

I first saw this in a tweet by James Clark.

…[S]emantically enriched open pharmacological space…

Filed under: Bioinformatics,Biomedical,Drug Discovery,Integration,Semantics — Patrick Durusau @ 2:25 pm

Scientific competency questions as the basis for semantically enriched open pharmacological space development by Kamal Azzaoui, et al. (Drug Discovery Today, Volume 18, Issues 17ā€“18, September 2013, Pages 843ā€“852)

Abstract:

Molecular information systems play an important part in modern data-driven drug discovery. They do not only support decision making but also enable new discoveries via association and inference. In this review, we outline the scientific requirements identified by the Innovative Medicines Initiative (IMI) Open PHACTS consortium for the design of an open pharmacological space (OPS) information system. The focus of this work is the integration of compoundā€“targetā€“pathwayā€“disease/phenotype data for public and industrial drug discovery research. Typical scientific competency questions provided by the consortium members will be analyzed based on the underlying data concepts and associations needed to answer the questions. Publicly available data sources used to target these questions as well as the need for and potential of semantic web-based technology will be presented.

Pharmacology may not be your space but this is a good example of what it takes for semantic integration of resources in a complex area.

Despite the “…you too can be a brain surgeon with our new web-based app…” from various sources, semantic integration has been, is and will remain difficult under the best of circumstances.

I don’t say that to discourage anyone but to avoid the let-down when integration projects don’t provide easy returns.

It is far better to plan for incremental and measurable benefits along the way than to fashion grandiose goals that are ever receding on the horizon.

I first saw this in a tweet by ChemConnector.

July 15, 2014

Free Companies House data to boost UK economy

Filed under: Government,Government Data,Open Data — Patrick Durusau @ 4:57 pm

Free Companies House data to boost UK economy

From the post:

Companies House is to make all of its digital data available free of charge. This will make the UK the first country to establish a truly open register of business information.

As a result, it will be easier for businesses and members of the public to research and scrutinise the activities and ownership of companies and connected individuals. Last year (2013/14), customers searching the Companies House website spent Ā£8.7 million accessing company information on the register.

This is a considerable step forward in improving corporate transparency; a key strand of the G8 declaration at the Lough Erne summit in 2013.

It will also open up opportunities for entrepreneurs to come up with innovative ways of using the information.

This change will come into effect from the second quarter of 2015 (April ā€“ June).

In a side bar, Business Secretary Vince Cable said in part:

Companies House is making the UK a more transparent, efficient and effective place to do business.

I’m not sure about “efficient,” but providing incentives for lawyers and others to track down insider trading and other business as usual practices and arming them with open data would be a start in the right direction.

I first saw this in a tweet by Hadley Beeman.

Spy vs. Spies

Filed under: Cybersecurity,Security — Patrick Durusau @ 4:41 pm

XRay: Enhancing the Webā€™s Transparency with Differential Correlation by Mathias Lécuyer, et al.

Abstract:

Today’s Web services – such as Google, Amazon, and Facebook – leverage user data for varied purposes, including personalizing recommendations, targeting advertisements, and adjusting prices. At present, users have little insight into how their data is being used. Hence, they cannot make informed choices about the services they choose. To increase transparency, we developed XRay, the first fine-grained, robust, and scalable personal data tracking system for the Web. XRay predicts which data in an arbitrary Web account (such as emails, searches, or viewed products) is being used to target which outputs (such as ads, recommended products, or prices). XRay’s core functions are service agnostic and easy to instantiate for new services, and they can track data within and across services. To make predictions independent of the audited service, XRay relies on the following insight: by comparing outputs from different accounts with similar, but not identical, subsets of data, one can pinpoint targeting through correlation. We show both theoretically, and through experiments on Gmail, Amazon, and YouTube, that XRay achieves high precision and recall by correlating data from a surprisingly small number of extra accounts.

Not immediately obvious, until someone explains it, but any system that reacts based on input you control can be investigated. Whether that includes dark marketing forces or government security agencies.

Be aware that provoking government security agencies is best left to professionals. šŸ˜‰

The next step will be to have bots that project false electronic trails for us to throw advertisers (or others) off track.

Very much worth your time to read.

Graph Classes and their Inclusions

Filed under: Graphs,Mathematics — Patrick Durusau @ 4:25 pm

Information System on Graph Classes and their Inclusions

From the webpage:

What is ISGCI?

ISGCI is an encyclopaedia of graphclasses with an accompanying java application that helps you to research what’s known about particular graph classes. You can:

  • check the relation between graph classes and get a witness for the result
  • draw clear inclusion diagrams
  • colour these diagrams according to the complexity of selected problems
  • find the P/NP boundary for a problem
  • save your diagrams as Postscript, GraphML or SVG files
  • find references on classes, inclusions and algorithms

As of 214-07-06, the database contains 1497 classes and 176,888 inclusions.

If you are past the giddy stage of “Everything’s a graph!,” you may find this site useful.

RDFUnit

Filed under: Linked Data,RDF,Semantic Web — Patrick Durusau @ 4:04 pm

RDFUnit – an RDF Unit-Testing suite

From the post:

RDFUnit is a test driven data-debugging framework that can run automatically generated (based on a schema) and manually generated test cases against an endpoint. All test cases are executed as SPARQL queries using a pattern-based transformation approach.

For more information on our methodology please refer to our report:

Test-driven evaluation of linked data quality. Dimitris Kontokostas, Patrick Westphal, Sƶren Auer, Sebastian Hellmann, Jens Lehmann, Roland Cornelissen, and Amrapali J. Zaveri in Proceedings of the 23rd International Conference on World Wide Web.

RDFUnit in a Nutshell

  • Test case: a data constraint that involves one or more triples. We use SPARQL as a test definition language.
  • Test suite: a set of test cases for testing a dataset
  • Status: Success, Fail, Timeout (complexity) or Error (e.g. network). A Fail can be an actual error, a warning or a notice
  • Data Quality Test Pattern (DQTP): Abstract test cases that can be intantiated into concrete test cases using pattern bindings
  • Pattern Bindings: valid replacements for a DQTP variable
  • Test Auto Generators (TAGs): Converts RDFS/OWL axioms into concrete test cases

If you are working with RDF data, this will certainly be helpful.

BTW, don’t miss the publications further down on the homepage for RDFUnit.

I first saw this in a tweet by Marin Dimitrov.

Classification and regression trees

Filed under: Classification,Machine Learning,Regression,Trees — Patrick Durusau @ 3:47 pm

Classification and regression trees by Wei-Yin Loh.

Abstract:

Classification and regression trees are machine-learningmethods for constructing prediction models from data. The models are obtained by recursively partitioning the data space and fitting a simple prediction model within each partition. As a result, the partitioning can be represented graphically as a decision tree. Classification trees are designed for dependent variables that take a finite number of unordered values, with prediction error measured in terms of misclassification cost. Regression trees are for dependent variables that take continuous or ordered discrete values, with prediction error typically measured by the squared difference between the observed and predicted values. This article gives an introduction to the subject by reviewing some widely available algorithms and comparing their capabilities, strengths, and weakness in two examples. 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 14ā€“23 DOI: 10.1002/widm.8.

A bit more challenging that CSV formats but also very useful.

I heard a joke many years ago but a then U.S. Assistant Attorney General who said:

To create a suspect list for a truck hijacking in New York, you choose files with certain name characteristics, delete the ones that are currently in prison and those that remain are your suspect list. (paraphrase)

If topic maps can represent any “subject” then they should be able to represent “group subjects” as well. We may know that our particular suspect is the member of a group, but we just don’t know which member of the group is our suspect.

Think of it as a topic map that evolves as more data/analysis is brought to the map and members of a group subject can be broken out into smaller groups or even individuals.

In fact, displaying summaries of characteristics of members of a group in response to classification/regression could well help with the subject analysis process. An interactive construction/mining of the topic map as it were.

Great paper whether you use it for topic map subject analysis or more traditional purposes.

Linked Data Guidelines (Australia)

Filed under: Linked Data,Semantic Web — Patrick Durusau @ 3:30 pm

First Version of Guidelines for Publishing Linked Data released by Allan Barger.

From the post:

The Australian Government Linked Data Working group (AGLDWG) is pleased to announce the release of a first version of a set of guidelines for the publishing of Linked Datasets on data.gov.au at:

https://github.com/AGLDWG/TR/wiki/URI-Guidelines-for-publishing-linked-datasets-on-data.gov.au-v0.1

The ā€œURI Guidelines for publishing Linked Datasets on data.gov.auā€ document provides a set of general guidelines aimed at helping Australian Government agencies to define and manage URIs for Linked Datasets and the resources described within that are published on data.gov.au. The Australian Government Linked Data Working group has developed the report over the last two years while the first datasets under the environment.data.gov.au sub-domain have been published following the patterns defined in this document.

Thought you might find this useful in mapping linked data sets from the Australian government to:

  • non-Australian government linked data sets
  • non-government linked data sets
  • non-linked data data sets (all sources)
  • pre-linked data data sets (all sources)
  • post-linked data data sets (all sources)

Enjoy!

CSV validator ā€“ a new digital preservation tool

Filed under: CSV,Data — Patrick Durusau @ 3:19 pm

CSV validator ā€“ a new digital preservation tool by David Underdown.

From the post:

Today marks the official release of a new digital preservation tool developed by The National Archives, CSV Validator version 1.0. This follows on from well known tools such as DROID and PRONOM database used in file identification (discussed in several previous blog posts). The release comprises the validator itself, but perhaps more importantly, it also includes the formal specification of a CSV schema language for describing the allowable content of fields within CSV (Comma Separated Value) files, which gives something to validate against.

Odd to find two presentations about CSV on the same day!

Adam Retter presented on this project today. slides.

It will be interesting to see how much cross-pollination occurs with the CSV on the Web Working Group.

Suggest you follow both groups.

CSV on the Web

Filed under: CSV,Standards,XQuery — Patrick Durusau @ 2:56 pm

CSV on the Web – What’s Happening in the W3C Working Group by Jeni Tennison.

After seeing Software Carpentry: Lessons Learned yesterday, I have a new appreciation for documenting the semantics of data as used by its users.

Not to say we don’t need specialized semantic syntaxes and technologies, but if we expect market share, then we need to follow the software and data users are using.

How important is CSV?

Jeni gives that stats as:

  • >90% open data is tabular
  • 2/3rds “CSV” files on data.gov.uk aren’t machine readable

Which means people use customized solutions (read vendor lockin).

A good overview of the CSV WG’s work so far with a request for your assistance:

I need to start following this workgroup. Curious to see if they reuse XQuery addressing to annotate CSV files, columns, rows, cells.

PS: If you don’t see arrows in the presentation, I didn’t, use your space bar to change slides and Esc to see all the slides.

Visualizing ggplot2 internals…

Filed under: Ggplot2,Graphics,Visualization — Patrick Durusau @ 1:27 pm

Visualizing ggplot2 internals with shiny and D3 by Carson Sievert.

From the post:

As I started this project, I became frustrated trying to understand/navigate through the nested list-like structure of ggplot objects. As you can imagine, it isnā€™t an optimal approach to print out the structure everytime you want to checkout a particular element. Out of this frustration came an idea to build this tool to help interact with and visualize this structure. Thankfully, my wonderful GSoC mentor Toby Dylan Hocking agreed that this project could bring value to the ggplot2 community and encouraged me to pursue it.

By default, this tool presents a radial Reingoldā€“Tilford Tree of this nested list structure, but also has options to use the collapsable or cartesian versions. It also leverages the shinyAce package which allows users to send arbitrary ggplot2 code to a shiny server thats evaluate the results and re-renders the visuals. Iā€™m quite happy with the results as I think this tool is a great way to quickly grasp the internal building blocks of ggplot(s). Please share your thoughts below!

I started with the blog post about the visualization but seeing the visualization is more powerful:

Visualizing ggplot2 internals (demo)

I rather like the radial layout.

For either topic map design or analysis, this looks like a good technique to explore the properties we assign to subjects.

Be Secure, Be Very Secure

Filed under: Cybersecurity,Security — Patrick Durusau @ 1:04 pm

Using strong crypto, TunnelX offers a conversation tool that no one can snoop on by Jeff John Roberts.

From the post:

Between NSA surveillance and giant corporations that sniff our messages for ad money, it sometimes feels as if thereā€™s no such thing as a private online conversation. An intriguing group of techno-types and lawyers are trying to change that with a secure new messaging service called TunnelX.

TunnelX, which is free, offers online ā€œtunnelsā€ where two people can meet and share messages and media in a space no one else can see. While TunnelX isnā€™t the only company trying to restore privacy in the post-Snowden era, its tool is worth a look because it is aimed at everyday people ā€” and not just the usual crowd of crypto-heads and paranoiacs.

Jeff gives a good overview of TunnelX and how it can be used by ordinary users.

TunnelX gives the technical skinny as:

Tunnel X “superenciphers” all stored messages and uploaded files with AES, TwoFish, and Serpent using different 256-bit keys for each layer. (AES is the cipher approved by the U.S. National Security Agency for encrypting classified data across all U.S. government agencies; TwoFish and Serpent are the two most well-known “runner-up” AES candidates.

Tunnel X allows only SSL/TLS-encrypted connections (sometimes called “https” connections). Furthermore, we strongly encourage you to connect with the latest version of TLS (1.2). Finally, as part of our SSL/TLS setup, Tunnel X only allows connections which are secured with a PFS (perfect forward secrecy) ciphersuite. PFS is a technology which prevents encrypted messages from being stored and then decrypted in the future if a server’s private SSL key is ever compromised.

Under “What is a tunnel?” on the homepage you will find a list of technologies that TunnelX does not use!

I just created an account and the service merits high marks for ease of use!

The one feature I did not see and that would be useful, would be a “delete on read” setting so that messages are deleted as soon as they are read by the intended target.

Just another layer of security on top of what TunnelX already offers.

For all the layers of security, realize the black shirts don’t need to decrypt your messages once they discover your identity.

Knowing your identity, they can apply very unreliable techniques to extract messages from you personally. That is one of the problems with saviors of civilization. Given the stakes, no atrocity is beyond them.

July 14, 2014

Flax Clade PoC

Filed under: Indexing,Solr,Taxonomy — Patrick Durusau @ 6:20 pm

Flax Clade PoC by Tom Mortimer.

From the webpage:

Flax Clade PoC is a proof-of-concept open source taxonomy management and document classification system, based on Apache Solr. In its current state it should be considered pre-alpha. As open-source software you are welcome to try, use, copy and modify Clade as you like. We would love to hear any constructive suggestions you might have.

Tom Mortimer tom@flax.co.uk


Taxonomies and document classification

Clade taxonomies have a tree structure, with a single top-level category (e.g. in the example data, “Social Psychology”). There is no distinction between parent and child nodes (except that the former has children) and the hierachical structure of the taxonomy is completely orthogonal from the node data. The structure may be freely edited.

Each node represents a category, which is represented by a set of “keywords” (words or phrases) which should be present in a document belonging to that category. Not all the keywords have to be present – they are joined with Boolean OR rather than AND. A document may belong to multiple categories, which are ranked according to standard Solr (TF-IDF) scoring. It is also possible to exclude certain keywords from categories.

Clade will also suggest keywords to add to a category, based on the content of the documents already in the category. This feature is currently slow as it uses the standard Solr MoreLikeThis component to analyse a large number of documents. We plan to improve this for a future release by writing a custom Solr plugin.

Documents are stored in a standard Solr index and are categorised dynamically as taxonomy nodes are selected. There is currently no way of writing the categorisation results to the documents in SOLR, but see below for how to export the document categorisation to an XML or CSV file.

A very interesting project!

I am particularly interested in the dynamic categorisation when nodes are selected.

CMU Machine Learning Summer School (2014)

Filed under: Machine Learning — Patrick Durusau @ 4:37 pm

CMU Machine Learning Summer School (2014)

From the webpage:

Machine Learning is a foundational discipline that forms the basis of much modern data analysis. It combines theory from areas as diverse as Statistics, Mathematics, Engineering, and Information Technology with many practical and relevant real life applications. The focus of the current summer school is big data analytics, distributed inference, scalable algorithms, and applications to the digital economy. The event is targeted at research students, IT professionals, and academics from all over the world.

This school is suitable for all levels, both for researchers without previous knowledge in Machine Learning, and those wishing to broaden their expertise in this area. That said, some background will prove useful. For a research student, the summer school provides a unique, high-quality, and intensive period of study. It is ideally suited for students currently pursuing, or intending to pursue, research in Machine Learning or related fields. Limited scholarships are available for students to cover accommodation, registration costs, and partial travel expenses.

Videos have been posted at YouTube!

Enjoy!

An Empirical Investigation into Programming Language Syntax

Filed under: Language,Language Design,Programming,Query Language — Patrick Durusau @ 4:02 pm

An Empirical Investigation into Programming Language Syntax by Greg Wilson.

A great synopsis of Andreas Stefik and Susanna Siebert’s “An Empirical Investigation into Programming Language Syntax.” ACM Transactions on Computing Education, 13(4), Nov. 2013.

A sample to interest you in the post:

  1. Programming language designers needlessly make programming languages harder to learn by not doing basic usability testing. For example, “…the three most common words for looping in computer science, for, while, and foreach, were rated as the three most unintuitive choices by non-programmers.”
  2. C-style syntax, as used in Java and Perl, is just as hard for novices to learn as a randomly-designed syntax. Again, this pain is needless, because the syntax of other languages (such as Python and Ruby) is significantly easier.

Let me repeat part of that:

C-style syntax, as used in Java and Perl, is just as hard for novices to learn as a randomly-designed syntax.

Randomly-designed syntax?

Now, think about the latest semantic syntax or semantic query syntax you have read about.

Was it designed for users? Was there any user testing at all?

Is there a lesson here for designers of semantic syntaxes and query languages?

Yes?

I first saw this in Greg Wilson’s Software Carpentry: Lessons Learned video.

Software Carpentry: Lessons Learned

Filed under: Programming,Teaching — Patrick Durusau @ 3:38 pm

Software Carpentry: Lessons Learned by Greg Wilson.

In paper Software Carpentry: lessons learned form.

I would suggest seeing the presentation and reading the paper.

Wilson says: Read How Learning Works

Relevant to how you impart information to your clients/customers.

BTW, the impact of computers on learning is negligible so stop the orders for tablets/iPads.

List of Emacs Starter Kits

Filed under: Editor — Patrick Durusau @ 3:06 pm

List of Emacs Starter Kits by Xah Lee.

A list of nine (9) Emacs starter kits and links to tutorials.

No, sorry, there is no recommendation on which one is best.

Depends on your preferences and requirements. But unlike some editors, no names, they are completely customizable.

I first saw this in a tweet by ErgoEmacs.

L(Ī»)THW

Filed under: Lisp,Programming — Patrick Durusau @ 2:54 pm

L(Ī»)THW Learn Lisp The Hard Way by “the Phoeron” Colin J.E. Lupton

From the preface:

TANSTAAFL

“There Ain't No Such Thing As A Free Lunch… anything free costs twice as much in the long run or turns out worthless.”

Robert A. Heinlein, The Moon Is A Harsh Mistress

Programming is hard. Anyone who says differently is either trying to make you feel inferior to them or sell you something. In the case of many “easy-to-learn” programming languages, both happen to be true. But you're not here for inefficient, glorified, instant-gratification scripting languages that pigeon-hole you into prescribed execution control-flows and common use-cases. You're not here for monolithic imperative languages that have to be tamed into submission for the simplest tasks. You've sought out the Hard Way, and the hardest language to master, Common Lisp. You're not afraid of working for what you want or committing to a new way of thinking—you're here because you want to be a Lisp Hacker, and you're not going to let anything get in your way.

That being said, learning Lisp is not an impossible dream. Like any skill, practice makes perfect—and that's what the Hard Way is all about. Lisp may seem like an ancient mystical secret, cherished and protected by an impenetrable cabal of hacker elites, but that, much like the language's popular association solely with Artificial Intelligence applications, is a misconception. You don't have to be a genius or Black Hat to crack the mystery surrounding the language and the open-source subculture where it thrives. You just have to follow a few essential steps, and apply them without fail.

The biggest secret to Lisp is that it is actually the simplest programming language ever created—and that, coupled with its expressiveness and elegance, is why it is favored exclusively by the best programmers in the world. With hard work, attention to detail, and careful reflection over the subject material, you will be up and running with Lisp and writing real applications much earlier than you could with other, lesser languages.

Lisp dates from 1958 so there is no shortage of materials if you need to search for online resources. šŸ˜‰

The resources section is a bit “lite” but the resources cited do point you to more resources.

Enjoy!

I first saw this in a tweet by Computer Science.

Quoc Leā€™s Lectures on Deep Learning

Filed under: Machine Learning,Neural Networks — Patrick Durusau @ 1:32 pm

Quoc Leā€™s Lectures on Deep Learning by Gaurav Trivedi.

From the post:

Dr. Quoc Le from the Google Brain project team (yes, the one that made headlines for creating a cat recognizer) presented a series of lectures at the Machine Learning Summer School (MLSS ’14) in Pittsburgh this week. This is my favorite lecture series from the event till now and I was glad to be able to attend them.

The good news is that the organizers have made available the entire set of video lectures in 4K for you to watch. But since Dr. Le did most of them on the board and did not provide any accompanying slides, I decided to put the contents of the lectures along with the videos here.

I like Gaurav’s “enhanced” version over the straight YouTube version.

I need to go back and look at the cat recognizer. Particularly if I can use it as a filter on a twitter stream. šŸ˜‰

I first saw this in Nat Torkington’s Four short links: 14 July 2014.

July 13, 2014

Abstract Algebra

Filed under: Algebra,Mathematics — Patrick Durusau @ 7:19 pm

Abstract Algebra by Benedict Gross, PhD, George Vasmer Leverett Professor of Mathematics, Harvard University.

From the webpage:

Algebra is the language of modern mathematics. This course introduces students to that language through a study of groups, group actions, vector spaces, linear algebra, and the theory of fields.

Videos, notes, problem sets from the Harvard Extension School.

The relationship between these videos and those found on YouTube isn’t clear.

The text book for the class was Algebra by Michael Artin. (There is a 2nd edition now.)

There are two comments that may motivate you to pursue these lectures:

First, Gross remarks in the first session that there are numerous homework assignments because you are learning a language. Which makes me curious why math isn’t taught like a language?

Second, the Wikipedia article on abstract algebra observes in part:

Numerous textbooks in abstract algebra start with axiomatic definitions of various algebraic structures and then proceed to establish their properties. This creates a false impression that in algebra axioms had come first and then served as a motivation and as a basis of further study. The true order of historical development was almost exactly the opposite. For example, the hypercomplex numbers of the nineteenth century had kinematic and physical motivations but challenged comprehension. Most theories that are now recognized as parts of algebra started as collections of disparate facts from various branches of mathematics, acquired a common theme that served as a core around which various results were grouped, and finally became unified on a basis of a common set of concepts. An archetypical example of this progressive synthesis can be seen in the history of group theory.

Interesting that techniques are developed for quite practical reasons but later justified with greater formality.

Suggests that semantic integration should focus on practical results and leave formal justification for later.

Yes?

I first saw this in a tweet by Steven Strogatz.

Bad Spellers and Typists Rejoice

Filed under: Editor — Patrick Durusau @ 3:54 pm

Bad Spellers and Typists Rejoice

From the post:

Some people are bad spellers or at least consistently have trouble with certain words. Others can spell but are poor typists and constantly mistype words. Some, I suppose, fall into both categories. If any of this describes you, donā€™t despair: Bruce Connor has a solution for you.

Over at Endless Parentheses he presents a bit of Elisp that will look up the correct spelling with ispell (or aspell or whatever youā€™re using) and make an abbreviation for your incorrect spelling so that it will be automatically corrected in the future. Itā€™s probably not for everyone but if you consistently make spelling errorsā€”for whatever reasonā€”it may be helpful.

Endless Parentheses is a fairly new blog that concentrates on short, mostly weekly, Emacs-oriented posts. As of this writing there are only eight short posts so you might want to read them all. It wonā€™t take much time and youā€™ll probably learn something.

It sounds like they are playing my song!

I make the same spelling errors consistently! šŸ˜‰

Enjoy!

July 12, 2014

Grimoire

Filed under: Clojure,Functional Programming,Programming — Patrick Durusau @ 7:47 pm

Grimoire

From the description in Clojure Weekly, July 9th, 2014:

Grimoire is a new documentation service and community contributed example repo similar to ClojureDocs.org. ClojureDocs.org still comes up high in google searches despite it is documenting Clojure 1.3, by providing examples that are not easy to find elsewhere. Grimoire is instead up to date and also gives access to all the examples in clojuredocs. Grimoire is open to community contribution over git pull requests and will likely be improved in the future to make contributions easier.

For more on Grimoire see: Of Mages and Grimoires.

Use, contribute and enjoy!

« Newer PostsOlder Posts »

Powered by WordPress