Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 10, 2011

Google1000 dataset

Filed under: Dataset,Image Recognition,Machine Learning — Patrick Durusau @ 6:46 pm

Google1000 dataset

From the post:

This is a dataset of scans of 1000 public domain books that was released to the public at ICDAR 2007. At the time there was no public serving infrastructure, so few people actually got the 120GB dataset. It has since been hosted on Google Cloud Storage and made available for public download: (see the post for the links)

Intended for OCR and machine learning purposes. The results of which you may wish to unite in topic maps with other resources.

Machine Learning (Carnegie Mellon University)

Filed under: Computer Science,CS Lectures,Machine Learning — Patrick Durusau @ 6:33 pm

Machine Learning 10-701/15-781, Spring 2011 Carnegie Mellon University by Tom Mitchell.

Course Description:

Machine Learning is concerned with computer programs that automatically improve their performance through experience (e.g., programs that learn to recognize human faces, recommend music and movies, and drive autonomous robots). This course covers the theory and practical algorithms for machine learning from a variety of perspectives. We cover topics such as Bayesian networks, decision tree learning, Support Vector Machines, statistical learning methods, unsupervised learning and reinforcement learning. The course covers theoretical concepts such as inductive bias, the PAC learning framework, Bayesian learning methods, margin-based learning, and Occam’s Razor. Short programming assignments include hands-on experiments with various learning algorithms, and a larger course project gives students a chance to dig into an area of their choice. This course is designed to give a graduate-level student a thorough grounding in the methodologies, technologies, mathematics and algorithms currently needed by people who do research in machine learning.

I don’t know how other disciplines are faring but for a variety of CS topics, there are enough excellent online materials to complete the equivalent of an undergraduate if not master’s degree in CS.

November 5, 2011

How to enter a data contest – machine learning for newbies like me

Filed under: Contest,Data Contest,Machine Learning — Patrick Durusau @ 6:43 pm

How to enter a data contest – machine learning for newbies like me

From the post:

I’ve not had much experience with machine learning, most of my work has been a struggle just to get data sets that are large enough to be interesting! That’s a big reason why I turned to the Kaggle community when I needed a good prediction algorithm for my current project. I wasn’t completely off the hook though, I still needed to create an example of our current approach, limited as it is, to serve as a benchmark for the teams. While I was at it, it seemed worthwhile to open up the code too, so I’ve created a new Github project:

https://github.com/petewarden/MLloWorld

It actually produces very poor results, but does demonstrate the basics of how to pull in the data and apply one of scikit-learn’s great collection of algorithms. If you get the itch there’s lots of room for improvement, and the contest has another two weeks to run!

There is a case to be made for machine learning in the production of topic maps and what better motivation than contests for learning it?

Which makes me wonder how to structure something similar for topic maps? Contests that is for creating topic maps from one or more data sets? Coming up with funding for something like a meaningful prize would not be as hard as setting up something that was not too easy but also not too hard. At least not for the early contests anyway. 😉

For the early ones, pride of first place might be enough.

Suggestions/Comments?

What Market Researchers could learn from eavesdropping on R2D2

Filed under: Machine Learning,Marketing — Patrick Durusau @ 6:41 pm

What Market Researchers could learn from eavesdropping on R2D2

From the post:

Scott asks: in the context of research and insight, why should we care about what the Machine Learning community is doing?

For those not familiar with Machine Learning, it is a scientific discipline related to artificial intelligence. But it is more concerned with the science of teaching machines to solve useful problems as opposed to trying to get machines to replicate human behavior. If you were to put it in Star Wars terms, a Machine Learning expert would be more focused on building the short, bleeping useful R2D2 than the shiny, linguistically gifted but clumsy C3P0—a machine that is useful and efficient as opposed to a machine that replicates behaviors and mannerisms of humans.

There are many techniques and approaches that marketing insights consultants could borrow from the Machine Learning community. The community is made up of a larger group of researchers and scientists as well as those concerned with market research, and their focus is improving algorithms that can be applied across a wide variety of scientific, technology, business, and engineering problems. And so it is a wonderful source of inspiration for approaches that can be adapted to our own industry.

Since topic mappers aren’t large enough to be the objects of study (yet), I thought this piece on how marketers view the machine learning community might be instructive.

Successful topic mappers will straddle semantic communities and to do that, they need to be adept at what I would call “semantic cross-overs.”

Semantic cross-overs are those people and written pieces that give you a view that over arches two or more communities. Almost always written more from one point of view than another, but enough of both to give you ideas that may spark in both camps.

Remember, crossing over between two communities isn’t your view of the cross-over, but that of members of the respective communities. In other words, your topic map between them may seem very clever to you, but unless it is clever to members of those communities, we call it: No Sale!

November 1, 2011

Lab 49 Blog

Filed under: Artificial Intelligence,Finance Services,Machine Learning — Patrick Durusau @ 3:33 pm

Lab 49 Blog

From the main site:

Lab49 is a technology consulting firm that builds advanced solutions for the financial services industry. Our clients include many of the world’s largest investment banks, hedge funds and exchanges. Lab49 designs and delivers some of the most sophisticated and forward-thinking financial applications in the industry today, and has an impeccable delivery record on mission critical systems.

Lab49 helps clients effect positive change in their markets through technological innovation and a rich fabric of industry best practices and first-hand experience. From next-generation trading platforms to innovative risk aggregation and reporting systems to entirely new investment ventures, we enable our clients to realize new business opportunities and gain competitive advantage.

Lab49 cultivates a collaborative culture that is both innovative and delivery-focused. We value intelligent, experienced, and personable engineering professionals that work with clients as partners. With a proven ability to attract and retain industry-leading engineering talent and to forge and leverage valued partnerships, Lab49 continues to innovate at the vanguard of software and technology.

A very interesting blog sponsored by what appears to be a very interesting company, Lab 49.

October 29, 2011

Jubatus

Filed under: Jubatus,Machine Learning — Patrick Durusau @ 7:19 pm

Jubatus: Distributed Online Machine Learning Framework

From the webpage:

The Jubatus library is a online machine learning framework which runs in distributed environment. Jubatus library includes these functions:

  • multi-class/binary classification,
  • pre-proccessing data(for natural language), and
  • process management.

Talk about something that will make you perk up on a rainy afternoon!

October 20, 2011

Learning Richly Structured Representations From Weakly Annotated Data

Filed under: Artificial Intelligence,Computer Science,Machine Learning — Patrick Durusau @ 6:42 pm

Learning Richly Structured Representations From Weakly Annotated Data by Daphne Koller. (DeGroot Lecture, Carnegie Mellon University, October 14, 2011).

Abstract:

The solution to many complex problems require that we build up a representation that spans multiple levels of abstraction. For example, to obtain a semantic scene understanding from an image, we need to detect and identify objects and assign pixels to objects, understand scene geometry, derive object pose, and reconstruct the relationships between different objects. Fully annotated data for learning richly structured models can only be obtained in very limited quantities; hence, for such applications and many others, we need to learn models from data where many of the relevant variables are unobserved. I will describe novel machine learning methods that can train models using weakly labeled data, thereby making use of much larger amounts of available data, with diverse levels of annotation. These models are inspired by ideas from human learning, in which the complexity of the learned models and the difficulty of the training instances tackled changes over the course of the learning process. We will demonstrate the applicability of these ideas to various problems, focusing on the problem of holistic computer vision.

If your topic map application involves computer vision, this is a must see video.

For text/data miners, are you faced with similar issues? Limited amounts of richly annotated training data?

I saw a slide, will run it down later, that had text running from plain text to annotated with ontological data. I mention that because that isn’t what a user sees when they “read” a text. They see implied relationships, references to other subjects, other instances of a particular subject, and all that passes in the instance of recognition.

Perhaps the problem of correct identification in text is one of too few dimensions than too many.

October 15, 2011

25 years of Machine Learning Journal

Filed under: Machine Learning — Patrick Durusau @ 4:29 pm

25 years of Machine Learning Journal

KDNuggets reports free access to Machine Learning Journal until 31 October 2011.

Take the time to decide if you need access on a regular basis.

RadioVision: FMA Melds w Echo Nest’s Musical Brain

Filed under: Data Mining,Machine Learning,Natural Language Processing — Patrick Durusau @ 4:28 pm

RadioVision: FMA Melds w Echo Nest’s Musical Brain

From the post:

The Echo Nest has indexed the Free Music Archive catalog, integrating the most incredible music intelligence platform with the finest collection of free music.

The Echo Nest has been called “the most important music company on Earth” for good reason: 12 years of research at UC Berkeley, Columbia and MIT factored into the development of their “musical brain.” The platform combines large-scale data mining, natural language processing, acoustic analysis and machine learning to automatically understand how the online world describes every artist, extract musical attributes like tempo and time signature, learn about music trends (see: “hotttnesss“), and a whole lot more. Echo Nest then shares all of this data through a free and open API. [read more here]

Add music to your topic map!

October 14, 2011

Hierarchical Temporal Memory

Filed under: CS Lectures,Hierarchical Temporal Memory (HTM),Machine Learning — Patrick Durusau @ 6:23 pm

Hierarchical Temporal Memory: How a Theory of the Neocortex May Lead to Truly Intelligent Machines by Jeff Hawkins.

Don’t skip because of the title!

Hawkins covers his theory of the neocortex but however you feel about that, 2/3 of the presentation is on algorithms, completely new material.

Very cool presentation on “Fixed Sparsity Distributed Representation” and lots of neural science stuff. Need to listen to it again and then read the books/papers.

What I liked about it was the notion that even in very noisy or missing data contexts, that highly reliable identifications can be made.

True enough, Hawkins was talking about vision, etc., but he didn’t bring up any reasons why that could not work in other data environments.

In other words, when can a program treat extra data about a subject as noise and recognize it anyway?

Or if some information is missing about a subject, have a program reliably recognize it.

Or if we only want to store some information and yet have reliable recognition?

Don’t know if any, some or all of those are possible but it is certainly worth finding out.

Description:

Jeff Hawkins (Numenta founder) presents as part of the UBC Department of Computer Science’s Distinguished Lecture Series, March 18, 2010.

Coaxing computers to perform basic acts of perception and robotics, let alone high-level thought, has been difficult. No existing computer can recognize pictures, understand language, or navigate through a cluttered room with anywhere near the facility of a child. Hawkins and his colleagues have developed a model of how the neocortex performs these and other tasks. The theory, called Hierarchical Temporal Memory, explains how the hierarchical structure of the neocortex builds a model of its world and uses this model for inference and prediction. To turn this theory into a useful technology, Hawkins has created a company called Numenta. In this talk Hawkins will describe the theory, its biological basis, and progress in applying Hierarchical Temporal Memory to machine learning problems.

Part of this theory was described in Hawkins’ 2004 book, On Intelligence. Further information can be found at www.Numenta.com

October 10, 2011

FPGA-based MapReduce Framework for Machine Learning

Filed under: Machine Learning,MapReduce — Patrick Durusau @ 6:19 pm

FPGA-based MapReduce Framework for Machine Learning by Ningyi XU.

From the description:

Machine learning algorithms are becoming increasingly important in our daily life. However, training on very large scale datasets is usually very slow. FPGA is a reconfigurable platform that can achieve high parallelism and data throughput. Many works have been done on accelerating machine learning algorithms on FPGA. In this paper, we adapt Google’s MapReduce model to FPGA by realizing an on-chip MapReduce framework for machine learning algorithms. A processor scheduler is implemented for the maximum computation resource utilization and load balancing. In accordance with the characteristics of many machine learning algorithms, a common data access scheme is carefully designed to maximize data throughput for large scale dataset. This framework hides the task control, synchronization and communication away from designers to shorten development cycles. In a case study of RankBoost acceleration, up to 31.8x speedup is achieved versus CPU-based design, which is comparable with a fully manually designed version. We also discuss the implementations of two other machine learning algorithms, SVM and PageRank, to demonstrate the capability of the framework.

Not quite ready for general use but this looks very promising.

The usual discussion of “big data” made me start thinking about whether we need lots of instances of data to have “big data” or have we trimmed down data that surrounds us so we can manage it without the requirements of “big data?”

There are lots of examples of the former, can you think of examples of the latter?

October 8, 2011

Machine Learning on Hadoop at Huffington Post | AOL

Filed under: Hadoop,Machine Learning — Patrick Durusau @ 8:12 pm

Machine Learning on Hadoop at Huffington Post | AOL

Nice slide deck on creating a pluggable platform for testing large numbers of algorithms and then selecting the best.

October 5, 2011

Machine Learning Module

Filed under: Machine Learning — Patrick Durusau @ 6:58 pm

Machine Learning Module

For anyone taking the machine learning course at Stanford this Fall, some supplemental materials you may find interesting.

October 4, 2011

Adding Machine Learning to a Web App

Filed under: Artificial Intelligence,Machine Learning,Web Applications — Patrick Durusau @ 7:53 pm

Adding Machine Learning to a Web App by Richard Dallaway.

As Richard points out, the example is contrived and I don’t think you will be rushing off to add machine learning to a web app based on these slides.

That said, I think his point that you should pilot the data first is a good one.

If you mis-understand the data, then your results are not going to be very useful. Hmmm, maybe there is an AI/ML axiom in there somewhere. Probably already discovered, let me know if you run across it.

October 2, 2011

Machine Learning with Hadoop

Filed under: Data Mining,Hadoop,Machine Learning — Patrick Durusau @ 6:34 pm

Machine Learning with Hadoop by Josh Patterson.

Very current (Sept. 2011) review of Hadoop, data mining and related issues. Plus pointers to software projects such as Lumberyard, which deals with terabyte-sized time series data.

September 27, 2011

Learning Discriminative Metrics via Generative Models and Kernel Learning

Filed under: Kernel Methods,Machine Learning — Patrick Durusau @ 6:50 pm

Learning Discriminative Metrics via Generative Models and Kernel Learning by Yuan Shi, Yung-Kyun Noh, Fei Sha, and Daniel D. Lee.

Abstract:

Metrics specifying distances between data points can be learned in a discriminative manner or from generative models. In this paper, we show how to unify generative and discriminative learning of metrics via a kernel learning framework. Specifically, we learn local metrics optimized from parametric generative models. These are then used as base kernels to construct a global kernel that minimizes a discriminative training criterion. We consider both linear and nonlinear combinations of local metric kernels. Our empirical results show that these combinations significantly improve performance on classification tasks. The proposed learning algorithm is also very efficient, achieving order of magnitude speedup in training time compared to previous discriminative baseline methods.

Combination of machine learning techniques within a framework.

It may be some bias in my reading patterns but I don’t recall any explicit combination of human + machine learning techniques? I don’t take analysis of search logs to be an explicit human contribution since the analysis is guessing as to why a particular link and not another was chosen. I suppose time on the resource chosen might be an indication but a search log per se isn’t going to give that level of detail.

For that level of detail you would need browsing history. Would be interesting to see if a research library or perhaps employer (fewer “consent” issues) would permit browsing history collection over some long period of time, say 3 to 6 months. So that not only is the search log captured but the entire browsing history.

Hard to say if that would result in enough increased accuracy on search results to be worth the trouble.

Interesting paper about combining purely machine learning techniques and promises significant gains. What these plus human learning would produce remains a subject for future research papers.

September 26, 2011

DAta Mining & Exploration (DAME)

Filed under: Astroinformatics,Data Mining,Machine Learning — Patrick Durusau @ 7:00 pm

DAta Mining & Exploration (DAME)

From the website:

What is DAME

Nowadays, many scientific areas share the same need of being able to deal with massive and distributed datasets and to perform on them complex knowledge extraction tasks. This simple consideration is behind the international efforts to build virtual organizations such as, for instance, the Virtual Observatory (VObs). DAME (DAta Mining & Exploration) is an innovative, general purpose, Web-based, distributed data mining infrastructure specialized in Massive Data Sets exploration with machine learning methods.

Initially fine tuned to deal with astronomical data only, DAME has evolved in a general purpose platform program, hosting a cloud of applications and services useful also in other domains of human endeavor.

DAME is an evolving platform and new services as well as additional features are continuously added. The modular architecture of DAME can also be exploited to build applications, finely tuned to specific needs.

Follow DAME on YouTube

The project represents what is commonly considered an important element of e-science: a stronger multi-disciplinary approach based on the mutual interaction and interoperability between different scientific and technological fields (nowadays defined as X-Informatics, such as Astro-Informatics). Such an approach may have significant implications in the Knowledge Discovery in Databases process, where even near-term developments in the computing infrastructure which links data, knowledge and scientists will lead to a transformation of the scientific communication paradigm and will improve the discovery scenario in all sciences.

So far there is only one video at YouTube and it could lose the background music with no ill-effect.

The lessons learned (or applied) here should be applicable to other situations with very large data sets, say from satellites revolving the Earth?

September 23, 2011

Top Scoring Pairs for Feature Selection in Machine Learning and Applications to Cancer Outcome Prediction

Filed under: Bioinformatics,Biomedical,Classifier,Machine Learning,Prediction — Patrick Durusau @ 6:15 pm

Top Scoring Pairs for Feature Selection in Machine Learning and Applications to Cancer Outcome Prediction by Ping Shi, Surajit Ray, Qifu Zhu and Mark A Kon.

BMC Bioinformatics 2011, 12:375 doi:10.1186/1471-2105-12-375 Published: 23 September 2011

Abstract:

Background

The widely used k top scoring pair (k-TSP) algorithm is a simple yet powerful parameter-free classifier. It owes its success in many cancer microarray datasets to an effective feature selection algorithm that is based on relative expression ordering of gene pairs. However, its general robustness does not extend to some difficult datasets, such as those involving cancer outcome prediction, which may be due to the relatively simple voting scheme used by the classifier. We believe that the performance can be enhanced by separating its effective feature selection component and combining it with a powerful classifier such as the support vector machine (SVM). More generally the top scoring pairs generated by the k-TSP ranking algorithm can be used as a dimensionally reduced subspace for other machine learning classifiers.

Results

We developed an approach integrating the k-TSP ranking algorithm (TSP) with other machine learning methods, allowing combination of the computationally efficient, multivariate feature ranking of k-TSP with multivariate classifiers such as SVM. We evaluated this hybrid scheme (k-TSP+SVM) in a range of simulated datasets with known data structures. As compared with other feature selection methods, such as a univariate method similar to Fisher’s discriminant criterion (Fisher), or a recursive feature elimination embedded in SVM (RFE), TSP is increasingly more effective than the other two methods as the informative genes become progressively more correlated, which is demonstrated both in terms of the classification performance and the ability to recover true informative genes. We also applied this hybrid scheme to four cancer prognosis datasets, in which k-TSP+SVM outperforms k-TSP classifier in all datasets, and achieves either comparable or superior performance to that using SVM alone. In concurrence with what is observed in simulation, TSP appears to be a better feature selector than Fisher and RFE in some of the cancer datasets

Conclusions

The k-TSP ranking algorithm can be used as a computationally efficient, multivariate filter method for feature selection in machine learning. SVM in combination with k-TSP ranking algorithm outperforms k-TSP and SVM alone in simulated datasets and in some cancer prognosis datasets. Simulation studies suggest that as a feature selector, it is better tuned to certain data characteristics, i.e. correlations among informative genes, which is potentially interesting as an alternative feature ranking method in pathway analysis.

Knowing the tools that are already in use in bioinformatics will help you design topic map applications of interest to those in that field. And this is a very nice combination of methods to study on its own.

ParLearning 2012 (silos or maps?)

ParLearning 2012 : Workshop on Parallel and Distributed Computing for Machine Learning and Inference Problems

Dates:

When May 25, 2012 – May 25, 2012
Where Shanghai, China
Submission Deadline Dec 19, 2011
Notification Due Feb 1, 2012
Final Version Due Feb 21, 2012

From the notice:

HIGHLIGHTS

  • Foster collaboration between HPC community and AI community
  • Applying HPC techniques for learning problems
  • Identifying HPC challenges from learning and inference
  • Explore a critical emerging area with strong industry interest without overlapping with existing IPDPS workshops
  • Great opportunity for researchers worldwide for collaborating with Chinese Academia and Industry

CALL FOR PAPERS

Authors are invited to submit manuscripts of original unpublished research that demonstrate a strong interplay between parallel/distributed computing techniques and learning/inference applications, such as algorithm design and libraries/framework development on multicore/ manycore architectures, GPUs, clusters, supercomputers, cloud computing platforms that target applications including but not limited to:

  • Learning and inference using large scale Bayesian Networks
  • Large scale inference algorithms using parallel TPIC models, clustering and SVM etc.
  • Parallel natural language processing (NLP).
  • Semantic inference for disambiguation of content on web or social media
  • Discovering and searching for patterns in audio or video content
  • On-line analytics for streaming text and multimedia content
  • Comparison of various HPC infrastructures for learning
  • Large scale learning applications in search engine and social networks
  • Distributed machine learning tools (e.g., Mahout and IBM parallel tool)
  • Real-time solutions for learning algorithms on parallel platforms

If you are wondering what role topic maps have to play in this arena, ask yourself the following question:

Will the systems and techniques demonstrated at this conference use the same means to identify the same subjects?*

If your answer is no, what would you suggest is the solution for mapping different identifications of the same subjects together?

My answer to that question is to use topic maps.

*Whatever your ascribe as its origin, semantic diversity is part and parcel of the human condition. We can either develop silos or maps across silos. Which do you prefer?

September 22, 2011

Sparse Machine Learning Methods for Understanding Large Text Corpora

Filed under: Machine Learning,Sparse Learning,Text Analytics — Patrick Durusau @ 6:30 pm

Sparse Machine Learning Methods for Understanding Large Text Corpora (pdf) by Laurent El Ghaoui, Guan-Cheng Li, Viet-An Duong, Vu Pham, Ashok Srivastava, and Kanishka Bhaduri. Status: Accepted for publication in Proc. Conference on Intelligent Data Understanding, 2011.

Abstract:

Sparse machine learning has recently emerged as powerful tool to obtain models of high-dimensional data with high degree of interpretability, at low computational cost. This paper posits that these methods can be extremely useful for understanding large collections of text documents, without requiring user expertise in machine learning. Our approach relies on three main ingredients: (a) multi-document text summarization and (b) comparative summarization of two corpora, both using sparse regression or classification; (c) sparse principal components and sparse graphical models for unsupervised analysis and visualization of large text corpora. We validate our approach using a corpus of Aviation Safety Reporting System (ASRS) reports and demonstrate that the methods can reveal causal and contributing factors in runway incursions. Furthermore, we show that the methods automatically discover four main tasks that pilots perform during flight, which can aid in further understanding the causal and contributing factors to runway incursions and other drivers for aviation safety incidents.

I suppose it depends on your background (mine includes a law degree and a decade of practice) but when I read:

The ASRS data contains several of the crucial challenges involved under the general banner of “large-scale text data understanding”. First, its scale is huge, and growing rapidly, making the need for automated analyses of the processed reports more crucial than ever. Another issue is that the reports themselves are far from being syntactically correct, with lots of abbreviations, orthographic and grammatical errors, and other shortcuts. Thus we are not facing a corpora with well-structured language having clearly de ned rules, as we would if we were to consider a corpus of laws or bills or any other well-redacted data set.

I thought I would fall out of my chair. I don’t think I have ever heard of a “corpus of laws or bills” being described as a “…well-redacted data set.”

There was a bill passed in the US Congress last year that despite being acted on by both Houses and who knows how many production specialists, was passed without a name.

Apologies for the digression.

From the paper:

Our paper makes the claim that sparse learning methods can be very useful to the understanding large text databases. Of course, machine learning methods in general have already been successfully applied to text classi cation and clustering, as evidenced for example by [21]. We will show that sparsity is an important added property that is a crucial component in any tool aiming at providing interpretable statistical analysis, allowing in particular efficient multi-document summarization, comparison, and visualization of huge-scale text corpora.

You will need to read the paper for the details but I think it clearly demonstrates that sparse learning methods are useful for exploring large text databases. While it may be the case that your users have a view of their data, it is equally likely that you will be called upon to mine a text database and to originate a navigation overlay for it. That will require exploring the data and developing an understanding of it.

For all the projections of need for data analysts and required technical skills, without insight and imagination, they will just be going through the motions.

(Applying sparse learning methods to new areas is an example of imagination.)

September 21, 2011

Using Machine Learning to Detect Malware Similarity

Filed under: Machine Learning,Malware,Similarity — Patrick Durusau @ 7:07 pm

Using Machine Learning to Detect Malware Similarity by Sagar Chaki.

From the post:

Malware, which is short for “malicious software,” consists of programming aimed at disrupting or denying operation, gathering private information without consent, gaining unauthorized access to system resources, and other inappropriate behavior. Malware infestation is of increasing concern to government and commercial organizations. For example, according to the Global Threat Report from Cisco Security Intelligence Operations, there were 287,298 “unique malware encounters” in June 2011, double the number of incidents that occurred in March. To help mitigate the threat of malware, researchers at the SEI are investigating the origin of executable software binaries that often take the form of malware. This posting augments a previous posting describing our research on using classification (a form of machine learning) to detect “provenance similarities” in binaries, which means that they have been compiled from similar source code (e.g., differing by only minor revisions) and with similar compilers (e.g., different versions of Microsoft Visual C++ or different levels of optimization).

Interesting study in the development of ways to identify a subject that is trying to hide. Not to mention some hard core disassembly and other techniques.

September 19, 2011

Emrycher

Filed under: Artificial Intelligence,Machine Learning — Patrick Durusau @ 7:56 pm

Emrycher

Interesting site but you have to dig for further information.

ENRYCHER – SERVICE ORIENTED TEXT ENRICHMENT is a paper that I located about the site.

From the introduction:

In our experience, many knowledge extraction scenarios generally consist of multiple steps, starting with natural language processing, which are in turn used in higher level annotations, either as entities or document-level annotations. This in turn yields a rather complex dependency scheme between separate components. Such complexity growth is a common scenario in general information systems development. Therefore, we decided to mitigate this by applying a service-oriented approach to integration of a knowledge extraction component stack. The motivation behind Enrycher[17] is to have a single web service endpoint that could perform several of these steps, which we refer to as ‘enrichments’, without requiring the user to bother with setting up pre-processing infrastructure himself.

Note the critical statement: “…without requiring the user to bother with setting up pre-processing infrastructure himself.

The lower the bar to entry, the more participants you will have. What’s unclear about that?

September 13, 2011

Practical Aggregation of Semantical Program Properties for Machine Learning Based Optimization

Filed under: Machine Learning,Statistical Learning,Vectors — Patrick Durusau @ 7:14 pm

Practical Aggregation of Semantical Program Properties for Machine Learning Based Optimization by Mircea Namolaru, Albert Cohen, Grigori Fursin, Ayal Zaks, and Ari Freund.

ABSTRACT

Iterative search combined with machine learning is a promising approach to design optimizing compilers harnessing the complexity of modern computing systems. While traversing a program optimization space, we collect characteristic feature vectors of the program, and use them to discover correlations across programs, target architectures, data sets, and performance. Predictive models can be derived from such correlations, effectively hiding the time-consuming feedback-directed optimization process from the application programmer.

One key task of this approach, naturally assigned to compiler experts, is to design relevant features and implement scalable feature extractors, including statistical models that filter the most relevant information from millions of lines of code. This new task turns out to be a very challenging and tedious one from a compiler construction perspective. So far, only a limited set of ad-hoc, largely syntactical features have been devised. Yet machine learning is only able to discover correlations from information it is fed with: it is critical to select topical program features for a given optimization problem in order for this approach to succeed.

We propose a general method for systematically generating numerical features from a program. This method puts no restrictions on how to logically and algebraically aggregate semantical properties into numerical features. We illustrate our method on the difficult problem of selecting the best possible combination of 88 available optimizations in GCC. We achieve 74% of the potential speedup obtained through iterative compilation on a wide range of benchmarks and four different general-purpose and embedded architectures. Our work is particularly relevant to embedded system designers willing to quickly adapt the optimization heuristics of a mainstream compiler to their custom ISA, microarchitecture, benchmark suite and workload. Our method has been integrated with the publicly released MILEPOST GCC [14].

Read the portions on extracting features, inference of new relations, extracting relations from programs, extracting features from relations and tell me this isn’t a description of pre-topic map processing! 😉

September 11, 2011

Efficient P2P Ensemble Learning with Linear Models on Fully Distributed Data

Filed under: Ensemble Methods,Machine Learning,P2P — Patrick Durusau @ 7:02 pm

Efficient P2P Ensemble Learning with Linear Models on Fully Distributed Data by RĂłbert OrmĂĄndi, IstvĂĄn HegedĹąs, and MĂĄrk Jelasity.

Abstract:

Machine learning over fully distributed data poses an important problem in peer-to-peer (P2P) applications. In this model we have one data record at each network node, but without the possibility to move raw data due to privacy considerations. For example, user profiles, ratings, history, or sensor readings can represent this case. This problem is difficult, because there is no possibility to learn local models, yet the communication cost needs to be kept low. Here we propose gossip learning, a generic approach that is based on multiple models taking random walks over the network in parallel, while applying an online learning algorithm to improve themselves, and getting combined via ensemble learning methods. We present an instantiation of this approach for the case of classification with linear models. Our main contribution is an ensemble learning method which-through the continuous combination of the models in the network-implements a virtual weighted voting mechanism over an exponential number of models at practically no extra cost as compared to independent random walks. Our experimental analysis demonstrates the performance and robustness of the proposed approach.

Interesting. In a topic map context, I wonder about creating associations based on information that is not revealed to the peer making the association? Or the peer suggesting the association?

September 7, 2011

Photometric Catalogue of Quasars and Other Point Sources in the Sloan Digital Sky Survey

Filed under: Astroinformatics,Machine Learning — Patrick Durusau @ 6:57 pm

Photometric Catalogue of Quasars and Other Point Sources in the Sloan Digital Sky Survey by Sheelu Abraham, Ninan Sajeeth Philip, Ajit Kembhavi, Yogesh G Wadadekar, and Rita Sinha. (Submitted on 9 Nov 2010 (v1), last revised 25 Aug 2011 (this version, v3))

Abstract:

We present a catalogue of about 6 million unresolved photometric detections in the Sloan Digital Sky Survey Seventh Data Release classifying them into stars, galaxies and quasars. We use a machine learning classifier trained on a subset of spectroscopically confirmed objects from 14th to 22nd magnitude in the SDSS {\it i}-band. Our catalogue consists of 2,430,625 quasars, 3,544,036 stars and 63,586 unresolved galaxies from 14th to 24th magnitude in the SDSS {\it i}-band. Our algorithm recovers 99.96% of spectroscopically confirmed quasars and 99.51% of stars to i $\sim$21.3 in the colour window that we study. The level of contamination due to data artefacts for objects beyond $i=21.3$ is highly uncertain and all mention of completeness and contamination in the paper are valid only for objects brighter than this magnitude. However, a comparison of the predicted number of quasars with the theoretical number counts shows reasonable agreement.

OK, admittedly more interest to me than probably anyone else that reads this blog.

Still, every machine learning technique and data requirement that you learn has potential application in other fields.

Vowpal Wabbit 6.0

Filed under: Machine Learning,Vowpal Wabbit — Patrick Durusau @ 6:47 pm

Vowpal Wabbit 6.0

From the post:

I just released Vowpal Wabbit 6.0. Since the last version:

  1. VW is now 2-3 orders of magnitude faster at linear learning, primarily thanks to Alekh. Given the baseline, this is loads of fun, allowing us to easily deal with terafeature datasets, and dwarfing the scale of any other open source projects. The core improvement here comes from effective parallelization over kilonode clusters (either Hadoop or not). This code is highly scalable, so it even helps with clusters of size 2 (and doesn’t hurt for clusters of size 1). The core allreduce technique appears widely and easily reused—we’ve already used it to parallelize Conjugate Gradient, LBFGS, and two variants of online learning. We’ll be documenting how to do this more thoroughly, but for now “README_cluster” and associated scripts should provide a good starting point.
  2. The new LBFGS code from Miro seems to commonly dominate the existing conjugate gradient code in time/quality tradeoffs.
  3. The new matrix factorization code from Jake adds a core algorithm.
  4. We finally have basic persistent daemon support, again with Jake’s help.
  5. Adaptive gradient calculations can now be made dimensionally correct, following up on Paul’s post, yielding a better algorithm. And Nikos sped it up further with SSE native inverse square root.
  6. The LDA core is perhaps twice as fast after Paul educated us about SSE and representational gymnastics.

All of the above was done without adding significant new dependencies, so the code should compile easily.

The VW mailing list has been slowly growing, and is a good place to ask questions.

September 1, 2011

Greenplum Community

Filed under: Algorithms,Analytics,Machine Learning,SQL — Patrick Durusau @ 6:00 pm

A post by Alex Popescu, Data Scientist Summit Videos, lead me to discover the Greenplum Community.

Hosted by Greenplum:

Greenplum is the pioneer of Enterprise Data Cloud™ solutions for large-scale data warehousing and analytics, providing customers with flexible access to all their data for business intelligence and advanced analytics. Greenplum offers industry-leading performance at a low cost for companies managing terabytes to petabytes of data. Data-driven businesses around the world, including NASDAQ, NYSE Euronext, Silver Spring Networks and Zions Bancorporation, have adopted Greenplum Database-based products to support their mission-critical business functions.

registration (free) brings access to the videos from the Data Scientist Summit.

The “community” is focused on Greenplum software (there is a “community” edition). Do be aware that Greenplum Database CE is a 1.7 GB download. Just so you know.

August 31, 2011

Big Learning 2011

Filed under: Conferences,Humor,Machine Learning — Patrick Durusau @ 7:39 pm

Big Learning 2011 : Big Learning: Algorithms, Systems, and Tools for Learning at Scale

Dates

When Dec 16, 2011 – Dec 17, 2011
Where Sierra Nevada, Spain
Submission Deadline Sep 30, 2011
Notification Due Oct 21, 2011
Final Version Due Nov 11, 2011

From the call:

Big Learning: Algorithms, Systems, and Tools for Learning at Scale

NIPS 2011 Workshop (http://www.biglearn.org)

Submissions are solicited for a two day workshop December 16-17 in Sierra Nevada, Spain.

This workshop will address tools, algorithms, systems, hardware, and real-world problem domains related to large-scale machine learning (“Big Learning”). The Big Learning setting has attracted intense interest with active research spanning diverse fields including machine learning, databases, parallel and distributed systems, parallel architectures, and programming languages and abstractions. This workshop will bring together experts across these diverse communities to discuss recent progress, share tools and software, identify pressing new challenges, and to exchange new ideas. Topics of interest include (but are not limited to):

It looks like an interesting conference but “big” doesn’t add anything.

To head off future “big” clutter, I hereby claim copyright, trademark, etc., protection under various galactic and inter-galactic treaties and laws for:

  • big blogging
  • big tweeting
  • big microformats
  • big IM
  • big IM’NOT
  • big smileys
  • big imaginary but not instantiated spaces
  • big cells
  • big things that are not cells
  • big words that look like CS at a distance
  • big …. well, I will be expanding this list with your non-obscene suggestions, provided you transfer ownership to me.

August 29, 2011

RuSSIR/EDBT 2011 Summer School

Filed under: Dataset,Machine Learning — Patrick Durusau @ 6:25 pm

RuSSIR/EDBT 2011 Summer School

Machine learning task with task and training set data.

RuSSIR machine learning contest winners presentations

Contest tasks are described on http://bit.ly/russir2011. Results are presented in the previous post: http://bit.ly/pr6bSz

Yura Perov: http://dl.dropbox.com/u/1572852/RussirResults/yura_perov_ideas_for_practical_task.pptx

Dmitry Kan and Ivan Golubev: http://dl.dropbox.com/u/1572852/RussirResults/Russir_regression-task-ivan_dima.pptx

Nikita Zhiltsov: http://dl.dropbox.com/u/1572852/RussirResults/nzhiltsov_task2.pdf

August 28, 2011

NPSML Library – C – Machine Learning

Filed under: Machine Learning,Software — Patrick Durusau @ 7:59 pm

Naval Postgraduate School Machine Learning Library (NPSML Library)

At present pre-release C based machine learning package. Do note the file format requirements.

« Newer PostsOlder Posts »

Powered by WordPress