Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 11, 2012

AlchemyAPI

Filed under: AlchemyAPI,Data Mining,Machine Learning — Patrick Durusau @ 8:10 pm

AlchemyAPI

From the documentation:

AlchemyAPI utilizes natural language processing technology and machine learning algorithms to analyze content, extracting semantic meta-data: information about people, places, companies, topics, facts & relationships, authors, languages, and more.

API endpoints are provided for performing content analysis on Internet-accessible web pages, posted HTML or text content.

To use AlchemyAPI, you need an access key. If you do not have an API key, you must first obtain one.

I haven’t used it but it looks like a useful service for information products meant for an end user.

Do you use such services? Any others you would suggest?

March 6, 2012

mlpy: Machine Learning Python

Filed under: Machine Learning,Python — Patrick Durusau @ 8:09 pm

mlpy: Machine Learning Python by Davide Albanese, Roberto Visintainer, Stefano Merler, Samantha Riccadonna, Giuseppe Jurman, and Cesare Furlanello.

Abstract:

mlpy is a Python Open Source Machine Learning library built on top of NumPy/SciPy and the GNU Scientific Libraries. mlpy provides a wide range of state-of-the-art machine learning methods for supervised and unsupervised problems and it is aimed at finding a reasonable compromise among modularity, maintainability, reproducibility, usability and efficiency. mlpy is multiplatform, it works with Python 2 and 3 and it is distributed under GPL3 at the website this http URL

There must have been a publication requirement because the paper doesn’t really add anything to the already excellent documentation at the project site. More of a short summary/overview sort of document.

The software, on the other hand, deserves your close attention.

I guess the authors got the memo on GPL licensing? 😉

March 3, 2012

Skytree: Big Data Analytics

Skytree: Big Data Analytics

Released this last week, Skytree offers both local as well as cloud-based data analytics.

From the website:

Skytree Server can accurately perform machine learning on massive datasets at high speed.

In the same way a relational database system (or database accelerator) is designed to perform SQL queries efficiently, Skytree Server is designed to efficiently perform machine learning on massive datasets.

Skytree Server’s scalable architecture performs state-of-the-art machine learning methods on data sets that were previously too big for machine learning algorithms to process. Leveraging advanced algorithms implemented on specialized systems and dedicated data representations tuned to machine learning, Skytree Server delivers up to 10,000 times performance improvement over existing approaches.

Currently supported machine learning methods:

  • Neighbors (Nearest, Farthest, Range, k, Classification)
  • Kernel Density Estimation and Non-parametric Bayes Classifier
  • K-Means
  • Linear Regression
  • Support Vector Machines (SVM)
  • Fast Singular Value Decomposition (SVD)
  • The Two-point Correlation

There is a “free” local version with a data limit (100,000 records) and of course the commercial local and cloud versions.

Comments?

March 1, 2012

Target, Pregnancy and Predictive Analytics (parts 1 and 2)

Filed under: Data Analysis,Machine Learning,Predictive Analytics — Patrick Durusau @ 9:02 pm

Dean Abbott wrote a pair of posts on a New York Times article about Target predicting if customers are pregnant.

Target, Pregnancy and Predictive Analytics (part 1)

Target, Pregnancy and Predictive Analytics (part 2)

Read both I truly liked his conclusion that models give us the patterns in data but it is up to us to “recognize” the patterns as significant.

BTW, I do wonder what the different is between the New York Times snooping for secrets to sell newspapers versus Target to sell products? If you know, please give a shout!

February 29, 2012

Announcing Google-hosted workshop videos from NIPS 2011

Filed under: Machine Learning,Music,Neuroinformatics,Semantics — Patrick Durusau @ 7:21 pm

Announcing Google-hosted workshop videos from NIPS 2011 by John Blitzer and Douglas Eck.

From the post:

At the 25th Neural Information Processing Systems (NIPS) conference in Granada, Spain last December, we engaged in dialogue with a diverse population of neuroscientists, cognitive scientists, statistical learning theorists, and machine learning researchers. More than twenty Googlers participated in an intensive single-track program of talks, nightly poster sessions and a workshop weekend in the Spanish Sierra Nevada mountains. Check out the NIPS 2011 blog post for full information on Google at NIPS.

In conjunction with our technical involvement and gold sponsorship of NIPS, we recorded the five workshops that Googlers helped to organize on various topics from big learning to music. We’re now pleased to provide access to these rich workshop experiences to the wider technical community.

Watch videos of Googler-led workshops on the YouTube Tech Talks Channel:

Not to mention several other videos you will find at the original post.

Suspect everyone will find something they will enjoy!

Comments on any of these that you find particularly useful?

Will the Circle Be Unbroken? Interactive Annotation!

I have to agree with Bob Carpenter, the title is a bit much:

Closing the Loop: Fast, Interactive Semi-Supervised Annotation with Queries on Features and Instances

From the post:

Whew, that was a long title. Luckily, the paper’s worth it:

Settles, Burr. 2011. Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances. EMNLP.

It’s a paper that shows you how to use active learning to build reasonably high-performance classifier with only minutes of user effort. Very cool and right up our alley here at LingPipe.

Both the paper and Bob’s review merit close reading.

February 19, 2012

SML: Scalable Machine Learning

Filed under: Machine Learning — Patrick Durusau @ 8:35 pm

SML: Scalable Machine Learning

Alex Smola’s lectures on Scalable Machine Learning at Berkeley with a wealth of supplemental materials.

Overview:

Scalable Machine Learning occurs when Statistics, Systems, Machine Learning and Data Mining are combined into flexible, often nonparametric, and scalable techniques for analyzing large amounts of data at internet scale. This class aims to teach methods which are going to power the next generation of internet applications. The class will cover systems and processing paradigms, an introduction to statistical analysis, algorithms for data streams, generalized linear methods (logistic models, support vector machines, etc.), large scale convex optimization, kernels, graphical models and inference algorithms such as sampling and variational approximations, and explore/exploit mechanisms. Applications include social recommender systems, real time analytics, spam filtering, topic models, and document analysis.

Just to give you a taste for the content, the first set of lectures is on Hardware and covers:

  • Hardware
    • Processor, RAM, buses, GPU, disk, SSD, network, switches, racks, server centers
    • Bandwidth, latency and faults
  • Basic parallelization paradigms
    • Trees, stars, rings, queues
    • Hashing (consistent, proportional)
    • Distributed hash tables and P2P
  • Storage
    • RAID
    • Google File System / HadoopFS
    • Distributed (key, value) storage
  • Processing
    • MapReduce
    • Dryad
    • S4 / stream processing
  • Structured access beyond SQL
    • BigTable
    • Cassandra

Each set of lectures was back to back (to reduce travel time for Smola).

Hardware influences our thinking and design choices so it was good to see the lectures starting with coverage of hardware.

Interesting point near the end of the first lecture about never using editors to create editorial data. Then Alex explains that query results were validated at one point by women in their twenties so other perspectives on query results were not reflected in the results. He suggested getting users to provide data for search validation than using experts to label the data.

I would split his comments on editorial content into:

  1. Editorial content from experts
  2. Editorial content from users

I would put #1 in the same category as getting ontologists or linked data types to markup data. It works for them and from their point of view, but that doesn’t mean it works for the users of the data.

On the other hand, #2, content from users about how they think about their data and what constitutes a good result, seems a lot more appealing to me.

I would say that Alex’s point isn’t to not to use editors but to choose one’s editors carefully, favoring the users who will be using the results of the searches. (And avoiding the activity of labeling, there are better ways to get the needed data from users.)

That doesn’t work for a generalized search interface like Google but then a public ….., err, water trough is a public water trough.

February 18, 2012

Hadoop and Machine Learning: Present and Future

Filed under: Hadoop,Machine Learning — Patrick Durusau @ 5:26 pm

Hadoop and Machine Learning: Present and Future by Josh Wills.

Presentation at LA Machine Learning.

Josh Wills is Cloudera’s Director of Data Science, working with customers and engineers to develop Hadoop-based solutions across a wide-range of industries. Prior to joining Cloudera, Josh worked at Google, where he worked on the ad auction system and then led the development of the analytics infrastructure used in Google+. Prior to Google, Josh worked at a variety of startups- sometimes as a Software Engineer, and sometimes as a Statistician. He earned his Bachelor’s degree in Mathematics from Duke University and his Master’s in Operations Research from The University of Texas – Austin.

Very practice oriented view of Hadoop and machine learning. If you aren’t excited about Hadoop and machine learning already, you will be after this presentation!

February 16, 2012

Code for Machine Learning for Hackers

Filed under: Machine Learning,R — Patrick Durusau @ 6:52 pm

Code for Machine Learning for Hackers by Drew Conway.

Drew writes:

For those interested, my co-author John Myles White is hosting the code at his Github, which can be accessed at:

https://github.com/johnmyleswhite/ML_for_Hackers

Drew and John wrote Machine Learning for Hackers, which has just been released but their publisher hasn’t updated its website to point to the code repository. (As of the time of Drew’s post.)

February 12, 2012

RTextTools Short Course

Filed under: Machine Learning,R — Patrick Durusau @ 5:14 pm

RTextTools Short Course

The post:

Attached are some of the materials from the recent short course at UNC. For confidential reasons, we are unable to present all of the materials, but this is enough to get someone started. 1. Lecture; 2. Intro to R; 3. NY Times; 4. Congressional Bills. Hope this proves helpful.

Brief lecture notes and three (3) examples of R code to get you started.

February 7, 2012

Machine Learning for Hackers

Filed under: Machine Learning,R — Patrick Durusau @ 4:35 pm

Machine Learning for Hackers: Case Studies and Algorithms to Get You Started by Drew Conway, John Myles White.

Publisher’s Description:

Now that storage and collection technologies are cheaper and more precise, methods for extracting relevant information from large datasets is within the reach any experienced programmer willing to crunch data. With this book, you’ll learn machine learning and statistics tools in a practical fashion, using black-box solutions and case studies instead of a traditional math-heavy presentation.

By exploring each problem in this book in depth—including both viable and hopeless approaches—you’ll learn to recognize when your situation closely matches traditional problems. Then you’ll discover how to apply classical statistics tools to your problem. Machine Learning for Hackers is ideal for programmers from private, public, and academic sectors.

From twitter traffic it appears that the print version has gone to the printers.

Interested in your comments when either the eBook or print versions become available.

Dean’s blog, Zero Intelligence Agents, makes me confident what appears will be high quality.

Curious that O’Reilly doesn’t mention that it is entirely in R. That to me would be a selling point.

February 5, 2012

Machine Learning (BETA)

Filed under: HPCC,Machine Learning — Patrick Durusau @ 8:08 pm

Machine Learning (BETA)

From HPCC Systems:

An extensible set of Machine Learning (ML) and Matrix processing algorithms to assist with business intelligence; covering supervised and unsupervised learning, document and text analysis, statistics and probabilities, and general inductive inference related problems.

The ML project is designed to create an extensible library of fully parallel machine learning routines; the early stages of a bottom up implementation of a set of algorithms which are easy to use and efficient to execute. This library leverages the distributed nature of the HPCC Systems architecture, providing for extreme scalability to both, the high level implementation of the machine learning algorithms and the underlying matrix algebra library, extensible to tens of thousands of features on billions of training examples.

Some of the most representative algorithms in the different areas of machine learning have been implemented, including k-means for clustering, naive bayes classifiers, ordinary linear regression, logistic regression, correlations (including Pearson and Kendalls Tau), and association routines to perform association analysis and pattern prediction. The document tokenization and text classifiers included, with n-gram extraction and analysis, provide the basis to perform statistical grammar inference based natural language processing. Univariate statistics such as mean, median, mode, variance and percentile ranking are supported along with standard statistical measures such as Student-t, Normal, Poisson, Binomial, Negative Binomial and Exponential.

In case you need reminding, this is the open sourced Lexis/Nexis engine.

Unlike algorithms that run on top of summarized big data, these algorithms run on big data.

See if that makes a difference for your use cases.

January 26, 2012

Sixth Annual Machine Learning Symposium

Filed under: CS Lectures,Machine Learning — Patrick Durusau @ 6:55 pm

Sixth Annual Machine Learning Symposium sponsored by the New York Academy of the Sciences.

There were eighteen (18) presentations and any attempt to summarize on my part would do injustice to one or more of them.

Post your comments and suggestions for which ones I should watch first. Thanks!

January 16, 2012

Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD)

Filed under: Database,Knowledge Discovery,Machine Learning — Patrick Durusau @ 2:43 pm

The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD) will take place in Bristol, UK from September 24th to 28th, 2012.

Dates:

Abstract submission deadline: Thu 19 April 2012
Paper submission deadline: Mon 23 April 2012
Early author notification: Mon 28 May 2012
Author notification: Fri 15 June 2012
Camera ready submission: Fri 29 June 2012
Conference: Mon – Fri, 24-28 September, 2012.

From the call for papers:

The European Conference on “Machine Learning” and “Principles and Practice of Knowledge Discovery in Databases” (ECML-PKDD) provides an international forum for the discussion of the latest high-quality research results in all areas related to machine learning and knowledge discovery in databases and other innovative application domains.

Submissions are invited on all aspects of machine learning, knowledge discovery and data mining, including real-world applications.

The overriding criteria for acceptance will be a paper’s:

  • potential to inspire the research community by introducing new and relevant problems, concepts, solution strategies, and ideas;
  • contribution to solving a problem widely recognized as both challenging and important;
  • capability to address a novel area of impact of machine learning and data mining.

Other criteria are scientific rigour and correctness, challenges overcome, quality and reproducibility of the experiments, and presentation.

I rather like that: quality and reproducibility of the experiments.

As opposed to the “just believe in the power of ….” and you will get all manner of benefits. But no one can produce data to prove those claims.

Reminds me of the astronomer in Ben Johnson’s who claimed to:

I have possessed for five years the regulation of the weather and the distribution of the seasons. The sun has listened to my dictates, and passed from tropic to tropic by my direction; the clouds at my call have poured their waters, and the Nile has overflowed at my command. I have restrained the rage of the dog-star, and mitigated the fervours of the crab. The winds alone, of all the elemental powers, have hitherto refused my authority, and multitudes have perished by equinoctial tempests which I found myself unable to prohibit or restrain. I have administered this great office with exact justice, and made to the different nations of the earth an impartial dividend of rain and sunshine. What must have been the misery of half the globe if I had limited the clouds to particular regions, or confined the sun to either side of the equator?’”

And when asked how he knew this to be true, replied:

“‘Because,’ said he, ‘I cannot prove it by any external evidence; and I know too well the laws of demonstration to think that my conviction ought to influence another, who cannot, like me, be conscious of its force. I therefore shall not attempt to gain credit by disputation. It is sufficient that I feel this power that I have long possessed, and every day exerted it. But the life of man is short; the infirmities of age increase upon me, and the time will soon come when the regulator of the year must mingle with the dust. The care of appointing a successor has long disturbed me; the night and the day have been spent in comparisons of all the characters which have come to my knowledge, and I have yet found none so worthy as thyself.’” (emphasis added)

Project Gutenberg has a copy online: Rasselas, Prince of Abyssinia, by Samuel Johnson.

For my part, I think semantic integration has been, is and will be hard, not to mention expensive.

Determining your ROI is just as necessary for semantic integration project, whatever technology you choose, as for any other project.

January 15, 2012

Machine Learning: Ensemble Methods

Filed under: Ensemble Methods,Machine Learning — Patrick Durusau @ 9:16 pm

Machine Learning: Ensemble Methods by Ricky Ho.

Ricky gives a brief overview of ensemble methods in machine learning.

Not enough for practical application but enough to orient yourself to learn more.

From the post:

Ensemble Method is a popular approach in Machine Learning based on the idea of combining multiple models. For example, by mixing different machine learning algorithms (e.g. SVM, Logistic regression, Bayesian network), ensemble method can automatically pick the best algorithmic model that fits the data the best. On the other hand, by mixing different parameter set of the same algorithmic model (e.g. Random forest, Boosting tree), it can pick the best set of parameters of the same algorithmic model.

January 13, 2012

scikit-learn 0.10

Filed under: Machine Learning,Python — Patrick Durusau @ 8:16 pm

scikit-learn 0.10

With a list of 27 items that include words like “new,” “added,” “fixed,” “refactored,” etc., you know this is a change log you want to do more than skim.

In case you have been under a programming rock somewhere, scikit-learn is a Python machine learning library. Scikit-learn homepage

January 10, 2012

Stanford Machine Learning

Filed under: Machine Learning — Patrick Durusau @ 7:59 pm

Stanford Machine Learning by Alex Holehouse.

From the webpage:

The following notes represent a complete, stand alone interpretation of Stanford’s machine learning course presented by Professor Andrew Ng and originally posted on the ml-class.org website during the fall 2011 semester. The topics covered are shown below, although for a more detailed summary see lecture 19. The only content not covered here is the Octave/MATLAB programming.

All diagrams are my own or are directly taken from the lectures, full credit to Professor Ng for a truly exceptional lecture course.

The (Real) Semantic Web Requires Machine Learning

Filed under: Machine Learning,Semantic Web — Patrick Durusau @ 7:57 pm

The (Real) Semantic Web Requires Machine Learning by John O’Neil.

From the longer quote below:

…different people will almost inevitably create knowledge encodings that can’t easily be compared, because they use different — sometimes subtly, maddeningly different — basic definitions and concepts. Another difficult problem is to decide when entity names refer to the “same” real-world thing. Even worse, if the entity names are defined in two separate places, when and how should they be merged?

And the same is true for relationships between entities.(full stop)

The author thinks statistical analysis will be able to distinguish both entities and relationships between them, which I am sure will be true to some degree.

I would characterize that as a topic map authoring aid but it would also be possible to simply accept the statistical results.

It is refreshing to see someone recognize the “semantic web” is the one created by users and not as dictated by other authorities.

From the post:

We think about the semantic web in two complementary (and equivalent) ways. It can be viewed as:

  • A large set of subject-verb-object triples, where the verb is a relation and the subject and object are entities

OR

  • As a large graph or network, where the nodes of the graph are entities and the graph’s directed edges or arrows are the relations between nodes.

As a reminder, entities are proper names, like people, places, companies, and so on. Relations are meaningful events, outcomes or states, like BORN-IN, WORKS-FOR, MARRIED-TO, and so on. Each entity (like “John O’Neil”, “Attivio” or “Newton, MA”) has a type (like “PERSON”, “COMPANY” or “LOCATION”) and each relation is constrained to only accept certain types of entities. For example, WORKS-FOR may require a PERSON as the subject and a COMPANY as the object.

How semantic web information is organized and transmitted is described by a blizzard of technical standards and XML namespaces. Once you escape from that, the basic goals of the semantic web are (1) to allow a lot of useful information about the world to be simply expressed, in a way that (2) allows computers to do useful things with it.

Almost immediately, some problems crop up. As generations of artificial intelligence researchers have learned, it can be really difficult to encode real-world knowledge into predicate logic, which is more-or-less what the semantic web is. The same AI researchers also learned that different people will almost inevitably create knowledge encodings that can’t easily be compared, because they use different — sometimes subtly, maddeningly different — basic definitions and concepts. Another difficult problem is to decide when entity names refer to the “same” real-world thing. Even worse, if the entity names are defined in two separate places, when and how should they be merged? For example, do an Internet search for “John O’Neil”, and try to decide which of the results refer to how many different people. Believe me, all the results are not for the same person.

idata-semantic-web.jpgAs for relations, it’s difficult to tell when they really mean the same thing across different knowledge encodings. No matter how careful you are, if you want to use relations to infer new facts, you have few resources to check to see if the combined information is valid.

So, when each web site can define its own entities and relations, independently of any other web site, how do you reconcile entities and relations defined by different people?

December 27, 2011

scikits-image – Name Change

Filed under: Image Processing,Machine Learning,Names,Python — Patrick Durusau @ 7:13 pm

scikits-image – Name Change.

Speaking of naming issues, do note that scikits-image has become skimage, although as of 27 December 2011, PyPi – The Python Package Index isn’t aware of the change.

On the other hand, a search for sklearn (the new name for scikit-learn) resolves to the current package name scikit-learn-0.9.tar.gz.

I will drop the administrators a note because the text shifts between the two names without explanation on sklearn.

I got clued in about the change at: http://pythonvision.org/blog/2011/December/skimage04.

So, how do we deal with all the prior uses of the “scikits-image” and “scikit-learn” identifiers that are about to be disconnected from the software they once named?

Eventually the package pages will be innocent of either one, save perhaps in increasingly old change logs.

Assume I run across a blog post or article that is two or three years old with an interesting technique that uses the old names. Other than by chance, how do I find the package under its new name? And if I do find it, how can I save other people from the same time investment and depending on luck for the result?

To be sure, the package search mechanism puts me out at the right place but what if I am not expecting the resolution to another name? Will I think this is another package?

December 25, 2011

Learning Machine Learning with Apache Mahout

Filed under: Machine Learning,Mahout — Patrick Durusau @ 6:06 pm

Learning Machine Learning with Apache Mahout

From the post:

Once in a while I get questions like Where to start learning more on machine learning. Other than the official sources I think there is quite good coverage also in the Mahout community: Since it was founded several presentations have been given that give an overview of Apache Mahout, introduce special features or even go into more details on particular implementations. Below is an attempt to create a collection of talks given so far without any claim to contain links to all videos or lectures. Feel free to add your favourite in the comments section. In addition I linked to some online courses with further material to get you started.

When looking for books of course check out Mahout in Action. Also Taming Text and the data mining book that comes with weka are good starting points for practitioners.

Nice collection of resources on getting started with Apache Mahout.

December 23, 2011

Machine Learning and Hadoop

Filed under: Hadoop,Machine Learning — Patrick Durusau @ 4:28 pm

Machine Learning and Hadoop

Interesting slide deck from Josh Wills,Tom Pierce, and Jeff Hammerbacher of Cloudera.

The mention of “Pareto optimization” reminds me of a debate tournament judge who had written his dissertation on that topic. Who carefully pointed out that it wasn’t possible to “know” how close (or far away) a society was from any optimal point. 😉 Oh well, it was a “case” that sounded good to opponents unfamiliar with economic theory at any rate. An example of those “critical evaluation” skills I was talking about a day or so ago.

Not that you can’t benefit from machine learning and Hadoop. You can, but ask careful questions and persist until you are given answers that make sense. To you. With demonstrable results.

In other words, don’t be afraid to ask “stupid” questions and keep on asking them until you are satisfied with the answers. Or hire someone who is willing to play that role.

December 20, 2011

PURDUE Machine Learning Summer School 2011

Filed under: Machine Learning — Patrick Durusau @ 8:25 pm

PURDUE Machine Learning Summer School 2011

The coverage of the summer school is very impressive. The lecture titles and presenters were:

  • Machine Learning for Statistical Genetics by Karsten Borgwardt
  • Large-scale Machine Learning and Stochastic Algorithms by Leon Bottou
  • Divide and Recombine (D&R) for the Analysis of Big Data by William S. Cleveland
  • Privacy Issues with Machine Learning: Fears, Facts, and Opportunities by Chris Clifton
  • The MASH project. An open platform for the collaborative development of feature extractors by Francois Fleuret
  • Techniques for Massive-Data Machine Learning, with Application to Astronomy by Alex Gray
  • Mining Heterogeneous Information Networks by Jiawei Han
  • Machine Learning for a Rainy Day by Sergey Kirshner
  • Machine Learning for Discovery in Legal Cases by David D. Lewis
  • Classic and Modern Data Clustering by Marina Meilă
  • Modeling Complex Social Networks: Challenges and Opportunities for Statistical Learning and Inference by Jennifer Neville
  • Using Heat for Shape Understanding and Retrieval by Karthik Ramani
  • Learning Rhythm from Live Music by Christopher Raphael
  • Introduction to supervised, unsupervised and partially-supervised training algorithms by Dale Schuurmans
  • A Machine Learning Approach for Complex Information Retrieval Applications by Luo Si
  • A Short Course on Reinforcement Learning by Satinder Singh Baveja
  • Graphical Models for the Internet by Alexander Smola
  • Optimization for Machine Learning by S V N Vishwanathan
  • Survey of Boosting from an Optimization Perspective by Manfred K. Warmuth

Now that would be a summer school to remember!

December 19, 2011

Journal of Computing Science and Engineering

Filed under: Bioinformatics,Computer Science,Linguistics,Machine Learning,Record Linkage — Patrick Durusau @ 8:09 pm

Journal of Computing Science and Engineering

From the webpage:

Journal of Computing Science and Engineering (JCSE) is a peer-reviewed quarterly journal that publishes high-quality papers on all aspects of computing science and engineering. The primary objective of JCSE is to be an authoritative international forum for delivering both theoretical and innovative applied researches in the field. JCSE publishes original research contributions, surveys, and experimental studies with scientific advances.

The scope of JCSE covers all topics related to computing science and engineering, with a special emphasis on the following areas: embedded computing, ubiquitous computing, convergence computing, green computing, smart and intelligent computing, and human computing.

I got here from following a sponsor link at a bioinformatics conference.

Then just picking at random from the current issue I see:

A Fast Algorithm for Korean Text Extraction and Segmentation from Subway Signboard Images Utilizing Smartphone Sensors by Igor Milevskiy, Jin-Young Ha.

Abstract:

We present a fast algorithm for Korean text extraction and segmentation from subway signboards using smart phone sensors in order to minimize computational time and memory usage. The algorithm can be used as preprocessing steps for optical character recognition (OCR): binarization, text location, and segmentation. An image of a signboard captured by smart phone camera while holding smart phone by an arbitrary angle is rotated by the detected angle, as if the image was taken by holding a smart phone horizontally. Binarization is only performed once on the subset of connected components instead of the whole image area, resulting in a large reduction in computational time. Text location is guided by user’s marker-line placed over the region of interest in binarized image via smart phone touch screen. Then, text segmentation utilizes the data of connected components received in the binarization step, and cuts the string into individual images for designated characters. The resulting data could be used as OCR input, hence solving the most difficult part of OCR on text area included in natural scene images. The experimental results showed that the binarization algorithm of our method is 3.5 and 3.7 times faster than Niblack and Sauvola adaptive-thresholding algorithms, respectively. In addition, our method achieved better quality than other methods.

Secure Blocking + Secure Matching = Secure Record Linkage by Alexandros Karakasidis, Vassilios S. Verykios.

Abstract:

Performing approximate data matching has always been an intriguing problem for both industry and academia. This task becomes even more challenging when the requirement of data privacy rises. In this paper, we propose a novel technique to address the problem of efficient privacy-preserving approximate record linkage. The secure framework we propose consists of two basic components. First, we utilize a secure blocking component based on phonetic algorithms statistically enhanced to improve security. Second, we use a secure matching component where actual approximate matching is performed using a novel private approach of the Levenshtein Distance algorithm. Our goal is to combine the speed of private blocking with the increased accuracy of approximate secure matching.

A Survey of Transfer and Multitask Learning in Bioinformatics by Qian Xu, Qiang Yang.

Abstract:

Machine learning and data mining have found many applications in biological domains, where we look to build predictive models based on labeled training data. However, in practice, high quality labeled data is scarce, and to label new data incurs high costs. Transfer and multitask learning offer an attractive alternative, by allowing useful knowledge to be extracted and transferred from data in auxiliary domains helps counter the lack of data problem in the target domain. In this article, we survey recent advances in transfer and multitask learning for bioinformatics applications. In particular, we survey several key bioinformatics application areas, including sequence classification, gene expression data analysis, biological network reconstruction and biomedical applications.

And the ones I didn’t list from the current issue are just as interesting and relevant to identity/mapping issues.

This journal is a good example of people who have deliberately reached further across disciplinary boundaries than most.

About the only excuse for not doing so left is the discomfort of being the newbie in a field not your own.

Is that a good enough reason to miss possible opportunities to make critical advances in your home field? (Only you can answer that for yourself. No one can answer it for you.)

December 17, 2011

Vowpal Wabbit

Filed under: Artificial Intelligence,Machine Learning — Patrick Durusau @ 7:50 pm

Vowpal Wabbit version 6.1

Refinements in 6.1:

  1. The cluster parallel learning code better supports multiple simultaneous runs, and other forms of parallelism have been mostly removed. This incidentally significantly simplifies the learning core.
  2. The online learning algorithms are more general, with support for l1 (via a truncated gradient variant) and l2 regularization, and a generalized form of variable metric learning.
  3. There is a solid persistent server mode which can train online, as well as serve answers to many simultaneous queries, either in text or binary.

December 5, 2011

What is the “hashing trick”?

Filed under: Hashing,Machine Learning — Patrick Durusau @ 7:53 pm

What is the “hashing trick”?

I suspect this question:

I’ve heard people mention the “hashing trick” in machine learning, particularly with regards to machine learning on large data.

What is this trick, and what is it used for? Is it similar to the use of random projections?

(Yes, I know that there’s a brief page about it here. I guess I’m looking for an overview that might be more helpful than reading a bunch of papers.)

comes up fairly often. The answer given is unusually helpful so I wanted to point it out here.

MTI ML

Filed under: Machine Learning — Patrick Durusau @ 4:31 pm

MTI ML

From the webpage:

This package provides machine learning algorithms optimized for large text categorization tasks and is able to combine several text categorization solutions. The advantages of this package compared to existing approaches are: 1) its speed, 2) it is able to work with a large number of categorization problems and, 3) it provides the ability to compare several text categorization tools based on meta-learning. This website describes how to download, install and run MTI ML. An example data set is provided to verify the installation of the tool. More detailed instructions on using the tool are available here.

As usual with NIH projects, high quality work, lots of data.

December 4, 2011

CS545: Machine Learning (Fall 2011)

Filed under: Machine Learning,Python — Patrick Durusau @ 8:17 pm

CS545: Machine Learning (Fall 2011)

From the Overview page:

In this class you will learn about a variety of approaches to using a computer to discover patterns in data. The approaches include techniques from statistics, linear algebra, and artificial intelligence. Students will be required to solve written exercises, implement a number of machine learning algorithms and apply them to sets of data, and hand in written reports describing the results.

For implementations, we will be using Python. You may download and install Python on your computer, and work through the on-line tutorials to help prepare for this course. For the written reports, we will be using LaTeX, a document preparation system freely available on all platforms.

There has always been a lot of CS stuff online but the last couple of years it seems to have exploded. Python jockeys will like this one.

November 29, 2011

Deep Learning

Filed under: Artificial Intelligence,Deep Learning,Machine Learning — Patrick Durusau @ 8:42 pm

Deep Learning… moving beyond shallow machine learning since 2006!

From the webpage:

Deep Learning is a new area of Machine Learning research, which has been introduced with the objective of moving Machine Learning closer to one of its original goals: Artificial Intelligence.

This website is intended to host a variety of resources and pointers to information about Deep Learning. In these pages you will find

  • a reading list
  • links to software
  • datasets
  • a discussion forum
  • as well as tutorials and cool demos

I encountered this site via its Deep Learning Tutorial which is only one of the tutorial type resources available Tutorials.

I mention that because the Deep Learning Tutorial looks like it would be of interest to anyone doing data or entity mining.
.

November 17, 2011

Machine Learning with Python – Logistic Regression

Filed under: Machine Learning,Python — Patrick Durusau @ 8:39 pm

Machine Learning with Python – Logistic Regression

From the post:

I decided to start a new series of posts now focusing on general machine learning with several snippets for anyone to use with real problems or real datasets. Since I am studying machine learning again with a great course online offered this semester by Stanford University, one of the best ways to review the content learned is to write some notes about what I learned. The best part is that it will include examples with Python, Numpy and Scipy. I expect you enjoy all those posts!

This could be really nice, will send updates when new posts arrive.

November 11, 2011

29th International Conference on Machine Learning (ICML-2012)

Filed under: Conferences,Machine Learning — Patrick Durusau @ 7:38 pm

29th International Conference on Machine Learning (ICML-2012) June 26 to July 1 2012

Dates:

  • Workshop and tutorial proposals due February 10, 2012
  • Paper submissions due February 24, 2012
  • Author response period April 9–12, 2012
  • Author notification April 30, 2012
  • Workshop submissions due May 7, 2012
  • Workshop author notification May 21, 2012
  • Tutorials June 26, 2012
  • Main conference June 27–29, 2012
  • Workshops June 30–July 1, 2012

From the call for papers:

The 29th International Conference on Machine Learning (ICML 2012) will be held at the University of Edinburgh, Scotland, from June 26 to July 1 2012.

ICML 2012 invites the submission of engaging papers on substantial, original, and previously unpublished research in all aspects of machine learning. We welcome submissions of innovative work on systems that are self adaptive, systems that improve their own performance, or systems that apply logical, statistical, probabilistic or other formalisms to the analysis of data, to the learning of predictive models, to cognition, or to interaction with the environment. We welcome innovative applications, theoretical contributions, carefully evaluated empirical studies, and we particularly welcome work that combines all of these elements. We also encourage submissions that bridge the gap between machine learning and other fields of research.

« Newer PostsOlder Posts »

Powered by WordPress