Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 11, 2011

The joy of algorithms and NoSQL: a MongoDB example (part 2)

Filed under: Algorithms,Cheminformatics,MapReduce,MongoDB — Patrick Durusau @ 6:35 pm

The joy of algorithms and NoSQL: a MongoDB example (part 2)

From the post:

In part 1 of this article, I described the use of MongoDB to solve a specific Chemoinformatics problem, namely the computation of molecular similarities. Depending on the target Tanimoto coefficient, the MongoDB solution is able to screen a database of a million compounds in subsecond time. To make this possible, queries only return chemical compounds which, in theory, are able to satisfy the particular target Tanimoto. Even though this optimization is in place, the number of compounds returned by this query increases significantly when the target Tanimoto is lowered. The example code on the GitHub repository for instance, imports and indexes ~25000 chemical compounds. When a target Tanimoto of 0.8 is employed, the query returns ~700 compounds. When the target Tanimoto is lowered to 0.6, the number of returned compounds increases to ~7000. Using the MongoDB explain functionality, one is able to observe that the internal MongoDB query execution time increases slightly, compared to the execution overhead to transfer the full list of 7000 compounds to the remote Java application. Hence, it would make more sense to perform the calculations local to where the data is stored. Welcome to MongoDB’s build-in map-reduce functionality!

Screening “…millions of compounds in subsecond time” sounds useful in a topic map context.

August 10, 2011

Leaky Topic Maps?

Filed under: Algorithms,Encryption — Patrick Durusau @ 7:13 pm

A Cloud that Can’t Leak

From the post:

Imagine getting a friend’s advice on a personal problem and being safe in the knowledge that it would be impossible for your friend to divulge the question, or even his own reply.

Researchers at Microsoft have taken a step toward making something similar possible for cloud computing, so that data sent to an Internet server can be used without ever being revealed. Their prototype can perform statistical analyses on encrypted data despite never decrypting it. The results worked out by the software emerge fully encrypted, too, and can only be interpreted using the key in the possession of the data’s owner.

Uses a technique called homomorphic encryption.

The article says 5 to 10 years before practical application, but it was 30 years between its proposal and a formal proof it was even possible. In the 2 or 3 years since that proof, a number of almost practical demonstrations have emerged. Would not bet on the 5 to 10 year time frame.

August 8, 2011

Introduction to Algorithms

Filed under: Algorithms — Patrick Durusau @ 6:23 pm

Introduction to Algorithms by Prof. Erik Demaine Prof. Charles Leiserson

From the description:

This course teaches techniques for the design and analysis of efficient algorithms, emphasizing methods useful in practice. Topics covered include: sorting; search trees, heaps, and hashing; divide-and-conquer; dynamic programming; amortized analysis; graph algorithms; shortest paths; network flow; computational geometry; number-theoretic algorithms; polynomial and matrix calculations; caching; and parallel computing.

The iTunes free version.

If you go to MIT: 6.006 Introduction to Algorithms, you can pickup lecture notes, exams and solutions, assignments (no solutions).

Must be the coming of Fall. A young person’s mind turns to CS lectures. 😉

August 7, 2011

Computational Fairy Tales

Filed under: Algorithms,CS Lectures — Patrick Durusau @ 7:07 pm

Computational Fairy Tales by Jeremy Kubica.

From the post:

Computer science concepts as told through fairy tales.

Quite an interesting collection of stories.

July 22, 2011

You Too Can Use Hadoop Inefficiently!!!

Filed under: Algorithms,Graphs,Hadoop,RDF,SPARQL — Patrick Durusau @ 6:15 pm

The headline Hadoop’s tremendous inefficiency on graph data management (and how to avoid it) certainly got my attention.

But when you read the paper, Scalable SPARQL Querying of Large RDF Graphs, it isn’t Hadoop’s “tremendous inefficiency,” but actually that of SHARD, an RDF triple store that uses flat text files for storage.

Or as the authors say in their paper (6.3 Performance Comparison):

Figure 6 shows the execution time for LUBM in the four benchmarked systems. Except for query 6, all queries take more time on SHARD than on the single-machine deployment of RDF-3X. This is because SHARD’s use of hash partitioning only allows it optimize subject-subject joins. Every other type of join requires a complete redistribution of data over the network within a Hadoop job, which is extremely expensive. Furthermore, its storage layer is not at all optimized for RDF data (it stores data in flat files).

Saying that SHARD (not as well known as Hadoop), was using Hadoop inefficiently, would not have the “draw” of allegations about Hadoop’s failure to process graph data efficiently.

Sure, I write blog lines for “draw” but let’s ‘fess up in the body of the blog article. Readers shouldn’t have to run down other sources to find the real facts.

July 20, 2011

K-sort: A new sorting algorithm that beats Heap sort for n <= 70 lakhs!

Filed under: Algorithms,Search Algorithms — Patrick Durusau @ 12:56 pm

K-sort: A new sorting algorithm that beats Heap sort for n <= 70 lakhs! by Kiran Kumar Sundararajan, Mita Pal, Soubhik Chakraborty, N.C. Mahanti.

From the description:

Sundararajan and Chakraborty (2007) introduced a new version of Quick sort removing the interchanges. Khreisat (2007) found this algorithm to be competing well with some other versions of Quick sort. However, it uses an auxiliary array thereby increasing the space complexity. Here, we provide a second version of our new sort where we have removed the auxiliary array. This second improved version of the algorithm, which we call K-sort, is found to sort elements faster than Heap sort for an appreciably large array size (n <= 70,00,000) for uniform U[0, 1] inputs.

OK, so some people have small data, < = 70 lakhs to sort at one time. 😉 Also shows that there are still interesting things to say about sorting. An operation that is of interest to topic mappers and others.

July 19, 2011

Excel DataScope

Filed under: Algorithms,Cloud Computing,Excel Datascope,Hadoop — Patrick Durusau @ 7:51 pm

Excel DataScope

From the webpage:

From the familiar interface of Microsoft Excel, Excel DataScope enables researchers to accelerate data-driven decision making. It offers data analytics, machine learning, and information visualization by using Windows Azure for data and compute-intensive tasks. Its powerful analysis techniques are applicable to any type of data, ranging from web analytics to survey, environmental, or social data.

And:

Excel DataScope is a technology ramp between Excel on the user’s client machine, the resources that are available in the cloud, and a new class of analytics algorithms that are being implemented in the cloud. An Excel user can simply select an analytics algorithm from the Excel DataScope Research Ribbon without concern for how to move their data to the cloud, how to start up virtual machines in the cloud, or how to scale out the execution of their selected algorithm in the cloud. They simply focus on exploring their data by using a familiar client application.

Excel DataScope is an ongoing research and development project. We envision a future in which a model developer can publish their latest data analysis algorithm or machine learning model to the cloud and within minutes Excel users around the world can discover it within their Excel Research Ribbon and begin using it to explore their data collection. (emphasis added)

I added emphasis to the last sentence because that is the sort of convenience/collaboration that will make cloud computing and collaboration meaningful.

Imagine that sort of sharing across MS and non-MS cloud resources. Well, you would have to have an Excel DataScope interface on non-MS cloud resources, but one hopes that will be a product offering in the near future.

July 1, 2011

…filling space — without cubes

Filed under: Algorithms,Data Structures,Mathematics — Patrick Durusau @ 2:56 pm

Princeton researchers solve problem filling space — without cubes

From the post:

Whether packing oranges into a crate, fitting molecules into a human cell or getting data onto a compact disc, wasted space is usually not a good thing.

Now, in findings published June 20 in the Proceedings of the National Academy Sciences, Princeton University chemist Salvatore Torquato and colleagues have solved a conundrum that has baffled mathematical minds since ancient times — how to fill three-dimensional space with multi-sided objects other than cubes without having any gaps.

The discovery could lead to scientists finding new materials and could lead to advances in communications systems and computer security.

“You know you can fill space with cubes,” Torquato said, “We were looking for another way.” In the article “New Family of Tilings of Three-Dimensional Euclidean Space by Tetrahedra and Octahedra,” he and his team show they have solved the problem.

Not immediately useful for topic maps but will be interesting to see if new data structures emerge from this work.

See the article: New Family of Tilings of Three-Dimensional Euclidean Space by Tetrahedra and Octahedra (pay-per-view site)

June 23, 2011

Advanced Topics in Machine Learning

Advanced Topics in Machine Learning

Andreas Krause and Daniel Golovin course at CalTech. Lecture notes, readings, this will keep you entertained for some time.

Overview:

How can we gain insights from massive data sets?

Many scientific and commercial applications require us to obtain insights from massive, high-dimensional data sets. In particular, in this course we will study:

  • Online learning: How can we learn when we cannot fit the training data into memory? We will cover no regret online algorithms; bandit algorithms; sketching and dimension reduction.
  • Active learning: How should we choose few expensive labels to best utilize massive unlabeled data? We will cover active learning algorithms, learning theory and label complexity.
  • Nonparametric learning on large data: How can we let complexity of classifiers grow in a principled manner with data set size? We will cover large-­scale kernel methods; Gaussian process regression, classification, optimization and active set methods.

Why would a non-strong AI person list so much machine learning stuff?

Two reasons:

1) Machine learning techniques are incredibly useful in appropriate cases.

2) You have to understand machine learning to pick out the appropriate cases.

June 20, 2011

Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction

Filed under: Algorithms,Entity Extraction,String Matching — Patrick Durusau @ 3:31 pm

Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction by Guoliang Li, Dong Deng, and Jianhua Feng in SIGMOD ’11.

Abstract:

Dictionary-based entity extraction identifies predefined entities (e.g., person names or locations) from a document. A recent trend for improving extraction recall is to support approximate entity extraction, which finds all substrings in the document that approximately match entities in a given dictionary. Existing methods to address this problem support either token-based similarity (e.g., Jaccard Similarity) or character-based dissimilarity (e.g., Edit Distance). It calls for a unified method to support various similarity/dissimilarity functions, since a unified method can reduce the programming efforts, hardware requirements, and the manpower. In addition, many substrings in the document have overlaps, and we have an opportunity to utilize the shared computation across the overlaps to avoid unnecessary redundant computation. In this paper, we propose a unified framework to support many similarity/dissimilarity functions, such as jaccard similarity, cosine similarity, dice similarity, edit similarity, and edit distance. We devise efficient filtering algorithms to utilize the shared computation and develop effective pruning techniques to improve the performance. The experimental results show that our method achieves high performance and outperforms state-of-the-art studies.

Entity extraction should be high on your list of topic map skills. A good set of user defined dictionaries is a good start. Creating those dictionaries as topic maps is an even better start.

On string metrics, you might want to visit Sam’s String Metrics, which lists some thirty algorithms.

June 15, 2011

Partitioning Graph Databases

Filed under: Algorithms,Graph Partitioning,Graphs — Patrick Durusau @ 3:11 pm

Partitioning Graph Databases by ALEX AVERBUCH & MARTIN NEUMANN.

Focusing on Neo4j, reports that compared to random partitioning, use of algorithms herein result in a reduction of inter-partition traffic by 40 to 90%, depending on the dataset.

Abstract:

The amount of globally stored, electronic data is growing at an increasing rate. This growth is both in size and connectivity, where connectivity refers to the increasing presence of, and interest in, relationships between data [12]. An example of such data is the social network graph created and stored by Twitter [2].

Due to this growth, demand is increasing for technologies that can process such data. Currently relational databases are the predominant data storage technology, but they are poorly suited to processing connected data as they are optimized for index-intensive operations. Conversely, the storage engines of graph databases are optimized for graph computation as they store records adjacent to one another, linked by direct references. This enables retrieval of adjacent elements in constant time, regardless of graph size, and allows for relationships to be followed without performing costly index lookups. However, as data volume increases these databases outgrow the resources available on a single computer, and partitioning the data becomes necessary. At present, few graph databases are capable of doing this [6].

In this work we evaluate the viability of using graph partitioning algorithms as a means of partitioning graph databases, with focus on the Neo4j graph database [4]. For this purpose, a prototype partitioned database was developed. Then, three partitioning algorithms were explored and one implemented. During evaluation, three
graph datasets were used: two from production applications, and one synthetically generated. These were partitioned in various ways and the impact on database
performance was measured. To gauge this impact, we de fined one synthetic access pattern per dataset and executed each one on the partitioned datasets. Evaluation took place in a simulation environment, which ensured repeatability and made it possible to measure certain metrics, such as network traffic and load balance.

Simulation results show that, compared to random partitioning, use of a graph partitioning algorithm reduced inter-partition traffic by 40{90 %, depending on
dataset. Executing the algorithm intermittently during database usage was shown to maintain partition quality, while requiring only 1% the computation time of
initially partitioning the datasets. Finally, a strong correlation was found between theoretic graph partitioning quality metrics and the generated inter-partition traffic
under non-uniform access patterns. Our results suggest that use of such algorithms to partition graph databases can result in signlfi cant performance benefi ts, and
warrants further investigation.

May 30, 2011

Databases For Machine Learning Experiments

Filed under: Algorithms,Machine Learning — Patrick Durusau @ 6:57 pm

Databases For Machine Learning Experiments

From the website:

An experiment database is a database designed to store learning experiments in full detail, aimed at providing a convenient platform for the study of learning algorithms.

By submitting all details about the learning algorithms, datasets, experimental setup and results, experiments can be easily reproduced and reused in further studies.

By querying and mining the database, it allows easy, thorough analysis of learning algorithms while providing all information to correctly interpret the results.

To get a first idea, watch the video tutorial (updated!) of our explorer tool. Or start querying online by looking at some examples!

Video tutorials:

Experiment Database for Machine Learning Tutorial – SQL Querying

Experiment Database for Machine Learning Tutorial – Video Querying

Very interesting site. Wondering how something similar could be done to illustrate the use of topic maps?

Has anyone used this in connection with a class on machine learning?

May 18, 2011

Practical Machine Learning

Filed under: Algorithms,Machine Learning,Statistical Learning,Statistics — Patrick Durusau @ 6:45 pm

Practical Machine Learning, by Michael Jordan (UC Berkeley).

From the course webpage:

This course introduces core statistical machine learning algorithms in a (relatively) non-mathematical way, emphasizing applied problem-solving. The prerequisites are light; some prior exposure to basic probability and to linear algebra will suffice.

This is the Michael Jordan who gave a Posner Lecture at the 24th Annual Conference on Neural Information Processing Systems 2010.

May 17, 2011

TunedIT

Filed under: Algorithms,Data Mining,Machine Learning — Patrick Durusau @ 2:52 pm

TunedIT Machine Learning & Data Mining Algorithms Automated Tests, Repeatable Experiments, Meaningful Results

There are two parts to the TunedIT site:

TunedIT Research

TunedIT Research is an open platform for reproducible evaluation of machine learning and data mining algorithms. Everyone may use TunedIT tools to launch reproducible experiments and share results with others. Reproducibility is achieved through automation. Datasets and algorithms, as well as experimental results, are collected in central databases: Repository and Knowledge Base, to enable comparison of wide range of algorithms, and to facilitate dissemination of research findings and cooperation between researchers. Everyone may access the contents of TunedIT and contribute new resources and results.

TunedIT Challenge

The TunedIT project was established in 2008 as a free and open experimentation platform for data mining scientists, specialists and programmers. It was extended in 2009 with a framework for online data mining competitions, used initially for laboratory classes at universities. Today, we provide a diverse range of competition types – for didactic, scientific and business purposes.

  • Student Challenge — For closed members groups. Perfectly suited to organize assignments for students attending laboratory classes. Restricted access and visibility, only for members of the group. FREE of charge
  • Scientific Challenge — Open contest for non-commercial purpose. Typically associated with a conference, journal or scientific organization. Concludes with public dissemination of results and winning algorithms. May feature prizes. Fee: FREE or 20%
  • Industrial Challenge — Open contest with commercial purpose. Intellectual Property can be transfered at the end. No requirement for dissemination of solutions. Fee: 30%

This looks like a possible way to generate some publicity about and interest in topic maps.

Suggestions of existing public data sets that would be of interest to a fairly broad audience?

Thinking we are likely to model some common things the same and other common things differently.

Would be interesting to compare results.

May 2, 2011

Algoviz.org

Filed under: Algorithms,Data Structures,Visualization — Patrick Durusau @ 10:34 am

Algoviz.org: The Algorithm Visualization Portal

From the website:

AlgoViz.org is a gathering place for users and developers of algorithm visualizations and animations (AVs). It is a gateway to AV-related services, collections, and resources.

An amazing resource for algorithm visualization and animations. The “catalog” has over 500 entries. Along with an annotated bibliography of papers on algorithm visualization, field reports of the use of visualizations in the classroom, forums and other resources.

Visualization of merging algorithms is going to take on increasing importance as TMCL and TMQL increase the range of merging opportunities.

Building on prior techniques and experiences with visualization seems like a good idea.

April 22, 2011

Introduction to Algorithms and Computational Complexity

Filed under: Algorithms — Patrick Durusau @ 1:06 pm

Three lectures on algorithms and computational complexity from Yuri Gurevich, courtesy of Channel 9.

Yuri works at Microsoft Research and is the inventor of abstract state machines.

Introduction to Algorithms and Computational Complexity, 1 of n

Introduction to Algorithms and Computational Complexity, 2 of n

Introduction to Algorithms and Computational Complexity, 3 of 3

April 15, 2011

Sorting algorithms demonstrated with Hungarian folk dance

Filed under: Algorithms,Visualization — Patrick Durusau @ 6:28 am

Courtesy of FlowingData: Sorting algorithms demonstrated with Hungarian folk dance

OK, its not Dancing with the Stars but it is a way to visualize sorting algorithms.

And we don’t know what visualization will appeal to some user and “click” for them.

Thoughts about how to illustrate topic map operations using dance?

April 11, 2011

University of Florida Sparse Matrix Collection

Filed under: Algorithms,Computational Geometry,Graphs,Matrix,Networks — Patrick Durusau @ 5:49 am

University of Florida Sparse Matrix Collection

It’s not what you think. Well, it is but it is so much more. You really have to see the images at this site.

Abstract (from the paper by the same title):

We describe the University of Florida Sparse Matrix Collection, a large and actively growing set of sparse matrices that arise in real applications. The Collection is widely used by the numerical linear algebra community for the development and performance evaluation of sparse matrix algorithms. It allows for robust and repeatable experiments: robust because performance results with artificially-generated matrices can be misleading, and repeatable because matrices are curated and made publicly available in many formats. Its matrices cover a wide spectrum of domains, include those arising from problems with underlying 2D or 3D geometry (as structural engineering, computational fluid dynamics, model reduction, electromagnetics, semiconductor devices, thermodynamics, materials, acoustics, computer graphics/vision, robotics/kinematics, and other discretizations) and those that typically do not have such geometry (optimization, circuit simulation, economic and financial modeling, theoretical and quantum chemistry, chemical process simulation, mathematics and statistics, power networks, and other networks and graphs). We provide software for accessing and managing the Collection, from MATLAB, Mathematica, Fortran, and C, as well as an online search capability. Graph visualization of the matrices is provided, and a new multilevel coarsening scheme is proposed to facilitate this task.

A Java viewer for matrices is also found here.

NSF Workshop on Algorithms In The Field: Request Ideas

Filed under: Algorithms — Patrick Durusau @ 5:44 am

NSF Workshop on Algorithms In The Field: Request Ideas

From the post:

NSF Workshop on Algorithms In The Field (W8F) will assemble CSists from networking, databases, social networks, data mining, machine learning, graphics/vision/geometry/robotics and those from algorithms and theoretical computer science. It is a wider collection than you’d find in most meetings yet small enough to discuss or even debate.

Many of you are native experts in 8F. Many of you have first hand experience of being algorithms researchers and working with others on a common challenge or vice versa. Also, many of you have thought about frustrations and triumphs, in collaborations that involve algorithms/theory researchers.

The post goes on to list some ideas and suggests that you can roll your own.

The workshop is in the DIMACS Center, CoRE Building, Rutgers University, Piscataway, NJ, May 16–18, 2011, by invitation only, so get your suggestions/applications in early.

This would be a perfect venue for suggestions/discussions of merging algorithms.

April 7, 2011

Third Workshop on Massive Data Algorithmics (MASSIVE 2011)

Filed under: Algorithms,BigData,Subject Identity — Patrick Durusau @ 7:26 pm

Third Workshop on Massive Data Algorithmics (MASSIVE 2011)

From the website:

Tremendous advances in our ability to acquire, store and process data, as well as the pervasive use of computers in general, have resulted in a spectacular increase in the amount of data being collected. This availability of high-quality data has led to major advances in both science and industry. In general, society is becoming increasingly data driven, and this trend is likely to continue in the coming years.

The increasing number of applications processing massive data means that in general focus on algorithm efficiency is increasing. However, the large size of the data, and/or the small size of many modern computing devices, also means that issues such as memory hierarchy architecture often play a crucial role in algorithm efficiency. Thus the availability of massive data also means many new challenges for algorithm designers.

Forgive me for mentioning it, but what is the one thing all algorithms have in common? Whether for massive data or no?

Ah, yes, some presumption about the identity of the subjects to be processed.

Would be rather difficult to efficiently process anything unless you knew where you were starting and with what?

Making the subjects processed by algorithms efficiently interchangeable seems like a good thing to me.

March 10, 2011

Top Ten Algorithms in Data Mining

Filed under: Algorithms,Data Mining — Patrick Durusau @ 9:13 am

Top Ten Algorithms in Data Mining

Summary of paper on data mining algorithms nominated and voted on by ACM KDD Innovation Award and IEEE ICDM Research Contributions Award winners to come up with a top 10 list.

I was curious about how the entries on the list from 2007 have fared.

I searched CiteseerX limiting the publication year to 2010.

The results, algorithm followed by citation count, were as follows:

  1. C4.5 – 41
  2. The k-Means algorithm – 86
  3. Support Vector Machines – 64
  4. The Apriori algorithm – 46
  5. Expectation-Maximization – 41
  6. PageRank – 19
  7. AdaBoost – 11
  8. k-Nearest Neighbor Classification – 36*
  9. Naive Bayes – 25
  10. CART (Classification and Regression Trees) – 11

*Searched as “k-Nearest Neighbor”.

Not a scientific study but enough variation to make me curious about:

  1. Broader survey of algorithm citation.
  2. What articles cite more than one algorithm?
  3. Are there any groupings by subject of study?

Not a high priority item but something I want to return to examine more closely.

March 6, 2011

Gaussian Processes for Machine Learning

Filed under: Algorithms,Guassian Processes,Machine Learning — Patrick Durusau @ 3:31 pm

Gaussian Processes for Machine Learning

Complete text of:

Gaussian Processes for Machine Learning, Carl Edward Rasmussen and Christopher K. I. Williams, MIT Press, 2006. ISBN-10 0-262-18253-X, ISBN-13 978-0-262-18253-9.

I like the quote from James Clerk Maxwell that goes:

The actual science of logic is conversant at present only with things either certain, impossible, or entirely doubtful, none of which (fortunately) we have to reason on. Therefore the true logic for this world is the calculus of Probabilities, which takes account of the magnitude of the probability which is, or ought to be, in a reasonable man’s mind.

Interesting. Is our identification of subjects probabilistic or is our identification of what we thought others meant probabilistic?

Or both? Neither?

From the preface:

Over the last decade there has been an explosion of work in the “kernel machines” area of machine learning. Probably the best known example of this is work on support vector machines, but during this period there has also been much activity concerning the application of Gaussian process models to machine learning tasks. The goal of this book is to provide a systematic and unified treatment of this area. Gaussian processes provide a principled, practical, probabilistic approach to learning in kernel machines. This gives advantages with respect to the interpretation of model predictions and provides a well founded framework for learning and model selection. Theoretical and practical developments of over the last decade have made Gaussian processes a serious competitor for real supervised learning applications.

I am downloading the PDF version but have just ordered a copy from Amazon.

If you want to encourage MIT Press and other publishers to put materials online as well as in print, order a copy of this and other online materials.

Saying online copies don’t hurt print sales isn’t as convincing as hearing the cash register go “cha-ching!”

(I would also drop a note to the press saying you bought a copy of the online book as well.)

February 28, 2011

Book of Proof

Filed under: Algorithms,Mathematics — Patrick Durusau @ 8:21 am

Book of Proof by Richard Hammack.

Important for topic maps research and also captures an important distinction that is sometimes overlooked in topic maps.

From the Introduction:

This is a book about how to prove theorems.

Until this point in your education, you may have regarded mathematics as being a primarily computational discipline. You have learned to solve equations, compute derivatives and integrals, multiply matrices and find determinants; and you have seen how these things can answer practical questions about the real world. In this setting, your primary goal in using mathematics has been to compute answers.

But there is another approach to mathematics that is more theoretical than computational. In this approach, the primary goal is to understand mathematical structures, to prove mathematical statements, and even to discover new mathematical theorems and theories. The mathematical techniques and procedures that you have learned and used up until now have their origins in this theoretical side of mathematics. For example, in computing the area under a curve, you use the Fundamental Theorem of Calculus. It is because this theorem is true that your answer is correct. However, in your calculus class you were probably far more concerned with how that theorem could be applied than in understanding why it is true. But how do we know it is true? How can we convince ourselves or others
of its validity? Questions of this nature belong to the theoretical realm of mathematics. This book is an introduction to that realm.

This book will initiate you into an esoteric world. You will learn to understand and apply the methods of thought that mathematicians use to verify theorems, explore mathematical truth and create new mathematical theories. This will prepare you for advanced mathematics courses, for you will be better able to understand proofs, write your own proofs and think critically and inquisitively about mathematics.

Quite legitimately there are topic map activities that are concerned with the efficient application and processing of particular ways to identify subjects and to determine when subject sameness has occurred.

It is equally legitimate to investigate how subject identity is viewed in different domains and the nature of data structures that can best represent those views.

Either one without the other is incomplete.

For those walking on the theoretical side of the street, I think this volume will prove to be quite valuable.

February 18, 2011

The Next Generation of Apache Hadoop MapReduce

Filed under: Algorithms,Hadoop,MapReduce,NoSQL,Topic Maps — Patrick Durusau @ 5:02 am

The Next Generation of Apache Hadoop MapReduce by Arun C Murthy (@acmurthy)

From the webpage:

In the Big Data business running fewer larger clusters is cheaper than running more small clusters. Larger clusters also process larger data sets and support more jobs and users.

The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of Apache Hadoop MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application execution. Since downtime is more expensive at scale high-availability is built-in from the beginning; as are security and multi-tenancy to support many users on the larger clusters. The new architecture will also increase innovation, agility and hardware utilization.

Since I posted the note about OpenStack and it is Friday, it seemed like a natural. Something to read over the weekend!

Saw this first at Alex Popescu’s myNoSQL – The Next Generation of Apache Hadoop MapReduce, which is sporting a new look!

Building a free, massively scalable cloud computing platform

Filed under: Algorithms,Cloud Computing,Topic Maps — Patrick Durusau @ 4:59 am

Building a free, massively scalable cloud computing platform

Soren Hansen at FOSDEM 2011:

A developer’s look into Openstack architecture

OpenStack is a very new, very popular cloud computing project, backed by Rackspace and NASA. It’s all free software (Apache Licensed) and is in production use already.

It’s written entirely in Python and uses Twisted, Eventlet, AMQP, SQLAlchemy, WSGI and many other high quality libraries and standards.

We’ll take a detailed look at the architecture and dive into some of the challenges we’ve faced building a platform that is supposed to handle millions of gigabytes of data and millions of virtual machines, and how we’ve dealt with them.

Video of the presentation

This presentation will be of interest to those who think the answer to topic map processing is to turn up the power dial.

That is one answer and it may be the best one under some circumstances.

That said, given the availability of cloud computing resources, the question becomes one of rolling your own or simply buying the necessary cycles.

Unless you are just trying to throw money at your IT department (that happens even in standards organizations), I suspect buying the cycles is the most cost effective option.

“Free” software really isn’t, particularly server class software, unless you have all volunteers administrators, donated hardware, backup facilities, etc.

Note that NASA, one of the sponsors of this project, can whistle up if not language authors, major contributors to any software package of interest to it. Can your organization say the same thing?

Just to be fair, my suspicions are in favor of a mix of algorithm development, innovative data structures and high end computing structures. (Yes, I know, I cheated and made three choices. Author’s choice.)

February 17, 2011

Clever Algorithms: Nature-Inspired Programming Recipes

Filed under: Algorithms,Artificial Intelligence — Patrick Durusau @ 7:03 am

Clever Algorithms: Nature-Inspired Programming Recipes

From the website:

The book “Clever Algorithms: Nature-Inspired Programming Recipes” by Jason Brownlee PhD describes 45 algorithms from the field of Artificial Intelligence. All algorithm descriptions are complete and consistent to ensure that they are accessible, usable and understandable by a wide audience.

5 Reasons To Read:

  1. 45 algorithms described.
  2. Designed specifically for Programmers, Research Scientists and Interested Amateurs.
  3. Complete code examples in the Ruby programming language.
  4. Standardized algorithm descriptions.
  5. Algorithms drawn from the popular fields of Computational Intelligence, Metaheuristics, and Biologically Inspired Computation.

A recent advance in graph node coloring reduced the problem from millions of years to under 7 seconds on a desktop box.

Question: Are you willing to step beyond the current orthodoxy?

« Newer Posts

Powered by WordPress