Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 28, 2011

UCCS Department of Mathematics Math Courses

Filed under: Mathematics — Patrick Durusau @ 9:34 pm

UCCS Department of Mathematics Math Courses

I am sure everyone is wondering what math skills they can pick up this Spring. 😉 You will be glad to learn that University of Colorado at Colorado Springs (UCCS) has four online courses this Spring and an archive of more that fifty (50) more.

The classes are free but you do need to create an account to view the recorded content.

For the Spring 2012 Semester you will find:

  • Math 1360- Calculus II- Shannon Michaux- (MathOnline Course)
  • Math 2350- Calculus III – Dr. Jenny Dorrington – (MathOnline Course)
  • Math 3110- Theory of Numbers – Dr. Gene Abrams – (MathOnline Course)
  • Math 3400- Introduction to Differential Equations – Dr. Radu Cascaval – (MathOnline Course)

Videos are recorded and posted the same day as the class sessions. (No credit or certificates. But for some positions being able to do the job counts for a good bit.)

Luke – The Lucene Index Toolbox v. 3.5.0

Filed under: Lucene,Luke — Patrick Durusau @ 9:33 pm

Luke – The Lucene Index Toolbox v. 3.5.0

Andrzej Bialecki writes:

I’m happy to announce the release of Luke – The Lucene Index Toolbox, version 3.5.0. This release includes Lucene 3.5.0 libraries, and you can download it from:

http://code.google.com/p/luke

Changes in version 3.5.0 (released on 2011.12.28):
* Update to Lucene 3.5.0 and fix some deprecated API usage.
* Issue 49 : fix faulty logic that prevented opening indexes in
read-only mode (MarkHarwood).
* Issue 43 : fix left-over references to Field (merkertr).
* Issue 42 : Luke should indicate if a field is a numeric field (merkertr).

Enjoy!

PS. Merry Christmas and a happy New Year to you all! 🙂

About Luke (from its homepage):

Lucene is an Open Source, mature and high-performance Java search engine. It is highly flexible, and scalable from hundreds to millions of documents.

Luke is a handy development and diagnostic tool, which accesses already existing Lucene indexes and allows you to display and modify their content in several ways:

  • browse by document number, or by term
  • view documents / copy to clipboard
  • retrieve a ranked list of most frequent terms
  • execute a search, and browse the results
  • analyze search results
  • selectively delete documents from the index
  • reconstruct the original document fields, edit them and re-insert to the index
  • optimize indexes
  • open indexes consisting of multiple parts, and located on Hadoop filesystem
  • and much more…

Current stable release of Luke is 3.5.0 and it includes Lucene 3.5.0 and Hadoop 0.20.2. Available is also Luke 1.0.1 (using Lucene 3.0.1), 0.9.9.1 based on Lucene 2.9.1, and other versions as well – please see the Downloads section.

Luke releases are numbered the same as the version of Lucene libraries that they use (plus a minor number in case of bugfix releases).

Below is a screenshot of the application showing the Overview section, which displays the details of the index format and some overall statistics.

Luke Overview tab

Automatically creating tags for big blogs with WordPress (possible upgrade)

Filed under: Tagging — Patrick Durusau @ 9:33 pm

Automatically creating tags for big blogs with WordPress (possible upgrade)

Ajay Ohri writes:

I use the simple-tags plugin in WordPress for automatically creating and posting tags. I am hoping this makes the site better to navigate. Given the fact that I had not been a very efficient tagger before, this plugin can really be useful for someone in creating tags for more than 100 (or 1000 posts) especially WordPress based blog aggregators. (added the hyperlink to simple-tags)

I am thinking about possible changes to this blog to make it more useful. Both for me and you.

Curious if anyone has experience with the “simple-tags” plugin? Was it useful?

Do you think it would be useful with the type of material you find here?

MDM Goes Beyond the Data Warehouse

Filed under: Master Data Management,MDM — Patrick Durusau @ 9:32 pm

MDM Goes Beyond the Data Warehouse

Rich Sherman writes:

Enterprises are awash with data from customers, suppliers, employees and their operational systems. Most enterprises have data warehousing (DW) or business intelligence (BI) programs, which sometimes have been operating for many years. The DW/BI programs frequently do not provide the consistent information needed by the business because of multiple and often inconsistent lists of customers, prospects, employees, suppliers and products. Master data management (MDM) is the initiative that is needed to address the problem of inconsistent lists or dimensions.

The reality is that for many years, whether people realized it or not, the DW has served as the default MDM repository. This happened because the EDW had to reconcile and produce a master list of data for every data subject area that the business needs for performing enterprise analytics. Years before the term MDM was coined, MDM was referred to as reference data management. But DW programs have fallen short of providing effective MDM solutions for several reasons.

Interesting take on the problems faced in master data management projects. (Yes, I added index entries for MDM and “master data management.” People might look under one and not the other.)

It occurs to me that there may be transitions towards a master data list that includes understanding data systems that will eventually migrate to the master system. Topic maps could play a useful role in creating the mapping to the master system as well as finding commonalities in other systems to be migrated to the master system.

Documenting the master system with a topic map would give such a project one leg up as they say on its eventual migration to some other system.

And there are always alien data systems that have different data systems from the internal MDM system (assuming that comes to pass), which could also be mapped into the master system using topic maps. I say “assuming that comes to pass” about MDM systems because the “reference data management” if implemented, would have already solved the problems that MDM faces today.

IT services are not regarded as a project with a defined end point. After all, users expect IT services every day. And such services are necessary for any enterprise to conduct business.

Perhaps data integration should move from a “project” orientation to a “process” orientation, so that continued investment and management of the integration process is ongoing and not episodic. That would create a base for in-house expertise at data integration and a continual gathering of information and expertise to anticipate data integration issues, instead of trying to solve them in hindsight.

Apache HBase 0.90.5 is now available

Filed under: Hadoop,HBase — Patrick Durusau @ 9:31 pm

Apache HBase 0.90.5 is now available

From Jonathan Hsieh at Cloudera:

Apache HBase 0.90.5 is now available. This is release of the scalable distributed data store inspired by Google’s BigTable is a fix release that covers 81 issue, including 5 considered blockers, and 11 considered critical. This release addresses several robustness and resource leakage issues, fixes rare data-loss scenarios having to do with splits and replication, and improves the atomicity of bulk loads. This version includes some new supporting features including improvements to hbck and an offline meta-rebuild disaster recovery mechanism.

The 0.90.5 release is backward compatible with 0.90.4. Many of the fixes in this release will be included as part of CDH3u3.

I like the HBase page:

Welcome to Apache HBase!

HBase is the Hadoop database. Think of it as a distributed scalable Big Data store.

When Would I Use HBase?

Use HBase when you need random, realtime read/write access to your Big Data. This project’s goal is the hosting of very large tables — billions of rows X millions of columns — atop clusters of commodity hardware. HBase is an open-source, distributed, versioned, column-oriented store modeled after Google’s Bigtable: A Distributed Storage System for Structured by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

Concise, to the point, either you are interested or you are not. Doesn’t waste time on hand wringing about “big data,” “oh, what shall we do?,” or parades of data horrors.

Do you think something similar for topic maps would need an application area approach? That is to focus on a particular problem deeply rather than all the possible uses of topic maps?

Apache Whirr 0.7.0 has been released

Filed under: Cloud Computing,Clustering (servers),Mahout,Whirr — Patrick Durusau @ 9:30 pm

Apache Whirr 0.7.0 has been released

From Patrick Hunt at Cloudera:

Apache Whirr release 0.7.0 is now available. It includes changes covering over 50 issues, four of which were considered blockers. Whirr is a tool for quickly starting and managing clusters running on cloud services like Amazon EC2. This is the first Whirr release as a top level Apache project (previously releases were under the auspices of the Incubator). In addition to improving overall stability some of the highlights are described below:

Support for Apache Mahout as a deployable component is new in 0.7.0. Mahout is a scalable machine learning library implemented on top of Apache Hadoop.

  • WHIRR-384 – Add Mahout as a service
  • WHIRR-49 – Allow Whirr to use Chef for configuration management
  • WHIRR-258 – Add Ganglia as a service
  • WHIRR-385 – Implement support for using nodeless, masterless Puppet to provision and run scripts

Whirr 0.7.0 will be included in a scheduled update to CDH4.

Getting Involved

The Apache Whirr project is working on a number of new features. The How To Contribute page is a great place to start if you’re interested in getting involved as a developer.

Cluster management or even the “cloud” in your topic map future?

You could do worse than learning one of the most recent top level Apache top level projects to prepare for a future that may arrive sooner than you think!

December 27, 2011

Royal Statistical Society Christmas quiz

Filed under: Humor — Patrick Durusau @ 7:22 pm

Royal Statistical Society Christmas quiz

Report from the Guardian of the Royal Statistical Society annual Christmas quiz, including details on how to enter.

Thoughts on having something similar for semantic technologies? 😉

At some point it was the first Christmas quiz for the RSS as well.

London – International Software Development Conference 2012 QCon

Filed under: Conferences,Programming — Patrick Durusau @ 7:16 pm

London – International Software Development Conference 2012 QCon

Training March 5-6, Conference 7-9

I know, I was thinking the same thing. March? London? London weather? British airport/hotel/street corner security?

But, then I followed the video link and saw these presentations from prior conferences:

  • Nikolai Onken – “Mobile JavaScript Development”
  • Robert C. Martin – “Bad Code, Craftsmanship, Engineering, and Certification”
  • Joe Armstrong – “Message Passing Concurrency in Erlang “
  • Oren Eini – “The Wizardry of Scaling”
  • Ralph Johnson – “A Pattern Language for Parallel Programming”

I won’t mislead you, I won’t be there. But it is because I can’t and not because of the London negatives I mentioned above.

You can do me a favor. Please blog about the presentations you like best.

Like voting, it encourages the bastards. 😉

A Month of Math Software

Filed under: Mathematics — Patrick Durusau @ 7:15 pm

A Month of Math Software

For the month of November 2011 but past issues are also available. Really too much to quote or describe so go take a look and suggest anything you think needs to be mentioned specifically here.

Typesafe Stack

Filed under: Akka,Scala — Patrick Durusau @ 7:14 pm

Typesafe Stack

From the website:

Scala. Akka. Simple.

A 100% open source, integrated distribution offering Scala, Akka, sbt, and the Scala IDE for Eclipse.

The Typesafe Stack makes it easy for developers to get started building scalable software systems with Scala and Akka. The Typesafe Stack is based on the most recent stable versions of Scala and Akka, and provides all of the major components needed to develop and deploy Scala and Akka applications.

Go ahead! You need something new to put on your new, shiny 5TB disk drive. 😉

scikits-image – Name Change

Filed under: Image Processing,Machine Learning,Names,Python — Patrick Durusau @ 7:13 pm

scikits-image – Name Change.

Speaking of naming issues, do note that scikits-image has become skimage, although as of 27 December 2011, PyPi – The Python Package Index isn’t aware of the change.

On the other hand, a search for sklearn (the new name for scikit-learn) resolves to the current package name scikit-learn-0.9.tar.gz.

I will drop the administrators a note because the text shifts between the two names without explanation on sklearn.

I got clued in about the change at: http://pythonvision.org/blog/2011/December/skimage04.

So, how do we deal with all the prior uses of the “scikits-image” and “scikit-learn” identifiers that are about to be disconnected from the software they once named?

Eventually the package pages will be innocent of either one, save perhaps in increasingly old change logs.

Assume I run across a blog post or article that is two or three years old with an interesting technique that uses the old names. Other than by chance, how do I find the package under its new name? And if I do find it, how can I save other people from the same time investment and depending on luck for the result?

To be sure, the package search mechanism puts me out at the right place but what if I am not expecting the resolution to another name? Will I think this is another package?

How Important Are Names?

Filed under: Marketing — Patrick Durusau @ 7:12 pm

In some cases, very important:

Medication Errors Injure 1.5 Million People and Cost Billions of Dollars Annually; Report Offers Comprehensive Strategies for Reducing Drug-Related Mistakes

Medication errors are among the most common medical errors, harming at least 1.5 million people every year, says a new report from the Institute of Medicine of the National Academies. The extra medical costs of treating drug-related injuries occurring in hospitals alone conservatively amount to $3.5 billion a year, and this estimate does not take into account lost wages and productivity or additional health care costs, the report says.

One of the causes?:

Confusion caused by similar drug names accounts for up to 25 percent of all errors reported to the Medication Error Reporting Program operated cooperatively by U.S. Pharmacopeia (USP) and the Institute for Safe Medication Practices (ISMP). In addition, labeling and packaging issues were cited as the cause of 33 percent of errors, including 30 percent of fatalities, reported to the program. Drug naming terms should be standardized as much as possible, and all companies should be required to use the standardized terms, the report urges. FDA, AHRQ, and the pharmaceutical industry should collaborate with USP, ISMP, and other appropriate organizations to develop a plan to address the problems associated with drug naming, labeling, and packaging by the end of 2007.

Similar drug names?

Before you jump to the conclusion that I am going to recommend topic maps as a solution, let me assure you I’m not. Nor would I recommend RDF or any other “semantic” technology that I am aware of.

In part because the naming/identification issue here, as in many places, is only part of a very complex social and economic set of issues. To focus on the “easy” part, ;-), that is identification, is to lose sight of many other requirements.

To be effective, a solution can’t only address the issue that your technology or product is good at addressing but it must address other issues as well.

I have written to the National Academies to see if there is an update on this report. This report rather optimistically suggests a number of actions that I find unlikely to occur without government intervention.

PS: Products that incorporate topic maps or RDF based technologies may form a part of a larger solution to medication errors but that isn’t the same thing as being “the” answer.

Best Maps and Visualizations of 2011

Filed under: Maps,Visualization — Patrick Durusau @ 7:11 pm

Best Maps and Visualizations of 2011

From Spatialanalysis.co.uk a selection of rather stunning maps and visualizations.

The “naming rivers and places” map illustrates the issue of different identifiers by different communities. This one is based on geographic location. I am not sure how you would handle the topography but similar maps could be constructed of terminology usage by profession/occupation. Or between specialities within profession or occupations.

Rickshaw

Filed under: Graphs,Visualization — Patrick Durusau @ 7:11 pm

Rickshaw

From the webpage:

Rickshaw is a JavaScript toolkit for creating interactive time series graphs, developed at Shutterstock.

Includes the ability to update in real time.

Do you click through parts of a map built with Richshaw to a topic map, or does a topic map update a map made with Rickshaw, or do you use a topic map at all with Richshaw?

Depends on your requirements for identifiers and their recognition.

Computer Vision & Math

Filed under: Image Recognition,Image Understanding,Mathematics — Patrick Durusau @ 7:10 pm

Computer Vision & Math

From the website:

The main part of this site is called Home of Math. It’s an online mathematics textbook that contains over 800 articles with over 2000 illustrations. The level varies from beginner to advanced.

Try our image analysis software. Pixcavator is a light-weight program intended for scientists and engineers who want to automate their image analysis tasks but lack a significant computing background. This image analysis software allows the analyst to concentrate on the science and lets us take care of the math.

If you create image analysis applications, consider Pixcavator SDK. It provides a simple tool for developing new image analysis software in a variety of fields. It allows the software developer to concentrate on the user’s needs instead of development of custom algorithms.

Thinking, Fast and Slow

Thinking, Fast and Slow by Daniel Kahneman, Farrar, Straus and Giroux, New York, 2011.

I got a copy of “Thinking, Fast and Slow” for Christmas and it has already proven to be an enjoyable read.

Kahneman says early on (page 28):

The premise of this book is that it is easier to recognize other people’s mistakes than our own.

I thought about that line when I read a note from a friend that topic maps needed more than my:

tagging everything with “Topic Maps….”

Which means I haven’t been clear about the reasons for the breath of materials I have and will be covering in this blog.

One premise of this blog is that the use and recognition of identifiers is essential for communication.

Another premise of this blog is that it is easier for us to study the use and recognition of identifiers by others, much for the same reasons we can recognize the mistakes of others more easily.

The use and recognition of identifiers by others aren’t mistakes but they may be different from those we would make. In cases where they differ from ours, we have a unique opportunity to study the choices made and the impacts of those choices. And we may learn patterns in those choices that we can eventually see in our own choices.

Understanding the use and recognition of identifiers in a particular circumstance and the requirements for the use and recognition of identifiers, is the first step towards deciding whether topic maps would be useful in some circumstance and in what way?

For example, processing social security records in the United States, anything other than “bare” identifiers like a social security number may be unnecessary and add load with no corresponding benefit. Aligning social security records with bank records, might need to reconsider the judgement to use only social security numbers. (Some information sharing is “against the law.” But as the Sheriff in “Oh Brother where art thou?” says: “The law is a man made thing.” Laws change, or you can commission absurdist interpretations of it.)

Topic maps aren’t everywhere but identifiers and recognition of identifiers are.

Understanding identifiers and their recognition will help you choose the most appropriate solution to a problem

December 26, 2011

Hive Plots

Filed under: Hive Plots,Networks,Visualization — Patrick Durusau @ 8:24 pm

Hive Plots

From the website:

Hive plots — for the impatient

The hive plot is a rational visualization method for drawing networks. Nodes are mapped to and positioned on radially distributed linear axes — this mapping is based on network structural properties. Edges are drawn as curved links. Simple and interpretable.

The purpose of the hive plot is to establish a new baseline for visualization of large networks — a method that is both general and tunable and useful as a starting point in visually exploring network structure.

You will really have to visit the link to properly experience hive plots. No description on my part would really be adequate.

Haskell – New Release

Filed under: Haskell — Patrick Durusau @ 8:23 pm

Haskell – New Release

Current stable release: 2011.4.0.0 (December 2011) – Download today!

A Christmas Miracle

Filed under: Dataset,Government Data — Patrick Durusau @ 8:22 pm

A Christmas Miracle

From the post:

Data files on 407 banks, between the dates of 2007 to 2009, on the daily borrowing with the US Federal Reserve bank. The data sets are available from Bloomberg at this address data

This is an unprecedented look into the day-to-day transactions of banks with the Feds during one of the worse and unusual times in US financial history. A time of weekend deals, large banks being summoned to sign contracts, and all around chaos. For the economist, technocrat, and R enthusiasts this is the opportunity of a life time to examine and analyze financial data normally held in the strictest of confidentiality. A good comparison would be taking all of the auto companies and getting their daily production, sales, and cost data for two years and sending it out to the world. Never has happened.

Not to get too excited, what were released were daily totals, not the raw data itself.

Being a naturally curious person, when someone releases massaged data when the raw data would have been easier, I have to wonder what would I see if I had the raw data? Or perhaps in a topic maps context, what subjects could I link up with the raw data that I can’t with the massaged data?

Pattern

Filed under: Data Mining,Python — Patrick Durusau @ 8:21 pm

Pattern

From the webpage:

Pattern is a web mining module for the Python programming language.

It bundles tools for data retrieval (Google + Twitter + Wikipedia API, web spider, HTML DOM parser), text analysis (rule-based shallow parser, WordNet interface, syntactical + semantical n-gram search algorithm, tf-idf + cosine similarity + LSA metrics) and data visualization (graph networks).

The module is bundled with 30+ example scripts.

Consider it to be a late stocking stuffer. 😉

Galaxy: Data Intensive Biology for Everyone

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 8:20 pm

Galaxy: Data Intensive Biology for Everyone (main site)

Galaxy 101: The first thing you should try (tutorial)

Work through the tutorial, keeping track of where you think subject identity (and any tests you care to suggest) would be useful.

I don’t know but suspect this is representative of what researchers in the field expect in terms of capabilities.

With a little effort I suspect it would make a nice basis to start a conversation about what subject identity could add that would be of interest.

NoSQL Conference CGN

Filed under: NoSQL — Patrick Durusau @ 8:19 pm

NoSQL Conference CGN

Important dates:

1 February 2012 – Deadline for proposals
1 March 2012 – Accepted speakers announced
29 May 2012 – 30 May 2012 conference, Cologne, Germany

From the website:

NoSQL is taking the IT-world by storm: originally devised to tackle growing amounts of data, NoSQL now forms the base for a large variety of business solutions.

NoSQL – currently the most efficient way to manage large data repositories – is taking a leading role in the next generation of database technologies. The upcoming conference NoSQL matters will present the innovations at the forefront of the area – and right in the center of Europe.

Over two days international experts will present topics associated with NoSQL technologies, outline challenges and solutions for the administration of large data respositories. The conference aims to contribute creatively to the understanding of NoSQL in terms of development and practical use. NoSQL matters takes place in Cologne, where 2000 years of history are combined with modern technology.

I understand the tourist facilities at Ur are in disrepair so I guess having the conference at a more recent location is ok. 😉

Beyond Relational

Filed under: Database,MySQL,Oracle,PostgreSQL,SQL,SQL Server — Patrick Durusau @ 8:19 pm

Beyond Relational

I originally arrived at this site because of a blog hosted there with lessons on Oracle 10g. Exploring a bit I decided to post about it.

Seems to have fairly broad coverage, from Oracle and PostgreSQL to TSQL and XQuery.

Likely to be a good site for learning cross-overs between systems that you can map for later use.

Suggestions of similar sites?

From Information to Knowledge: On Line Access to Legal Information

Filed under: Law - Sources,Legal Informatics — Patrick Durusau @ 8:18 pm

From Information to Knowledge: On Line Access to Legal Information

Collection of pointers to slides, abstracts and some papers on access to legal information, including classification, ontologies, reports on experiences with current systems, etc.

Focused on Europe and “open access” to legal materials.

Maybe useful background information for discussions about topic map and legal materials.

Mondeca helps to bring Electronic Patient Record to reality

Filed under: Biomedical,Data Integration,Health care,Medical Informatics — Patrick Durusau @ 8:13 pm

Mondeca helps to bring Electronic Patient Record to reality

This has been out for a while but I just saw it today.

From the post:

Data interoperability is one of the key issues in assembling unified Electronic Patient Records, both within and across healthcare providers. ASIP Santé, the French national healthcare agency responsible for implementing nation-wide healthcare management systems, has been charged to ensure such interoperability for the French national healthcare.

The task is a daunting one since most healthcare providers use their own custom terminologies and medical codes. This is due to a number of issues with standard terminologies: 1) standard terminologies take too long to be updated with the latest terms; 2) significant internal data, systems, and expertise rely on the usage of legacy custom terminologies; and 3) a part of the business domain is not covered by a standard terminology.

The only way forward was to align the local custom terminologies and codes with the standard ones. This way local data can be automatically converted into the standard representation, which will in turn allow to integrate it with the data coming from other healthcare providers.

I assume the alignment of local custom terminologies is an ongoing process so as the local terminologies change, re-alignment occurs as well?

Kudos to Mondeca for they played an active role in the early days of XTM and I suspect that experience has influenced (for the good), their approach to this project.

December 25, 2011

Let It Crash

Filed under: Akka — Patrick Durusau @ 6:08 pm

Let It Crash

Who else but the Akka team would choose a blog title like: Let It Crash. 😉

An early post? Read on:

Location Transparency: Remoting in Akka 2.0

The remoting capabilities of Akka 2.0 are really powerful. Something that not has been as powerful is the documentation of the Akka remoting. We are constantly striving on improving it and this blog post will, hopefully, shed some light on the topic.

The remoting contains functionality not only to lookup a remote actor and send messages to it but also to deploy actors on remote nodes. These two types of interaction are referred to as:

  • Lookup
  • Creation

In the section below the two different approaches will be explained.
(It may be worth pointing out that a combination of the two ways is, of course, also feasible)

(see the post for the rest of it)

Encouraging because the team realizes that its documentation leaves something to be desired and just as importantly, it wants to do something about it.

Looking forward to more posts like this one.

Drop by and leave an encouraging word.

Visualizing Reuters Editorial Investment

Filed under: Visualization — Patrick Durusau @ 6:07 pm

Visualizing Reuters Editorial Investment by Matthew Hurst.

From the post:

This is a very early view of a work in progress. The process is to crawl Reuters, extract the attribution of each article (writers and editors) and extract the mention of country names. Then, using gephi, to visualize the relationships, thus – in this case – showing which editors are associated with the mention of which countries. In this snippet, countries have mutual links (red) with other countries they are collocated with. Editors have directed edges (green) with the country mentions they are associated with.

See the post for the image of relationships.

Arnetminer

Filed under: Networks,Social Networks — Patrick Durusau @ 6:07 pm

Arnetminer: search and mining of academic social networks

From the webpage:

Arnetminer (arnetminer.org) aims to provide comprehensive search and mining services for researcher social networks. In this system, we focus on: (1) creating a semantic-based profile for each researcher by extracting information from the distributed Web; (2) integrating academic data (e.g., the bibliographic data and the researcher profiles) from multiple sources; (3) accurately searching the heterogeneous network; (4) analyzing and discovering interesting patterns from the built researcher social network. The main search and analysis functions in arnetminer include:

  • Profile search: input a researcher name (e.g., Jie Tang), the system will return the semantic-based profile created for the researcher using information extraction techniques. In the profile page, the extracted and integrated information include: contact information, photo, citation statistics, academic achievement evaluation, (temporal) research interest, educational history, personal social graph, research funding (currently only US and CN), and publication records (including citation information, and the papers are automatically assigned to several different domains).
  • Expert finding: input a query (e.g., data mining), the system will return experts on this topic. In addition, the system will suggest the top conference and the top ranked papers on this topic. There are two ranking algorithms, VSM and ACT. The former is similar to the conventional language model and the latter is based on our Author-Conference-Topic (ACT) model. Users can also provide feedbacks to the search results.
  • Conference analysis: input a conference name (e.g., KDD), the system returns who are the most active researchers on this conference, and the top-ranked papers.
  • Course search: input a query (e.g., data mining), the system will tell you who are teaching courses relevant to the query.
  • Associate search: input two researcher names, the system returns the association path between the two researchers. The function is based on the well-known "six-degree separation" theory.
  • Sub-graph search: input a query (e.g., data mining), the system first tells you what topics are relevant to the query (e.g., five topics "Data mining", "XML Data", "Data Mining / Query Processing", "Web Data / Database design", "Web Mining" are relevant), and then display the most important sub-graph discovered on each relevant topic, augmented with a summary for the sub-graph.
  • Topic browser: based on our Author-Conference-Topic (ACT) model, we automatically discover 200 hot topics from the publications. For each topic, we automatically assign a label to represent its meanings. Furthermore, the browser presents the most active researchers, the most relevant conferences/papers, and the evolution trend of the topic is discovered.
  • Academic ranks: we define 8 measures to evaluate the researcher's achievement. The measures include "H-index", "Citation", "Uptrend, "Activity", "Longevity", "Diversity, "Sociability", "New Star". For each measure, we output a ranking list in different domains. For example, one can search who have the highest citation number in the "data mining" domain.
  • User management: one can register as a user to: (1) modify the extracted profile information; (2) provide feedback on the search results; (3) follow researchers in arnetminer; (4) create an arnetminer page (which can be used to advertise confs/workshops, or recruit students).

Arnetminer.org has been in operation on the internet for more than three years. Currently, the academic network includes more than 6,000 conferences, 3,200,000 publications, 700,000 researcher profiles. The system attracts users from more than 200 countries and receives >200,000 access logs per day. The top five countries where users come from are United States, China, Germany, India, and United Kingdom.

A rich data source and a way to explore who’s who in particular domains.

New Entrez Genome

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 6:07 pm

New Entrez Genome Released on November 9, 2011

From the announcement:

Historically the Entrez Genome data model was designed for complete genomes of microorganisms (Archaea, Eubacteria, and Viruses) and a very few eukaryotic genomes such as human, yeast, worm, fly and thale cress (Arabidopsis thaliana). It also included individual complete genomes of organelles and plasmids. Despite the name, the Entrez Genome database record has been a chromosome (or organelle or plasmid) rather than a genome.

The new Genome resource uses a new data model where a single record provides information about the organism (usually a species), its genome structure, available assemblies and annotations, and related genome-scale projects such as transcriptome sequencing, epigenetic studies and variation analysis. As before, the Genome resource represents genomes from all major taxonomic groups: Archaea, Bacteria, Eukaryote, and Viruses. The old Genome database represented only Refseq genomes, while the new resource extends this scope to all genomes either provided by primary submitters (INSDC genomes) or curated by NCBI staff (RefSeq genomes).

The new Genome database shares a close relationship with the recently redesigned BioProject database (formerly Genome Project). Primary information about genome sequencing projects in the new Genome database is stored in the BioProject database. BioProject records of type “Organism Overview” have become Genome records with a Genome ID that maps uniquely to a BioProject ID. The new Genome database also includes all “genome sequencing” records in BioProject.

BTW, just in case you ever wonder about changes in identifiers causing problems:

The new Genome IDs cannot be directly mapped to the old Genome IDs because the data types are very different. Each old Genome ID represented a single sequence that can still be found in Entrez Nucleotide using standard Entrez searches or the E-utilities. We recommend that you convert old Genome IDs to Nucleotide GI numbers using the following remapping file available on the NCBI FTP site:
ftp://ftp.ncbi.nih.gov/genomes/old_genomeID2nucGI

The Genome site.

LittleSis

Filed under: Government Data,Politics — Patrick Durusau @ 6:07 pm

LittleSis* is a free database of who-knows-who at the heights of business and government. (*opposite of Big Brother).

Quick Summary: LittleSis is tracking “21,390 organizations, 64,453 people, and 339,769 connections between them”

From the “about” page:

LittleSis is a free database detailing the connections between powerful people and organizations.

We bring transparency to influential social networks by tracking the key relationships of politicians, business leaders, lobbyists, financiers, and their affiliated institutions. We help answer questions such as:

  • Who do the wealthiest Americans donate their money to?
  • Where did White House officials work before they were appointed?
  • Which lobbyists are married to politicians, and who do they lobby for?

All of this information is public, but scattered. We bring it together in one place. Our data derives from government filings, news articles, and other reputable sources. Some data sets are updated automatically; the rest is filled in by our user community.

Their blog is known as: Eyes on the Ties.

Just in case you are interested in politics. Looks like the sort of effort that would benefit from using a topic map.

« Newer PostsOlder Posts »

Powered by WordPress