Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 1, 2013

Eight UX Design Trends for 2013

Filed under: Design,Interface Research/Design,Usability,Visualization — Patrick Durusau @ 8:07 pm

Eight UX Design Trends for 2013

Very visual so you will have to consult the post but I can list the titles for the experiences:

  • Downsampling
  • Foodism
  • Quantified Ambition
  • Augmented Dialogue
  • Sensory Bandwidth
  • Agile Economies
  • Faceted Video
  • RetroFuturism

One or more of these may help distinguish your product/services from less successful ones.

Concurrency Improvements in TokuDB v6.6 (Part 1)

Filed under: Concurrent Programming,Indexing — Patrick Durusau @ 8:07 pm

Concurrency Improvements in TokuDB v6.6 (Part 1)

From the post:

With TokuDB v6.6 out now, I’m excited to present one of my favorite enhancements: concurrency within a single index. Previously, while there could be many SQL transactions in-flight at any given moment, operations inside a single index were fairly serialized. We’ve been working on concurrency for a few versions, and things have been getting a lot better over time. Today I’ll talk about what to expect from v6.6. Next time, we’ll see why.

Impressive numbers as always!

Should get you interested in learning how this was done as an engineering matter. (That’s in part 2.)

Tracking 5.3 Billion Mutations: Using MySQL for Genomic Big Data

Filed under: MySQL,TokuDB,Tokutek — Patrick Durusau @ 8:07 pm

Tracking 5.3 Billion Mutations: Using MySQL for Genomic Big Data by Lawrence Schwartz.

From the post:

The Organization: The The Philip Awadalla Laboratory is the Medical and Population Genomics Laboratory at the University of Montreal. Working with empirical genomic data and modern computational models, the laboratory addresses questions relevant to how genetics and the environment influence the frequency and severity of diseases in human populations. Its research includes work relevant to all types of human diseases: genetic, immunological, infectious, chronic and cancer. Using genomic data from single-nucleotide polymorphisms (SNP), next-generation re-sequencing, and gene expression, along with modern statistical tools, the lab is able to locate genome regions that are associated with disease pathology and virulence as well as study the mechanisms that cause the mutations.

The Challenge: The lab’s genomic research database is following 1400 individuals with 3.7 million shared mutations, which means it is tracking 5.3 billion mutations. Because the representation of genomic sequence is a highly compressible series of letters, the database requires less hardware than a typical one. However, it must be able to store and retrieve data quickly in order to respond to research requests.

Thibault de Malliard, the researcher tasked with managing the lab’s data, adds hundreds of thousands of records every day to the lab’s MySQL database. The database must be able to process the records ASAP so that the researchers can make queries and find information quickly. However, as the database grew to 200 GB, its performance plummeted. de Malliard determined that the database’s MyISAM storage engine was having difficulty keeping up with the fire hose of data, pointing out that a single sequencing batch could take days to run.

Anticipating that the database could grow to 500 GB or even 1 TB within the next year, de Malliard began to search for a storage engine that would maintain performance no matter how large his database got.

Insertion Performance: “For us, TokuDB proved to be over 50x faster to add or update data into big tables,” according to de Malliard. “Adding 1M records took 51 min for MyISAM, but 1 min for TokuDB. So inserting one sequencing batch with 48 samples and 1.5M positions would take 2.5 days for MyISAM but one hour with TokuDB.”

OK, so it’s not “big data.” But it was critical data to the lab.

Maybe instead of “big data” we should be talking about “critical” or even “relevant” data.

Remember the story of the data analyst with “830 million GPS records of 80 million taxi trips” whose analysis confirmed what taxi drivers already knew, they stop driving when it rains. Could have asked a taxi driver or two. Starting Data Analysis with Assumptions

Take a look at TukoDB when you need a “relevant” data solution.

Topic Discovery With Apache Pig and Mallet

Filed under: Latent Dirichlet Allocation (LDA),MALLET,Pig — Patrick Durusau @ 8:07 pm

Topic Discovery With Apache Pig and Mallet

Only one of two posts from this blog in 2012 but it is a useful one.

From the post:

A common desire when working with natural language is topic discovery. That is, given a set of documents (eg. tweets, blog posts, emails) you would like to discover the topics inherent in those documents. Often this method is used to summarize a large corpus of text so it can be quickly understood what that text is ‘about’. You can go further and use topic discovery as a way to classify new documents or to group and organize the documents you’ve done topic discovery on.

Walks through the use of Pig and Mallet on a newsgroup data set.

I have been thinking about getting one of those unlimited download newsgroup accounts.

Maybe I need to go ahead and start building some newsgroup data sets.

BBC …To Explore Linked Data Technology [Instead of hand-curated content management]

Filed under: Linked Data,LOD,News — Patrick Durusau @ 8:07 pm

BBC News Lab to Explore Linked Data Technology by Angela Guess.

From the post:

Matt Shearer of the BBC recently reported that the BBC’s News Lab team will begin exploring linked data technologies. He writes, “Hi I’m Matt Shearer, delivery manager for Future Media News. I manage the delivery of the News Product and I also lead on BBC News Labs. BBC News Labs is an innovation project which was started during 2012 to help us harness the BBC’s wider expertise to explore future opportunities. Generally speaking BBC News believes in allowing creative technologists to innovate and influence the direction of the News product. For example the delivery of BBC News’ responsive design mobile service started in 2011 when we made space for a multidiscipline project to explore responsive design opportunities for BBC News. With this in mind the BBC News team setup News Labs to explore linked data technologies.”

Shearer goes on, “The BBC has been making use of linked data technologies in its internal content production systems since 2011. As explained by Jem Rayfield this enabled the publishing of news aggregation pages ‘per athlete’, ‘per sport’ and ‘per event’ for the 2012 Olympics – something that would not have been possible with hand-curated content management. Linked data is being rolled out on BBC News from early 2013 to enrich the connections between BBC News stories, content assets, the wider BBC website and the World Wide Web. We framed each challenge/opportunity for the News Lab in terms of a clear ‘problem space’ (as opposed to a set of requirements that may limit options) supported by research findings, audience needs, market needs, technology opportunities and framed with the BBC News Strategy.”

Read more here.

(emphasis added)

Apologies for the long quote but I wanted to capture the BBC’s comparison of using linked data to hand-curated content management in context.

I never dreamed the BBC was still using “hand-curated content management” as a measure of modern IT systems.

Quite remarkable.

On the other hand, perhaps they were being kind to the linked data experiment by using a measure that enables it to excel.

If you know which one, please comment.

Thanks!

Ocean Biogeographic Information System (OBIS)

Filed under: Biology,Oceanography — Patrick Durusau @ 8:04 pm

Ocean Biogeographic Information System (OBIS)

Someone suggested to me recently that pointers to data for topic maps would be quite useful.

In the vein, consider the records held by the OBIS system:

Below is an overview of some of the vital statistics of OBIS, including number of records available through the search interface, number of species and number of datasets; the numbers between brackets are those for the last two data loads, and show progress booked since then. The graph shows how the number of records increased over time.

  • Number of records: 35.5 (33.6, 32.7, 32.3) million
    • Number of records identified to species or infraspecies: 27.32 (26.3, 25.52, 25.19) million
    • Number of records identified to genus or better: 31.1 (29.8, 28.5, 28.4) million
  • Number of valid species with data reported to OBIS: 146,496 (145,899; 145,317; 145,153)
  • Number of valid marine taxa in OBIS: 163,313 (162,139; 161,620; 161,493)
    • Number of valid marine species: 120,259 (119,337; 118,937; 118,801)
    • Number of valid marine genera: 27,333 (27,228; 27,154; 27,086)
  • Number of datasets: 1,130 (1,125; 1,072; 1,056)

Talk about an opportunity to integrate data into the historical records of marine biology!

Still Not a MOOC, but…

Filed under: Machine Learning — Patrick Durusau @ 8:04 pm

John Langford and Yann LeCun are teaching a large scale machine learning class at NYU that was announced to not be a MOOC. See: NYU Large Scale Machine Learning Class [Not a MOOC]

However, see: Remote large scale learning class participation.

John and Yann have arranged for lectures to be posted with slides with one day delay and have a discussion forum if you are interested.

Still not a MOOC, but a wonderful opportunity for those of us who cannot attend in person.

« Newer Posts

Powered by WordPress