Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 21, 2015

600 Terabytes/30,000 Instances Exposed (Not OPM or Sony)

Filed under: Cybersecurity,MongoDB,Security — Patrick Durusau @ 1:17 pm

Graeme Burton writes in Check your NoSQL database – 600 terabytes of MongoDB data publicly exposed on the internet, that the inventor of Shodan, John Matherly, claims 595.2 terabytes (TB) of data are exposed in MongoDB instances without authentication.

On the MongoDB story, see: It’s the Data, Stupid! by John Matherly, if you want the story on collecting the data on MongoDB instances.

The larger item of interest is Shodan, “Shodan is the world’s first search engine for Internet-connected devices.

Considering how well everyone has done with computer security to date, being able to search Internet-connected devices should not be problem. Yes?

From the homepage:

Explore the Internet of Things

Use Shodan to discover which of your devices are connected to the Internet, where they are located and who is using them.

See the Big Picture

Websites are just one part of the Internet. There are power plants, Smart TVs, refrigerators and much more that can be found with Shodan!

Monitor Network Security

Keep track of all the computers on your network that are directly accessible from the Internet. Shodan lets you understand your digital footprint.

Get a Competitive Advantage

Who is using your product? Where are they located? Use Shodan to perform empirical market intelligence.

My favorite is the last one, “…to perform empirical market intelligence.” You bet!

There are free accounts so I signed up for one to see what I could see. 😉

Here are some of the popular saved searches (that you can see with a free account):

  • Webcam – best ip cam search I have found yet.
  • Cams – admin admin
  • Netcam – Netcam
  • dreambox – dreambox
  • default password – Finds results with “default password” in the banner; the named defaults might work!
  • netgear – user: admin pass: password
  • 108.223.86.43 – Trendnet IP Cam
  • ssh – ssh
  • Router w/ Default Info – Routers that give their default username/ password as admin/1234 in their banner.
  • SCADA – SCADA systems search

With the free account, you can only see the first fifty (50) results for a search.

I’m not sure I agree that the pricing is “simple” but it is attractive. Note the difference between query credits and scan credits. The first applies to searches of the Shodan database and the second applies to networks you have targeted.

The 20K+ routers w/ default info could be a real hoot!

You know, this might be a cost effective alternative for the lower level NSA folks.

BTW, in case you are looking for it: the API documentation.

Definitely worth a bookmark and blog entries about your experiences with it.

It could well be that being insecure in large numbers is a form of cyberdefense.

Who is going to go after you when numerous larger, more well known targets are out there for the taking? And with IoT, the number of targets is going to increase geometrically.

June 29, 2015

Streaming Data IO in R

Filed under: JSON,MongoDB,R,Streams — Patrick Durusau @ 2:59 pm

Streaming Data IO in R – curl, jsonlite, mongolite by Jeroem Ooms.

Abstract:

The jsonlite package provides a powerful JSON parser and generator that has become one of standard methods for getting data in and out of R. We discuss some recent additions to the package, in particular support streaming (large) data over http(s) connections. We then introduce the new mongolite package: a high-performance MongoDB client based on jsonlite. MongoDB (from “humongous”) is a popular open-source document database for storing and manipulating very big JSON structures. It includes a JSON query language and an embedded V8 engine for in-database aggregation and map-reduce. We show how mongolite makes inserting and retrieving R data to/from a database as easy as converting it to/from JSON, without the bureaucracy that comes with traditional databases. Users that are already familiar with the JSON format might find MongoDB a great companion to the R language and will enjoy the benefits of using a single format for both serialization and persistency of data.

R, JSON, MongoDB, what’s there not to like? 😉

From UseR! 2015.

Enjoy!

May 18, 2015

A Virtual Database between MongoDB, ElasticSearch, and MarkLogic

Filed under: ElasticSearch,MarkLogic,MongoDB — Patrick Durusau @ 2:21 pm

A Virtual Database between MongoDB, ElasticSearch, and MarkLogic by William Candillon.

From the post:

Virtual Databases enable developers to write applications regardless of the underlying database technologies. We recently updated a database infrastructure from MongoDB and ElasticSearch to MarkLogic without touching the codebase.

We just flipped a switch. We updated the database infrastructure of an application (20k LOC) from MongoDB and Elasticsearch to MarkLogic without changing a single line of code.

Earlier this year, we published a tutorial that shows how the 28msec query technology can enable developers to write applications regardless of the underlying database technology. Recently, we had the opportunity to put it to the test on both a real world use case and a substantial codebase.

At 28msec, we have designed1 and implemented2 an open source modern data warehouse called CellStore. Whereas traditional data warehousing solutions can only support hundreds of fixed dimensions and thus need to ETL the data to analyze, cell stores support an unbounded number of dimensions. Our implementation of the cell store paradigm is around 20k lines of JSONiq queries. Originally the implementation was running on top of MongoDB and Elasticsearch.
….

1. http://arxiv.org/pdf/1410.0600.pdf
2. http://github.com/28msec/cellstore

Impressive work and it merits a separate post on the underlying technology, CellStore.

February 18, 2015

40000 vulnerable MongoDB databases

Filed under: Cybersecurity,MongoDB,Security — Patrick Durusau @ 5:17 pm

Discovered 40000 vulnerable MongoDB databases on the Internet by Pierluigi Paganini.

From the post:

Today MongoDB is used by many organizations, the bad news is that nearly 40,000 entities running MongoDB are exposed and vulnerable to risks of hacking attacks.

MongoDB-vulnerable

Three students from University of Saarland in Germany, Kai Greshake, Eric Petryka and Jens Heyens, discovered that MongoDB databases running at TCP port 27017 as a service of several thousand of commercial web servers are exposed on the Internet without proper defense measures.

In MongoDB databases at risk – Several thousand MongoDBs without access control on the Internet, Jens Heyens, Kai Greshake, Eric Petryka, report the cause as:

The reason for this problem is twofold:

  • The defaults of MongoDB are tailored for running it on the same physical machine or virtual machine instances.
  • The documentations and guidelines for setting up MongoDB servers with Internet access may not be sufficiently explicit when it comes to the necessity to activate access control, authentication, and transfer encryption mechanisms.

Err, “…may not be sufficiently explicit…?”

You think?

Looking at Install MongoDB on Ubuntu, do you see a word about securing access to MongoDB? Nope.

How about Security Introduction? A likely place for new users to check. Nope.

Authentication has your first clue about the localhost exception but doesn’t mention network access at all.

You finally have to reach Network Exposure and Security before you start learning how to restrict access to your MongoDB instance.

Or if you have grabbed the latest MongoDB documentation as a PDF file (2.6), the security information you need starts at page 286.

I setup a MongoDB instance a couple of weeks ago and remember being amazed that there wasn’t even a default admin password. As a former sysadmin I knew that was trouble to hunted through the documentation until finally hitting upon the necessary information.

Limiting access to a MongoDB instance should be included in the installation document. With bold, perhaps even red letters saying the security steps are necessary before starting your MongoDB instance.

Failure to provide security instructions has resulted in 39,890 vulnerable MongoDBs on the Internet.

Failed to be explicit? More like failed documentation. (full stop)

Users do a bad enough job with security without providing them with bad documentation.

Call me if you need $paid documentation assistance.

December 12, 2014

SlamData

Filed under: MongoDB,SlamData — Patrick Durusau @ 8:16 pm

SlamData

From the about page:

SlamData was formed in early 2014 in recognition that the primary methods for analytics on NoSQL data were far too complex and resource intensive. Even simple questions required learning new technolgies, writing complex ETL processes or even coding. We created the SlamData project to address this problem.

In contrast to legacy vendors, which emphasize trying to make the data fit legacy analytics infrastructure, SlamData focuses on trying to make the analytics infrastructure fit the data.

The SlamData solution provides a common ANSI SQL compatible interface to NoSQL data. This makes modern NoSQL data accessible to anyone. SlamData retains the leading developers of the SlamData open source project and provides commercial support and training around the open source analytics technology.

I first encountered SlamData in MongoDB gets its first native analytics tool by Andrew C. Oliver, who writes in part:


In order to deal with the difference between documents and tables, SlamData extends SQL with an XPath-like notation. Rather than querying from a table name (or collection name), you might query FROM person[*].address[*].city. This should represent a short learning curve for SQL-loving data analysts or power business users, while being inconsequential for developers.

The power of SlamData resides in its back-end SlamEngine, which implements a multidimensional relational algorithm and deals with the data without reformatting the infrastructure. The JVM (Scala) back end supplies a REST interface, which allows developers to access SlamData’s algorithm for their own uses.

The SlamData front end and SlamEngine are both open source and waiting for you to download them.

My major curiosity is about the extension to SQL and the SlamEngine’s “multidimensional relational algorithm.”

I was planning on setting up MongoDB for something else so perhaps this will be the push to get that project started.

Enjoy!

March 12, 2014

Building a tweet ranking web app using Neo4j

Filed under: Graphs,MongoDB,Neo4j,node-js,Python,Tweets — Patrick Durusau @ 7:28 pm

Building a tweet ranking web app using Neo4j by William Lyon.

From the post:

twizzard

I spent this past weekend hunkered down in the basement of the local Elk’s club, working on a project for a hackathon. The project was a tweet ranking web application. The idea was to build a web app that would allow users to login with their Twitter account and view a modified version of their Twitter timeline that shows them tweets ranked by importance. Spending hours every day scrolling through your timeline to keep up with what’s happening in your Twitter network? No more, with Twizzard!

The project uses the following components:

  • Node.js web application (using Express framework)
  • MongoDB database for storing basic user data
  • Integration with Twitter API, allowing for Twitter authentication
  • Python script for fetching Twitter data from Twitter API
  • Neo4j graph database for storing Twitter network data
  • Neo4j unmanaged server extension, providing additional REST endpoint for querying / retrieving ranked timelines per user

Looks like a great project and good practice as well!

Curious what you think of the ranking of tweets:

How can we score Tweets to show users their most important Tweets? Users are more likely to be interested in tweets from users they are more similar to and from users they interact with the most. We can calculate metrics to represent these relationships between users, adding an inverse time decay function to ensure that the content at the top of their timeline stays fresh.

That’s one measure of “importance.” Being able to assign a rank would be useful as well, say for the British Library.

Do take notice of the Jaccard similarity index.

Would you say that possessing at least one identical string (id, subject identifier, subject indicator) is a form of similarity measure?

What other types of similarity measures do you think would be useful for topic maps?

I first saw this in a tweet by GraphemeDB.

August 27, 2013

MongoDB Training

Filed under: MongoDB — Patrick Durusau @ 6:11 pm

Free Online MongoDB Training

Classes include:

  • MongoDB for Java Developers
  • MongoDB for Node.js Developers
  • MongoDB for Developers
  • MongoDB for DBAs

The Fall semester is getting closer and you are thinking about classes, football, dates, parties, ….

MongoDB University can’t help you with the last three but it does have free classes.

You have to handle the other stuff on your own. 😉

PS: What books do you see next to the programmer in the picture? I see a C++ Nutshell book next to “The C Programming Language.” Anything else you recognize?

August 23, 2013

Aggregation Options on Big Data Sets Part 1… [MongoDB]

Filed under: Aggregation,MongoDB — Patrick Durusau @ 5:26 pm

Aggregation Options on Big Data Sets Part 1: Basic Analysis using a Flights Data Set by Daniel Alabi and Sweet Song, MongoDB Summer Interns.

From the post:

Flights Dataset Overview

This is the first of three blog posts from this summer internship project showing how to answer questions concerning big datasets stored in MongoDB using MongoDB’s frameworks and connectors.

The first dataset explored was a domestic flights dataset. The Bureau of Transportation Statistics provides information for every commercial flight from 1987, but we narrowed down our project to focus on the most recent available data for the past year (April 2012-March 2013).

We were particularly attracted to this dataset because it contains a lot of fields that are well suited for manipulation using the MongoDB aggregation framework.

To get started, we wanted to answer a few basic questions concerning the dataset:

  1. When is the best time of day/day of week/time of year to fly to minimize delays?
  2. What types of planes suffer the most delays? How old are these planes?
  3. How often does a delay cascade into other flight delays?
  4. What was the effect of Hurricane Sandy on air transportation in New York? How quickly did the state return to normal?

A series of blog posts to watch!

I thought the comment:

We were particularly attracted to this dataset because it contains a lot of fields that are well suited for manipulation using the MongoDB aggregation framework.

was remarkably honest.

The Department of Transportation Table/Field guide reveals that the fields are mostly populated by codes, IDs and date/time values.

Values that lend themselves to easy aggregation.

Looking forward to harder aggregation examples as this series develops.

July 30, 2013

Comparing MongoDB, MySQL, and TokuMX Data Layout

Filed under: MongoDB,MySQL,TokuDB — Patrick Durusau @ 12:34 pm

Comparing MongoDB, MySQL, and TokuMX Data Layout by Zardosht Kasheff.

From the post:

A lot is said about the differences in the data between MySQL and MongoDB. Things such as “MongoDB is document based”, “MySQL is relational”, “InnoDB has a clustering key”, etc.. Some may wonder how TokuDB, our MySQL storage engine, and TokuMX, our MongoDB product, fit in with these data layouts. I could not find anything describing the differences with a simple google search, so I figured I’d write a post explaining how things compare.

So who are the players here? With MySQL, users are likely familiar with two storage engines: MyISAM, the original default up until MySQL 5.5, and InnoDB, the current default since MySQL 5.5. MongoDB has only one storage engine, and we’ll refer to it as “vanilla Mongo storage”. And of course, there is TokuDB for MySQL, and TokuMX.

First, let’s get some quick terminology out of the way. Documents and collections in MongoDB can be thought of as rows and tables in MySQL, respectively. And while not identical, fields in MongoDB are similar to columns in MySQL. A full SQL to MongoDB mapping can be found here. When I refer to MySQL, what I say applies to TokuDB, InnoDB, and MyISAM. When I say MongoDB, what I say applies to TokuMX and vanilla Mongo storage.

Great contrast of MongoDB and MySQL data formats.

Data formats are essential to understanding the capabilities and limitations of any software package.

June 21, 2013

TokuMX: High Performance for MongoDB

Filed under: Fractal Trees,Indexing,MongoDB,Tokutek — Patrick Durusau @ 6:20 pm

TokuMX: High Performance for MongoDB

From the webpage:

TokuMXTM for MongoDB is here!

Tokutek, whose Fractal Tree® indexing technology has brought dramatic performance and scalability to MySQL and MariaDB users, now brings those same benefits to MongoDB users.

TokuMX is open source performance-enhancing software for MongoDB that make MongoDB more performant in large application with demanding requirements. In addition to replacing B-tree indexing with more modern technology, TokuMX adds transaction support, document-level locking for concurrent writes, and replication.

You have seen the performance specs on Fractal Tree indexing.

Now they are available for MongoDB!

May 23, 2013

MongoDB: The Definitive Guide 2nd Edition is Out!

Filed under: MongoDB,NoSQL — Patrick Durusau @ 12:24 pm

MongoDB: The Definitive Guide 2nd Edition is Out! by Kristina Chodorow.

From the webpage:

The second edition of MongoDB: The Definitive Guide is now available from O’Reilly! It covers both developing with and administering MongoDB. The book is language-agnostic: almost all of the examples are in JavaScript.

Looking forward to enjoying the second edition as much as the first!

Although, I am not really sure that always using JavaScript means you are “language-agnostic.” 😉

May 9, 2013

MongoDB as in-memory DB

Filed under: Cybersecurity,MongoDB,Security — Patrick Durusau @ 2:14 pm

How to use MongoDB as a pure in-memory DB (Redis style) by Antoine Girbal.

From the post:

There has been a growing interest in using MongoDB as an in-memory database, meaning that the data is not stored on disk at all. This can be super useful for applications like:

  • a write-heavy cache in front of a slower RDBMS system
  • embedded systems
  • PCI compliant systems where no data should be persisted
  • unit testing where the database should be light and easily cleaned

That would be really neat indeed if it was possible: one could leverage the advanced querying / indexing capabilities of MongoDB without hitting the disk. As you probably know the disk IO (especially random) is the system bottleneck in 99% of cases, and if you are writing data you cannot avoid hitting the disk.

One sweet design choice of MongoDB is that it uses memory-mapped files to handle access to data files on disk. This means that MongoDB does not know the difference between RAM and disk, it just accesses bytes at offsets in giant arrays representing files and the OS takes care of the rest! It is this design decision that allows MongoDB to run in RAM with no modification.

Reports getting 20K writes per second on a single core.

I can imagine topic map scenarios where no data should be persisted.

You?

April 19, 2013

How to Compare NoSQL Databases

Filed under: Aerospike,Benchmarks,Cassandra,Couchbase,Database,MongoDB,NoSQL — Patrick Durusau @ 12:45 pm

How to Compare NoSQL Databases by Ben Engber. (video)

From the description:

Ben Engber, CEO and founder of Thumbtack Technology, will discuss how to perform tuned benchmarking across a number of NoSQL solutions (Couchbase, Aerospike, MongoDB, Cassandra, HBase, others) and to do so in a way that does not artificially distort the data in favor of a particular database or storage paradigm. This includes hardware and software configurations, as well as ways of measuring to ensure repeatable results.

We also discuss how to extend benchmarking tests to simulate different kinds of failure scenarios to help evaluate the maintainablility and recoverability of different systems. This requires carefully constructed tests and significant knowledge of the underlying databases — the talk will help evaluators overcome the common pitfalls and time sinks involved in trying to measure this.

Lastly we discuss the YCSB benchmarking tool, its significant limitations, and the significant extensions and supplementary tools Thumbtack has created to provide distributed load generation and failure simulation.

Ben makes a very good case for understanding the details of your use case versus the characteristics of particular NoSQL solutions.

Where you will find “better” performance depends on non-obvious details.

Watch the use of terms like “consistency” in this presentation.

The paper Ben refers to: Ultra-High Performance NoSQL Benchmarking: Analyzing Durability and Performance Tradeoffs.

Forty-three pages of analysis and charts.

Slow but interesting reading.

If you are into the details of performance and NoSQL databases.

April 13, 2013

Mongraph

Filed under: MongoDB,Mongraph,Neo4j — Patrick Durusau @ 6:00 pm

Mongraph

From the readme:

Mongraph combines documentstorage database with graph-database relationships by creating a corresponding node for each document.

Flies in the face of every app being the “universal” app orthodoxy but still worth watching.

March 26, 2013

Wanted: Evaluators to Try MongoDB with Fractal Tree Indexing

Filed under: Fractal Trees,MongoDB,Tokutek — Patrick Durusau @ 4:43 am

Wanted: Evaluators to Try MongoDB with Fractal Tree Indexing by Tim Callaghan.

From the post:

We recently resumed our discussion around bringing Fractal Tree indexes to MongoDB. This effort includes Tokutek’s interview with Jeff Kelly at Strata as well as my two recent tech blogs which describe the compression achieved on a generic MongoDB data set and performance improvements we measured using on our implementation of Sysbench for MongoDB. I have a full line-up of benchmarks and blogs planned for the next few months, as our project continues. Many of these will be deeply technical and written by the Tokutek developers.

We have a group of evaluators running MongoDB with Fractal Tree Indexes, but more feedback is always better. So …

Do you want to participate in the process of bringing high compression and extreme performance gains to MongoDB? We’re looking for MongoDB experts to test our build on your real-world workloads and benchmarks. Evaluator feedback will be used in creating the product road map. Please email me at tim@tokutek.com if interested.

You keep reading about the performance numbers on MongoDB.

Aren’t you curious if those numbers are true for your use case?

Here’s your opportunity to find out!

March 25, 2013

CSA: Upgrade Immediately to MongoDB 2.4.1

Filed under: MongoDB — Patrick Durusau @ 4:08 am

CSA: Upgrade Immediately to MongoDB 2.4.1

Alex Popescu advises:

If you are running MongoDB 2.4, upgrade immediately to 2.4.1. Details here.

March 19, 2013

MongoDB 2.4 Release

Filed under: Lucene,MongoDB,NoSQL,Searching,Solr — Patrick Durusau @ 1:11 pm

MongoDB 2.4 Release

From the webpage:

Developer Productivity

  • Capped Arrays simplify development by making it easy to incorporate fixed, sorted lists for features like leaderboards and logging.
  • Geospatial Enhancements enable new use cases with support for polygon intersections and analytics based on geospatial data.
  • Text Search provides a simplified, integrated approach to incorporating search functionality into apps (Note: this feature is currently in beta release).

Operations

  • Hash-Based Sharding simplifies deployment of large MongoDB systems.
  • Working Set Analyzer makes capacity planning easier for ops teams.
  • Improved Replication increases resiliency and reduces administration.
  • Mongo Client creates an intuitive, consistent feature set across all drivers.

Performance

  • Faster Counts and Aggregation Framework Refinements make it easier to leverage real-time, in-place analytics.
  • V8 JavaScript Engine offers better concurrency and faster performance for some operations, including MapReduce jobs.

Monitoring

  • On-Prem Monitoring provides comprehensive monitoring, visualization and alerting on more than 100 operational metrics of a MongoDB system in real time, based on the same application that powers 10gen’s popular MongoDB Monitoring Service (MMS). On-Prem Monitoring is only available with MongoDB Enterprise.



Security
….

  • Kerberos Authentication enables enterprise and government customers to integrate MongoDB into existing enterprise security systems. Kerberos support is only available in MongoDB Enterprise.
  • Role-Based Privileges allow organizations to assign more granular security policies for server, database and cluster administration.

You can read more about the improvements to MongoDB 2.4 in the Release Notes. Also, MongoDB 2.4 is available for download on MongoDB.org.

Lots to look at in MongoDB 2.4!

But I am curious about the beta text search feature.

MongoDB Text Search: Experimental Feature in MongoDB 2.4 says:

Text search (SERVER-380) is one of the most requested features for MongoDB 10gen is working on an experimental text-search feature, to be released in v2.4, and we’re already seeing some talk in the community about the native implementation within the server. We view this as an important step towards fulfilling a community need.

MongoDB text search is still in its infancy and we encourage you to try it out on your datasets. Many applications use both MongoDB and Solr/Lucene, but realize that there is still a feature gap. For some applications, the basic text search that we are introducing may be sufficient. As you get to know text search, you can determine when MongoDB has crossed the threshold for what you need. (emphasis added)

So, why isn’t MongoDB incorporating Solr/Lucene instead of a home grown text search feature?

Seems like users could leverage their Solr/Lucene skills with their MongoDB installations.

Yes?

March 8, 2013

Databases & Dragons

Filed under: MongoDB,Software — Patrick Durusau @ 5:17 pm

Databases & Dragons by Kristina Chodorow.

From the post:

Here are some exercises to battle-test your MongoDB instance before going into production. You’ll need a Database Master (aka DM) to make bad things happen to your MongoDB install and one or more players to try to figure out what’s going wrong and fix it.

Should be of interest if you are developing MongoDB to go into production.

The idea should also be of interest if you are developing other software to go into production.

Most software (not all) works fine with expected values, other components responding correctly, etc.

But those are the very conditions your software may not encounter in production.

Where’s your “databases &amps dragons” test for your software?

March 1, 2013

MongoDB + Fractal Tree Indexes = High Compression

Filed under: Fractal Trees,Indexing,MongoDB,Requirements — Patrick Durusau @ 5:31 pm

MongoDB + Fractal Tree Indexes = High Compression by Tim Callaghan.

You may have heard that MapR Technologies broke the MinuteSort Record by sorting 15 billion 100-btye records in 60 seconds. Used 2,103 virtual instances in the Google Compute Engine and each instance had four virtual cores and one virtual disk, totaling 8,412 virtual cores and 2,103 virtual disks. Google Compute Engine, MapR Break MinuteSort Record.

So, the next time you have 8,412 virtual cores and 2,103 virtual disks, you know what is possible, 😉

But if you have less firepower than that, you will need to be clever:

One doesn’t have to look far to see that there is strong interest in MongoDB compression. MongoDB has an open ticket from 2009 titled “Option to Store Data Compressed” with Fix Version/s planned but not scheduled. The ticket has a lot of comments, mostly from MongoDB users explaining their use-cases for the feature. For example, Khalid Salomão notes that “Compression would be very good to reduce storage cost and improve IO performance” and Andy notes that “SSD is getting more and more common for servers. They are very fast. The problems are high costs and low capacity.” There are many more in the ticket.

In prior blogs we’ve written about significant performance advantages when using Fractal Tree Indexes with MongoDB. Compression has always been a key feature of Fractal Tree Indexes. We currently support the LZMA, quicklz, and zlib compression algorithms, and our architecture allows us to easily add more. Our large block size creates another advantage as these algorithms tend to compress large blocks better than small ones.

Given the interest in compression for MongoDB and our capabilities to address this functionality, we decided to do a benchmark to measure the compression achieved by MongoDB + Fractal Tree Indexes using each available compression type. The benchmark loads 51 million documents into a collection and measures the size of all files in the file system (–dbpath).

More benchmarks to follow and you should remember that all benchmarks are just that, benchmarks.

Benchmarks do not represent experience with your data, under your operating load and network conditions, etc.

Investigate software based on the first, purchase software based on the second.

February 20, 2013

NoSQL is Great, But You Still Need Indexes [MongoDB for example]

Filed under: Fractal Trees,Indexing,MongoDB,NoSQL,TokuDB,Tokutek — Patrick Durusau @ 9:23 pm

NoSQL is Great, But You Still Need Indexes by Martin Farach-Colton.

From the post:

I’ve said it before, and, as is the nature of these things, I’ll almost certainly say it again: your database performance is only as good as your indexes.

That’s the grand thesis, so what does that mean? In any DB system — SQL, NoSQL, NewSQL, PostSQL, … — data gets ingested and organized. And the system answers queries. The pain point for most users is around the speed to answer queries. And the query speed (both latency and throughput, to be exact) depend on how the data is organized. In short: Good Indexes, Fast Queries; Poor Indexes, Slow Queries.

But building indexes is hard work, or at least it has been for the last several decades, because almost all indexing is done with B-trees. That’s true of commercial databases, of MySQL, and of most NoSQL solutions that do indexing. (The ones that don’t do indexing solve a very different problem and probably shouldn’t be confused with databases.)

It’s not true of TokuDB. We build Fractal Tree Indexes, which are much easier to maintain but can still answer queries quickly. So with TokuDB, it’s Fast Indexes, More Indexes, Fast Queries. TokuDB is usually thought of as a storage engine for MySQL and MariaDB. But it’s really a B-tree substitute, so we’re always on the lookout for systems where we can improving the indexing.

Enter MongoDB. MongoDB is beloved because it makes deployment fast. But when you peel away the layers, you get down to a B-tree, with all the performance headaches and workarounds that they necessitate.

That’s the theory, anyway. So we did some testing. We ripped out the part of MongoDB that takes care of secondary indices and plugged in TokuDB. We’ve posted the blogs before, but here they are again, the greatest hits of TokuDB+MongoDB: we show a 10x insertion performance, a 268x query performance, and a 532x (or 53,200% if you prefer) multikey index insertion performance. We also discussed covered indexes vs. clustered Fractal Tree Indexes.

Did somebody declare February 20th to be performance release day?

Did I miss that memo? 😉

Like every geek, I like faster. But, here’s my question:

Have there been any studies on the impact of faster systems on searching and decision making by users?

My assumption is the faster I get a non-responsive result from a search, the sooner I can improve it.

But that’s an assumption on my part.

Is that really true?

January 17, 2013

MongoDB Text Search Tutorial

Filed under: MongoDB,Search Engines,Searching,Text Mining — Patrick Durusau @ 7:26 pm

MongoDB Text Search Tutorial by Alex Popescu.

From the post:

Today is the day of the experimental MongoDB text search feature. Tobias Trelle continues his posts about this feature providing some examples for query syntax (negation, phrase search)—according to the previous post even more advanced queries should be supported, filtering and projections, multiple text fields indexing, and adding details about the stemming solution used (Snowball).

Alex also has a list of his posts on the text search feature for MongoDB.

December 30, 2012

MongoDB Puzzlers #1

Filed under: MongoDB,Query Language — Patrick Durusau @ 9:48 pm

MongoDB Puzzlers #1 by Kristina Chodorow.

If you are not too deeply invested in the fiscal cliff debate, ;-), you may enjoy the distraction of a puzzler based on the MongoDB query language.

Collecting puzzler’s for MongoDB and other query languages would be a good idea.

Something to be enjoyed in times of national “crisis,” aka, collective hand wringing by the media.

When is “Hello World,” Not “Hello World?”

Filed under: Graphs,MongoDB,Neo4j,Software — Patrick Durusau @ 8:43 pm

To answer that question, you need to see the post: Travel NoSQL Application – Polyglot NoSQL with SpringData on Neo4J and MongoDB.

Just a quick sample:

 In this Fuse day, Tikal Java group decided to continue its previous Fuse research for NoSQL, but this time from a different point of view – SpringData and Polyglot persistence. We had two goals in this Fuse day: try working with more than one NoSQL in the same application, and also taking advantage of SpringData data access abstractions for NoSQL databases. We decided to take MongoDB and Neo4J as document DB, and Neo4J as graph database and put them behind an existing, classic and well known application – Spring Travel Sample application.

More than the usual “Hello World” example for languages and a bit more than for most applications.

It would be a nice trend to see more robust, perhaps “Hello World+” examples.

What is your enhanced “Hello World+” going to look like in 2013?

December 16, 2012

Searching an Encrypted Document Collection with Solr4, MongoDB and JCE

Filed under: Encryption,MongoDB,Security,Solr — Patrick Durusau @ 8:00 pm

Searching an Encrypted Document Collection with Solr4, MongoDB and JCE by Sujit Pal.

From the post:

A while back, someone asked me if it was possible to make an encrypted document collection searchable through Solr. The use case was patient records – the patient is the owner of the records, and the only person who can search through them, unless he temporarily grants permission to someone else (for example his doctor) for diagnostic purposes. I couldn’t come up with a good way of doing it off the bat, but after some thought, came up with a design that roughly looked like the picture below:

With privacy being all the rage, a very timely post.

Not to mention an opportunity to try out Solr4.

October 31, 2012

MongoSV 2012

Filed under: Conferences,MongoDB — Patrick Durusau @ 2:54 pm

MongoSV 2012

From the webpage:

December 4th Santa Clara, CA

MongoSV is an annual one-day conference in Silicon Valley, CA, dedicated to the open source, non-relational database MongoDB.

There are five (5) tracks, morning and afternoon sessions, a final session followed by a conference party from 5:30 PM to 8 PM.

Any summary is going to miss something of interest for someone. Take the time to review the schedule.

While you are there, register for the conference as well. A unique annual opportunity to mix-n-meet with MongoDB enthusiasts!

MongoDB and Fractal Tree Indexes (Webinar) [13 November 2012]

Filed under: Fractal Trees,MongoDB — Patrick Durusau @ 11:05 am

Webinar: MongoDB and Fractal Tree Indexes by Tim Callaghan.

From the post:

This webinar covers the basics of B-trees and Fractal Tree Indexes, the benchmarks we’ve run so far, and the development road map going forward.

Date: November 13th
Time: 2 PM EST / 11 AM PST
REGISTER TODAY

If you aren’t familiar with Fractal Tree Indexes and MongoDB, this is your opportunity to catch up!

October 24, 2012

Online Education- MongoDB and Oracle R Enterprise

Filed under: MongoDB,Oracle,R — Patrick Durusau @ 7:03 pm

Online Education- MongoDB and Oracle R Enterprise by Ajay Ohri.

Ajay brings news of two MongoDB online courses, one for developers and one for DBAs, and an Oracle offering on R.

The MongoDB classes started Monday (22nd of October) so you had better hurry to register.

October 18, 2012

10gen: Growing the MongoDB world

Filed under: MongoDB — Patrick Durusau @ 10:36 am

10gen: Growing the MongoDB world by Dj Walker-Morgan.

From the post:

10gen, the company set up by the creators of the open source NoSQL database MongoDB, has been on a roll recently, creating business partnerships with numerous companies, making it a hot commercial proposition without creating any apparent friction with its open source community. So what has brought MongoDB to the fore?

One factor has been how easy it is to get up and running with the database, a feature that the company wants to actively maintain. 10gen president Max Schireson explained: “I think that it’s honestly a combination of the functionality of MongoDB itself, but also the effort that we’ve invested in packaging for the open source community. I see some open source companies taking the approach of ‘oh yeah the code’s open source but you’ll need a PhD to actually get a working build of it unless you are a subscriber’. While that might help monetisation, that’s not a way to build a big community”.

Schireson says the company isn’t going stand still though: although it’s easy to get a single node up and running, over time they want to make it easier to get more complex, sharded, implementations configured and deployed. “As people use more and more functionality, that of necessity brings in more complexity, we’re looking for ways to make that easier,” he says, pointing to the cluster manager being developed as a native part of MongoDB, which should make it easier to manage and upgrade clusters.

Always appreciate a plug for good documentation.

May not work for you but it certainly worked here.

October 9, 2012

How MongoDB’s Journaling Works

Filed under: MongoDB — Patrick Durusau @ 2:05 pm

How MongoDB’s Journaling Works by Kristina Chodorow.

From the post:

I was working on a section on the gooey innards of journaling for The Definitive Guide, but then I realized it’s an implementation detail that most people won’t care about. However, I had all of these nice diagrams just laying around.

Well, journaling may be “an implementation detail,” but Kristina explains it well and some “implementation details” shape our views of what is or isn’t possible.

Doesn’t hurt to know more than we when we started reading the post.

Is your appreciation of journaling the same or different after reading Kristina’s post?

September 14, 2012

Looking for MongoDB users to test Fractal Tree Indexing

Filed under: Fractal Trees,Indexing,MongoDB,Tokutek — Patrick Durusau @ 10:03 am

Looking for MongoDB users to test Fractal Tree Indexing by Tim Callaghan.

In my three previous blogs I wrote about our implementation of Fractal Tree Indexes on MongoDB, showing a 10x insertion performance increase, a 268x query performance increase, and a comparison of covered indexes and clustered indexes. The benchmarks show the difference that rich and efficient indexing can make to your MongoDB workload.

It’s one thing for us to benchmark MongoDB + TokuDB and another to measure real world performance. If you are looking for a way to improve the performance or scalability of your MongoDB deployment, we can help and we’d like to hear from you. We have a preview build available for MongoDB v2.2 that you can run with your existing data folder, drop/add Fractal Tree Indexes, and measure the performance differences. Please email me at tim@tokutek.com if interested.

Here is your chance to try these speed improvements out on your data!

Older Posts »

Powered by WordPress