Diving into HDFS by Julia Evans.
From the post:
Yesterday I wanted to start learning about how HDFS (the Hadoop Distributed File System) works internally. I knew that
- It’s distributed, so one file may be stored across many different machines
- There’s a namenode, which keeps track of where all the files are stored
- There are data nodes, which contain the actual file data
But I wasn’t quite sure how to get started! I knew how to navigate the filesystem from the command line (
hadoop fs -ls /
, and friends), but not how to figure out how it works internally.Colin Marc pointed me to this great library called snakebite which is a Python HDFS client. In particular he pointed me to the part of the code that reads file contents from HDFS. We’re going to tear it apart a bit and see what exactly it does!
…
Be cautious reading Julia’s post!
Her enthusiasm can be infectious. 😉
Seriously, I take Julia’s posts as the way CS topics are supposed to be explored. While there is hard work, there is also the thrill of discovery. Not a bad approach to have.