The Technologies behind Hive

In this article, we will briefly introduce the backend technologies behind Hive.

Current web development adopts different development methodologies from approaches in the traditional software engineering field. The development process is much faster than it was before. User requirements can also change quickly which requires flexibility in production, essentially in programming. The complicated programming language Java is obviously not a good choice to keep up with a fast pace which is required for this project.

Several years ago, PHP was a popular choice to quickly develop a prototype into production in a few weeks. However, when the code base increased to a certain level, PHP code became very hard to maintain [1]. In recent years, Python and Ruby have become a better choice in terms of codeĀ maintenanceĀ and ecosystem support.

The reason that we chose Python over Ruby is simply because one of our group members is more familiar with the former.

Python has particularly strong support in web development. Numerous web frameworks such as Quixote, Django, web.py, Tornado, etc. provide a range of choices for developers to choose according to their own needs. These web frameworks greatly relieve the workload of developers so that they can invest their time in developing the application rather than dealing with low level logic such as raw HTTP requests. With the help from these web frameworks, simple RESTful web servcies can be built in a short time.

The appearance of Web 2.0 largely enhances the interaction between users and web applications. In other words, web applications need to serve a lot of client requests. The difficulty of handling many HTTP connections was raised as the C10K problem [2]. Tornado and nginx are good practices of the C10K problem, deploying non-blocking I/O models to handle the large amount of connections, especially when clients built up long connections with backend servers such as comet.

Hive uses nginx to serve static files such as JavaScript scripts, images and CSS files. Moreover, it acts as a load balancer to route the requests to app servers which are running on Tornado.

Tornado is a web server and web framework. It is purely written in Python and has shown good performance when serving FriendFeed which later was acquired by Facebook.

The final choice falls on the database. Many arguments have been stated in this area because of the NoSQL movement. NoSQL essentially is a type of database without SQL semantics as the name suggests. It usually provides simple keys-value data model and has good performance in heavy read and writes over large datasets. However, due to data model limitations, much of the data logic gathers at application level. This increases developer work and makesĀ maintenanceĀ harder. More importantly, most NoSQL databases have not shown their reliability in a large distributed environment. On the other hand, although traditional RMDBS such as MySQL has its limitations when scalability becomes important, careful application level partitions can avoid this drawback [3].

Since group members are more familiar with MySQL and NoSQL has not been persuasive, we chose MySQL as our main storage system. Although some NoSQL databases such as Redis can be a cache option when the system reaches its capacity.

In conclusion, we choose Python as the main programming language of the application, tornado as the web framework and server. Nginx is used to serve static files and act as a load balancer meanwhile MySQL serves other data.

This entry was posted in Design, Implementation and tagged , , . Bookmark the permalink.