Big data can refer to any large amount of data in any form. It could be data that is somewhat structured in a more traditional database sense, or it could be a completely unstructured collection of files. Regardless of the type, managing it in the cloud requires some special tools. These are some of the technologies that might help you along the way.
Apache Hadoop – This open source project is a distributed computing platform that allows you to spread out data processes over a cluster of computers. It does not matter how the data is structured, if at all, as this distributed file system can handle it.
MapReduce – Designed to harness the power of cloud resources, MapReduce is a software framework that supports distributed computing over large data sets.
Apache Cassandra – Originally developed by Facebook, this open source data management system is designed to handle high-volume, high-traffic scenarios without the stringent structural requirements of a traditional SQL database.
Amazon Dynamo DB – A cloud based, fully managed NoSQL database service, Dynamo takes a lot of work off your shoulders and performs storage and indexing tasks for you. It can integrate with Hadoop via Elastic MapReduce and is available as a service, payable on a throughput basis.
There are many other big data tools out there, both proprietary and open source, and many big enterprise database vendors are now offering their own unique big data storage and analysis applications and appliances. Most businesses will want a combination of on-premise, private cloud, and public cloud systems in what is now best described as a hybrid cloud.