Skip to main content

Big Data

What is BigData? 

As you all know, nowadays, there are dozens of devices which includes hundreds of sensors... For example I have a smart phone, laptop, smart tv, tablet, TV box, washing machine, ip tv box, smart watch... All of these has different systems, sensors. All these devices can connect to internet. Besides I have lots of devices which can not connect to the internet. I don't need the datas of these device's sensors to be proccessed. But their operating systems do this. All of them creates MBs or GBs of datas. 

Let's see the big picture. For example in a plane, there are thousands of sensors etc. In just one flight, a plane system creates GB's or TB's of datas. These datas need to be proccessed. And something does this.

Or, think about facebook, instagram, twitter, google, amazon... Did you think that how can they process all the datas? Almost at the same time, millions of people search a query, share text, photo, video or something else. Or think of governments datas. Citizenship number, social security datas, addresses, crime records, photos, health records... These datas should be analyzed, proccessed and stored quickly. And of course reported easly and quickly too. Think that... What a huge system, what a complex software bundle...

Traditional softwares can not deal with datas that are too large. So people needed a new approach, new softwares, new systems.

Big data teknologies can handle all the datas easily even that are too large. They can handle Petabytes, Exabytes or more... They uses distrubuted processing technology. Big data makes clusters and running these synchronous.

So how can handle all of these?

First of all, I've searched and began to learn open source solutions. There are some popular softwares developed and maintained by Apache Software Foundation. As you know this is an open source community.

Steps of proccessing big data:
  1. Data input
  2. Collecting and queing this data (Kafka)
  3. MapRaducing Step: Proccessing, analyzing (Hadoop and/or Spark)
  4. Saving in a database (Hadoop HDFS, HBase, MongoDB)
  5. Showing as a report on a dashboard
  1. Data Input: 
    This is first step. These are the datas which you want to proccess. Comes from your social media aplication, yourdevices sensors, your sales team etc...
  2. Apache Kafka: 
    Apache Kafka is a distrubuted streaming platform. It can be used for building realtime streaming data pipelines. 
  3. Map Reducing Step: 
    Apache Hadoop and Apache Spark: 
    These are the frameworks and bunch of libraries to handle big data. They uses distrubuted computing tecnologies on clustered computers. Difference of them is, Hadoop has own file system but Spark does not have one. Developers claim Spark is more faster than Hadoop.
    Apache Hive And Apache Pig:
    These are the platforms for reading, writing and managing large datasets residing in distributed clusture storages. 
    Hive uses SQL
    Pig uses NoSQL
  4. Saving the Data in a Database:
    After proccessing and analyzing your data, you need to store them.
  5. Showing as a Report on a Dashboard:
    You need to code yourself a dashboard which gets data from your database and show you. The interface has variety as your needs. There may be ready to use solutions but I did not have time to search them.

I am a newbie of this... I am searching and learning this great technologies and  I will update here as I learn.
____________________________
My way to publish something here:
  1. Post something on blog.
  2. Then update the pages if the post is about one of the pages in the "pages" section.
  3. The period between the first and the second line may be takes a few days or weeks.
So if you search the entire site, may be you can find out more.

Comments