What Do You Know About Greenplum - Open Sourced Data Warehouse?


What is Greenplum?

Greenplum is an Advanced,fully featured open sourced data warehouse. It's provide powerful and rapid analytics on petabyte scale data volume.


Uniquely geared towards big data analytics. Greenplum Database is powered by the world's most advanced cost-base query optimizer delivering high analytical query performance on large data volume.

Greenplum Database is,
  • Highly scalable and shared nothing database.
  • A platform for advanced analytics on any(and all) data.
  • An enterprise ready platform capable of flexing with you need.  
"Greenplum is a MPP database."

What is MPP?

Coordinated processing of a program by multiple processors(cores/cpus/systems) which requires a messaging interface to coordinate the processing of data.

Commonly called “shared nothing” or loosely coupled. The process begins by the client.the client issuing a query and passed it to the Master node. The Master contains the information(data dictionary and session information which is used to generate an execution plan designed) to retrieve the needed information from each underline Node.

Parallel execution represent the implementation of the execution plan generated by the Master Node. Each Node execute the unique query  it has reserved and retrieve the executed data(Result) to the client.

Greenplum Architecture...



Features of Greenplum...


  • Massively parallel processing Architecture
  • Petabyte-Scale Loading
  • Innovative query optimizer
  • Polymorphic data storage and execution
  • Advanced Machine learning
  • External data accessibility

GPDA High Availability


  • Master Host Mirroring


If the primary master fails, the replication process stops, and an administrator can activate the standby master. Upon activation of the standby master, the replicated logs reconstruct the state of the primary master at the time of the last successfully committed transaction.

  • Segment Mirroring



With mirroring enabled, Greenplum Database automatically fails over to a mirror segment when a primary segment goes down. Provided one segment instance is online per portion of data, users may not realize a segment is down.

The mirror segment is a same copy of primary segment according to their segment number.The Primary segment and the Mirror segment(same segment number) are Not in the same host because if one host fails the whole system should not be failed.So it is from another Host.

Segment mirroring employs a physical file replication scheme—data file I/O at the primary is replicated to the secondary so that the mirror's files are identical to the primary's files. Data in Greenplum Database are represented with tuples, which are packed into blocks. Database tables are stored in disk files consisting of one or more blocks. A change to a tuple changes the block it is saved in, which is then written to disk on the primary and copied over the network to the mirror. The mirror updates the corresponding block in its copy of the file.

If a primary segment fails, the file replication process stops and the mirror segment automatically starts as the active segment instance. The now active mirror's system state becomes Change Tracking, which means the mirror maintains a system table and change-log of all blocks updated while the primary segment is unavailable. 

When the failed primary segment is repaired and ready to be brought back online, an administrator initiates a recovery process and the system goes into Resynchronization state. The recovery process applies the logged changes to the repaired primary segment. The system state changes to Synchronized when the recovery process completes.

so that's it about Greenplum. See you again with a new topic!
Thank You!
Resources : https://gpdb.docs.pivotal.io

Post a Comment

0 Comments