Cloud (IaaS) & Big Data

IBM SoftLayer IaaS – notes from 2 day training class in NYC

Posted on November 7, 2014 | Comments Off

I attended SoftLayer training in NYC and wrote up a few pages of notes. I really like the idea of building IaaS systems via web control panels and APIs, and SoftLayer delivers on this.

Overview

21k customers in 140 countries
15 data centers, 18 network points of presence (PoPs)
Mix and match of virtual (diverse set of hypervisors) and bare metal servers, all managed via web control panel and/or API
Deployment in real time with high degree of automation.
Some customers build a hybrid solution using SoftLayer in addition to their own datacenter. Connect via VPN or leased line.

Server Architecture

While most cloud providers offer only virtualized resources on shared infrastructure, SoftLayer offers the option of bare metal and/or virtualization, and the option of shared and/or dedicated infrastructure.
Redundancy in some cases stops at the rack, not the server. For example, multiple power supplies for the rack not for each server in the rack
Server options
- Multi-tenant (you don’t know who/what else is running on the same resources as you)
  - Virtual (public node)
    - Managed Citrix Xen hypervisor
    - Monthly/Hourly billing
    - Up to 16 cores
    - Local storage or SAN
    - Free 5 TB outbound data transfer if choose monthly billing
    - 15 minute provisioning
  - Single tenant (all resources dedicated to single customer, aka “private cloud”)
    - Bare metal
      - Optional (unmanaged) hypervisor, such as Citrix Xen, VMWare, Hyper-V, Parallels
      - Monthly billing. In some instances can do hourly billing
      - Free 20 TB outbound bandwidth per month
      - Optional private network, private rack
      - Options on CPUs, up to 36 internal drives (build your own NAS), NVIDIA Tesla GPU http://www.nvidia.com/object/tesla-servers.html
      - 2-4 hour provisioning. That’s the time it takes for the machine to become visible to the customer. Additional time needed to apply operating system and applications.
    - Virtual (private node)
      - Pretty much the same as Multi-tenant virtual except that you have dedicated hardware.
      - You can install as many virtual machines as you want on your hardware.
- OK for customer to deploy their own software appliances, but there is no option to ever deploy your own hardware
- Image Templates
  - Software/configuration of a physical or virtual space
  - Apply to a machine to create a runtime environment
  - Two types of image templates
    - Standard
      - Virtual machine only
      - Any operating system
      - Citrix Xen only
    - Flex
      - Both physical and virtual machines
      - Red Hat (RHEL) and Windows only
      - All hypervisors

Networking

Three networks
- Public (2 NICs, both usable rather than just redundancy)
  - Bare metal: 20 TB outbound bandwidth per month
  - Virtual: 5 TB outbound bandwidth per month. Can be pooled if some servers aren’t publically exposed
- Private (2 NICs, both usable rather than just redundancy)
  - No limitations on bandwidth. Great for backups across multiple datacenters
  - Private VLANs can include servers in multiple datacenters. A server can connect (span) to multiple VLANs
- Management/Admin (1 NICs)
SoftLayer SLA: “reasonable efforts to provide 100% service”
VPN
- tunnels: SSL, PPTP, IPSec
- Recommends managing with FortiGate or Vyatta appliances
SoftLayer Looking Glass: Test latency between your datacenter and SoftLayer, or between resources within or across SoftLayer datacenters
Content Delivery Network
- Uses Edgecast as partner. Similar to Akamai. http://www.edgecast.com/
- Caches content on servers near the user.
Load Balancing
- Local
  - Array Networks http://www.arraynetworks.com/
  - Distribute traffic within a single datacenter on a public VLAN
  - Limited to Layer 4 http://en.wikipedia.org/wiki/Transport_layer
  - Managed through Customer Portal. No console access because it’s a shared appliance.
- Global
  - F5 Networks https://f5.com/
  - Distribute traffic between multiple SoftLayer datacenters.
  - Limited to Layer 4 http://en.wikipedia.org/wiki/Transport_layer
  - Managed through Customer Portal & APIs. No console access because it’s a shared appliance.
- Citrix NetScaler VPX appliance http://www.citrix.com/products/netscaler-application-delivery-controller/platforms.html
  - Private networks
  - All layers, up to 7
  - Can do global load balancing and sql database load balancing
  - Can use either standard or platinum edition
Firewalls
- Fortinet FortiGate 3000 series http://www.fortinet.com/products/fortigate/3000series.html
- Shared hardware
  - Multi-tenant
  - Managed through Customer Portal & APIs. No console access because it’s shared hardware.
  - Configured to protect a single server
- Dedicated hardware
  - Same as above, but single-tenant, yet still no console access.
  - Configured to protect a single server or an entire VLAN
- Dedicated appliance
  - Same as dedicated hardware, but provides access to console and native tools. This gives the customer more capabilities.
Gateway Appliance
- Vyatta
  - Applies to any portion of, or entire customer infrastructure at SoftLayer
  - Used forGateway Appliance
    - IPSec VPN tunnels
    - NAT
    - Firewall
    - Router
  - Configured by console or Vyatta gui via VPN. No SoftLayer Customer Portal or API
DNS Options
- Customer uses their own DNS that’s external to SoftLayer
- Customer uses SoftLayer’s DNS, which is redundant across datacenters
- Customer uses 3^rdparty DNS
- Customer runs their own DNS hosted on their own machines within SoftLayer

Security

Much easier to deploy/configure security via the SoftLayer Customer Portal than in a traditional datacenter. One common source of vulnerabilities is incomplete or incorrect security deployments, so an easier to use method would suggest that it’s easier to create a secure system.
Offerings
- McAfee (Windows) anti-virus
- DDoS – detect and isolate (take off line) machines that are under attack, but does not have service to remediate the threat
  - Cisco Guard DDoS protection
  - Arbor Peakflow traffic analysis
  - Arbor ATLAS Global Traffic Analyzer
- Servers local to datacenter for Windows and Red Hat updates
- IDS/IPS protection
  - Nessus vulnerability assessment and reporting
  - McAfee host intrusion protection
- FortiGate firewalls
- US Gov’t standards
  - Drive wiping using same tools as Dept of Defense (DoD)
  - SP800-53 US Gov’t standard
  - Federal Information Security Management Act (FISMA).
  - FedRAMP datacenters
  - Health Insurance Portability and Accountability Act (HIPPA). Will sign agreement with customer.
- Two factor authentication
  - Symantec identity protection
  - Windows Azure Mult-Factor
- VPN
  - Client site SSL or PPTP, and Site to site IPSec
Datacenters are
- Service Organization Control (SOC) 2 certified
- Payment Card Industry Security Standard (PCI-DSS) for bare metal and single-tenant virtual. Not recommended for multi-tenant.
- Tier 3
  - 99.982% availability (translates to < 1.6 hours/year)
  - Multiple power/cooling
  - N+1 fault tolerant
  - Can sustain 72 hour power outage
- Physical security. All items mentioned are good, but seemed typical of other datacenters I’ve been to or learned about.
- Cloud Security Alliance (CSA) self-assessment, but not yet certified

Data

NAS
Local (SSD and disk)
iSCSI SANs (no fibre)
- multi-tenant
  - NetApp
- dedicated
  - Dell EqualLogic http://www.dell.com/us/business/p/equallogic-ps-series
Object storage
- Built on OpenStack http://www.openstack.org/software/openstack-storage/
- For large data sets (like media) and works well with CDN
- For long term archival storage
Build your own NAS or SAN
- QuantaStor appliance http://www.osnexus.com/storage-appliance-os/deployed to single-tenant Ubuntu servers
Backup
- eVault http://www.evault.com/with agents for Exchange, SharePoint, MS SQL, Oracle
- Idera https://www.idera.com/for continuous high performance disk-to-disk
- Within and across datacenters
Data sovereignty
- Some countries (e.g. Australia) require data to reside within its borders, and this can be done with SoftLayer

Managed services

Backup plans
Security plans, patching, server hardening
Monitoring
DBA
Change Management

APIs

Implemented using SOAP and XML-RPC
Available as Representational State Transfer (REST)
Supports a wide range of languages
264 services (20 of which are high level) comprising a total of 3,421 API calls
Can be used to up-scale and down-scale an implementation in an automated manner. There’s a new package for this called OnScale. Not sure at what level this compares or competes with Pure Applications on SoftLayer
Can be used to create a custom branded Customer Portal for reselling services

Compared to other cloud providers

A lot of marketing hype, although Gartner quadrant wasn’t at all kind to SoftLayer
Compared to Amazon AWS showed as higher performance and availability at lower cost, but used bare metal for the comparison. Didn’t show whether SoftLayer virtual is comparable to AWS, although in theory SoftLayer would cost less.
Catalyst: incubator to help small companies with infrastructure costs http://www.softlayer.com/catalyst

Comments Off on IBM SoftLayer IaaS – notes from 2 day training class in NYC

Posted in AWS, cloud, IBM, SoftLayer

Tagged ibm.com, softlayer.com

IBM PureApplications for Hybrid IaaS Cloud

Posted on July 28, 2014 | Comments Off

IBM PureApplications provides on-premise cloud. #PureApp for SoftLayer provides off-premises cloud solutions. ibm.co/TNzV8m @Prolifics

Video includes clip from my manager @Prolifics, Mike Hastie.

Comments Off on IBM PureApplications for Hybrid IaaS Cloud

Posted in cloud, hardware, IBM, Java

Tagged ibm.com, youtube.com

Facebook compresses its 300 petabyte Hadoop Hive data warehouse layer by factor of 8x

Posted on April 16, 2014 | Comments Off

Facebook’s 300 PB data warehouse grows by approximately 600 TB per day and resides on more than 100k servers (although I’m not certain how many of those are Hadoop nodes). With the brute force approach of more storage and more servers reaching a logistical limit, the Facebook engineers have increased their level of data compression to 8x (using a custom modification of the Hortonworks ORCFile) from a previous 5x (using RCFile) compression. The Hortonworks ORCFile is generally faster than RCFile when reading, but is slower on writing. Facebook’s custom ORCFile was always fastest on both read and write and also the best compression.

Source:

Comments Off on Facebook compresses its 300 petabyte Hadoop Hive data warehouse layer by factor of 8x

Posted in Data Warehouse, Facebook, HBase, Hive, HortonWorks, performance

Tagged enterprisetech.com, facebook.com, github.com, hortonworks.com

Hortonworks Sandbox Hive tutorial

Posted on April 14, 2014 | Comments Off

I much preferred this tutorial in Hive, rather than the previous one using Pig. Using the same dataset in each example made the comparison clearer.

Pig makes sense for sequential steps, such as an ETL job. Hive seemed better suited for tasks comparable to ones in which we’d write stored procedures within a more traditional database server.

Another difference came with debugging.

The Pig editor bundled into the Hortonworks sandbox isn’t very sophisticated as IDEs go. No breakpoints, viewing of data, etc. Perhaps there’s a way to accomplish this, but (thankfully) it isn’t covered in such an early stage tutorial. There’s a button to upload a UDF jar, so I’ve got to research how one develops that jar outside of the Pig script editor.
The Hive tutorial makes it easier to view progress at each step, since you can think of each step as an independent SQL (actually HiveQL) statement. If the programming task were far more complex, I could see myself structuring the Pig scripts in a way that might be easier to debug than Hive.
Hive seemed good for an ad-hoc query and Pig for a complex procedural task.
The next tutorial combines Pig and Hive. I’ll see how that shapes my perceptions.

Comments Off on Hortonworks Sandbox Hive tutorial

Posted in Hive, HiveQL, HortonWorks, tutorial

Hortonworks Sandbox Pig tutorial

Posted on April 14, 2014 | Comments Off

I just completed the Hortonworks Pig tutorial. Seemed very straight forward, yet I ran into one problem.

The PIG script as specified was:

batting = load ‘Batting.csv’ using PigStorage(‘,’);
runs = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
grp_data = GROUP runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
dump join_data;

Yet it generated an error. I wasn’t able to understand the logs well enough (yet!) to debug it, so fell back to Google’ing it and found this.

http://hortonworks.com/community/forums/topic/error-while-running-sand-box-tutorial-for-pig-script/

Best I can understand, the input data has column headers yet the script assumes no column headers. So the fix is to filter out any row with non-numeric data.

batting = load ‘Batting.csv’ using PigStorage(‘,’);
runs_raw = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
runs = filter runs_raw by runs > 0;
grp_data = GROUP runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
dump join_data;

I suppose that there’s also a way to filter out the first row but my Pig isn’t anywhere near good enough for that.

Other than that, Pig seems interesting. Sort of a procedural programming language version of a subset of what the next tutorial shows us in Hive.

Comments Off on Hortonworks Sandbox Pig tutorial

Posted in HortonWorks, Pig, tutorial

I spent some time today using the Hortonworks Hadoop sandbox

Posted on April 10, 2014 | 1 comment

I downloaded the Hortonworks sandbox today. I’m using the version that runs as a virtual machine under Oracle VirtualBox. The sandbox can run in as little as 2GB RAM, but requires 4GB in order to enable Ambari and HBase. Good thing that I have 8GB in my laptop.

The “Hello World” tutorial provided me with hands on:

Uploading a file into HCatalog
Typing queries into Beeswax, which is a GUI into Hive
Running a more complex query by writing a short script in Pig

There are a lot more tutorials. I’ll update this blog post after I finish each tutorial.

Sources:

1 Comment

Posted in Ambari, HBase, HCatalog, Hive, HortonWorks, Pig, tutorial

Tagged hortonworks.com

3 types of clueless Big Data “experts”

Posted on March 21, 2014 | Comments Off

I just read a very funny (and informative) article on InfoWorld about clueless “cloud experts”. Very easy to translate into any tech vertical, but made me recall so may examples of people who don’t understand Big Data.

I’ve built Big Data applications years ago.
I have a good friend (who I hope never reads this) who insists that he built a Big Data application in 1992 using Apple Hypercard with both executable and data distributed on one CDROM. Of course that was “a lot” of data in 1992. So one question if we want to be pedantic: If you don’t use Hadoop can it be a Big Data application?
Big Data has no privacy. Isn’t that what the NSA proved?
This misconception is the exact opposite of the truth. The NSA uses Accumulo, a very secure Hadoop distribution, and siphons data from all sorts of systems all over the planet. Sure, it probably pulls from some Hadoop systems, but for the NSA to get so much data doesn’t it make sense that the vast majority must be coming from ordinary non-Hadoop systems?
Big Data is the answer for everything.
I know a guy who suggested using Hadoop (running the Teradata distribution no less!) to store data feeds that we’re not ready to run ETL on yet. Wouldn’t a simple fileshare be a lot easier?

Source:

http://www.infoworld.com/d/cloud-computing/3-types-of-cloud-experts-make-my-blood-boil-238541

Comments Off on 3 types of clueless Big Data “experts”

Posted in Accumulo, ETL, Humor, Teradata

Tagged infoworld.com

Hadoop job scheduling that takes network bandwidth into account

Posted on March 17, 2014 | Comments Off

A research paper from Cornell University discusses scheduling Hadoop jobs based upon an analysis of available network bandwidth. Typically a Hadoop cluster only considers server node availability when scheduling. Software Defined Networking (SDN) is assumed. SDN is a new front in virtualization technology and critical for dynamic scaling of clouds.

Source:

http://arxiv.org/abs/1403.2800

Comments Off on Hadoop job scheduling that takes network bandwidth into account

Posted in MapReduce, Software Defined Networking (SDN), Yarn

Tagged arxiv.org, cornell.edu

Running Hadoop on VMWare

Posted on March 8, 2014 | Comments Off

Normally we’d like to think of Hadoop running on hundreds of racks of commodity hardware, but that doesn’t mean that we should forget all of the reasons why we love virtualization.

This case study explains that how & why, and provides benchmarks of the experiment of running Hadoop on VMWare. Of course the experiment was successful, as the study was published by VMWare.

The moral of the story is that just because Hadoop can run on commodity hardware doesn’t mean that it has to, or that it’s the best way to deploy.

Source:

Comments Off on Running Hadoop on VMWare

Posted in hardware, VMWare

Tagged vmware.com

Xplenty offers HaaS in AWS

Posted on March 8, 2014 | Comments Off

Xplenty offers Hadoop as a Service for Amazon Web Services in all AWS global regions. This HaaS offering promises a “coding free design environment”, of course in additional to AWS hardware free environment.

Source:

Comments Off on Xplenty offers HaaS in AWS

Posted in AWS, HaaS

Tagged digitaljournal.com, xplenty.com

Cloud (IaaS) & Big Data

IBM SoftLayer IaaS – notes from 2 day training class in NYC

Overview

Server Architecture

Networking

Security

Data

Managed services

APIs

Compared to other cloud providers

IBM PureApplications for Hybrid IaaS Cloud

Facebook compresses its 300 petabyte Hadoop Hive data warehouse layer by factor of 8x

Hortonworks Sandbox Hive tutorial

Hortonworks Sandbox Pig tutorial

I spent some time today using the Hortonworks Hadoop sandbox

3 types of clueless Big Data “experts”

Hadoop job scheduling that takes network bandwidth into account

Running Hadoop on VMWare

Xplenty offers HaaS in AWS

Categories

Sources

RSS

Archives