Messaging system

Design, building and handling in production

by Jakub Oboza / @jakuboboza

Heads Up

  • Programming languages - options/choices
  • Scalability of the platform and how it looks after a few years
  • Fault tolerance with its benefits in day to day basis
  • Operations costs and accessibility of programmers.

SMS Gateway

My example is directly related to sending sms messages :)

Before we start...

Becoming telecomunication specialist.

Down arrow

MT

mobile terminated

MO

mobile originated

DR

Delivery report

Up arrow

Year - 2009

The beginning

We need to go back in time to understand some things and decisions.

Requirements for the project

  • Not a startup environment
  • Contract with supplier limiting us to 9 months
  • ...not 9... it was later cut to 6 months!
  • Diversification of resources! it can't be written in an unsupported language

What are we doing ?

We are giving clients API so they can use us to send MT's and receive MO's

We also provide other services to various television channels etc.

We also do SMS PAYMENTS! This is actually the most important bit.

Programming Languages

  • Hard decision
  • Fault tolerant
  • Scalable
  • Team support
  • Available resources we can get
  • Support? Any?
  • ....

Ruby

  • + Team knows it!, Easy to learn
  • + Rails, Merb, Sinatra..etc
  • + Awesome testing tools
  • - Not scalable
  • - Slow
  • - (2009) No real jvm support
  • - not fault tolerant, hard restarts

Java

  • Java 1.5. (1.6 was close, but not there yet)
  • + and - = JVM
  • Team was not super fluent with it, but we had 2 team full of java devs
  • It was not clear how to achieve fault tolerant systems
  • Restarts, hot code swapping was not there
  • Testing tools not cool
  • Big plus was a lot of ready-to-use libs

Scala and Go

Not present yet! (in context of business)

Haskell

  • 2009 era of 6.*
  • Almost no useful libs, no aeson, no mysql-simple, no snap
  • Very limited access to developers
  • Pure functional code, very trustworthy
  • Hard

Erlang

  • Fault tolerant out of box
  • Easy to learn syntax
  • Support in team
  • OTP
  • Available commercial support
  • - hard to get devs

Database

  • It is 2009
  • Can't really lose any single MO/MT
  • Not a lot of experts in field of big volumes of load on db
  • We had to be sure it will be good and perform

MySQL

  • + A lot of experience in team
  • Easy to use and monitor
  • It crashed a few times, not really fault tolerant
  • - No master master replication
  • - Network splits can hurt slave a lot

Riak

  • Seemed solid
  • - Nobody had a lot of experience with it
  • Fault tolerant
  • Not slow!, speed was ok
  • Distributed, everything accessible everywhere
  • Basho offered commercial support

  • Early builds 0.12
  • Sleepy scheduler problem
  • Timeouts
  • Upgrades

Architecture | Design

Split FRONTEND and BACKEND.

BACKEND = part that talks to operators, MO/MT/DR requests and handles API

FRONTEND = here you can configure gateway. Add remove connections etc...

The only connection between the two parts was the "Instructions Queue"

Frontend

One big monolithic Rails app

Backend

Interesting part

Nodes, many nodes

Type of nodes

  • Operator nodes(smpp, ucp, http)
  • Client nodes(http, smpp)
  • Router nodes
  • Conf nodes
  • Counter and Callback nodes

Simple MO flow

Queues

Many queues

Every node uses queues to communicate with other nodes

Why not use RabbitMQ? - I don't know! RabbitMQ is great

Operator nodes

  • You can't have just one - you need 3+
  • Fault tolerant

Client nodes

  • Some clients prefer HTTP api
  • Some prefer SMPP

Other types of nodes

  • Counter
  • Callback
  • ...

Deployment and Distribution

  • On every physical box we have many nodes
  • Some boxes have only operator nodes or client nodes
  • Deployment: Capistrano

Early Life

  • Early days were great, nothing broke...
  • Actually, some stuff broke

Problems I

  • Riak Timeouts and concurrency
  • Throttle Queues, Queue size
  • Zombie keys
  • Corrupted data due to not using V-clocks in Riak
  • Multi datacenter replication going crazy

Problems II

  • Round robining Riak connections
  • SSD's, Numa (Non-Uniform Access Memory)
  • many queues to single queue
  • Duplicate messages mos and mts because of smpp not being reliable

Problems III

  • Upgrade of Riak happen and this happen....
  • +zdnfgtse = do-not-f*cking-go-to-sleep-ever

Maintenance

  • Not upgrading will eventually bite you
  • Logs, graphs, more detailed view into gateway
  • That day when you discover 50 000 000 files in one directory

Logs

Logs are most important thing in your LIFE if you are a developer. Not the tool, not your family... logs. Good log format is brilliant, bad is a bane.

Logs (2)

Rails logs


              Processing PostsController#create (for 127.0.0.1 at 2008-09-08 11:52:54) [POST]
  Session ID: BAh7BzoMY3NyZl9pZCIlMDY5MWU1M2I1ZDRjODBlMzkyMWI1OTg2NWQyNzViZjYiCmZsYXNoSUM6J0FjdGl
vbkNvbnRyb2xsZXI6OkZsYXNoOjpGbGFzaEhhc2h7AAY6CkB1c2VkewA=--b18cd92fba90eacf8137e5f6b3b06c4d724596a4
  Parameters: {"commit"=>"Create", "post"=>{"title"=>"Debugging Rails",
 "body"=>"I'm learning how to print in logs!!!", "published"=>"0"},
 "authenticity_token"=>"2059c1286e93402e389127b1153204e0d1e275dd", "action"=>"create", "controller"=>"posts"}
New post: {"updated_at"=>nil, "title"=>"Debugging Rails", "body"=>"I'm learning how to print in logs!!!",
 "published"=>false, "created_at"=>nil}
Post should be valid: true
  Post Create (0.000443)   INSERT INTO "posts" ("updated_at", "title", "body", "published",
 "created_at") VALUES('2008-09-08 14:52:54', 'Debugging Rails',
 'I''m learning how to print in logs!!!', 'f', '2008-09-08 14:52:54')
The post was saved and now the user is going to be redirected...
Redirected to #
Completed in 0.01224 (81 reqs/sec) | DB: 0.00044 (3%) | 302 Found [http://localhost/posts]
            

Logs (3)

CSV logs


2014-01-16 12:38:52.364377,mt_sent,65423-GUID,KKK01HU,three,000111111111,63103,KK1544,0011111111111:160114123852
2014-01-16 12:39:48.736609,mt_sent,cf2e99-GUID,KKK01HU,three,000111111111,84383,KF1250,000111111111:160114123948
            

Logs (4)

The logstash experiment

Monitoring, Graphs

  • logstash
  • kibana
  • hekad
  • graphite!

hekad

  • written in Go
  • fast and low on CPU
  • still had imperfections

Graphite

  • Graphite is awesome
  • Did you know you can base your alarms on graphite ?
  • and many more....

Thank you :D

Any questions ?