Scalability of the platform and how it looks after a few years
Fault tolerance with its benefits in day to day basis
Operations costs and accessibility of programmers.
SMS Gateway
My example is directly related to sending sms messages :)
Before we start...
Becoming telecomunication specialist.
MT
mobile terminated
MO
mobile originated
DR
Delivery report
Year - 2009
The beginning
We need to go back in time to understand some things and decisions.
Requirements for the project
Not a startup environment
Contract with supplier limiting us to 9 months
...not 9... it was later cut to 6 months!
Diversification of resources! it can't be written in an unsupported language
What are we doing ?
We are giving clients API so they can use us to send MT's and receive MO's
We also provide other services to various television channels etc.
We also do SMS PAYMENTS! This is actually the most important bit.
Programming Languages
Hard decision
Fault tolerant
Scalable
Team support
Available resources we can get
Support? Any?
....
Ruby
+ Team knows it!, Easy to learn
+ Rails, Merb, Sinatra..etc
+ Awesome testing tools
- Not scalable
- Slow
- (2009) No real jvm support
- not fault tolerant, hard restarts
Java
Java 1.5. (1.6 was close, but not there yet)
+ and - = JVM
Team was not super fluent with it, but we had 2 team full of java devs
It was not clear how to achieve fault tolerant systems
Restarts, hot code swapping was not there
Testing tools not cool
Big plus was a lot of ready-to-use libs
Scala and Go
Not present yet! (in context of business)
Haskell
2009 era of 6.*
Almost no useful libs, no aeson, no mysql-simple, no snap
Very limited access to developers
Pure functional code, very trustworthy
Hard
Erlang
Fault tolerant out of box
Easy to learn syntax
Support in team
OTP
Available commercial support
- hard to get devs
Database
It is 2009
Can't really lose any single MO/MT
Not a lot of experts in field of big volumes of load on db
We had to be sure it will be good and perform
MySQL
+ A lot of experience in team
Easy to use and monitor
It crashed a few times, not really fault tolerant
- No master master replication
- Network splits can hurt slave a lot
Riak
Seemed solid
- Nobody had a lot of experience with it
Fault tolerant
Not slow!, speed was ok
Distributed, everything accessible everywhere
Basho offered commercial support
Early builds 0.12
Sleepy scheduler problem
Timeouts
Upgrades
Architecture | Design
Split FRONTEND and BACKEND.
BACKEND = part that talks to operators, MO/MT/DR requests and handles API
FRONTEND = here you can configure gateway. Add remove connections etc...
The only connection between the two parts was the "Instructions Queue"
Frontend
One big monolithic Rails app
Backend
Interesting part
Nodes, many nodes
Type of nodes
Operator nodes(smpp, ucp, http)
Client nodes(http, smpp)
Router nodes
Conf nodes
Counter and Callback nodes
Simple MO flow
Queues
Many queues
Every node uses queues to communicate with other nodes
Why not use RabbitMQ? - I don't know! RabbitMQ is great
Operator nodes
You can't have just one - you need 3+
Fault tolerant
Client nodes
Some clients prefer HTTP api
Some prefer SMPP
Other types of nodes
Counter
Callback
...
Deployment and Distribution
On every physical box we have many nodes
Some boxes have only operator nodes or client nodes
Deployment: Capistrano
Early Life
Early days were great, nothing broke...
Actually, some stuff broke
Problems I
Riak Timeouts and concurrency
Throttle Queues, Queue size
Zombie keys
Corrupted data due to not using V-clocks in Riak
Multi datacenter replication going crazy
Problems II
Round robining Riak connections
SSD's, Numa (Non-Uniform Access Memory)
many queues to single queue
Duplicate messages mos and mts because of smpp not being reliable
Problems III
Upgrade of Riak happen and this happen....
+zdnfgtse = do-not-f*cking-go-to-sleep-ever
Maintenance
Not upgrading will eventually bite you
Logs, graphs, more detailed view into gateway
That day when you discover 50 000 000 files in one directory
Logs
Logs are most important thing in your LIFE if you are a developer.
Not the tool, not your family... logs. Good log format is brilliant, bad is a bane.
Logs (2)
Rails logs
Processing PostsController#create (for 127.0.0.1 at 2008-09-08 11:52:54) [POST]
Session ID: BAh7BzoMY3NyZl9pZCIlMDY5MWU1M2I1ZDRjODBlMzkyMWI1OTg2NWQyNzViZjYiCmZsYXNoSUM6J0FjdGl
vbkNvbnRyb2xsZXI6OkZsYXNoOjpGbGFzaEhhc2h7AAY6CkB1c2VkewA=--b18cd92fba90eacf8137e5f6b3b06c4d724596a4
Parameters: {"commit"=>"Create", "post"=>{"title"=>"Debugging Rails",
"body"=>"I'm learning how to print in logs!!!", "published"=>"0"},
"authenticity_token"=>"2059c1286e93402e389127b1153204e0d1e275dd", "action"=>"create", "controller"=>"posts"}
New post: {"updated_at"=>nil, "title"=>"Debugging Rails", "body"=>"I'm learning how to print in logs!!!",
"published"=>false, "created_at"=>nil}
Post should be valid: true
Post Create (0.000443) INSERT INTO "posts" ("updated_at", "title", "body", "published",
"created_at") VALUES('2008-09-08 14:52:54', 'Debugging Rails',
'I''m learning how to print in logs!!!', 'f', '2008-09-08 14:52:54')
The post was saved and now the user is going to be redirected...
Redirected to #
Completed in 0.01224 (81 reqs/sec) | DB: 0.00044 (3%) | 302 Found [http://localhost/posts]