A Hybrid Cloud Approach from FraudGuard.io that Handles 50M Requests a Day - High Scalability -

 highscalability.com  01/28/2019 17:02:20 

Monday, January 28, 2019 at 9:02AM

 This is a guest post from Ryan Averill at FraudGuard.io.

At FraudGuard.io we are a team of just a few developers; all working with our customers to try to make their applications as safe as possible. We have been working on FraudGuard for about 3 years and we’ve had paying customers for more than 2 years now. The main idea behind FraudGuard is for us to get attacked so you don’t have to. In other words; reduce the overall number of attacks your application receives each day by leveraging our threat data. We do this by by taking our attack data from our network of honeypots and share that data via API direct to you. Instead of some businesses just running services like Maxmind, that update occasionally, we actually run the entire process in house so we can immediately share real-time attack data from around the world.

Originally the plan was to never sell FraudGuard’s data but instead secure a massive collection of Wordpress and Opencart applications that we were responsible for developing, managing and securing. We ended up having so much success with a reduction in application attacks that we even had a significant reduction in nightly Pagerduty escalations for the team overall. The integration in this instance was actually very simple; we just wrote a couple quick plugins that validate inbound internet traffic to critical application components. This is also our official recommended implementation of FraudGuard for the record.

Once we decided to build a system capable of collecting enough useful data, we found that the costs to store and retain enough data running in the cloud (originally DigitalOcean) simply wasn’t as cost effective as we wanted. While our front-end applications are fairly simplistic to the end user, the massive datasets we collect 24/7 were really expensive to store and constantly process. In addition to our very large datasets, our backend also relies on dozens of complicated microservices as well as quite a few open source applications.

FraudGuard is in the first column second row

While we previously ran on DigitalOcean (and loved them) at the time we also had a significant restriction on the number of public IPs that they would allocate - 1 IP to 1 server. Since we collect all our own attack and non-attack data and don't rely on any third party service we need as many public IPv4 and IPv6 IPs as we can afford to collect data.

What we decided to do was actually break with the typical norm and lease a 1/4 rack at a local Tampa Florida data center - Hivelocity. Working with an actual datacenter gave us a huge amount of flexibility in terms of pricing, fixed a couple of our technical limitations (mainly public IP restrictions) and also gave us fairly cheap permanent resource allocation that was previously quite expensive in the cloud. Especially in comparison with on-demand (non-reserved) pricing. To provide some context here take our previous infrastructure expenses running at DigitalOcean. Completely removing our honeypot costs from the equation, just running the major backend components of the business requires an x-large (at least 8 cores) MySQL and large (at least 4 cores) Redis and CouchDB instances preferably multiplied by two in order to have the ability to failover between instances and have at least one read replica. At that time (I know Droplets have dropped in price since then) those expenses alone significantly exceeded our entire months hosting quote at Hivelocity and we haven’t even discussed pricing applications servers, hosting auxiliary applications, etc. 

Inside our rack we’ve gone through many iterations of hardware first HP and now Supermicro as well as many upgrades to parts. It also takes a special kind of super passionate team to go to a physical datacenter or call a NOC to get assistance in the middle of the night instead of just logging in and clicking reboot. But in some instances it may be worth looking into from a pricing perspective.

At our current configuration we run 4 Supermicro FatTwins they run an extremely sizeable workload and have been phenomenally reliable as has been our datacenter so far. Each Supermicro server is a FatTwin - so its actually two dual processor machines per 2U - 8 hosts (complete physical machines - I believe they only share redundant power supplies) in total in our 1/4 rack. Each host is pretty simple and just runs Kubernetes on Rancher. So far we’ve been really fortunate to not have any problems, and it makes the entire setup and management process a breeze. For our network layer (and since we are just a very small company) we run a 10Gbps Ubiquiti Networks switch and Ubiquiti Networks firewall the USG-PRO-4. For storage, CPU and memory each host has 6 drives which we’ve upgraded to 6x 1TB Kingston SSDs, 48GB DDR3 memory and 16 cores. In total and with RAID this gives us about 30TB usable capacity, 128 Xeon cores and 384GB memory - which is great for now even with the last few years growth we are probably using 3/5 of that capacity today.

One of the many questions we usually get here is the initial expense to build the infrastructure above and to be fair it's a lot of money. Most of the purchases we made were bulk purchases and used/refurb hardware to try and keep costs low. In addition for whenever hardware fails, in order to stay online we have an entire 2 hosts (with all internal parts) and spare identical network gear on hand just a few miles away. One of the many nice parts about this setup is with enough planning, we hope to remain under our allocated bandwidth and power caps (10 Amps) so we basically just pay a flat fee every month which makes budgeting and forecasting pretty easy for the business. For the record we aren’t at all against cloud hosting (we still use AWS everyday in fact) or cloud pricing in general but if you can afford the initial expenditure (which in our case we were lucky enough that we could with our first few large customers) and you have the need to warrant the capacity or similar use case - what we pay each month in hosting is a fraction of the price for 10x the capacity.  

Our web apps are mostly Laravel and our APIs are all written in the Laravel micro framework Lumen - both of which we are all huge fans of - we love you Taylor - the creator of Laravel. Each web app and microservice is dockerized and runs in Rancher including support for running cross host redundancy with almost no config required which is really awesome. Each of our apps is stored in version control in Github and built in Jenkins pipelines. So far we’ve had almost no issues with this configuration which has been pretty epic. 

Now inside our rack we use many technologies to run the crux of our business; our attack correlation engine. Mostly this data is persisted inside MySQL but we also heavily use RabbitMQ for queues, Redis for cache and lastly NoSQL document storage runs in CouchDB. The majority of our other application stacks (outside of our web apps that run PHP) mostly run in a combination of Python and a little Ruby.

One of the greatest things about this hybrid design that we never even planned for, was actually leveraging our datacenter hardware for our own auxiliary apps, our massive attack correlation engine and most importantly our free tier of cheap users that never pay us. For our higher tier of paying customers we leverage a separate endpoint with full AWS multi-region support but only for those users that pay for it.

Moving between AWS and our colocated environment is easy with an OpenVPN tunnel direct to our VPC. Once you’re inside our VPC the setup is very simple - we just use RDS Aurora - effectively a read-only replica, Redis in ElastiCache, CouchDB for NoSQL in EC2 and the same dockerized apps running in ECS. Auto-scaling is a pinch with EC2 auto-scaling groups and provisioning is super fast with the ECS AMI in a launch config. And, lastly load balancers all run in ALBs with health checks in EC2 target groups. This was a really easy way to have a small and inexpensive but quickly and automatically scalable environment for our top tier customers. In addition there are no dependencies for our paying customers running in AWS, each system is entirely independent in terms of serving API traffic. In the case of downtime at our datacenter our paying customers will continue to receive data as normal, and if AWS were to ever have issues in both our availability zones simultaneously, we could even make a quick DNS change and push those paying customers back into our colocated rack.

For our logging layer we are big fans of Graylog, this is also how we spawn quite a few alerts through Graylog streams that are usually fired off to Slack. For error tracking we just recently switched to Sentry which has been great so far and for external monitoring we pay for Uptimerobot. We use open source Prometheus for a lot of our system monitoring as well as general business metrics and Piwik now called Matomo for analytics. For payments we passthrough to Stripe which is definitely the way to go. Our beautiful API docs are all generated in the open source Ruby project Slate. Lastly is our distributed job management and we rely heavily on a large Jenkins cluster that we run in house.

Our actual honeypot application is entirely custom and is mostly Python now. We built this as a bunch of container images running in various cloud providers, that all report attacks back to our on premise infrastructure. In the case of running honeypots in house we have a segmented and entirely unconnected portion of infrastructure capable of handling this attack traffic. In our case when I say segmented I mean they share a PDU- that's about it. We actually run 1 of our 8 hosts as an single isolated machine running its own hypervisor, pfSense firewall and Docker cluster as well as its own IP space and even runs in its own physical network drop.

Part of our large Jenkins workload is constantly rebuilding images that we run on premise and in the cloud. Attacks that take place typically don't last long because public IPs are constantly changing and if an image is ever breached, immediately wiped on a recurring schedule. At FraudGuard we also collect a bunch of other data that also run in Jenkins - this includes capturing IPs of open public proxies, TOR exit nodes, spam abusers, and a whole lot more.

Pretty much the rest of our services required to run a SaaS run in AWS, including but not limited to, Route 53 for DNS, Cloudfront for CDN, S3 for object storage, SES and Workmail for email, Glacier for backups, etc.

There is a lot more to FraudGuard but that's a good start. At a minimum it's been a lot of fun getting up to nearly 50M API requests per day. If you are a developer responsible for securing an application, a FraudGuard.io integration might be able to help you. If you want to chat send us an email anytime, we can be reached at [email protected]

« Go back