"The future is already here – it's just not evenly distributed." — William Gibson
Do you like this sort of Stuff? Pleasesupport me on Patreon. I'd really appreciate it. Know anyone looking for a simple book explaining the cloud? Then please recommend my well reviewed (30 reviews on Amazon and 72 on Goodreads!) book: Explain the Cloud Like I'm 10. They'll love it and you'll be their hero forever.
$10 billion: Apple services revenue; 1.49B: Facebook daily active users; 34: cache sites for iOS rollout; 87M: paying Spotify users; 20k: Facebook's new large-scale dataset for video description as a new challenge for multi-sentence video description; 6 million: online court case dataset; 125 quadrillion: Sierra supercomputer calculations each second; 1500: per day automated chaos experiments run at Netflix; 12: neurons needed to park a car; 2025: boots on Mars; 600: free online courses; 94%: accuracy of AI lawyer; 4 billion: source code files held in Software Heritage Foundation; 33,606: Facebook headcount;
@cdixon: 1/ There have been many great software-related forum posts over the years. Some favorites…2/ Tim Berners Lee, proposing the World Wide Web in 1991...3/ Linus Torvalds proposing Linux, also in 1991...4/ Marc Andreessen proposing images for web browsers, in 1993...5/ And, of course, Satoshi proposing Bitcoin, 10 years ago today.
Federal Register: In this final rule, the Librarian of Congress adopts exemptions to the provision of the Digital Millennium Copyright Act (“DMCA”) that prohibits circumvention of technological measures that control access to copyrighted works, codified in the United States Code.
@johncutlefish: Them: “Why can’t the teams just focus on shipping!?Maybe it is a generational thing, but these people just keep asking questions....” Me: “But you hired people who solve puzzles / problem solve for a living, what did you expect?”
David Rosenthal: The big successes in the field haven't come from consensus building around a roadmap, they have come from idiosyncratic individuals such as Brewster Kahle, Roberto di Cosmo and Jason Scott identifying a need and building a system to address it no matter what "the community" thinks. We have a couple of decades of experience showing that "the community" is incapable of coming to a coherent consensus that leads to action on a scale appropriate to the problem. In any case, describing road-mapping as "research" is a stretch.
Simon Wistow: Observability goes beyond monitoring, enabling the proactive introspection of distributed systems for greater operational visibility. Observability allows you to ask open-ended questions and have the data you need in order to explore the data to find answers. In short, observability gives you the information you need to make better decisions using real data.
@keithchambers: We use Mesos + Marathon in Yammer at Microsoft. It works fine for us. But long term we don’t want to run any of our own infrastructure (in Yammer) so we will look to move to Service Fabric or Azure Kubernetes Service. We have bigger fish to fry so we’ll be on Mesos for years.
Hanna Fry: we need something like the FDA for algorithms
Geoff Huston: The industrial revolution was certainly triggered by the refinement of the steam engine, but the social revolution was far larger in scope than the invention of a simple mechanical device. In a similar line of thought, maybe it’s not the Internet or its governance that lies at the heart of many of today’s issues. Maybe it's the broader issues of our enthusiastic adoption of computing and communications that form a propulsive force for change in today’s world.
Tony Albrecht: That’s a suspicious gap. One should never ignore a suspicious gap. Upon further investigation, we found a few lines of code used to communicate with the League client - these should have been quick. They just polled the client and instantly returned if there were no messages there. This issue was handed back to the team responsible for the code and they quickly found the problem. The polling code was waiting for at least 1 millisecond, and that polling code was called twice! We’d found our 2ms stall! The code in question had been refactored about a year ago, but wasn’t being used in game, so we hadn’t noticed the 1ms stall.
Tim Merel: This year's record $25 billion games deals over just 9 months could mean we've hit the top of the market, as detailed in Digi-Capital's new Games Report Q4 2018. Last time this happened, games investment and M&A plunged into a games deals ice age at the lowest level in a decade. The games deals cycle has swung from boom to bust twice already since 2010, so the next 6 months will determine if history repeats itself or the dollars continue to flow.
Qnovo: Statistically speaking, battery events occur at the rate of about 10 to 100 failures for every one million devices (in technical lingo, 10 to 100 ppm). This may sound like a small numerical figure, but when multiplied with the billions of devices that use batteries, the number of safety problems becomes very troubling. Yet, there are technologies that can reduce this figure by a factor of 100 (down to parts per billion or even lower). It’s time that the battery safety is taken far more seriously.
evanb: I'm privileged to be an early-science user of Sierra (and Lassen) to pursue lattice QCD calculations while the machine is still in commissioning and acceptance. It's completely absurd. A calculation that took about a year on Titan last year we reproduced and blew past with a week on Sierra.
Facebook: In Facebook’s data centers, Oomd, in conjunction with PSI metrics and cgroup2, is increasing reliability and efficiency, driving large-capacity gains and significant increases in resource utilization.
cheriot: Can we not debate the term "serverless" on every FaaS and PaaS article?
Joel Hruska: According to DigiTimes, NAND prices — which have already fallen 50 percent this year — could be set for another 50 percent decline.
tehlike: I work in mobile ads, and for fun, i injected 50ms, 100ms, 250ms, 500ms latency in an experiment. The results were not as dramatic. 500ms had less than 1% impact. 50ms looked almost positive (didn't wait enough to have it statistically significant). This is 2018. Results are rather nonlinear, and different formats with different latency requirements had different results (like formats lending themselves to advance preloading would not be impacted at all even at 500ms, but ohers which had tighter UX requirements would).
Ginger Campbell: The key implication is that the synapses within the mouse are quite diverse and Dr. Grant observed that it is possible that each synapse is unique.
Brian Bailey: It turns out that it has been fairly consistent across the past 15 years where around 30% of all project were able to achieve first silicon success. This changed significantly in 2018 where only 26% of design projects were able to achieve first silicon success.
Karen Hao: The researchers found [re: the trolly problem] that countries’ preferences differ widely, but they also correlate highly with culture and economics. For example, participants from collectivist cultures like China and Japan are less likely to spare the young over the old—perhaps, the researchers hypothesized, because of a greater emphasis on respecting the elderly.
Adrian Cockroft: As datacenters migrate to cloud, fragile and manual disaster recovery will be replaced by chaos engineering. Testing failure mitigation will move from a scary annual experience to automated continuous chaos.
@philandstuff: There's an idea out there that it shouldn't be possible to SSH into production systems. I disagree. Let's start with some common ground: I don't think that people should routinely SSH into boxes. I think infrastructure should be managed by code. I rarely SSH to machines myself. Logs should be stored and queried centrally. Same with metrics. So, people ask: what do we still need SSH for, then? I'll tell you what: debugging.
@copyconstruct: A good writeup about the rollout of Zipkin at Netflix for distributed tracing. "At 0.1% sampling, all regions combined, at peak we see 135k trace events/sec which results in 240 MB/sec network in to Kafka. Elasticsearch: 90 nodes"
Matt Klein: I think that’s where we’re breaking down right now. I think we’ve really swung too far as an industry to thinking that infrastructure is magic and it is the solution to all of our problems with infrastructure as code. Don’t get me wrong. I think we can do a lot from an infrastructure perspective and we can consolidate a lot of roles or a lot of functions that would have previously been there. I think that we’ve swung a little hard and we don’t recognize how important all of these other things are in terms of reliability, operational agility, education, documentation, support. I just don’t think we’ve focused on that enough and that’s creating problems.
Jeremy Daly: So is operations going away in a serverless world? I really don’t think so. While it’s certainly true that more adaptation will be required of them, we still need people to do things like plan and handle disaster recovery, configure and optimize managed services/databases, analyze tracing reports, replay failed events, and monitor overall system health. Sure they may need to jump in and help code once in awhile, but to rely on only developers to navigate and support the complexity of serverless cloud-based applications, IMO, would be taking a huge risk.
@Obdurodon: Code (and especially comment) as though the next person reading your code will be doing so under extreme bug-hunting pressure. They won't appreciate how super-clever you were, but will appreciate context and warnings about pitfalls.
@wiredferret: #VelocityConf @krisnova: Flexibility is not freedom, it’s chaos. Conquering the chaos is where we find freedom.
@zackbloom: A @Cloudflare Workers customer increased the amount of traffic they send us by 100k requests per second today. They didn't have to do anything, we didn't have to do anything, it all just quietly, peacefully, seamlessly, works.
Matthew Barlocker: The guaranteed limit for c5.large is 823,806 packets per second and the best effort is 994,007. You could look at this like “How dare AWS throttle me!” or you could thank them for giving the extra 20.66% throughput when they’re not busy. I’m more of a glass half-full type of guy.
@aallan: The official report into the #SoyuzMS10 anomaly has concluded that the abort was caused by a sensor failure at booster separation…. External video of launch shows the anomalous booster separation (seperation at 1 min 24 sec)...This is the most worrying thing about the #SoyuzMS10 anomaly. That the cause is not an isolated incident, but is part of a pattern of poor quality control issues that have now caused a number of incidents. This was just the first on a crewed launch.
@benthompson: My daughter deleted an entire report because, after only using Google Docs previously, she had to use Word. She was completely befuddled by the idea of "save", especially because she didn't have a pre-existing folder for her class in the dialog. Finally she just quit the app
@rothgar: If you're chasing Kubernetes because that's what the tech giants use you should know Google uses Borg Facebook uses systemd Twitter uses mesos Netflix uses mesos I'm not sure what Amazon or Microsoft use but I guarantee it's not k8s
@pushrax: Shopify is on k8s. We've had to scale part of our system to many clusters purely because of API load. Overall it's working well for us. Autoscaling in progress, it's tough because of the flash sale problem. Thinking about machine learning approaches.
André.Arko: Between rust-aws-lambda and docker-lambda, I was able to port my parser to accept an AWS S3 Event, and output a few lines of JSON with counters in them. From there, I can read those tiny files out of S3 and import the counts into a database. With Rust in Lambda, each 1GB file takes about 23 seconds to download and parse. That’s about a 78x speedup compared to each Python Glue worker.
Shawna Williams: Most people think that the seat of our thoughts is the cerebral cortex, says Nate Sawtell, a neuroscientist at Columbia University’s Zuckerman Institute who was not involved in the work. The new paper’s “provocative suggestion,” he says, is that “maybe what makes you you is not just the cerebral cortex, but a more distributed network.”
Trail of Bits: Attackers look for weird machines to defeat modern exploit mitigations. Weird machines are partially Turing-complete snippets of code that inherently exist in “loose contracts” around functions and groups of functions. A loose contract is a piece of code that not only implements the intended program, but does it in such a way that the program undergoes more state changes than it should (i.e. the set of preconditions of a program state is larger than necessary). We want “tight contracts” for better security, where the program only changes state on exactly the intended preconditions, and no “weird” or unintended state changes can arise.
Brian Bailey: “In the past 35 years, the semiconductor industry and the EDA/IP industry have been spending most of their precious talents and resources readying and porting designs to new process nodes, generations after generations,” states Chi-Ping Hsu, executive director at Avatar Integrated Systems. “Moore’s Law definitely brought significant integration and scaling that were never imaginable before. But at the same time, it sucked all the talent to do mostly scaling/porting/integration type of engineering, instead of new innovations.”
Conscious Entities: [Sean Carroll] maintains that the idea of Boltzmann brains is cognitively unstable. If we really are such a brain, or some similar entity, we have no reason to think that the external world is anything like what we think it is. But all our ideas about entropy and the universe come from the very observations that those ideas now apparently undermine. We don’t quite have a contradiction, but we have an idea that removes the reasons we had for believing in it. We may not strictly be able to prove such ideas wrong, but it seems reasonable, methodologically at least, to avoid them.
You know how they say distributed systems are hard? GitHub is a great example of why: outage amplification. At GitHub a 24 hour and 11 minute service degradation was caused by a mere 43 second network outage Why? GitHub continues the proud tradition of full and transparent post-mortems: October 21 post-incident analysis.
It wasn't a configuration error as so often happens, nor was it an upgrade problem, it was the other common problem—a failure during routine maintenance. And as often happens with politics, it wasn't the incident that was the big problem, it was the cascading series of failures radiating out from the rootish cause. For more on this there's A Large Scale Study of Data Center Network Reliability: "Maintenance failures contribute the most documented incidents; 2× higher rate of human errors than hardware errors." And the conclusion brings it home: "As software systems grow in complexity, interconnectedness, and geographic distribution, unwanted behavior from network infrastructure has the potential to become a key limiting factor in the ability to reliably operate distributed software systems at a large scale."
GitHub runs their own datacenters. While replacing failing 100G optical equipment connectivity was lost between their US East Coast network hub and their primary US East Coast data center for 43 seconds. On the network failure Raft picked the secondary datacenter as the write primary. On connection reestablishment all writes were directed to the secondary datacenter. Can you see the problem? A bunch of writes (30+ minutes) are now stranded at the old primary because they couldn't be replicated during the network outage. In addition, user experience degraded because database latency increased because traffic now had to flow to the west coast. So they decided to restore both sites from backups so they would be in sync and start replaying queued jobs. Backups are taken every 4 hours. So that's good. What's bad is the backups are multiple terabytes and take forever to transfer from a remote backup service and load into MySQL. At this point GitHub was madly investigating ways to speed up the process. And of course as new users woke up and started putting more write pressure on the databases, that slowed the process down too. After restoring service they then started working on resolving data inconsistencies.
Obviously this is not how GitHub wanted the process to go. So what are they going to change? It sounds like they are going to do what a lot of people do, which is put a human in the loop to determine when a fail-over should happen. It would have been a lot better to endure a 43 second failure rather than fail over. Most errors are transient and can be waited out. They also want to move to an active-active architecture. One might think Microsoft can help with that. They will also proactively test their assumptions.
boulos: At first, I had assumed this was Glacier (“it took a long time to download”). But the daily retrieval testing suggests it’s likely just regular S3. Multiple TB sounds like less than 10. So the question becomes “Did GitHub have less than 100 Gbps of peering to AWS?”. I hope that’s an action item if restores were meant to be quick (and likely this will be resolved by migrating to Azure, getting lots of connectivity, etc.).
terom: Reading this post-mortem and their MySQL HA post, this incident deserves a talk titled: "MySQL semi-synchronous replication with automatic inter-DC failover to a DR site: how to turn a 47s outage into a 24h outage requiring manual data fixes"
js2: In my career, the worst outages (longest downtime) I can recall have been due to HA + automatic failover. Everything from early NetApp clustering solutions corrupting the filesystem to cross-country split-brain issues like this.
Our study spans thousands of intra data center network incidents across seven years, and eighteen months of inter data center network incidents.A Large Scale Study of Data Center Network Reliability: During this time, automated repair software fixed 99.7% of RSW (rack switch) failures, 99.5% of FSW (fabric switch) failures, and 75% of core device failures...We find the root cause of 29% of network incidents is undetermined...Maintenance failures contribute the most documented root causes (17%)...Hardware failures represent 13% of the root causes, while humaninduced misconfiguration and software bugs occur at nearly double the rate (25%) of those caused by hardware failures...We observe a similar rate of misconfiguration incidents (13%)...A potpourri of accidents and capacity planning issues makes up the last 16% of incidents...Higher incident rates occur on higher bandwidth devices...Lower incident rates occur on fabric network devices...RSW incidents are increasing over time...Core devices have the most incidents, but they are low severity...Fabric networks have less severe incidents than cluster networks...Cluster network incidents increased steadily over time...Cluster networks have 2.8× the incidents versus fabric networks...Incident rates vary by 3 orders of magnitude across device types...Larger networks have longer incident resolution times...Typical edge node failure rate is on the order of months...Typical edge node recovery time is on the order of hours...There is high variance in both edge node MTBF and MTTR...Edge node failure rate is similar across most continents...Edge nodes recover within 1 day on average on all continents.
Understanding Production: What can you measure?: Production is everything. If your software doesn’t perform in production, it doesn’t perform. Thankfully there’s a range of information that you can measure and monitor that helps you understand your production system and solve any issues that may arise...A metric is just a number that tells you something about the system...A log is just a series of events that have been recorded...Instrumentation is the process of adding extra code at runtime to an existing codebase in order to measure timings of different operations...Profiling is measuring what part of your application is consuming a particular resource...Continuous Profiling is the ability to get always-on profiling data from production into the hands of developers quickly and usably...So many production systems only implement one or two of these approaches and as a result can encounter downtime or performance issues that leave a bad taste in your customer’s mouths.
MySQL Cluster 7.6.8 performance jump of up to 240%: One interesting thing was that I found a bitmask that had zeroing of the bitmask in the constructor, it turned out that this constructor was called twice in filtering a row and neither of them was required. So fixing this simple thing removed about 20 ns of CPU usage and in this case about 3-4% performance improvement...One of the biggest reasons for bad performance in modern software applications is instruction cache misses...In this code path I was optimising I found that I had roughly 1 billion instruction cache misses over a short period (20 seconds if I remember correctly). I managed with numerous changes to decrease the number of instruction cache misses to 100 million in the same amount of time. I also found some simple fixes that cut away a third of the processing time. In the end I found myself looking at the cost being brought down to around 250ns. So comparing the performance of this scan filtering with 7.5.10 we have optimised this particular code path by 240%.
Google says almost everywhere you have a configuration parameter you should use machine learning. Walmart applies the same idea, just in a different domain. Data Science In Walmart Supply Chain Technology: delivery promising, order sourcing, picking Optimization, Packing Optimization, Lane Planning (Mix Integer Programming Problem), Continuous Moves (Combinatorial Optimization Problem), DC to Store Delivery Routing and Scheduling (VRPTW), Grocery Delivery Routing and Scheduling (VRPTW and Assignment Problem), Map Routing (Graph Theory — Shortest Path Problem), Supply Shaping, Last Mile Order Pickup, Customer Waiting Time Estimate and Control (Queuing Theory).
How can you figure out all the things that can wrong? This is not new stuff. On one five 9s project we went through a process called Failure modes and effects analysis (FMEA) on every layer of software and hardware. Brutal, but useful. Adrian Cockroft: "Chaos Engineering - What is it, and where it's going" - Chaos Conf 2018. Divides failures into 4 layers: operations; application; software stack; infrastructure. Infrastructure failures: Device failures; CPU failures; Datacenter failures; Internet failures. Software stack failures: Time bombs; Date bombs; Expiration; Revocation; Exploit; Language bugs; Runtime bugs; Protocol problems. Application failures; Time bombs; Date bombs; Content bombs; Configuration; Versioning; Cascading failures; Cascading overload; Retry storms. Operations failures: Poor capacity planning; Inadequate incident management; Failure to initiate incident; Unable to access monitoring dashboards; Insufficient observability of systems; Incorrect corrective actions.
Peloton aims to support large-scale batch jobs, e.g., millions of active tasks, 50,000 hosts, and 1,000 tasks per second. Each month, these clusters run three million jobs and 36 million containers.Peloton: Uber’s Unified Resource Scheduler for Diverse Cluster Workloads: There are four main categories of compute cluster workloads used at Uber: stateless, stateful, batch, and daemon jobs...Co-locating diverse workloads on shared clusters is key to improving cluster utilization and reducing overall cluster costs...Resource overcommitment and job preemption are key to improving cluster resource utilization...As Uber’s services move towards an active-active architecture, we will have capacity reserved for disaster recovery (DR) in each data center...Uber’s online workloads spike during big events, like Halloween or New Year’s Eve...Different workloads have resource profiles that are often complementary to each other...Peloton is built on top of Mesos, leveraging it to aggregate resources from different hosts and launch tasks as Docker containers...To achieve high-availability and scalability, Peloton uses an active-active architecture with four separate daemon types: job manager, resource manager, placement engine, and host manager...All four daemons depend on Apache Zookeeper for service discovery and leader election...With Peloton, we use hierarchical max-min fairness for resource management, which is elastic in nature...Every resource pool has different resource dimensions, such as those for CPUs, memory, disk size, and GPUs...We are also planning to migrate Mesos workloads to Kubernetes using Peloton, which would help Peloton adoption in the major cloud services, as Kubernetes already enjoys extensive support in that realm...Peloton supports distributed TensorFlow, gang scheduling, and Horovod...Peloton has its own Apache Spark driver, similar to those used for YARN, Mesos, and Kubernetes...Autonomous vehicle workloads...Peloton runs large batch workloads, such as Spark and distributed TensorFlow, for Uber’s Maps team...Uber’s Marketplace team runs the platform that matches drivers with riders for our ridesharing business and determines dynamic pricing.
Modern war is going to the drones. The Era of the Drone Swarm Is Coming, and We Need to Be Ready for It: the more drones in a swarm, the more capable the swarm. Larger underwater swarms can cover greater distances in the search for adversary submarines or surface vessels. Larger swarms can better survive some defenses. The loss of a dozen drones would significantly degrade the capabilities of a twenty-drone swarm, but would be insignificant to a thousand-drone swarm...A future drone swarm need not consist of the same type and size of drones, but incorporate both large and small drones equipped with different payloads. Joining a diverse set of drones creates a whole that is more capable than the individual parts. A single drone swarm could even operate across domain, with undersea and surface drones or ground and aerial drones coordinating their actions...Customizable drone swarms offer flexibility to commanders, enabling them to add or remove drones as needed. This requires common standards for inter-drone communication, so that new drones can easily be added to the swarm. Similarly, the swarm must be able to adapt to the removal of drone, either intentionally or through hostile action...Advances in technology may also harden the swarm against electronic warfare vulnerabilities. Novel forms of communication may weaken or entirely eliminate those vulnerabilities. For example, drone swarms could communicate on the basis of stigmergy. Stigmergy is an indirect form of communication used by ants and other swarming insects
google-research/bert (paper): BERT, or Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks.
facebookresearch/Horizon: Horizon has allowed us to improve the image quality of 360-degree video viewed on Facebook by optimizing bit rate parameters in real time. The platform factors in both the amount of available bandwidth and the amount of video that’s already been buffered, to determine whether it’s possible to shift to higher quality video. This process takes advantage of RL’s ability to create incentives in the moment, using new and unsupervised data — it works while a given video is playing, rather than analyzing performance and carefully annotated data after the fact.
OCTOPTICOM: an open-ended puzzle programming game about designing and optimizing optical computing devices. Use lasers, mirrors, filters and other components to read, transform and write sequences of colored squares.
Efficient Synchronization of State-based CRDTs: In this paper we: 1) identify two sources of inefficiency in current synchronization algorithms for delta-based CRDTs; 2) bring the concept of join decomposition to state-based CRDTs; 3) exploit join decompositions to obtain optimal deltas and 4) improve the efficiency of synchronization algorithms; and finally, 5) evaluate the improved algorithms.
Dissecting Apple’s Meta-CDN during an iOS Update: Content delivery networks (CDN) contribute more than 50% of today’s Internet traffic. Meta-CDNs, an evolution of centrally controlled CDNs, promise increased flexibility by multihoming content. So far, efforts to understand the characteristics of Meta-CDNs focus mainly on third-party Meta-CDN services. A common, but unexplored, use case for Meta-CDNs is to use the CDNs mapping infrastructure to form self-operated Meta-CDNs integrating thirdparty CDNs. These CDNs assist in the build-up phase of a CDN’s infrastructure or mitigate capacity shortages by offloading traffic...We found Apple to use two major CDNs, Akamai and Lighlight, as well as its own content delivery infrastructure. Our findings are two-fold: First, we detected 34 sites cache-sites, and second, we determined the internal structure of the cache sites by analyzing HTTP headers.
RobinHood: tail latency aware caching – dynamic reallocation from cache-rich to cache-poor: RobinHood dynamically allocates cache space to those backends responsible for high request tail latency (cache-poor) backends, while stealing space from backends that do not affect the request tail latency (cache-rich backends). In doing so, Robin Hood makes compromises that may seem counter-intuitive (e.g., significantly increasing the tail latencies of certain backends).
Applying Deep Learning To Airbnb Search: So would we recommend deep learning to others? at would be a wholehearted Yes. And it’s not only because of the strong gains in the online performance of the model. Part of it has to do with how deep learning has transformed our roadmap ahead. Earlier the focus was largely on feature engineering, but aer the move to deep learning, trying to do beer math on the features manually has lost its luster. is has freed us up to investigate problems at a higher level, like how can we improve our optimization objective, and are we accurately representing all our users? Two years aer taking the rst steps towards applying neural networks to search ranking, we feel we are just geing started.