Hot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night Stand
Hannes end-of-the-router-tnc17
1. The End of the
Router ?
Hannes Gredler
hannes@rtbrick.com
2. • Past: Director Product Management with Juniper
• 18 years work experience @ Juniper, Cisco,
Tellium, Infinera & Tata Consultancy
• MS (ECE), Rutgers NJ, MBA IIM Ahmedabad
• Software expertise in platform infrastructure,
applications, OO and component software
• Global PLM exposure with market segments
using advanced routing technologies
• 2 Patents, 3 Publications
• Past: Distinguished Engineer with Juniper
• 19 Years working experience, developing,
deploying and supporting Network Software
• Expertise: BGP, Link-state IGPs, MPLS
• 20+ Patents
• 20+ Proposed Standards, RFCs
• http://www.arkko.com/tools/allstats/hannesgredler.h
tml
• IETF WG chair (IS-IS)
Exploring the Wild, Wild West
Pravin Bhandarkar
CEO & Founder
Hannes Gredler
CTO & Founder
3. • Path of the
The “path of the more”
- Chassis Based
- Proprietary Base OS
- Closed databases
- One/few Management planes
- Hardware Optimized
- Curated Software Release
- Waterfall Development
- Black-box tests
The “path of the less”
- Pizza Box Based
- Linux (ONL, Ubuntu)
- noSQL Type Value Stores
- Many Management planes
- Developer Optimized
- micro-Service Architecture
- Continuous Integration
- Integration/Acceptance tests
4. BUILD A SYSTEM OF LITTLE BRICKS
• Microservice architecture / UNIX pipeline model
• Small pieces of software, serving a unique purpose
• Easy transfer of state from one stage to next
• Every node may is a filter / transformer
BestPath
Selection
Input policyI/O
Handler
Output
policy
I/O
Handler
iod –p bgp –h 192.168.1.1 | bgp_policy_i –f customerA-in | …
bgp_best_path_default | …
bgp_policy_o –f core-out | iod –p bgp –h 192.168.1.2
7. • Prevailing resiliency model
• Large Units, 1:1 box redundancy, hard-state coupling, bi-polar
availability
• Failure is the exception
• Swarm based resiliency model
• Small units, network redundancy, soft-state coupling, gradual
degradation
• Failure is the normal
RESILIENCY
13. Net-Centric 2016
•Issue: Lack of abstraction
/ model
• Any present management protocol is
manually bolted on some implementation of
some state keeping entity
• All bolt-on management protocols
condemned to lag (and hence fail)
MANAGEMENT PLANE (1)
17. •Bit-optimized, binary protocols from the 80s
• Emphasis on state-representation rather than state-flow
• Micro-optimization thinking creeping through system design
• Re-inventing the wheel for every protocol
• FSM, In-memory-DB, Serializer/De-serializer, Flow-Control
• No sense for parallelism, horizontal scaling and fault-domains
CONTROL-PLANE
18. Feature Development time
ON PRODUCTIVITY
• Only 5% of time writing Application code
• Lots of repetitive tasks (re-inventing the wheel)
• IMDB, HA, Config Processing, UI-Mgmt have eternal cost
In Memory
Database
Config
Parser
Figure out what to do ?
(Diff processing)
Proto/IP
C
Parser
UI/Mgmt
HA
Transfer to SQA / Bug Triaging
early Support cycle
App Logic
19. Hardware vs. Dev-Time optimization ?
Time to Revenue (TTR) O(months)
Development SQA-Test Acceptance-
Test
Deploy
Vendor Service Provider
Dev Test Deploy
Time to Revenue (TTR) O(days)
Vendor & Service Provider
21. Database centric / Distributed Data Store
bds://local/bgp.neighbor
bds://local/isis.a
dj
bds://local/isis.lsdb.l2
bds://217.160.181.216/bgp.rib-in
PUBSUB
25. •No true programmability
• Architecture centered around IP / MPLS and ACLs thereof
• Hard coded pipelines
• no notion of “code” like branching, nesting, looping
• Forwarding tables oversized
• Every Prefix may carry 100% link speed, Really ?
• No support for Hardware/Software Hybrids
• Lack of host-path bandwidth
DATA-PLANE
27. •All features
packaged at
compile time
• Slows down everybody else
• Black box testing
• Test Matrix grows N^2
IS-IS
BGP
RSVP
LDP
Netflow
Sflow
OSPF
Trill
STP
PIM
L3VPN
L2VPN
SR
Statically Compiled
Monolithic NOS
CURATED RELEASE MODEL (1)
28. •Fix: plugin
architecture
• Load plugins at runtime
• Load (& Pay) what you needs
• Faster Test qualification
• Small subset of Test Matrix
executed
Core infra (DB, IPC, PKG)
IS-IS NetflowSRBGP
Dynamic Loaded Library
Modular NOS
CURATED RELEASE MODEL (2)
30. Net-Centric 2016
The Lindy effect is a theory of the life
expectancy of non-perishable things like a
technology or an idea.
Every additional day may imply a longer
(remaining) life expectancy.
THE LINDY EFFECT (1)
Dear <name>, thanks for giving me the opportunity to present my my upcoming venture called ‘rtbrick’.
As the name ‘brick’ implies we want to build a modular, scalable routing software targeted to a disintegrated network market.
As a side effect this also deviates from the prevailing industry resiliency model.
Because there can be always more than one worker for a given state-keeping entity,
and rather assuming that failure of a component is an exception, it becomes the norm
once you think about new processes joining the swarm, old-ones leaving etc.
If you think about small components which may fail anytime -
and just degrades once it does so you get a very robust system as a result.
An engineer from the 90s is obsessed with performance in favor of abstractions and re-usability.
therefore a lot of the major state keeping modules are kept together and often compiled to
run in one process. There is a high price to pay for the ease of state-sharing between components.
All state handling entities have become one large, fragile fault-domain.
Furthermore multiple CPU cores can only leveraged by making code even more brittle using
multithreading insides that single process.
Lets start with the management plane:
Observation is that every management protocol (whether its SNMP, netconf or anything else) is constantly trailing behind the original feature or functionality
Why this ? The issue is that there is a general lack of abstraction of all state keeping components inside a router.
Because of this every management method (add/chg/delete/show) has to be manually coded and bolted on top of every
State keeping entity – more worse, it has to be repeated for every management method in every management protocol.
If you assume just one management method for every management protocol you can see here the dilemma, one has to code
Every connector between a management method and the state-keeping entity.
Anticipation is that the number of lines in a modern router is order of thousands.
So the fix for this is actually straight forward - rather than bolting every management method to its
state keeping entity, one first has to store all state into a fast in-memory database.
Next you can implement your management methods just by routing it to the correct element in the database, if this is done
right it can be done at runtime in the field by the network operator. More important that any-management-method to
any-state-keeper full-mesh just has been broken down into a star, which is a much more smaller problem to tackle
as one does not need to write management methods but just generic connectors, one for each management protocol
Which gets us to the state-keeping part of the control-plane ...
I could argue that the protocols that we deal with here are bit-optimized, binary protocols from the 80s
but again - Lindy effect - these protocols will be around for quite some time.
Perhaps my biggest point in implementation of control-plane protocols is that many implementors
are falling into the trap of internalizing IETF specifications which put more emphasis
on the "representation" of state rather than how state shoudl flow through the system.
As such implementors are often missing the common part in protocols, like protocol-parsers,
hierarchical protocol-encoders, in memory databases, how to implement state compression and so on.
Furthermore there is no sense of parallelizing operations and that has an important impact on
the both resiliency as well as even utilization of all CPU cores of a given system.
In most routing-stacks I have seen are C/C++ based and the size of the software module (> 1.5 M LoC) has exceeded
Any practical reason.
Furthermore there is no case tooling like model based code-generation, but rather everything is hand-crafted.
An unsolved problem for all but one vendors (Alcatel Lucent TIMOS) is the fact that they do not
utilize more than one CPU core for a given protocol. IOW the software gets slower every year
The story on utilizing more than one CPU core is not convincing. Most vendors needs to fundamentally rewrite their software for taking advantage of the new normal
Even mobile phones today have 4 cores and more.
But the worst thing is the lack of whitebox testing and the shared responsibility between software developer and software tester.
Todays testing is done using black-box testing which means that the entire router is deployed in heavy-weight lab or in a
Heavy-weight (Type-1 Hypervisor) virtual environment. Then test scripts do generate external stimulus and check internal state transitions.
The practical problem is that the person doing the development is not the one which runs the test which leads to interfacing issues when
e.g. the code is in a bad quality …. This interfacing between parties translates into additional latency ….
The inherent promise of DEVOPS is to reduce the Time-to-revenue of new network functionality which is today order of months
Using our system new network functionality can be developed, tested and rolled-out in a a few days
The core of the system is a distributed data store which holds every state in the system. Irrespective if it is interface information, IS-IS adjacencies, BGP RIBs or forwarding tables – every state is stored in the back store.
Brick Data Store (BDS) it is a model-driven indexing and data replication vehicle.
BDS has been designed for speed and consistency – data insertion can progress as fast as 1M updates per second per CPU core.
The back store is configured by defining tables, objects and its attributes akin to a SQL server using a json config file.
The back store also allows to quickly locate who is the originator of a an object and generate local replicas of that data for local (in-situ) processing.
BDS is fully horizontal scalable, so if there are large tables (like for example RIB-ins) then it can shard the workload across a set of worker processes) all of the shard-ing can happen without changing a single-line of code
The database centric design allows us to do all the cool-HA things, like doing live-software upgrades, restart of components without loss of service, etc. – Furthermore this design paradigm greatly minimizes the amount of boiler-plate code that one has to do to develop new networking code.
Now lets come to the data-plane - the main-issue i can see here is no true
programmability - everything existing is centered around the three data-plane
protocols and whenever you want to add some new combination (e.g. MPLS context tables)
then you need a re-spin of the ASIC. It would be desirable to treat a data-plane
forwarding path as flexible as code with all concepts like branching, looping and subroutines.
I think most current data-planes are overdone (read: too expensive) and having
a lookup hierarchy with hot, cold and software-served prefixes would again
speed up getting new functionality into the network.
My last point of criticism is the concept of a system-wide "Release" -
The prevailing model is a static compiled binary with all features
(even the ones that you do not need and want). The main disadvantage
of this is, that you need to test the full feature matrix and cannot take
any shortcuts as you do not know what features will be activated.
a full-release test-cycle may sometime take several weeks to qualify.
In any modern software system you have a wild variety of software packages,
patches and usually a good package manager that keeps the zoo of combinations
at control. Why cannot we package state-keeping entities as small packages
that can interact with each other, given some extensible state exchange protocol.
Web Architectures with JSON as their state sharing vehicle come into mind.
For all technologies or ideas something called the Lindy Effect does apply.
It says that for every day a technology or idea has been around and (proven useful)
the longer it is going to be around ultimately. Call it human inertia or whatever, this is a very empirical effect.
Automobiles are a good example of the Lindy effect –
the concept or idea of a engine powered carriage on 4 wheels that gets one or more persons from A to B
has been around now for more than 100 years.
What has changed is all the components inside: transmission, fuel, autonomous piloting software, entertainment systems, how
These cars are engineered and designed
What has not changed is the original idea (and also the favorite color of buyers) …
if Lindy is right the the core idea of the automobile is going to be around for another 100 years
if Lindy is right then the core idea of the router is going to be around for another 30 years,
however just like the automobile a lot has to change for its components
How can the networking industry evolve ? - I think we're right now witnessing an industry transformation
that has happened in other verticals as well. In the 60s the car industry has been
in control of their designs and everything between steering wheel, engine and tires.
The car industry has figured out that they cannot innovate fast enough and hence have
opened their eco-system for suppliers each specialized in their domain.
Today the are just owners of the brand and the integration process, pretty much all
moving parts are coming from suppliers.
Opening the full eco-system of components to suppliers would mean that the three planes
(Management, Control and Data Plane) respective Steering wheel, Engine and
Tires gets designed and implemented using best-in class suppliers in each segment.
I am convinced that if the a brand-owners in the networking space do not open
up their eco-systems then for sure their competition will do.
If that happens then innovation is coming back again and the answer to
"have we reached the end of the router" is hopefully going to be a "No"
Thank you !