Integratingplabintopgeni

This page contains a brain dump of some of Main.DavidJohnson's initial thoughts on using parts of the PlanetLab code for ProtoGeni (formerly located at bas:~johnsond/pgeni/pl.src.notes).

This file describes key bits of the PlanetLab components that we may want to leverage for GENI, and how we'll hack them and integrate our current stuff to get the best of plab and elab into protogeni.

General issues

Does it make sense for NMs to run on components that the user can entirely blow away? Out of necessity, clearly at least the MA must be able to do some non-NM control. Does that control go in the NM, or somewhere else?

BootCD

(Looks like devs may eventually change the BootCD build approach; they don't seem to know how yet.)

Basically, all the boot cd does is boot into an mfs so that the boot manager can take over. We can do anything we want here as long as we invoke the boot manager... may as well use theirs to start, though!

For the future, since we probably want to boot other OSs, we'll want to integrate our support for writing a "boot header" onto disk. But, we don't want to use the freebsd bootloader (need to use linux as our mfs to maximize our chance of booting into the debug/setup env on any/all machines). Thus, we would have to figure out how to shoehorn boot info into the grub bootblock, and hack grub to do the right thing. The reason to use grub is that it can boot any current OS we might want to boot (freebsd, linux, windows).

Actually, we could just keep using the freebsd loader as the primary loader, keep some state in the MBR (probably less than we do today), and then figure out where to boot from: network, disk, or debug (usb/cd). At least this should work, assuming the bsd loader always works to jump to a loader in diskN:partM.

Also, what I think we're going to want to do is have two usb dongles in each machine. One will be write-protected, and will contain the static debug/setup OS and tools. The idea is that these will never (rarely) change. The second dongle will be large, and will not be write-protected. It will act as a cache for node config info, disk images, overlay software (i.e., tarballs to deploy in slivers, atop disk images, etc). The hashes for these files will be stored at Emulab, or the GMC (whichever ends up making more sense), and as the node boots, the BootManager (and perhaps NodeManager, later in the process) can check the remote hash against the cached contents. Of course, anything stored in the cache must be fetchable at any time from Emulab or GMC.

Note, we can store network settings for the node in the cache, all except the node_id and the node_key. Those must be in a place where the

This process will be useful for any of our non-cluster nodes (in emulab or geni), including local and remote wireless nodes.

BootManager

Lots of hardware detection code; most important is code to detect which linux kernel modules are needed.

The BootManager authenticates with PLC on each API call by computing an hmac using the node_key over a serialized version of the parameter list for the call. The auth structure includes this hmac, the node_id and IP.

Does require an http boot server to upload log files to (probably other stuff too).

If we allow blowing away the disk, we need to save off some persistent files that the BootManager depends on (i.e., `configuration', which stores "environment" variables).

When it runs, the BootManager takes a single argument that defines the state it must get the node to enter (i.e., new, inst, rins, boot, dbg); then, the BootManager drives the node to the desired state via a bunch of logical "steps". For now, the only really important things about these steps are:

what kinds of external communication they necessitate (esp with PLC);
what kind of state they store on the disk persistently across boots;
what kind of vserver/linux specific stuff is embedded in the logic and setup.

PLC API function calls:

BootCheckAuthentication (auth the node)
BootUpdateNode (update pub half of ssh hostkey with plc)
BootNotifyOwners (tell plc and owners if node doesn't meet min hw req, or when the node changes state)
BootGetNodeDetails (plc returns session, boot state, hw options, node networks (although the BootManager only uses the primary one right now), etc)
BootUpdateNode (sends current network settings to PLC (including mac, gateway, network, bcast, mask, dns1, dns2)
GetNodes, GetNodeGroups (used to determine which groups this node is in)

Local persistent files:

AUTH_FAILURE_COUNT_FILE (count failed auths, so that the node can be moved into the dbg state after hearing N of them)
"configuration"
/etc/planetlab/session (?)
( there are others ... )

BootServer HTTP calls:

getnodeid.php (gets nodeid based on mac addrs on the node, IF the nodeid wasn't in the config file on the usbkey)
( there are others, I just didn't write them all down. )

Basically, to support booting entire disk images, we are going to have to do a more generic job of sending a configuration file to the node so that it knows what to do (instead of just checking for/setting up lvm, vservers, etc). Like tmcc, but hopefully better. Why not just xmlrpc-ize the process? Making the linux specific and lvm and vservers code be optional depending on the node config shouldn't be too hard... but I need to think more about the right way to specify and abstract it, to maximize future node configurability (plus support heterogeneous platforms, perhaps?).

I think we should split the BootManager into two parts: a basic BootManager, which handles the initial boot, checking in with control, etc; then hands off control to a ConfigurationManager to bring the node into whichever mode(s?) required. This should be able to support multiple reboots during a single Configuration process. Then, it's easy to support many different configuration backends... and multiple ones at the same time.

BootstrapFS

This is the fedora-based tarball that the BootManager downloads from PLC and dumps onto the node. We could allow this, plus perhaps add support for a frisbee disk image.

We have the buildscripts to build the thing (building really means installing rpms on a base fedora install), but we'll have to assemble our own Fedora base image to act as the build server, I think (this is after 30s of looking at stuff).

NodeManager

The NM auths all its calls to PLC in the same way as the BootManager.

The NM makes these API calls:

GetSlivers
BootGetNodeDetails (get a session key, then use that when chatting with PLC after)
GetSession()

We may (will) have to replace their rspec and ticket stuff depending on what we do for GENI.

What if we split up the NodeManager into a frontend CM interface, with the standard stuff that all CM backends will need (i.e., a records database, communication to MAs and GMC, auth checks, others?), but with several backends for node/hw-specific configuration? I think this means the backends should be pluggable... sort of like venkat's stuff! That way, we can enable a whole bunch of them in a single "application", where a single node can take on multiple roles, or we can just couple a CM with a single backend.

So, we need to define a good interface on the CM for the backend to call into. Probably this contains getting configuration info, sending management notification, etc.

But, is putting a node into a single mode enough? Can't some nodes take on multiple roles at once?

PLCAPI

This isn't so bad. We just steal the API calls we need (they are logically separated). However, I will probably have to rewrite their database interface since it's tightly coupled to their notion that the database essentially stores discrete objects, and our schema is more complex. See XXX.

For now, we can just add these calls into our current xmlrpc server. However, auth will be slightly different; we can generate an ssl cert for all nodes (or for each node), and they can do "basic" auth with that to fit our model, but then there's all the per-xmlrpc call auth in the API... and that's probably a good thing to keep.

PLC DB Schema

To minimize the pain of taking at least the NM, and probably some of the monitoring/limiting tools, we need to take the following parts of the schema:

Slices (tables slices, slice_node, slice_person, slice_attribute_types, slice_attribute)
- we could lift the slices table straight out, or nearly so.
- if we are really going to be geni compatible, the slice_node table should replace our reserved table, since nodes will be allocated in slices, not experiments. Experiments are a service overlay, not a resource container -- or if they are a resource container, it is by extension. So, what I am really recommending here is that we interpose the new abstraction of a "slice" as the resource container. For now, we could map a pid/eid to a slicename...
- slice_person currently maps membership in a slice; does geni need more than that in the auth model?
- we ought to rip out all the places where the plab code assumes a slicename is composed of a site name and a slicename (i.e., utah_myslice)
- slice_attribute_types defines the set of possible attributes that may be set in slice_attributes. plab uses these to control special permissions for the slivers (i.e., to control Proper permissions, raw sockets, reserved ports, etc). This isn't necessarily the best mechanism to do this, but it's fine. We ought to have experiment attributes, but I'm not sure all of what purposes they could serve... For now, we just lift this stuff.

Objects/tables we need to map onto our schema:

nodes
- we don't have a notion of external dns names for nodes
- we don't map widearea nodes to sites
- nodes don't have ssh host keys
- we need to add more boot states (i.e., debug,
node_groups
- the value of this is debateable, but we may as well have/allow it -- it's like event groups, but for nodes.
Sites:
- Although GENI will not necessarily have a map of users to sites, it may well have a map of nodes deployed at sites. Thus, we should keep only that part for our own organization purposes. So, we just have node->site mapping, and per-site info (i.e., contacts, authorities for the site).
Keys: we already have a map of users to ssh pubkeys.

Objects/Concepts with obvious mapping problems, that we need:

API access control
- PLC does RBAC on its API. The following roles exist: admin,pi,user,tech,node,anonymous,peer. We need to support at least the first 5 right away, or rip out their access control.
- How do we map our permissions? We have pid/eid, group, and various permissions (project_root, group_root, local_root, user, admin). Of the two sets, project_root<->pi, admin<->admin, (group_root,local_root,user)<->user; but we don't have a node/tech concept. For now, I think we can get away without having tech role, but we need the equivalent of a node role so that the node can both read and write (which we don't have) config/log info.

NodeNetwork abstractions:
- Basically, we need to store static IP-level config information for widearea nodes that cannot use dhcp. To do this, PLC has abstractions:
  - network_type (i.e., 'ipv4'),
  - network_method (i.e., static, dhcp, proxy),
  - nodenetworks (this roughly maps to our interfaces table, but it has additional IP-level config details like gateway, bcast addr, dns servers)
- their network type is similar to our link type, I think... but what do we have for widearea nodes?
- do we need to store more detailed info than IP? What about the case for when people will have nodes that have static IP info? We need to store this in the db, and maybe their solution is good. Alternatives:
  - use the interface_settings table (bad);
  - extend the interfaces table with more fields (bad because most nodes don't need this kind of static info);
  - extend the widearea_nodeinfo table;
  - add a table that extends the widearea_nodeinfo table (i.e., with the static IP level config stuff that non-dhcp nodes need to get on the network);
  - extend the interfaces tables (either by adding a new table (i.e., widearea_interfaces) that had more details (like gateway, network, bcast addr, dns servers, hostname); or by adding fields to interfaces).

Stuff we can ignore:

Peers (their peering mechanism isn't very good, I feel, and we should just implement the GENI way going forward)
user accounts: we already have the same notions
addresses
power control (ours is superior)

-- Main.DavidJohnson - 13 Nov 2007