| | 1 | |
|---|
| | 2 | |
|---|
| | 3 | This page contains a brain dump of some of Main.DavidJohnson's initial thoughts |
|---|
| | 4 | on using parts of the PlanetLab code for ProtoGeni (formerly located at |
|---|
| | 5 | bas:~johnsond/pgeni/pl.src.notes). |
|---|
| | 6 | |
|---|
| | 7 | This file describes key bits of the PlanetLab components that we may want to |
|---|
| | 8 | leverage for GENI, and how we'll hack them and integrate our current stuff to |
|---|
| | 9 | get the best of plab and elab into protogeni. |
|---|
| | 10 | |
|---|
| | 11 | |
|---|
| | 12 | '''General issues''' |
|---|
| | 13 | |
|---|
| | 14 | Does it make sense for NMs to run on components that the user can entirely blow |
|---|
| | 15 | away? Out of necessity, clearly at least the MA must be able to do some non-NM |
|---|
| | 16 | control. Does that control go in the NM, or somewhere else? |
|---|
| | 17 | |
|---|
| | 18 | |
|---|
| | 19 | '''BootCD''' |
|---|
| | 20 | |
|---|
| | 21 | (Looks like devs may eventually change the BootCD build approach; they don't |
|---|
| | 22 | seem to know how yet.) |
|---|
| | 23 | |
|---|
| | 24 | Basically, all the boot cd does is boot into an mfs so that the boot manager |
|---|
| | 25 | can take over. We can do anything we want here as long as we invoke the boot |
|---|
| | 26 | manager... may as well use theirs to start, though! |
|---|
| | 27 | |
|---|
| | 28 | For the future, since we probably want to boot other OSs, we'll want to |
|---|
| | 29 | integrate our support for writing a "boot header" onto disk. But, we don't |
|---|
| | 30 | want to use the freebsd bootloader (need to use linux as our mfs to maximize |
|---|
| | 31 | our chance of booting into the debug/setup env on any/all machines). Thus, we |
|---|
| | 32 | would have to figure out how to shoehorn boot info into the grub bootblock, and |
|---|
| | 33 | hack grub to do the right thing. The reason to use grub is that it can boot |
|---|
| | 34 | any current OS we might want to boot (freebsd, linux, windows). |
|---|
| | 35 | |
|---|
| | 36 | Actually, we could just keep using the freebsd loader as the primary loader, |
|---|
| | 37 | keep some state in the MBR (probably less than we do today), and then figure |
|---|
| | 38 | out where to boot from: network, disk, or debug (usb/cd). At least this should |
|---|
| | 39 | work, assuming the bsd loader always works to jump to a loader in diskN:partM. |
|---|
| | 40 | |
|---|
| | 41 | Also, what I think we're going to want to do is have two usb dongles in each |
|---|
| | 42 | machine. One will be write-protected, and will contain the static debug/setup |
|---|
| | 43 | OS and tools. The idea is that these will never (rarely) change. The second |
|---|
| | 44 | dongle will be large, and will not be write-protected. It will act as a cache |
|---|
| | 45 | for node config info, disk images, overlay software (i.e., tarballs to deploy |
|---|
| | 46 | in slivers, atop disk images, etc). The hashes for these files will be stored |
|---|
| | 47 | at Emulab, or the GMC (whichever ends up making more sense), and as the node |
|---|
| | 48 | boots, the BootManager (and perhaps NodeManager, later in the process) can |
|---|
| | 49 | check the remote hash against the cached contents. Of course, anything stored |
|---|
| | 50 | in the cache must be fetchable at any time from Emulab or GMC. |
|---|
| | 51 | |
|---|
| | 52 | Note, we can store network settings for the node in the cache, all except the |
|---|
| | 53 | node_id and the node_key. Those must be in a place where the |
|---|
| | 54 | |
|---|
| | 55 | This process will be useful for any of our non-cluster nodes (in emulab or |
|---|
| | 56 | geni), including local and remote wireless nodes. |
|---|
| | 57 | |
|---|
| | 58 | |
|---|
| | 59 | '''BootManager''' |
|---|
| | 60 | |
|---|
| | 61 | Lots of hardware detection code; most important is code to detect which linux |
|---|
| | 62 | kernel modules are needed. |
|---|
| | 63 | |
|---|
| | 64 | The BootManager authenticates with PLC on each API call by computing an hmac |
|---|
| | 65 | using the node_key over a serialized version of the parameter list for the |
|---|
| | 66 | call. The auth structure includes this hmac, the node_id and IP. |
|---|
| | 67 | |
|---|
| | 68 | Does require an http boot server to upload log files to (probably other stuff |
|---|
| | 69 | too). |
|---|
| | 70 | |
|---|
| | 71 | If we allow blowing away the disk, we need to save off some persistent files |
|---|
| | 72 | that the BootManager depends on (i.e., `configuration', which stores |
|---|
| | 73 | "environment" variables). |
|---|
| | 74 | |
|---|
| | 75 | When it runs, the BootManager takes a single argument that defines the state it |
|---|
| | 76 | must get the node to enter (i.e., new, inst, rins, boot, dbg); then, the |
|---|
| | 77 | BootManager drives the node to the desired state via a bunch of logical |
|---|
| | 78 | "steps". For now, the only really important things about these steps are: |
|---|
| | 79 | * what kinds of external communication they necessitate (esp with PLC); |
|---|
| | 80 | * what kind of state they store on the disk persistently across boots; |
|---|
| | 81 | * what kind of vserver/linux specific stuff is embedded in the logic and |
|---|
| | 82 | setup. |
|---|
| | 83 | |
|---|
| | 84 | PLC API function calls: |
|---|
| | 85 | * BootCheckAuthentication (auth the node) |
|---|
| | 86 | * BootUpdateNode (update pub half of ssh hostkey with plc) |
|---|
| | 87 | * BootNotifyOwners (tell plc and owners if node doesn't meet min hw req, or |
|---|
| | 88 | when the node changes state) |
|---|
| | 89 | * BootGetNodeDetails (plc returns session, boot state, hw options, node |
|---|
| | 90 | networks (although the BootManager only uses the primary one right now), |
|---|
| | 91 | etc) |
|---|
| | 92 | * BootUpdateNode (sends current network settings to PLC (including mac, |
|---|
| | 93 | gateway, network, bcast, mask, dns1, dns2) |
|---|
| | 94 | * GetNodes, GetNodeGroups (used to determine which groups this node is in) |
|---|
| | 95 | |
|---|
| | 96 | Local persistent files: |
|---|
| | 97 | * AUTH_FAILURE_COUNT_FILE (count failed auths, so that the node can be moved |
|---|
| | 98 | into the dbg state after hearing N of them) |
|---|
| | 99 | * "configuration" |
|---|
| | 100 | * /etc/planetlab/session (?) |
|---|
| | 101 | * ( there are others ... ) |
|---|
| | 102 | |
|---|
| | 103 | BootServer HTTP calls: |
|---|
| | 104 | * getnodeid.php (gets nodeid based on mac addrs on the node, IF the nodeid |
|---|
| | 105 | wasn't in the config file on the usbkey) |
|---|
| | 106 | * ( there are others, I just didn't write them all down. ) |
|---|
| | 107 | |
|---|
| | 108 | Basically, to support booting entire disk images, we are going to have to do a |
|---|
| | 109 | more generic job of sending a configuration file to the node so that it knows |
|---|
| | 110 | what to do (instead of just checking for/setting up lvm, vservers, etc). Like |
|---|
| | 111 | tmcc, but hopefully better. Why not just xmlrpc-ize the process? |
|---|
| | 112 | Making the linux specific and lvm and vservers code be optional depending on |
|---|
| | 113 | the node config shouldn't be too hard... but I need to think more about the |
|---|
| | 114 | right way to specify and abstract it, to maximize future node configurability |
|---|
| | 115 | (plus support heterogeneous platforms, perhaps?). |
|---|
| | 116 | |
|---|
| | 117 | I think we should split the BootManager into two parts: a basic BootManager, |
|---|
| | 118 | which handles the initial boot, checking in with control, etc; then hands off |
|---|
| | 119 | control to a ConfigurationManager to bring the node into whichever mode(s?) |
|---|
| | 120 | required. This should be able to support multiple reboots during a single |
|---|
| | 121 | Configuration process. Then, it's easy to support many different configuration |
|---|
| | 122 | backends... and multiple ones at the same time. |
|---|
| | 123 | |
|---|
| | 124 | |
|---|
| | 125 | '''BootstrapFS''' |
|---|
| | 126 | |
|---|
| | 127 | This is the fedora-based tarball that the BootManager downloads from PLC and |
|---|
| | 128 | dumps onto the node. We could allow this, plus perhaps add support for a |
|---|
| | 129 | frisbee disk image. |
|---|
| | 130 | |
|---|
| | 131 | We have the buildscripts to build the thing (building really means installing |
|---|
| | 132 | rpms on a base fedora install), but we'll have to assemble our own Fedora base |
|---|
| | 133 | image to act as the build server, I think (this is after 30s of looking at |
|---|
| | 134 | stuff). |
|---|
| | 135 | |
|---|
| | 136 | '''NodeManager''' |
|---|
| | 137 | |
|---|
| | 138 | The NM auths all its calls to PLC in the same way as the BootManager. |
|---|
| | 139 | |
|---|
| | 140 | The NM makes these API calls: |
|---|
| | 141 | * GetSlivers |
|---|
| | 142 | * BootGetNodeDetails (get a session key, then use that when chatting with |
|---|
| | 143 | PLC after) |
|---|
| | 144 | * GetSession() |
|---|
| | 145 | |
|---|
| | 146 | We may (will) have to replace their rspec and ticket stuff depending on what we |
|---|
| | 147 | do for GENI. |
|---|
| | 148 | |
|---|
| | 149 | What if we split up the NodeManager into a frontend CM interface, with the |
|---|
| | 150 | standard stuff that all CM backends will need (i.e., a records database, |
|---|
| | 151 | communication to MAs and GMC, auth checks, others?), but with several backends |
|---|
| | 152 | for node/hw-specific configuration? I think this means the backends should be |
|---|
| | 153 | pluggable... sort of like venkat's stuff! That way, we can enable a whole |
|---|
| | 154 | bunch of them in a single "application", where a single node can take on |
|---|
| | 155 | multiple roles, or we can just couple a CM with a single backend. |
|---|
| | 156 | |
|---|
| | 157 | So, we need to define a good interface on the CM for the backend to call into. |
|---|
| | 158 | Probably this contains getting configuration info, sending management |
|---|
| | 159 | notification, etc. |
|---|
| | 160 | |
|---|
| | 161 | But, is putting a node into a single mode enough? Can't some nodes take on |
|---|
| | 162 | multiple roles at once? |
|---|
| | 163 | |
|---|
| | 164 | |
|---|
| | 165 | '''PLCAPI''' |
|---|
| | 166 | |
|---|
| | 167 | This isn't so bad. We just steal the API calls we need (they are logically |
|---|
| | 168 | separated). However, I will probably have to rewrite their database interface |
|---|
| | 169 | since it's tightly coupled to their notion that the database essentially stores |
|---|
| | 170 | discrete objects, and our schema is more complex. See XXX. |
|---|
| | 171 | |
|---|
| | 172 | For now, we can just add these calls into our current xmlrpc server. However, |
|---|
| | 173 | auth will be slightly different; we can generate an ssl cert for all nodes (or |
|---|
| | 174 | for each node), and they can do "basic" auth with that to fit our model, but |
|---|
| | 175 | then there's all the per-xmlrpc call auth in the API... and that's probably a |
|---|
| | 176 | good thing to keep. |
|---|
| | 177 | |
|---|
| | 178 | |
|---|
| | 179 | '''PLC DB Schema''' |
|---|
| | 180 | |
|---|
| | 181 | To minimize the pain of taking at least the NM, and probably some of the |
|---|
| | 182 | monitoring/limiting tools, we need to take the following parts of the schema: |
|---|
| | 183 | |
|---|
| | 184 | * Slices (tables `slices`, `slice_node`, `slice_person`, |
|---|
| | 185 | `slice_attribute_types`, `slice_attribute`) |
|---|
| | 186 | * we could lift the slices table straight out, or nearly so. |
|---|
| | 187 | * if we are really going to be geni compatible, the slice_node table |
|---|
| | 188 | should replace our reserved table, since nodes will be allocated in |
|---|
| | 189 | slices, not experiments. Experiments are a service overlay, not a |
|---|
| | 190 | resource container -- or if they are a resource container, it is by |
|---|
| | 191 | extension. |
|---|
| | 192 | So, what I am really recommending here is that we interpose the new |
|---|
| | 193 | abstraction of a "slice" as the resource container. For now, we |
|---|
| | 194 | '''could''' map a pid/eid to a slicename... |
|---|
| | 195 | * slice_person currently maps membership in a slice; does geni need more |
|---|
| | 196 | than that in the auth model? |
|---|
| | 197 | * we ought to rip out all the places where the plab code assumes a |
|---|
| | 198 | slicename is composed of a site name and a slicename (i.e., |
|---|
| | 199 | utah_myslice) |
|---|
| | 200 | * slice_attribute_types defines the set of possible attributes that may |
|---|
| | 201 | be set in slice_attributes. plab uses these to control special |
|---|
| | 202 | permissions for the slivers (i.e., to control Proper permissions, raw |
|---|
| | 203 | sockets, reserved ports, etc). This isn't necessarily the best |
|---|
| | 204 | mechanism to do this, but it's fine. We ought to have experiment |
|---|
| | 205 | attributes, but I'm not sure all of what purposes they could serve... |
|---|
| | 206 | For now, we just lift this stuff. |
|---|
| | 207 | |
|---|
| | 208 | Objects/tables we need to map onto our schema: |
|---|
| | 209 | |
|---|
| | 210 | * `nodes` |
|---|
| | 211 | * we don't have a notion of external dns names for nodes |
|---|
| | 212 | * we don't map widearea nodes to sites |
|---|
| | 213 | * nodes don't have ssh host keys |
|---|
| | 214 | * we need to add more boot states (i.e., debug, |
|---|
| | 215 | * `node_groups` |
|---|
| | 216 | * the value of this is debateable, but we may as well have/allow it -- |
|---|
| | 217 | it's like event groups, but for nodes. |
|---|
| | 218 | * Sites: |
|---|
| | 219 | * Although GENI will not necessarily have a map of users to sites, it may |
|---|
| | 220 | well have a map of nodes deployed at sites. Thus, we should keep only |
|---|
| | 221 | that part for our own organization purposes. So, we just have |
|---|
| | 222 | node->site mapping, and per-site info (i.e., contacts, authorities for |
|---|
| | 223 | the site). |
|---|
| | 224 | * Keys: we already have a map of users to ssh pubkeys. |
|---|
| | 225 | |
|---|
| | 226 | |
|---|
| | 227 | Objects/Concepts with obvious mapping problems, that we need: |
|---|
| | 228 | |
|---|
| | 229 | * API access control |
|---|
| | 230 | * PLC does RBAC on its API. The following roles exist: |
|---|
| | 231 | admin,pi,user,tech,node,anonymous,peer. We need to support at least |
|---|
| | 232 | the first 5 right away, or rip out their access control. |
|---|
| | 233 | * How do we map our permissions? We have pid/eid, group, and various |
|---|
| | 234 | permissions (project_root, group_root, local_root, user, admin). Of |
|---|
| | 235 | the two sets, project_root<->pi, admin<->admin, |
|---|
| | 236 | (group_root,local_root,user)<->user; but we don't have a node/tech |
|---|
| | 237 | concept. For now, I think we can get away without having tech role, |
|---|
| | 238 | but we need the equivalent of a node role so that the node can both |
|---|
| | 239 | read and write (which we don't have) config/log info. |
|---|
| | 240 | |
|---|
| | 241 | * NodeNetwork abstractions: |
|---|
| | 242 | * Basically, we need to store static IP-level config information for |
|---|
| | 243 | widearea nodes that cannot use dhcp. To do this, PLC has abstractions: |
|---|
| | 244 | * network_type (i.e., 'ipv4'), |
|---|
| | 245 | * network_method (i.e., static, dhcp, proxy), |
|---|
| | 246 | * nodenetworks (this roughly maps to our interfaces table, but it has |
|---|
| | 247 | additional IP-level config details like gateway, bcast addr, dns |
|---|
| | 248 | servers) |
|---|
| | 249 | * their network type is similar to our link type, I think... but what do |
|---|
| | 250 | we have for widearea nodes? |
|---|
| | 251 | * do we need to store more detailed info than IP? What about the case |
|---|
| | 252 | for when people will have nodes that have static IP info? We need to |
|---|
| | 253 | store this in the db, and maybe their solution is good. |
|---|
| | 254 | Alternatives: |
|---|
| | 255 | * use the interface_settings table (bad); |
|---|
| | 256 | * extend the interfaces table with more fields (bad because most nodes |
|---|
| | 257 | don't need this kind of static info); |
|---|
| | 258 | * extend the widearea_nodeinfo table; |
|---|
| | 259 | * add a table that extends the widearea_nodeinfo table (i.e., with the |
|---|
| | 260 | static IP level config stuff that non-dhcp nodes need to get on the |
|---|
| | 261 | network); |
|---|
| | 262 | * extend the interfaces tables (either by adding a new table (i.e., |
|---|
| | 263 | widearea_interfaces) that had more details (like gateway, network, |
|---|
| | 264 | bcast addr, dns servers, hostname); or by adding fields to |
|---|
| | 265 | interfaces). |
|---|
| | 266 | |
|---|
| | 267 | Stuff we can ignore: |
|---|
| | 268 | |
|---|
| | 269 | * Peers (their peering mechanism isn't very good, I feel, and we should just |
|---|
| | 270 | implement the GENI way going forward) |
|---|
| | 271 | * user accounts: we already have the same notions |
|---|
| | 272 | * addresses |
|---|
| | 273 | * power control (ours is superior) |
|---|
| | 274 | |
|---|
| | 275 | |
|---|
| | 276 | -- Main.DavidJohnson - 13 Nov 2007 |