Changes from Version 1 of IntegratingPlabIntoPGeni

Show
Ignore:
Author:
trac (IP: 127.0.0.1)
Timestamp:
03/26/08 18:04:59 (2 years ago)
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • IntegratingPlabIntoPGeni

    v0 v1  
     1 
     2 
     3This page contains a brain dump of some of Main.DavidJohnson's initial thoughts 
     4on using parts of the PlanetLab code for ProtoGeni (formerly located at 
     5bas:~johnsond/pgeni/pl.src.notes). 
     6 
     7This file describes key bits of the PlanetLab components that we may want to 
     8leverage for GENI, and how we'll hack them and integrate our current stuff to 
     9get the best of plab and elab into protogeni. 
     10 
     11 
     12'''General issues''' 
     13 
     14Does it make sense for NMs to run on components that the user can entirely blow 
     15away?  Out of necessity, clearly at least the MA must be able to do some non-NM 
     16control.  Does that control go in the NM, or somewhere else? 
     17 
     18 
     19'''BootCD''' 
     20 
     21(Looks like devs may eventually change the BootCD build approach; they don't 
     22seem to know how yet.) 
     23 
     24Basically, all the boot cd does is boot into an mfs so that the boot manager 
     25can take over.  We can do anything we want here as long as we invoke the boot 
     26manager... may as well use theirs to start, though! 
     27 
     28For the future, since we probably want to boot other OSs, we'll want to 
     29integrate our support for writing a "boot header" onto disk.  But, we don't 
     30want to use the freebsd bootloader (need to use linux as our mfs to maximize 
     31our chance of booting into the debug/setup env on any/all machines).  Thus, we 
     32would have to figure out how to shoehorn boot info into the grub bootblock, and 
     33hack grub to do the right thing.  The reason to use grub is that it can boot 
     34any current OS we might want to boot (freebsd, linux, windows). 
     35 
     36Actually, we could just keep using the freebsd loader as the primary loader, 
     37keep some state in the MBR (probably less than we do today), and then figure 
     38out where to boot from: network, disk, or debug (usb/cd).  At least this should 
     39work, assuming the bsd loader always works to jump to a loader in diskN:partM. 
     40 
     41Also, what I think we're going to want to do is have two usb dongles in each 
     42machine.  One will be write-protected, and will contain the static debug/setup 
     43OS and tools.  The idea is that these will never (rarely) change.  The second 
     44dongle will be large, and will not be write-protected.  It will act as a cache 
     45for node config info, disk images, overlay software (i.e., tarballs to deploy 
     46in slivers, atop disk images, etc).  The hashes for these files will be stored 
     47at Emulab, or the GMC (whichever ends up making more sense), and as the node 
     48boots, the BootManager (and perhaps NodeManager, later in the process) can 
     49check the remote hash against the cached contents.  Of course, anything stored 
     50in the cache must be fetchable at any time from Emulab or GMC. 
     51 
     52Note, we can store network settings for the node in the cache, all except the 
     53node_id and the node_key.  Those must be in a place where the 
     54 
     55This process will be useful for any of our non-cluster nodes (in emulab or 
     56geni), including local and remote wireless nodes. 
     57 
     58 
     59'''BootManager''' 
     60 
     61Lots of hardware detection code; most important is code to detect which linux 
     62kernel modules are needed. 
     63 
     64The BootManager authenticates with PLC on each API call by computing an hmac 
     65using the node_key over a serialized version of the parameter list for the 
     66call.  The auth structure includes this hmac, the node_id and IP. 
     67 
     68Does require an http boot server to upload log files to (probably other stuff 
     69too). 
     70 
     71If we allow blowing away the disk, we need to save off some persistent files 
     72that the BootManager depends on (i.e., `configuration', which stores 
     73"environment" variables). 
     74 
     75When it runs, the BootManager takes a single argument that defines the state it 
     76must get the node to enter (i.e., new, inst, rins, boot, dbg); then, the 
     77BootManager drives the node to the desired state via a bunch of logical 
     78"steps".  For now, the only really important things about these steps are: 
     79   * what kinds of external communication they necessitate (esp with PLC); 
     80   * what kind of state they store on the disk persistently across boots; 
     81   * what kind of vserver/linux specific stuff is embedded in the logic and 
     82     setup. 
     83 
     84PLC API function calls: 
     85   * BootCheckAuthentication (auth the node) 
     86   * BootUpdateNode (update pub half of ssh hostkey with plc) 
     87   * BootNotifyOwners (tell plc and owners if node doesn't meet min hw req, or 
     88     when the node changes state) 
     89   * BootGetNodeDetails (plc returns session, boot state, hw options, node 
     90     networks (although the BootManager only uses the primary one right now), 
     91     etc) 
     92   * BootUpdateNode (sends current network settings to PLC (including mac, 
     93     gateway, network, bcast, mask, dns1, dns2) 
     94   * GetNodes, GetNodeGroups (used to determine which groups this node is in) 
     95 
     96Local persistent files: 
     97   * AUTH_FAILURE_COUNT_FILE (count failed auths, so that the node can be moved 
     98     into the dbg state after hearing N of them) 
     99   * "configuration" 
     100   * /etc/planetlab/session (?) 
     101   * ( there are others ... ) 
     102 
     103BootServer HTTP calls: 
     104   * getnodeid.php (gets nodeid based on mac addrs on the node, IF the nodeid 
     105     wasn't in the config file on the usbkey) 
     106   * ( there are others, I just didn't write them all down. ) 
     107 
     108Basically, to support booting entire disk images, we are going to have to do a 
     109more generic job of sending a configuration file to the node so that it knows 
     110what to do (instead of just checking for/setting up lvm, vservers, etc).  Like 
     111tmcc, but hopefully better.  Why not just xmlrpc-ize the process? 
     112Making the linux specific and lvm and vservers code be optional depending on 
     113the node config shouldn't be too hard... but I need to think more about the 
     114right way to specify and abstract it, to maximize future node configurability 
     115(plus support heterogeneous platforms, perhaps?). 
     116 
     117I think we should split the BootManager into two parts: a basic BootManager, 
     118which handles the initial boot, checking in with control, etc; then hands off 
     119control to a ConfigurationManager to bring the node into whichever mode(s?) 
     120required.  This should be able to support multiple reboots during a single 
     121Configuration process.  Then, it's easy to support many different configuration 
     122backends... and multiple ones at the same time. 
     123 
     124 
     125'''BootstrapFS''' 
     126 
     127This is the fedora-based tarball that the BootManager downloads from PLC and 
     128dumps onto the node.  We could allow this, plus perhaps add support for a 
     129frisbee disk image. 
     130 
     131We have the buildscripts to build the thing (building really means installing 
     132rpms on a base fedora install), but we'll have to assemble our own Fedora base 
     133image to act as the build server, I think (this is after 30s of looking at 
     134stuff). 
     135 
     136'''NodeManager''' 
     137 
     138The NM auths all its calls to PLC in the same way as the BootManager. 
     139 
     140The NM makes these API calls: 
     141   * GetSlivers 
     142   * BootGetNodeDetails (get a session key, then use that when chatting with 
     143     PLC after) 
     144   * GetSession() 
     145 
     146We may (will) have to replace their rspec and ticket stuff depending on what we 
     147do for GENI. 
     148 
     149What if we split up the NodeManager into a frontend CM interface, with the 
     150standard stuff that all CM backends will need (i.e., a records database, 
     151communication to MAs and GMC, auth checks, others?), but with several backends 
     152for node/hw-specific configuration?  I think this means the backends should be 
     153pluggable... sort of like venkat's stuff!  That way, we can enable a whole 
     154bunch of them in a single "application", where a single node can take on 
     155multiple roles, or we can just couple a CM with a single backend. 
     156 
     157So, we need to define a good interface on the CM for the backend to call into. 
     158Probably this contains getting configuration info, sending management 
     159notification, etc. 
     160 
     161But, is putting a node into a single mode enough?  Can't some nodes take on 
     162multiple roles at once? 
     163 
     164 
     165'''PLCAPI''' 
     166 
     167This isn't so bad.  We just steal the API calls we need (they are logically 
     168separated).  However, I will probably have to rewrite their database interface 
     169since it's tightly coupled to their notion that the database essentially stores 
     170discrete objects, and our schema is more complex.  See XXX. 
     171 
     172For now, we can just add these calls into our current xmlrpc server.  However, 
     173auth will be slightly different; we can generate an ssl cert for all nodes (or 
     174for each node), and they can do "basic" auth with that to fit our model, but 
     175then there's all the per-xmlrpc call auth in the API... and that's probably a 
     176good thing to keep. 
     177 
     178 
     179'''PLC DB Schema''' 
     180 
     181To minimize the pain of taking at least the NM, and probably some of the 
     182monitoring/limiting tools, we need to take the following parts of the schema: 
     183 
     184   * Slices (tables `slices`, `slice_node`, `slice_person`, 
     185     `slice_attribute_types`, `slice_attribute`) 
     186      * we could lift the slices table straight out, or nearly so. 
     187      * if we are really going to be geni compatible, the slice_node table 
     188        should replace our reserved table, since nodes will be allocated in 
     189        slices, not experiments.  Experiments are a service overlay, not a 
     190        resource container -- or if they are a resource container, it is by 
     191        extension.  
     192        So, what I am really recommending here is that we interpose the new 
     193        abstraction of a "slice" as the resource container.  For now, we 
     194        '''could''' map a pid/eid to a slicename... 
     195      * slice_person currently maps membership in a slice; does geni need more 
     196        than that in the auth model? 
     197      * we ought to rip out all the places where the plab code assumes a 
     198        slicename is composed of a site name and a slicename (i.e., 
     199        utah_myslice) 
     200      * slice_attribute_types defines the set of possible attributes that may 
     201        be set in slice_attributes.  plab uses these to control special 
     202        permissions for the slivers (i.e., to control Proper permissions, raw 
     203        sockets, reserved ports, etc).  This isn't necessarily the best 
     204        mechanism to do this, but it's fine.  We ought to have experiment 
     205        attributes, but I'm not sure all of what purposes they could serve... 
     206        For now, we just lift this stuff. 
     207 
     208Objects/tables we need to map onto our schema: 
     209 
     210   * `nodes` 
     211      * we don't have a notion of external dns names for nodes 
     212      * we don't map widearea nodes to sites 
     213      * nodes don't have ssh host keys 
     214      * we need to add more boot states (i.e., debug, 
     215   * `node_groups` 
     216      * the value of this is debateable, but we may as well have/allow it -- 
     217        it's like event groups, but for nodes. 
     218   * Sites: 
     219      * Although GENI will not necessarily have a map of users to sites, it may 
     220        well have a map of nodes deployed at sites.  Thus, we should keep only 
     221        that part for our own organization purposes.  So, we just have 
     222        node->site mapping, and per-site info (i.e., contacts, authorities for 
     223        the site).  
     224   * Keys: we already have a map of users to ssh pubkeys. 
     225 
     226 
     227Objects/Concepts with obvious mapping problems, that we need: 
     228 
     229   * API access control 
     230      * PLC does RBAC on its API.  The following roles exist: 
     231        admin,pi,user,tech,node,anonymous,peer.  We need to support at least 
     232        the first 5 right away, or rip out their access control. 
     233      * How do we map our permissions?  We have pid/eid, group, and various 
     234        permissions (project_root, group_root, local_root, user, admin).  Of 
     235        the two sets, project_root<->pi, admin<->admin, 
     236        (group_root,local_root,user)<->user; but we don't have a node/tech 
     237        concept.  For now, I think we can get away without having tech role, 
     238        but we need the equivalent of a node role so that the node can both 
     239        read and write (which we don't have) config/log info. 
     240 
     241   * NodeNetwork abstractions: 
     242      * Basically, we need to store static IP-level config information for 
     243        widearea nodes that cannot use dhcp.  To do this, PLC has abstractions: 
     244         * network_type (i.e., 'ipv4'), 
     245         * network_method (i.e., static, dhcp, proxy), 
     246         * nodenetworks (this roughly maps to our interfaces table, but it has 
     247           additional IP-level config details like gateway, bcast addr, dns 
     248           servers) 
     249      * their network type is similar to our link type, I think... but what do 
     250        we have for widearea nodes? 
     251      * do we need to store more detailed info than IP?  What about the case 
     252        for when people will have nodes that have static IP info?  We need to 
     253        store this in the db, and maybe their solution is good. 
     254        Alternatives: 
     255         * use the interface_settings table (bad); 
     256         * extend the interfaces table with more fields (bad because most nodes 
     257           don't need this kind of static info); 
     258         * extend the widearea_nodeinfo table; 
     259         * add a table that extends the widearea_nodeinfo table (i.e., with the 
     260           static IP level config stuff that non-dhcp nodes need to get on the 
     261           network); 
     262         * extend the interfaces tables (either by adding a new table (i.e., 
     263           widearea_interfaces) that had more details (like gateway, network, 
     264           bcast addr, dns servers, hostname); or by adding fields to 
     265           interfaces).  
     266 
     267Stuff we can ignore: 
     268 
     269   * Peers (their peering mechanism isn't very good, I feel, and we should just 
     270     implement the GENI way going forward) 
     271   * user accounts: we already have the same notions 
     272   * addresses 
     273   * power control (ours is superior) 
     274 
     275 
     276-- Main.DavidJohnson - 13 Nov 2007