Tuesday, July 19, 2022

Persistent crypto device passthrough on s390x KVM


With v2.22.0 the s390-tools (aka s390utils on RHEL) have an important addition that helps to pass through crypto domains to KVM guests: they allow for persistence and help the user to avoid invalid configurations. (Thanks to Matthew Rosato for explaining the details to me.)

As you might remember from my previous post, crypto device passthrough (with libvirt) consists of three main steps:

  1. Remove the host driver from the device and assign vfio_ap
  2. Start a mediated device of type vfio_ap-passthrough with assigned adapter, usage domain and optionally control domain
  3. Attach the mediated device via its UUID and the <hostdev> element to the KVM domain
For a while now, we have the nice tool mdevctl to help with step 2. above, see for example RHEL 8 official user documentation. Actually, libvirt also integrates with mdevctl, so you can manage your mediated devices via libvirt's nodedev API.

However, if you reboot your LPAR, several configurations need to be persisted in your environment:

  1. Of course, you need the mediated device configuration and the KVM definition to be persisted.
  2. The passthrough driver needs to be loaded; this might depend on the kernel you are using. On RHEL you can configure the kernel module vfio_ap to be automatically loaded at boot as described here. Otherwise, trying to define a device via the nodedev API might just tell you: unsupported configuration: invalid parent device 'ap_matrix'
  3. Finally, the crypto devices' driver assignments need to be persisted.
While 1. is taken care of the libvirt (with mdevctl's help), 2. and 3. are what the mentioned s390-tools provide with release v2.22.0. (Mind that it leverages kernel uevents BINDINGS=complete and COMPLETECOUNT=X that might not be available in older kernel versions, e.g. v4.x.)

As a result, we now can setup workloads in KVM machines that leverage the crypto devices (e.g. hardware accelerated crypto operations), set it up once, set it to start automatically and after we reboot the LPAR (e.g. when we apply important security updates to the kernel), our KVM are restarted automatically.

How to persist crypto device driver assignments

Suppose the vfio_ap kernel module is loaded (v2.22.0's new ap udev rule takes care of this, too) and the mediated device configured as well.
In order to configure the device driver assignment, instead of using sysfs directly, we'll make use of the new type ap on the lszdev and chzdev commands.

Check if your s390-tools version supports ap handling

Simply run

# lszdev --list-types

and confirm that it's listed

ap           Cryptographic Adjunct Processor (AP) device

Configure apmask and aqmask with chzdev

chzdev is not a new tool. Important for the persistence of device configurations are the flags:

           Apply changes to the active configuration only

           Apply changes to persistent configuration only.

where omitting both flags has the same effect as passing them both: any change is applied immediately and will be restored on reboot.

The interesting part is how the apmask and aqmask are configured: With this tool we use a decimal base which can be more convenient. You might remember that each mask represents the adapters resp. domains through an array of 256 bits, where a 0 says 'used by the host' and a 1 'free for use by other driver (vfio_ap)', details see kernel doc.

We've been used to the hexadecimal representation when using the lszcrypt tool or the sysfs directly, e.g.

# lszcrypt

02       CEX7A Accelerator online            0
02.002b  CEX7A Accelerator online            0
02.0032  CEX7A Accelerator online            0
02.0033  CEX7A Accelerator online            0

So, if we want to pass the domain 02.002b through we have to get their decimal values. For those who are not good at base change and have a bash console you can achieve this - e.g. for 2b - like this:

# echo "obase=10; ibase=16; 2B" | bc

(It's important you use the upper case letter.)

Now instead of echoing into the sysfs path, you can simply issue

# chzdev -t ap apmask=-2 aqmask=-43

(remember the "-" means to "take away" from the host)

and confirm via

# lszdev -t ap
  Description        : Cryptographic Adjunct Processor (AP) device
  Modules            : ap
  Active             : yes
  Persistent         : yes

  apmask     "0-1,3-255"    "0-1,3-255"
  aqmask     "0-42,44-255"  "0-42,44-255"

The tool will write out the persistent setting in a udev rule making sure the assignment is restored after reboot.

chzdev allows for more sophisticated operations, e.g.

# chzdev -t ap apmask=+2,-4-8 aqmask=+43,-0-42

will return 02.002b to the host and give the ranges 4-8 and 0-42 to the vfio_ap driver

# lszdev -t ap
  Description        : Cryptographic Adjunct Processor (AP) device
  Modules            : ap
  Active             : yes
  Persistent         : yes

  apmask     "0-3,9-255"  "0-3,9-255"
  aqmask     "43-255"     "43-255"

This also allows us to easily reset all of our configurations during testing via chzdev -t ap apmask=0-255 aqmask=0-255.

Protection against bad configurations

As a further improvement for users, the tools check the current environment and integrate with mdevctl to help avoid conflicting configurations. For example:

Don't allow for mediated device definitions if another device already uses the same device

With a the following nodedev successfully defined, we can't return the adapter to the host:

# virsh nodedev-dumpxml 
  <capability type='mdev'>
    <type id='vfio_ap-passthrough'/>
    <iommuGroup number='1'/>
    <attr name='assign_adapter' value='0x02'/>
    <attr name='assign_domain' value='0x002b'/>

# chzdev -t ap apmask=+2
chzdev: apmask conflicts with mdev d36d7d0f-cf3d-4fef-bb9c-ed393954996b APQN 2.43
chzdev: persistent apmask conflicts with defined autostart mdev d36d7d0f-cf3d-4fef-bb9c-ed393954996b APQN 2.43
ap device type configure failed
    Error: Invalid configuration

If interested you can check out further validations in ap_check's source code here.

Thursday, February 3, 2022

Customer focus powered by SBT and Automation

Recently we experimented with Session Based Testing (as opposed to scripted testing) and reflected

about how to integrate it in our processes.

We asked questions like

  • What makes human testers unique?
  • Is there space for product validation in regression testing?
  • Can we reduce test documentation efforts?

  • This was a collaborative effort. I am very grateful for everybody who gave feedback, joined testing sessions or participated in some other way to help us learn and improve our Testing.

    Hopefully, others will join the conversation and maybe find something useful in our findings.

    You can download the slides with speaker notes as PDF here.

    Thursday, July 8, 2021

    Testing shared memory communications with Linux on Z

    Mainframes allow for shared memory communications between LPARs on the same box through ISM - internal shared memory. More information about the Linux device driver can be found here.

    IBM provides open-source tools including smc_run which easily converts an application's usage of TCP/IP sockets to SMC (shared memory connection) sockets.

    Let's suppose you want to check two of your LPARs can communicate with each other.

    First you'd want to check if ISM is available. As ISM are made available as virtual pci devices, you can simply run lspci on each LPAR.

    # lspci
    00:00.0 Non-VGA unclassified device: IBM Internal Shared Memory (ISM) virtual PCI device
    If your admin tells you have been provided the ISM device but you don't see it you might have to power it on.
    # echo 1 > /sys/bus/pci/slots/0000032/power
    Now, smc_run will convert any TCP/IP socket usage to an SMC socket. So let's suppose you have an echo server using the AF_NET protocol that you can start via commandline; you'd simply run the same command.
    [host1]# smc_run python3 echo_server.py --host my_host_name.example.org --port 12345
    You can then simply send data from a client in the same way.
    [host2]# smc_run python3 send_data.py --host my_host_name.example.org --port 12345
    some data
    Looks like it's working right? But does it really?

    If you power off one of the ISM devices, the above scenario still works. How can that be?

    Reading through the manpages, we'll find in the af_smc manpage:

    SMC socket capabilities are negotiated at connection setup. If one peer is not SMC capable, further socket processing falls back to TCP usage automatically.

    So, how can we make sure that our LPARs really communicate through the ISM?

    The s390-tools luckily deliver another tool smcss. It shows details for AF_SMC socket connections. The Mode column shows how data is exchanged:

    SMCD     The SMC socket uses SMC-D for data exchange.

    SMCR     The SMC socket uses SMC-R for data exchange.

    TCP        The SMC socket uses the TCP protocol for data exchange, because an SMC connection could not be established.

    And really, the difference can be confirmed while the connection is open depending on the availability of ISM on both LPARs.

    [host1]# smcss
    State   UID   Inode   Local Address       Peer Address        Intf Mode
    ACTIVE  00000 22045079  0000 SMCD
    [host1]# smcss
    State   UID   Inode   Local Address       Peer Address        Intf Mode
    ACTIVE  00000 22049662  0000 TCP 0x05000000/0x03030000

    Finally, the s390-tools since version 1.5 also offers another tool that helps to check the ISM live-connectivity without a TCP application, smc_chk. You can shortly run:

    [host1]# smc_chk -S
    Server started on port 37374
    [host2]# smc_chk -C -p 37374
    Test with target IP and port 37374
      Live test (SMC-D and SMC-R, EXPERIMENTAL)
         Success, using SMC-D

    Friday, September 4, 2020

    How to align on the right - partition alignment algorithm

     In MSDOS 6.22 there are alignment restrictions for partitions. This means a partition of size capacity = /start - end/ = end - start, partition boundaries (start, end) must coincide with certain boundaries; in this case cylinder boundaries.

    The following alignment algorithm is taken from the libvirt virtualization API.

    In short, the algorithm will make sure allocated continuous space is aligned on the right, that is on the end. If the available free space for alignment already starts at a given boundary value, it will be fully aligned [1].

    We'll have:

    1. Input: c := capacity, l := alignment interval, s := start
    2. Output: e := end
    All of these values are in ZZ (actually NN). For e: c <= e - s; e + 1 = n*l (for some natural n), that is, the required capacity fits into the allocated space and the end is aligned while it must end one unit before the next interval starts.

    Let r := l - (c mod l). We understand as the extra space required to reach the interval boundary, e.g. if l is 512 (think of sector size) and I need to allocate capacity 618, then 1*l won't cover c, instead I'd have to used 2*l = 1024 >= 618. But then I have 1024 - 618 = 406 = 512 - (618 mod 512) of extra space I need to allocate that wasn't really required.

    The algorithm handles three cases:
    1. s = m*l, for some m (the start is aligned at a boundary)
    2. s != m*l; s mod l <= r (the start offset fits into the extra space reserved for alignment)
    3. s != m*l; s mod l > r (the start offset doesn't fit) 
    For 1. the correct e is quite easy, we already know how much extra space to align and subtract 1 to have the partition end just before the next boundary in order to have the next partition start exactly at boundary.

    (1)    e = s + c + r - 1

    This is the base for the other two cases.

    For 2. (1) would surpass the boundary:

    boundary=s       ...         s+c          boundary=s+c+r
             |                ...           |                        |

    boundary          s         ...       s+c   boundary       s+c+r
             |               |                      |                |              |

    But we know that s mod l <= r, therefore s+c+r - s mod l >= s + c proving that the alignment on the right would fit the required capacity. Thus:

    (2)    e = s + c + r - s mod l - 1

    And e is still on boundary: e = s + c + r - s mod l - 1 =  (c + r) + (s - s mod l) - 1 = n_1*l + n_2*l - 1.

    Now for 3. from the above,   e - s = s + c + r - s mod l - s = c + r - s mod l < c. If we originally had defined r to be 2*l - (c mod l), then e - s > c. But we didn't do that because for 1. and 2. that would be a waste of space. However, here for 3. we don't have another choice, so we add another l:

    (3)    e = s + c + r + l - s mod l - 1

    In the referenced algorithm, you'll see that s mod l is always subtracted. Let's keep in mind that s is at a boundary iff s mod l = 0. So we can actually summarize

    (4)    e = s + c + r + d - s mod l - 1, where d := 0 if s mod l < r, else d := l.

    [1]  I wonder if the need for a first partition not starting exactly at the second cylinder to save space or some other MSDOS restrictions are the reason for not aligning the partition start, too.

    Monday, July 27, 2020

    Set up Crypto Card passthrough with KVM on IBM Z (vfio-ap)

    Crypto Cards on IBM Z systems provide secure key encryption.

    This security feature can be passed through to KVM guests.

    The passthrough is available through the vfio_ap kernel module (paired with the homonymous driver). It uses another passthrough interface, namely, the VFIO mediated device framework (represented by kernel module vfio_mdev).

    More details can be found kernel doc. Here, the focus is on setting up a single passthrough using libvirt.

    What we need
    1. A System Z host with a crypto card, KVM guest
    2. lszcrypt command (from s390tools, often comes preinstalled with distro, package name can be s390utils, too)
    What we do
    1. Identify the device
    2. Mark device queues as not usable by host
    3. Create mediated device
    4. Assign crypto device to mediated device
    5. Attach mediated device to guest
    6. Verify setup
    Identify the device

    # lszcrypt -V

    01          CEX5C CCA-Coproc  online         1        0     11     08 S--D--N--  cex4card   
    01.0011     CEX5C CCA-Coproc  online         1        0     11     08 S--D--N--  cex4queue  

    We need two pieces of information:
    • HWTYPE: passthrough is only supported if this number is >= 10
    • CARD.DOMAIN: 0x01 (adapter id), 0x0011 (domain id)
    Mark device queues as not usable by host
    • Mark adapter not usable by host:
      • echo -0x01 > /sys/bus/ap/apmask
    • Mark device queues not usable by host:
      • echo -0x0011 > /sys/bus/ap/aqmask
    lszcrypt now should only list the card, not the queue.

    # lszcrypt
    01          CEX5C CCA-Coproc  online         4

    Create mediated device

    cd /sys/devices/vfio_ap/matrix/mdev_supported_types/vfio_ap-passthrough
    uuidgen > create

    Assign crypto device to mediated device

    Below the device dir $uuid that we just created below vfio_ap-passthrough/devices:

    cd devices/$uuidgen
    echo 0x01 > assign_adapter
    echo 0x0011 > assign_domain

    We can confirm assignment:

    # lszcrypt -V
    01          CEX5C CCA-Coproc  online         4        0     11     08 S--D--N--  cex4card   
    01.0011     CEX5C CCA-Coproc  -              4        0     11     08 S--D--N--  vfio_ap    

    Attach the mediated device to guest

    Use $uuid for the mediated device and modify guest domain xml (e.g. virsh edit)
    <hostdev mode='subsystem' type='mdev' model='vfio-ap'>
          <address uuid='$uuid'/>

    Verify setup

    After starting the guest:

    root@guest # lszcrypt -V

    01          CEX5C CCA-Coproc  online         1        0     11     08 S--D--N--  cex4card   
    01.0011     CEX5C CCA-Coproc  online         1        0     11     08 S--D--N--  cex4queue  

    Tuesday, April 28, 2020

    The testing effort growth

    Just wondering how a proof for "Number of test cases grows exponentially" could look like...

    Define a program to be a function from a set of input variables to a set of output variables,

    P = Input x P|Output = I x O 
      = { (j_1,...,j_n) x (o_1,...,o_m) }
      = { (j_1,...,j_n,o_1,...,o_m) }.

    Each component of Input is supposed to have at least 2 elements. (If not, the input variable will never change the image value, in other words, the program's behavior, and can therefore be eliminated. The case where an input value is determine to be defined or not, Input ≃ { 0, {0} }.)

    Also, I need to restrict to the case where the dim(I) > 1 because if not each extra value to test adds exactly one test case (linear growth).

    Define a feature to be a selection of a subset of input variables combined with their image, that is, it's a restriction of the program to a subset of the domain, F = P|J, J < I.

    We take a test case to be the a random variable

    T : Input x Ω -> Input x Output
    T(i, ω) = (i, T_1(i, ω),..., T_m(i, ω)).

    Each T_j is an expected output or simply expectation or post-condition.

    A test passes if T_j(ω) = P(i)_j for all j, or fails otherwise, for a test execution ω.

    A test case of a feature is then simply T|F = T|J x Ω where F = P|J.

    For a given feature we can add new behavior minimally by
    1. extending the domain of an input variable J_i by an additional value j' that defines a new behavior of the program, that is for some i, J_i' = J_i + { j' } and Input = J_1 x ... x J_i' x ... x J_n
    2. extending the set of input variables by an additional dimension J_n+1
    Both will result in an extended feature F'.

    For 2. it is easy to see that

    #T|F' = #{ (J_1,...,J_n) x J_n+1 x O } 
          = #J * #J_n+1 * # O 
          = #T|F * #J_n+1
          ≥ #T|F * 2.

    As for 1., given that dim(J) > 1 each new value j' creates another full set of combinations of input variables (j_1,...,j',...,j_n), that is, the number of added test cases is

    #T|F' - #T|F = #J_1,...,^J_i,...,J_n
                                   ≥ 2^(n-1)

    where ^J_i denotes not selecting this component.


    #T|F'  #T|F + 2^(n-1).

    So, in general adding a new value to test, the lower bound for growth is 2^(n-1).  ⃞

    Considering that the effort E|F of testing a feature depends on the selected test suite S|F < T|F, we might dare say that selecting S|F and the methods to evaluate T(i, ω) is a very important activity in testing.

    I hope this makes sense...

    Thursday, March 26, 2020

    ssh into libvirt guest

    Libvirt sets up NAT mode per default allowing guests to communicate, for example, with the internet, with each other and with the host. A host interface vnetX will be created and normally an IP will be assigned automatically.

    # virsh domifaddr vm
     Name       MAC address          Protocol     Address
     vnet0      52:54:00:6f:dc:90    ipv4
    # virsh dumpxml --inactive vm
        <interface type="network">
          <mac address="52:54:00:6f:dc:90">
          <source network="default"></source>
          <model type="virtio">
          <address bus="0x00" domain="0x0000" function="0x0" slot="0x03" type="pci">

    However, vnetX won't get the IP assigned really, that's not how it's intended. So, per default neither the ip nor any name are available to ssh

    What you can do instead is use the NSS module libvirt-nss.

    If you want to easily ssh into the guest using its libvirt name, set
    # /etc/nsswitch.conf:
    hosts:       files libvirt_guest dns
    and make sure to have the correct sshd configuration in your guest.