Vendors use different terminology to specify the performance of their Hardware Security Modules (HSMs). Regular terms are transactions, exponentiations, encryptions or signatures per second, or microsecond per transaction, exponentiation, etc. Performance statistics that use different units are incomparable. We’re trying to overcome that by using common unit. This post elaborates further on an small application for performance measurement.
Performance depends on algorithm and size of the key. Mostly, 1024-bit RSA private key operations are used, but that is often not specified. Using units like “encryption” or “verification” is biased as well, as both encryption/verification are public key operations (and thus small exponents), which are much faster than “decrypting/signing”. Using “exponentiations” is sometimes used to amplify the statistics. For example, a 1024 bit RSA key implies 512 exponentiations for a single “transaction” (the performance numbers are blown up by a factor of 2^9 …. on paper).
Performance is only comparable when using the same standard measurement unit. Since most vendors use 1024 bit RSA key signatures per second (sig/sec), let’s use that for a performance specification conformance test (or… lets check the marketing on the box).
For this test we’re using a Sun Fire T2000 with 3 SCA6000 cards. The technical specification promises “Up to 13,000 RSA operations per second with 1,024-bit keys”. All three combined should get a nice performance of about 39,000 RSA signatures/second…. in theory.
An often used method to measure performance is the OpenSSL speed test. However, it is not possible to specify keys that are located on the HSM. Also, an engine is needed to let OpenSSL use the pkcs11 interface. The well known OpenSC PKCS11 engine assumes that keys are on the HSM, while the RSA speed test generates its own key causing the speed test to fail. Sun’s PKCS11 engine is fully supported (thanks for Darren J. Moffat for pointing that out, see his comment below), the patches for OpenSSL are not supported by Sun. Lastly, the OpenSSL speed test uses fork/wait/pipe (using the undocumented -multi and -elapsed for proper timing), where we want to use threads (less overhead, no IPC). So it was time to write a small performance test application that uses native PKCS11 calls.
The result of that speed test is a whopping 39353 sig/sec for a 1024 RSA private key. This was verified independently by the unix time utility (for elapsed time) and Solaris kstat utility (for actual hardware transactions).
Or….. signing 7 million records in less than 3 minutes.
hsm-speed implementation notes
Download the hsm-speed package.
Simply creating a loop in which data is signed might not get the desired performance. A single loop performed at about 1600 sig/sec, while the specification promised 13000 sig/sec per card. A single loop (one process thread) did not get enough exposure to fill the bus fast enough. Creating multiple processing threads seems the obvious answer, especially since the T2000 uses an UltraSPARC T1 processor with 32 simultaneous processing threads. The speed-test is made multi threaded (using pthreads for portability, not the Solaris native threads), and gets about 13200 sig/sec on a single card. Note that there is also the option to fork processes, which effectively causes multithreading per forked process. Since forking has more overhead than threading, and threading has more overhead than looping, a straightforward way to maximize performance is to increase the loop iteration until it adds no more speed. Then increase the threads until it adds no more, then increase the forks.
Solaris Cryptographic Framework notes
The Solaris cryptographic framework allows different slot configurations. The “Metaslot” serves as a single virtual slot with all the combined capabilities of all the tokens and slots that have been installed. The “Keystore” slot groups only the crypto hardware together. The order in which multiple calls to C_FindObject returns objects from the metastore is reverse of that of the keystore. Hence, a search for a key without specifying the object class, will on the metaslot return the private key first, and on the keystore slot return the public key first. Effectively, when using the keystore slot, a C_SignInit that returns error “CKR_KEY_TYPE_INCONSISTENT” might be the result of not having specified CKO_PRIVATE_KEY in the search template for C_FindObjectsInit().
Another problem encountered with the Metaslot configuration is that it has a bug in meta_release_slot_session, used by C_CloseAllSessions, causing a nasty segmentation fault when trying to close a certain amount of idle sessions. This is circumvented by closing individual sessions one by one, though that is a tiny bit detrimental on the overall performance.
It is essential that the cards have the same firmware. Exporting the keystore information to another card requires the same firmware on both cards.
The PIN is a combination of the username and password, separated by a colon. When the password requirement for the SCA6000 is set to high, the password must be at least 8 characters long. However, the solaris getpass() call (from stdlib.h) only returns the first 8 characters, thus it leaves no room for the username to be specified. The GNU getpass() (libc) does not have this limitation. To circumvent this issue, use getpassphrase() on solaris. Note that this function is not portable.
Notes on PKCS11
Threads that share a single session might interfere each other between a C_SignInit and a C_Sign call. This will have unpredictable behavior. A thread safe way of sharing sessions is using mutex locks. This will significantly reduce the benefit of using threads. One way to avoid interference without having to use mutex locks is to create one session per thread. Since sessions can safely interleave and interfere, this is a very effective way to guarantee thread safety without locking.