Flecks of Rust #9: Splitting a string by whitespace

Splitting a string by whitespace is a common enough case in text processing. In this fleck of Rust I'll show you how to do it for a real life task.

Working with MIDI synthesizers, I often need to receive MIDI data from them, particularly MIDI System Exclusive dumps. Often I use SysEx Librarian by Snoize, but in command line scenarios I rely on Geert Bevin's ReceiveMIDI (and its complement, SendMIDI).

ReceiveMIDI and SendMIDI have a text format for MIDI messages. When you start ReceiveMIDI and make it listen to a specific MIDI port, any messages that appear are shown one per line, something like this (slightly edited for clarity):

midi-clock
channel  1   note-on           C3  77
midi-clock
midi-clock
channel  1   note-off          C3   0
midi-clock
midi-clock
channel  1   note-on           E3  69
midi-clock
midi-clock
channel  1   note-off          E3   0
midi-clock
midi-clock
channel  1   note-on           G3  70
midi-clock
midi-clock
channel  1   note-off          G3   0
midi-clock
midi-clock
active-sensing

In particular, MIDI System Exclusive messages are received with all the message data bytes, in decimal or hex as specified:

% receivemidi dev "Studio 1824" hex
system-exclusive hex 40 00 61 00 0A 01 dec

What I would like to do is filter out all other messages besides MIDI System Exclusive, and then collect the data bytes for further processing. As you can see from the examples above, each text representation of a MIDI message begins with the name of the message, followed by some whitespace, and then some message-specific data.

In the case of MIDI System Exclusive messages, the text representation starts with system-exclusive, which is followed by hexadecimal bytes (or decimal, depending on the command-line parameter specified, if any). The initiator and terminator of a System Exclusive message (0xF0 and 0xF7) are not included.

So, to decode MIDI System Exclusive messages I need to:

Here I'll just concentrate on the first task. Usually I pipe the output of ReceiveMIDI into my program, reading from standard input. For each line I receive, I split it on whitespace. If the first element of the resulting vector is not system-exclusive, I just skip the line.

Splitting and processing

The following code is an extract from a larger program called syxreceive (found in the syxpack-rs crate). It is called for each line received from standard input.

    loop {
        let mut input = String::new();
        match std::io::stdin().read_line(&mut input) {
            Ok(len) => if len == 0 {
                return;
            }
            else {
                let parts: Vec<&str> = input.split_whitespace().collect();

                // We want at least "system-exclusive", "hex" or "dec", and one byte
                if parts.len() < 3 {
                    continue;
                }

                // Only deal with SysEx:
                if parts[0] == "system-exclusive" {
                    // Get the base of the byte strings.
                    let base = if parts[1] == "hex" { 16 } else { 10 };

                    let mut data: Vec<u8> = Vec::new();

                    for part in &parts[2..] {
                        match u8::from_str_radix(part, base) {
                            Ok(b) => data.push(b),
                            Err(_) => {
                                //eprintln!("Error in byte string '{}': {}", part, e);
                                continue;
                            }
                        }
                    }
                    println!("data = {:?}", data);
                }
            },
            Err(e) => {
                eprintln!("{}", e);
                std::process::exit(1);
            }
        }
    }
}

If input contains system-exclusive hex 40 00 61 00 0A 01 dec like in the ReceiveMIDI example, then parts will have the following content:

[0] "system-exclusive"
[1] "hex"
[2] "40"
[3] "00"
[4] "61"
[5] "00"
[6] "0A"
[7] "01"
[8] "dec"

The first two elements will be checked but then discarded, as will the last one. After processing a slice of the parts vector, parts[2..], the data vector should contain [64, 0, 97, 0, 10, 1].

A quick test:

    $ echo "system-exclusive hex 40 00 61 00 0A 01 dec" | target/debug/syxreceive
    data = [64, 0, 97, 0, 10, 1]

Using split_whitespace vs. split

When searching for a library function to split a string on whitespace, you might first reach for the split function. However, if there is any chance that the parts of a string might be separated by more than one whitespace character, you will want to use split_whitespace instead, as in the example above.

That is because split includes any consecutive whitespace characters as individual elements. If you don't want that, you will need to filter out any empty elements after splitting, but split_whitespace does not include any empty elements in the first place.

Note also that split_whitespace uses the Unicode definition of whitespace, which is a lot broader than you will typically encounter at least in typical machine-readable content. In the unlikely case that split_whitespace is giving you performance problems, you could try the split_ascii_whitespace function instead.