Splitting a string by whitespace is a common enough case in text processing. In this fleck of Rust I'll show you how to do it for a real life task.
Working with MIDI synthesizers, I often need to receive MIDI data from them, particularly MIDI System Exclusive dumps. Often I use SysEx Librarian by Snoize, but in command line scenarios I rely on Geert Bevin's ReceiveMIDI (and its complement, SendMIDI).
ReceiveMIDI and SendMIDI have a text format for MIDI messages. When you start ReceiveMIDI and make it listen to a specific MIDI port, any messages that appear are shown one per line, something like this (slightly edited for clarity):
midi-clock channel 1 note-on C3 77 midi-clock midi-clock channel 1 note-off C3 0 midi-clock midi-clock channel 1 note-on E3 69 midi-clock midi-clock channel 1 note-off E3 0 midi-clock midi-clock channel 1 note-on G3 70 midi-clock midi-clock channel 1 note-off G3 0 midi-clock midi-clock active-sensing
In particular, MIDI System Exclusive messages are received with all the message data bytes, in decimal or hex as specified:
% receivemidi dev "Studio 1824" hex system-exclusive hex 40 00 61 00 0A 01 dec
What I would like to do is filter out all other messages besides MIDI System Exclusive, and then collect the data bytes for further processing. As you can see from the examples above, each text representation of a MIDI message begins with the name of the message, followed by some whitespace, and then some message-specific data.
In the case of MIDI System Exclusive messages, the text representation starts with system-exclusive
,
which is followed by hexadecimal bytes (or decimal, depending on the command-line parameter
specified, if any). The initiator and terminator of a System Exclusive message (0xF0
and 0xF7) are not included.
So, to decode MIDI System Exclusive messages I need to:
u8
valuesHere I'll just concentrate on the first task. Usually I pipe the output of ReceiveMIDI into my program,
reading from standard input. For each line I receive, I split it on whitespace. If the first element
of the resulting vector is not system-exclusive
, I just skip the line.
The following code is an extract from a larger program
called syxreceive
(found in the
syxpack-rs crate).
It is called for each line received from standard input.
loop {
let mut input = String::new();
match std::io::stdin().read_line(&mut input) {
Ok(len) => if len == 0 {
return;
}
else {
let parts: Vec<&str> = input.split_whitespace().collect();
// We want at least "system-exclusive", "hex" or "dec", and one byte
if parts.len() < 3 {
continue;
}
// Only deal with SysEx:
if parts[0] == "system-exclusive" {
// Get the base of the byte strings.
let base = if parts[1] == "hex" { 16 } else { 10 };
let mut data: Vec<u8> = Vec::new();
for part in &parts[2..] {
match u8::from_str_radix(part, base) {
Ok(b) => data.push(b),
Err(_) => {
//eprintln!("Error in byte string '{}': {}", part, e);
continue;
}
}
}
println!("data = {:?}", data);
}
},
Err(e) => {
eprintln!("{}", e);
std::process::exit(1);
}
}
}
}
If input
contains system-exclusive hex 40 00 61 00 0A 01 dec
like in
the ReceiveMIDI example, then parts
will have the following content:
[0] "system-exclusive" [1] "hex" [2] "40" [3] "00" [4] "61" [5] "00" [6] "0A" [7] "01" [8] "dec"
The first two elements will be checked but then discarded, as will the last one.
After processing a slice of the parts
vector, parts[2..]
, the data
vector should contain [64, 0, 97, 0, 10, 1]
.
A quick test:
$ echo "system-exclusive hex 40 00 61 00 0A 01 dec" | target/debug/syxreceive data = [64, 0, 97, 0, 10, 1]
split_whitespace
vs. split
When searching for a library function to split a string on whitespace,
you might first reach for the
split
function.
However, if there is any chance that the parts of a string might be separated by more
than one whitespace character, you will want to use
split_whitespace
instead, as in the example above.
That is because split
includes any consecutive whitespace characters
as individual elements. If you don't want that, you will need to filter out any
empty elements after splitting, but split_whitespace
does not include any
empty elements in the first place.
Note also that split_whitespace
uses the Unicode definition of
whitespace, which is a lot broader than you will typically encounter at least in
typical machine-readable content. In the unlikely case that split_whitespace
is giving you performance problems, you could try the
split_ascii_whitespace
function instead.