Crafting standard datasets¶

In the earlier tutorials, we saw an example of how to use qusi on a transit dataset and how that dataset was constructed. We’ll expand on the ways we can work with the datasets here to create a dataset that allows you to apply qusi to your own dataset.

The content of this tutorial will round out the material you need to create a standard binary or multi-class classification problem with simple light curves. In addition to the content of this tutorial, qusi includes a way to easily inject data into light curves (with any mix of real or synthetic data). This will be discussed in Crafting injected datasets. Finally, qusi is not limited to simple light curves and classification. We’ll explore more complex datasets in Advanced crafting datasets.

Light curve lengths¶

In the Basic transit identification dataset construction tutorial, we showed how the transit dataset was constructed. One piece we (purposely) glossed over at the time, is that qusi is doing several preprocessing steps on the data before it’s handed to the neural network (NN). We’ll talk more about the other preprocessing steps in the next section, but the first one to note is a uniform lengthening step. Some of the NNs used by qusi (though not all) require a uniform light curve input length. To accomplish this, the default preprocessing will truncate light curves that are too long and repeat light curves that are too short. Perhaps surprisingly, in addition to enabling NNs that require uniform length, this uniform lengthening (along with some other preprocessing) can actually improve the NN’s predictive performance by reducing overfitting due to the randomization of flux positions this preprocessing adds.

However, the uniform length is set to a specific default value. A good choice for this might be the median length of the light curves in your dataset. For now, we’ll set the length for the preprocessing step, but keep all other preprocessing steps the same. To do this, at the top of our dataset file, we’ll first import some necessary functions:

from functools import partial
from qusi.transform import default_light_curve_post_injection_transform

Then, were we specify the construction of our dataset, we’ll add an additional input parameter. So taking what we had in the previous tutorial, we can now change the dataset creation statement to:

train_light_curve_dataset = LightCurveObservationDataset.new(
    light_curve_collections=[positive_train_light_curve_collection,
                             negative_train_light_curve_collection],
    post_injection_transform = partial(default_light_curve_post_injection_transform, length=4000)
)

Let’s clarify what’s happening here. The LightCurveObservationDataset.new() constructor takes as input a parameter called post_injection_transform. This function will be applied to our light curves before they get handed to the NN. default_light_curve_observation_post_injection_transform is the default set of preprocessing transforms qusi uses. We’ll look at these transforms in more detail in the next section. partial is a Python builtin function, that takes another function as input, along with a parameter of that function, and creates a new function with that parameter prefilled. So partial(default_light_curve_observation_post_injection_transform, length=4000) is taking our default transforms, setting the uniforming lengthening step to 4000, then giving us back the updated function, which we can then give to the dataset. The advantage to this approach is that post_injection_transform is completely flexible, as we’ll explore more in the next section.

Before we run the updated code, we also need to use a NN model which expects our new input size. Luckily, qusi has NN architectures that automatically resize their components for a given input size. So the only other change from the existing code is to change Hadryss.new() to Hadryss.new(input_length=4000).

Modifying the preprocessing¶

In the previous section, we only changed the length of that the uniform lengthening preprocessing transform is using. However, we still used all the remaining default preprocessing steps that are contained in default_light_curve_observation_post_injection_transform. Let’s take a look at what the default one does. It looks something like:

def default_light_curve_observation_post_injection_transform(x: LightCurveObservation, length: int, randomize: bool = True) -> (Tensor, Tensor):
    x = remove_infinite_flux_data_points_from_light_curve_observation(x)
    x = remove_nan_flux_data_points_from_light_curve_observation(x)
    if randomize:
        x = randomly_roll_light_curve_observation(x)
    x = from_light_curve_observation_to_fluxes_array_and_label_array(x)
    x = (make_uniform_length(x[0], length=length), x[1])
    x = pair_array_to_tensor(x)
    x = (normalize_tensor_by_modified_z_score(x[0]), x[1])
    return x

It’s a function that takes in a LightCurveObservation and spits out two Tensors, one for the fluxes and one for the label to predict. Most of the data transform functions within have names that are largely descriptive, but we’ll walk through them anyway. remove_nan_flux_data_points_from_light_curve_observation removes time steps from a LightCurveObservation where the flux is NaN. randomly_roll_light_curve_observation randomly rolls the light curve (a random cut is made and the two segments’ order is swapped). from_light_curve_observation_to_fluxes_array_and_label_array extracts two NumPy arrays from a LightCurveObservation, one for the fluxes and one from the label (which in this case will be an array with a single value). make_uniform_length performs the uniform lengthening on the fluxes as we discussed in the previous section. pair_array_to_tensor converts the pair of NumPy arrays to a pair of PyTorch tensors (PyTorch’s equivalent of an array). normalize_tensor_by_modified_z_score normalizes a tensor based on the median absolute deviation. Notice, this is only applied to the flux tensor, not the label tensor. The randomize parameter enables or disables randomization of the functions which may include randomization. During training, randomization should be on to make sure we get variation in training observations. During evaluation and inference, it should be off to get repeatable results. In our previous example, to keep the code simple, we did not disable randomization for the validation dataset. Although in most cases it will not make a major difference, randomization should be disabled on the validation dataset. It should only be enabled for the training dataset.

It’s worth noting, default_light_curve_observation_post_injection_transform is just a function that can be replaced as desired. To remove one of the preprocessing steps or add in an addition one, we can simply make a modified version of this function. Additionally, qusi does not require the transform function to output only the fluxes and a binary label. The Hadryss NN model expects these two types of values for training, but other models may take advantage of the times of the light curve, or they may predict multi-class or regression labels.