Basic transit identification dataset construction

This tutorial will show you how we built the datasets for the Basic transit identification with prebuilt components tutorial. This will also show how this can be adjusted for your own datasets. This tutorial expects you have already completed the Basic transit identification with prebuilt components tutorial and that you have the examples directory as obtained in that tutorial.

User defined components of a qusi dataset

For qusi to work with your data, it needs 3 components to be defined:

  1. A function that finds your data files.

  2. A function that loads the input data from a given file. The input data is often the fluxes and the times of a light curve.

  3. A function that loads the label or target value for a file. This is what the neural network (NN) is going to try to predict, such as a classification label.

Creating a function to find the data

The first thing we’ll do is provide functions on where to find the data. A simple version of this might look like:

def get_positive_train_paths():
    return list(Path('data/spoc_transit_experiment/train/positives').glob('*.fits'))

This functions says to create a Path object for a directory at data/spoc_transit_experiment/train/positives. Then, it obtains all the files ending with the .fits extension within that directory. It puts that in a list and returns that list. In particular, qusi expects a function that takes no input parameters and outputs a list of Paths.

In our example code, we’ve split the data based on if it’s train, validation, or test data and we’ve split the data based on if it’s positive or negative data. And we provide a function for each of the 6 permutations of this, which is almost identical to what’s above. You can see the above function and other 5 similar functions near the top of scripts/dataset.py.

qusi is flexible in how the paths are provided, and this construction of having a separate function for each type of data is certainly not the only way of approaching this. Depending on your task, another option might serve better. In another tutorial, we will explore a few example alternatives. However, to better understand those alternatives, it’s first useful to see the rest of this dataset construction.

Creating a function to load the data

The next thing qusi needs is the function to load the data. In our example case, we use the same function to do this in all cases. It looks like:

def load_times_and_fluxes_from_path(path):
    light_curve = TessMissionLightCurve.from_path(path)
    return light_curve.times, light_curve.fluxes

This uses a builtin class in qusi that is designed for loading light curves from TESS mission FITS files. However, the important thing is that your function returns two comma separated values, which is a NumPy array of the times and a NumPy array of the fluxes of your light curve. And the function takes a single Path object as input. These Path objects will be one of the ones we returned from the functions in the previous section. But you can write any code you need to get from a Path to the two arrays that represent times and fluxes. For example, if your file is a simple CSV file, it would be easy to use Pandas to load the CSV file and extract the time column and the flux column as two arrays which are then returned at the end of the function. You will see the above function in scripts/dataset.py.

Creating a function to provide a label for the data

Now we need to define what label belongs to each light curve in our data. In the next section, we’re going to join the function we made previously to get the paths to the function that assigns the labels. Since we already defined different functions to get the paths for positive cases and negative cases and we’re going to explicitly join this to a label function, for this current use case, we don’t actually need the label function to contain any real logic in it. We’re going to define two functions, that always return 0 (for negative) or always return 1 (for positive). They look like:

def positive_label_function(path):
    return 1

def negative_label_function(path):
    return 0

Note, qusi expects the label functions to take in a Path object as input, even if we don’t end up using it. This is because, it allows for more flexible configurations. For example, in a different situation, the data might not be split into positive and negative directories, but instead, the label data might be contained within the user’s data file itself. Also, in other cases, this label can also be something other than 0 and 1. The label is whatever the NN is attempting to predict for the input light curve. But for our binary classification case, 0 and 1 are what we want to use. Once again, you can see these functions in scripts/dataset.py.

Creating a light curve collection

Note

In the below classes, qusi uses the word “observation” in the statistics meaning of the term. That is, a single example of the data with a label. It is not referring to an “observation” in astrophysical terminology. Unfortunately, when different disiplines collide, they can have conflicting terminology.

Now we’re going to join the various functions we’ve just defined into LightCurveObservationCollections. For the case of positive train light curves, this looks like:

positive_train_light_curve_collection = LightCurveObservationCollection.new(
    get_paths_function=get_positive_train_paths,
    load_times_and_fluxes_from_path_function=load_times_and_fluxes_from_path,
    load_label_from_path_function=positive_label_function)

This defines a collection of labeled light curves where qusi knows how to obtain the paths, how to load the times and fluxes of the light curves, and how to load the labels. This LightCurveObservationCollection.new(... function takes in the three pieces we just built earlier. Note that you pass in the functions themselves, not the output of the functions. So for the get_paths_function parameter, we pass get_positive_train_paths, not get_positive_train_paths() (notice the difference in parenthesis). qusi will call these functions internally. However, the above bit of code is not by itself in scripts/dataset.py as the rest of the code in this tutorial was. This is because qusi doesn’t use this collection by itself. It uses it as part of a dataset. We will explain why there’s this extra layer in a moment.

Creating a dataset

Finally, we build the dataset qusi uses to train the network. First, we’ll take a look and then unpack it:

def get_transit_train_dataset():
    positive_train_light_curve_collection = LightCurveObservationCollection.new(
        get_paths_function=get_positive_train_paths,
        load_times_and_fluxes_from_path_function=load_times_and_fluxes_from_path,
        load_label_from_path_function=positive_label_function)
    negative_train_light_curve_collection = LightCurveObservationCollection.new(
        get_paths_function=get_negative_train_paths,
        load_times_and_fluxes_from_path_function=load_times_and_fluxes_from_path,
        load_label_from_path_function=negative_label_function)
    train_light_curve_dataset = LightCurveDataset.new(
        standard_light_curve_collections=[positive_train_light_curve_collection,
                                          negative_train_light_curve_collection])
    return train_light_curve_dataset

This is the function which generates the training dataset we called in the Basic transit identification with prebuilt components tutorial. The parts of this function are as follows. First, we create the positive_train_light_curve_collection. This is exactly what we just saw in the previous section. Next, we create a negative_train_light_curve_collection. This is almost identical to its positive counterpart, except now we pass the get_negative_train_paths and negative_label_function instead of the positive versions. Then there is the train_light_curve_dataset = LightCurveDataset.new( line. This creates a qusi dataset built from these two collections. The reason the collections are separate is that LightCurveDataset has several mechanisms working under-the-hood. Notably for this case, LightCurveDataset will balance the two light curve collections. We know of a lot more light curves that don’t have planet transits in them than we do light curves that do have planet transits. In the real world case, it’s thousands of times more at least. But for a NN, it’s usually useful to during the training process to show equal amounts of the positives and negatives. LightCurveDataset will do this for us. You may have also noticed that we passed these collections in as the standard_light_curve_collections parameter. LightCurveDataset also allows for passing different types of collections. Notably, collections can be passed such that light curves from one collection will be injected into another. This is useful for injecting synthetic signals into real telescope data. However, we’ll save the injection options for another tutorial.

You can see the above get_transit_train_dataset dataset creation function in the scripts/dataset.py file. The only part of that file we haven’t yet looked at in detail is the get_transit_validation_dataset and get_transit_finite_test_dataset functions. However, these are nearly identical to the above get_transit_train_dataset expect using the validation and test path obtaining functions above instead of the train ones.

Adjusting this for your own binary classification task

Now for a few quick starting points for how you might adjust this for your own binary classification task.

If you were simply looking for a different type of phenomena that wasn’t transits, but were still searching in the TESS SPOC data, you would just need to replace the light curve FITS files in the data directory with the ones that match what you’re trying to search for. Then, nothing else would need to change, and you would be able to use these same scripts to train a NN for search for that phenomena.

Let’s say you weren’t searching for TESS SPOC data, but were searching some other telescope data with light curves around the same length. If you’re still able to put the light curves in the same data directory file structure as previously, you would just need to update the load_times_and_fluxes_from_path function to tell qusi how to get the times and fluxes from your file paths. And again, the existing scripts will allow you to start training a NN to search for your phenomena.

Working with light curves with a different length is also fairly simple, but does require passing a different integer to the preprocessing step.