Quickstart

Quickstart#

To get started with LSDB we will demonstrate a very common workflow. It consists of crossmatching a small set of objects of your interest with a large survey catalog stored in HiPSCat format (Gaia), applying cuts to the data, and saving the final result.

The first thing you need to do is to import our package.

[1]:
import lsdb

Create a Pandas Dataframe with the equatorial coordinates (right ascension and declination, in degrees) for your objects.

[2]:
import pandas as pd

# The coordinates (ra, dec) for our objects of interest
objects = [(180.080612, 9.507394), (179.884664, 10.479632), (179.790319, 9.551745)]

objects_df = pd.DataFrame(objects, columns=["ra", "dec"])
objects_df
[2]:
ra dec
0 180.080612 9.507394
1 179.884664 10.479632
2 179.790319 9.551745

Now that the data is in a DataFrame you can create an LSDB catalog to have it in HiPSCat format.

[3]:
my_object_catalog = lsdb.from_dataframe(objects_df, catalog_name="my_object_catalog", catalog_type="object")
my_object_catalog
[3]:
lsdb Catalog my_object_catalog:
ra dec Norder Dir Npix
npartitions=1
6917529027641081856 double[pyarrow] double[pyarrow] uint8[pyarrow] uint64[pyarrow] uint64[pyarrow]
18446744073709551615 ... ... ... ... ...

Next, read the catalog we want to crossmatch with. Because we are downloading it from a web source we need to install an additional package (aiohttp). If your catalog happens to be present in local storage you can call read_hipscat directly.

[4]:
!pip install aiohttp --quiet

In this tutorial we will read a small 1 degree cone region of Gaia DR3, one that should contain our objects. While LSDB typically reads into memory only the minimal amount of data it needs for our workflow, manually providing it with spatial information helps it identify which files to search for on disk.

[5]:
from lsdb.core.search import ConeSearch

gaia_path = "https://epyc.astro.washington.edu/~lincc-frameworks/hipscat_surveys/gaia_dr3/gaia"

gaia = lsdb.read_hipscat(gaia_path, search_filter=ConeSearch(ra=180, dec=10, radius_arcsec=0.5 * 3600))
gaia
[5]:
lsdb Catalog gaia:
solution_id designation source_id random_index ref_epoch ra ra_error dec dec_error parallax parallax_error parallax_over_error pm pmra pmra_error pmdec pmdec_error ra_dec_corr ra_parallax_corr ra_pmra_corr ra_pmdec_corr dec_parallax_corr dec_pmra_corr dec_pmdec_corr parallax_pmra_corr parallax_pmdec_corr pmra_pmdec_corr astrometric_n_obs_al astrometric_n_obs_ac astrometric_n_good_obs_al astrometric_n_bad_obs_al astrometric_gof_al astrometric_chi2_al astrometric_excess_noise astrometric_excess_noise_sig astrometric_params_solved astrometric_primary_flag nu_eff_used_in_astrometry pseudocolour pseudocolour_error ra_pseudocolour_corr dec_pseudocolour_corr parallax_pseudocolour_corr pmra_pseudocolour_corr pmdec_pseudocolour_corr astrometric_matched_transits visibility_periods_used astrometric_sigma5d_max matched_transits new_matched_transits matched_transits_removed ipd_gof_harmonic_amplitude ipd_gof_harmonic_phase ipd_frac_multi_peak ipd_frac_odd_win ruwe scan_direction_strength_k1 scan_direction_strength_k2 scan_direction_strength_k3 scan_direction_strength_k4 scan_direction_mean_k1 scan_direction_mean_k2 scan_direction_mean_k3 scan_direction_mean_k4 duplicated_source phot_g_n_obs phot_g_mean_flux phot_g_mean_flux_error phot_g_mean_flux_over_error phot_g_mean_mag phot_bp_n_obs phot_bp_mean_flux phot_bp_mean_flux_error phot_bp_mean_flux_over_error phot_bp_mean_mag phot_rp_n_obs phot_rp_mean_flux phot_rp_mean_flux_error phot_rp_mean_flux_over_error phot_rp_mean_mag phot_bp_rp_excess_factor phot_bp_n_contaminated_transits phot_bp_n_blended_transits phot_rp_n_contaminated_transits phot_rp_n_blended_transits phot_proc_mode bp_rp bp_g g_rp radial_velocity radial_velocity_error rv_method_used rv_nb_transits rv_nb_deblended_transits rv_visibility_periods_used rv_expected_sig_to_noise rv_renormalised_gof rv_chisq_pvalue rv_time_duration rv_amplitude_robust rv_template_teff rv_template_logg rv_template_fe_h rv_atm_param_origin vbroad vbroad_error vbroad_nb_transits grvs_mag grvs_mag_error grvs_mag_nb_transits rvs_spec_sig_to_noise phot_variable_flag l b ecl_lon ecl_lat in_qso_candidates in_galaxy_candidates non_single_star has_xp_continuous has_xp_sampled has_rvs has_epoch_photometry has_epoch_rv has_mcmc_gspphot has_mcmc_msc in_andromeda_survey classprob_dsc_combmod_quasar classprob_dsc_combmod_galaxy classprob_dsc_combmod_star teff_gspphot teff_gspphot_lower teff_gspphot_upper logg_gspphot logg_gspphot_lower logg_gspphot_upper mh_gspphot mh_gspphot_lower mh_gspphot_upper distance_gspphot distance_gspphot_lower distance_gspphot_upper azero_gspphot azero_gspphot_lower azero_gspphot_upper ag_gspphot ag_gspphot_lower ag_gspphot_upper ebpminrp_gspphot ebpminrp_gspphot_lower ebpminrp_gspphot_upper libname_gspphot Norder Dir Npix
npartitions=1
7782220156096217088 int64[pyarrow] string[pyarrow] int64[pyarrow] int64[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] int64[pyarrow] int64[pyarrow] int64[pyarrow] int64[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] int64[pyarrow] bool[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] int64[pyarrow] int64[pyarrow] double[pyarrow] int64[pyarrow] int64[pyarrow] int64[pyarrow] double[pyarrow] double[pyarrow] int64[pyarrow] int64[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] bool[pyarrow] int64[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] int64[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] int64[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] string[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] bool[pyarrow] bool[pyarrow] int64[pyarrow] bool[pyarrow] bool[pyarrow] bool[pyarrow] bool[pyarrow] bool[pyarrow] bool[pyarrow] bool[pyarrow] bool[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] string[pyarrow] uint8[pyarrow] uint64[pyarrow] uint64[pyarrow]
18446744073709551615 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

Let’s crossmatch them! As a result we will have a catalog with the objects from Gaia that match our initial objects of interest, according to a specified maximum distance. We will use the default K nearest neighbors algorithm with k=1 and a maximum separation distance of 1 arcsecond.

[6]:
result = my_object_catalog.crossmatch(gaia, n_neighbors=1, radius_arcsec=1 * 3600, require_right_margin=False)
/home/docs/checkouts/readthedocs.org/user_builds/lsdb/envs/latest/lib/python3.10/site-packages/lsdb/dask/crossmatch_catalog_data.py:117: RuntimeWarning: Right catalog does not have a margin cache. Results may be incomplete and/or inaccurate.
  warnings.warn(

Now let’s say we wish to apply a cut to our data to get all the objects with a mean magnitude in the G band greater than 18 and a total number of observations AL greater than 200. We can build a query expression and filter the catalog. Because of the crossmatch, the name of the Gaia columns for the query need to contain the name of the catalog as a suffix, _gaia.

[7]:
result = result.query("phot_g_mean_mag_gaia > 18 and astrometric_n_obs_al_gaia > 200")

Our workflow takes advantage of Dask’s “lazy” evaluation. This means that we have been defining a set of tasks which will only be executed at our command. When that happens, data will be read into disk and operations will be distributed among the several workers for parallel computation. To trigger this we will call compute on the catalog that resulted from the crossmatch. The final result will be presented in a Pandas DataFrame.

[8]:
result.compute()
[8]:
ra_my_object_catalog dec_my_object_catalog Norder_my_object_catalog Dir_my_object_catalog Npix_my_object_catalog solution_id_gaia designation_gaia source_id_gaia random_index_gaia ref_epoch_gaia ... ag_gspphot_lower_gaia ag_gspphot_upper_gaia ebpminrp_gspphot_gaia ebpminrp_gspphot_lower_gaia ebpminrp_gspphot_upper_gaia libname_gspphot_gaia Norder_gaia Dir_gaia Npix_gaia _dist_arcsec
_hipscat_index
7800225727302860800 180.080612 9.507394 0 0 6 1636148068921376768 Gaia DR3 3900112810537137536 3900112810537137536 272226905 2016.0 ... 0.0022 0.0161 0.0039 0.0012 0.0088 MARCS 2 0 108 34.558381

1 rows × 161 columns

You will want to save your resulting catalog to disk, especially if it is too large to fit in memory or if you will need to use it later on. The catalog exposes the to_hipscat API for that, you just need to provide it with a path for the target base directory and a catalog name.

[9]:
result.to_hipscat("lsdb_catalogs", catalog_name="my_object_catalog_x_gaia")