Importing catalogs to HiPSCat format

Importing catalogs to HiPSCat format#

This notebook presents two ways of importing catalogs to HiPSCat format. The first uses the lsdb.from_dataframe() method, which is helpful to load smaller catalogs from a single dataframe, while the second uses the hipscat-import pipeline.

[1]:
import lsdb
import os
import pandas as pd
import tempfile
Matplotlib is building the font cache; this may take a moment.

We will be importing small_sky_order1 from a single CSV file:

[2]:
catalog_name = "small_sky_order1"
test_data_dir = os.path.join("../../tests", "data")

Let’s define the input and output paths:

[3]:
# Input paths
catalog_dir = os.path.join(test_data_dir, catalog_name)
catalog_csv_path = os.path.join(catalog_dir, f"{catalog_name}.csv")

# Temporary directory for the intermediate/output files
tmp_dir = tempfile.TemporaryDirectory()

lsdb.from_dataframe#

[4]:
%%time

# Read simple catalog from its CSV file
catalog = lsdb.from_dataframe(
    pd.read_csv(catalog_csv_path),
    catalog_name="from_dataframe",
    catalog_type="object",
    highest_order=5,
    threshold=100,
)

# Save it to disk in HiPSCat format
catalog.to_hipscat(f"{tmp_dir.name}/from_dataframe")
CPU times: user 747 ms, sys: 15.9 ms, total: 763 ms
Wall time: 759 ms

HiPSCat import pipeline#

Let’s install the latest release of hipscat-import:

[5]:
!pip install git+https://github.com/astronomy-commons/hipscat-import.git@main --quiet
[6]:
from dask.distributed import Client
from hipscat_import.catalog.arguments import ImportArguments
from hipscat_import.pipeline import pipeline_with_client
[7]:
%%time

args = ImportArguments(
    ra_column="ra",
    dec_column="dec",
    highest_healpix_order=5,
    pixel_threshold=100,
    file_reader="csv",
    input_file_list=[catalog_csv_path],
    output_artifact_name="from_import_pipeline",
    output_path=tmp_dir.name,
    resume=False,
)

with Client(n_workers=1) as client:
    pipeline_with_client(args, client)
CPU times: user 1.01 s, sys: 57.1 ms, total: 1.07 s
Wall time: 8.17 s

Let’s read both catalogs, from disk, and check that the two methods produced the same output:

[8]:
from_dataframe_catalog = lsdb.read_hipscat(f"{tmp_dir.name}/from_dataframe")
from_dataframe_catalog
[8]:
lsdb Catalog from_dataframe:
id ra dec ra_error dec_error Norder Dir Npix
npartitions=4
12682136550675316736 int64[pyarrow] double[pyarrow] double[pyarrow] int64[pyarrow] int64[pyarrow] uint8[pyarrow] uint64[pyarrow] uint64[pyarrow]
12970366926827028480 ... ... ... ... ... ... ... ...
13258597302978740224 ... ... ... ... ... ... ... ...
13546827679130451968 ... ... ... ... ... ... ... ...
18446744073709551615 ... ... ... ... ... ... ... ...
[9]:
from_import_pipeline_catalog = lsdb.read_hipscat(f"{tmp_dir.name}/from_import_pipeline")
from_import_pipeline_catalog
[9]:
lsdb Catalog from_import_pipeline:
id ra dec ra_error dec_error Norder Dir Npix
npartitions=4
12682136550675316736 int64[pyarrow] double[pyarrow] double[pyarrow] int64[pyarrow] int64[pyarrow] uint8[pyarrow] uint64[pyarrow] uint64[pyarrow]
12970366926827028480 ... ... ... ... ... ... ... ...
13258597302978740224 ... ... ... ... ... ... ... ...
13546827679130451968 ... ... ... ... ... ... ... ...
18446744073709551615 ... ... ... ... ... ... ... ...
[10]:
# Verify that the pixels they contain are similar
assert from_dataframe_catalog.get_healpix_pixels() == from_import_pipeline_catalog.get_healpix_pixels()

# Verify that resulting dataframes contain the same data
sorted_from_dataframe = from_dataframe_catalog.compute().sort_index()
sorted_from_import_pipeline = from_import_pipeline_catalog.compute().sort_index()
pd.testing.assert_frame_equal(sorted_from_dataframe, sorted_from_import_pipeline)

Finally, tear down the directory used for the intermediate / output files:

[11]:
tmp_dir.cleanup()