Importing catalogs to HiPSCat format#
This notebook presents two ways of importing catalogs to HiPSCat format. The first uses the lsdb.from_dataframe()
method, which is helpful to load smaller catalogs from a single dataframe, while the second uses the hipscat-import pipeline
.
[1]:
import lsdb
import os
import pandas as pd
import tempfile
Matplotlib is building the font cache; this may take a moment.
We will be importing small_sky_order1
from a single CSV file:
[2]:
catalog_name = "small_sky_order1"
test_data_dir = os.path.join("../../tests", "data")
Let’s define the input and output paths:
[3]:
# Input paths
catalog_dir = os.path.join(test_data_dir, catalog_name)
catalog_csv_path = os.path.join(catalog_dir, f"{catalog_name}.csv")
# Temporary directory for the intermediate/output files
tmp_dir = tempfile.TemporaryDirectory()
lsdb.from_dataframe#
[4]:
%%time
# Read simple catalog from its CSV file
catalog = lsdb.from_dataframe(
pd.read_csv(catalog_csv_path),
catalog_name="from_dataframe",
catalog_type="object",
highest_order=5,
threshold=100,
)
# Save it to disk in HiPSCat format
catalog.to_hipscat(f"{tmp_dir.name}/from_dataframe")
CPU times: user 747 ms, sys: 15.9 ms, total: 763 ms
Wall time: 759 ms
HiPSCat import pipeline#
Let’s install the latest release of hipscat-import:
[5]:
!pip install git+https://github.com/astronomy-commons/hipscat-import.git@main --quiet
[6]:
from dask.distributed import Client
from hipscat_import.catalog.arguments import ImportArguments
from hipscat_import.pipeline import pipeline_with_client
[7]:
%%time
args = ImportArguments(
ra_column="ra",
dec_column="dec",
highest_healpix_order=5,
pixel_threshold=100,
file_reader="csv",
input_file_list=[catalog_csv_path],
output_artifact_name="from_import_pipeline",
output_path=tmp_dir.name,
resume=False,
)
with Client(n_workers=1) as client:
pipeline_with_client(args, client)
CPU times: user 1.01 s, sys: 57.1 ms, total: 1.07 s
Wall time: 8.17 s
Let’s read both catalogs, from disk, and check that the two methods produced the same output:
[8]:
from_dataframe_catalog = lsdb.read_hipscat(f"{tmp_dir.name}/from_dataframe")
from_dataframe_catalog
[8]:
lsdb Catalog from_dataframe:
id | ra | dec | ra_error | dec_error | Norder | Dir | Npix | |
---|---|---|---|---|---|---|---|---|
npartitions=4 | ||||||||
12682136550675316736 | int64[pyarrow] | double[pyarrow] | double[pyarrow] | int64[pyarrow] | int64[pyarrow] | uint8[pyarrow] | uint64[pyarrow] | uint64[pyarrow] |
12970366926827028480 | ... | ... | ... | ... | ... | ... | ... | ... |
13258597302978740224 | ... | ... | ... | ... | ... | ... | ... | ... |
13546827679130451968 | ... | ... | ... | ... | ... | ... | ... | ... |
18446744073709551615 | ... | ... | ... | ... | ... | ... | ... | ... |
[9]:
from_import_pipeline_catalog = lsdb.read_hipscat(f"{tmp_dir.name}/from_import_pipeline")
from_import_pipeline_catalog
[9]:
lsdb Catalog from_import_pipeline:
id | ra | dec | ra_error | dec_error | Norder | Dir | Npix | |
---|---|---|---|---|---|---|---|---|
npartitions=4 | ||||||||
12682136550675316736 | int64[pyarrow] | double[pyarrow] | double[pyarrow] | int64[pyarrow] | int64[pyarrow] | uint8[pyarrow] | uint64[pyarrow] | uint64[pyarrow] |
12970366926827028480 | ... | ... | ... | ... | ... | ... | ... | ... |
13258597302978740224 | ... | ... | ... | ... | ... | ... | ... | ... |
13546827679130451968 | ... | ... | ... | ... | ... | ... | ... | ... |
18446744073709551615 | ... | ... | ... | ... | ... | ... | ... | ... |
[10]:
# Verify that the pixels they contain are similar
assert from_dataframe_catalog.get_healpix_pixels() == from_import_pipeline_catalog.get_healpix_pixels()
# Verify that resulting dataframes contain the same data
sorted_from_dataframe = from_dataframe_catalog.compute().sort_index()
sorted_from_import_pipeline = from_import_pipeline_catalog.compute().sort_index()
pd.testing.assert_frame_equal(sorted_from_dataframe, sorted_from_import_pipeline)
Finally, tear down the directory used for the intermediate / output files:
[11]:
tmp_dir.cleanup()