{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Transmission FTIR Spectra\n",
    "\n",
    "- This Jupyter notebook provides an example workflow for processing transmission FTIR spectra through PyIRoGlass. \n",
    "\n",
    "- The Jupyter notebook and data can be accessed here: https://github.com/SarahShi/PyIRoGlass/blob/main/docs/examples/transmission_ftir/. \n",
    "\n",
    "- You need to have the PyIRoGlass PyPi package on your machine once. If you have not done this, please uncomment (remove the #) symbol and run the cell below. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#!pip install PyIRoGlass"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Load Python Packages and Data"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Python Packages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Import packages\n",
    "\n",
    "import PyIRoGlass as pig\n",
    "\n",
    "from IPython.display import Image\n",
    "\n",
    "%matplotlib inline\n",
    "%config InlineBackend.figure_format = 'retina'\n",
    "\n",
    "pig.__version__"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set paths to data\n",
    "\n",
    "Update the path to the directory containing the spectra, as well as the paths to the chemistry and thickness data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "spectrum_path = 'SPECTRA/'\n",
    "print(spectrum_path)\n",
    "\n",
    "chemistry_thickness_path = 'ChemThick.csv'\n",
    "print(chemistry_thickness_path)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set desired output file directory name\n",
    "\n",
    "Update the `export_path` to the desired output location."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "export_path = 'RESULTS'\n",
    "print(export_path)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load transmission FTIR spectra along with chemistry thickness data\n",
    "\n",
    "We will use the class `pig.SampleDataLoader` to load all FTIR spectra and chemistry thickness data. The class takes the arguments: \n",
    "\n",
    "- `spectrum_path`: String or list path to the directory with spectral data\n",
    "- `chemistry_thickness_path`: String path to CSV file with glass chemistry and thickness data\n",
    "\n",
    "and contains the methods: \n",
    "\n",
    "- `load_spectrum_directory`: Loads spectral data\n",
    "- `load_chemistry_thickness`: Loads chemistry thickness data\n",
    "- `load_all_data`: Loads spectral and chemistry thickness data\n",
    "\n",
    "Here, we use `load_all_data`. This returns the outputs of: \n",
    "\n",
    "- `dfs_dict`: Dictionary where the keys are file identifiers and values are DataFrames with spectral data\n",
    "- `chemistry`: DataFrame of chemical data\n",
    "- `thickness`: DataFrame of thickness data\n",
    "\n",
    "The file names from the spectra (what comes before the .CSV) are important when we load in melt compositions and thicknesses. Unique identifiers identify the same samples. Make sure that this `ChemThick.CSV` file has the same sample names as the loaded spectra."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "loader = pig.SampleDataLoader(spectrum_path=spectrum_path, chemistry_thickness_path=chemistry_thickness_path)\n",
    "dfs_dict, chemistry, thickness = loader.load_all_data()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at what `dfs_dict`, a dictionary of transmission FTIR spectra, looks like. Samples are identified by their file names (keys) and the wavenumber and absorbance data are stored as dataframes for each spectrum (values)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "dfs_dict"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Display `chemistry`, the DataFrame of glass compositions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "chemistry"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Display `thickness`, the DataFrame of wafer thicknesses."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "thickness"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "See that the sample names of the spectra in `dfs_dict`, `chemistry`, and `thickness` all align. "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# We're ready to roll -- MCMC, here we come! \n",
    "\n",
    "We use the function `pig.calculate_baselines`, which takes in two arguments: \n",
    "\n",
    "- `dfs_dict`: Dictionary where the keys are file identifiers and values are DataFrames with spectral data\n",
    "- `export_path`: Desired output directory name, or `None` to prevent figure generation\n",
    "\n",
    "and returns: \n",
    "\n",
    "- `Volatile_PH`: DataFrame with peak heights and their associated uncertainties\n",
    "- `failures`: List of file identifiers for which analysis failed\n",
    "\n",
    "Running this code will take a few seconds to minutes per spectra, as it is fitting $\\mathrm{10^6}$ baselines and peaks to your spectrum to sample uncertainty. If any samples fail, they will be returned in the list `failures`. It took 20 seconds to process 1 spectrum on my M2 Macbook Pro with 12 CPU cores. The same task takes about 2 minutes on Google Colab. \n",
    "\n",
    "The function automatically saves this file as a CSV, so you have this information. We will also use this DataFrame to calculate concentrations. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "Volatile_PH, failures = pig.calculate_baselines(dfs_dict, export_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`pig.calculate_baselines` returns `Volatile_PH`, a DataFrame of the output peak heights and associated uncertainties. Let's see what is included. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "Volatile_PH"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can look at all the columns in this DataFrame, given the size. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "Volatile_PH.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All columns with the prefix of `PH` represent a peak height. All columns with the suffix of `_M` represent the mean value, and the suffix of `_STD` represents 1 $\\sigma$. \n",
    "\n",
    "The column `H2Ot_3550_SAT` returns a `-` if the sample is not saturated, and a `*` if the sample is saturated. This is based on the maximum absorbance of the peak, and the warning of `*` indicates that we must consider the concentrations more. The following functions calculating concentration handle this and will suggest best values to use. \n",
    "\n",
    "The columns `STN_P5200` and `STN_P4500` represent the signal to noise ratios for the $\\mathrm{H_2O_{m,5200}}$ and $\\mathrm{OH^-_{4500}}$ peaks. If the values are greater than 4, indicating that the signal is meaningful, the ERR_5200 and ERR_4500 peaks return a `-` value. If signal-to-noise is too low, the warning of `*` is returned. \n",
    "\n",
    "The columns after describe the fitting parameters for generating the baseline and the $\\mathrm{H_2O_{m,1635}}$ peak, so you can generate the baseline yourself. "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Outputs\n",
    "\n",
    "Quite few figures, log files, and npz files are generated by `pig.calculate_baselines`, assuming you provide an export path. Let's look at a few of them together. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "PyIRoGlass creates this figure for visualizing how each peak within the 1000-5500 cm $\\mathrm{^{-1}}$ is fit, with their peak heights shown. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "Image(\"https://github.com/sarahshi/PyIRoGlass/raw/main/docs/_static/AC4_OL49_021920_30x30_H2O_a.png\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can visualize how well PyIRoGlass does in fitting this transmission FTIR spectrum, with the modelfit figure. This plots the fit from `MC3` against the transmission FTIR spectrum, with the residual in fit. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "Image(\"https://github.com/sarahshi/PyIRoGlass/raw/main/docs/_static/AC4_OL49_021920_30x30_H2O_a_modelfit.png\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The histogram figure shows the distribution of posterior probability densities, with the mean value displayed in the navy dashed line. The shaded region represents the 68% confidence interval around the value. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "Image(\"https://github.com/sarahshi/PyIRoGlass/raw/main/docs/_static/AC4_OL49_021920_30x30_H2O_a_histogram.png\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The pairwise figure plots the posterior probability density distribution for the 16 fitting parameters of Equation 10, allowing for the visualization of covariance within the parameters. Accounting for covariance allows us to properly account for uncertainty. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "Image(\"https://github.com/sarahshi/PyIRoGlass/raw/main/docs/_static/AC4_OL49_021920_30x30_H2O_a_pairwise.png\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The trace figure shows how the parameters evolve through MCMC sampling. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "Image(\"https://github.com/sarahshi/PyIRoGlass/raw/main/docs/_static/AC4_OL49_021920_30x30_H2O_a_trace.png\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## LOG and NPZ\n",
    "\n",
    ".log files record the performance of the MCMC algorithm through the samples, and the best parameters at each 10% increment. These are shown above. \n",
    "\n",
    ".npz files store all the best-parameters, sampled parameters, etc. in a ready-to-use `NumPy` format. \n",
    "\n",
    "We won't open these here, but these are quite useful to review! "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Concentrations\n",
    "\n",
    "We now want to convert all those peak heights (with uncertainties) to concentrations (with uncertainties), by applying the Beer-Lambert Law. We do so by using the `pig.calculate_concentrations` function, which takes in these parameters and samples over `N` samples for a secondary MCMC: \n",
    "\n",
    "- `Volatile_PH`: DataFrame with peak heights and their associated uncertainties; the output from `pig.calculate_baselines`\n",
    "- `chemistry`: DataFrame of chemical data\n",
    "- `thickness`: DataFrame of thickness data\n",
    "- `export_path`: Output directory name, or `None` to prevent figure generation\n",
    "- `N`: Number of Monte Carlo simulations to perform for uncertainty estimation, with default of 500,000\n",
    "- `T`: Temperature in Celsius at which the density is calculated, with default of 25°C\n",
    "- `P`: Pressure in bars at which the density is calculated, with default of 1 bar\n",
    "- `model`: Choice of density model, with options of the default \"LS\" for Lesher and Spera (2015) and \"IT\" for Iacovino and Till (2019)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "concentrations_df = pig.calculate_concentrations(Volatile_PH, chemistry, thickness, export_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We're all done now! Let's display the `concentrations_df` DataFrame, which contains all results. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "concentrations_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are a few things to note. Each column with the suffix `_MEAN` represents the mean value, `_BP` represents the best-parameter from MCMC, and `_STD` represents the standard deviation. We recommend the use of the `H2Ot_MEAN`, `H2Ot_STD`, `CO2_MEAN`, and `CO2_STD` columns. The columns with the suffix `_STN` show the signal-to-noise ratio of the NIR peaks, and the columns with the prefix `ERR_` just process this information, returning a `-` if the peaks are meaningful and a `*` if the signal is too low. \n",
    "\n",
    "Concentrations of $\\mathrm{H_2O}$ depend on whether your sample is saturated or not. If your sample is unsaturated (marked by `H2Ot_3550_SAT=='-'`), the column `H2Ot_MEAN==H2Ot_3550_M`. If your sample is saturated (marked by `H2Ot_3550_SAT=='*'`), the column of `H2Ot_MEAN==(H2Om_1635_BP+OH_4500_M)`. The $\\mathrm{H_2O_{t, 3550}}$ peak cannot be used, given potential nonlinearity in the Beer-Lambert Law. See the discussion of this handling of speciation in the paper. \n",
    "\n",
    "The column `Density` contains the densities used for the final concentration. The values between `Density` and `Density_Sat` will be different if the sample is saturated, showing the difference in densities when using variable concentrations of $\\mathrm{H_2O_m}$. \n",
    "\n",
    "`Tau` and `Eta` calculate the compositional parameters required for determining molar absorptivity. All calculated molar absorptivities and their uncertainties (`Sigma_` prefix) from the inversion are provided in the DataFrame. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}