Merging EXCEL Data

Data Integration

This notebook highlights the process of integrating data from four different data sources (Excel files) onto a master file. For efficiency, the data manipulation and cleanup is done with Python. After the processing is done, a master Excel file is exported.

Relational Algebra Join Lateral Join

Raw data files and table definitions can be found at https://www.kaggle.com/anikannal/solar-power-generation-data

View this notebook

** NOTE: The purpose of this post is to show a possible workflow for data cleanup and merging. Although the code is annotated to identify the general logic of each subsection, it is not intended to be a step-by-step tutorial.

Problem Identification

  • Customer would like to analize data on power generation from two solar power plants and their respective weather sensor logs.
  • The information recides in four separate Excel workbooks (two power generation files and two weather sensor files.)
  • Customer requests:
    — Consolidate data into one workbook
    — Manipulate data to facilitate comprehensibility
    —- Change Codes to easy-to-read labels
    —- Create Code dictionary for future reference
    — Conduct basic data cleanup

Plant 1 Power Generation Data – Sample
(Plant 2 file has the same structure and data formats)

plant1.JPG

Plant 2 Weather Sensor Data
(Plant 1 file has the same structure and data formats)

wx2.JPG

WORKFLOW:

1) Load Dependencies
import pandas as pd
import numpy as np
import datetime as dt
2) Import source Excel files
def import_xlx(file):
    '''import excel file into target dataframe'''
    dir = 'D:/Springboard/Projects/SolarPower/data/raw/'
    df = pd.read_excel(dir + file, dtype={'DATE_TIME':np.str}) 
    #import dates as string and manipulate in python
    return df
p1 = import_xlx('Plant_1_Generation_Data.xlsx')
p2 = import_xlx('Plant_2_Generation_Data.xlsx')
w1 = import_xlx('Plant_1_Weather_Sensor_Data.xlsx')
w2 = import_xlx('Plant_2_Weather_Sensor_Data.xlsx')
3) Analyze structure and content of powerplant files
p1.head(3)
DATE_TIMEPLANT_IDSOURCE_KEYDC_POWERAC_POWERDAILY_YIELDTOTAL_YIELD
015-05-2020 00:0041350011BY6WEcLGh8j5v70.00.00.06259559.0
115-05-2020 00:0041350011IF53ai7Xc0U56Y0.00.00.06183645.0
215-05-2020 00:0041350013PZuoBAID5Wc2HD0.00.00.06987759.0

Check to see if there are missing values anywhere in the dataset.

p1.isnull().sum()

DATE_TIME      0
PLANT_ID       0
SOURCE_KEY     0
DC_POWER       0
AC_POWER       0
DAILY_YIELD    0
TOTAL_YIELD    0
dtype: int64
#check for data type consistency
p1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68778 entries, 0 to 68777
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   DATE_TIME    68778 non-null  object 
 1   PLANT_ID     68778 non-null  int64  
 2   SOURCE_KEY   68778 non-null  object 
 3   DC_POWER     68778 non-null  float64
 4   AC_POWER     68778 non-null  float64
 5   DAILY_YIELD  68778 non-null  float64
 6   TOTAL_YIELD  68778 non-null  float64
dtypes: float64(4), int64(1), object(2)
memory usage: 3.7+ MB
4) Perform data cleanup

Plant ID and Source Key appear to be categorical values. Let’s see how many possible values each hold.

print(p1.PLANT_ID.unique())
[4135001]

There is only one ID in this columns. For easier identification lets change the id to ‘1’.

p1['PLANT_ID'] = '1'
coder = {}
coder[4135001] = 'PLANT ID: 1'

Check data in ‘SOURCE_KEY’ column.

sources = p1.SOURCE_KEY.unique()
print(sources)
print()
print('There are '+ str(len(p1.SOURCE_KEY.unique())) + ' distinct values.')
['1BY6WEcLGh8j5v7' '1IF53ai7Xc0U56Y' '3PZuoBAID5Wc2HD' '7JYdWkrLSPkdwr4'
 'McdE0feGgRqW7Ca' 'VHMLBKoKgIrUVDU' 'WRmjgnKYAwPKWDb' 'ZnxXDlPa8U1GXgE'
 'ZoEaEvLYb1n2sOq' 'adLQvlD726eNBSB' 'bvBOhCH3iADSZry' 'iCRJl6heRkivqQ3'
 'ih0vzX44oOqAx2f' 'pkci93gMrogZuBj' 'rGa61gmuvPhdLxV' 'sjndEbLyjtCKgGv'
 'uHbuxQJl8lW7ozc' 'wCURE6d3bPkepu2' 'z9Y9gH1T5YWrNuG' 'zBIq5rxdHJRwDNY'
 'zVJPv84UY57bAof' 'YxYtjZvoooNbGkE']

There are 22 distinct values.

There are 22 different values for the source column. These are hard to read. We are going to map this values to something easier to read. Format will be ‘P1_SRC_X’ for Plant 1 SOURCE #.

We will store the key:value combination for later reference.

#create dictionary to hold source key mapping
p1_source = {}
idx=1
for source in sources:
    p1_source = 'P1_SRC_' + str(idx)
    idx += 1

#replace old codes with new values
p1['SOURCE_KEY'] = p1['SOURCE_KEY'].replace(p1_source)

The DATE_TIME values are been treated as strings. Perform data cleanup to standardize format and make the column a DATETIME type.

Separate DATES and TIMES into their own columns for easier manipulation, after clean-up bring back together into DATE_TIME column. ** This is needed because some date entries in the file have inconsistent formats. Removing the time components from them allow python logic to correctly identify and format the dates.

p1['DATE'] = p1['DATE_TIME'].str.slice(0,10)
p1['TIME'] = p1['DATE_TIME'].str.slice(11,16)
p1.head(3)
DATE_TIMEPLANT_IDSOURCE_KEYDC_POWERAC_POWERDAILY_YIELDTOTAL_YIELDDATETIME
015-05-2020 00:001P1_SRC_10.00.00.06259559.015-05-202000:00
115-05-2020 00:001P1_SRC_20.00.00.06183645.015-05-202000:00
215-05-2020 00:001P1_SRC_30.00.00.06987759.015-05-202000:00
#Convert the Date column from string to date type
p1['DATE'] = pd.to_datetime(p1['DATE'])
p1.drop('DATE_TIME', axis = 1, inplace=True)

# add seconds for time standardization
p1['TIME'] = p1['TIME'] + ':00'
# convert str to time
p1['TIME'] = pd.to_datetime(p1['TIME'], format= '%H:%M:%S').dt.time
p1['DATE_TIME'] = ''

#Combine date + time into DATE_TIME column
for idx in range (len(p1)):
    p1.iloc[idx,8] = pd.datetime.combine(p1.iloc[idx,6],p1.iloc[idx,7])

# Reorder columns    
p1.drop(['DATE','TIME'], axis=1, inplace=True)
p1 = p1[['DATE_TIME', 'PLANT_ID', 'SOURCE_KEY', 'DC_POWER', 'AC_POWER', 'DAILY_YIELD', 'TOTAL_YIELD']]

p1.iloc[idx,8] = pd.datetime.combine(p1.iloc[idx,6],p1.iloc[idx,7])

# Vizualize current status of dataset
p1.sample(5)
DATE_TIMEPLANT_IDSOURCE_KEYDC_POWERAC_POWERDAILY_YIELDTOTAL_YIELD
451762020-06-06 19:00:001P1_SRC_30.0000000.0000006496.0000007158982.000
547722020-11-06 08:15:001P1_SRC_82913.857143285.814286258.4285716724468.429
358612020-02-06 07:00:001P1_SRC_211669.857143163.40000059.1428577247825.143
141182020-05-22 11:00:001P1_SRC_87082.857143692.9714292251.1428576574115.143
384712020-03-06 13:00:001P1_SRC_1511547.4285701127.4857144329.1428577253659.143

Power plant 1 dataset is ready.

Repeat steps above to process Power plant 2 file.

5) Analyze structure and content
p2.head(3)
DATE_TIMEPLANT_IDSOURCE_KEYDC_POWERAC_POWERDAILY_YIELDTOTAL_YIELD
02020-05-15 00:00:0041360014UPUqMRk7TRMgml0.00.09425.0000002.429011e+06
12020-05-15 00:00:00413600181aHJ1q11NBPMrL0.00.00.0000001.215279e+09
22020-05-15 00:00:0041360019kRcWv60rDACzjR0.00.03075.3333332.247720e+09
p2.isnull().sum()

DATE_TIME      0
PLANT_ID       0
SOURCE_KEY     0
DC_POWER       0
AC_POWER       0
DAILY_YIELD    0
TOTAL_YIELD    0
dtype: int64
p2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67698 entries, 0 to 67697
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   DATE_TIME    67698 non-null  object 
 1   PLANT_ID     67698 non-null  int64  
 2   SOURCE_KEY   67698 non-null  object 
 3   DC_POWER     67698 non-null  float64
 4   AC_POWER     67698 non-null  float64
 5   DAILY_YIELD  67698 non-null  float64
 6   TOTAL_YIELD  67698 non-null  float64
dtypes: float64(4), int64(1), object(2)
memory usage: 3.6+ MB
6) Perform data cleanup
print(p2.PLANT_ID.unique())
[4136001]
p2['PLANT_ID'] = '2'
coder[4136001] = 'PLANT ID: 2'

sources = p2.SOURCE_KEY.unique()
print(sources)
print('There are '+ str(len(p2.SOURCE_KEY.unique())) + ' values.')

['4UPUqMRk7TRMgml' '81aHJ1q11NBPMrL' '9kRcWv60rDACzjR' 'Et9kgGMDl729KT4'
 'IQ2d7wF4YD8zU1Q' 'LYwnQax7tkwH5Cb' 'LlT2YUhhzqhg5Sw' 'Mx2yZCDsyf6DPfv'
 'NgDl19wMapZy17u' 'PeE6FRyGXUgsRhN' 'Qf4GUc1pJu5T6c6' 'Quc1TzYxW2pYoWX'
 'V94E5Ben1TlhnDV' 'WcxssY2VbP4hApt' 'mqwcsP2rE7J0TFp' 'oZ35aAeoifZaQzV'
 'oZZkBaNadn6DNKz' 'q49J1IKaHRwDQnt' 'rrq4fwE8jgrTyWY' 'vOuJvMaM2sgwLmb'
 'xMbIugepa2P7lBB' 'xoJJ8DcxJEcupym']
There are 22 values.

#create dictionary to hold source key mapping

p2_source = {}
idx=1
for source in sources:
p2_source = ‘P2_SRC_’ + str(idx)
idx += 1

p2[‘SOURCE_KEY’] = p2[‘SOURCE_KEY’].replace(p2_source)

p2['DATE'] = p2['DATE_TIME'].str.slice(0,10)
p2['TIME'] = p2['DATE_TIME'].str.slice(11,16)
p2['DATE'] = pd.to_datetime(p2['DATE'])
p2.drop('DATE_TIME', axis = 1, inplace=True)

p2['TIME'] = p2['TIME'] + ':00'
p2['TIME'] = pd.to_datetime(p2['TIME'], format= '%H:%M:%S').dt.time
p2['DATE_TIME'] = ''
for idx in range (len(p2)):
   p2.iloc[idx,8] = pd.datetime.combine(p2.iloc[idx,6],p2.iloc[idx,7])

p2.drop(['DATE','TIME'], axis=1, inplace=True)
p2 = p2[['DATE_TIME', 'PLANT_ID', 'SOURCE_KEY', 'DC_POWER',      
         'AC_POWER', 'DAILY_YIELD', 'TOTAL_YIELD']]
<ipython-input-27-4fbbd9230d85>:5: FutureWarning: The pandas.datetime class is deprecated and will be removed from pandas in a future version. Import from datetime module instead.
  p2.iloc[idx,8] = pd.datetime.combine(p2.iloc[idx,6],p2.iloc[idx,7])
p2.sample(5)
DATE_TIMEPLANT_IDSOURCE_KEYDC_POWERAC_POWERDAILY_YIELDTOTAL_YIELD
478372020-06-08 14:15:002P2_SRC_60.0000000.0000001330.0000001.795072e+09
325342020-06-01 07:45:002P2_SRC_15202.053333198.093333222.6666675.937129e+08
245002020-05-28 06:00:002P2_SRC_2015.12000014.5933331.2666672.305880e+06
618752020-06-15 05:45:002P2_SRC_80.0000000.0000000.0000002.670741e+06
143872020-05-22 09:45:002P2_SRC_20.0000000.0000001605.0000001.215316e+09

Both Power Plant Generation files are ready.

Process Weather Sensor files

6) Analyze structure and content
w1.head(3)
DATE_TIMEPLANT_IDSOURCE_KEYAMBIENT_TEMPERATUREMODULE_TEMPERATUREIRRADIATION
02020-05-15 00:00:004135001HmiyD2TTLFNqkNe25.18431622.8575070.0
12020-05-15 00:15:004135001HmiyD2TTLFNqkNe25.08458922.7616680.0
22020-05-15 00:30:004135001HmiyD2TTLFNqkNe24.93575322.5923060.0
w1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3182 entries, 0 to 3181
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   DATE_TIME            3182 non-null   object 
 1   PLANT_ID             3182 non-null   int64  
 2   SOURCE_KEY           3182 non-null   object 
 3   AMBIENT_TEMPERATURE  3182 non-null   float64
 4   MODULE_TEMPERATURE   3182 non-null   float64
 5   IRRADIATION          3182 non-null   float64
dtypes: float64(3), int64(1), object(2)
memory usage: 149.3+ KB
7) Perform data cleanup

Code for readability:

w1.PLANT_ID.unique()
array([4135001], dtype=int64)
w1['PLANT_ID'] = '1'
w1['SOURCE_KEY'].unique()
array(['HmiyD2TTLFNqkNe'], dtype=object)
w1['SOURCE_KEY'] = 'WXS1'
coder['HmiyD2TTLFNqkNe'] = 'WXS1'

Process Data/Time data:

w1['DATE'] = w1['DATE_TIME'].str.slice(0,10)
w1['DATE'] = pd.to_datetime(w1['DATE'])
w1['TIME'] = w1['DATE_TIME'].str.slice(11,16)
w1['TIME'] = w1['TIME'] + ':00'
w1['TIME'] = pd.to_datetime(w1['TIME'], format= '%H:%M:%S').dt.time
w1.drop('DATE_TIME', axis=1, inplace=True)
w1['DATE_TIME'] = ''
for idx in range (len(w1)):
    w1.iloc[idx,7] =pd.datetime.combine(w1.iloc[idx,5],w1.iloc[idx,6])
w1.drop(['DATE','TIME'], axis=1, inplace=True)
w1= w1[['DATE_TIME', 'PLANT_ID', 'SOURCE_KEY', 'AMBIENT_TEMPERATURE', 'MODULE_TEMPERATURE', 'IRRADIATION']]
w1.head(3)
<ipython-input-36-f041425cb444>:2: FutureWarning: The pandas.datetime class is deprecated and will be removed from pandas in a future version. Import from datetime module instead.
  w1.iloc[idx,7] = pd.datetime.combine(w1.iloc[idx,5],w1.iloc[idx,6])
DATE_TIMEPLANT_IDSOURCE_KEYAMBIENT_TEMPERATUREMODULE_TEMPERATUREIRRADIATION
02020-05-15 00:00:001WXS125.18431622.8575070.0
12020-05-15 00:15:001WXS125.08458922.7616680.0
22020-05-15 00:30:001WXS124.93575322.5923060.0

Work second weather file

8) Analyze structure and content
w2.head(3)
DATE_TIMEPLANT_IDSOURCE_KEYAMBIENT_TEMPERATUREMODULE_TEMPERATUREIRRADIATION
02020-05-15 00:00:004136001iq8k7ZNt4Mwm3w027.00476425.0607890.0
12020-05-15 00:15:004136001iq8k7ZNt4Mwm3w026.88081124.4218690.0
22020-05-15 00:30:004136001iq8k7ZNt4Mwm3w026.68205524.4272900.0
w2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3259 entries, 0 to 3258
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   DATE_TIME            3259 non-null   object 
 1   PLANT_ID             3259 non-null   int64  
 2   SOURCE_KEY           3259 non-null   object 
 3   AMBIENT_TEMPERATURE  3259 non-null   float64
 4   MODULE_TEMPERATURE   3259 non-null   float64
 5   IRRADIATION          3259 non-null   float64
dtypes: float64(3), int64(1), object(2)
memory usage: 152.9+ KB
9) Perform data cleanup

Code for readability:

w2.PLANT_ID.unique()

array([4136001], dtype=int64)
w2.SOURCE_KEY.unique()
array(['iq8k7ZNt4Mwm3w0'], dtype=object)
w2['PLANT_ID'] = '2'
w2['SOURCE_KEY'] = 'WXS2'
coder['iq8k7ZNt4Mwm3w0'] = 'WXS2'

Process Date/Time data:

w2['DATE'] = w2['DATE_TIME'].str.slice(0,10)
w2['DATE'] = pd.to_datetime(w2['DATE'])
w2['TIME'] = w2['DATE_TIME'].str.slice(11,16)
w2['TIME'] = w2['TIME'] + ':00'
w2['TIME'] = pd.to_datetime(w2['TIME'], format= '%H:%M:%S').dt.time
w2.drop('DATE_TIME', axis=1, inplace=True)
w2['DATE_TIME'] = ''

for idx in range (len(w2)):
    w2.iloc[idx,7] = pd.datetime.combine(w2.iloc[idx,5], w2.iloc[idx,6])
w2.drop(['DATE','TIME'], axis=1, inplace=True)
w2= w2[['DATE_TIME', 'PLANT_ID', 'SOURCE_KEY', 'AMBIENT_TEMPERATURE', 'MODULE_TEMPERATURE', 'IRRADIATION']]
w2.head(3)
<ipython-input-43-399aca825677>:2: FutureWarning: The pandas.datetime class is deprecated and will be removed from pandas in a future version. Import from datetime module instead.
  w2.iloc[idx,7] = pd.datetime.combine(w2.iloc[idx,5], w2.iloc[idx,6])
DATE_TIMEPLANT_IDSOURCE_KEYAMBIENT_TEMPERATUREMODULE_TEMPERATUREIRRADIATION
02020-05-15 00:00:002WXS227.00476425.0607890.0
12020-05-15 00:15:002WXS226.88081124.4218690.0
22020-05-15 00:30:002WXS226.68205524.4272900.0

Merge datasets into Master File

10) Prepare dataset for merge
# join powerplant datasets
df_power = pd.concat([p1,p2], ignore_index=True)
# join wx sensor datasets
df_wx = pd.concat([w1,w2], ignore_index=True)
# rename commun column name to avoid conflict
df_power.rename(columns={'SOURCE_KEY':'PLANT_SOURCE'}, inplace=True)
# rename columns for redeability
df_wx.rename(columns={'SOURCE_KEY':'WX_SENSOR', 'AMBIENT_TEMPERATURE': 'AMBIENT_TEMP', 'MODULE_TEMPERATURE':'MODULE_TEMP'}, inplace=True)
11) Merge Datasets
df = pd.merge(df_power, df_wx, how='left', left_on=['DATE_TIME', 'PLANT_ID'], right_on=['DATE_TIME', 'PLANT_ID'])
12) Conduct final cleanup on Master File
Limit Temperature and Irradiation values level of precision (rounding.)
df['AMBIENT_TEMP'] = round(df['AMBIENT_TEMP'],4)
df['MODULE_TEMP'] = round(df['MODULE_TEMP'],4)
df['IRRADIATION'] = round(df['IRRADIATION'],6)
df['DC_POWER'] = round(df['DC_POWER'],4)
df['AC_POWER'] = round(df['AC_POWER'],4)
df['DAILY_YIELD'] = round(df['DAILY_YIELD'],4)
df['TOTAL_YIELD'] = round(df['TOTAL_YIELD'],4)
df
DATE_TIMEPLANT_IDPLANT_SOURCEDC_POWERAC_POWERDAILY_YIELDTOTAL_YIELDWX_SENSORAMBIENT_TEMPMODULE_TEMPIRRADIATION
02020-05-15 00:00:001P1_SRC_10.00.00.06259559.0WXS125.184322.85750.0
12020-05-15 00:00:001P1_SRC_20.00.00.06183645.0WXS125.184322.85750.0
22020-05-15 00:00:001P1_SRC_30.00.00.06987759.0WXS125.184322.85750.0
32020-05-15 00:00:001P1_SRC_40.00.00.07602960.0WXS125.184322.85750.0
42020-05-15 00:00:001P1_SRC_50.00.00.07158964.0WXS125.184322.85750.0
1364712020-06-17 23:45:002P2_SRC_180.00.04157.0520758.0WXS223.202922.53590.0
1364722020-06-17 23:45:002P2_SRC_190.00.03931.0121131356.0WXS223.202922.53590.0
1364732020-06-17 23:45:002P2_SRC_200.00.04322.02427691.0WXS223.202922.53590.0
1364742020-06-17 23:45:002P2_SRC_210.00.04218.0106896394.0WXS223.202922.53590.0
1364752020-06-17 23:45:002P2_SRC_220.00.04316.0209335741.0WXS223.202922.53590.0

136476 rows × 11 columns

Export Master File

13) Create Lookup table of New and old codes for export:

code = {**p1_source, **p2_source}
code = {**code, **coder}
code_df = pd.DataFrame.from_dict(code, orient='index', columns=['New_Code'])
code_df['Old_Code'] = code_df.index
code_df.reset_index(drop=True, inplace=True)
code_df.head()
New_CodeOld_Code
0P1_SRC_11BY6WEcLGh8j5v7
1P1_SRC_21IF53ai7Xc0U56Y
2P1_SRC_33PZuoBAID5Wc2HD
3P1_SRC_47JYdWkrLSPkdwr4
4P1_SRC_5McdE0feGgRqW7Ca

14) Export merged master file to Excel Workbook:

file = 'D:/Springboard/Projects/SolarPower/data/
           final/Combined_Power_Generation.xlsx'
with pd.ExcelWriter(file, engine='openpyxl') as writer:  
    df.to_excel(writer, sheet_name='Master')
    code_df.to_excel(writer, sheet_name='Code_Lookup')

Final Master File

Master worksheet

Master.JPG

Leave a comment