Excel Power Query for Data Cleanup

Checkout this introductory series detailing data cleanup workflows with the Power Query in MS Excel.

Video 1: Introduction to Data Query
– Standardize text formatting on select columns
– Remove duplicate entries

Video 2: Formatting Phone Numbers
– How to remove unwanted characters
– How clean ‘null’ entries
– How to split a column by character count
– How to create a custom column with a formula that formats the phone number as (xxx) xxx-xxxx

Video 3 – Text Extraction with Conditional Columns
– How to create a Conditional Column
– Navigating the IF statement to set condition

Business Data Merge and Cleanup

Here is the general workflow I used to complete a recent data merge and cleanup project.

TASKs:

  • Merge data from 5 different data structures (Approximately 4Gb in 148 files)
  • Extract business related to commercial painting
  • Format data
  • Create consolidated Master file
  • Create separate files divided by State where business is located

Tools:

  • Python’s Pandas library
  • MS Office Visual Basic for Applications (VBA)

Workflow

  1. Explore data structure – First we explore the structure of each data set. Below we can see that many of the column titles do not match. For example, for Company Name we have ‘COMPANY_NAME’, ‘NAME’, ‘Company’ and ‘Business Name’. Also as expected not all datasets contain the same type of information.
Header Data Source 1
Header Data Source 2
Header Data Source 3
Header Data Source 4
Header Data Source 5

2. Create Master Dataset – In order to merge all the data, a Master Dataset was engineered to collect each represented columns across all datasets and the appropriate items were mapped to the respective columns on the Master Dataset.

3. Define Top Level Filter and extract records – The following code looked for the term ‘paint’ in the columns ‘COMPANY_NAME’, ‘SIC_DESCRIPTION’, and ‘WEBSITE’. Records with a match were extracted and compiled.

Because each of the datasets had a unique structure, this code had to be tweaked for each file structure.

4. Apply second filter – A look at the resulting data indicates we still have many records that we do not want. As the client in interested in residential and commercial painting we do not need business that work on automobiles, face paint or paintball. We search and filter out those records.

5. Drop duplicated business entries – We consider any business that have the same Company Name, Address, and city as being duplicated and we keep only the first reference to it in the data set. This brings the total record count from 1,158,119 to 318,802.

6. Format and Standardize entries – The phone and fax numbers contained mixed formats. Some numbers were entered without any format such as 14072332211, other formatted as (###) – ###- ####. The code applied first strips the entries to numbers only and then applies the (###) – ### – #### format.

Other text type entries were standardize.

7. With the data filtered and cleaned up a Master data file was created and well as a set of files divided by State.

Deliverable Sample

Data Cleanup in Python

This notebook shows the initial data cleanup workflow for a capstone project in fulfilment of Springboard’s Data Science Track Program.

The data was retrieved from the National Transportation and Safety Board website. The original data resides in a 20-table MS Access database. The pertinent information was exported to Coma-Separated Value (CSV) files utilizing Access’ query functions. This resulted in six distinct files. The data was then imported, analyzed, and cleaned utilizing Python Pandas methods and other libraries. Finally, a single merged file was exported ready for Exploratory Data Analysis (EDA).

DATA CLEANUP NOTEBOOK

Merging EXCEL Data

Data Integration

This notebook highlights the process of integrating data from four different data sources (Excel files) onto a master file. For efficiency, the data manipulation and cleanup is done with Python. After the processing is done, a master Excel file is exported.

Relational Algebra Join Lateral Join

Raw data files and table definitions can be found at https://www.kaggle.com/anikannal/solar-power-generation-data

View this notebook

** NOTE: The purpose of this post is to show a possible workflow for data cleanup and merging. Although the code is annotated to identify the general logic of each subsection, it is not intended to be a step-by-step tutorial.

Problem Identification

  • Customer would like to analize data on power generation from two solar power plants and their respective weather sensor logs.
  • The information recides in four separate Excel workbooks (two power generation files and two weather sensor files.)
  • Customer requests:
    — Consolidate data into one workbook
    — Manipulate data to facilitate comprehensibility
    —- Change Codes to easy-to-read labels
    —- Create Code dictionary for future reference
    — Conduct basic data cleanup

Plant 1 Power Generation Data – Sample
(Plant 2 file has the same structure and data formats)

plant1.JPG

Plant 2 Weather Sensor Data
(Plant 1 file has the same structure and data formats)

wx2.JPG

WORKFLOW:

1) Load Dependencies
import pandas as pd
import numpy as np
import datetime as dt
2) Import source Excel files
def import_xlx(file):
    '''import excel file into target dataframe'''
    dir = 'D:/Springboard/Projects/SolarPower/data/raw/'
    df = pd.read_excel(dir + file, dtype={'DATE_TIME':np.str}) 
    #import dates as string and manipulate in python
    return df
p1 = import_xlx('Plant_1_Generation_Data.xlsx')
p2 = import_xlx('Plant_2_Generation_Data.xlsx')
w1 = import_xlx('Plant_1_Weather_Sensor_Data.xlsx')
w2 = import_xlx('Plant_2_Weather_Sensor_Data.xlsx')
3) Analyze structure and content of powerplant files
p1.head(3)
DATE_TIMEPLANT_IDSOURCE_KEYDC_POWERAC_POWERDAILY_YIELDTOTAL_YIELD
015-05-2020 00:0041350011BY6WEcLGh8j5v70.00.00.06259559.0
115-05-2020 00:0041350011IF53ai7Xc0U56Y0.00.00.06183645.0
215-05-2020 00:0041350013PZuoBAID5Wc2HD0.00.00.06987759.0

Check to see if there are missing values anywhere in the dataset.

p1.isnull().sum()

DATE_TIME      0
PLANT_ID       0
SOURCE_KEY     0
DC_POWER       0
AC_POWER       0
DAILY_YIELD    0
TOTAL_YIELD    0
dtype: int64
#check for data type consistency
p1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68778 entries, 0 to 68777
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   DATE_TIME    68778 non-null  object 
 1   PLANT_ID     68778 non-null  int64  
 2   SOURCE_KEY   68778 non-null  object 
 3   DC_POWER     68778 non-null  float64
 4   AC_POWER     68778 non-null  float64
 5   DAILY_YIELD  68778 non-null  float64
 6   TOTAL_YIELD  68778 non-null  float64
dtypes: float64(4), int64(1), object(2)
memory usage: 3.7+ MB
4) Perform data cleanup

Plant ID and Source Key appear to be categorical values. Let’s see how many possible values each hold.

print(p1.PLANT_ID.unique())
[4135001]

There is only one ID in this columns. For easier identification lets change the id to ‘1’.

p1['PLANT_ID'] = '1'
coder = {}
coder[4135001] = 'PLANT ID: 1'

Check data in ‘SOURCE_KEY’ column.

sources = p1.SOURCE_KEY.unique()
print(sources)
print()
print('There are '+ str(len(p1.SOURCE_KEY.unique())) + ' distinct values.')
['1BY6WEcLGh8j5v7' '1IF53ai7Xc0U56Y' '3PZuoBAID5Wc2HD' '7JYdWkrLSPkdwr4'
 'McdE0feGgRqW7Ca' 'VHMLBKoKgIrUVDU' 'WRmjgnKYAwPKWDb' 'ZnxXDlPa8U1GXgE'
 'ZoEaEvLYb1n2sOq' 'adLQvlD726eNBSB' 'bvBOhCH3iADSZry' 'iCRJl6heRkivqQ3'
 'ih0vzX44oOqAx2f' 'pkci93gMrogZuBj' 'rGa61gmuvPhdLxV' 'sjndEbLyjtCKgGv'
 'uHbuxQJl8lW7ozc' 'wCURE6d3bPkepu2' 'z9Y9gH1T5YWrNuG' 'zBIq5rxdHJRwDNY'
 'zVJPv84UY57bAof' 'YxYtjZvoooNbGkE']

There are 22 distinct values.

There are 22 different values for the source column. These are hard to read. We are going to map this values to something easier to read. Format will be ‘P1_SRC_X’ for Plant 1 SOURCE #.

We will store the key:value combination for later reference.

#create dictionary to hold source key mapping
p1_source = {}
idx=1
for source in sources:
    p1_source = 'P1_SRC_' + str(idx)
    idx += 1

#replace old codes with new values
p1['SOURCE_KEY'] = p1['SOURCE_KEY'].replace(p1_source)

The DATE_TIME values are been treated as strings. Perform data cleanup to standardize format and make the column a DATETIME type.

Separate DATES and TIMES into their own columns for easier manipulation, after clean-up bring back together into DATE_TIME column. ** This is needed because some date entries in the file have inconsistent formats. Removing the time components from them allow python logic to correctly identify and format the dates.

p1['DATE'] = p1['DATE_TIME'].str.slice(0,10)
p1['TIME'] = p1['DATE_TIME'].str.slice(11,16)
p1.head(3)
DATE_TIMEPLANT_IDSOURCE_KEYDC_POWERAC_POWERDAILY_YIELDTOTAL_YIELDDATETIME
015-05-2020 00:001P1_SRC_10.00.00.06259559.015-05-202000:00
115-05-2020 00:001P1_SRC_20.00.00.06183645.015-05-202000:00
215-05-2020 00:001P1_SRC_30.00.00.06987759.015-05-202000:00
#Convert the Date column from string to date type
p1['DATE'] = pd.to_datetime(p1['DATE'])
p1.drop('DATE_TIME', axis = 1, inplace=True)

# add seconds for time standardization
p1['TIME'] = p1['TIME'] + ':00'
# convert str to time
p1['TIME'] = pd.to_datetime(p1['TIME'], format= '%H:%M:%S').dt.time
p1['DATE_TIME'] = ''

#Combine date + time into DATE_TIME column
for idx in range (len(p1)):
    p1.iloc[idx,8] = pd.datetime.combine(p1.iloc[idx,6],p1.iloc[idx,7])

# Reorder columns    
p1.drop(['DATE','TIME'], axis=1, inplace=True)
p1 = p1[['DATE_TIME', 'PLANT_ID', 'SOURCE_KEY', 'DC_POWER', 'AC_POWER', 'DAILY_YIELD', 'TOTAL_YIELD']]

p1.iloc[idx,8] = pd.datetime.combine(p1.iloc[idx,6],p1.iloc[idx,7])

# Vizualize current status of dataset
p1.sample(5)
DATE_TIMEPLANT_IDSOURCE_KEYDC_POWERAC_POWERDAILY_YIELDTOTAL_YIELD
451762020-06-06 19:00:001P1_SRC_30.0000000.0000006496.0000007158982.000
547722020-11-06 08:15:001P1_SRC_82913.857143285.814286258.4285716724468.429
358612020-02-06 07:00:001P1_SRC_211669.857143163.40000059.1428577247825.143
141182020-05-22 11:00:001P1_SRC_87082.857143692.9714292251.1428576574115.143
384712020-03-06 13:00:001P1_SRC_1511547.4285701127.4857144329.1428577253659.143

Power plant 1 dataset is ready.

Repeat steps above to process Power plant 2 file.

5) Analyze structure and content
p2.head(3)
DATE_TIMEPLANT_IDSOURCE_KEYDC_POWERAC_POWERDAILY_YIELDTOTAL_YIELD
02020-05-15 00:00:0041360014UPUqMRk7TRMgml0.00.09425.0000002.429011e+06
12020-05-15 00:00:00413600181aHJ1q11NBPMrL0.00.00.0000001.215279e+09
22020-05-15 00:00:0041360019kRcWv60rDACzjR0.00.03075.3333332.247720e+09
p2.isnull().sum()

DATE_TIME      0
PLANT_ID       0
SOURCE_KEY     0
DC_POWER       0
AC_POWER       0
DAILY_YIELD    0
TOTAL_YIELD    0
dtype: int64
p2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67698 entries, 0 to 67697
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   DATE_TIME    67698 non-null  object 
 1   PLANT_ID     67698 non-null  int64  
 2   SOURCE_KEY   67698 non-null  object 
 3   DC_POWER     67698 non-null  float64
 4   AC_POWER     67698 non-null  float64
 5   DAILY_YIELD  67698 non-null  float64
 6   TOTAL_YIELD  67698 non-null  float64
dtypes: float64(4), int64(1), object(2)
memory usage: 3.6+ MB
6) Perform data cleanup
print(p2.PLANT_ID.unique())
[4136001]
p2['PLANT_ID'] = '2'
coder[4136001] = 'PLANT ID: 2'

sources = p2.SOURCE_KEY.unique()
print(sources)
print('There are '+ str(len(p2.SOURCE_KEY.unique())) + ' values.')

['4UPUqMRk7TRMgml' '81aHJ1q11NBPMrL' '9kRcWv60rDACzjR' 'Et9kgGMDl729KT4'
 'IQ2d7wF4YD8zU1Q' 'LYwnQax7tkwH5Cb' 'LlT2YUhhzqhg5Sw' 'Mx2yZCDsyf6DPfv'
 'NgDl19wMapZy17u' 'PeE6FRyGXUgsRhN' 'Qf4GUc1pJu5T6c6' 'Quc1TzYxW2pYoWX'
 'V94E5Ben1TlhnDV' 'WcxssY2VbP4hApt' 'mqwcsP2rE7J0TFp' 'oZ35aAeoifZaQzV'
 'oZZkBaNadn6DNKz' 'q49J1IKaHRwDQnt' 'rrq4fwE8jgrTyWY' 'vOuJvMaM2sgwLmb'
 'xMbIugepa2P7lBB' 'xoJJ8DcxJEcupym']
There are 22 values.

#create dictionary to hold source key mapping

p2_source = {}
idx=1
for source in sources:
p2_source = ‘P2_SRC_’ + str(idx)
idx += 1

p2[‘SOURCE_KEY’] = p2[‘SOURCE_KEY’].replace(p2_source)

p2['DATE'] = p2['DATE_TIME'].str.slice(0,10)
p2['TIME'] = p2['DATE_TIME'].str.slice(11,16)
p2['DATE'] = pd.to_datetime(p2['DATE'])
p2.drop('DATE_TIME', axis = 1, inplace=True)

p2['TIME'] = p2['TIME'] + ':00'
p2['TIME'] = pd.to_datetime(p2['TIME'], format= '%H:%M:%S').dt.time
p2['DATE_TIME'] = ''
for idx in range (len(p2)):
   p2.iloc[idx,8] = pd.datetime.combine(p2.iloc[idx,6],p2.iloc[idx,7])

p2.drop(['DATE','TIME'], axis=1, inplace=True)
p2 = p2[['DATE_TIME', 'PLANT_ID', 'SOURCE_KEY', 'DC_POWER',      
         'AC_POWER', 'DAILY_YIELD', 'TOTAL_YIELD']]
<ipython-input-27-4fbbd9230d85>:5: FutureWarning: The pandas.datetime class is deprecated and will be removed from pandas in a future version. Import from datetime module instead.
  p2.iloc[idx,8] = pd.datetime.combine(p2.iloc[idx,6],p2.iloc[idx,7])
p2.sample(5)
DATE_TIMEPLANT_IDSOURCE_KEYDC_POWERAC_POWERDAILY_YIELDTOTAL_YIELD
478372020-06-08 14:15:002P2_SRC_60.0000000.0000001330.0000001.795072e+09
325342020-06-01 07:45:002P2_SRC_15202.053333198.093333222.6666675.937129e+08
245002020-05-28 06:00:002P2_SRC_2015.12000014.5933331.2666672.305880e+06
618752020-06-15 05:45:002P2_SRC_80.0000000.0000000.0000002.670741e+06
143872020-05-22 09:45:002P2_SRC_20.0000000.0000001605.0000001.215316e+09

Both Power Plant Generation files are ready.

Process Weather Sensor files

6) Analyze structure and content
w1.head(3)
DATE_TIMEPLANT_IDSOURCE_KEYAMBIENT_TEMPERATUREMODULE_TEMPERATUREIRRADIATION
02020-05-15 00:00:004135001HmiyD2TTLFNqkNe25.18431622.8575070.0
12020-05-15 00:15:004135001HmiyD2TTLFNqkNe25.08458922.7616680.0
22020-05-15 00:30:004135001HmiyD2TTLFNqkNe24.93575322.5923060.0
w1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3182 entries, 0 to 3181
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   DATE_TIME            3182 non-null   object 
 1   PLANT_ID             3182 non-null   int64  
 2   SOURCE_KEY           3182 non-null   object 
 3   AMBIENT_TEMPERATURE  3182 non-null   float64
 4   MODULE_TEMPERATURE   3182 non-null   float64
 5   IRRADIATION          3182 non-null   float64
dtypes: float64(3), int64(1), object(2)
memory usage: 149.3+ KB
7) Perform data cleanup

Code for readability:

w1.PLANT_ID.unique()
array([4135001], dtype=int64)
w1['PLANT_ID'] = '1'
w1['SOURCE_KEY'].unique()
array(['HmiyD2TTLFNqkNe'], dtype=object)
w1['SOURCE_KEY'] = 'WXS1'
coder['HmiyD2TTLFNqkNe'] = 'WXS1'

Process Data/Time data:

w1['DATE'] = w1['DATE_TIME'].str.slice(0,10)
w1['DATE'] = pd.to_datetime(w1['DATE'])
w1['TIME'] = w1['DATE_TIME'].str.slice(11,16)
w1['TIME'] = w1['TIME'] + ':00'
w1['TIME'] = pd.to_datetime(w1['TIME'], format= '%H:%M:%S').dt.time
w1.drop('DATE_TIME', axis=1, inplace=True)
w1['DATE_TIME'] = ''
for idx in range (len(w1)):
    w1.iloc[idx,7] =pd.datetime.combine(w1.iloc[idx,5],w1.iloc[idx,6])
w1.drop(['DATE','TIME'], axis=1, inplace=True)
w1= w1[['DATE_TIME', 'PLANT_ID', 'SOURCE_KEY', 'AMBIENT_TEMPERATURE', 'MODULE_TEMPERATURE', 'IRRADIATION']]
w1.head(3)
<ipython-input-36-f041425cb444>:2: FutureWarning: The pandas.datetime class is deprecated and will be removed from pandas in a future version. Import from datetime module instead.
  w1.iloc[idx,7] = pd.datetime.combine(w1.iloc[idx,5],w1.iloc[idx,6])
DATE_TIMEPLANT_IDSOURCE_KEYAMBIENT_TEMPERATUREMODULE_TEMPERATUREIRRADIATION
02020-05-15 00:00:001WXS125.18431622.8575070.0
12020-05-15 00:15:001WXS125.08458922.7616680.0
22020-05-15 00:30:001WXS124.93575322.5923060.0

Work second weather file

8) Analyze structure and content
w2.head(3)
DATE_TIMEPLANT_IDSOURCE_KEYAMBIENT_TEMPERATUREMODULE_TEMPERATUREIRRADIATION
02020-05-15 00:00:004136001iq8k7ZNt4Mwm3w027.00476425.0607890.0
12020-05-15 00:15:004136001iq8k7ZNt4Mwm3w026.88081124.4218690.0
22020-05-15 00:30:004136001iq8k7ZNt4Mwm3w026.68205524.4272900.0
w2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3259 entries, 0 to 3258
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   DATE_TIME            3259 non-null   object 
 1   PLANT_ID             3259 non-null   int64  
 2   SOURCE_KEY           3259 non-null   object 
 3   AMBIENT_TEMPERATURE  3259 non-null   float64
 4   MODULE_TEMPERATURE   3259 non-null   float64
 5   IRRADIATION          3259 non-null   float64
dtypes: float64(3), int64(1), object(2)
memory usage: 152.9+ KB
9) Perform data cleanup

Code for readability:

w2.PLANT_ID.unique()

array([4136001], dtype=int64)
w2.SOURCE_KEY.unique()
array(['iq8k7ZNt4Mwm3w0'], dtype=object)
w2['PLANT_ID'] = '2'
w2['SOURCE_KEY'] = 'WXS2'
coder['iq8k7ZNt4Mwm3w0'] = 'WXS2'

Process Date/Time data:

w2['DATE'] = w2['DATE_TIME'].str.slice(0,10)
w2['DATE'] = pd.to_datetime(w2['DATE'])
w2['TIME'] = w2['DATE_TIME'].str.slice(11,16)
w2['TIME'] = w2['TIME'] + ':00'
w2['TIME'] = pd.to_datetime(w2['TIME'], format= '%H:%M:%S').dt.time
w2.drop('DATE_TIME', axis=1, inplace=True)
w2['DATE_TIME'] = ''

for idx in range (len(w2)):
    w2.iloc[idx,7] = pd.datetime.combine(w2.iloc[idx,5], w2.iloc[idx,6])
w2.drop(['DATE','TIME'], axis=1, inplace=True)
w2= w2[['DATE_TIME', 'PLANT_ID', 'SOURCE_KEY', 'AMBIENT_TEMPERATURE', 'MODULE_TEMPERATURE', 'IRRADIATION']]
w2.head(3)
<ipython-input-43-399aca825677>:2: FutureWarning: The pandas.datetime class is deprecated and will be removed from pandas in a future version. Import from datetime module instead.
  w2.iloc[idx,7] = pd.datetime.combine(w2.iloc[idx,5], w2.iloc[idx,6])
DATE_TIMEPLANT_IDSOURCE_KEYAMBIENT_TEMPERATUREMODULE_TEMPERATUREIRRADIATION
02020-05-15 00:00:002WXS227.00476425.0607890.0
12020-05-15 00:15:002WXS226.88081124.4218690.0
22020-05-15 00:30:002WXS226.68205524.4272900.0

Merge datasets into Master File

10) Prepare dataset for merge
# join powerplant datasets
df_power = pd.concat([p1,p2], ignore_index=True)
# join wx sensor datasets
df_wx = pd.concat([w1,w2], ignore_index=True)
# rename commun column name to avoid conflict
df_power.rename(columns={'SOURCE_KEY':'PLANT_SOURCE'}, inplace=True)
# rename columns for redeability
df_wx.rename(columns={'SOURCE_KEY':'WX_SENSOR', 'AMBIENT_TEMPERATURE': 'AMBIENT_TEMP', 'MODULE_TEMPERATURE':'MODULE_TEMP'}, inplace=True)
11) Merge Datasets
df = pd.merge(df_power, df_wx, how='left', left_on=['DATE_TIME', 'PLANT_ID'], right_on=['DATE_TIME', 'PLANT_ID'])
12) Conduct final cleanup on Master File
Limit Temperature and Irradiation values level of precision (rounding.)
df['AMBIENT_TEMP'] = round(df['AMBIENT_TEMP'],4)
df['MODULE_TEMP'] = round(df['MODULE_TEMP'],4)
df['IRRADIATION'] = round(df['IRRADIATION'],6)
df['DC_POWER'] = round(df['DC_POWER'],4)
df['AC_POWER'] = round(df['AC_POWER'],4)
df['DAILY_YIELD'] = round(df['DAILY_YIELD'],4)
df['TOTAL_YIELD'] = round(df['TOTAL_YIELD'],4)
df
DATE_TIMEPLANT_IDPLANT_SOURCEDC_POWERAC_POWERDAILY_YIELDTOTAL_YIELDWX_SENSORAMBIENT_TEMPMODULE_TEMPIRRADIATION
02020-05-15 00:00:001P1_SRC_10.00.00.06259559.0WXS125.184322.85750.0
12020-05-15 00:00:001P1_SRC_20.00.00.06183645.0WXS125.184322.85750.0
22020-05-15 00:00:001P1_SRC_30.00.00.06987759.0WXS125.184322.85750.0
32020-05-15 00:00:001P1_SRC_40.00.00.07602960.0WXS125.184322.85750.0
42020-05-15 00:00:001P1_SRC_50.00.00.07158964.0WXS125.184322.85750.0
1364712020-06-17 23:45:002P2_SRC_180.00.04157.0520758.0WXS223.202922.53590.0
1364722020-06-17 23:45:002P2_SRC_190.00.03931.0121131356.0WXS223.202922.53590.0
1364732020-06-17 23:45:002P2_SRC_200.00.04322.02427691.0WXS223.202922.53590.0
1364742020-06-17 23:45:002P2_SRC_210.00.04218.0106896394.0WXS223.202922.53590.0
1364752020-06-17 23:45:002P2_SRC_220.00.04316.0209335741.0WXS223.202922.53590.0

136476 rows × 11 columns

Export Master File

13) Create Lookup table of New and old codes for export:

code = {**p1_source, **p2_source}
code = {**code, **coder}
code_df = pd.DataFrame.from_dict(code, orient='index', columns=['New_Code'])
code_df['Old_Code'] = code_df.index
code_df.reset_index(drop=True, inplace=True)
code_df.head()
New_CodeOld_Code
0P1_SRC_11BY6WEcLGh8j5v7
1P1_SRC_21IF53ai7Xc0U56Y
2P1_SRC_33PZuoBAID5Wc2HD
3P1_SRC_47JYdWkrLSPkdwr4
4P1_SRC_5McdE0feGgRqW7Ca

14) Export merged master file to Excel Workbook:

file = 'D:/Springboard/Projects/SolarPower/data/
           final/Combined_Power_Generation.xlsx'
with pd.ExcelWriter(file, engine='openpyxl') as writer:  
    df.to_excel(writer, sheet_name='Master')
    code_df.to_excel(writer, sheet_name='Code_Lookup')

Final Master File

Master worksheet

Master.JPG