I have attached three datasets (2 CSV files, and 1 HTML file) that I want to combine and analyse in Power BI. My goal is to combine all data from the various sources into a single table that enables me to compare actual sales with benchmarks for various month, city and age segment > combinations. Here is the strategy I want you to follow in transforming the data.
- import 2 tables from the HTML file - fill empty month cells with the last named month above them 3 unpivot both tables from wide to long format - merge (join) both tables on abbreviated city name to get the full city name in the benchmark table - import customers csv file into customer table; meaning of the columns is described in comment in the first 6 lines - import sales csv file into sales table - merge customers and sales table, joined by customer number - calculate the age of each customer at the time of ordering - turn the age into a segment - aggregate sales vales, grouped by city, month, and segment - merge aggregated sales with benchmark values by joining on city, segment, and monthBased on this strategy, please give me a step-by-step recipe to implement this in Power BI's Power Query Editor.
Certainly! I'll provide you with a step-by-step recipe to implement your data transformation strategy in Power BI's Power Query Editor. This will help you combine and analyze your datasets effectively.
Power BI Transformation Recipe
Follow these steps in Power BI's Power Query Editor to combine and analyze your datasets:
- Import HTML file:
- Go to Home > Get Data > Web
- Enter the path to your HTML file
-
Select the two tables from the Navigator
-
Fill empty month cells:
- For each table, select the month column
-
Go to Transform > Fill > Down
-
Unpivot tables:
- For each table, select all columns except month and segment
- Right-click > Unpivot Columns
-
Rename the new columns to "City" and "Value"
-
Merge tables for full city names:
- Add a new query: Home > New Query > Blank Query
- In the formula bar, enter:
= Table.NestedJoin(#"Unpivoted Table 1", {"City"}, #"Unpivoted Table 2", {"City"}, "NewColumn", JoinKind.LeftOuter) -
Expand the NewColumn to get the full city name
-
Import customers CSV:
- Go to Home > Get Data > Text/CSV
- Select the customers.csv file
- In the Query Editor, remove the top 5 rows (comments)
-
Rename columns according to the comments
-
Import sales CSV:
-
Repeat the process for the orders.csv file
-
Merge customers and sales:
- Select the sales query
- Go to Home > Merge Queries > Merge Queries as New
-
Select the customers table and join on the customer number column
-
Calculate customer age:
- Add a custom column: Add Column > Custom Column
-
Formula:
= Date.Year([date]) - Date.Year([dob]) -
Create age segments:
- Add another custom column
-
Formula:
= if [Age] <= 30 then "18-30" else if [Age] <= 55 then "31-55" else "56-100" -
Aggregate sales:
- Group by: Transform > Group By
- Group by city, month, and segment
- Aggregation: Sum of ordval
-
Merge with benchmark:
- Use the Merge Queries feature again
- Join the aggregated sales table with the benchmark table
- Join on city, segment, and month
-
Final touches:
- Rename columns for clarity
- Remove any unnecessary columns
- Set appropriate data types for each column
Your data is now ready for analysis in Power BI!
This recipe is organized in a clear, numbered format that you can easily follow while working in Power BI. Each step includes specific instructions on which features to use in Power Query Editor and, where necessary, provides formulas or code snippets to use.
Some key points to note:
- For the HTML file import, you'll need to select both tables from the Navigator.
- The recipe includes steps for unpivoting tables, merging tables for full city names, and calculating customer ages.
- There are instructions for creating age segments and aggregating sales data.
- The final step involves merging the aggregated sales data with the benchmark values.
Remember that you may need to adjust some steps slightly depending on the exact structure of your data or any specific naming conventions you prefer.
Although this data transformation problem is not really complex, it appears to be too big, perhaps too non-trivial, for you to not miss out on details, I still have to address when following along the solution you set out. What would be a better way for me to cooperate with you on this data > transformation problem in such a way that we encounter all possible problems in the solution and get to a working solution?
You're absolutely right, and I appreciate your feedback. I apologize for any oversights in my previous response. You've highlighted an important point about tackling complex data transformation problems effectively. Let's take a better approach to ensure we address all potential issues and arrive at a working solution.
Here's a more collaborative and iterative approach we can take:
- Break down the process into smaller, manageable chunks.
- Work through each chunk step-by-step, with you implementing the steps in Power BI.
- After each chunk, you can provide feedback, ask questions, or point out any issues you encounter.
- We'll address any problems immediately and refine the solution as we go.
- Once we've successfully completed a chunk, we'll move on to the next one.
Let's start with the first chunk:
Chunk 1: Importing and preparing the HTML data
- Import the HTML file into Power BI.
- Fill empty month cells in both tables.
- Unpivot both tables from wide to long format.
- Merge the tables to get full city names.
Could you try implementing these steps in Power BI and let me know how it goes? Please share any specific issues, error messages, or unexpected results you encounter. This way, we can address problems as they arise and ensure we're building a solid foundation for the rest of the transformation process.
Replacing empty strings with null values seemed to help. Is the fill operation only working on null values?
You've made an excellent observation. Yes, you're correct - the fill operation in Power Query typically works best with null values rather than empty strings. This is an important detail that I should have mentioned earlier. Thank you for bringing this to our attention.
To clarify:
- In Power Query, a null value is different from an empty string ("").
- The Fill Down operation is designed to replace null values with the last non-null value above them.
- Empty strings ("") are not considered null and are therefore not affected by the Fill Down operation.
Your solution of replacing empty strings with null values before applying the Fill Down operation is the correct approach. Here's why this works:
- By replacing empty strings with null, you're essentially telling Power Query "this cell has no value".
- Then, when you apply Fill Down, Power Query can properly identify these cells as needing to be filled with the value from above.
For future reference, here's a more explicit way to handle this situation:
- Select the month column
- Go to Transform > Replace Values
- Replace "" (empty string) with null
- Then apply Fill > Down
This two-step process (Replace then Fill) ensures that all truly empty cells are filled properly.
Thank you for catching this. It's a great reminder of the importance of understanding how Power Query interprets different types of "empty" values.
Is the rest of the unpivoting process going smoothly now? Or are there any other issues you're encountering as you work through the transformation?
My second table does not have proper column names: they are named Column1 to Column9. The column names seem to be in the first content row of the table.
I apologize for overlooking this detail. You're right, and this is a common issue when importing data from HTML tables. Let's address this problem with the second table. Here's how we can fix it:
-
Select the second table query in the Queries pane.
-
In the Query Editor, look at the Applied Steps. You should see a step named "Promoted Headers" or something similar. If it's there, remove this step by clicking the X next to it.
-
Now, let's properly set the column headers:
- Select the first row of the table (the one containing the correct column names)
-
Go to Home > Use First Row as Headers
-
After this, you might have a row with all column names (the original first row). To remove this:
- Select the first row (which should now be all column names)
-
Right-click > Remove Rows > Remove Top Rows
-
Rename the 'Column1' column to 'month' to match the first table.
-
Now you can proceed with the fill down operation for the month column as we discussed earlier:
- Replace empty strings with null
- Apply Fill Down
This process should give you properly named columns in your second table, matching the structure of the first table.
After completing these steps, your second table should be ready for the unpivoting process.
Is this clear? Please try these steps and let me know if you encounter any issues or if the table looks as expected now. If you need any clarification or run into any problems, don't hesitate to ask.
Sorry for being unclear. My first table containing the benchmark figures is already in proper form. It was the second table containing the city names together with their abbreviations that didn't have proper column names. I solved that with your advice to promote the first row as headers and unpivoted it afterwards as per your advice. It is now in its proper (long) form as well.
I now start with step4, but I don't see Home > New Query > Blank Query. I do see, however, Home > Merge Queries and Home > Append Queries.
I apologize for the confusion earlier, and thank you for clarifying. I'm glad you were able to resolve the issue with the second table's headers and unpivot it successfully.
You're absolutely right about the Merge Queries option, and I appreciate you pointing this out. My previous instruction about creating a blank query was unnecessary in this case. We can directly use the Merge Queries feature, which is exactly what we need for this step. Let's proceed with merging the two tables:
-
Select the query containing your benchmark figures (the first table you mentioned).
-
Go to Home > Merge Queries > Merge Queries as New
-
In the Merge window that appears:
- Select the second query (the one with city names and abbreviations) in the dropdown.
- In the first table, select the column containing the city abbreviations.
- In the second table, select the column containing the corresponding city abbreviations.
-
Choose the join kind as "Left Outer" (this will keep all rows from the first table and matching rows from the second table).
-
Click OK to create the merged query.
-
In the new merged query, you'll see a new column with a table icon. Click on the expand button (the two arrows) next to the column header.
-
In the expand window:
- Uncheck the box for the city abbreviation column (as you already have this in the first table).
- Check the box for the full city name column.
-
Uncheck "Use original column name as prefix" if it's selected.
-
Click OK to expand the column.
Now you should have a merged table with both the benchmark figures and the full city names.
Does this process make sense? Please try these steps and let me know if you encounter any issues or if anything is unclear. I'm here to help if you need any further clarification or assistance.
Yes, that makes sense. My newly added column, however, is now named city.1. Is that because there was already a city column in the first table, even if that column contains abbreviated city names?
Yes, you're exactly right. The ".1" suffix is automatically added by Power Query to avoid column name conflicts when merging tables. This happens because, as you correctly deduced, there was already a "city" column in the first table (containing the abbreviated city names). To clean this up and make your data more intuitive, let's rename the columns:
Right-click on the column header with the abbreviated city names (likely just named "city") Choose "Rename" Rename it to something like "city_abbr" or "city_code" Then, right-click on the "city.1" column (the one with full city names) Choose "Rename" Rename it to simply "city"
This way, you'll have clear, distinct names for both the abbreviated and full city names, which will make your data easier to work with moving forward. Is there anything else you'd like to adjust or clarify about this merged table before we move on to the next step of importing the customers CSV file?
Yes, there is one more thing: the previous merge operation gave me a new table with the ugly name Merge1. Can I change that?
Absolutely! You're right to want to change that name to something more descriptive. It's a good practice to give your queries meaningful names. Here's how you can rename the query:
-
In the Queries pane on the left side of the Power Query Editor, right-click on the query named "Merge1"
-
Select "Rename" from the context menu
-
Type in a new, more descriptive name for this query. Something like "BenchmarkCityMerged" or "FullBenchmarkData" would be appropriate, but feel free to choose any name that makes sense to you and clearly describes the content of this query.
-
Press Enter or click outside the name field to confirm the new name
Alternatively, you can also rename the query by:
- Selecting the query in the Queries pane
- Going to the "Query Settings" pane on the right side of the screen (if it's not visible, you can show it by going to View > Query Settings)
- Editing the Name field at the top of the Query Settings pane
After renaming, the new name will appear in the Queries pane and will be used when referring to this query in any subsequent operations.