Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamp database #3766

Open
wants to merge 89 commits into
base: release-4.0
Choose a base branch
from
Open

Revamp database #3766

wants to merge 89 commits into from

Conversation

ShiZhongming
Copy link
Collaborator

@ShiZhongming ShiZhongming commented Jan 30, 2025

We are redesigning the database.

  • All excels are changed to .csv files.
  • We are also adding a verification feature to if every data point is present and conform with the schema file
  • Otherwise, CEA will point out the problematic data point for the user to modify.

Soon, I will submit another PR updating all the CEA Feature scripts to be compatible with the new database.
Then, two PRs can be reviewed together.

Summary by CodeRabbit

  • New Features

    • Enhanced data migration and verification tools for CEA-4 format.
    • Added verbose logging options for migration and verification processes.
    • Improved schema validation for database and file formats.
    • Introduced new functions for better file and data verification.
  • Improvements

    • Updated verification functions to provide more detailed error reporting.
    • Added support for checking file integrity and schema compliance.
    • Implemented more robust error handling during data migration.
    • Enhanced output messages for better clarity on process status.
    • Improved handling of the 'building:levels' attribute in shapefiles.
  • Bug Fixes

    • Refined data type and column validation mechanisms.
    • Improved handling of file paths and database item conversions.
    • Adjusted logic for populating the 'reference' column in shapefiles.

@ShiZhongming ShiZhongming added priority-must-have These requirements are critical and must be included in CEA feature-database labels Jan 30, 2025
@ShiZhongming ShiZhongming added this to the CEA4 - Mar'25 milestone Jan 30, 2025
@ShiZhongming ShiZhongming requested a review from reyery January 30, 2025 07:37
Copy link

coderabbitai bot commented Jan 30, 2025

Warning

Rate limit exceeded

@reyery has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 4 minutes and 44 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 03b5aa9 and fc32904.

📒 Files selected for processing (1)
  • cea/interfaces/dashboard/api/inputs.py (2 hunks)

Walkthrough

This pull request introduces enhancements to the City Energy Analyst (CEA) data management format helper modules, focusing on data migration and verification processes. Key modifications include the addition of a verbose parameter for optional logging in migration functions, the implementation of new schema validation functions, and the creation of robust migration utilities for transitioning between CEA-3 and CEA-4 data formats. The updates provide improved control over data transformation and validation processes while retaining the overall control flow.

Changes

File Change Summary
cea/datamanagement/format_helper/cea4_migrate.py Added verbose parameter to migrate_dbf_to_csv and migrate_cea3_to_cea4 functions for logging. Updated print statements for success and error messages based on verbosity.
cea/datamanagement/format_helper/cea4_migrate_db.py New script for database migration, including functions for Excel to CSV conversion, file processing, and migration orchestration. Added multiple new functions for handling migration tasks.
cea/datamanagement/format_helper/cea4_verify.py Enhanced schema validation with new mapping dictionaries and added verify_file_against_schema_4 function. Updated existing verification functions to integrate schema validation.
cea/datamanagement/format_helper/cea4_verify_db.py New module for comprehensive database verification with multiple validation functions. Added functions for path management, schema validation, and error reporting.
cea/datamanagement/format_helper/format_helper.py Updated exec_cea_format_helper to manage both verification and migration processes, adjusting output messages and control flow.
cea/datamanagement/zone_helper.py Modified assign_attributes function to improve handling of 'building:levels' attribute and introduced a new 'reference' column for clarity.
cea/interfaces/dashboard/api/inputs.py Changed data source from Excel to CSV in get_choices function.

Possibly related PRs

  • CEA4 Format Helper - Phase 1 #3753: The changes in the main PR are related to the modifications in the migrate_dbf_to_csv and migrate_cea3_to_cea4 functions, which are also present in the retrieved PR, indicating a direct connection in their functionality.

Suggested labels

feature-data-management, about-interface

Poem

🐰 In the land of data, we hop and play,
Migration and verification lead the way.
With verbose logs, we see it all,
From CEA-3 to CEA-4, we stand tall!
A rabbit's joy in code so bright,
Transforming data, what a delight! 🥕✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🔭 Outside diff range comments (3)
cea/datamanagement/format_helper/cea4_migrate.py (3)

Line range hint 203-227: Add error handling for file operations and edge cases.

The function has several areas that could benefit from improved error handling:

  1. The case where 'Name' is in list_missing_columns isn't properly handled
  2. File operations (to_csv, remove) should be wrapped in try-except blocks
  3. The duplicate names error message could be more specific

Consider this improved implementation:

 def migrate_dbf_to_csv(scenario, item, required_columns, verbose=False):
     list_missing_columns = verify_dbf_3(scenario, item, required_columns)
     if list_missing_columns:
         print(f'! Ensure column(s) are present in the {item}.dbf: {list_missing_columns}')
+        return False
+    
+    if 'Name' in list_missing_columns:
+        print(f'! Required column "Name" is missing in {item}.dbf')
+        return False
+
+    list_names_duplicated = verify_name_duplicates_3(scenario, item)
+    if list_names_duplicated:
+        duplicates_str = ', '.join(list_names_duplicated)
+        print(f'! Found duplicate building names in {item}.dbf: {duplicates_str}')
+        return False
+
+    try:
+        df = dbf_to_dataframe(path_to_input_file_without_db_3(scenario, item))
+        df.rename(columns=columns_mapping_dict_name, inplace=True)
+        df.rename(columns=columns_mapping_dict_typology, inplace=True)
+        
+        output_path = path_to_input_file_without_db_4(scenario, item)
+        df.to_csv(output_path, index=False)
+        
+        input_path = path_to_input_file_without_db_3(scenario, item)
+        os.remove(input_path)
+        
+        if verbose:
+            print(f'- {item}.dbf has been migrated from CEA-3 to CEA-4 format.')
+        return True
+    except Exception as e:
+        print(f'! Error during migration of {item}.dbf: {str(e)}')
+        return False
-    else:
-        if 'Name' not in list_missing_columns:
-            list_names_duplicated = verify_name_duplicates_3(scenario, item)
-            if list_names_duplicated:
-                print(f'! Ensure name(s) are unique in {item}.dbf: {list_names_duplicated} is duplicated.')
-            else:
-                df = dbf_to_dataframe(path_to_input_file_without_db_3(scenario, item))
-                df.rename(columns=columns_mapping_dict_name, inplace=True)
-                df.rename(columns=columns_mapping_dict_typology, inplace=True)
-                df.to_csv(path_to_input_file_without_db_4(scenario, item), index=False)
-                os.remove(path_to_input_file_without_db_3(scenario, item))
-                if verbose:
-                    print(f'- {item}.dbf has been migrated from CEA-3 to CEA-4 format.')

Line range hint 233-374: Consider breaking down the large function into smaller, focused functions.

The migrate_cea3_to_cea4 function is quite long (140+ lines) and handles multiple responsibilities. This makes it harder to maintain and test.

Consider breaking it down into smaller functions:

  1. migrate_zone_files
  2. migrate_surroundings_files
  3. migrate_building_properties

Example structure:

def migrate_zone_files(scenario, verbose=False):
    """Handle migration of zone.shp and related files."""
    # Extract zone.shp migration logic here
    pass

def migrate_surroundings_files(scenario, verbose=False):
    """Handle migration of surroundings.shp."""
    # Extract surroundings.shp migration logic here
    pass

def migrate_building_properties(scenario, verbose=False):
    """Handle migration of building properties files."""
    # Extract building properties migration logic here
    pass

def migrate_cea3_to_cea4(scenario, verbose=False):
    """Main migration function that orchestrates the migration process."""
    dict_missing = cea4_verify(scenario)
    if all(not value for value in dict_missing.values()):
        return
        
    migrate_zone_files(scenario, verbose)
    migrate_surroundings_files(scenario, verbose)
    migrate_building_properties(scenario, verbose)

Line range hint 362-362: Fix typo in variable name.

There's a typo in the variable name used in the error message.

-            print('! Ensure column(s) are present in supply_systems.csv: {list_missing_columns_supply_system_4}.'.format(list_missing_columns_supply_system_4=list_missing_columns_supply_systems_4))
+            print(f'! Ensure column(s) are present in supply_systems.csv: {list_missing_columns_supply_systems_4}')
🧹 Nitpick comments (17)
cea/datamanagement/format_helper/cea4_verify.py (6)

11-12: Consider a minimal dependency approach for type checks.
NumPy is imported solely for numeric type checks, which can also be performed using Python's built-in int and float without additional overhead. If there's no plan to leverage NumPy's extended functionality, consider removing this import to streamline dependencies.

- import numpy as np

28-39: Centralize or unify locator mappings for maintainability.
These mappings are convenient, but consider centralizing them in a single constants file or enumerating items in one place so that references stay consistent across the codebase. This helps avoid duplication and ensures easier updates when new items or schema locators are added.


40-50: Use a sentinel value for missing id_column instead of an empty string.
Relying on empty strings for items like terrain, weather, or streets could lead to accidental misuse via df.at[idx, ""] if overlooked. Consider assigning a meaningful sentinel (e.g., None) or skipping ID-based checks for those items.

- mapping_dict_input_item_to_id_column = {'terrain': '', 'weather': '', ...}
+ mapping_dict_input_item_to_id_column = {'terrain': None, 'weather': None, ...}

125-125: Remove unused col_attr assignment.
According to static analysis (F841), col_attr is assigned but never used. Remove it to clean up extraneous code.

- col_attr = 'Attribute'
🧰 Tools
🪛 Ruff (0.8.2)

125-125: Local variable col_attr is assigned to but never used

Remove assignment to unused variable col_attr

(F841)


309-329: Unify CSV verification to avoid duplication.
This function and verify_file_against_schema_4 partially overlap in intent (checking columns, verifying naming). Consider leveraging the schema-based approach from verify_file_against_schema_4 to handle CSV structure consistently, which increases uniformity and reduces maintenance overhead.


419-432: Potential improvement in aggregating errors.
Currently, partial errors or missing columns are printed if verbose is true but not aggregated thoroughly. Consider returning a combined list of errors to facilitate automated checks or CI validations. This supports better integration with external analytics.

cea/datamanagement/format_helper/cea4_migrate_db.py (4)

2-2: Fix the spelling in the docstring.

The docstring has "Mirgate" instead of "Migrate."

- Mirgate the format of the DB to CEA-4 format.
+ Migrate the format of the DB to CEA-4 format.

27-49: Add documentation for the rename_dict usage.

Although the dictionary entries are fairly self-explanatory, a brief comment or docstring explaining the rationale and usage of rename_dict would help maintainability.


58-84: Consider simplifying the if-elif chain.

path_to_db_file_3 relies on many if-elif statements. A dictionary-based approach can reduce complexity and enhance maintainability.

-    if item == 'technology':
-        path_db_file = os.path.join(scenario, "inputs", "technology")
-    elif item == "CONSTRUCTION_STANDARD":
-        path_db_file = os.path.join(scenario, "inputs", "technology", "archetypes", "CONSTRUCTION_STANDARD.xlsx")
-    ...
-    else:
-        raise ValueError(f"Unknown item {item}")
+    path_map = {
+        'technology': os.path.join(scenario, "inputs", "technology"),
+        'CONSTRUCTION_STANDARD': os.path.join(scenario, "inputs", "technology", "archetypes", "CONSTRUCTION_STANDARD.xlsx"),
+        # ... other mappings ...
+    }
+    try:
+        path_db_file = path_map[item]
+    except KeyError:
+        raise ValueError(f"Unknown item {item}")

91-136: Validate file operations in excel_tab_to_csv.

The function reads multiple sheets in an Excel file, renames columns, and saves CSVs. Adding more explicit error handling (e.g., retry or fallback) can improve robustness if any sheet is malformed.

🧰 Tools
🪛 Ruff (0.8.2)

122-122: f-string without any placeholders

Remove extraneous f prefix

(F541)

🪛 GitHub Actions: Ruff

[error] 122-122: F541: f-string without any placeholders. Remove extraneous f prefix

cea/datamanagement/format_helper/cea4_verify_db.py (3)

25-46: Document or centralize global constants.

Constants like ARCHETYPES, SCHEDULES_FOLDER, and others would benefit from additional context or referencing a configuration module. This reduces the risk of silent mismatches.


136-185: Avoid repetitive path-building logic.

path_to_db_file_4 duplicates the large, conditional approach from path_to_db_file_3. You could unify or parameterize them in a single utility to reduce code repetition.


551-567: Consolidate check_directory_contains_csv usage.

The same function appears in cea4_migrate_db and here. Refactoring it into a shared utility module will improve maintainability and consistency.

cea/datamanagement/format_helper/format_helper.py (2)

25-42: Handle partial migration or verification.

When bool_migrate is True, the script fully migrates then verifies. If partial data is missing or only some items need re-checking, consider selectively processing them to optimize time and reduce potential file conflicts.


98-118: Provide richer logging or reports.

The final summary prints either success or partial failure but doesn't output the details in a structured way. Generating a more detailed log file or structured JSON could help users diagnose issues quickly.

cea/datamanagement/format_helper/cea4_migrate.py (2)

276-277: Fix inconsistent string formatting.

The error messages use inconsistent string formatting styles. Consider using f-strings consistently throughout the code.

-                        raise ValueError('! typology.dbf exists but does not follow the CEA-3 format. CEA cannot proceed with the data migration. '
-                                         'Check the following column(s) for CEA-3 format: {list_missing_attributes_typology_3}.'.format(list_missing_attributes_typology_3=list_missing_attributes_typology_3)
-                                         )
+                        raise ValueError(f'! typology.dbf exists but does not follow the CEA-3 format. CEA cannot proceed with the data migration. '
+                                       f'Check the following column(s) for CEA-3 format: {list_missing_attributes_typology_3}')

-                raise ValueError('! zone.shp exists but follows neither the CEA-3 nor CEA-4 format. CEA cannot proceed with the data migration.'
-                                 'Check the following column(s) for CEA-3 format: {list_missing_attributes_zone_3}.'.format(list_missing_attributes_zone_3=list_missing_attributes_zone_3),
-                                 'Check the following column(s) for CEA-4 format: {list_missing_attributes_zone_4}.'.format(list_missing_attributes_zone_4=list_missing_attributes_zone_4)
-                                 )
+                raise ValueError(f'! zone.shp exists but follows neither the CEA-3 nor CEA-4 format. CEA cannot proceed with the data migration. '
+                               f'Check the following column(s) for CEA-3 format: {list_missing_attributes_zone_3}. '
+                               f'Check the following column(s) for CEA-4 format: {list_missing_attributes_zone_4}')

Also applies to: 286-289


373-374: Remove commented-out print statements.

These commented-out print statements should be removed as they add noise to the code.

-        # # Print: End
-        # print('-' * 49)
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dcc4090 and 3dd77bc.

📒 Files selected for processing (5)
  • cea/datamanagement/format_helper/cea4_migrate.py (10 hunks)
  • cea/datamanagement/format_helper/cea4_migrate_db.py (1 hunks)
  • cea/datamanagement/format_helper/cea4_verify.py (6 hunks)
  • cea/datamanagement/format_helper/cea4_verify_db.py (1 hunks)
  • cea/datamanagement/format_helper/format_helper.py (4 hunks)
🧰 Additional context used
🪛 GitHub Actions: Ruff
cea/datamanagement/format_helper/format_helper.py

[error] 11-11: F401: cea.datamanagement.format_helper.cea4_verify.print_verification_results_4 imported but unused. Remove unused import

cea/datamanagement/format_helper/cea4_migrate_db.py

[error] 122-122: F541: f-string without any placeholders. Remove extraneous f prefix

🪛 Ruff (0.8.2)
cea/datamanagement/format_helper/cea4_verify.py

125-125: Local variable col_attr is assigned to but never used

Remove assignment to unused variable col_attr

(F841)

cea/datamanagement/format_helper/cea4_migrate_db.py

122-122: f-string without any placeholders

Remove extraneous f prefix

(F541)


333-333: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: windows-latest
  • GitHub Check: macos-latest
  • GitHub Check: ubuntu-latest
🔇 Additional comments (5)
cea/datamanagement/format_helper/cea4_verify.py (2)

397-408: Shapefile and file existence checks appear correct.
Verifications for terrain, weather, and streets are straightforward and handle missing files appropriately.


457-457: Ensure top-level usage matches intended verbosity configuration.
Calling cea4_verify(scenario, verbose=True) correctly aligns with the new signature. If further adjustments to verbosity or error aggregation are made, remain consistent across invocation points.

cea/datamanagement/format_helper/cea4_verify_db.py (1)

55-96: Verify correctness of mapping dictionaries.

Ensure the item-to-schema mappings are updated whenever you add or remove data categories. Inconsistent mappings will cause runtime errors or invalid file lookups.

To confirm usage across the codebase, consider a script:

✅ Verification successful

Mapping dictionary verification completed successfully.

The item-to-schema mappings are correctly defined and consistent with the database structure. All mapped items have:

  • Corresponding schema definitions in cea/schemas.yml
  • Appropriate getter functions in the codebase
  • Proper file paths and required fields defined
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Searching references for each mapping dictionary key.
rg "CONSTRUCTION_TYPE" 
rg "USE_TYPE"
rg "SCHEDULES"
# Add more as needed

Length of output: 11657


Script:

#!/bin/bash
# Get all schema definitions
echo "=== Schema Definitions ==="
cat cea/schemas.yml

# Search for all getter functions referenced in the mapping
echo -e "\n=== Function References ==="
rg "get_database_assemblies_envelope|get_database_assemblies_hvac|get_database_assemblies_supply|get_database_components_conversion|get_database_components_distribution|get_database_components_feedstocks"

Length of output: 65705


Script:

#!/bin/bash
# Search for database schema definitions
echo "=== Database Schemas ==="
find . -type f -name "*.yml" -exec grep -l "file_path: inputs/database" {} \;

# Search for getter function implementations
echo -e "\n=== Getter Functions ==="
rg "def get_database_" --type py

Length of output: 1479


Script:

#!/bin/bash
# Get the full schema definitions for database files
echo "=== Database Schema Definitions ==="
grep -A 50 "file_path: inputs/database" cea/schemas.yml

Length of output: 63432

cea/datamanagement/format_helper/format_helper.py (2)

65-74: Validate scenario list usage.

If scenarios_to_verify_and_migrate is empty or incorrectly set, the loop is skipped. Consider warning the user or defaulting to the current directory to avoid confusion.


10-10: ⚠️ Potential issue

Remove the unused import.

print_verification_results_4 is unused, causing a lint error (F401). Removing or using it will silence the warning.

-from cea.datamanagement.format_helper.cea4_verify import cea4_verify, print_verification_results_4
+from cea.datamanagement.format_helper.cea4_verify import cea4_verify

Likely invalid or redundant comment.

Comment on lines +140 to +141
if id_column not in missing_columns:

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Dead condition or missing code block.
The if id_column not in missing_columns: check is immediately followed by a blank line, suggesting a missing statement or indentation. If the intention is to skip certain checks when the id_column is missing, ensure the logic is clearly implemented.

Comment on lines +92 to +103
"""
Validate a file against a schema section in a YAML file.

Parameters:
- scenario (str): Path to the scenario.
- item (str): Locator for the file to validate (e.g., 'get_zone_geometry').
- self: Reference to the calling class/module.
- verbose (bool, optional): If True, print validation errors to the console.

Returns:
- List[dict]: List of validation errors.
"""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Docstring parameters do not match the function signature.
The docstring mentions verbose but the function signature omits it. Either remove verbose from the docstring or add it to the function signature if you intend to support console-based output.

 def verify_file_against_schema_4(scenario, item):
     """
-    Validate a file against a schema section in a YAML file.
+    Validate a file against a schema section in a YAML file.
 
     ...
-    - verbose (bool, optional): If True, print validation errors to the console.
 
     Returns:
     - List[dict]: List of validation errors.
     """
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"""
Validate a file against a schema section in a YAML file.
Parameters:
- scenario (str): Path to the scenario.
- item (str): Locator for the file to validate (e.g., 'get_zone_geometry').
- self: Reference to the calling class/module.
- verbose (bool, optional): If True, print validation errors to the console.
Returns:
- List[dict]: List of validation errors.
"""
"""
Validate a file against a schema section in a YAML file.
Parameters:
- scenario (str): Path to the scenario.
- item (str): Locator for the file to validate (e.g., 'get_zone_geometry').
- self: Reference to the calling class/module.
Returns:
- List[dict]: List of validation errors.
"""

Comment on lines +337 to +393
def cea4_verify(scenario, verbose=False):

#1. about zone.shp and surroundings.shp
list_missing_attributes_zone = []
list_missing_attributes_surroundings = []
list_missing_files_shp_building_geometry = verify_file_exists_4(scenario, SHAPEFILES)

if 'zone' not in list_missing_files_shp_building_geometry:
list_missing_attributes_zone = verify_shp(scenario, 'zone', COLUMNS_ZONE_4)
list_missing_attributes_zone, list_issues_against_schema_zone = verify_file_against_schema_4(scenario, 'zone')
if list_missing_attributes_zone:
if print_results:
print('! Ensure attribute(s) are present in zone.shp: {missing_attributes_zone}'.format(missing_attributes_zone=list_missing_attributes_zone))
if 'name' not in list_missing_attributes_zone:
list_names_duplicated = verify_name_duplicates_4(scenario, 'zone')
if list_names_duplicated:
if print_results:
print('! Ensure name(s) are unique in zone.shp: {list_names_duplicated} is duplicated.'.format(list_names_duplicated=list_names_duplicated))
if verbose:
print('! Ensure attribute(s) are present in zone.shp: {missing_attributes_zone}.'.format(missing_attributes_zone=', '.join(map(str, list_missing_attributes_zone))))
if list_issues_against_schema_zone:
print('! Check values in zone.shp:')
print("\n".join(f" {item}" for item in list_issues_against_schema_zone))

if 'name' not in list_missing_attributes_zone:
list_names_duplicated = verify_name_duplicates_4(scenario, 'zone')
if list_names_duplicated:
if verbose:
print('! Ensure name(s) are unique in zone.shp: {list_names_duplicated} is duplicated.'.format(list_names_duplicated=', '.join(map(str, list_names_duplicated))))

if 'surroundings' not in list_missing_files_shp_building_geometry:
list_missing_attributes_surroundings = verify_shp(scenario, 'surroundings', COLUMNS_SURROUNDINGS_4)
list_missing_attributes_surroundings, list_issues_against_schema_surroundings = verify_file_against_schema_4(scenario, 'surroundings')
if list_missing_attributes_surroundings:
if print_results:
print('! Ensure attribute(s) are present in surroundings.shp: {missing_attributes_surroundings}'.format(missing_attributes_surroundings=list_missing_attributes_surroundings))
if 'name' not in list_missing_attributes_surroundings:
list_names_duplicated = verify_name_duplicates_4(scenario, 'surroundings')
if list_names_duplicated:
if print_results:
print('! Ensure name(s) are unique in surroundings.shp: {list_names_duplicated} is duplicated.'.format(list_names_duplicated=list_names_duplicated))
if verbose:
print('! Ensure attribute(s) are present in surroundings.shp: {missing_attributes_surroundings}.'.format(missing_attributes_surroundings=', '.join(map(str, list_missing_attributes_surroundings))))
if list_issues_against_schema_surroundings:
print('! Check values in surroundings.shp:')
print("\n".join(f" {item}" for item in list_issues_against_schema_surroundings))
if 'name' not in list_missing_attributes_surroundings:
list_names_duplicated = verify_name_duplicates_4(scenario, 'surroundings')
if list_names_duplicated:
if verbose:
print('! Ensure name(s) are unique in surroundings.shp: {list_names_duplicated} is duplicated.'.format(list_names_duplicated=', '.join(map(str, list_names_duplicated))))

#2. about .csv files under the "inputs/building-properties" folder
list_missing_columns_air_conditioning = []
list_missing_columns_architecture = []
list_missing_columns_indoor_comfort = []
list_missing_columns_internal_loads = []
list_missing_columns_supply_systems = []
dict_list_missing_columns_csv_building_properties = {}

list_missing_files_csv_building_properties = verify_file_exists_4(scenario, CSV_BUILDING_PROPERTIES_4)
if list_missing_files_csv_building_properties:
if print_results:
print('! Ensure .csv file(s) are present in the building-properties folder: {missing_files_csv_building_properties}'.format(missing_files_csv_building_properties=list_missing_files_csv_building_properties))

for item, columns in [
('air_conditioning', COLUMNS_AIR_CONDITIONING_4),
('architecture', COLUMNS_ARCHITECTURE_4),
('indoor_comfort', COLUMNS_INDOOR_COMFORT_4),
('internal_loads', COLUMNS_INTERNAL_LOADS_4),
('supply_systems', COLUMNS_SUPPLY_SYSTEMS_4)
]:
if verbose:
print('! Ensure .csv file(s) are present in the building-properties folder: {missing_files_csv_building_properties}.'.format(missing_files_csv_building_properties=', '.join(map(str, list_missing_files_csv_building_properties))))

for item in ['air_conditioning', 'architecture', 'indoor_comfort', 'internal_loads', 'supply_systems']:
if item not in list_missing_files_csv_building_properties:
locals()[f'list_missing_columns_{item}'] = verify_csv_file(
scenario, item, columns, print_results=print_results
)
list_missing_columns_csv_building_properties, list_issues_against_csv_building_properties = verify_file_against_schema_4(scenario, item)
dict_list_missing_columns_csv_building_properties[item] = list_missing_columns_csv_building_properties
if verbose:
if list_missing_columns_csv_building_properties:
print('! Ensure column(s) are present in {item}.csv: {missing_columns}.'.format(item=item, missing_columns=', '.join(map(str, list_missing_columns_csv_building_properties))))
if list_issues_against_csv_building_properties:
print('! Check values in {item}.csv: ')
print("\n".join(f" {item}" for item in list_issues_against_csv_building_properties))
else:
dict_list_missing_columns_csv_building_properties[item] = []

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Maintain consistent naming for “columns” vs. “attributes.”
The function returns list_missing_attributes_zone and list_issues_against_schema_zone, although the schema verification logic deals with columns. This can confuse users, particularly for shapefiles and CSVs. Standardize to “columns” or “fields” for clarity across different data formats.

- list_missing_attributes_zone, list_issues_against_schema_zone
+ list_missing_columns_zone, list_issues_against_schema_zone

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines 342 to 420
def migrate_cea3_to_cea4_db(scenario):

#0. verify if everything is already in the correct format for CEA-4
dict_missing = cea4_verify_db(scenario)
if all(not value for value in dict_missing.values()):
pass
else:
# Verify missing files for CEA-3 and CEA-4 formats
list_problems_construction_type = dict_missing.get('CONSTRUCTION_TYPE')
list_problems_use_type = dict_missing.get('USE_TYPE')
list_problems_schedules = dict_missing.get('SCHEDULES')
list_problems_envelope = dict_missing.get('ENVELOPE')
list_problems_hvac = dict_missing.get('HVAC')
list_problems_supply = dict_missing.get('SUPPLY')
list_problems_conversion = dict_missing.get('CONVERSION')
list_problems_distribution = dict_missing.get('DISTRIBUTION')
list_problems_feedstocks = dict_missing.get('FEEDSTOCKS')

#1. about archetypes - construction types
path_3 = path_to_db_file_3(scenario, 'CONSTRUCTION_STANDARD')
if list_problems_construction_type and os.path.isfile(path_3):
path_csv = path_to_db_file_4(scenario, 'CONSTRUCTION_TYPE')
merge_excel_tab_to_csv(path_3, 'STANDARD', path_csv, rename_dict=rename_dict)

#2. about archetypes - use types
path_3 = path_to_db_file_3(scenario, 'USE_TYPE_PROPERTIES')
if list_problems_use_type and os.path.isfile(path_3):
path_csv = path_to_db_file_4(scenario, 'USE_TYPE')
merge_excel_tab_to_csv(path_3, 'code', path_csv)

if list_problems_schedules:
move_txt_modify_csv_files(scenario)

#3. about assemblies
path_3 = path_to_db_file_3(scenario, 'ENVELOPE')
if list_problems_envelope and os.path.isfile(path_3):
excel_tab_to_csv(path_3, path_to_db_file_4(scenario, 'ENVELOPE'), rename_dict=rename_dict)
path_3 = path_to_db_file_3(scenario, 'HVAC')
if list_problems_hvac and os.path.isfile(path_3):
excel_tab_to_csv(path_3, path_to_db_file_4(scenario, 'HVAC'), rename_dict=rename_dict)
path_3 = path_to_db_file_3(scenario, 'SUPPLY')
if list_problems_supply and os.path.isfile(path_3):
excel_tab_to_csv(path_3, path_to_db_file_4(scenario, 'SUPPLY'), rename_dict=rename_dict)

#4. about components
path_3 = path_to_db_file_3(scenario, 'CONVERSION')
if list_problems_conversion and os.path.isfile(path_3):
excel_tab_to_csv(path_3, path_to_db_file_4(scenario, 'CONVERSION'), rename_dict=rename_dict)
path_3 = path_to_db_file_3(scenario, 'DISTRIBUTION')
if list_problems_distribution and os.path.isfile(path_3):
excel_tab_to_csv(path_3, path_to_db_file_4(scenario, 'DISTRIBUTION'), rename_dict=rename_dict)
path_3 = path_to_db_file_3(scenario, 'FEEDSTOCKS')
if list_problems_feedstocks and os.path.isfile(path_3):
excel_tab_to_csv(path_3, path_to_db_file_4(scenario, 'FEEDSTOCKS'), rename_dict=rename_dict)

# # Print: End
# print('-' * 49)


## --------------------------------------------------------------------------------------------------------------------
## Main function
## --------------------------------------------------------------------------------------------------------------------


def main(config):
# Start the timer
t0 = time.perf_counter()
assert os.path.exists(config.general.project), 'input file not found: %s' % config.project

scenario = config.scenario
scenario_name = os.path.basename(scenario)

# Print: Start
div_len = 37 - len(scenario_name)
print('-' * 50)
print("-" * 1 + ' Scenario: {scenario} '.format(scenario=scenario_name) + "-" * div_len)

# Execute the migration
migrate_cea3_to_cea4_db(scenario)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Reduce the size of migrate_cea3_to_cea4_db.

This function coordinates multiple steps (verifying DB, merging sheets, etc.). Splitting it into smaller helper functions can improve readability and maintainability.


# Handle the special case of renaming
if sheet_name == 'CONSTRUCTION':
output_path = os.path.join(directory_csv, f"MASS.csv")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove extraneous f-string prefix.

The pipeline flagged an f-string without placeholders (F541). Removing the f will fix the lint error.

- output_path = os.path.join(directory_csv, f"MASS.csv")
+ output_path = os.path.join(directory_csv, "MASS.csv")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
output_path = os.path.join(directory_csv, f"MASS.csv")
output_path = os.path.join(directory_csv, "MASS.csv")
🧰 Tools
🪛 Ruff (0.8.2)

122-122: f-string without any placeholders

Remove extraneous f prefix

(F541)

🪛 GitHub Actions: Ruff

[error] 122-122: F541: f-string without any placeholders. Remove extraneous f prefix

Comment on lines +333 to +334
except Exception as e:
pass
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove unused exception variable.

The local variable e is assigned but never used (F841). Either handle it or omit the variable.

-except Exception as e:
+except Exception:
    pass
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except Exception as e:
pass
except Exception:
pass
🧰 Tools
🪛 Ruff (0.8.2)

333-333: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

Comment on lines +241 to +278
for col_name, col_specs in schema_columns.items():
if col_name not in df.columns:
missing_columns.append(col_name)
continue

if id_column not in missing_columns:

col_data = df[col_name]

# Check type
if col_specs['type'] == 'string':
invalid = ~col_data.apply(lambda x: isinstance(x, str) or pd.isnull(x))
elif col_specs['type'] == 'int':
invalid = ~col_data.apply(lambda x: isinstance(x, (int, np.integer)) or pd.isnull(x))
elif col_specs['type'] == 'float':
invalid = ~col_data.apply(lambda x: isinstance(x, (float, int, np.floating, np.integer)) or pd.isnull(x))
else:
invalid = pd.Series(False, index=col_data.index) # Unknown types are skipped

for idx in invalid[invalid].index:
identifier = df.at[idx, id_column]
errors.append({col_attr: col_name, "Issue": "Invalid type", "Row": identifier, "Value": col_data[idx]})
errors.append(f"The {col_name} value for row {identifier} is invalid ({col_data[idx]}). Please check the data type.")

# Check range
if 'min' in col_specs:
out_of_range = col_data[col_data < col_specs['min']]
for idx, value in out_of_range.items():
identifier = df.at[idx, id_column]
errors.append(f"The {col_name} value for row {identifier} is too low ({value}). It should be at least {col_specs['min']}.")

if 'max' in col_specs:
out_of_range = col_data[col_data > col_specs['max']]
for idx, value in out_of_range.items():
identifier = df.at[idx, id_column]
errors.append(f"The {col_name} value for row {identifier} is too high ({value}). It should be at most {col_specs['max']}.")


Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Unify dictionary- and string-based error reporting.

The code sometimes appends dict objects and sometimes appends string messages to errors. Adopting a single structure (e.g., always a dict) can simplify logs and downstream error handling.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (10)
cea/datamanagement/format_helper/cea4_migrate_db.py (5)

1-4: Fix minor spelling in docstring heading.

Line 2 says "Mirgate" instead of "Migrate." This small change will improve clarity.

-"""
-Mirgate the format of the DB to CEA-4 format.
- 
-"""
+"""
+Migrate the format of the DB to CEA-4 format.
+
+"""

60-82: Convert large if-elif chain to a dynamic lookup.

Consider using a dictionary mapping each item to its corresponding path instead of multiple conditional branches. This will make the code more concise and easier to maintain.

paths_map = {
    'technology': os.path.join(scenario, "inputs", "technology"),
    'CONSTRUCTION_STANDARD': os.path.join(scenario, "inputs", "technology", "archetypes", "CONSTRUCTION_STANDARD.xlsx"),
    ...
}
try:
    path_db_file = paths_map[item]
except KeyError:
    raise ValueError(f"Unknown item {item}")

91-99: Refine Excel reading and docstring details.

  1. Reuse the loaded excel_data object to parse sheets instead of calling pd.read_excel again, which can be more efficient for large files:
    -df = pd.read_excel(path_excel, sheet_name=sheet_name)
    +df = excel_data.parse(sheet_name=sheet_name)
  2. The docstring omits the verbose parameter description. Add a short explanation to maintain completeness.

Also applies to: 106-110, 113-136


229-312: Split the function into smaller helpers.

move_txt_modify_csv_files performs multiple tasks:

  • Moves .txt files
  • Processes .csv files
  • Compiles rows into a consolidated DataFrame

Splitting them into separate, cohesive functions will make the code more readable and testable.


405-449: Consider adopting a structured logging approach.

Because this is a user-facing script that prints progress, using Python’s built-in logging module and different log levels (INFO, WARNING, ERROR) could help maintain clarity as the project grows.

cea/datamanagement/format_helper/format_helper.py (5)

65-70: Provide more context in the printed messages.

The lines printing user-facing messages (65-70) are helpful but consider adding a note about where to find logs or further instructions for verifying or migrating data, improving user guidance.


83-98: Streamline scenario loop.

You skip hidden directories and files, which is sensible. However, consider documenting or logging that skip logic to inform users about directories or files not processed. This can help in debugging unexpected omissions.


89-97: Consider merging dict_missing and dict_missing_db earlier.

By merging them right after each call, you could simplify the logic of partial or complete successes. This is optional but might reduce branching complexity.


100-106: Document scenario problems more systematically.

Instead of storing dict_result objects in a log list, consider writing them to a structured JSON or CSV file. This can help with automated analysis and better record-keeping.


114-118: Maintain a single exit message.

Lines 114-118 display distinct messages for migration vs. verification. If the logic grows more complex, consider using a single standard completion message with parameter placeholders to keep the logs consistent.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3dd77bc and 4ca2b65.

📒 Files selected for processing (2)
  • cea/datamanagement/format_helper/cea4_migrate_db.py (1 hunks)
  • cea/datamanagement/format_helper/format_helper.py (4 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
cea/datamanagement/format_helper/cea4_migrate_db.py

333-333: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: windows-latest
  • GitHub Check: macos-latest
🔇 Additional comments (4)
cea/datamanagement/format_helper/cea4_migrate_db.py (3)

333-334: Remove assignment to unused exception variable
(e.g., replace except Exception as e: with except Exception:).

The local variable e is never used. Either handle it or omit the variable to avoid linter warnings.

🧰 Tools
🪛 Ruff (0.8.2)

333-333: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)


342-449: Reduce the size of migrate_cea3_to_cea4_db.

Splitting the function into smaller, well-defined helper functions (e.g., for verifying or migrating each data type) will improve maintainability and readability, as previously noted in older reviews.


170-190: Verify handling of duplicate rows after merging.

When merging sheets with pd.merge(..., how='outer'), multiple rows could arise if the same key exists across sheets. Ensure no unintended duplicates appear.

You can run a quick check with this script to identify repeated keys in the merged DataFrame:

The script prints any duplicated lines. If the output is empty, there are no duplicates.

✅ Verification successful

Merge operation is properly implemented and safe from duplicates

The implementation safely handles potential duplicates because:

  • The merge operation uses a verified key column that exists in all sheets
  • Column names are properly managed with prefixes for assembly sheets
  • Pandas' merge operation inherently handles matching keys correctly with outer joins
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Searching for duplicated rows in merged CSV outputs of the codebase
fd -e csv -E '.*_ASSEMBLIES.*' --exec bash -c '
  for f; do
    echo "Checking for duplicates in $f"
    tail -n +2 "$f" | sort | uniq -d
  done
' {} +

Length of output: 5735


Script:

#!/bin/bash
# Search for test files related to this functionality
echo "=== Searching for related test files ==="
rg -l "cea4_migrate_db" --type py

echo -e "\n=== Looking for CSV files in data management directories ==="
fd -e csv . cea/datamanagement

echo -e "\n=== Checking implementation details ==="
rg -A 5 -B 5 "pd\.merge\(" --type py

Length of output: 13685

cea/datamanagement/format_helper/format_helper.py (1)

25-42: Ensure consistent handling of migration and verification.

In exec_cea_format_helper, blocks for migration vs. verification are separated by conditionals. Confirm that any new features (like advanced logging or extended validations) are applied equally to both paths if intended.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (2)
cea/datamanagement/format_helper/cea4_migrate_db.py (1)

325-338: ⚠️ Potential issue

Improve error handling in delete_files function.

The function silently ignores all exceptions, which could hide important errors.

 def delete_files(path, verbose=False):
     try:
         shutil.rmtree(path)
         if verbose:
             print(f"Deleted directory: {path}")
-    except Exception as e:
-        pass
+    except Exception as e:
+        if verbose:
+            print(f"Failed to delete {path}: {e}")
🧰 Tools
🪛 Ruff (0.8.2)

336-336: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

cea/datamanagement/format_helper/cea4_verify_db.py (1)

232-264: 🛠️ Refactor suggestion

Standardize error reporting format.

The function mixes dictionary and string formats for error reporting. Standardize on a single format (preferably dictionary) for consistent error handling and processing.

-errors.append(f"The {col_name} value for row {identifier} is invalid ({col_data[idx]}). Please check the data type.")
+errors.append({
+    "Column": col_name,
+    "Issue": "Invalid type",
+    "Row": identifier,
+    "Value": col_data[idx],
+    "Message": f"Please check the data type"
+})
🧹 Nitpick comments (9)
cea/datamanagement/format_helper/cea4_migrate_db.py (4)

1-4: Fix typo in docstring.

There's a typo in the docstring: "Mirgate" should be "Migrate".

-"""
-Mirgate the format of the DB to CEA-4 format.
-
-"""
+"""
+Migrate the format of the DB to CEA-4 format.
+
+"""

59-60: Consider using inputlocator for path management.

The comment indicates paths are hardcoded without using the inputlocator script. This could make the code less maintainable and more prone to errors if paths change.

Consider implementing support for the inputlocator script to make path management more robust and maintainable.


409-452: Add error handling for configuration validation.

The main function assumes the configuration is valid but only checks for project existence.

Consider adding more robust configuration validation:

def validate_config(config):
    """Validate the configuration parameters."""
    if not hasattr(config, 'general') or not hasattr(config.general, 'project'):
        raise ValueError("Invalid configuration: missing 'general.project'")
    if not hasattr(config, 'scenario'):
        raise ValueError("Invalid configuration: missing 'scenario'")
    if not os.path.exists(config.general.project):
        raise FileNotFoundError(f"Project not found: {config.general.project}")
    return True

def main(config):
    # Validate configuration
    validate_config(config)
    # ... rest of the code ...

443-445: Uncomment or remove commented code.

The code contains commented-out functionality for deleting files. This should either be implemented or removed.

-    # delete_files(path_to_db_file_4(scenario, 'database'))
cea/datamanagement/format_helper/cea4_verify_db.py (5)

1-4: Enhance module docstring.

The current docstring is too brief. Consider adding more details about:

  • Purpose and functionality of the module
  • Key functions and their roles
  • Expected inputs/outputs
  • Dependencies
-"""
-Verify the format of DB for CEA-4 model.
-
-"""
+"""
+Verify and validate the database format for CEA-4 model.
+
+This module provides functionality to:
+- Verify the existence of required database files
+- Validate CSV files against predefined schemas
+- Check relationships between different database components
+- Ensure data consistency across the database
+
+Key functions:
+- cea4_verify_db: Main verification function
+- verify_file_against_schema_4_db: Schema validation
+- verify_components_exist: Component verification
+
+Dependencies:
+- pandas: For CSV file handling
+- cea.schemas: For schema definitions
+"""

25-96: Move large dictionaries to configuration files.

Consider moving the large mapping dictionaries to separate configuration files (JSON/YAML) for better maintainability and readability. This will make the code more modular and easier to update.


251-258: Simplify type checking using pandas dtypes.

Replace manual type checking with pandas' built-in dtype functionality for more robust and maintainable code.

-            if col_specs['type'] == 'string':
-                invalid = ~col_data.apply(lambda x: isinstance(x, str) or pd.isnull(x))
-            elif col_specs['type'] == 'int':
-                invalid = ~col_data.apply(lambda x: isinstance(x, (int, np.integer)) or pd.isnull(x))
-            elif col_specs['type'] == 'float':
-                invalid = ~col_data.apply(lambda x: isinstance(x, (float, int, np.floating, np.integer)) or pd.isnull(x))
-            else:
-                invalid = pd.Series(False, index=col_data.index)  # Unknown types are skipped
+            dtype_map = {
+                'string': 'object',
+                'int': 'int64',
+                'float': 'float64'
+            }
+            if col_specs['type'] in dtype_map:
+                try:
+                    col_data.astype(dtype_map[col_specs['type']])
+                    invalid = pd.Series(False, index=col_data.index)
+                except (ValueError, TypeError):
+                    invalid = ~col_data.isna()
+            else:
+                invalid = pd.Series(False, index=col_data.index)  # Unknown types are skipped

235-239: Remove commented code.

Remove the commented code block as it's no longer needed. If this is a TODO item, convert it to a proper TODO comment or implement the functionality.

-    # if id_column not in df.columns:
-    #     if verbose:
-    #         print(f"! CEA was not able to verify {os.path.basename(file_path)} "
-    #               f"as a unique row identifier column such as (building) name or (component) code is not present.")

775-804: Enhance error handling and progress reporting in main function.

Add more robust error handling and better progress reporting for long-running operations.

 def main(config):
-    # Start the timer
-    t0 = time.perf_counter()
-    assert os.path.exists(config.general.project), 'input file not found: %s' % config.project
+    try:
+        # Start the timer
+        t0 = time.perf_counter()
+        
+        # Validate configuration
+        if not os.path.exists(config.general.project):
+            raise FileNotFoundError(f'Input file not found: {config.project}')
+        if not hasattr(config, 'scenario'):
+            raise ValueError('Scenario not specified in configuration')
 
-    # Get the scenario name
-    scenario = config.scenario
-    scenario_name = os.path.basename(scenario)
+        # Get the scenario name
+        scenario = config.scenario
+        scenario_name = os.path.basename(scenario)
 
-    # Print: Start
-    div_len = 37 - len(scenario_name)
-    print('+' * 60)
-    print("-" * 1 + ' Scenario: {scenario} '.format(scenario=scenario_name) + "-" * div_len)
+        # Print: Start
+        div_len = 37 - len(scenario_name)
+        print('+' * 60)
+        print("-" * 1 + f' Scenario: {scenario_name} ' + "-" * div_len)
+        print("Starting verification process...")
 
-    # Execute the verification
-    dict_missing_db = cea4_verify_db(scenario, verbose=True)
+        # Execute the verification with progress updates
+        print("1/3: Verifying database structure...")
+        dict_missing_db = cea4_verify_db(scenario, verbose=True)
 
-    # Print the results
-    print_verification_results_4_db(scenario_name, dict_missing_db)
+        print("2/3: Analyzing results...")
+        print_verification_results_4_db(scenario_name, dict_missing_db)
 
-    # Print the time used for the entire processing
-    time_elapsed = time.perf_counter() - t0
+        # Print the time used for the entire processing
+        time_elapsed = time.perf_counter() - t0
+        print("3/3: Completed verification")
 
-    # Print: End
-    print('+' * 60)
-    print('The entire process of CEA-4 format verification is now completed - time elapsed: %.2f seconds' % time_elapsed)
+        # Print: End
+        print('+' * 60)
+        print(f'Verification completed successfully - time elapsed: {time_elapsed:.2f} seconds')
+    except Exception as e:
+        print(f"Error during verification: {str(e)}")
+        raise
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4ca2b65 and 7e74f20.

📒 Files selected for processing (2)
  • cea/datamanagement/format_helper/cea4_migrate_db.py (1 hunks)
  • cea/datamanagement/format_helper/cea4_verify_db.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
cea/datamanagement/format_helper/cea4_migrate_db.py

336-336: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: windows-latest
  • GitHub Check: macos-latest
  • GitHub Check: ubuntu-latest
🔇 Additional comments (1)
cea/datamanagement/format_helper/cea4_verify_db.py (1)

16-16: ⚠️ Potential issue

Update copyright year.

The copyright year is set to 2025 which is in the future. Update it to the current year.

-__copyright__ = "Copyright 2025, Architecture and Building Systems - ETH Zurich"
+__copyright__ = "Copyright 2024, Architecture and Building Systems - ETH Zurich"

Likely invalid or redundant comment.

Comment on lines 345 to 403
def migrate_cea3_to_cea4_db(scenario):

#0. verify if everything is already in the correct format for CEA-4
dict_missing = cea4_verify_db(scenario)
if all(not value for value in dict_missing.values()):
pass
else:
# Verify missing files for CEA-3 and CEA-4 formats
list_problems_construction_type = dict_missing.get('CONSTRUCTION_TYPE')
list_problems_use_type = dict_missing.get('USE_TYPE')
list_problems_schedules = dict_missing.get('SCHEDULES')
list_problems_envelope = dict_missing.get('ENVELOPE')
list_problems_hvac = dict_missing.get('HVAC')
list_problems_supply = dict_missing.get('SUPPLY')
list_problems_conversion = dict_missing.get('CONVERSION')
list_problems_distribution = dict_missing.get('DISTRIBUTION')
list_problems_feedstocks = dict_missing.get('FEEDSTOCKS')

#1. about archetypes - construction types
path_3 = path_to_db_file_3(scenario, 'CONSTRUCTION_STANDARD')
if list_problems_construction_type and os.path.isfile(path_3):
path_csv = path_to_db_file_4(scenario, 'CONSTRUCTION_TYPE')
merge_excel_tab_to_csv(path_3, 'STANDARD', path_csv, rename_dict=rename_dict)

#2. about archetypes - use types
path_3 = path_to_db_file_3(scenario, 'USE_TYPE_PROPERTIES')
if list_problems_use_type and os.path.isfile(path_3):
path_csv = path_to_db_file_4(scenario, 'USE_TYPE')
merge_excel_tab_to_csv(path_3, 'code', path_csv, rename_dict=rename_dict_2)

if list_problems_schedules:
move_txt_modify_csv_files(scenario)

#3. about assemblies
path_3 = path_to_db_file_3(scenario, 'ENVELOPE')
if list_problems_envelope and os.path.isfile(path_3):
excel_tab_to_csv(path_3, path_to_db_file_4(scenario, 'ENVELOPE'), rename_dict=rename_dict)
path_3 = path_to_db_file_3(scenario, 'HVAC')
if list_problems_hvac and os.path.isfile(path_3):
excel_tab_to_csv(path_3, path_to_db_file_4(scenario, 'HVAC'), rename_dict=rename_dict)
path_3 = path_to_db_file_3(scenario, 'SUPPLY')
if list_problems_supply and os.path.isfile(path_3):
excel_tab_to_csv(path_3, path_to_db_file_4(scenario, 'SUPPLY'), rename_dict=rename_dict)

#4. about components
path_3 = path_to_db_file_3(scenario, 'CONVERSION')
if list_problems_conversion and os.path.isfile(path_3):
excel_tab_to_csv(path_3, path_to_db_file_4(scenario, 'CONVERSION'), rename_dict=rename_dict)
path_3 = path_to_db_file_3(scenario, 'DISTRIBUTION')
if list_problems_distribution and os.path.isfile(path_3):
excel_tab_to_csv(path_3, path_to_db_file_4(scenario, 'DISTRIBUTION'), rename_dict=rename_dict)
path_3 = path_to_db_file_3(scenario, 'FEEDSTOCKS')
if list_problems_feedstocks and os.path.isfile(path_3):
excel_tab_to_csv(path_3, path_to_db_file_4(scenario, 'FEEDSTOCKS'), rename_dict=rename_dict)

# # Print: End
# print('-' * 49)


Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Refactor migrate_cea3_to_cea4_db for better maintainability.

The function is quite long and handles multiple responsibilities. Breaking it down would improve maintainability and testability.

Consider splitting into smaller functions:

  1. verify_migration_requirements
  2. migrate_construction_types
  3. migrate_use_types
  4. migrate_assemblies
  5. migrate_components

Example refactor:

def migrate_construction_types(scenario, list_problems):
    path_3 = path_to_db_file_3(scenario, 'CONSTRUCTION_STANDARD')
    if list_problems and os.path.isfile(path_3):
        path_csv = path_to_db_file_4(scenario, 'CONSTRUCTION_TYPE')
        merge_excel_tab_to_csv(path_3, 'STANDARD', path_csv, rename_dict=rename_dict)

def migrate_assemblies(scenario, problems_dict):
    for item in ['ENVELOPE', 'HVAC', 'SUPPLY']:
        path_3 = path_to_db_file_3(scenario, item)
        if problems_dict[item] and os.path.isfile(path_3):
            excel_tab_to_csv(path_3, path_to_db_file_4(scenario, item), rename_dict=rename_dict)

Comment on lines +94 to +139
def excel_tab_to_csv(path_excel, directory_csv, rename_dict=None, verbose=False):
"""
Converts each sheet of an Excel file into individual CSV files.

Parameters:
- path_excel (str): The path to the input Excel file.
- directory_csv (str): The directory where CSV files should be saved.
- rename_dict (dict, optional): Dictionary for renaming columns. Keys are old column names, values are new names.
"""
# Ensure the output directory exists
os.makedirs(directory_csv, exist_ok=True)

# Get the file name without the extension
file_name = os.path.splitext(os.path.basename(path_excel))[0]

# Read the Excel file
try:
excel_data = pd.ExcelFile(path_excel)
except Exception as e:
raise ValueError(f"Error reading Excel file: {e}")

# Loop through each sheet and save as a CSV
for sheet_name in excel_data.sheet_names:
try:
df = pd.read_excel(path_excel, sheet_name=sheet_name)
# Rename columns based on the rename_dict
if rename_dict:
df.rename(columns=rename_dict, inplace=True)

# Handle the special case of renaming
if sheet_name == 'CONSTRUCTION':
output_path = os.path.join(directory_csv, "MASS.csv")
elif file_name == 'HVAC':
output_path = os.path.join(directory_csv, f"HVAC_{sheet_name}.csv")
elif file_name == 'SUPPLY':
output_path = os.path.join(directory_csv, f"SUPPLY_{sheet_name}.csv")
else:
output_path = os.path.join(directory_csv, f"{sheet_name}.csv")

# Save the file
df.to_csv(output_path, index=False)
if verbose:
print(f"Saved {sheet_name} to {output_path}")
except Exception as e:
print(f"Failed to save sheet {sheet_name} as .csv format Error: {e}")

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance error handling in excel_tab_to_csv function.

The function catches exceptions but continues processing other sheets when one fails. This could lead to partial data migration without clear indication of what failed.

Consider:

  1. Collecting all errors and returning a summary
  2. Adding a flag to control whether to continue on error
  3. Logging the total number of successful/failed conversions
 def excel_tab_to_csv(path_excel, directory_csv, rename_dict=None, verbose=False):
+    errors = []
+    success_count = 0
+    total_sheets = 0
     # ... existing code ...
     for sheet_name in excel_data.sheet_names:
+        total_sheets += 1
         try:
             # ... existing code ...
             df.to_csv(output_path, index=False)
+            success_count += 1
             if verbose:
                 print(f"Saved {sheet_name} to {output_path}")
         except Exception as e:
-            print(f"Failed to save sheet {sheet_name} as .csv format Error: {e}")
+            errors.append(f"Sheet {sheet_name}: {str(e)}")
+    if verbose:
+        print(f"Converted {success_count}/{total_sheets} sheets successfully")
+    if errors:
+        print("Errors encountered:")
+        for error in errors:
+            print(f"- {error}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def excel_tab_to_csv(path_excel, directory_csv, rename_dict=None, verbose=False):
"""
Converts each sheet of an Excel file into individual CSV files.
Parameters:
- path_excel (str): The path to the input Excel file.
- directory_csv (str): The directory where CSV files should be saved.
- rename_dict (dict, optional): Dictionary for renaming columns. Keys are old column names, values are new names.
"""
# Ensure the output directory exists
os.makedirs(directory_csv, exist_ok=True)
# Get the file name without the extension
file_name = os.path.splitext(os.path.basename(path_excel))[0]
# Read the Excel file
try:
excel_data = pd.ExcelFile(path_excel)
except Exception as e:
raise ValueError(f"Error reading Excel file: {e}")
# Loop through each sheet and save as a CSV
for sheet_name in excel_data.sheet_names:
try:
df = pd.read_excel(path_excel, sheet_name=sheet_name)
# Rename columns based on the rename_dict
if rename_dict:
df.rename(columns=rename_dict, inplace=True)
# Handle the special case of renaming
if sheet_name == 'CONSTRUCTION':
output_path = os.path.join(directory_csv, "MASS.csv")
elif file_name == 'HVAC':
output_path = os.path.join(directory_csv, f"HVAC_{sheet_name}.csv")
elif file_name == 'SUPPLY':
output_path = os.path.join(directory_csv, f"SUPPLY_{sheet_name}.csv")
else:
output_path = os.path.join(directory_csv, f"{sheet_name}.csv")
# Save the file
df.to_csv(output_path, index=False)
if verbose:
print(f"Saved {sheet_name} to {output_path}")
except Exception as e:
print(f"Failed to save sheet {sheet_name} as .csv format Error: {e}")
def excel_tab_to_csv(path_excel, directory_csv, rename_dict=None, verbose=False):
"""
Converts each sheet of an Excel file into individual CSV files.
Parameters:
- path_excel (str): The path to the input Excel file.
- directory_csv (str): The directory where CSV files should be saved.
- rename_dict (dict, optional): Dictionary for renaming columns. Keys are old column names, values are new names.
"""
# Ensure the output directory exists
os.makedirs(directory_csv, exist_ok=True)
# Get the file name without the extension
file_name = os.path.splitext(os.path.basename(path_excel))[0]
# Read the Excel file
try:
excel_data = pd.ExcelFile(path_excel)
except Exception as e:
raise ValueError(f"Error reading Excel file: {e}")
errors = []
success_count = 0
total_sheets = 0
# Loop through each sheet and save as a CSV
for sheet_name in excel_data.sheet_names:
total_sheets += 1
try:
df = pd.read_excel(path_excel, sheet_name=sheet_name)
# Rename columns based on the rename_dict
if rename_dict:
df.rename(columns=rename_dict, inplace=True)
# Handle the special case of renaming
if sheet_name == 'CONSTRUCTION':
output_path = os.path.join(directory_csv, "MASS.csv")
elif file_name == 'HVAC':
output_path = os.path.join(directory_csv, f"HVAC_{sheet_name}.csv")
elif file_name == 'SUPPLY':
output_path = os.path.join(directory_csv, f"SUPPLY_{sheet_name}.csv")
else:
output_path = os.path.join(directory_csv, f"{sheet_name}.csv")
# Save the file
df.to_csv(output_path, index=False)
success_count += 1
if verbose:
print(f"Saved {sheet_name} to {output_path}")
except Exception as e:
errors.append(f"Sheet {sheet_name}: {str(e)}")
if verbose:
print(f"Converted {success_count}/{total_sheets} sheets successfully")
if errors:
print("Errors encountered:")
for error in errors:
print(f"- {error}")

Comment on lines 600 to 768
list_missing_files_csv_assemblies = verify_file_exists_4_db(scenario, [ASSEMBLIES], dict_ASSEMBLIES_COMPONENTS[ASSEMBLIES])
add_values_to_dict(dict_missing_db, ASSEMBLIES, list_missing_files_csv_assemblies)
if list_missing_files_csv_assemblies:
if verbose:
print('! Ensure .csv file(s) are present in the ASSEMBLIES>{ASSEMBLIES} folder: {list_missing_files_csv}.'.format(ASSEMBLIES=ASSEMBLIES, list_missing_files_csv=', '.join(map(str, list_missing_files_csv_assemblies))))

list_list_missing_columns_csv = verify_assembly(scenario, ASSEMBLIES, list_missing_files_csv_assemblies, verbose)
add_values_to_dict(dict_missing_db, ASSEMBLIES, [item for sublist in list_list_missing_columns_csv for item in sublist])

list_existing_files_csv = list(set(dict_ASSEMBLIES_COMPONENTS[ASSEMBLIES]) - set(list_missing_files_csv_assemblies))
# Verify is all values in the construction_type.csv file are defined in the assemblies.csv file
dict_missing_assemblies = verify_assemblies_exist(scenario, ASSEMBLIES, list_existing_files_csv, dict_missing_db['CONSTRUCTION_TYPE'], archetypes='CONSTRUCTION_TYPE')
list_missing_names_assemblies = list(dict_missing_assemblies.keys())
add_values_to_dict(dict_missing_db, ASSEMBLIES, list_missing_names_assemblies)
if list_missing_names_assemblies:
if verbose:
for key, values in dict_missing_assemblies.items():
print('! Ensure .csv file(s) are present in COMPONENTS>{ASSEMBLIES} folder: {missing_name_assemblies}, with assembly(ies) defined: {assemblies}.'.format(ASSEMBLIES=ASSEMBLIES, missing_name_assemblies=key, assemblies=', '.join(map(str, values))))

#4. verify columns and values in .csv files for assemblies
for COMPONENTS in COMPONENTS_FOLDERS:
list_missing_files_csv_components = verify_file_exists_4_db(scenario, [COMPONENTS], dict_ASSEMBLIES_COMPONENTS[COMPONENTS])
if list_missing_files_csv_components:
if verbose:
print('! Ensure .csv file(s) are present in the COMPONENTS>{COMPONENTS} folder: {list_missing_files_csv}.'.format(COMPONENTS=COMPONENTS, list_missing_files_csv=', '.join(map(str, list_missing_files_csv_components))))
add_values_to_dict(dict_missing_db, COMPONENTS, list_missing_files_csv_components)
else:
add_values_to_dict(dict_missing_db, COMPONENTS, [])

#5. verify columns and values in .csv files for components - conversion
if not dict_missing_db['CONVERSION']:
list_conversion_db = get_csv_filenames(path_to_db_file_4(scenario, 'CONVERSION'))
dict_missing_conversion = verify_components_exist(scenario, 'SUPPLY', ['SUPPLY_HEATING', 'SUPPLY_COOLING'], ['primary_components', 'secondary_components', 'tertiary_components'], 'CONVERSION', code_to_name=True)
if dict_missing_conversion:
list_missing_names_conversion = list(dict_missing_conversion.keys())
add_values_to_dict(dict_missing_db, 'CONVERSION', list_missing_names_conversion)
if verbose:
for key, values in dict_missing_conversion.items():
print('! Ensure .csv file(s) are present in COMPONENTS>CONVERSION folder: {missing_name_conversion}, with component(s) defined: {components}.'.format(missing_name_conversion=key, components=', '.join(map(str, values))))
for sheet in list_conversion_db:
list_missing_columns_csv_conversion, list_issues_against_csv_conversion = verify_file_against_schema_4_db(scenario, 'CONVERSION', sheet_name=sheet)
add_values_to_dict(dict_missing_db, 'CONVERSION', list_missing_columns_csv_conversion)
add_values_to_dict(dict_missing_db, 'CONVERSION', list_issues_against_csv_conversion)
if verbose:
if list_missing_columns_csv_conversion:
print('! Ensure column(s) are present in {conversion}.csv: {missing_columns}.'.format(conversion=sheet, missing_columns=', '.join(map(str, list_missing_columns_csv_conversion))))
if list_issues_against_csv_conversion:
print('! Check value(s) in {conversion}.csv:')
print("\n".join(f" {item}" for item in list_issues_against_csv_conversion))

#6. verify columns and values in .csv files for components - distribution
if not dict_missing_db['DISTRIBUTION']:
list_missing_files_csv_distribution_components = verify_file_exists_4_db(scenario, ['DISTRIBUTION'], DISTRIBUTION_COMPONENTS)
if list_missing_files_csv_distribution_components:
print('! Ensure .csv file(s) are present in the COMPONENTS>DISTRIBUTION folder: {list_missing_files_csv}.'.format(list_missing_files_csv=', '.join(map(str, list_missing_files_csv_distribution_components))))

for sheet in DISTRIBUTION_COMPONENTS:
list_missing_columns_csv_distribution, list_issues_against_csv_distribution = verify_file_against_schema_4_db(scenario, 'DISTRIBUTION', sheet_name=sheet)
add_values_to_dict(dict_missing_db, 'DISTRIBUTION', list_missing_columns_csv_distribution)
add_values_to_dict(dict_missing_db, 'DISTRIBUTION', list_issues_against_csv_distribution)
if verbose:
if list_missing_columns_csv_distribution:
print('! Ensure column(s) are present in DISTRIBUTION.csv: {missing_columns}.'.format(missing_columns=', '.join(map(str, list_missing_columns_csv_distribution))))
if list_issues_against_csv_distribution:
print('! Check value(s) in DISTRIBUTION.csv:')
print("\n".join(f" {item}" for item in list_issues_against_csv_distribution))

#7. verify columns and values in .csv files for components - feedstocks
if not dict_missing_db['FEEDSTOCKS']:
list_missing_files_csv_feedstocks_components = verify_file_exists_4_db(scenario, ['FEEDSTOCKS'], ['ENERGY_CARRIERS'])
if list_missing_files_csv_feedstocks_components:
print('! Ensure .csv file(s) are present in COMPONENTS>FEEDSTOCKS folder: {list_missing_files_csv}.'.format(list_missing_files_csv=', '.join(list_missing_files_csv_feedstocks_components)))
add_values_to_dict(dict_missing_db, 'FEEDSTOCKS', list_missing_files_csv_feedstocks_components)

list_feedstocks_db = get_csv_filenames(path_to_db_file_4(scenario, 'FEEDSTOCKS'))
dict_missing_feedstocks = verify_components_exist(scenario, 'SUPPLY', SUPPLY_ASSEMBLIES, ['feedstock'], 'FEEDSTOCKS')
if dict_missing_feedstocks:
list_missing_names_feedstocks = list(dict_missing_feedstocks.keys())
add_values_to_dict(dict_missing_db, 'FEEDSTOCKS', list_missing_names_feedstocks)
if verbose:
for key, _ in dict_missing_feedstocks.items():
print('! Ensure .csv file(s) are present in COMPONENTS>FEEDSTOCKS folder: {list_missing_feedstocks}.'.format(list_missing_feedstocks=', '.join(map(str, [key]))))
if 'ENERGY_CARRIERS' not in list_missing_files_csv_feedstocks_components:
list_missing_energy_carriers = find_missing_values_directory_column(path_to_db_file_4(scenario, 'FEEDSTOCKS'), path_to_db_file_4(scenario, 'FEEDSTOCKS', 'ENERGY_CARRIERS'), 'cost_and_ghg_tab')
if list_missing_energy_carriers:
if verbose:
print('! Ensure feedstock(s) are defined in the ENERGY_CARRIERS.csv: {list_missing_energy_carriers}.'.format(list_missing_energy_carriers=', '.join(map(str, list_missing_energy_carriers))))
add_values_to_dict(dict_missing_db, 'FEEDSTOCKS', list_missing_energy_carriers)
for sheet in list_feedstocks_db:
list_missing_columns_csv_feedstocks, list_issues_against_csv_feedstocks = verify_file_against_schema_4_db(scenario, 'FEEDSTOCKS', sheet_name=sheet)
add_values_to_dict(dict_missing_db, 'FEEDSTOCKS', list_missing_columns_csv_feedstocks)
add_values_to_dict(dict_missing_db, 'FEEDSTOCKS', list_issues_against_csv_feedstocks)
if verbose:
if list_missing_columns_csv_feedstocks:
print('! Ensure column(s) are present in {feedstocks}.csv: {missing_columns}.'.format(feedstocks=sheet, missing_columns=', '.join(map(str, list_missing_columns_csv_feedstocks))))
if list_issues_against_csv_feedstocks:
print('! Check value(s) in {feedstocks}.csv:')
print("\n".join(f" {item}" for item in list_issues_against_csv_feedstocks))

return dict_missing_db
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Decompose cea4_verify_db into smaller functions.

The function is too long (168 lines) and handles multiple responsibilities. Consider breaking it down into smaller, focused functions for better maintainability and testing.

Suggested structure:

def verify_archetypes(scenario, verbose=False):
    """Verify columns and values in archetypes CSV files."""
    pass

def verify_schedules(scenario, verbose=False):
    """Verify columns and values in schedules CSV files."""
    pass

def verify_assemblies(scenario, verbose=False):
    """Verify columns and values in assemblies CSV files."""
    pass

def verify_components(scenario, verbose=False):
    """Verify columns and values in components CSV files."""
    pass

def cea4_verify_db(scenario, verbose=False):
    """Main verification function."""
    dict_missing_db = {}
    
    # Verify each section
    dict_missing_db.update(verify_archetypes(scenario, verbose))
    dict_missing_db.update(verify_schedules(scenario, verbose))
    dict_missing_db.update(verify_assemblies(scenario, verbose))
    dict_missing_db.update(verify_components(scenario, verbose))
    
    return dict_missing_db

Comment on lines 134 to 185
# The paths are relatively hardcoded for now without using the inputlocator script.
# This is because we want to iterate over all scenarios, which is currently not possible with the inputlocator script.
def path_to_db_file_4(scenario, item, sheet_name=None):

if item == 'database':
path_db_file = os.path.join(scenario, "inputs", "database")
elif item == "CONSTRUCTION_TYPE":
path_db_file = os.path.join(scenario, "inputs", "database", "ARCHETYPES", "CONSTRUCTION_TYPE.csv")
elif item == "USE_TYPE":
path_db_file = os.path.join(scenario, "inputs", "database", "ARCHETYPES", "USE_TYPE.csv")
elif item == "SCHEDULES":
if sheet_name is None:
path_db_file = os.path.join(scenario, "inputs", "database", "ARCHETYPES", "SCHEDULES")
else:
path_db_file = os.path.join(scenario, "inputs", "database", "ARCHETYPES", "SCHEDULES", "{use_type}.csv".format(use_type=sheet_name))
elif item == "MONTHLY_MULTIPLIER":
path_db_file = os.path.join(scenario, "inputs", "database", "ARCHETYPES", "SCHEDULES", "MONTHLY_MULTIPLIER.csv")
elif item == "ENVELOPE":
if sheet_name is None:
path_db_file = os.path.join(scenario, "inputs", "database", "ASSEMBLIES", "ENVELOPE")
else:
path_db_file = os.path.join(scenario, "inputs", "database", "ASSEMBLIES", "ENVELOPE", "{envelope_assemblies}.csv".format(envelope_assemblies=sheet_name))
elif item == "HVAC":
if sheet_name is None:
path_db_file = os.path.join(scenario, "inputs", "database", "ASSEMBLIES", "HVAC")
else:
path_db_file = os.path.join(scenario, "inputs", "database", "ASSEMBLIES", "HVAC", "{hvac_assemblies}.csv".format(hvac_assemblies=sheet_name))
elif item == "SUPPLY":
if sheet_name is None:
path_db_file = os.path.join(scenario, "inputs", "database", "ASSEMBLIES", "SUPPLY")
else:
path_db_file = os.path.join(scenario, "inputs", "database", "ASSEMBLIES", "SUPPLY", "{supply_assemblies}.csv".format(supply_assemblies=sheet_name))
elif item == "CONVERSION":
if sheet_name is None:
path_db_file = os.path.join(scenario, "inputs", "database", "COMPONENTS", "CONVERSION")
else:
path_db_file = os.path.join(scenario, "inputs", "database", "COMPONENTS", "CONVERSION", "{conversion_components}.csv".format(conversion_components=sheet_name))
elif item == "DISTRIBUTION":
if sheet_name is None:
path_db_file = os.path.join(scenario, "inputs", "database", "COMPONENTS", "DISTRIBUTION")
else:
path_db_file = os.path.join(scenario, "inputs", "database", "COMPONENTS", "DISTRIBUTION", "{conversion_components}.csv".format(conversion_components=sheet_name))
elif item == "FEEDSTOCKS":
if sheet_name is None:
path_db_file = os.path.join(scenario, "inputs", "database", "COMPONENTS", "FEEDSTOCKS")
else:
path_db_file = os.path.join(scenario, "inputs", "database", "COMPONENTS", "FEEDSTOCKS", "{feedstocks_components}.csv".format(feedstocks_components=sheet_name))
else:
raise ValueError(f"Unknown item {item}")

return path_db_file

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Refactor path construction to use templates.

The current implementation uses many if-elif branches with hardcoded paths. Consider refactoring to use path templates and a mapping dictionary for better maintainability.

+# Define path templates at module level
+PATH_TEMPLATES = {
+    'database': ['inputs', 'database'],
+    'CONSTRUCTION_TYPE': ['inputs', 'database', 'ARCHETYPES', 'CONSTRUCTION_TYPE.csv'],
+    'USE_TYPE': ['inputs', 'database', 'ARCHETYPES', 'USE_TYPE.csv'],
+    'SCHEDULES': {
+        'base': ['inputs', 'database', 'ARCHETYPES', 'SCHEDULES'],
+        'sheet': ['inputs', 'database', 'ARCHETYPES', 'SCHEDULES', '{use_type}.csv']
+    },
+    # Add other templates...
+}

 def path_to_db_file_4(scenario, item, sheet_name=None):
-    if item == 'database':
-        path_db_file = os.path.join(scenario, "inputs", "database")
-    elif item == "CONSTRUCTION_TYPE":
-        path_db_file = os.path.join(scenario, "inputs", "database", "ARCHETYPES", "CONSTRUCTION_TYPE.csv")
-    # ... many elif branches ...
+    if item not in PATH_TEMPLATES:
+        raise ValueError(f"Unknown item {item}")
+    
+    template = PATH_TEMPLATES[item]
+    if isinstance(template, dict):
+        # Handle cases with sheet_name
+        path_parts = template['sheet' if sheet_name else 'base']
+    else:
+        path_parts = template
+    
+    # Construct the path
+    path = os.path.join(scenario, *path_parts)
+    
+    # Format if needed
+    if sheet_name and '{' in path:
+        path = path.format(
+            use_type=sheet_name,
+            envelope_assemblies=sheet_name,
+            hvac_assemblies=sheet_name,
+            supply_assemblies=sheet_name,
+            conversion_components=sheet_name,
+            feedstocks_components=sheet_name
+        )
+    
+    return path
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# The paths are relatively hardcoded for now without using the inputlocator script.
# This is because we want to iterate over all scenarios, which is currently not possible with the inputlocator script.
def path_to_db_file_4(scenario, item, sheet_name=None):
if item == 'database':
path_db_file = os.path.join(scenario, "inputs", "database")
elif item == "CONSTRUCTION_TYPE":
path_db_file = os.path.join(scenario, "inputs", "database", "ARCHETYPES", "CONSTRUCTION_TYPE.csv")
elif item == "USE_TYPE":
path_db_file = os.path.join(scenario, "inputs", "database", "ARCHETYPES", "USE_TYPE.csv")
elif item == "SCHEDULES":
if sheet_name is None:
path_db_file = os.path.join(scenario, "inputs", "database", "ARCHETYPES", "SCHEDULES")
else:
path_db_file = os.path.join(scenario, "inputs", "database", "ARCHETYPES", "SCHEDULES", "{use_type}.csv".format(use_type=sheet_name))
elif item == "MONTHLY_MULTIPLIER":
path_db_file = os.path.join(scenario, "inputs", "database", "ARCHETYPES", "SCHEDULES", "MONTHLY_MULTIPLIER.csv")
elif item == "ENVELOPE":
if sheet_name is None:
path_db_file = os.path.join(scenario, "inputs", "database", "ASSEMBLIES", "ENVELOPE")
else:
path_db_file = os.path.join(scenario, "inputs", "database", "ASSEMBLIES", "ENVELOPE", "{envelope_assemblies}.csv".format(envelope_assemblies=sheet_name))
elif item == "HVAC":
if sheet_name is None:
path_db_file = os.path.join(scenario, "inputs", "database", "ASSEMBLIES", "HVAC")
else:
path_db_file = os.path.join(scenario, "inputs", "database", "ASSEMBLIES", "HVAC", "{hvac_assemblies}.csv".format(hvac_assemblies=sheet_name))
elif item == "SUPPLY":
if sheet_name is None:
path_db_file = os.path.join(scenario, "inputs", "database", "ASSEMBLIES", "SUPPLY")
else:
path_db_file = os.path.join(scenario, "inputs", "database", "ASSEMBLIES", "SUPPLY", "{supply_assemblies}.csv".format(supply_assemblies=sheet_name))
elif item == "CONVERSION":
if sheet_name is None:
path_db_file = os.path.join(scenario, "inputs", "database", "COMPONENTS", "CONVERSION")
else:
path_db_file = os.path.join(scenario, "inputs", "database", "COMPONENTS", "CONVERSION", "{conversion_components}.csv".format(conversion_components=sheet_name))
elif item == "DISTRIBUTION":
if sheet_name is None:
path_db_file = os.path.join(scenario, "inputs", "database", "COMPONENTS", "DISTRIBUTION")
else:
path_db_file = os.path.join(scenario, "inputs", "database", "COMPONENTS", "DISTRIBUTION", "{conversion_components}.csv".format(conversion_components=sheet_name))
elif item == "FEEDSTOCKS":
if sheet_name is None:
path_db_file = os.path.join(scenario, "inputs", "database", "COMPONENTS", "FEEDSTOCKS")
else:
path_db_file = os.path.join(scenario, "inputs", "database", "COMPONENTS", "FEEDSTOCKS", "{feedstocks_components}.csv".format(feedstocks_components=sheet_name))
else:
raise ValueError(f"Unknown item {item}")
return path_db_file
# Define path templates at module level
PATH_TEMPLATES = {
'database': ['inputs', 'database'],
'CONSTRUCTION_TYPE': ['inputs', 'database', 'ARCHETYPES', 'CONSTRUCTION_TYPE.csv'],
'USE_TYPE': ['inputs', 'database', 'ARCHETYPES', 'USE_TYPE.csv'],
'SCHEDULES': {
'base': ['inputs', 'database', 'ARCHETYPES', 'SCHEDULES'],
'sheet': ['inputs', 'database', 'ARCHETYPES', 'SCHEDULES', '{use_type}.csv']
},
# Add other templates...
}
def path_to_db_file_4(scenario, item, sheet_name=None):
if item not in PATH_TEMPLATES:
raise ValueError(f"Unknown item {item}")
template = PATH_TEMPLATES[item]
if isinstance(template, dict):
# Handle cases with sheet_name
path_parts = template['sheet' if sheet_name else 'base']
else:
path_parts = template
# Construct the path
path = os.path.join(scenario, *path_parts)
# Format if needed
if sheet_name and '{' in path:
path = path.format(
use_type=sheet_name,
envelope_assemblies=sheet_name,
hvac_assemblies=sheet_name,
supply_assemblies=sheet_name,
conversion_components=sheet_name,
feedstocks_components=sheet_name
)
return path

Comment on lines +323 to +336
def verify_assembly(scenario, ASSEMBLIES, list_missing_files_csv, verbose=False):
list_existing_files_csv = list(set(dict_ASSEMBLIES_COMPONENTS[ASSEMBLIES]) - set(list_missing_files_csv))
list_list_missing_columns_csv = []
for assembly in list_existing_files_csv:
list_missing_columns_csv, list_issues_against_csv = verify_file_against_schema_4_db(scenario, ASSEMBLIES, sheet_name=assembly)
list_list_missing_columns_csv.append(list_missing_columns_csv)
if verbose:
if list_missing_columns_csv:
print('! Ensure column(s) are present in {assembly}.csv: {missing_columns}.'.format(assembly=assembly, missing_columns=', '.join(map(str, list_missing_columns_csv))))
if list_issues_against_csv:
print('! Check values in {assembly}.csv:')
print("\n".join(f" {item}" for item in list_issues_against_csv))
return list_list_missing_columns_csv

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve error handling in verify_assembly function.

The function silently continues on errors and doesn't validate input parameters. Add proper error handling and input validation.

 def verify_assembly(scenario, ASSEMBLIES, list_missing_files_csv, verbose=False):
+    if not isinstance(ASSEMBLIES, str) or ASSEMBLIES not in ASSEMBLIES_FOLDERS:
+        raise ValueError(f"Invalid ASSEMBLIES value: {ASSEMBLIES}")
+    if not isinstance(list_missing_files_csv, list):
+        raise ValueError("list_missing_files_csv must be a list")
+
     list_existing_files_csv = list(set(dict_ASSEMBLIES_COMPONENTS[ASSEMBLIES]) - set(list_missing_files_csv))
     list_list_missing_columns_csv = []
     for assembly in list_existing_files_csv:
-        list_missing_columns_csv, list_issues_against_csv = verify_file_against_schema_4_db(scenario, ASSEMBLIES, sheet_name=assembly)
-        list_list_missing_columns_csv.append(list_missing_columns_csv)
+        try:
+            list_missing_columns_csv, list_issues_against_csv = verify_file_against_schema_4_db(scenario, ASSEMBLIES, sheet_name=assembly)
+            list_list_missing_columns_csv.append(list_missing_columns_csv)
+        except Exception as e:
+            if verbose:
+                print(f"Error verifying assembly {assembly}: {str(e)}")
+            list_list_missing_columns_csv.append([])

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +110 to +128
dict_code_to_name = {'CH':'VAPOR_COMPRESSION_CHILLERS',
'CT':'COOLING_TOWERS',
'HEX':'HEAT_EXCHANGERS',
'ACH':'ABSORPTION_CHILLERS',
'BO':'BOILERS',
'HP':'HEAT_PUMPS',
'BH':'BORE_HOLES',
'OEHR':'COGENERATION_PLANTS',
'FU':'COGENERATION_PLANTS',
'CCGT':'COGENERATION_PLANTS',
'FC':'FUEL_CELLS',
'PU':'HYDRAULIC_PUMPS',
'PV':'PHOTOVOLTAIC_PANELS',
'PVT':'PHOTOVOLTAIC_THERMAL_PANELS',
'TR':'POWER_TRANSFORMERS',
'SC': 'SOLAR_THERMAL_PANELS',
'TES':'THERMAL_ENERGY_STORAGES',
'AC':'UNITARY_AIR_CONDITIONERS',
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Duplicate Mapping Issue in dict_code_to_name

The duplicate mapping for 'COGENERATION_PLANTS' is confirmed. In the file cea/datamanagement/format_helper/cea4_verify_db.py, the keys 'OEHR', 'FU', and 'CCGT' all map to 'COGENERATION_PLANTS'.

  • Consider using distinct names if these represent different components, or document why multiple codes map to the same component.
🔗 Analysis chain

Fix duplicate mapping for 'COGENERATION_PLANTS'.

The dict_code_to_name contains multiple keys ('OEHR', 'FU', 'CCGT') mapping to the same value 'COGENERATION_PLANTS'. Consider using a more specific naming scheme or documenting why multiple codes map to the same component.



Duplicate Mapping Issue in dict_code_to_name

  • The duplicate mapping for 'COGENERATION_PLANTS' has been confirmed. Three keys ('OEHR', 'FU', and 'CCGT') all map to the same value.
  • Please either adopt a more specific naming scheme for these keys or add documentation explaining why multiple codes map to the same component.

Duplicate Mapping Issue in dict_code_to_name

  • The duplicate mapping for 'COGENERATION_PLANTS' has been confirmed. Keys 'OEHR', 'FU', and 'CCGT' all map to the same value.
  • Please either adopt a more specific naming scheme or add documentation explaining why these multiple codes map to the same component.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for other potential duplicate mappings in the codebase

# Search for similar dictionary patterns with potential duplicates
rg -A 5 "dict_.*_to_.*\s*=\s*{" | grep -B 5 -A 5 ":'[^']*'.*:'\\1'"

Length of output: 218


Script:

#!/usr/bin/env python3
import ast
import sys

filename = "cea/datamanagement/format_helper/cea4_verify_db.py"
with open(filename, "r") as f:
    content = f.read()

start_token = "dict_code_to_name"
idx = content.find(start_token)
if idx == -1:
    print("dict_code_to_name not found")
    sys.exit(1)

dict_start = content.find("{", idx)
dict_end = content.find("}", dict_start)
if dict_start == -1 or dict_end == -1:
    print("Dictionary boundaries not found")
    sys.exit(1)

dict_literal = content[dict_start:dict_end+1]
try:
    mapping = ast.literal_eval(dict_literal)
except Exception as e:
    print("Error parsing dictionary:", e)
    sys.exit(1)

reverse = {}
for code, name in mapping.items():
    reverse.setdefault(name, []).append(code)

duplicates_found = False
for name, codes in reverse.items():
    if len(codes) > 1:
        print(f"{name}: {codes}")
        duplicates_found = True

if not duplicates_found:
    print("No duplicates found.")

Length of output: 596


Script:

#!/bin/bash
python3 <<'EOF'
import ast
import sys

filename = "cea/datamanagement/format_helper/cea4_verify_db.py"
with open(filename, "r") as f:
    content = f.read()

start_token = "dict_code_to_name"
idx = content.find(start_token)
if idx == -1:
    print("dict_code_to_name not found")
    sys.exit(1)

dict_start = content.find("{", idx)
dict_end = content.find("}", dict_start)
if dict_start == -1 or dict_end == -1:
    print("Dictionary boundaries not found")
    sys.exit(1)

dict_literal = content[dict_start:dict_end+1]
try:
    mapping = ast.literal_eval(dict_literal)
except Exception as e:
    print("Error parsing dictionary:", e)
    sys.exit(1)

reverse = {}
for code, name in mapping.items():
    reverse.setdefault(name, []).append(code)

duplicates_found = False
for name, codes in reverse.items():
    if len(codes) > 1:
        print(f"{name}: {codes}")
        duplicates_found = True

if not duplicates_found:
    print("No duplicates found.")
EOF

Length of output: 53

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (4)
cea/datamanagement/format_helper/cea4_verify_db.py (4)

113-131: ⚠️ Potential issue

Fix duplicate mapping in dict_code_to_name.

Multiple codes ('OEHR', 'FU', 'CCGT') map to the same value 'COGENERATION_PLANTS'. This could lead to confusion and potential bugs.

Consider using more specific names or documenting why multiple codes map to the same component:

 dict_code_to_name = {
     'CH':'VAPOR_COMPRESSION_CHILLERS',
     'CT':'COOLING_TOWERS',
     'HEX':'HEAT_EXCHANGERS',
     'ACH':'ABSORPTION_CHILLERS',
     'BO':'BOILERS',
     'HP':'HEAT_PUMPS',
     'BH':'BORE_HOLES',
-    'OEHR':'COGENERATION_PLANTS',
-    'FU':'COGENERATION_PLANTS',
-    'CCGT':'COGENERATION_PLANTS',
+    'OEHR':'ORGANIC_RANKINE_COGENERATION',
+    'FU':'FUEL_COGENERATION',
+    'CCGT':'COMBINED_CYCLE_COGENERATION',
     'FC':'FUEL_CELLS',
     'PU':'HYDRAULIC_PUMPS',
     'PV':'PHOTOVOLTAIC_PANELS',
     'PVT':'PHOTOVOLTAIC_THERMAL_PANELS',
     'TR':'POWER_TRANSFORMERS',
     'SC': 'SOLAR_THERMAL_PANELS',
     'TES':'THERMAL_ENERGY_STORAGES',
     'AC':'UNITARY_AIR_CONDITIONERS',
     }

139-191: 🛠️ Refactor suggestion

Refactor path construction to use templates.

The current implementation uses many if-elif branches with hardcoded paths. This makes maintenance difficult and is prone to errors.

Consider using path templates and a mapping dictionary:

+# Define path templates at module level
+PATH_TEMPLATES = {
+    'database': ['inputs', 'database'],
+    'archetypes': {
+        'base': ['inputs', 'database', 'ARCHETYPES'],
+        'construction': ['inputs', 'database', 'ARCHETYPES', 'CONSTRUCTION_TYPE.csv'],
+        'use_type': ['inputs', 'database', 'ARCHETYPES', 'USE_TYPE.csv'],
+        'schedules': ['inputs', 'database', 'ARCHETYPES', 'SCHEDULES'],
+        'schedules_library': ['inputs', 'database', 'ARCHETYPES', 'SCHEDULES', 'SCHEDULES_LIBRARY', '{use_type}.csv'],
+        'monthly_multiplier': ['inputs', 'database', 'ARCHETYPES', 'SCHEDULES', 'MONTHLY_SCHEDULE_COEFFICIENTS.csv']
+    },
+    'assemblies': {
+        'envelope': ['inputs', 'database', 'ASSEMBLIES', 'ENVELOPE', '{envelope_assemblies}.csv'],
+        'hvac': ['inputs', 'database', 'ASSEMBLIES', 'HVAC', '{hvac_assemblies}.csv'],
+        'supply': ['inputs', 'database', 'ASSEMBLIES', 'SUPPLY', '{supply_assemblies}.csv']
+    },
+    'components': {
+        'conversion': ['inputs', 'database', 'COMPONENTS', 'CONVERSION', '{conversion_components}.csv'],
+        'distribution': ['inputs', 'database', 'COMPONENTS', 'DISTRIBUTION', '{distribution_components}.csv'],
+        'feedstocks': ['inputs', 'database', 'COMPONENTS', 'FEEDSTOCKS', 'FEEDSTOCK_LIBRARY', '{feedstocks_components}.csv'],
+        'energy_carriers': ['inputs', 'database', 'COMPONENTS', 'FEEDSTOCKS', 'ENERGY_CARRIERS.csv']
+    }
+}

 def path_to_db_file_4(scenario, item, sheet_name=None):
+    """
+    Construct file path based on item type and optional sheet name.
+    
+    Args:
+        scenario (str): Scenario path
+        item (str): Item type (e.g., 'database', 'CONSTRUCTION_TYPE')
+        sheet_name (str, optional): Sheet name for specific files
+    
+    Returns:
+        str: Constructed file path
+    """
-    if item == 'database':
-        path_db_file = os.path.join(scenario, "inputs", "database")
-    elif item == "CONSTRUCTION_TYPE":
-        path_db_file = os.path.join(scenario, "inputs", "database", "ARCHETYPES", "CONSTRUCTION_TYPE.csv")
-    # ... many elif branches ...
+    def build_path(template):
+        if isinstance(template, list):
+            return os.path.join(scenario, *template)
+        return os.path.join(scenario, *template['base' if sheet_name is None else sheet_name])
+    
+    for category, templates in PATH_TEMPLATES.items():
+        if item in templates:
+            path = build_path(templates[item])
+            if '{' in path:
+                path = path.format(
+                    use_type=sheet_name,
+                    envelope_assemblies=sheet_name,
+                    hvac_assemblies=sheet_name,
+                    supply_assemblies=sheet_name,
+                    conversion_components=sheet_name,
+                    distribution_components=sheet_name,
+                    feedstocks_components=sheet_name
+                )
+            return path
+            
+    raise ValueError(f"Unknown item {item}")

330-343: ⚠️ Potential issue

Improve error handling in verify_assembly function.

The function silently continues on errors and doesn't validate input parameters.

Add proper error handling and input validation:

 def verify_assembly(scenario, ASSEMBLIES, list_missing_files_csv, verbose=False):
+    """
+    Verify assembly files against schema.
+    
+    Args:
+        scenario (str): Scenario path
+        ASSEMBLIES (str): Assembly type
+        list_missing_files_csv (list): List of missing CSV files
+        verbose (bool, optional): Enable verbose output
+    
+    Returns:
+        list: List of missing columns
+    
+    Raises:
+        ValueError: If input parameters are invalid
+    """
+    if not isinstance(ASSEMBLIES, str) or ASSEMBLIES not in ASSEMBLIES_FOLDERS:
+        raise ValueError(f"Invalid ASSEMBLIES value: {ASSEMBLIES}")
+    if not isinstance(list_missing_files_csv, list):
+        raise ValueError("list_missing_files_csv must be a list")
+
     list_existing_files_csv = list(set(dict_ASSEMBLIES_COMPONENTS[ASSEMBLIES]) - set(list_missing_files_csv))
     list_list_missing_columns_csv = []
     for assembly in list_existing_files_csv:
-        list_missing_columns_csv, list_issues_against_csv = verify_file_against_schema_4_db(scenario, ASSEMBLIES, sheet_name=assembly)
-        list_list_missing_columns_csv.append(list_missing_columns_csv)
+        try:
+            list_missing_columns_csv, list_issues_against_csv = verify_file_against_schema_4_db(scenario, ASSEMBLIES, sheet_name=assembly)
+            list_list_missing_columns_csv.append(list_missing_columns_csv)
+        except Exception as e:
+            if verbose:
+                print(f"Error verifying assembly {assembly}: {str(e)}")
+            list_list_missing_columns_csv.append([])

607-782: 🛠️ Refactor suggestion

Decompose cea4_verify_db into smaller functions.

The function is too long (175 lines) and handles multiple responsibilities.

Break it down into smaller, focused functions:

+def verify_archetypes(scenario, verbose=False):
+    """Verify columns and values in archetypes CSV files."""
+    dict_missing = {}
+    list_missing_files = verify_file_exists_4_db(scenario, ARCHETYPES)
+    # ... archetypes verification logic ...
+    return dict_missing
+
+def verify_schedules(scenario, verbose=False):
+    """Verify columns and values in schedules CSV files."""
+    dict_missing = {}
+    # ... schedules verification logic ...
+    return dict_missing
+
+def verify_assemblies(scenario, verbose=False):
+    """Verify columns and values in assemblies CSV files."""
+    dict_missing = {}
+    # ... assemblies verification logic ...
+    return dict_missing
+
+def verify_components(scenario, verbose=False):
+    """Verify columns and values in components CSV files."""
+    dict_missing = {}
+    # ... components verification logic ...
+    return dict_missing
+
 def cea4_verify_db(scenario, verbose=False):
     """
     Verify the database for the CEA-4 format.
-
-    :param scenario: the scenario to verify
-    :param verbose: if True, print the results
-    :return: a dictionary with the missing files
+    
+    Args:
+        scenario (str): The scenario to verify
+        verbose (bool): If True, print the results
+    
+    Returns:
+        dict: Dictionary with missing files and validation errors
+    
+    Raises:
+        ValueError: If scenario path is invalid
     """
+    if not os.path.isdir(scenario):
+        raise ValueError(f"Invalid scenario path: {scenario}")
 
     dict_missing_db = {}
 
-    #1. verify columns and values in .csv files for archetypes
-    list_missing_files_csv_archetypes = verify_file_exists_4_db(scenario, ARCHETYPES)
-    # ... many lines of verification logic ...
+    # Verify each section
+    dict_missing_db.update(verify_archetypes(scenario, verbose))
+    dict_missing_db.update(verify_schedules(scenario, verbose))
+    dict_missing_db.update(verify_assemblies(scenario, verbose))
+    dict_missing_db.update(verify_components(scenario, verbose))
 
     return dict_missing_db
🧹 Nitpick comments (2)
cea/datamanagement/format_helper/cea4_verify_db.py (2)

1-4: Enhance module docstring.

The current docstring is too brief. Consider adding more details about the module's purpose, functionality, and usage.

-"""
-Verify the format of DB for CEA-4 model.
-
-"""
+"""
+Verify the format and schema compliance of databases in the CEA-4 model.
+
+This module provides functionality to:
+- Validate CSV files against predefined schemas
+- Check for missing required files and columns
+- Verify data types and value ranges
+- Ensure consistency between different database components
+
+Usage:
+    from cea.datamanagement.format_helper import cea4_verify_db
+    dict_missing = cea4_verify_db(scenario, verbose=True)
+"""

25-47: Add documentation for constants.

The lists and dictionaries lack documentation explaining their purpose and structure.

+# Database structure constants
+# These lists define the expected files and folders in the database
+
 ARCHETYPES = ['CONSTRUCTION_TYPE', 'USE_TYPE']
 SCHEDULES_FOLDER = ['SCHEDULES']
 SCHEDULES_LIBRARY_FOLDER = ['SCHEDULES_LIBRARY']
+
+# Assembly types and their components
 ENVELOPE_ASSEMBLIES = ['MASS', 'TIGHTNESS', 'FLOOR', 'WALL', 'WINDOW', 'SHADING', 'ROOF']
 HVAC_ASSEMBLIES = ['HVAC_CONTROLLER', 'HVAC_HOT_WATER', 'HVAC_HEATING', 'HVAC_COOLING', 'HVAC_VENTILATION']
 SUPPLY_ASSEMBLIES = ['SUPPLY_COOLING', 'SUPPLY_ELECTRICITY', 'SUPPLY_HEATING', 'SUPPLY_HOT_WATER']
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7e74f20 and 4af83cd.

📒 Files selected for processing (1)
  • cea/datamanagement/format_helper/cea4_verify_db.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: windows-latest
  • GitHub Check: macos-latest
  • GitHub Check: ubuntu-latest
🔇 Additional comments (2)
cea/datamanagement/format_helper/cea4_verify_db.py (2)

788-817: LGTM!

The main function is well-structured with proper error handling and timing measurement.


16-16: ⚠️ Potential issue

Update copyright year.

The copyright year is set to 2025 which is incorrect as it's a future date.

-__copyright__ = "Copyright 2025, Architecture and Building Systems - ETH Zurich"
+__copyright__ = "Copyright 2024, Architecture and Building Systems - ETH Zurich"

Likely invalid or redundant comment.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (4)
cea/datamanagement/format_helper/cea4_verify_db.py (2)

327-340: 🛠️ Refactor suggestion

Improve error handling in verify_assembly function.

The function needs better error handling and input validation.

Apply the error handling improvements from the past review comment.


609-786: 🛠️ Refactor suggestion

Decompose cea4_verify_db into smaller functions.

The function is too long (>150 lines) and handles multiple responsibilities, making it hard to maintain and test.

Apply the decomposition pattern from the past review comment that suggests breaking it down into smaller, focused functions for each verification task.

cea/datamanagement/format_helper/cea4_migrate_db.py (2)

94-139: 🛠️ Refactor suggestion

Enhance error handling in excel_tab_to_csv function.

The function catches exceptions but continues processing other sheets when one fails. This could lead to partial data migration without clear indication of what failed.


374-429: 🛠️ Refactor suggestion

Refactor migrate_cea3_to_cea4_db for better maintainability.

The function is quite long and handles multiple responsibilities. Breaking it down would improve maintainability and testability.

🧹 Nitpick comments (3)
cea/datamanagement/format_helper/cea4_verify_db.py (1)

239-243: Remove or implement commented code.

The commented code block appears to be a validation check for id_column. Either implement this check or remove the comments to maintain code clarity.

cea/datamanagement/format_helper/cea4_migrate_db.py (2)

1-4: Fix typo in docstring.

There's a typo in the docstring: "Mirgate" should be "Migrate".

-Mirgate the format of the DB to CEA-4 format.
+Migrate the format of the DB to CEA-4 format.

27-54: Consider using an enum or dataclass for rename mappings.

The rename dictionaries contain static mappings that would benefit from being defined as enums or dataclasses for better type safety and maintainability.

from dataclasses import dataclass
from typing import Dict

@dataclass(frozen=True)
class ColumnMappings:
    STANDARD_MAPPINGS: Dict[str, str] = {
        'STANDARD': 'const_type',
        'YEAR_START': 'year_start',
        # ... other mappings
    }
    
    USE_TYPE_MAPPINGS: Dict[str, str] = {
        'Code': 'use_type',
        'code': 'use_type',
    }
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4af83cd and bca55a5.

📒 Files selected for processing (2)
  • cea/datamanagement/format_helper/cea4_migrate_db.py (1 hunks)
  • cea/datamanagement/format_helper/cea4_verify_db.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
cea/datamanagement/format_helper/cea4_migrate_db.py

365-365: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

cea/datamanagement/format_helper/cea4_verify_db.py

140-140: Module level import not at top of file

(E402)


140-140: Redefinition of unused os from line 7

Remove definition: os

(F811)

🪛 GitHub Actions: Ruff
cea/datamanagement/format_helper/cea4_verify_db.py

[error] 140-140: Module level import not at top of file.


[error] 140-140: Redefinition of unused os from line 7. Remove definition: os.

🔇 Additional comments (4)
cea/datamanagement/format_helper/cea4_verify_db.py (3)

114-132: Document or refactor duplicate mappings in dict_code_to_name.

Multiple codes ('OEHR', 'FU', 'CCGT') map to 'COGENERATION_PLANTS'. This could lead to confusion or errors in code maintenance.

Let's verify if these codes represent different types of cogeneration plants:

What are the differences between OEHR, FU, and CCGT cogeneration plants?

142-187: Consider using path templates for better maintainability.

The current implementation uses hardcoded paths. While functional, it could be improved by:

  1. Using path templates to reduce repetition
  2. Leveraging the inputlocator script for consistent path handling across CEA

Apply the refactoring pattern from the past review comment that suggests using path templates and a mapping dictionary.


792-821: LGTM! Well-structured main function.

The main function has good error handling, progress reporting, and timing information.

cea/datamanagement/format_helper/cea4_migrate_db.py (1)

474-476: Uncomment cleanup code in failure path.

The cleanup code for the new database in case of migration failure is commented out. This should either be implemented or removed if not needed.

     else:
-        # delete_files(path_to_db_file_4(scenario, 'database'))
+        delete_files(path_to_db_file_4(scenario, 'database'))
✅ Verification successful

Uncomment the cleanup call for proper failure handling.
After reviewing the snippet in cea/datamanagement/format_helper/cea4_migrate_db.py (lines 474–476), it’s clear that the deletion of the new database file on migration failure was intentionally commented out. Since leaving temporary database files might lead to unintended resource consumption or conflicts on subsequent runs, the cleanup code should be active. Please ensure that calling

delete_files(path_to_db_file_4(scenario, 'database'))

in the failure path does not have side effects elsewhere. Un-commenting this line will help to keep the environment clean in case of migration failure.

🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts or web queries found

Comment on lines +7 to +14
import os
import cea.config
import time
import pandas as pd
import numpy as np
from cea.schemas import schemas
import re

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix import issues.

  1. Remove the duplicate import of os at line 140 as it's already imported at line 7.
  2. Move all imports to the top of the file.

Apply this diff to fix the import issues:

-import os
-import cea.config
-import time
-import pandas as pd
-import numpy as np
-from cea.schemas import schemas
-import re
+import os
+import re
+import time
+import numpy as np
+import pandas as pd
+import cea.config
+from cea.schemas import schemas

 __author__ = "Zhongming Shi"
 # ... rest of the code ...

-import os  # Remove this line

Also applies to: 140-140

Comment on lines +354 to +367
def delete_files(path, verbose=False):
"""
Delete all files in a directory

Parameters:
- path (str): The path to the directory
"""
try:
shutil.rmtree(path)
if verbose:
print(f"Deleted directory: {path}")
except Exception as e:
pass

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Improve error handling in delete_files function.

The function silently ignores all exceptions, which could hide important errors.

 def delete_files(path, verbose=False):
     try:
         shutil.rmtree(path)
         if verbose:
             print(f"Deleted directory: {path}")
-    except Exception as e:
-        pass
+    except Exception as e:
+        print(f"Warning: Failed to delete {path}: {e}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def delete_files(path, verbose=False):
"""
Delete all files in a directory
Parameters:
- path (str): The path to the directory
"""
try:
shutil.rmtree(path)
if verbose:
print(f"Deleted directory: {path}")
except Exception as e:
pass
def delete_files(path, verbose=False):
"""
Delete all files in a directory
Parameters:
- path (str): The path to the directory
"""
try:
shutil.rmtree(path)
if verbose:
print(f"Deleted directory: {path}")
except Exception as e:
print(f"Warning: Failed to delete {path}: {e}")
🧰 Tools
🪛 Ruff (0.8.2)

365-365: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

Comment on lines +440 to +483
def main(config):
# Start the timer
t0 = time.perf_counter()
assert os.path.exists(config.general.project), 'input file not found: %s' % config.project

scenario = config.scenario
scenario_name = os.path.basename(scenario)

# Print: Start
div_len = 37 - len(scenario_name)
print('-' * 50)
print("-" * 1 + ' Scenario: {scenario} '.format(scenario=scenario_name) + "-" * div_len)

# Execute the migration
migrate_cea3_to_cea4_db(scenario)

# Execute the verification again
dict_missing = cea4_verify_db(scenario, verbose=True)

# Print the verification results
print_verification_results_4_db(scenario_name, dict_missing)

# If verification is passed, remove the old database
if all(not value for value in dict_missing.values()):
delete_files(path_to_db_file_3(scenario, 'technology'))
# Print: End
# print("-" * 1 + ' Scenario: {scenario} - end '.format(scenario=scenario_name) + "-" * 50)
print('+' * 104)

# Print the time used for the entire processing
time_elapsed = time.perf_counter() - t0
print('The entire process of Database migration from CEA-3 to CEA-4 is now completed and successful - time elapsed: %.2f seconds' % time_elapsed)

# if verification is failed, keep the old database, remove the new one
else:
# delete_files(path_to_db_file_4(scenario, 'database'))
# Print: End
# print("-" * 1 + ' Scenario: {scenario} - end '.format(scenario=scenario_name) + "-" * 50)
print('+' * 104)

# Print the time used for the entire processing
time_elapsed = time.perf_counter() - t0
print('The process of Database migration from CEA-3 to CEA-4 is not entirely successful - time elapsed: %.2f seconds' % time_elapsed)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling for configuration validation.

The main function should handle potential configuration errors more gracefully.

 def main(config):
     # Start the timer
     t0 = time.perf_counter()
-    assert os.path.exists(config.general.project), 'input file not found: %s' % config.project
+    if not os.path.exists(config.general.project):
+        print(f"Error: Project path not found: {config.project}")
+        return False
 
     scenario = config.scenario
+    if not scenario:
+        print("Error: Scenario path not specified in configuration")
+        return False
+
     scenario_name = os.path.basename(scenario)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def main(config):
# Start the timer
t0 = time.perf_counter()
assert os.path.exists(config.general.project), 'input file not found: %s' % config.project
scenario = config.scenario
scenario_name = os.path.basename(scenario)
# Print: Start
div_len = 37 - len(scenario_name)
print('-' * 50)
print("-" * 1 + ' Scenario: {scenario} '.format(scenario=scenario_name) + "-" * div_len)
# Execute the migration
migrate_cea3_to_cea4_db(scenario)
# Execute the verification again
dict_missing = cea4_verify_db(scenario, verbose=True)
# Print the verification results
print_verification_results_4_db(scenario_name, dict_missing)
# If verification is passed, remove the old database
if all(not value for value in dict_missing.values()):
delete_files(path_to_db_file_3(scenario, 'technology'))
# Print: End
# print("-" * 1 + ' Scenario: {scenario} - end '.format(scenario=scenario_name) + "-" * 50)
print('+' * 104)
# Print the time used for the entire processing
time_elapsed = time.perf_counter() - t0
print('The entire process of Database migration from CEA-3 to CEA-4 is now completed and successful - time elapsed: %.2f seconds' % time_elapsed)
# if verification is failed, keep the old database, remove the new one
else:
# delete_files(path_to_db_file_4(scenario, 'database'))
# Print: End
# print("-" * 1 + ' Scenario: {scenario} - end '.format(scenario=scenario_name) + "-" * 50)
print('+' * 104)
# Print the time used for the entire processing
time_elapsed = time.perf_counter() - t0
print('The process of Database migration from CEA-3 to CEA-4 is not entirely successful - time elapsed: %.2f seconds' % time_elapsed)
def main(config):
# Start the timer
t0 = time.perf_counter()
if not os.path.exists(config.general.project):
print(f"Error: Project path not found: {config.project}")
return False
scenario = config.scenario
if not scenario:
print("Error: Scenario path not specified in configuration")
return False
scenario_name = os.path.basename(scenario)
# Print: Start
div_len = 37 - len(scenario_name)
print('-' * 50)
print("-" * 1 + ' Scenario: {scenario} '.format(scenario=scenario_name) + "-" * div_len)
# Execute the migration
migrate_cea3_to_cea4_db(scenario)
# Execute the verification again
dict_missing = cea4_verify_db(scenario, verbose=True)
# Print the verification results
print_verification_results_4_db(scenario_name, dict_missing)
# If verification is passed, remove the old database
if all(not value for value in dict_missing.values()):
delete_files(path_to_db_file_3(scenario, 'technology'))
# Print: End
# print("-" * 1 + ' Scenario: {scenario} - end '.format(scenario=scenario_name) + "-" * 50)
print('+' * 104)
# Print the time used for the entire processing
time_elapsed = time.perf_counter() - t0
print('The entire process of Database migration from CEA-3 to CEA-4 is now completed and successful - time elapsed: %.2f seconds' % time_elapsed)
# if verification is failed, keep the old database, remove the new one
else:
# delete_files(path_to_db_file_4(scenario, 'database'))
# Print: End
# print("-" * 1 + ' Scenario: {scenario} - end '.format(scenario=scenario_name) + "-" * 50)
print('+' * 104)
# Print the time used for the entire processing
time_elapsed = time.perf_counter() - t0
print('The process of Database migration from CEA-3 to CEA-4 is not entirely successful - time elapsed: %.2f seconds' % time_elapsed)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (2)
cea/datamanagement/zone_helper.py (2)

481-482: Update the default construction year.

The hardcoded year 2000 seems outdated, especially since the print statement mentions 2020 as the default year.

Update the default year to match the documentation:

-            zone_df["start_date"] = 2000
+            zone_df["start_date"] = 2020

113-116: Review Comment: Underground Buildings Handling Still Unresolved

🔗 Analysis chain

Address TODO comment about underground buildings.

The TODO comment indicates potential issues with handling underground or partially underground buildings in the radiation script. This could affect the accuracy of energy calculations.

Let's verify if this is still an issue:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if the radiation script has been updated to handle underground buildings

# Search for height-related code in radiation scripts
rg -A 5 "height.*=.*0" --glob "cea/**/radiation*.py"

# Search for related issues or PRs
gh issue list --search "radiation underground buildings"
gh pr list --search "radiation underground buildings"

Length of output: 418

🧹 Nitpick comments (1)
cea/datamanagement/zone_helper.py (1)

80-84: LGTM! The changes improve robustness and efficiency.

The modifications enhance the code by:

  1. Adding .all() to properly check if all values are NaN
  2. Using scalar multiplication for better efficiency
  3. Adding data source tracking through the 'reference' column

Consider using pandas' built-in fillna for a more idiomatic approach:

-            shapefile['building:levels'] = 3 * no_buildings
+            shapefile['building:levels'] = pd.Series([3] * no_buildings).fillna(3)
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bca55a5 and f611b38.

📒 Files selected for processing (1)
  • cea/datamanagement/zone_helper.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: windows-latest
  • GitHub Check: macos-latest
  • GitHub Check: ubuntu-latest

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
cea/interfaces/dashboard/api/inputs.py (1)

468-469: Update remaining Excel operations to use CSV format.

Several functions still use Excel format, which is inconsistent with the database revamp objectives:

  • validate_input_database reads Excel files
  • database_dict_to_file writes Excel files

Consider updating these functions to use CSV format. Here's a suggested implementation:

- df = pd.read_excel(db_path, sheet_name=None)
+ df = pd.read_csv(db_path)

- def database_dict_to_file(db_dict, db_path):
-     with pd.ExcelWriter(db_path) as writer:
-         for sheet_name, data in db_dict.items():
-             df = pd.DataFrame(data).dropna(axis=0, how='all')
-             df.to_excel(writer, sheet_name=sheet_name, index=False)
-     print('Database file written to {}'.format(db_path))
+ def database_dict_to_file(db_dict, db_path):
+     for sheet_name, data in db_dict.items():
+         df = pd.DataFrame(data).dropna(axis=0, how='all')
+         csv_path = db_path.replace('.xlsx', f'_{sheet_name}.csv')
+         df.to_csv(csv_path, index=False)
+     print('Database files written to {}'.format(os.path.dirname(db_path)))

Also applies to: 487-492

🧹 Nitpick comments (1)
cea/interfaces/dashboard/api/inputs.py (1)

442-449: Enhance database validation for CSV format.

While error handling is robust, the database validation could be enhanced to specifically verify CSV format compliance.

Consider adding CSV-specific validation in the check_input_database function:

@router.get('/databases/check')
async def check_input_database(config: CEAConfig):
    locator = cea.inputlocator.InputLocator(config.scenario)
    try:
        locator.verify_database_template()
+       # Add CSV-specific validation
+       for file_path in locator.get_database_files():
+           if file_path.endswith('.csv'):
+               df = pd.read_csv(file_path)
+               if df.empty:
+                   raise ValueError(f"Empty CSV file: {file_path}")
+               required_cols = get_required_columns(file_path)
+               missing_cols = [col for col in required_cols if col not in df.columns]
+               if missing_cols:
+                   raise ValueError(f"Missing columns in {file_path}: {missing_cols}")
    except (IOError, ValueError) as e:
        print(e)
        raise HTTPException(
            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
            detail=str(e),
        )
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f611b38 and 03b5aa9.

📒 Files selected for processing (1)
  • cea/interfaces/dashboard/api/inputs.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: check_environment_yml
🔇 Additional comments (1)
cea/interfaces/dashboard/api/inputs.py (1)

504-505: LGTM! Verify CSV format consistency.

The change from Excel to CSV aligns with the database revamp objectives. The implementation handles edge cases appropriately.

Run this script to verify that all CSV files have the required column structure:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-database priority-must-have These requirements are critical and must be included in CEA
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants