Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cytnx Storage Refactor Proposal #500

Open
10 tasks
IvanaGyro opened this issue Nov 8, 2024 · 11 comments
Open
10 tasks

Cytnx Storage Refactor Proposal #500

IvanaGyro opened this issue Nov 8, 2024 · 11 comments
Assignees
Labels
enhancement New feature or request

Comments

@IvanaGyro
Copy link
Collaborator

IvanaGyro commented Nov 8, 2024

Introduction

Per off-line (not on GitHub) discussions on 2024.10.03 and 2024.11.06, some chapters of this proposal are removed from the original unpublished draft. Only the chapters that got consensus remained.

This proposal outlines a plan for refactoring the Storage and Tensor classes, along with their respective implementation classes. The goal of this refactoring is to enhance the efficiency of the program and minimize human errors. Please note that this proposal is not finalized and does not constitute a definitive list of items to be implemented. We encourage discussions regarding any details of the proposed refactoring plan.

Importantly, this proposal DOES NOT alter the APIs for Python.

Advantages

Refactoring takes time. There is no need to refactor the code if there is no advantage. Here are some advantages we will have if we apply the plan mentioned in the following sections.

  1. Reduced Code Volume: A smaller codebase will facilitate easier comprehension for newcomers.
  2. Shorter Build Times: A reduced codebase will lead to faster build times.
  3. Elimination of boost::intrusive_ptr Dependencies: Currently, we utilize two components from Boost: intrusive_ptr and unordered_map. By replacing boost::unordered_map with std::unordered_map, we can eliminate our dependency on Boost.
  4. Decreased Human Error: By encapsulating the Storage class, we can lower the likelihood of contributors inadvertently accessing memory incorrectly.
  5. Enabled compile time check: The proposed changes allow for more bugs to be identified at compile time, enabling the compiler to manage type conversions effectively.
  6. Flexibility for Other Backends: While we currently support PyTorch as a backend, the current design binds the UniTensor class to PyTorch's API. This proposal aims to decouple them.

Class Diagrams

Below are class diagrams illustrating the current and proposed designs. Both diagrams omit specifiers such as const, virtual, etc., for clarity.

---
title: Current Design
---
classDiagram
  class Storage {
    +Init(...) void
    +append~T~(...) void
    +at~T~(...) T
    +to_(int device) void
    +to(int device) intrusive_ptr~Storage_base~
    +fill(const T &val) void
    +static from_vector(...) Storage
    +boost::intrusive_ptr~Storage_base~ _impl
  }
  class Storage_base {
    +Init(...) void
    +_Init_byptr(...) void
    +_create_new_sametype() intrusive_ptr~Storage_base~
    +clone() intrusive_ptr~Storage_base~
    +Move_memory(...) intrusive_ptr~Storage_base~
    +Move_memory_(...) void
    +to_(int device) void
    +to(int device) intrusive_ptr~Storage_base~
    +PrintElem_byShape(...) void
    +print_elems() void
    +real() intrusive_ptr~Storage_base~
    +imag() intrusive_ptr~Storage_base~
    +intrusive_ptr~Storage_base~ _impl
    +void *Mem
    +unsigned long long len
    +unsigned long long cap
    +unsigned int dtype
    +int device
  }
  class ComplexDoubleStorage
  class FloatStorage
  class intrusive_ptr_base~T~
  namespace boost {
    class intrusive_ptr~T~
  }
  class Tensor {
    +intrusive_ptr~Tensor_impl~ _impl
    +Init(...) void
  }
  class Tensor_impl {
    -Storage _storage
    Init(...) void
    +to_(int device) void
    +to(int device) intrusive_ptr~Tensor_impl~
    +at~T~(locator) T &
    +at(locator) Sproxy
  }
  class UniTensor {
    +intrusive_ptr~UniTensor_base~ _impl;
  }
  class UniTensor_base {

  }
  class BlockUniTensor {
    +vector~Tensor~ _blocks
  }
  class DenseUniTensor {
    +Tensor _block
  }
   
  Storage "1..*" o-- "1" Storage_base
  intrusive_ptr_base~Storage_base~ <|-- Storage_base
  intrusive_ptr_base~T~ <.. intrusive_ptr~T~
  Storage_base <|-- ComplexDoubleStorage
  Storage_base <|-- FloatStorage
  Tensor "1..*" o-- "1" Tensor_impl
  intrusive_ptr_base~T~ <|-- Tensor_impl
  Tensor_impl "1" o-- "1" Storage
  intrusive_ptr_base~Tensor_impl~ <|-- Tensor_impl
  intrusive_ptr_base~UniTensor_base~ <|-- UniTensor_base
  UniTensor "1..*" o-- "1" UniTensor_base
  UniTensor_base <|-- BlockUniTensor
  UniTensor_base <|-- DenseUniTensor
  BlockUniTensor "1" *-- "0..*" Tensor
  DenseUniTensor "1" *-- "1" Tensor
Loading

And here is the class diagram of the new design. The diagram only shows the stuff changed. The levels above from class UniTensor are not affected. Every interface only shows the methods to be implemented for bridging other backends.

---
title: New Design
---
classDiagram
  class intrusive_ptr_base~T~
  namespace boost {
    class intrusive_ptr~T~
  }

  class StorageBase {
    <<Interface>>
    +data() void*
    +size() size_t
    +capacity() size_t
    +resize(size_t size) void
    +reserve(size_t size) void
    +dtype() int
    +device() int
    +clone() unique_ptr~StorageBase~
    +to(int device) unique_ptr~StorageBase~
  }
  class Storage {
    +append(cytnx_variant value) void
    +at(...) cytnx_reference_variant
    +clone() unique_ptr~StorageBase~
    +to(int device) unique_ptr~StorageBase~
    +fill(cytnx_variant value) void
    -unique_ptr~StorageBase~ implementation_
  }
  class StorageImplementation~T, Allocator~ {
    +clone() unique_ptr~StorageBase~
    +to(int device) unique_ptr~StorageBase~
    +friend ostream& operator<<() void
  }
  class StorageFactory {
    <<Interface>>
    +MakeStorage(int dtype, int device): unique_ptr<StorageBase>
  }
  class StorageFactoryImplementation {
    +MakeStorage(int dtype, int device): unique_ptr<StorageBase>
  }
  class TensorBase {
    <<Interface>>
    +at(locator) cytnx_reference_variant
    +reshape_(shape) TensorBase&
    +contiguous_() TensorBase&
    +permute_(vector~size_t~ dimensions) TensorBase&
    +shape() const vector~size_t~&
    +dtype() int
    +device() int
    +clone() unique_ptr~TensorBase~
    +to(int device) unique_ptr~TensorBase~
    -data() void*
    -HasContiguousIterator() bool
    %% do not force to have an implementation of Storage
  }
  class Tensor {
    +real() Tensor
    +imag() Tensor
    +clone() unique_ptr~TensorBase~
    +to(int device) unique_ptr~TensorBase~
    -unique_ptr~TensorBase~ implementation_
  }
  class TensorImplementation~T, Allocator~ {
    +clone() unique_ptr~TensorBase~
    +to(int device) unique_ptr~TensorBase~
    +at(locator) T
    -StorageImplementation~T, Allocator~ storage_
  }
  class TensorFactory {
    <<Interface>>
    +MakeTensor(int dtype, int device): unique_ptr<TensorBase>
  }
  class TensorFactoryImplementation {
    +MakeTensor(int dtype, int device): unique_ptr<TensorBase>
  }
  class UniTensor {
    +intrusive_ptr~UniTensor_base~ _impl;
  }
  class UniTensor_base {
  }
  class BlockUniTensor {
    +vector~Tensor~ _blocks
  }
  class DenseUniTensor {
    +Tensor _block
  }
   
  intrusive_ptr_base <.. intrusive_ptr
  StorageBase <|.. Storage
  StorageBase <|.. StorageImplementation
  Storage "1" *-- "1" StorageImplementation
  StorageFactory <|.. StorageFactoryImplementation
  StorageFactory <-- Storage
  TensorBase <|.. Tensor
  TensorBase <|.. TensorImplementation
  Tensor "1" *-- "1" TensorImplementation
  TensorImplementation "1" *-- "1" StorageImplementation
  TensorFactory <|.. TensorFactoryImplementation
  TensorFactory <-- Tensor
  intrusive_ptr_base~UniTensor_base~ <|-- UniTensor_base
  UniTensor "1..*" o-- "1" UniTensor_base
  UniTensor_base <|-- BlockUniTensor
  UniTensor_base <|-- DenseUniTensor
  BlockUniTensor "1" *-- "0..*" Tensor
  DenseUniTensor "1" *-- "1" Tensor
Loading

Proposed Changes

Templatization of Implementations of Storage and Tensor

The current design repeatedly defines implementations for Storage, resulting in numerous duplicated functions. Additionally, Tensor_impl lacks type information, preventing compile-time type checking and leading to runtime checks that can degrade performance.

class ComplexDoubleStorage : public Storage_base {}
class FloatStorage : public Storage_base {}
...
// Tensor_impl doesn't provide the information about the type of data it contains to the compiler.
class Tensor_impl {
private:
  Storage _storage;
}

The new design templatizes the implementations of Storage and Tensor to enable type checking at the compile time. The compiler can also help to determine if one type can convert to the other type. The new design also makes the containers for CPU and the containers for GPU different types. By separating them, we can easily use the STL container, like std::vector to maintain the memory. And we only open the door of using thrust::vector to maintain the memory both on CPU and GPU.

template <typename T, tpyename Allocator>
class StorageImplementation {
private:
  std::vector<T, Allocator> storage_;
}

template <typename T, tpyename Allocator> 
class TensorImplementation {
private:
  StorageImplementation<T, Allocator> storage_;
}

StorageImplementation<cytnx_int64, std::allocator<cytnx_int64>>; // Allocating the memory on CPU
StorageImplementation<cytnx_int64, GpuAllocator<cytnx_int64>>; // Allocating the memory on GPU

Encapsulation of All Classes

Using class Storage_base as an example:

class Storage_base {
public:
  void *Mem;
  unsigned long long len;
  unsigned long long cap;
  unsigned int dtype;
  int device;
  intrusive_ptr<Storage_base> _impl;
}

Makes member variables public increasing the risk of illegally accessing the memory.

void foo(Storage_base& storage) {
  // A contributor forgets to update the number of values contained.
  void* old_ptr = storage.Mem;
  storage.Mem = malloc(2, /* size corresponding to dtype */ 4);
  free(old_ptr);
}

// The other contributor uses `foo()`
Storage_base storage; // contains 10 values
foo(storage);
std::cout << reinterpret_cast<double*>(storage.Mem)[9]; // illegal access

Besides, both Google Style Guide and C++ Core Guidelines prefer private member variables.

Here is a change for class Storage_base.

template <typename T, typename Allocator>
class StorageImplementation : public StorageBase {
public:
  StorageImplementation(int dtype, int device):dtype_(dtype), device_(device) {}

  int dtype() { return dtype_; }
  int device() { return device; }
  size_t size() { return storage_.size(); } // replace `len`
  T* data() { return storage_.data(); } // replace `Mem`
  size_t capacity() { return storage_.capacity(); } // replace `cap`
  void resize(size_t count) { storage_.reize(); } // Use encapsulated interface to modify the values of `len`, `cap`, and `Mem`.
private:
  int dtype_;
  int device_;
  std::vector<T, Allocator> storage_;
}

The access control of the member variables changed or removed in the class diagram of the new design will be updated.

Creation of Interfaces: class StorageBase and class TensorBase

The introduction of these interfaces allows for a cleaner separation between interface definitions and their implementations, enhancing flexibility for future integrations with other libraries. There are two options for implementation.

Added on 2024.11.08: This proposal requests the developers to implement the type-less StorageBase and TensorBase while integrating with other libraries. However, by following with #496, it may be also possible to request the develops to implement typed StorageBase<typename> and TensorBase<typename>.

Option 1 (Not Selected)

This option proposed making existing classes base classes for implementations from other libraries but presented several issues including forward declaration requirements and mixed responsibilities.

class StorageFactory; // Need forward declaration here.

class Storage {
public:
  Storage(int dtype, int device, StorageFactory factory = StorageFactoryImplementation()) {
    implementation_ = factory.MakeStorage(dtype, device);
  }
  
  virtual ~Storage() = default;

  virtual cytnx_reference_variant at(size_t position) {}

protected:    
  Storage() = default; // leave `implementation_` null

private: 
  std::unique_ptr<Storage> implementation_;
};


class StorageFactory {
public:
  virtual std::unique_ptr<Storage> MakeStorage(int dtype, int device) = 0;
}


class CustomUint64Storage : public Storage {
public:
  // Can omit ":Storage()" because Storage::Storage() is the default constructor.
  CustomUint64Storage():Storage() {}

  // Is overriding `at()` neccessary? It is not clear because we cannot mark any
  // member function of `class Storage` as a pure virtual function.
private:
  StorageFromOtherLibrary storage_;
}


class StorageFactoryForOtherLibrary : public StorageFactory {
public:
  virtual std::unique_ptr<Storage> MakeStorage(int dtype, int device) { return new CustomUint64Storage(); }
}

// usage:
Storage storage(Type.Uint64, Device.cpu, StorageFactoryForOtherLibrary());
// storage.implementation_ is an instance of `class CustomUint64Storage`

There are three problems with this option.

  1. There must be a forward declaration, which hides a dependency and violates the style guide.
  2. Mixed the interface with its representation. The user who wants to use the backend provided by the other libraries will not be clear about which functions are necessary to be overridden.
  3. Every implementation class, like class CustomUint64Storage, contains the unused variable implementation_ which wastes memory.

Option 2 (Selected)

The selected option introduces abstract interfaces (class StorageBase and class TensorBase) that must be implemented by any concrete storage or tensor class:

class StorageBase {
  ...
}
class Storage: public StorageBase {
public:
  Storage(int dtype, int device, StorageFactory factory = StorageFactoryImplementation()) {
    implementation_ = factory.MakeStorage(dtype, device);
  }
  
  ~Storage() = default;

  cytnx_reference_variant at(size_t position) override {}

protected:    
  // With option 2, the default constructor is not needed.
  // Storage() = default;

private: 
  std::unique_ptr<Storage> implementation_;
};


class StorageFactory {
public:
  virtual std::unique_ptr<Storage> MakeStorage(int dtype, int device) = 0;
}


class CustomUint64Storage : public StorageBase {
public:
  CustomUint64Storage() {}

private:
  StorageFromOtherLibrary storage_;
}


class StorageFactoryForOtherLibrary : public StorageFactory {
public:
  virtual std::unique_ptr<Storage> MakeStorage(int dtype, int device) { return new CustomUint64Storage(); }
}

// usage:
void foo(const StorageBase& storage) {
  // do something with the storage
}

Storage storage(Type.Uint64, Device.cpu, StorageFactoryForOtherLibrary());
// storage.implementation_ is an instance of `class CustomUint64Storage`
foo(storage);
foo(CustomUint64Storage()); // Encourage to use the class to tell which type of data is contained.

Except for the public member functions this proposal appeals to remove, the public member functions of class StorageBase and class TensorBase will copy from the current class Storage and class Tensor respectively. Below are the pure virtual functions of class StorageBase and class TensorBase.These pure virtual functions must be overridden by the implementation.

class StorageBase {
public:
  virtual void* data() = 0;
  virtual size_t size() = 0;
  virtual size_t capacity() = 0;
  virtual void resize(size_t size) = 0;
  virtual void reserve(size_t size) = 0;
  virtual int dtype() = 0;
  virtual int device() = 0;
  virtual unique_ptr<StorageBase> clone() = 0;
  virtual unique_ptr<StorageBase> to(int device) = 0;
}


class TensorBase {
public:
  virtual cytnx_reference_variant at(locator) = 0;
  virtual TensorBase& reshape_(shape) = 0;
  virtual TensorBase& contiguous_() = 0;
  virtual TensorBase& permute_(vector<size_t> dimensions) = 0;
  virtual const vector<size_t>& shape() = 0;
  virtual int dtype() = 0;
  virtual int device() = 0;
  virtual unique_ptr<TensorBase> clone() = 0;
  virtual unique_ptr<TensorBase> to(int device) = 0;

private:
  virtual void* data() = 0;
  virtual bool HasContiguousIterator() = 0;
}

There is no function for getting Storage from TensorBase because we don't expect other libraries to have the component equivalent to Storage. With this assumption, it means we have to change the arguments of the functions consuming class Storage to the iterators and the raw pointers.

The reason for adding HasContiguousIterator() is to support the tensor implementation which stores the data in a multi-dimensional data structure. We may consider providing alternative and less efficient algorithms, like Exp(), for that kind of tensor implementation. If we only want to support the tensor implementations storing data in a one-dimensional array, we should consider replacing the whole Storage component with vector<T> because the functionality of Storage is almost equal to the functionality of vector.

Discussion on Removal of Types

To facilitate type elimination in the C++ API, it is crucial to retain the clone() functions, as well as the class StorageBase, class TensorBase, class Storage, and class Tensor. If we only remove types during the Python binding process, these components will no longer be necessary, resulting in a cleaner and more straightforward codebase.

The usage of clone() will be further discussed in the next section.

Blind Storage from Shapes

Another area for improvement is how Storage interacts with shapes. The Storage class currently relies on shape information that is passed to its member functions by the caller, which can lead to implicit coupling between storage management and shape handling.

To foster better separation of concerns, we propose moving the functions below into Tensor from Storage.

void GetElem_byShape_v2(
    boost::intrusive_ptr<Storage_base> &out,
    const std::vector<cytnx_uint64> &shape,
    const std::vector<std::vector<cytnx_uint64>> &locators,
    const cytnx_uint64 &Nunit);
void GetElem_byShape(
    boost::intrusive_ptr<Storage_base> &out,
    const std::vector<cytnx_uint64> &shape,
    const std::vector<cytnx_uint64> &mapper,
    const std::vector<cytnx_uint64> &len,
    const std::vector<std::vector<cytnx_uint64>> &locators);
void SetElem_byShape(
    boost::intrusive_ptr<Storage_base> &in,
    const std::vector<cytnx_uint64> &shape,
    const std::vector<cytnx_uint64> &mapper,
    const std::vector<cytnx_uint64> &len,
    const std::vector<std::vector<cytnx_uint64>> &locators,
    const bool &is_scalar);
void SetElem_byShape_v2(
    boost::intrusive_ptr<Storage_base> &in,
    const std::vector<cytnx_uint64> &shape,
    const std::vector<std::vector<cytnx_uint64>> &locators,
    const cytnx_uint64 &Nunit,
    const bool &is_scalar);
virtual boost::intrusive_ptr<Storage_base> Move_memory(
    const std::vector<cytnx_uint64> &old_shape,
    const std::vector<cytnx_uint64> &mapper,
    const std::vector<cytnx_uint64> &invmapper);
virtual void Move_memory_(
    const std::vector<cytnx_uint64> &old_shape,
    const std::vector<cytnx_uint64> &mapper,
    const std::vector<cytnx_uint64> &invmapper);
virtual void PrintElem_byShape(
    std::ostream &os,
    const std::vector<cytnx_uint64> &shape,
    const std::vector<cytnx_uint64> &mapper = {});

With the new design, we can consider eliminating the entire Storage component, as it essentially functions as a one-dimensional array. This functionality can be effectively replaced by using std::vector or thrust::vector as a member variable within the Tensor component. By making this change, we can streamline the implementation and reduce complexity in the codebase.

Steps

Following these steps will enable the compiler to detect more mistakes made during the refactoring process.

  • Make all member variables private.
  • Convert class Storage_base into an abstract class.
  • Replace USIInit, Storage::Init(), Storage_base::Init(), and Storage_base::_Init_byptr() with constructors and factory classes.
  • Templatize utils_internal::Movemem_cpu_x, utils_internal::Fill_gpu, and their CUDA counterparts.
  • Templatize class Storage_base.
  • Add template information to class Tensor_Impl.
  • Design iterators for Storage.
  • Create class StorageBase and class TensorBase.
  • Update functions that consume class Storage or class Tensor to instead consume class StorageBase, class TensorBase, or raw pointers.
  • Create a custom tensor class for bridging with PyTorch.
@ianmccul
Copy link
Contributor

ianmccul commented Nov 8, 2024

Looks good to me -- some minor points:

  1. StorageImplementation doesn't need to store the dtype, since that is just the index of T in the enumeration, and it certainly shouldn't be stored as an int that could get accidentally changed. In my sketch in Type safety of the C++ code #496 I have a constexpr templated variable, so StorageImplementation<T, Allocator>::dtype() could be implemented simply as { return dtype_index<T>; }.

  2. Where the data is stored (CPU, GPU etc) is effectively part of the type of StorageImplementation via the Allocator parameter, but only indirectly. It might be better to make this more explicit and extensible, with the idea that other storage devices might be added in the future, eg it might be 'stored' on another node of an MPI calculation, or on disk.

  3. Storage_base should definitely be a template, even if nothing else is. Replace the void* with a T*, or better still, a templated type. Something to consider: is it possible to add a storage type that isn't directly accessible via a pointer? The Thrust library uses device_ptr<T>, which is what ought to be used to denote the underlying storage. You really shouldn't reinterpret_cast between things like a device_ptr<T> and void* (do they even have the same size?!)

  4. I don't understand the Storage class in Option 2, it contains a std::unique_ptr<Storage>, i.e. a pointer to its own type? How does that work? Is the intent that Storage contains a std::unique_ptr<StorageBase> ? In that case, Storage itself is effectively a wrapper around the pointer to StorageBase, and shouldn't inherit from StorageBase itself.

  5. Similarly, Storage_base contains intrusive_ptr<Storage_base> _impl;, which I think cannot be intended.

@IvanaGyro
Copy link
Collaborator Author

IvanaGyro commented Nov 8, 2024

Thank @ianmccul for commenting on this proposal.

  1. I will include the implementation in Type safety of the C++ code #496
  2. Because this proposal uses std::vector to store data, the type of allocator must be decided at the compilation time. Currently, the memory accessed by GPU is allocated with Unified Memory Access(UMA) feature. CUDA handles the transfer between CPU and GPU. So that we only have to assigned an allocator designed for CUDA at the compilation time to support both CPU and GPU computation. If we want to decide the type of allocator at runtime, the implementation will be more complex. It can be a feature in the future.
  3. and 5. Storage_base is the current implementation. It will be dropped.
  4. You are correct. There is a typo.

@ianmccul
Copy link
Contributor

ianmccul commented Nov 19, 2024

I was having another look at this. I don't understand why classes Storage and Storage_base exist in the new design. TensorImplementation<T, Allocator> contains a StorageImplementation<T>, so Storage and StorageBase appear to be otherwise unused, so the only reason for having them is if it simplifies StorageImplementation<T> by having the base classes.

But that isn't the case -- the code on the dev-master branch for StorageImplementation<T> is much more complicated, and many more lines of code, because it has to manage converting types to/from void* for Storage_base. And slower too, because although StorageImplementation<T> obviously knows the type T, operations on it go via the set of overloaded virtual functions of Storage_base. The implementations of these virtual functions are all identical -- forward to a template class anyway.

For example, from StorageImplementation.cpp,

template <typename DType>
  void StorageImplementation<DType>::fill(const cytnx_complex128 &val) {
    Fill(val);
  }
  template <typename DType>
  void StorageImplementation<DType>::fill(const cytnx_complex64 &val) {
    Fill(val);
  }
  template <typename DType>
  void StorageImplementation<DType>::fill(const cytnx_double &val) {
    Fill(val);
  }

(and so on)

There is no reason for this to be a virtual function. Any caller of StorageImplementation<T>::fill() can just call the template function Fill() instead. There are many other similar examples. It seems to me that StorageImplementation<T> could be a fairly simple class that manages a T* for memory (or maybe a std::vector<T>), and TensorImplementation<T> holds a std::shared_ptr<StorageImplementation<T>> (so no need for StorageImplementation<T> to do its own reference counting). Then Storage and Storage_base go away completely, saving a few thousand lines of code.

If you still need an untyped Storage class in the new design, then it could be done in a few ways - a good way would be as a std::variant of shared_ptr<StorageImplementation<T>> objects (i.e. what the TensorImplementation<T> is holding).

@ianmccul
Copy link
Contributor

ianmccul commented Nov 19, 2024

Related to the above, but a separate issue,

  template <typename DType>
  template <typename OtherDType>
  void StorageImplementation<DType>::Fill(const OtherDType &value) {
    if constexpr (!std::is_constructible_v<DType, OtherDType>) {
      cytnx_error_msg(true, "[ERROR] cannot fill %s value into %s container",
                      Type.getname(Type.cy_typeid(OtherDType())).c_str(),
                      Type.getname(Type.cy_typeid(DType())).c_str());
      return;
    } else {
      if (this->device == Device.cpu) {
        utils_internal::FillCpu(this->Mem, static_cast<DType>(value), this->len);
      } else {
#ifdef UNI_GPU
        checkCudaErrors(cudaSetDevice(this->device));
        utils_internal::FillGpu(this->Mem, static_cast<DType>(value), this->len);
#else
        cytnx_error_msg(true, "[ERROR][fill] fatal internal, %s",
                        "storage is on gpu without CUDA support\n");
#endif
      }
    }
  }

I can't see any reason to defer the error here to runtime. If you call Fill<T,U> where T and U are incompatible types, then it should be a compiler error. The place where it becomes a runtime error is in the untyped interface, i.e. in Tensor::Fill(Scalar s). Only at this point, do we need to determine the dtype of the Tensor and the Scalar, and then it is a runtime error if they do not match. For example, (untested code!)

template <typename T>
class StorageImplementation {
   template <typename U> void Fill(U value); // existing Fill() function
};

template <typename T>
class TensorImplementation {
   StorageImplementation<T> storage_;
public:
   template <typename U>
   void Fill(U x) { storage_.Fill(x); }
};

class Scalar {
   std::variant<int, float, double /* other types */> data_;
};

class Tensor {
   std::variant<
       std::shared_ptr<TensorImplementation<int>>,
       std::shared_ptr<TensorImplementation<float>>,
       std::shared_ptr<TensorImplementation<double>>
       /* other types */
   > data_;

public:
   void Fill(Scalar s);
};

// Helper for Tensor::Fill (simplified version)
template <typename T, typename U>
void FillImpl(std::shared_ptr<TensorImplementation<T>>& tensor_impl, U scalar_val) {
   if constexpr (!std::is_constructible_v<T, U>) {
       cytnx_error_msg(true, "Cannot fill type %s into tensor of type %s.",
                       cy_type_str<U>, cy_type_str<T>); // cy_type_str does not yet exist
   } else {
       tensor_impl->Fill(scalar_val);
   }
}

// Tensor::Fill implementation, foward to FillImpl via visitor pattern
void Tensor::Fill(Scalar s) {
   std::visit(
       [](auto&& tensor_impl, auto&& scalar_val) {
           FillImpl(tensor_impl, scalar_val);
       },
       this->data_, s.data_);
}

@IvanaGyro
Copy link
Collaborator Author

IvanaGyro commented Nov 23, 2024

I can't see any reason to defer the error here to runtime. If you call Fill<T,U> where T and U are incompatible types, then it should be a compiler error.

We still hold untyped Storage now, so it should be a runtime error if DType cannot be constructed from OtherDType.

@ianmccul
Copy link
Contributor

OK, let me think about that. I agree at some point it needs to be a runtime error, but that should be in a function/method that belongs to the untyped interface, not a member of the typed interface StorageImplementation<Type>. I'll look at the code and see if I can find a solution.

@ianmccul
Copy link
Contributor

ianmccul commented Nov 24, 2024

Hi @IvanaGyro , I had a look to see if I could find a nice way to implement Fill() (and related functions), but it ended up being a bit too intrusive for a single PR. So I am in favour of using what you have now, as something that is a useful improvement.

However, there is a problem when compiling the code without UNI_GPU support:

        if constexpr (std::is_same_v<GpuAllocator<T>, Allocator>) {
          checkCudaErrors(cudaSetDevice(device));
          if constexpr (std::is_same_v<cytnx_complex128, T>) {
            utils_internal::cuComplexmem_gpu_cdtd(out->data(), storage_.data(), storage_.size(),
                                                  /* get_real */ false);
          } else {
            utils_internal::cuComplexmem_gpu_cftf(out->data(), storage_.data(), storage_.size(),
                                                  /* get_real */ false);
          }

Unfortunately, using if constexpr doesn't work this way. It still needs to be syntactically correct C++ code, meaning that non-template functions in both branches need to have been previously declared. For a template function you have more freedom because you can refer to templates that would otherwise be a compile-time error to instantiate, but the compiler still needs to be able to recognise it as a syntactically valid template. So if you compile without UNI_GPU, the compiler doesn't have the declaration of the cuda functions and it cannot continue, even though the cuda function would never get called.

I think there is a design question here, in the sense that it is the storage that is in charge of managing these operations, but the way you have it now, it is the Allocator that determines whether you want the CPU or the GPU version. So the storage needs to know the details of the allocator, which is not a great design. But because it is a template, it all comes down to spelling: in effect, you have two different classes, one for CPU storage, and one for GPU storage, implemented by alternative partial specializations of StorageImplementation<T, Allocator>. You could even make that explicit if you want:

template <typename T>
using CpuStorage = StorageImplementation<T, std::allocator<T>>;

template <typename T>
using GpuStorage = StorageImplementation<T, GpuAllocator<T>>;

So the solution to get the template to compile with and without UNI_GPU is to split the implementation into those two cases. You could even put GpuStorage into its own header, which is conditionally included only if UNI_GPU is defined, and that way you don't have to litter the code with #ifdef UNI_GPU everywhere.

They could both inherit from a common base class, so functions that are the same in both the GPU and CPU cases can be implemented just once. That intermediate base class could be a template of T, or the Allocator, or both; whatever is easier.

@IvanaGyro
Copy link
Collaborator Author

Suppose we only support the GPU having UMA feature, is there any negative impact if we only use, no matter whether device is set to Device.cpu or Device.cuda, GpuAllocator<T> when UNI_GPU is defined? Both Storage::to() and Storage::to_() are preserved per offline discussion on 2024.10.13 (which is mentioned at the very beginning of this proposal). The most efficient way to implement "in-place" Storage::to_() is to allocate the memory only on the unified memory. If there are negative impacts to do this, we have to trade-off between Storage::to_() and the negative impacts.

@ianmccul
Copy link
Contributor

Suppose we only support the GPU having UMA feature, is there any negative impact if we only use, no matter whether device is set to Device.cpu or Device.cuda, GpuAllocator<T> when UNI_GPU is defined? Both Storage::to() and Storage::to_() are preserved per offline discussion on 2024.10.13 (which is mentioned at the very beginning of this proposal). The most efficient way to implement "in-place" Storage::to_() is to allocate the memory only on the unified memory. If there are negative impacts to do this, we have to trade-off between Storage::to_() and the negative impacts.

That could work. In this model, it isn't really a GpuAllocator, since it doesn't necessary use the gpu for computation, but just a UnifiedAllocator or similar name/concept. This leads to the question of whether you want to implement the std::vector-based allocator vs the [Gpu|Unified]Allocator as separate classes at all, or just a compile-time switch based on UNI_GPU. Having them as separate classes means it is more extensible, but in practice cytnx is unlikely to ever support non-unified GPU memory or anything that isn't directly readable via a pointer --being able to access *Mem is too baked in. So in this case you could try something like

#ifdef UNI_GPU
template <typename T>
using StorageAllocator = GpuAllocator<T>;
#else
using StorageAllocator = std::allocator<T>;
#endif

Alternatively, have two versions of StorageImplementation<T>, one for the CPU and one for the GPU. In that case, I suggest something like

#ifdef UNI_GPU
#include "StorageImplementation_gpu.hpp"
#else
#include "StorageImplementation_cpu.hpp"
#endif

as a cleaner approach than sprinkling #ifdef UNI_GPU throughout the code. Sprinkling preprocessor tests throughout the code makes it harder to maintain, eg if you are fixing a bug in the cpu-specific part, you have to be careful to not affect the code that is conditional on the gpu, and vice versa. But my preference would be to name them differently, so that it is always clear exactly what code is running. Eg, if you get a bug report from someone and it has a stack trace that leads to some function in GpuAllocator then you know immediately where it was, rather than having to ask them whether UNI_GPU was configured when they installed the library (which they might not know!).

In any of these approaches, you could use an intermediate base class for common code between the cpu and gpu cases.

@ianmccul
Copy link
Contributor

By the way, if you use std::vector as the backing store, you will have a the vector<bool> problem. You could try a metafunction hack to use std::vector<char> when T is bool.

@IvanaGyro
Copy link
Collaborator Author

I prefer one StorageImplementation<T> and only make the allocator different when UNI_GPU is defined. Using vector<char> is possible. These two changes are introduced with #469. However, #469 has some problems so I will never merge it. We have to redo some work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants