docs: update docs based on version 1.60.0

GuanceCloud · Nov 2, 2024 · 8aa34bb · 8aa34bb
1 parent 2637249
commit 8aa34bb
Show file tree

Hide file tree

Showing 17 changed files with 265 additions and 163 deletions.
diff --git a/datakit.template.yaml b/datakit.template.yaml
@@ -194,7 +194,7 @@ spec:
         #  value: iploc
         # # ---iploc-end
         image: pubrepo.guance.com/datakit/datakit:{{.Version}}
-        imagePullPolicy: Always
+        imagePullPolicy: IfNotPresent
         name: datakit
         ports:
         - containerPort: 9529

diff --git a/internal/datakit/dkconf.go b/internal/datakit/dkconf.go
@@ -153,7 +153,7 @@ ulimit = 64000
   # Datakit will upload data points if cached(in memory) points
   #  reached(>=) the max_cache_count or the flush_interval triggered.
   max_cache_count = 1000
-  flush_workers   = 0 # default to (cpu_core * 2 + 1)
+  flush_workers   = 0 # default to (cpu_core * 2)
   flush_interval  = "10s"
 
   # Queue size of feed.

diff --git a/internal/export/doc/en/changelog.md b/internal/export/doc/en/changelog.md
@@ -1,5 +1,45 @@
 # Changelog
 
+## 1.60.0 (2024/10/18) {#cl-1.60.0}
+
+This release is an iterative update, with the following main changes:
+
+### New Features {#cl-1.60.0-new}
+
+- Added a new Prometheus v2 collector, which significantly optimizes parsing performance compared to the v1 version (#2427).
+- [APM Automatic Instrumentation](datakit-install.md#apm-instrumentation): During the Datakit installation, by setting specific flags, we can automatically inject APM into the corresponding applications (Java/Python) by restarting the applications(#2139).
+- RUM Session Replay add supports for blacklist rules configured in GuanCe console (#2424).
+- The Datakit [`/v1/write/:category` interface](apis.md#api-v1-write) now supports multiple compression formats(HTTP `Content-Encoding`) (#2368).
+
+### Bug Fixes {#cl-1.60.0-fix}
+
+- Fixed a crash issue in the HTTP service caused by the Gin timeout middleware(#2423).
+- Fixed a timestamp unit issue in the New Relic collector (#2417).
+- Fixed a crash issue caused by the Pipeline function `point_window()` (#2416).
+
+### Performance Improvements {#cl-1.60.0-opt}
+
+- Many performance optimizations have been made in this version (#2414):
+
+    - The experimental feature point-pool is now enabled by default.
+    - Improved Prometheus exporter data collection performance and reduced memory consumption.
+    - Enabled [HTTP API rate limiting](datakit-conf.md#set-http-api-limit) by default to prevent sudden traffic from consuming too much memory.
+    - Added a [WAL disk queue](datakit-conf.md#dataway-wal) to handle memory occupation that may be caused by upload blocking. The new disk queue *will cache data that fails to upload by default*.
+    - Refined Datakit's own memory usage metrics, adding memory occupation across multiple dimensions.
+    - Added a WAL panel display in the `datakit monitor -V` command.
+    - Improved KubernetesPrometheus collection performance (#2426).
+    - Improved container log collection performance (#2425).
+    - Removed debug-related fields within logging to optimize network traffic and storage.
+
+### Compatibility Adjustments {#cl-1.60.0-brk}
+
+- Due to some performance adjustments, there are compatibility differences in the following areas:
+
+    - The maximum size of a single HTTP body upload has been adjusted to 1MB. At the same time, the maximum size of a single log has also been reduced to 1MB. This adjustment is to reduce the amount of pooled memory used by Datakit under low load conditions.
+    - The original failed retry disk queue has been deprecated (this feature was not enabled by default). The new version will enable a new failed retry disk queue by default.
+
+---
+
 ## 1.39.0 (2024/09/25) {#cl-1.39.0}
 
 This release is an iterative update with the following changes:

diff --git a/internal/export/doc/en/datakit-conf.md b/internal/export/doc/en/datakit-conf.md
@@ -202,10 +202,10 @@ We can also enable `content_encoding = "v2"`([:octicons-tag-24: Version-1.32.0](
 
     ```toml
     [io]
-      feed_chan_size  = 4096  # length of data processing queue (a job typically has multiple points)
-      max_cache_count = 512   # data bulk sending points, beyond which sending is triggered in the cache
+      feed_chan_size  = 1     # length of compact queue
+      max_cache_count = 1000  # data bulk sending points, beyond which sending is triggered in the cache
       flush_interval  = "10s" # threshold for sending data at least once every 10s
-      flush_workers   = 8     # upload workers, default CPU-core * 2 + 1
+      flush_workers   = 0     # upload workers, default is the limited CPU-core * 2
     ```
 
     See [corresponding description in k8s](datakit-daemonset-deploy.md#env-io) for blocking mode
@@ -215,34 +215,6 @@ We can also enable `content_encoding = "v2"`([:octicons-tag-24: Version-1.32.0](
     See [here](datakit-daemonset-deploy.md#env-io)
 <!-- markdownlint-enable -->
 
-#### IO Disk Cache {#io-disk-cache}
-
-[:octicons-tag-24: Version-1.5.8](changelog.md#cl-1.5.8) · [:octicons-beaker-24: Experimental](index.md#experimental)
-
-When DataKit fails to send data, disk cache can be turned on in order not to lose critical data. The purpose of disk cache is to temporarily store the data on disk when upload Dataway failed, and then fetch data from disk and upload again later.
-
-<!-- markdownlint-disable MD046 -->
-=== "`datakit.conf`"
-
-    ```toml
-    [io]
-      enable_cache      = true   # turn on disk caching
-      cache_all         = false  # cache all categories(default metric,object and dial-testing data point not cached)
-      cache_max_size_gb = 5 # specify a disk size of 5GB
-    ```
-
-=== "Kubernetes"
-
-    See [here](datakit-daemonset-deploy.md#env-io)
-<!-- markdownlint-enable -->
-
----
-<!-- markdownlint-disable MD046 -->
-???+ attention
-
-    The `cache_max_size_gb` used to control max disk capacity of each data category. For there are 10 categories, if each on configured with 5GB, the max disk usage may reach to 50GB.
-<!-- markdownlint-enable -->
-
 ### Resource Limit  {#resource-limit}
 
 Because the amount of data processed on the DataKit cannot be estimated, if the resources consumed by the DataKit are not physically limited, it may consume a large amount of resources of the node where it is located. Here we can limit it with the help of cgroup in Linux or job object in Windows, which has the following configuration in *datakit.conf*:
@@ -319,6 +291,57 @@ Dataway got following settings to be configured:
 
 See [here](datakit-daemonset-deploy.md#env-dataway) for configuration under Kubernetes.
 
+#### WAL Queue Configuration {#dataway-wal}
+
+[:octicons-tag-24: Version-1.60.0](changelog.md#cl-1.60.0)
+
+In the `[dataway.wal]` section, we can adjust the configuration of the WAL queue:
+
+```toml
+  [dataway.wal]
+     max_capacity_gb = 2.0             # 2GB reserved disk space for each category (M/L/O/T/...)
+     workers = 0                       # flush workers on WAL (default to CPU limited cores)
+     mem_cap = 0                       # in-memory queue capacity (default to CPU limited cores)
+     fail_cache_clean_interval = "30s" # duration for cleaning failed uploaded data
+```
+
+The disk files are located in the *cache/dw-wal* directory under the Datakit installation directory:
+
+```shell
+/usr/local/datakit/cache/dw-wal/
+├── custom_object
+│   └── data
+├── dialtesting
+│   └── data
+├── dynamic_dw
+│   └── data
+├── fc
+│   └── data
+├── keyevent
+│   └── data
+├── logging
+│   ├── data
+│   └── data.00000000000000000000000000000000
+├── metric
+│   └── data
+├── network
+│   └── data
+├── object
+│   └── data
+├── profiling
+│   └── data
+├── rum
+│   └── data
+├── security
+│   └── data
+└── tracing
+    └── data
+
+13 directories, 14 files
+```
+
+Here, except for the *fc* directory, which is the failure retry queue, the other directories correspond to different data types. When data upload fails, these data will be cached in the *fc* directory, and Datakit will periodically upload them later.
+
 ### Dataway Sinker {#dataway-sink}
 
 See [here](../deployment/dataway-sink.md)

diff --git a/internal/export/doc/en/datakit-install.md b/internal/export/doc/en/datakit-install.md
@@ -315,6 +315,35 @@ Only Linux and Windows ([:octicons-tag-24: Version-1.15.0](changelog.md#cl-1.15.
 - `DK_LIMIT_CPUMAX`: Maximum CPU power, default 30.0
 - `DK_LIMIT_MEMMAX`: Limit memory (including swap), default 4096 (4GB)
 
+### APM Instrumentation {#apm-instrumentation}
+
+[:octicons-tag-24: Version-1.60.0](changelog.md#cl-1.60.0) · [:octicons-beaker-24: Experimental](index.md#experimental)
+
+By specifying `DK_APM_INSTRUMENTATION_ENABLED=host` in the installation command, you can automatically inject APM for Java/Python applications:
+
+```shell
+DK_APM_INSTRUMENTATION_ENABLED=host \
+  DK_DATAWAY=https://openway.guance.com?token=<TOKEN>  \
+  bash -c "$(curl -L https://static.guance.com/datakit/install.sh)"
+```
+
+After Datakit is installed, reopen a shell and restart the corresponding Java/Python applications.
+
+To enable or disable this feature, modify the value of the `instrumentation_enabled` configuration under `[apm_inject]` in the `datakit.conf` file:
+
+- Value `"host"`, enable
+- Value `""` or `"disable"`, disable
+
+Operating environment requirements:
+
+- Linux system
+    - CPU architecture: x86_64 or arm64
+    - C standard library: glibc 2.4 and above, or musl
+    - Java 8 and above
+    - Python 3.7 and above
+
+In Kubernetes, you can inject APM through the [Datakit Operator](datakit-operator.md#datakit-operator-inject-lib).
+
 ### Other Installation Options {#env-others}
 
 | Environment Variable Name        | Sample                      | Description                                                                                                                                                                                 |
@@ -332,7 +361,6 @@ Only Linux and Windows ([:octicons-tag-24: Version-1.15.0](changelog.md#cl-1.15.
 | `DK_VERBOSE`                     | `on`                        | Enable more verbose info during install(only for Linux/Mac)[:octicons-tag-24: Version-1.19.0](changelog.md#cl-1.19.0)                                                                       |
 | `DK_CRYPTO_AES_KEY`              | `0123456789abcdfg`          | Use the encrypted password decryption key to protect plaintext passwords in the collector.  [:octicons-tag-24: Version-1.31.0](changelog.md#cl-1.31.0)                                      |
 | `DK_CRYPTO_AES_KEY_FILE`         | `/usr/local/datakit/enc4dk` | Another way to configure the secret key takes priority over the previous one. Put the key into the file and configure the configuration file path through environment variables.            |
-| `DK_APM_INSTRUMENTATION_ENABLED` | `host`, `disable`           | Enable APM automatic injection for newly started Java and Python applications on the host.                                                                                                  |
 
 ## FAQ {#faq}
 <!-- markdownlint-disable MD013 -->

diff --git a/internal/export/doc/en/datakit-metrics.md b/internal/export/doc/en/datakit-metrics.md
@@ -51,19 +51,20 @@ We can also playing other metrics too(change the `grep` string), all available m
 |*internal/httpcli*|SUMMARY|`datakit_httpcli_dns_cost_seconds`|`from`|HTTP DNS cost|
 |*internal/httpcli*|SUMMARY|`datakit_httpcli_tls_handshake_seconds`|`from`|HTTP TLS handshake cost|
 |*internal/httpcli*|SUMMARY|`datakit_httpcli_http_connect_cost_seconds`|`from`|HTTP connect cost|
+|*internal/io/dataway*|GAUGE|`datakit_io_dataway_wal_mem_len`|`category`|Dataway WAL's memory queue length|
 |*internal/io/dataway*|SUMMARY|`datakit_io_flush_failcache_bytes`|`category`|IO flush fail-cache bytes(in gzip) summary|
 |*internal/io/dataway*|SUMMARY|`datakit_io_build_body_cost_seconds`|`category,encoding,stage`|Build point HTTP body cost|
 |*internal/io/dataway*|SUMMARY|`datakit_io_build_body_batches`|`category,encoding`|Batch HTTP body batches|
 |*internal/io/dataway*|SUMMARY|`datakit_io_build_body_batch_bytes`|`category,encoding,type`|Batch HTTP body size|
 |*internal/io/dataway*|SUMMARY|`datakit_io_build_body_batch_points`|`category,encoding`|Batch HTTP body points|
 |*internal/io/dataway*|SUMMARY|`datakit_io_dataway_wal_flush`|`category,gzip,queue`|Dataway WAL worker flushed bytes|
 |*internal/io/dataway*|COUNTER|`datakit_io_dataway_point_total`|`category,status`|Dataway uploaded points, partitioned by category and send status(HTTP status)|
+|*internal/io/dataway*|COUNTER|`datakit_io_wal_point_total`|`category,status`|WAL queued points|
 |*internal/io/dataway*|COUNTER|`datakit_io_dataway_point_bytes_total`|`category,enc,status`|Dataway uploaded points bytes, partitioned by category and pint send status(HTTP status)|
 |*internal/io/dataway*|COUNTER|`datakit_io_dataway_http_drop_point_total`|`category,error`|Dataway write drop points|
 |*internal/io/dataway*|SUMMARY|`datakit_io_dataway_api_latency_seconds`|`api,status`|Dataway HTTP request latency partitioned by HTTP API(method@url) and HTTP status|
 |*internal/io/dataway*|COUNTER|`datakit_io_http_retry_total`|`api,status`|Dataway HTTP retried count|
 |*internal/io/dataway*|SUMMARY|`datakit_io_grouped_request`|`category`|Grouped requests under sinker|
-|*internal/io/dataway*|GAUGE|`datakit_io_dataway_wal_mem_len`|`category`|Dataway WAL's memory queue length|
 |*internal/io/filter*|COUNTER|`datakit_filter_update_total`|`N/A`|Filters(remote) updated count|
 |*internal/io/filter*|GAUGE|`datakit_filter_last_update_timestamp_seconds`|`N/A`|Filter last update time|
 |*internal/io/filter*|COUNTER|`datakit_filter_point_total`|`category,filters,source`|Filter points of filters|
@@ -143,6 +144,7 @@ We can also playing other metrics too(change the `grep` string), all available m
 |*internal/plugins/inputs/promremote*|SUMMARY|`datakit_input_promremote_collect_points`|`source`|Total number of promremote collection points|
 |*internal/plugins/inputs/promremote*|SUMMARY|`datakit_input_promremote_time_diff_in_second`|`source`|Time diff with local time|
 |*internal/plugins/inputs/promremote*|COUNTER|`datakit_input_promremote_no_time_points_total`|`source`|Total number of promremote collection no time points|
+|*internal/plugins/inputs/promv2*|SUMMARY|`datakit_input_promv2_scrape_points`|`source,remote`|The number of points scrape from endpoint|
 |*internal/plugins/inputs/proxy/bench/client*|GAUGE|`api_elapsed_seconds`|`N/A`|Proxied API elapsed seconds|
 |*internal/plugins/inputs/proxy/bench/client*|COUNTER|`api_post_bytes_total`|`api,status`|Proxied API post bytes total|
 |*internal/plugins/inputs/proxy/bench/client*|SUMMARY|`api_latency_seconds`|`api,status`|Proxied API latency|
@@ -169,14 +171,13 @@ We can also playing other metrics too(change the `grep` string), all available m
 |*internal/prom*|GAUGE|`datakit_input_prom_stream_size`|`mode,source`|Stream size|
 |*internal/statsd*|SUMMARY|`datakit_input_statsd_collect_points`|`N/A`|Total number of statsd collection points|
 |*internal/statsd*|SUMMARY|`datakit_input_statsd_accept_bytes`|`N/A`|Accept bytes from network|
-|*internal/tailer*|COUNTER|`datakit_input_logging_socket_feed_message_count_total`|`network`|socket feed to IO message count|
-|*internal/tailer*|SUMMARY|`datakit_input_logging_socket_log_length`|`network`|record the length of each log line|
-|*internal/tailer*|COUNTER|`datakit_tailer_collect_multiline_state_total`|`source,filepath,multilinestate`|Tailer multiline state total|
 |*internal/tailer*|COUNTER|`datakit_tailer_file_rotate_total`|`source,filepath`|Tailer rotate total|
 |*internal/tailer*|COUNTER|`datakit_tailer_buffer_force_flush_total`|`source,filepath`|Tailer force flush total|
 |*internal/tailer*|COUNTER|`datakit_tailer_parse_fail_total`|`source,filepath,mode`|Tailer parse fail total|
 |*internal/tailer*|GAUGE|`datakit_tailer_open_file_num`|`mode`|Tailer open file total|
 |*internal/tailer*|COUNTER|`datakit_input_logging_socket_connect_status_total`|`network,status`|connect and close count for net.conn|
+|*internal/tailer*|COUNTER|`datakit_input_logging_socket_feed_message_count_total`|`network`|socket feed to IO message count|
+|*internal/tailer*|SUMMARY|`datakit_input_logging_socket_log_length`|`network`|record the length of each log line|
 |*internal/trace*|COUNTER|`datakit_input_tracing_total`|`input,service`|The total links number of Trace processed by the trace module|
 |*internal/trace*|COUNTER|`datakit_input_sampler_total`|`input,service`|The sampler number of Trace processed by the trace module|
 |*vendor/github.com/GuanceCloud/cliutils/diskcache*|SUMMARY|`diskcache_dropped_data`|`path,reason`|Dropped data during Put() when capacity reached.|