forked from adaptivecomputing/torque
-
Notifications
You must be signed in to change notification settings - Fork 0
/
CHANGELOG
2719 lines (2510 loc) · 147 KB
/
CHANGELOG
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
c - crash b - bug fix e - enhancement f - new feature n - note
NOTE: the CHANGELOG file is now deprecated. Please check the release notes page
on the Adaptive Computing Website. For example, the 6.0.1 release notes can be
found:
http://docs.adaptivecomputing.com/9-0-1/releaseNotes/help.htm
6.0.0
b - TRQ-3245. Enable reporter mom to correctly handle UNKNOWN role.
b - TRQ-3242. Fix problem where resource string argument to prologue
script getting garbled.
b - TRQ-3232 Start threadpool at pbs_mom start.
b - TRQ-3117. Fix a misspelling in the number_successful tag (was number_successfull)
e - TRQ-3131. Add capability to pass environment variables to pbsdsh.
b - TRQ-3185. Create subdirs when server attribute use_job_subdirs set.
5.1.2
b - TRQ-2675. Fix small errors in suse init.d scripts.
b - TRQ-3235. Fix problem when path to error, output or execution environment
contains one or more spaces.
e - TRQ-3098. Add the ability to set a parameter exit_code_canceled_job to force all
canceled jobs to have the same exit code regardless of the state they were in
when they were canceled.
e - TRQ-2836. Make node health check run on sister nodes when configured for job
start and job end as well.
e - TRQ-2843. Add the qmgr setting dont_write_nodes_file to make it so that nodes
cannot be edited dynamically
f - TRQ-2897. Add the ability to adopt running processes into a job with pbs_track.
b - TRQ-3189. Never delete a running job because of a dependency.
5.1.1.2
e - TRQ-3197. Add support for RHEL7 and SLES12.
5.1.1
b - TRQ-2947. Fix a race condition on deleting jobs which are failing to start.
c - TRQ-3068. Fix a race condition where a job may be deleted but have it's pointer
may still be in the alljobs container.
b - TRQ-2753. Fix a memory leak in generating the authoritative okclients list.
b - TRQ-2332. Fix a job dependency problem when the failover server comes up. This
only affects users running high availability.
b - TRQ-3023. Fix a bug when ALPS incorrectly returns a permanent confirmation failure.
b - TRQ-2833. Set CUDA_VISIBLE_DEVICES to only the indices for this host when
it will be set.
b - TRQ-3039. Fix a deadlock when deleting a job where other jobs have after any dependencies
on the first job.
f - TRQ-2782. Distribute job files into subdirectories when server attribute use_jobs_subdirs
set to true. Default is false (do not distribute job files).
b - TRQ-3116. Make qsub only retry on transient errors.
b - TRQ-3122. Fix a problem with login_property not working correctly (cray only).
b - TRQ-3114. Fix an issue where an asynchronously started job is stuck with a
substate of starting after a failed job start.
b - TRQ-3110. Handle slot limits correctly when jobs are preempted.
e - TRQ-3095. Add the server setting disable_automatic_requeue to stop jobs from being
requeued if they experience a transient failure on the mom.
e - TRQ-2307. Fix probelms where mom restarts intermittently fail.
b - TRQ-2946. Make qmgr able to handle Cray numeric node ids.
b - TRQ-2790. Make offlining cray compute nodes persist across restarts.
e - TRQ-3104. Add millisecond precision to the Torque log file
e - TRQ-2881. Add node health check error messages to a node's notes and therefore pbsnodes
output.
b - TRQ-3166. Add another safety check before killing stray jobs.
5.0.2
b - TRQ-3029. Make it so that pbs_server can't have active threads when the main
thread exits.
b - TRQ-3012. Fix memory leaks that happen each time a job is run.
b - TRQ-2966. Improved job rerun speed which had been significantly slowed down
starting in 4.2.6. Also, pbs_mom now correctly accounts for job resources when
user jobs call setsid more than once.
c - TRQ-2987. Fix a crash around job exits due to incorrect error code handling.
b - TRQ-2841. Fix some ways that max_user_queuable can become incorrect
b - TRQ-3097. Fixed a problem where failed job submissions would count against
the max_user_queuable count and could not be cleared until pbs_server
was restarted.
b - TRQ-3087. Fixed a problem where completed jobs were counted against max_user_queuable
when restarting pbs_server. Also if the max_user_queuable was set on a
queue and the number of queued jobs and completed jobs were over the
maximum then the last jobs submitted would not get loaded.
5.0.1.h2
b - Reverted a change in 5.0.0 which made it so a user could not submit a
job from a node which had been allowed using the acl_hosts list. The
change to 5.0.0 made it so the ruserok call could not be made to check
for user authorization.
5.0.1
e - TRQ-2410. Improved qstat behavior in cases where bad job IDs were referenced
in the command.
e - TRQ-2460. Two new fields were added to the accounting file for completed jobs:
total_execution_slots and unique_node_count. total_execution_slots should be
20 for a job that requests nodes=2:ppn=10. unique_node_count should be the
number of unique hosts the job occupied.
e - TRQ-2594. TORQUE now uses the Munge API rather than forking when configured
with the --enable-munge-auth option.
e - TRQ-2863. Reduced verbosity in error logging in HA environments.
e - TRQ-2868. TORQUE now allows for the modification of the output location
based on the Mother superior hostname. An environment variable ($HOSTNAME)
has been added to the job's environment.
e - TRQ-2882. Improved trqauthd error messages to more meaningful and less
redundant.
e - TRQ-2890. Added stderr capturing when using -o option.
b - TRQ-2025. Fixed bug where giving a bad queue name to qstat -Q results in
duplicate output.
b - TRQ-2292. Fixed bug where some tasks were incorrectly listed as 0 in 'qstat -a'
when requesting specific nodes.
b - TRQ-2367. Fixed bug related to accounting records on large systems.
b - TRQ-2411. Fixed output format bug in cases where multiple job IDs are passed
into qstat.
b - TRQ-2646. Fixed bug where qsub did not process args correctly when using
a submit filter.
b - TRQ-2652. Fixed parsing bug when using hostlist ranges in qsub.
b - TRQ-2653. Fixed build bug related to newer Intel MIC libraries installing
in different locations.
b - TRQ-2730. Fixed problem where GPUs were not split between NUMA nodes. You
now need to specify which gpus belong to each node board in the mom.layout
file. A sample mom.layout file might look like:
nodes=0 gpu=0
nodes=1 gpu=1
Also please note that this only works if you use nvml. The nvidia-smi
command is not supported.
b - TRQ-2732. Fixed bug where OU files were being left in spool when job was
preempted or requeued.
b - TRQ-2759. Fixed bug where reported cput was incorrect.
b - TRQ-2760. Fixed unexpected error when running 'pbsnodes -l offline -n'.
b - TRQ-2795. Fixed bug where jobs rejected due to max_user_queuable limit reached,
yet no jobs in the queue.
b - TRQ-2828. Fixed bug where 'momctl -q clearmsg' didn't clear error messages
properly.
b - TRQ-2837. Fixed bug where GPU modes were not passed to sister nodes.
b - TRQ-2852. Fixed bug while writing resources_default units to serverdb file.
b - TRQ-2885, CVE-2014-3684. Fixed issue around unauthorized termination
of processes.
b - TRQ-2890. Improved pbsdsh to better handle simultaneous use of -o and -s
options. Also fixed some problems where -o output was sometimes getting
truncated.
b - TRQ-2904. Fixed bug where TORQUE was not honoring KeepCompleted server
parameter when job_nanny was set to true.
b - TRQ-2918. Fixed problem with remote client job submission during
ruserok() calls.
b - TRQ-2919. Fixed deadlock issue when running 'qdel -p' as non-root user.
b - TRQ-2937. Fixed bug in qsub -m when TORQUE is configured --with-sendmail.
Some missing newlines were added.
b - TRQ-2956. Fixed bug where HOST_NAME_SUFFIX was no longer adding suffix to job names.
c - TRQ-2928, TRQ-2921, TRQ-2855, TRQ-2854, TRQ-2853, TRQ-2835. Fixed various crashes.
5.0.0
e - TRQ-2083. Remove job status polling from TORQUE. Have pbs_server only poll a
mom for a job's information if the information hasn't been received for 5
minutes. Otherwise, this information is communicated with the mom's status
information.
e - TRQ-2309. Have TORQUE recognize when a request to run a job specifies a node
list and directly access those nodes instead of searching linearly.
e - TRQ-1539. Condense the exec_host list to have one entry per node instead of one
entry per execution slot. The node entry contains a string specifying each
execution slot index. Also no longer display the value of exec_port in qstat.
f - TRQ-2363. Make it so that if you execute qrerun all - which previously
returned an error - it will ask for confirmation, and then place all running
jobs in a queued state without contacting the moms. This is meant to be used
only when the entire cluster has gone down and can't be contacted.
4.5.0
b - TRQ-2319. Replace two Torque functions with ones from the hwloc package.
n - Portable Hardware Locality (hwloc) package version 1.2 or higher must be
installed when using cpusets (--enable-cpuset). Previously at least version
1.1 was required.
b - TRQ-2373. Fix login nodes restricting the number of jobs to the number specified
by np=X.
e - TRQ-2044. Create a unique identifier for all jobs in TORQUE. This makes it
so that we're performing integer comparisons instead of string comparisons
for finding jobs.
4.2.9
b - TRQ-2730. Make nvml and numa-support configurations work together. The admin must
now specify which gpus are on which node board the same way it is done with mic
co-processors, adding gpu=X[-Y] to the mom.layout line for that node board.
4.2.8
b - TRQ-2501. Fix the total number of execution slots having a count that is off-by-one for
every Cray compute node.
b - TRQ-2498. Fixed a memory leak when using qrun -a (asynchronous). Also fixed a write
after free error that could lead to memory corruption.
b - Fixed the thread pool manager so it would free idle nodes. Also changed the default
thread stack sizes to a maximum of 8 Mb and Minimum of 1 Mb.
4.2.7
b - TRQ-2423. Fix a bug where cpusets would incorrectly be reported on mpi jobs
b - TRQ-2329. Fix a problem where nodes could be allocated to array subjobs even
after the job was deleted.
b - TRQ-2351. Fix an issue where moms that are before 4.2.6 can't run jobs if the
server is 4.2.6.
e - Made it so trqauthd cannot be loaded more than once. trqauthd opens a UNIX
domain name file to do its communication with client commands. If the
UNIX domain name file exists trqauthd will not load. By default this file
is /tmp/trqauthd-unix. It can be configured to point to a different directory.
If trqauthd will not start and you know there are no other instances of trqauthd
running you should delete the UNIX domain file and try again.
b - TRQ-2319. Replace two Torque functions with ones from the hwloc package.
n - Portable Hardware Locality (hwloc) package version 1.2 or higher must be
installed when using cpusets (--enable-cpuset). Previously at least version
1.1 was required.
b - TRQ-2373. Fix login nodes restricting the number of jobs to the number specified
by np=X.
b - TRQ-2354. Fix an issue with potential overflow in user job counts. Also fix a
user being considered different if from a different submit host.
b - TRQ-2369. Fix a problem with pbs_mom recovering which cpu indices were in use for
jobs that were running at shutdown and still running at the time the mom restarted.
b - TRQ-2377. Jobs with future start dates were being placed in queued after being
deleted if they were deleted before their start date and keep_completed kept them
around long enough. Fix this.
c - TRQ-2347. Fix a segfault around re-sending batch requests.
b - TRQ-2270. Fix some problems with TORQUE continuing to have nodes in a free state
when the host is down.
b - TRQ-2395. Fix a problem when running jobs on non-Cray nodes reporting to a pbs
server running in cray enabled mode.
n - TRQ-2299. Make it so that the reporter mom doesn't fork to send its update.
4.2.6.1.h1
b - TRQ-2395. Fix a problem when running jobs on non-Cray nodes reporting to a pbs
server running in cray enabled mode.
4.2.6.1
b - TRQ-2351. Fix an issue where moms that are before 4.2.6 can't run jobs if the
server is 4.2.6.
e - Made it so trqauthd cannot be loaded more than once. trqauthd opens a UNIX
domain name file to do its communication with client commands. If the
4.2.6
b - TRQ-2273. Job start time is hard coded to 5 minutes. If the prolog takes longer
than that to run the job will be requeued without killing the prolog. This
b - TRQ-2395. Fix a problem when running jobs on non-Cray nodes reporting to a pbs
server running in cray enabled mode.
n - TRQ-2299. Make it so that the reporter mom doesn't fork to send its update.
b - TRQ-2111. Fix a rare case of running jobs being deleted without having their
resources freed.
b - TRQ-2208. Stop having pbs_mom use trqauthd when it is checkpointing a job.
e - TRQ-2022. Make pbs_mom capable of handling either naming convention for cpuset
files, those with the 'cpuset.' prefix and those without.
b - TRQ-2259. Fix a problem for multi-node jobs: vmem was being stored in mem and
vice versa from the sisters.
b - TRQ-2280. Save properties added to cray compute nodes in the nodes file if the
file is overwritten by pbs_server.
around long enough. Fix this.
b - TRQ-2395. Fix a problem when running jobs on non-Cray nodes reporting to a pbs
server running in cray enabled mode.
n - TRQ-2299. Make it so that the reporter mom doesn't fork to send its update.
4.2.5
e - Remove the mom asking for a job status before sending an obit to pbs_server
for a job that has exited. This is unnecessary overhead.
b - TRQ-2097. Make it so that the proper errno is stored for non-blocking sockets
at connect time.
b - TRQ-2111. Make queued jobs never hold node resources.
c - TRQ-2155. Fix a crash in trqauthd.
e - TRQ-2058. Add the option of having the pbs_mom daemon read the mom hierarchy
file instead of having to get it from pbs_server. To do this, copy the
hierarchy to mom_priv/mom_hierarchy.
e - TRQ-2058. Add the -n option to pbs_server, telling pbs_server not to send a
hierarchy over the network unless it is requested by pbs_mom.
e - TRQ-2020. Add the option of setting properties (features) for cray compute
nodes in the nodes file. Syntax: node_id cray_compute property_name.
4.2.4
b - TRQ-1802. Make the environment variable $PBS_NUM_NODES accurate for multi-req
jobs.
e - TRQ-1832. Add the ability to add a login_property to a job at the queue level
by setting required_login_property on the queue.
e - TRQ-1925. Make pbs_mom smart enough to reserved extra memory nodes for non-numa
configured TORQUE when more memory is requested than reserved.
e - TRQ-1923. Make job aborts for a mother superior not recognizing the job a bit
more intelligent - if the job has been reported in the last 180 seconds in the
mom's status update don't abort it.
b - TRQ-1934. Ask for canonical hostnames on the default address family without
specifying for uniformity in the code.
b - TRQ-2003. For cray fix a miscalculation of nppn and width when mppdepth is
provided for the job.
e - TRQ-1833. Optimize starting jobs by not internally tracking the jobid for each
execution slot used by the job. Reduce string buildup and manipulation in other
internal places as well. Job start for large jobs has been optimized to be up
to 150X faster according to internal testing.
b - TRQ-2030. Fix an ALPS 1.2 bug with labels on nodes. In 1.2 labels would be
repeated like this: labelnamelabelname... Cray only.
b - TRQ-1914. Fix after type dependencies not being removed from arrays.
b - TRQ-2015. Fix a problem where pbs_mom processes get stuck in a defunc state when
doing a qrerun on a job. qrerun is not required to make this happen. Just the
action of requeing a running job on the mom causes this to happen.
4.2.3
b - TRQ-1653. Arrays depending on non-array jobs was broken. Fix this.
b - Add retries on transient failures to setuid and seteuid calls. TRQ-1541.
e - Add support for qstat -f -u <user>. This results in qstat -f output for only
the specified user.
e - TRQ-1798. Make pbs_server calculate mppmaxnodect more accurately for Cray.
e - Add a timeout for mother superior when cleaning up a job. Instead of waiting
infinitely for sisters to confirm that a job has exited, consider the job dead
after 10 minutes. This time can be adjusted by setting $job_exit_wait_time in
the mom's config file (time in seconds). This prevents jobs from being stuck
infinitely if a compute node crashes or if a mom daemon becomes unresponsive.
TRQ-1776.
e - Add the parameter default_features to queues. TRQ-1794. The other way of adding
a feature to all jobs in a queue (setting resources_default.neednodes) is
circumvented if a user requests a feature in the nodes request. Setting
default_features overcomes this issue.
b - If privileged ports are disabled, make pbs_moms not check if incoming connections
from mother superior are on privileged ports. TRQ-1669.
c - TRQ-1784, bugzilla #231. Fix a crash for modifying arrays with qalter.
e - Add two mom config parameters: max_join_job_wait_time and resend_join_job_wait_time.
The first specifies how long pbs_mom should wait before deciding that join jobs
will never be received, and defaults to 10 minutes. The latter specifies how long
pbs_mom should wait before attempting to resend join jobs to moms that it hasn't
received replies from, and this defaults to 5 minutes. Both are specified in
seconds. Prior to this functionality mother superior would wait indefinitely
for the join job replies. Please carefully consider what these values should be
for your site and set them appropriately. TRQ-1790.
e - If an error happens communicating with one MIC, attempt to communicate with the
others instead of failing the entire routine.
e - Reintroduced the procct resource for queues which allows jobs to be managed based
on the number of procs requested. TRQ-1623
b - TRQ-1709. Fix parsing of -l gpus=X,other_things parsing incorrectly.
b - TRQ-1639. Gpu status information wasn't being displayed correctly.
b - TRQ-1826. mppdepth is now passed correctly to the ALPS reservation.
b - TRQ-1639. Gpu status information wasn't being displayed correctly.
b - TRQ-1826. mppdepth is now passed correctly to the ALPS reservation.
e - Reintroduced the procct resource for queues which allows jobs to be managed based
on the number of procs requested. TRQ-1623
4.2.2
b - Make job_starter work for parallel jobs as well as serial. (TRQ-1577 - thanks
to NERSC for the patch)
b - Fix one issue with being able to submit jobs to the cray while offline. TRQ-1595.
b - Make job_starter work for parallel jobs as well as serial. (TRQ-1577 - thanks
e - Make the abort and email messages for jobs more specific when they are killed
for going over a limit. TRQ-1076.
e - Add mom parameter mom_oom_immunize, making the mom immune to being killed in out
of memory conditions. Default is now true. (thanks to Lukasz Flis for this work)
b - Don't count completed jobs against max_user_queuable. TRQ-1420.
e - For mics, set the variable $OFFLOAD_DEVICES with a list of MICs to use for the
job.
b - make pbs_track compatible with display_job_server_suffix = false. The user
has to set NO_SERVER_SUFFIX in the environment. TRQ-1389
b - Fix the way we monitor if a thread is active. Before we used the id, but if the
thread has exited, the id is no longer valid and this will cause a crash. Use
pthread_cleanup functionality instead. TRQ-1745.
b - TRQ-1751. Add some code to handle a corrupted job file where the job file says it
is running but there is no exec host list. These jobs now will receive a system
hold.
b - Fixed problem where max_queuable and max_user_queuable would fail incorrectly.
TRQ-1494
b - Cray: nppn wasn't being specified in reservations. Fix this. TRQ-1660.
4.2.1
b - Fix a deadlock when submitting two large arrays consecutively, the second
depending on the first. TRQ-1646 (reported by Jorg Blank).
4.2.0
f - Support the MIC architecture. This was co-developed with Doug Johnson at
Ohio Supercomputer Center (OSC) and provides support for the Intel® MIC
architecture similar to GPU support in TORQUE.
b - Fix a queue deadlock. TRQ-1435
b - Fix an issue with multi-node jobs not reporting resources completely. TRQ-1222.
b - Make the API not retry for 5 consecutive timeouts. TRQ-1425
b - Fix a deadlock when no files can be copied from compute nodes to pbs_server.
TRQ-1447.
b - Don't strip quotes from values in scripts before specific processing. TRQ-1632
4.1.6
b - Make job_starter work for parallel jobs as well as serial. (TRQ-1577 - thanks
to NERSC for the patch, backported from 4.2.2)
b - Fix one issue with being able to submit jobs to the cray while offline. TRQ-1595.
backported from 4.2.2
4.1.5
b - For cray: make sure that reservations are released when jobs are requeued. TRQ-1572.
b - For cray: support the mppdepth directive. Bugzilla #225.
c - If the job is no long valid after attempting to lock the array in get_jobs_array(),
make sure the array is valid before attempting to unlock it. TRQ-1598.
e - For cray: make it so you can continue to submit jobs to pbs_server even if you have
restarted it while the cray is offline. TRQ-1595.
b - Don't log an invalid connection message when close_conn() is called on 65535
(PBS_LOCAL_CONNECTION). TRQ-1557.
4.1.4
e - When in cray mode, write physmem and availmem in addition to totmem so that
Moab correctly reads memory info.
e - Specifying size, nodes, and mppwidth and all mutually exclusize, so reject
job submissions that attempt to specify more than one of these. TRQ-1185.
b - Merged changes for revision 7000 by hand because the merge was not clean. This
fixes problems with a deadlock when doing job dependencies using synccount/syncwith.
TRQ-1374
b - Fix a segfault in req_jobobit due to an off-by-one error. TRQ-1361.
e - Add the svn revision to --version outputs. TRQ-1357.
b - Fix a race condition in mom hierarchy reporting. TRQ-1378.
b - Fixed pbs_mom so epilogue will only run once. TRQ-1134
b - Fix some debug output escaping into job output. TRQ-1360.
b - Fixed a problem where server threads all get stuck in a poll. The problem
was an infinite loop created in socket_wait_for_read if poll return -1.
TRQ-1382
b - Fix a Cray-mode bug with jobs ending immediately when spanning nodes of
different proc counts when specifying -l procs. TRQ-1365.
b - Don't fail to make the tmpdir for sister moms. bugzilla #220, TRQ-1403.
c - Fix crashes due to unprotected array accesses. TRQ-1395.
b - Fixed a deadlock in get_parent_dest_queues when the queue_parent_name
and queue_dest_name are the same. TRQ-1413. 11/7/12
b - Fixed segfault in req_movejob where the job ji_qhdr was NULL. TRQ-1416
b - Fix a conflict in the code for herogeneous jobs and regular jobs.
b - For alps jobs, use the login nodes evenly even when one goes down. TRQ-1317.
b - Display the correct 'Assigned Cpu Count' in momctl output. TRQ-1307.
b - Make pbs_original_connect() no longer hang if the host is down. TRQ-1388.
b - Make epilogues run only once and be executed by the child and not the main
pbs_mom process. TRQ-937.
b - Reduce the error messages in HA mode from moms. They now only log errors if
no server could be contacted. TRQ-1385.
b - Fixed a seg-fault in send_depend_req. Also fixed a deadlock in the depend_on_term
TRQ-1430 and TRQ-1436
b - Fixed a null pointer dereference seg-fault when checking for disallowed types
TRQ-1408.
b - Fix a counting problem when running multi-req ALPS jobs (cray only). TRQ-1431.
b - Remove red herring error messages 'did not find work task for local request'.
These tasks are no longer created since issue_Drequest blocks until it gets the
reply and then processes it. TRQ-1423.
b - Fixed a problem where qsub was not applying the submit filter when given in the torque.cfg
file. TRQ-1446
e - When the mom has no jobs, check the aux path to make sure it is clean and
that we aren't leaving any files there. TRQ-1240.
b - Fix a counting problem when running multi-req ALPS jobs (cray only). TRQ-1431.
b - Remove red herring error messages 'did not find work task for local request'.
These tasks are no longer created since issue_Drequest blocks until it gets the
reply and then processes it. TRQ-1423.
e - When the mom has no jobs, check the aux path to make sure it is clean and
that we aren't leaving any files there. TRQ-1240.
b - Made it so that threads taken up by poll job tasks cannot consume all available
threads in the thread pool. This will make it so other work can continue if
poll jobs get stuck for whatever reason and that the server will recover. TRQ-1433
b - Fix a deadlock when recording alps reservations. TRQ-1421.
b - Fixed a segfault in req_jobobit caused by NULL pointer assignment to variable
pa. TRQ-1467
b - Fixed deadlock in remove_array. remove_array was calling get_arry with allarrays_mutex
locked. TRQ-1466
b - Fixed a problem with an end of file error when running momctl -dx. TRQ-1432.
b - Fix a deadlock in rare cases on job insertion. TRQ-1472.
b - Fix a deadlock after restarting pbs_server when it was SIGKILL'd before a
job array was done cloning. TRQ-1474.
b - Fix a Cray-related deadlock. Always lock the reporter mom before a compute
node. TRQ-1445
b - Additional fix for TRQ-1472. In rm_request on the mom pbs_tcp_timeout was
getting set to 0 which made it so the MOM would fail reading incoming data
if it had not already arrived. This would cause momctl -to fail with an
end of file message.
e - Add a safety net to resend any obits for exiting jobs on the mom that still
haven't cleaned up after five minutes. TRQ-1458.
b - Fix cray running jobs being cancelled after a restart due to jobs not being
set to the login nodes. TRQ-1482.
b - Fix a bug that using -V got rid of -v. TRQ-1457.
b - Make qsub -I -x work again. TRQ-1483.
c - Fix a potential crash when getting the status of a login node in cray mode.
TRQ-1491.
4.1.3
b - fix a security loophole that potentially allowed an interactive job to run
as root due to not resetting a value when $attempt_to_make_dir and $tmpdir
are set. TRQ-1078.
b - fix down_on_error for the server. TRQ-1074.
b - prevent pbs_server from spinning in select due to sockets in CLOSE_WAIT.
TRQ-1161.
e - Have pbs_server save the queues each time before exiting so that legacy
formats are converted to xml after upgrading. TRQ-1120.
b - Fix phantom jobs being left on the pbs_moms and blocking jobs for Cray
hardware. TRQ-1162. (Thanks Matt Ezell)
b - Fix a race condition on free'd memory when check for orphaned alps
reservations. TRQ-1181. (Thanks Matt Ezell)
b - If interrupted when reading the terminal type for an interactive job continue
trying to read instead of giving up. TRQ-1091.
b - Fix displaying elapsed time for a job. TRQ-1133.
b - Make offlining nodes persistent after shutting down. TRQ-1087.
b - Fixed a memory leak when calling net_move. net_move allocates memory for args
and starts a thread on send_job. However, args were not getting released
in send_job. TRQ-1199
b - Changed pbs_connect to check for a server name. If it is passed in only that
server name is tried for a connection. If no server name is given then the
default list is used. The previous behavior was to try the name passed in and
the default server list. This would lead to confusion in utilities like qstat
when querying for a specific server. If the server specified was no available
information from the remaining list would still be returned.
TRQ-1143.
e - Make issue_Drequest wait for the reply and have functions continue processing
immediately after instead of the added overhead of using the threadpool.
c - tm_adopt() calls caused pbs_mom to crash. Fix this. TRQ-1210.
b - Array element 0 wasn't showing up in qstat -t output. TRQ-1155.
b - Cores with multiple processing units were being incorrectly assigned in cpusets.
Additionally, multi-node jobs were getting the cpu list from each node in each
cpuset, also causing problems. TRQ-1202.
b - Finding subjobs (for heterogeneous jobs) wasn't compatible with hostnames that
have dashes. TRQ-1229.
b - Removed the call to wait_request the main_loop on pbs_server. All of our communication
is handled directly and there is no longer a need to wait for an out of band
reply from a client. TRQ-1161.
e - Modfied output for qstat -r. Expanded Req'd Time to include seconds and centered Elap Time
over it's column.
b - Fixed a bug found at Univ. of Michigan where a corrupt .JB file would cause
pbs_server to seg-fault and restart.
b - Don't leave quotes on any arguments passed to the resource list. TRQ-1209.
b - Fix a race condition that causes deadlock when two threads are routing the same job.
b - Fixed a bug with qsub where environment variables were not getting populated with the
-v option. TRQ-1228.
b - This time for sure. TRQ-1228. When max_queuable or max_user_queuable were set it
was still possible to go over the limit. This was because a job is qualified
in the call to req_quejob but does not get inserted into the queue until svr_enquejob
is called in req_commit, four network requests later. In a multi-threaded environment
this allowed several jobs to be qualified and put in the pipeline before they
were actually commited to a queue.
b - If max_user_queuable or max_queuable were set on a queue TORQUE would not honor
the limit when filling those queues from a routing queue. This has now
been fixed. TRQ-1088.
b - Fixed seg-fault when running jobs asynchronously. TRQ-1252.
b - Fixed a bug with SIGHUP to pbs_server. The signal handler (change_logs()) does file I/O
which is not allowed for signal interruption. This caused pbs_server to be up but
unresponsive to any commands. TRQ-1250 and TR!-1224
b - Job dependencies didn't work with display_server_suffix=false. Fixed. TRQ-1255.
b - Don't report alps reservation ids if a node is in interactive mode. TRQ-1251.
b - Only attempt to cancel an orphaned alps reservation a maximum of one time per
iteration. TRQ-1251.
b - Fix a deadlock when recording an alps reservation on the server side. Cray only.
TRQ-1272.
c - Fix mismanagement of the ji_globid. TRQ-1262.
c - Setting display_job_server_suffix=false crashed with job arrays. Fixed. bugzilla #216
b - Restore the asynchronous functionality. TRQ-1284.
e - Made it so pbs_server will come up even if a job cannot recover because of a missing
job dependency. TRQ-1287
b - Fixed a segfault in the path from do_tcp to tm_request to tm_eof. In this path we freed
the tcp channel three times. the call to DIS_tcp_cleanup was removed from tm_eof and
tm_request. TRQ-1232.
b - Fix a deadlock in logging when the machine is out of disk space. TRQ-1302.
b - Fixed a deadlock which occurs when there is a job with a dependency that is being moved
from a routing queue to an execution queue. TRQ-1294
e - Retry cleanup with the mom every 20 seconds for jobs that are stuck in an exiting state.
TRQ-1299.
b - Enabled qsub filters to be access from a non-default location.i TRQ-1127
b - Put the ability to write the resources_used data to the accounting logs. This was in 4.1.1
and 4.1.2 but failed to make it into 4.1.3. TRQ-1329
c - Fix a double free if the same chan is stored on two tasks for a job. TRQ-1299.
b - Changed pbs_original_connect to retry a failed connect attempt
MAX_RETRIES (5) times before returning failure. This will
reduce the number of client commands that fail due to a connection
failure. TRQ-1355
b - Fix the proliferation of "Non-digit found where a digit was expected" messages, due
to an off-by-one error. TRQ-1230.
b - Fixed a deadlock caused by queue not getting released when jobs are aborted when
moving jobs from a routing queue to an execution queue. TRQ-1344.
4.1.2
e - Add the ability to run a single job partially on CRAY hardware and partially
on hardware external to the CRAY in order to allow visualization of
large simulations.
4.1.1
e - pbs_server will now detect and release orphaned ALPS reservations
b - Fixed a deadlock with nodes in stream_eof after call to svr_connect.
b - resources_used information now appears in the accounting log again
TRQ-1083 and bugzilla 198.
b - Fixed a seg-fault found a LBNL where freeaddrinfo would crash because
of uninitialized memory.
b - Fixed a deadlock in handle_complete_second_time. We were not unlocking
when exiting svr_job_purge.
e - Added the wrappers lock_ji_mutex and unlock_ji_mutex to do the mutex locking
for all job->ji_mutex locks.
e - admins can now set the global max_user_queuable limit using qmgr. TRQ-978.
b - No longer make multiple alps reservation parameters for each alps reservation.
This creates problems for the aprun -B command.
b - Fix a problem running extremely large jobs with alps 1.1 and 1.2. Reservations
weren't correctly created in the past. TRQ-1092.
b - Fixed a deadlock with a queue mutex caused by call qstat -a <queue1> <queue2>
b - Fixed a memory corruption bug, double free in check_if_orphaned. To fix this
issue_Drequest was modified to always free the batch request regardless of
any errors.
b - Fix a potential segfault when using munge but not having set authorized users.
TRQ-1102
b - Added a modified version of a patch submitted by Matt Ezell for Bugzilla 207.
This fixes a seg-fault in qsub if Moab passes an environment variable without
a value.
b - fix an error in parsing environment variables with commas, newlines, etc. TRQ-1113
b - fixed a deadlock with array jobs running simultaneously with qstat.
b - Fixed qsub -v option. Variable list was not getting passed in to job environment.
TRQ-1128
b - TRQ-1116. mail is now sent on job start again.
b - TRQ-1118. Cray jobs are now recovered correctly after a restart.
b - TRQ-1109. Fixed x11 forwarding for interactive jobs. (qsub -I -X). Previous to
this fix interactive jobs would not run any x applications such as xterm, xclock,
etc.
b - TRQ-1161, Fixes a problem where TORQUE gets into a high CPU utilization condition.
The problem was that in the function process_pbs_server_port there was not
error returned if the call to getpeername() failed in the default case.
b - TRQ-1161. This fixes another case that would cause a thread to spin on poll
in start_process_pbs_server_port. A call to the dis function would return
and error but the code would close the connection and return the error code which
was a value less than 20. start_process_pbs_server_port did not recognize the low
error code value and would keep calling into process_pbs_server_port.
b - qdel'ing a running job in the cray environment was trying to communicate with the
cray compute instead of the login node. This is now fixed. TRQ-1184.
b - TRQ-1161. Fixed a problem in stream_eof where a svr_connect was used to connect
to a MOM to see if it was still there. On successful connection the connection
is closed but the wrong function (close_conn) with the wrong argument (the
handle returned by svr_connect()) was used. Replaced with svr_disconnect
b - Make it so that procct is never shown to Moab or users. TRQ-872.
b - TRQ-1182. Fixed a problem where jobs with dependencies were deleted on
the restart of pbs_server.
b - TRQ-1199. Fixed memory leaks found by Valgrind. Fixed a leak when routing jobs
to a remote server, memory leak with procct, memory leak creating queues,
memory leak with mom_server_valid_message_source and a memory leak in req_track.
4.1.0
e - make free_nodes() only look at nodes in the exec_host list and not examine
all nodes to check if the job at hand was there. This should greatly speed
up freeing nodes.
f - add the server parameter interactive_jobs_can_roam (Cray only). When set to
true, interactive jobs can have any login as mother superior, but by default
all interactive jobs with have their submit_host as mother superior
b - Fixed TRQ-696. Jobs get stuck in running state.
b - Fixed a problem where interactive jobs using X-forwarding would fail
because TORQUE though DISPLAY was not set. The problem was that
DISPLAY was set using lowercase internally. TRQ-1010
e - Add a hostname/address caching feature to alleviate stress on DNS.
4.0.3
b - fix qdel -p all - was performing a qdel all. TRQ-947
b - fix some memory leaks in 4.0.2 on the mom and server TRQ-944
c - TRQ-973. Fix a possibility of a segfault in netcounter_incr()
b - removed memory manager from alloc_br and free_br to solve a memory leak
b - fixes to communications between pbs_sched and pbs_server. TRQ-884
b - fix server crash caused by gpu mode not being right after gpus=x:. TRQ-948.
b - fix logic in torque.setup so it does not say successfully started when
trqauthd failed to start. TRQ-938.
b - fix segfaults on job deletes, dependencies, and cases where a batch
request is held in multiple places. TRQ-933, 988, 990
e - TRQ-961/bugzilla-176 - add the configure option --with-hwloc-path=PATH
to allow installing hwloc to a non-default location.
c - fix a crash when using job dependencies that fail - TRQ-990
e - Cache addresses and names to prevent calling getnameinfo() and getaddrinfo()
too often. TRQ-993
c - fix a crash around re-running jobs
e - change so some Moab envirionment variables will be put into environment for
the prologue and epilogue scripts. TRQ-967.
b - make command line arguments override the job script arguments. TRQ-1033.
b - fix a pbs_mom crash when using blcr. TRQ-1020.
e - Added patch to buildutils/pbs_mkdirs.in which enables pbs_mkdirs to run
silently. Patch submitted by Bas van der Vlies. Bugzilla 199.
4.0.2
e - Change so init.d script variables get set based on the configure command.
TRQ-789, TRQ-792.
b - Fix so qrun jobid[] does not cause pbs_server segfault. TRQ-865.
b - Fix to validate qsub -l nodes=x against resources_max.nodes the same as v2.4.
TRQ-897.
b - bugzilla #185. Empty arrays should no longer be loaded and now when qdel'ed
they will be deleted.
b - bugzilla #182. The serverdb will now correctly write out memory allocated.
b - bugzilla #188. The deadlock when using job logging is resolved
b - bugzilla #184. pbs_server will no longer log an erroneous error when the 12th
job array is submitted.
e - Allow pbs_mom to change users group on stderr/stdout files. Enabled by configuring
Torque with CFLAGS='-DRESETGROUP'. TRQ-908.
e - Have the parent intermediate mom process wait for the child to open the demux before
moving on for more precise synchronization for radix jobs.
e - Changed the way jobs queued in a routing queue are updated. A thread is now launched
at startup and by default checks every 10 seconds to see if there are jobs
in the routing queues that can be promoted to execution queues.
b - Fix so pbs_mom will compile when configured with --with-nvml-lib=/usr/lib and
--with-nvml-include. TRQ-926.
b - fix pbs_track to add its process to the cpuset as well. TRQ-925.
b - Fix so gpu count gets written out to server nodes file when using
--enable-nvidia-gpus. TRQ-927.
b - change pbs_server to listen on all interfaces. TRQ-923
b - Fix so "pbs_server --ha" does not fail when checking path for server.lock file. TRQ-907.
b - Fixed a problem in qmgr where only 9 commands could be completed before a failure.
Bugzilla 192 and TRQ-931
b - Fix to prevent deadlock on server restart with completed job that had a dependency.
TRQ-936.
b - prevent TORQUE from losing connectivity with Moab when starting jobs asynchronously
TRQ-918
b - prevent the API from segfaulting when passed a negative socket descriptor
b - don't allow pbs_tcp_timeout to ever be less than 5 minutes - may be temporary
b - fix pbs_server so it fails if another instance of pbs_server is already
running on same port. TRQ-914.
4.0.1
b - Fix trqauthd init scripts to use correct path to trqauthd.
b - fix so multiple stage in/out files can again be used with qsub -W
b - fix so comma separated file list can be used with qsub -W stagein/stageout.
Matches qsub documentation again.
b - Only seed the random number generator once
b - The code to run the epilogue set of scripts was removed when refactoring the
obit code. The epilogues are now run as part of post_epilogue. preobit_reply
is no longer used.
b - if using a default hierarchy and moms on non-default ports, pass that information
along in the hierarchy
e - Make pbs_server contact pbs_moms in the order in which they appear in the hierarchy
in order to reduce errors on start-up of a large cluster.
b - fix another possibility for deadlock with routing queues
e - move some the the main loop functionality to the threapool in order to increase
responsiveness.
e - Enabled the configuration to be able to write the path of the library directory
to /etc/ld.so.conf.d in a file named libtorque.conf. The file will be created
by default during make install. The configuration can be made to not install this
file by using the configure option --without-loadlibfile
b - Fixed a bug where Moab was using the option SYNCJOBID=TRUE which allows Moab
to create the job ids in TORQUE. With this in place if TORQUE were terminated
it would delete all jobs submitted through msub when pbs_server was restarted.
This fix recovers all jobs whether submitted with msub or qsub when pbs_server
restarts.
b - fix for where pbsnodes displays outdated gpu_status information.
b - fix problem with '+ and segfault when using multiple node gpu requests.
b - Fixed a bug in svr_connect. If the value for func were null then the newly
created connection was not added to the svr_conn table. This was not right.
We now always add the new connection to svr_conn.
b - fix problem with mom segfault when using 8 or more gpus on mom node.
b - Fix so child pbs_mom does not remain running after qdel on slow starting job.
TRQ-860.
b - Made it so the MOM will let pbs_server know it is down after momctl -s is invoked.
e - Made it so localhost is no longer hard coded. The string comes from getnameinfo.
b - fix a mom hiearchy error for running the moms on non-default ports
b - Fix server segfault for where mom in nodes file is not in mom_hierarchy. TRQ-873.
b - Fix so pbs_mom won't segfault after a qdel is done for a job that is still
running the prologue. TRQ-832.
b - Fix for segfault when using routing queues in pbs_server. TRQ-808
b - Fix so epilogue.precancel runs only once and only for cancelled jobs. TRQ-831.
b - Added a close socket to validate_socket to properly terminate the connection.
Moved the free of the incoming variable sock to process_svr_conn from the
beginning of the function to the end. This fixed a problem where the client
would always get a RST when trying to close its end of the connection.
b - Fix server segfault for where mom in nodes file is not in mom_hierarchy. TRQ-873.
b - routing to a routing queue now works again, TRQ-905, bugzilla 186
b - Fix server segfaults that happened doing qhold for blcr job. TRQ-900.
n - TORQUE 4.0.1 released 5/3/2012
4.0.0
e - make a threadpool for TORQUE server. The number of threads is
customizable using min_threads and max_threads, and idle time before
exiting can be set using thread_idle_seconds.
e - make pbs_server multi-threaded in order to increase responsiveness and scalability.
e - remove the forking from pbs_server running a job, the thread handling the request just
waits until the job is run.
e - change qdel to simply send qdel all - previously this was executed by a qstat and a qdel
of every individual job
e - no longer fork to send mail, just use a thread
e - use hwloc as the backbone for cpuset support in TORQUE (contributed by Dr. Bernd Kallies)
e - add the boolean variable $use_smt to mom config. If set to false, this skips logical
cores and uses only physical cores for the job. It is true by default.
(contributed by Dr. Bernd Kallies)
n - with the multi-threading the pbs_server -t create and -t cold commands could no longer
ask for user input from the command line. The call to ask if the user wants to continue
was moved higher in the initialization process and some of the wording changed to
reflect what is now happening.
e - if cpusets are configured but aren't found and cannot be mounted, pbs_mom will now fail to
start instead of failing silently.
e - Change node_spec from an N^2 (but average 5N) algorithm to an N algorithm with respect
to nodes. We only loop over each node once at a maximum.
e - Abandon pbs_iff in favor of trqauthd. trqauthd is a daemon to be started once that can
perform pbs_iff's functionality, increasing speed and enabling future security
enhancements
e - add mom_hierarchy functionality for reporting. The file is located in
<TORQUE_HOME>/server_priv/mom_hierarchy, and can be written to tell moms to send
updates to other moms who will pass them on to pbs_server. See docs for details
e - add a unit testing framework (check). It is compiled with --with-check and tests
are executed using make check. The framework is complete but not many tests have
been written as of yet.
b - Made changes to IM protocol where commands were not either waiting for a reply
or not sending a reply. Also made changes to close connections that were left
open.
b - Fix for where qmgr record_job_info is True and server hangs on startup.
e - Mom rejection messages are now passed back to qrun when possible
e - Added the option -c for startup. By default, the server attempts to send the mom
hierarchy file to all moms on startup, and all moms update the server and request
the hierarchy file. If both are trying to do this at once, it can cause a lot of
traffic. -c tells pbs_server to wait 10 minutes to attempt to contact moms that
haven't contacted it, reducing this traffic.
e - Added mom parameter -w to reduce start times. This parameter wait to send it's
first update until the server sends it the mom hierarchy file, or until 10
minutes have passed. This should reduce large cluster startup times.
3.0.5
b - fix for writing too much data when job_script is saved to job log.
b - fix for where pbs_mom would not automatically set gpu mode.
b - fix for alligning qstat -r output when configured with -DTXT.
e - Change size of transfer block used on job rerun from 4k to 64k.
b - With nvidia gpus, TORQUE was losing the directive of what nodes it should
run the job on from Moab. Corrected.
e - add the $PBS_WALLTIME variable to jobs, thanks to a patch from Mark Roberts
n - change moab_array_compatible server parameter so it defaults to true
e - change to allow pbs_mom to run if configured with --enable-nvidia-gpus but
installed on a node without Nvidia gpus.
3.0.4
c - fix a buffer being overrun with nvidia gpus enabled
b - no longer leave zombie processes when munge authenticating.
b - no longer reject procs if it is the second argument to -l
b - when having pbs_mom re-read the config file, old servers were kept, and pbs_mom
attempted to communicate with those as well. Now they are cleared and only the
new server(s) are contacted.
b - pbsnodes -l can now search on all valid node states
e - Added functionality that allows the values for the server parameter
authorized_users to use wild cards for both the user and host portion.
e - Improvements in munge handling of client connections and authentication.
3.0.3
b - fix for bugzilla #141 - qsub was overwriting the path variable in PBSD_authenticate
e - automatically create and mount /dev/cpuset when TORQUE is configured but the cpuset
directory isn't there
b - fix a bug where node lines past 256 characters were rejected. This buffer has been
made much larger (8192 characters)
b - clear out exec_gpus as needed
b - fix for bugzilla #147 - recreate $PBS_NODESFILE file when restarting a blcr
checkpointed job
b - Applied patch submitted by Eric Roman for resmom/Makefile.am (Bugzilla #147)
b - Fix for adding -lcr for BLCR makefiles (Bugzilla #146)
c - fix a potential segfault when using asynchronous runjob with an array slot limit
b - fix bugzilla #135, stagein was deleting directory instead of file
b - fix bugzilla #133, qsub submit filter, the -W arguments are not all there
e - add a mom config option - $attempt_to_make_dir - to give the user the option to
have TORQUE attempt to create the directories for their output file if they don't exist
b - Fixed momctl to return an error on failure. Prior to this fix momctl always returned 0
regardless of success or failure.
e - Change to allow qsub -l ncpus=x:gpus=x which adds a resource list entry for both
b - fix so user epilogues are run as user instead of root
b - No longer report a completion code if a job is pre-empted using qrerun.
c - Fix a crash in record_jobinfo() - this is fixed by backporting dynamic strings from
4.0.0 so that all of the resizing is done in a central location, fixing the crash.
b - No longer count down walltime for jobs that are suspending or have stopped running
for any other reasons
e - add a mom config option - $ext_pwd_retry - to specify # of retries on
checking for password validity.
3.0.2
c - check if the file pointer to /dev/console can be opened. If not, don't attempt to write it
b - fix a potential buffer overflow security issue in job names and host address names
b - restore += functionality for nodes when using qmgr. It was overwriting old properties
b - fix bugzilla #134, qmgr -= was deleting all entries
e - added the ability in qsub to submit jobs requesting total gpus for job instead of gpus per node:
-l ncpus=X,gpus=Y
b - do not prepend ${HOME} with the current dir for -o and -e in qsub
e - allow an administator using the proxy user submission to also set the job id to be used
in TORQUE. This makes TORQUE easier to use in grid configurations.
b - fix jobs named with -J not always having the server name appended correctly
b - make it so that jobs named like arrays via -J have legal output and error file names
b - make a fix for ATTR_node_exclusive - qsub wasn't accepting -n as a valid argument
3.0.1
e - updated qsub's man page to include ATTR_node_exclusive
b - when updating the nodes file, write out the ports for the mom if needed
b - fix a bug for non-NUMA systems that was continuously increasing memory values
e - the queue files are now stored as XML, just like the serverdb
e - Added code from 2.5-fixes which will try and find nodes that did not
resolve when pbs_server started up. This is in reference to Bugzilla
bug 110.
e - make gpus compatible with NUMA systems, and add the node attribute
numa_gpu_node_str for an additional way to specify gpus on node boards
e - Add code to verify the group list as well when VALIDATEGROUPS is set in torque.cfg
b - Fix a bug where if geometry requests are enabled and cpusets are enabled, the cpuset
wasn't deleted unless a geometry request was made.
b - Fix a race condition for pbs_mom -q, exitstatus was getting overwritten and as a result
pbs_server wasn't always re-queued, but were being deleted instead.
e - Add a configure option --with-tcp-retry-limit to prevent potential 4+ hour hangs on
pbs_server. We recommend --with-tcp-retry-limit=2
n - Changing the way to set ATTR_node_exclusive from -E to -n, in order to continue
compatibility with Moab.
b - preserve the order on array strings in TORQUE, like the route_destinations for a
routing queue
b - fix bugzilla #111, multi-line environment variables causing errors in TORQUE.
b - allow apostrophes in Mail_Users attributes, as apostrophes are rare but legal email
characters
b - restored functionality for -W umask as reported in bugzilla 115
b - Updated torque.spec.in to be able to handle the snapshot names of builds.
b - fix pbs_mom -q to work with parallel jobs
b - Added code to free the mom.lock file during MOM shutdown.
e - Added new MOM configure option job_starter. This options will execute
the script submitted in qsub to the executable or script provided
b - fixed a bug in set_resources that prevented the last resource in a list from being
checked. As a result the last item in the list would always be added
without regard to previous entries.
e - altered the prologue/epilogue code to allow root squashing
f - added the mom config parameter $reduce_prolog_checks. This makes it so TORQUE only checks
to verify that the file is a regular file and is executable.
e - allow more than 5 concurrent connections to TORQUE using pbsD_connect. Increase it to 10
b - fix a segfault when receiving an obit for a job that no longer exists
e - Added options to conditionally build munge, BLCR, high-availability, cpusets,
and spooling. Also allows customization of the sendmail path and allows for
optional XML conversion to serverdb.
b - also remove the procct resource when it is applied because of a default
c - fix a segfault when queue has acl_group_enable and acl_group_sloppy set
true and no acl_groups are defined.
3.0.0
e - serverdb is now stored as xml, this is no longer configurable.
f - added --enable-numa-support for supporting NUMA-type architectures. We
have tested this build on UV and Altix machines. The server treats the
mom as a node with several special numa nodes embedded, and the pbs_mom
reports on these numa nodes instead of itself as a whole.
f - for numa configurations, pbs_mom creates cpusets for memory as well as
cpus
e - adapted the task manager interface to interact properly with NUMA
systems, including tm_adopt
e - Addeded autogen.sh go make life easier in a Makefile.in-less world.
e - Modified buildutils/pbs_mkdirs.in to create server_priv/nodes file
at install time. The file only shows examples and a link to the
TORQUE documentation.
f - added ATTR_node_exclusive to allow a job to have a node exclusively.
f - added --enable-memacct to use an extra protocol in order to
accurately track jobs that exceed over their memory limits and kill
them
e - when ATTR_node_exclusive is set, reserve the entire node (or entire
numa node if applicable) in the cpuset
n - Changed the protocol versions for all client-to-server, mom-to-server and
mom-to-mom protocols from 1 to 2. The changes to the protocol in this version
of TORQUE will make it incompatible with previous versions.
e - when a select statement is used, tally up the memory requests and mark
the total in the resource list. This allows memory enforcement for
NUMA jobs, but doesn't affect others as memory isn't enforced for
multinode jobs
e - add an asynchronous option to qdel
b - do not reply when an asynchronous reply has already been sent
e - make the mem, vmem, and cput usage available on a per-mom basis using momctl -d2
(Dr. Bernd Kallies)
e - move the memory monitor functionality to linux/mom_mach.c in order to store the
more accurate statistics for usage, and still use it for applying limits.
(Dr. Bernd Kallies)
e - when pbs_mom is compiled to use cpusets, instead of looking at all processes,
only examine the ones in cpuset task files. For busy machines (especially large
systems like UVs) this can exponentially reduce job monitoring/harvesting times.
(Dr. Bernd Kallies)
e - when cpusets are configured and memory pressure enabled, add the ability to
check memory pressure for a job. Using $memory_pressure_threshold and
$memory_pressure_duration in the mom's config, the admin sets a threshold at
which a job becomes a problem. If duration is set, the job will be killed if
it exceeds the threshold for the configured number of checks. If duration isn't
set, then an arror is logged.
(Dr. Bernd Kallies)
e - change pbs_track to look for the executable in the existing path so it doesn't always
need a complete path.
(Dr. Bernd Kallies)
e - report sessions on a per numa node basis when NUMA is enabled
(Dr. Bernd Kallies)
b - Merged revision 4325 from 2.5-fixes. Fixed a problem where the -m n
(request no mail on qsub) was not always being recongnized.
e - Merged buildutils/torque.spec.in from 2.4-fixes.
Refactored torque spec file to comply with established RPM best
practices, including the following:
- Standard installation locations based on RPM macro configuration
(e.g., %{_prefix})
- Latest upstream RPM conditional build semantics with fallbacks for
older versions of RPM (e.g., RHEL4)
- Initial set of optional features (GUI, PAM, syslog, SCP) with more
planned
- Basic working configuration automatically generated at install-time
- Reduce the number of unnecessary subpackages by consolidating where
it makes sense and using existing RPM features (e.g., --excludedocs).
2.5.10
b - Fixed a problem where pbs_mom will crash of check_pwd returns NULL. This could
happen for example if LDAP was down and getpwnam returns NULL.
e - Added code to delete a job on the MOM if a job is in the EXITED substate and
going through the scan_for_exiting code. This happens when an obit has been
sent and the obit reply received by the PBS_BATCH_DeleteJob has not been
received from the server on the MOM. This fix allows the MOM to delete the
job and free up resources even if the server for some reason does not send
the delete job request.
b - TRQ-608: Removed code to check for blocking mode in write_nonblocking_socket().
Fixes problem with interactive jobs (qsub -I) exiting prematurely.
c - fix a buffer being overrun with nvidia gpus enabled (backported from 3.0.4)
b - To fix a problem in 2.5.9 where the job_array structure was modified
without changing the version or creating an upgrade path. This made
it incompatible with previous versions of TORQUE 2.5 and 3.0.
Added new array structure job_array_259. This is the original torque
2.5.9 job_array structure with the num_purged element added in the middle
of the structure. job_array_259 was created so users could upgrade from 2.5.9
and 3.0.3 to later versions of TORQUE. The job_array structure was
modified by moving the num_purged element to the bottom of the structure.
pbsd_init now has an upgrade path for job arrays from version 3 to version
4. However, there is an exceptional case when upgrading from 2.5.9 or 3.0.3
where pbs_server must be started using a new -u option.
b - no longer leave zombie processes when munge authenticating. (backported from 3.0.4)
2.5.9
e - change mom to only log "cannot find nvidia-smi in PATH" once when built
with --enable-nvidia-gpus and running on a node that does not have Nvidia
drivers installed.
b - Change so gpu states get set/unset correctly. Fixes problems with multiple
exclusive jobs being assigned to same gpu and where next job gets rejected
because gpu state was not reset after last shared gpu job finished.
e - Added a 1 millisecond sleep to src/lib/Libnet/net_client.c client_to_svr()
if connect fails with EADDRINTUSE EINVAL or EADDRNOTAVAIL case. For these cases
TORQUE will retry the connect again. This fix increases the chance of success
on the next iteration.