This page refers to the gLite version of CREAM. For CREAM released with EMI, please refer to the new CREAM wiki: http://wiki.italiangrid.org/CREAM

Known problems in CREAM software or in other software modules affecting a CREAM based CE (the list refer to known problem affecting the release of the software in production)

  • Bug #86238: Torque should be configured to suppress all mails (mail_domain=never). Otherwise the bupdater process of the blparser will keep dying.
  • Bug #78062: with CREAM CE 1.6.4, in the standard output of the job wrapper the following message is shown:
jw_echo: command not found.
The workaround, is to modify the CREAM jw template, as described here replacing "jw_echo" with "echo". At any rate this issue doesn't cause particular problems
  • Bug #78331: The bupdater log file is not created with SGE
  • When installing/updating a CREAM CE node, a dependency problem such as:
Error: Missing Dependency: libcares.so.0()(64bit) is needed by package glite-security-gss-2.0.0-6.sl5.x86_64
could be seen. In this case, the update should be done doing:
yum update --exclude=c-ares
In case of a fresh installation, instead, add an exclude line to the .repo file (/etc/yum.repos.d/slc5-updates.repo or main SL repository, starting with SL 5.6):
exclude=c-ares
  • GGUS ticket #55015: when the CREAM CE is not a Torque server, there could be communication errors when the maui (and probably torque) server and client are NOT of the same builds.
A common scenario/example when this can happen:
  • The maui server is a 32bit binary deployed on a 32bit LCG-CE
  • The 64bit maui client is deployed on a 64bit CREAM-CE
From the CREAM-CE node perform:
[root@cream-ce]# diagnose –g
If you see:
ERROR:    lost connection to server
ERROR:    cannot request service (status)
you are affected by the problem.
A possible workaround is the following:
  • At the LCG-CE:
    • Create a cron file to dump the `diagnose -g` output to a file
    [root@lcg-ce]# cat <<EOF>> /etc/cron.d/diagnose-for-cream
    
    > */5 * * * * root  /usr/bin/diagnose –g > /export/dir/to/cream-ce/diagnose.out
    
    > EOF
    
    The interval defined at the /etc/cron.d/diagnose-for-cream file, has to be set by the experts. Just an example has been provided here
    • Export over NFS the directory where the file is located
    [root@lcg-ce]# cat /etc/exports
    
    /export/dir/to/cream-ce            cream-ce(rw,map_identity,no_root_squash,sync)
    
  • At the CREAM-CE:
    • Include/mount the remote directory to a local one
    [root@cream-ce]# cat /etc/fstab | grep diagnose
    lcg-ce: /export/dir/to/cream-ce                /import/dir/to/cream-ce         nfs    defaults,bg        0 0
    
    • Feed the lcg-info-dynamic-scheduler with the diagnose output file.
    [root@cream-ce]# cat /opt/glite/etc/lcg-info-dynamic-scheduler.conf|grep vomaxjobs-maui
    
    vo_max_jobs_cmd: /opt/lcg/libexec/vomaxjobs-maui -h lcg-ce –infile /import/dir/to/cream-ce/diagnose-for-cream
    
(Thanks to Marios Chatziangelou and Dennis van Dok for having provided the possible workaround)
  • Execution of DAG jobs on CREAM based CE through the gLite WMS is not implemented yet.
  • RFC proxies are not supported yet. It is possible to submit to a CREAM CE using a RFC proxy, but the delegated proxy is not usable because of this bug in the delegation code.
  • If you see an error message such as "log_success_msg: command not found" starting/stopping tomcat, please check this page. At any rate this message is basically harmless
  • After an update of the CREAM RPM, it is mandatory to reconfigure (via yaim)
  • On SL(c)4 when kerberos is used, the /usr/bin/id executable can be affected by a problem: the exit code could be different than 0, as in the following example:
# su dteam001 -
[dteam001@cert-08 glexec]$ id
uid=1651(dteam001) gid=2688(dteam) groups=2688(dteam),1090601808
[dteam001@cert-08 glexec]$ echo $?
1
Because of this problem, job submission could fail reporting Authorization error: System error reading local user information
In this case the following workaround can be applied:
  • Create (and chmod +x) a script (e.g. /opt/glite/etc/glite-ce-cream/id-wrapper.sh) which issues the id command, but returns 0 as exit code, e.g.:
#!/bin/sh
/usr/bin/id
exit 0
  • Replace in /opt/glite/etc/glite-ce-cream/cream-config.xml and in /opt/glite/etc/lcas/lcas-glexec.db, the occurence of /usr/bin/id with the pathname of this script.
  • Restart tomcat
  • Don't use special characters in the CREAM_DB_USER and CREAM_DB_PASSWORD yaim variables
  • Problems have been reported if jobs are submitted through the WMS to a CREAM CE deployed on a machine installed using a non-English language. This is because of different representations of decimal numbers. The workaround in this case is to uncomment the line:
LANG=en_US
in $CATALINA_HOME/conf/tomcat5.conf and then restart tomcat
  • For jobs running on SL4 WNs, some env variables (GLITE_LOCATION LCG_LOCATION EDG_LOCATION GLOBUS_LOCATION) are not set.
The fix was provided for SL5 but not for SL4 (which is now frozen).
Site admins can deploy this rpm on the SL4 WNs to address this issue
  • With some Torque versions it was observer qsub crashing with glibc detecting a double free or corruption.Although this is a problem to be addressed in Torque problem, adding export MALLOC_CHECK_=0 to /opt/glite/etc/blah.config should help

Problems for which fixes have already been released in production

  • Bug #73481: The blparser for SGE is not able to manage already finished jobs. This means that cancelled jobs or jobs finished in a not clean way are reported with a wrong status. Fix provided with gLite 3.2 Update 32
  • Bug #78565: problem with truncation of arguments. Fix provided with gLite 3.2 Update 23
  • Bugs #73224 and #73109 in yaim-core: the variable GLOBUS_TCP_PORT_RANGE could not be properly defined and therefore causing problems with gridftp, See the workaround specified in these bugs. Fix provided with gLite 3.2 Update 22.
  • In the old CREAM CE (CREAM CE < 1.6.3) the sudoers file was scratched at each yaim reconfiguration and filled with just the stuff needed for CREAM. This meant that local customizations were scratched. In the new CREAM CE (CREAM CE 1.6.3) yaim:
    • checks if the installed sudo version supports the include directive (this should be the case for SL5)
    • cleans from /etc/sudoers the CREAM related stuff existing from a previous installation
    • if sudo supports include directives (this should be the case for SL5), yaim sets the CREAM related stuff in /etc/sudoers.forcream and adds in /etc/sudoers the include of /etc/sudoers.forcream
There is a problem with the cleaning part, which doesn't work properly if the name of the users don't include the name of the group: bug #76235.
For gLite 3.2, when updating to CREAM CE 1.6.3 from previous versions this means that the sudoers file could have some problems after the yaim reconfiguration. When the sudoers file is manually fixed, the problems won't happen anymore in following yaim reconfigurations.
Fix provided with CREAM CE 1.6.4 (released with gLite 3.2 Update 22)
  • Bug #69320: due to a bug in yaim-core 4.0.12-1 in combination with lcg-info-dynamic-software 1.0.3-3, introduced in glite-CREAM since version 3.2.5-0, clean installations of glite-CREAM may fail to publish software tags. This is because yaim-core is no longer creating the directory /opt/edg/var/info/$VO, where $VO is the VOs supported by the CREAM CE. This is used by lcg-info-dynamic-software to publish software tags.
The workaround is to create these directories manually along with an empty .list file inside each for holding the tags. As the faulty yaim core does in fact create the directories, but in the wrong location, it is easiest to move the directories to the correct place:
mkdir -p /opt/edg/var/info
mv /opt/glite/var/info/<VO> /opt/edg/var/info
where <VO> is replaced by the name of each VO supported. The ownership and permissions will already be correct on the directories yaim has created, along with an empty .list file inside.
Do not move the directories named <subcluster> from /opt/glite/var/info.
Fix provided with gLite 3.2 Update 22
  • Because of bug #37366 (in gsoap-plugin) some error messages are not propagated properly, and in this case user simply get something like:
Received NULL fault; the error is due to another cause: : FaultString=[Client fault] - FaultCode=[SOAP-ENV:Client]
The problem has been fixed in gLite 3.2, so it is relevant for direct submissions done by a gLite 3.1 UI and by submissions through WMS for a gLite 3.1 WMS
  • Bug #74807: there are problems if the mapping for a certain user is changed, and jobs refer to a delegationid created before the change.
Waiting for the fix, as workaround each delegationid created by that user before the mapping change should be manuallu removed, as shown in the following example.
use delegationdb;
delete from t_credential where dlg_id='cert12345678' and local_user='dteam002';
Fix provided with gLite 3.2 Update 22.
  • Bug #70287, when using the new blah blparser, jobs queued for more than one hour, are considered "lost" (i.e. they are considered failed with "reason 999"). The workaround is to choose a high value (e.g. 86400) for the alldone_interval attribute in /opt/glite/etc/blah.config and then restart the blparser (/opt/glite/etc/init.d/glite-ce-blahparser restart). Problem addressed with CREAM CE 1.6.3 released with gLite 3.2 Update 20.
  • Because of a non backward compatible change done in Torque (related with the -W option), the CREAM CE doesn't work with Torque v. >= 2.4). Problem addressed with CREAM CE 1.6.3 released with gLite 3.2 Update 20.
  • Bug #68225 in CREAM affecting CREAM CE 1.6.x (released first with gLite 3.2 update 12).
There are problems if there are special characters (e.g. '-') in the pool account groups: a problem with the sudoer file is reported.
As a workaround edit the /etc/sudoers file simply replacing the special characters in the alias names with valid ones. Problem addressed with CREAM CE 1.6.3 released with gLite 3.2 Update 20.
  • Because of bug #58515 in voms-api-java, there are problems whenever the Email= field is present in the certificate of a VOMS server.
Fix provided with gLite 3.2 Update 17
  • Bug #63714 in VOMS api java which doesn't support the critical extension Issuing Distribution Point. This affects the KEK CA. Waiting for the fix to be provided by the VOMS product team, the workaround is to replace /var/lib/tomcat5/webapps/ce-cream/WEB-INF/lib/vomsjapi.jar with this file and restart tomcat.
Fix provided with gLite 3.2 Update 17
  • Bug #68159 in CREAM affecting CREAM CE 1.6.x (released first with gLite 3.2 update 12).
There are problems if there are special characters (e.g. '-') in the pool account users and/or groups.
As workaround replace the file $CATALINA_HOME/webapps/ce-cream/WEB-INF/lib/glite-ce-common-java.jar with the new jar file and restart tomcat.
Fix provided with gLite 3.2 Update 17
  • There is a memory leak in util-java which can cause a OutOfMemory problem in the CREAM CE: see bug #69554
The workaround is:
  • Update the util-java rpm. take the new one from here
  • cp /opt/glite/share/java/glite-security-util-java.jar /usr/share/tomcat5/webapps/ce-cream/WEB-INF/lib/glite-security-util-java.jar
  • Restart tomcat
Fix provided with gLite 3.2 Update 17
  • Because of bug #69545 asynchronous commands can be processed very slowly. The workaround is:
    • Replace /usr/share/tomcat5/webapps/ce-cream/WEB-INF/lib/glite-ce-cream-api-java-common.jar with this file
    • Restart tomcat
Fix provided with gLite 3.2 Update 17
  • Bug #56762: since gLite 3.1 Update 56 (patch #3259) it is not possible to specify NodeNumber or CpuNumber in the JDL when JobType is Normal. Waiting for the patch fixing this problem , please apply the following workaround:
    • Replace $CATALINA_HOME/webapps/ce-cream/WEB-INF/lib/glite-jdl-api-java.jar with this file
    • Restart tomcat
This problem doesn't affect the CREAM CE version for gLite 3.2/sl5
Fix provided with gLite 3.1.0 Update 65 (patch #3898)
  • Bug #67302 in trustmanager affecting the CREAM CE 1.6 for gLite 3.2 (the one released with gLite 3.2 Update 12).
The bug affects the users of the following CAs:
/C=AU/O=APACGrid
/C=IL/O=IUCC
/C=CN/O=HEP
Fix provided with patch #4119 (released with gLite 3.2 Update 13)
  • Because of bug #17046 (in trustmanager) if there are CA changes, it is necessary to restart tomcat
  • Because of some serious problems with the new blparser (in particular bug #55438), it is suggested to keep using the old parser (i.e.: BLPARSER_WITH_UPDATER_NOTIFIER=false, which is the default). With CREAM CE 1.6 (patch #3959, released with gLite 3.2.0 Update 12) the new blparser can instead be used (and it is the default option)
  • Bug #58941 and #61493 in some cases glexec could map to a different account than the one mapped by gridftpd, therefore causing problems.
Because of this problem it is not possible to have both of the following in the grid-mapfile:
  • a mapping of some role/group to a static account;
  • a wildcard to map unrecognized roles/groups to pool accounts
Fix provided with patch #3959, released with gLite 3.2.0 Update 12. In the meantime it is possible to apply the following workaround after having configured via yaim:
  • echo "user_identity_switch_by = lcmaps" >> /opt/glite/etc/glexec.conf
  • wget --no-check-certificate https://savannah.cern.ch/bugs/download.php?file_id=11374 -O lcmaps-glexec.db
  • cp -p /opt/glite/etc/lcmaps/lcmaps-suexec.db opt/glite/etc/lcmaps/lcmaps-suexec.db.old
  • cat lcmaps-glexec.db > /opt/glite/etc/lcmaps/lcmaps-suexec.db
  • Only for glite 3.2/sl5_x86_64: sed -i -e 's|/opt/glite/lib/|/opt/glite/lib64/|' /opt/glite/etc/lcmaps/lcmaps-suexec.db
  • Bug #61790: There can be problems if there are some "strange" characters in the subject DN. E.g. if there is a ":" in the subject DN, the sandbox directory has a name with ":", and this is a character not accepted by PBS.
Patch #3959, released with gLite 3.2.0 Update 12, provides a fix for this problem (the sandbox dir will have only alpha-numeric chars + the '_' char).
In the meantime the workaround is to replace:
/var/lib/tomcat5/webapps/ce-cream/WEB-INF/lib/glite-ce-cream.jar with this file
/var/lib/tomcat5/webapps/ce-monitor/WEB-INF/lib/glite-ce-monitor.jar with this file
/var/lib/tomcat5/webapps/ce-cream/WEB-INF/lib/glite-ce-common-java.jar and /var/lib/tomcat5/webapps/ce-monitor/WEB-INF/lib/glite-ce-common-java.jar with this file
and restart tomcat
  • Bug #62893 : in some cases the JobWrapper running on the WN could not download in time the fresh proxy from the CREAM CE node. Fix provided with patch #3959, released with gLite 3.2.0 Update 12. In the meantime the workaround is to replace the CREAM jw template (following the instructions reported at: http://grid.pd.infn.it/cream/field.php?n=Main.HowToCustomizeTheCREAMJobWrapper) with this one.
  • CREAM/CEMon could "crash" reporting in its log files "too many open files" (see bug #52651): fix provided with patch #3959, released with gLite 3.2.0 Update 12
As a workaround in the meantime it is suggested:
  • To create the index for the extra_table attribute (see bug #52876): not needed if you are already using gLite 3.1 Update 56
  • To increase the number of file descriptors (e.g. to 4096) for tomcat. You can do it editing /etc/security/limits.conf and adding:
tomcat           soft    nofile          4096
tomcat           hard    nofile          4096
  • Bug #47254: if the proxy used to talk with a CREAM based CE is shorter than 10 minutes, the following problem could be seen:
CREAM Register returned error "MethodName=[jobRegister] Timestamp=[Fri 20 Feb 2009 16:24:32] ErrorCode=[0] Description=[system error]
FaultCause=[cannot create the job's working directory! The problem seems to be related to glexec]"
Actually in these cases glexec is not to blame: the problem is instead in the proxy used by CREAM for this glexec operation
Fix provided with patch #3959, released with gLite 3.2.0 Update 12
  • Old (expired) proxies delegated to a CREAM based CE are not deleted (bugs #33730 and #49497)
Fix provided with patch #3959, released with gLite 3.2.0 Update 12
  • Bug #45914 (in glexec) and in BLAH: sometimes glexec (used by CREAM/BLAH) can fail reporting something like:
gLExec has detected an input file change during the use of the file
CREAM CE patch #3959, released with gLite 3.2.0 Update 12, is not affected by this problem anymore
  • Bug #47804: for a LSF based CREAM CE, the yaim variable BATCH_CONF_DIR must be set to the directory where there is the lsf.conf. yaim-cream-ce assumes that in the same directory there is also the lsf.profile script, while this is not always the case. Waiting for the fix for this bug, the workaround is to edit /opt/glite/etc/blah.config (setting the proper path of the lsf.profile script), and restart tomcat
Fix provided with patch #3959, released with gLite 3.2.0 Update 12
  • Bug #56518: The BLAH blparser doesn't automatically start after a reboot
Fix provided with patch #3959, released with gLite 3.2.0 Update 12
  • Bug #52942: if a ISB/OSB file transfer done by the CREAM job wrapper fails, the failure reason is not properly reported.
Fix provided with patch #3959, released with gLite 3.2.0 Update 12.
In the meantime the workaround is to replace /var/lib/tomcat5/webapps//ce-cream/WEB-INF/lib/glite-ce-cream-api-java-common.jar with this file and restart tomcat
  • Because of bug #57141, there are problems to properly configure gLite 3.1 CREAM (32-bit) on a 64-bit node. Fix provided with patch #3438.
  • Bug #43830: There are problems if there are more than 32000 active jobs for a given user: fix provided with gLite 3.1 Update 56
  • Bug #47447: The JDL attribute MaxOutputSandboxSize is not properly managed: fix provided with gLite 3.1 Update 56
  • Because of bug #22436, the cert file of VOMS servers (the ones in /etc/grid-security/vomsdir) must have .pem has suffix (bug #22436): this is however managed by the yaim-cream-ce conf procedure. With gLite 3.1 Update 56 VOMS server certificates are not needed anymore
  • Bug #48144: the CREAM job purger directory is not cleared when the purge operation is called (explicitly or by the automatic purger) when the group which the user is mapped is different than the name of the VO. This can help to trigger bug #43830: fix provided with gLite 3.1 Update 56
  • On glite 3.1 (which is supported just for sl4_ia32), BLAH and the blparser don't work on 64bit machines: fix provided with gLite 3.1 update 56
  • Bug #52876:The extra_attribute table in the CREAM DB has no keys/indexes defined. This results in performance problems for some operations. Fix provided with gLite 3.1 Update 56
  • Because of bug #44924, if the CREAM jobwrapper fails to download the proxy from the CREAM CE node, after 5 attempts it gives up and cancels the job with "proxy expired", even if the proxy is still valid. The recent changes in yaim-core (which introduced a limit on the number of concurrent gridftp connections), can trigger this problem. For the time being the workaround is to remove /opt/globus/etc/gridftp.conf from the CREAM CE and restart the gridftpd: fix provided with gLite 3.1 Update 56
  • Bug #48083: if the mapping of a certain user changes, that user could not be able to submit jobs anymore (the error message will be a generic glexec error in the creation of the sandbox dir). Fix provided with gLite 3.1 Update 56
  • After introducing the fix for Bug #45887, YAIM has stopped to create /opt/edg/var/info directories. This is a mistake since the version of lcg-tags that will be able to write in the new directory /opt/glite/var/info/<SubClusterUniqueId>/<vo> is released in Patch #2940 that hasn't been certified yet. Old directories need to be supported for a while. The workaround is needed in clean installations or when a new VO is added in the CE. Edit $INSTALL_ROOT/glite/yaim/functions/cofig_gip_vo_tag and add at the end the old code to create the /opt/edg/var/info directory:
    for VO in $VOS; do
        dir=${INSTALL_ROOT}/edg/var/info/$VO
        mkdir -p $dir
                f=$dir/$VO.list
                [ -f $f ] || touch $f
        # work out the sgm user for this VO
        sgmusers=`users_getspecialusers $VO sgm`
        sgmuser=`echo $sgmusers | cut -d " " -f 1`
        vogroup=`users_getvogroup ${VO}`
        sgmgroup=`users_getspecialgroup ${VO} sgm`

        sgmgroup=`id -g -n $sgmuser`
        chown -R ${sgmuser}:${sgmgroup} $dir
        yaimlog DEBUG "$vogroup, $sgmgroup"
        if [ "x$vogroup" = "x$sgmgroup" ]; then
                yaimlog DEBUG "Removing grop writeability of files in $dir, sgm's primary group is equal to pool account's primary group."
                chmod -R go-w $dir
        else
                yaimlog DEBUG "Adding grop writeability of files in $dir, sgm's primary group is different to pool account's primary group."
                chmod -R ug+rw,o-w $dir
        fi
    done
  • There is a bug in lcas, which crashes (and therefore glexec doesn't work) if the number of delegation is higher than the number of RDNs in the subject. For example the following proxy:
subject   : /O=GermanGrid/OU=GSI/CN=Kilian Schwarz/CN=proxy/CN=proxy/CN=proxy/CN=proxy
issuer    : /O=GermanGrid/OU=GSI/CN=Kilian Schwarz/CN=proxy/CN=proxy/CN=proxy
identity  : /O=GermanGrid/OU=GSI/CN=Kilian Schwarz/CN=proxy/CN=proxy/CN=proxy
type      : proxy
strength  : 1024 bits
is affected by the problem, since the number of RDN is 3 (O=GermanGrid, OU=GSI, CN=Kilian Schwarz) and the number of delegations is 4.
Fix provided with gLite 3.1 Update 54.
  • Because of bug #36470 (in LB) LB processes in the CE could not run properly after a yaim (re)configuration. Please check the "Cream CE post-configuration" in the CREAM CE conf instructions via yaim. Fix provided with gLite 3.1 Update 52
  • Because of bug #47152 (in LCMAPS) there might be problems if many-to-one static accounts mapping is used. This results in a glexec failure. As workaround, voms pool account should be used instead of static ones. Fix provided with gLite 3.1 Update 49