问题现象
死机
分析步骤
[ 189.052980][ T5068] Unable to handle kernel paging request at virtual address 00046ffca9037bf9
[ 189.052991][ T5068] Mem abort info:
[ 189.052997][ T5068] ESR = 0x0000000096000004
[ 189.053005][ T5068] EC = 0x25: DABT (current EL), IL = 32 bits
[ 189.053013][ T5068] SET = 0, FnV = 0
[ 189.053020][ T5068] EA = 0, S1PTW = 0
[ 189.053027][ T5068] FSC = 0x04: level 0 translation fault
[ 189.053035][ T5068] Data abort info:
[ 189.053039][ T5068] ISV = 0, ISS = 0x00000004
[ 189.053045][ T5068] CM = 0, WnR = 0
[ 189.053053][ T5068] [00046ffca9037bf9] address between user and kernel address ranges
[ 189.053064][ T5068] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[ 189.053311][ T5068] Dumping ftrace buffer:
[ 189.053331][ T5068] (ftrace buffer empty)
[ 189.055391][ T5068] CPU: 1 PID: 5068 Comm: binder:1027_3 Tainted: G WC OE 6.1.118-android14-11-maybe-dirty #1
[ 189.055405][ T5068] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
[ 189.055412][ T5068] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 189.055426][ T5068] pc : dpm_complete+0x128/0x44c
[ 189.055451][ T5068] lr : dpm_complete+0x114/0x44c
[ 189.055462][ T5068] sp : ffffffc0243fbb40
[ 189.055468][ T5068] x29: ffffffc0243fbb60 x28: ffffff8035d52580 x27: ffffffc00a1fc000
[ 189.055489][ T5068] x26: ffffffc00a1fc210 x25: ffffffc0243fbb48 x24: ffffff8093e724a0
[ 189.055508][ T5068] x23: ffffff8093e72518 x22: ffffff8093e72400 x21: ffffffc0092f0ae9
[ 189.055527][ T5068] x20: ffffffc00a1fc1c0 x19: 0000000000000010 x18: ffffffc022c2d078
[ 189.055545][ T5068] x17: 000000007b71745f x16: 000000007b71745f x15: ffffff8179342180
[ 189.055564][ T5068] x14: 0000000000000010 x13: ffffffc0082809d4 x12: ffffffc00939e698
[ 189.055582][ T5068] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffffc00a0c7000
[ 189.055600][ T5068] x8 : a9046ffca9037bfd x7 : 3a4d50006574656c x6 : 0000101a1e00090b
[ 189.055619][ T5068] x5 : 0b09001e1a100000 x4 : 0000008000000000 x3 : ffffff8056d3a9c8
[ 189.055637][ T5068] x2 : 00000000ffff93a3 x1 : 0000000000000000 x0 : ffffff8093e72400
[ 189.055657][ T5068] Call trace:
[ 189.055663][ T5068] dpm_complete+0x128/0x44c
[ 189.055677][ T5068] suspend_devices_and_enter+0x894/0xc04
[ 189.055698][ T5068] pm_suspend+0x330/0x694
[ 189.055711][ T5068] state_store+0x104/0x1c8
[ 189.055724][ T5068] kobj_attr_store+0x30/0x48
[ 189.055747][ T5068] sysfs_kf_write+0x54/0x6c
[ 189.055769][ T5068] kernfs_fop_write_iter+0x104/0x1a4
[ 189.055789][ T5068] vfs_write+0x244/0x2e0
[ 189.055805][ T5068] ksys_write+0x78/0xe8
[ 189.055816][ T5068] __arm64_sys_write+0x1c/0x2c
[ 189.055829][ T5068] invoke_syscall+0x58/0x114
[ 189.055845][ T5068] el0_svc_common+0xb4/0xfc
[ 189.055857][ T5068] do_el0_svc+0x24/0x84
[ 189.055867][ T5068] el0_svc+0x2c/0x90
[ 189.055884][ T5068] el0t_64_sync_handler+0x68/0xb4
[ 189.055897][ T5068] el0t_64_sync+0x1a4/0x1a8
[ 189.055920][ T5068] Code: b40002a8 f9400508 b40003e8 aa1603e0 (b85fc110)
[ 189.055933][ T5068] ---[ end trace 0000000000000000 ]---
[ 189.169167][ T5068] Kernel panic - not syncing: Oops: Fatal exception
初步定位模块
问题出现在系统休眠过程中
设备陆续suspend
出问题的dev,为 disp_feature/disp-DSI-0
suspend的流程里,出现了问题,disp-DSI-0的class像是被注销了
第一个问题点
查看dmesg,可以看到初始化流程有两个线程同时执行,
7.0x 秒左右,T615线程执行到mi_display_pwrkey_callback_set
7.04 秒左右,T710线程触发了pwrkey的irq
7.40 秒左右,T710初始化了mi disp_core和mi disp_log
7.45 秒左右,T675再次初始化mi_disp_core和mi disp_log,检查到已经初始化直接return
7.45 秒左右,T675初始化mi disp_feature
由此得到第一个问题点:
display的初始化流程竟然被电源键的中断触发函数触发,而没有走正常的display的流程,这个需要整改
第二个问题点
Line 4538: [ 7.456376][ T710] sysfs: cannot create duplicate filename '/devices/virtual/mi_display/disp_feature'
Line 4549: [ 7.467624][ T710] CPU: 1 PID: 710 Comm: irq/135-pm8941_ Tainted: G WC OE 6.1.118-android14-11-maybe-dirty #1
Line 4559: [ 7.485547][ T710] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
Line 4560: [ 7.485552][ T710] Call trace:
Line 4561: [ 7.485555][ T710] dump_backtrace+0xf4/0x11c
Line 4562: [ 7.485569][ T710] show_stack+0x18/0x24
Line 4563: [ 7.485573][ T710] dump_stack_lvl+0x60/0x90
Line 4564: [ 7.485580][ T710] sysfs_create_dir_ns+0xf0/0x150
Line 4565: [ 7.485588][ T710] kobject_add_internal+0x228/0x478
Line 4566: [ 7.485595][ T710] kobject_add+0x94/0x10c
Line 4567: [ 7.485600][ T710] device_add+0x144/0x618
Line 4568: [ 7.485607][ T710] device_create_groups_vargs+0xcc/0x12c
Line 4570: [ 7.499011][ T710] device_create+0x58/0x80
Line 4571: [ 7.499017][ T710] mi_disp_feature_init+0xdc/0x20c [msm_drm]
Line 4573: [ 7.510902][ T710] mi_get_disp_feature+0x20/0x40 [msm_drm]
Line 4575: [ 7.522143][ T710] mi_display_powerkey_callback+0x18/0x80 [msm_drm]
Line 4577: [ 7.537274][ T710] pm8941_pwrkey_irq+0x1e8/0x330 [pm8941_pwrkey]
Line 4578: [ 7.537302][ T710] irq_thread_fn+0x44/0xa4
Line 4579: [ 7.537315][ T710] irq_thread+0x164/0x290
Line 4580: [ 7.537320][ T710] kthread+0x10c/0x154
Line 4581: [ 7.537328][ T710] ret_from_fork+0x10/0x20
Line 4583: [ 7.547231][ T710] kobject_add_internal failed for disp_feature with -EEXIST, don't try to register things with the same name in the same directory.
Line 4588: [ 7.559217][ T710] [mi_disp:mi_disp_feature_init [msm_drm]] [E]create device failed for disp_feature
Line 4591: [ 7.572531][ T710] ------------[ cut here ]------------
Line 4593: [ 7.584887][ T710] remove_proc_entry: removing non-empty directory '/proc/mi_display', leaking at least 'mipi_rw_prim'
Line 4594: [ 7.584917][ T710] WARNING: CPU: 1 PID: 710 at fs/proc/generic.c:720 remove_proc_entry+0x1e0/0x1ec
Line 4595: [ 7.584935][ T710] Modules linked in: rmnet_wlan(OE) rmnet_offload(OE) rmnet_perf(OE) rmnet_shs(OE) rmnet_perf_tether(OE) rmnet_core(OE) gauge_iio(E) ipanetm(OE)
Line 4625: [ 7.672205][ T710] CPU: 1 PID: 710 Comm: irq/135-pm8941_ Tainted: G WC OE 6.1.118-android14-11-maybe-dirty #1
Line 4626: [ 7.672211][ T710] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
Line 4627: [ 7.672214][ T710] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
Line 4628: [ 7.672219][ T710] pc : remove_proc_entry+0x1e0/0x1ec
Line 4629: [ 7.672234][ T710] lr : remove_proc_entry+0x1e0/0x1ec
Line 4630: [ 7.672240][ T710] sp : ffffffc00b4a3c60
Line 4631: [ 7.672242][ T710] x29: ffffffc00b4a3c80 x28: 0000000000000000 x27: 00000000ffffffff
Line 4632: [ 7.672250][ T710] x26: 0000000000000001 x25: ffffffc00a1a4580 x24: 000000000000000a
Line 4633: [ 7.672256][ T710] x23: 000000000000000a x22: ffffffc009318048 x21: ffffff804c52b180
Line 4634: [ 7.672263][ T710] x20: ffffff804c52b22c x19: ffffff804c52b200 x18: ffffffc00aafd048
Line 4635: [ 7.672269][ T710] x17: 0000000000000015 x16: 00000000000000a4 x15: ffffffc00902ec88
Line 4636: [ 7.672276][ T710] x14: 0000000000000001 x13: 000000000000004e x12: 0000000000000018
Line 4637: [ 7.672282][ T710] x11: 00000000ffffffff
Line 4640: [ 7.687628][ T710] x10: ffffffc00a09eb5c x9 : 67aa0542b3522000
Line 4641: [ 7.687638][ T710] x8 : 67aa0542b3522000 x7 : 656c20746120676e x6 : 0000000000000027
Line 4642: [ 7.687644][ T710] x5 : ffffff8179154234 x4 : ffffffc0093675d5 x3 : ffff0a00ffffff04
Line 4643: [ 7.687651][ T710] x2 : 0000000000000001 x1 : 0000000000000000 x0 : 0000000000000063
Line 4644: [ 7.687658][ T710] Call trace:
Line 4645: [ 7.687663][ T710] remove_proc_entry+0x1e0/0x1ec
Line 4646: [ 7.687673][ T710] mi_disp_core_deinit+0x34/0x60 [msm_drm]
Line 4653: [ 7.705247][ T710] mi_disp_feature_init+0x16c/0x20c [msm_drm]
Line 4663: [ 7.722296][ T710] mi_get_disp_feature+0x20/0x40 [msm_drm]
Line 4669: [ 7.739086][ T710] mi_display_powerkey_callback+0x18/0x80 [msm_drm]
Line 4671: [ 7.762509][ T710] pm8941_pwrkey_irq+0x1e8/0x330 [pm8941_pwrkey]
Line 4672: [ 7.762528][ T710] irq_thread_fn+0x44/0xa4
Line 4673: [ 7.762539][ T710] irq_thread+0x164/0x290
Line 4674: [ 7.762544][ T710] kthread+0x10c/0x154
Line 4675: [ 7.762550][ T710] ret_from_fork+0x10/0x20
Line 4677: [ 7.784476][ T710] ---[ end trace 0000000000000000 ]---
Line 4678: [ 7.784632][ T710] [mi_disp:mi_display_powerkey_callback [msm_drm]] [E]invalid dsi_display or dsi_panel ptr
pm8941_pwrkey_irq
最终触发mi_disp_core_deinit
,对应代码
void mi_disp_core_deinit(void)
{
if (!g_disp_core)
return;
debugfs_remove_recursive(g_disp_core->debugfs_dir);
remove_proc_entry(MI_DISPLAY_PROCFS_DIR, NULL);
class_destroy(g_disp_core->class);
kfree(g_disp_core);
g_disp_core = NULL; //置空g_disp_core ,但是
}
这边会使得 g_disp_core->class
destory掉,以及kfree掉g_disp_core
以及设为NULL
这里特地问了一下AI,
{% tip success %}
- class_destory把class清除了
- kfree(g_disp_core) 不会直接将g_disp_core所指向的内存直接清0,而是给系统标记,这段内存可以被释放,可以被使用了
- g_disp_core=NULL,这段是将制作指向的地址从原来的指针指向NULL
{% endtip %}
继续查看mi_disp_core_deinit
的调用,确认调用处为以下的代码:
int mi_disp_feature_init(void)
{
int ret = 0;
struct disp_feature *df = NULL;
struct disp_core *disp_core = NULL;
int i;
ret = mi_disp_core_init();
if (ret < 0)
return -ENODEV;
mi_disp_log_init();
disp_core = mi_get_disp_core();
if (!disp_core)
return -ENODEV;
if (g_disp_feature) {
DISP_INFO("mi disp_feature already initialized, return!\n");
return 0;
}
df = kzalloc(sizeof(struct disp_feature), GFP_KERNEL);
if (!df) {
DISP_ERROR("can not allocate Buffer\n");
ret = -ENOMEM;
goto err_core_deinit;
}
ret = mi_disp_cdev_register(DISP_FEATURE_DEVICE_NAME,
&disp_feature_fops, &df->cdev);
if (ret < 0) {
DISP_ERROR("cdev register failed for %s\n", DISP_FEATURE_DEVICE_NAME);
goto err_alloc_mem;
}
df->dev_id = df->cdev->dev;
df->class = disp_core->class; ///disp_core->class赋值给disp_feature
df->pdev = device_create(df->class, NULL, df->dev_id, df, DISP_FEATURE_DEVICE_NAME);
if (IS_ERR(df->pdev)) {
DISP_ERROR("create device failed for %s\n", DISP_FEATURE_DEVICE_NAME); /////这里log打印了
ret = -ENODEV;
goto err_cdev_register;
}
df->version = MI_DISP_FEATURE_VERSION;
for (i = MI_DISP_PRIMARY; i < MI_DISP_MAX; i++) {
df->d_display[i].dev = NULL;
df->d_display[i].display = NULL;
df->d_display[i].disp_id = MI_DISP_MAX;
df->d_display[i].intf_type = MI_INTF_MAX;
mutex_init(&df->d_display[i].mutex_lock);
}
INIT_LIST_HEAD(&df->client_list);
spin_lock_init(&df->client_spinlock);
g_disp_feature = df; //第一次初始化时将申请的内存df指针 赋值给全局变量
DISP_INFO("mi disp_feature driver initialized!\n");
if (hwconf_init() < 0) {
DISP_ERROR("can not initialize hwconf.\n");
}
return 0;
err_cdev_register: ////跳到这里执行
mi_disp_cdev_unregister(df->cdev);
err_alloc_mem:
kfree(df);
err_core_deinit:
mi_disp_core_deinit(); /////这里
return ret;
}
goto err_cdev_register
err_cdev_register: ////跳到这里执行
mi_disp_cdev_unregister(df->cdev); ////注销cdev
err_alloc_mem:
kfree(df); ////标记df的内存可释放
err_core_deinit:
mi_disp_core_deinit(); /////这里
void mi_disp_cdev_unregister(struct cdev *cdev)
{
unregister_chrdev_region(cdev->dev, 1);
cdev_del(cdev);
cdev = NULL;
}
第二个问题出现了
cdev是函数的形参局部变量,将局部变量设为NULL,并不会影响实参
所以df->cdev应该不为NULL,这点我们看一下g_disp_feature->cdev
就可以知道,确实没被清0
从函数汇编角度来看这个问题,也是可以确认的
x0为cdev的值,函数一进来就将x0保存到x19里了,后续操作都不会对x0直接操作,而是操作x19
可以看到ldr x19,[sp, #0x10] ,这里是编译器优化,直接将x19寄存器当作sp来使用返回函数地址了,所以直到函数结束返回,x0中的值仍然没有变
第三个问题点
goto err_cdev_register
err_cdev_register: ////跳到这里执行
mi_disp_cdev_unregister(df->cdev); ////注销cdev
err_alloc_mem:
kfree(df); ////标记df的内存可释放
err_core_deinit:
mi_disp_core_deinit(); /////这里
kfree了df后,没有将df=NULL,以及g_disp_feature=NULL,
这个是很容易出现问题的
这里需要注意的是:
df和g_disp_feature指向的是同一块内存空间,但是这两个指针是不一样的,属于不同的地址,如果只kfree了df,标明这块内存可以被释放。如果这些内存被使用了,那df和g_disp_feature仍然指向原来的地址。直接调用就会出现异常!
问题总结
这个问题虽然发现了3个问题点,但是实际的死机是因为class的状态被destory后没有同步给g_disp_feature,将g_disp_core以及g_disp_feature都要置为NULL
df->class = disp_core->class; ///disp_core->class赋值给disp_feature
void mi_disp_core_deinit(void)
{
if (!g_disp_core)
return;
debugfs_remove_recursive(g_disp_core->debugfs_dir);
remove_proc_entry(MI_DISPLAY_PROCFS_DIR, NULL);
class_destroy(g_disp_core->class);
kfree(g_disp_core);
g_disp_core = NULL; //置空g_disp_core ,但是disp_core->class没有被删除,仍然有df->class指针可以访问到这个成员
}
所以在suspend流程时认为class还存在导致了这个问题,从trace32里看到的整个class的成员都是异常的,这个说明这个内存块应该被其他人使用了