您的位置:

linux时间管理

引言　　之前的文章已经将调度器的数据结构、初始化、加入进程都进行了分析，这篇文章将主要说明调度器是如何在程序稳定运行的情况下进行进程调度的。系统定时器　　因为我们主要讲解的是调度器，而会涉及到一些系统定时器的知识，这里我们简单讲解一下内核中定时器是如何组织，又是如何通过通过定时器实现了调度器的间隔调度。首先我们先看一下内核定时器的框架　　在内核中，会使用strut clock_event_device结构描述硬件上的定时器，每个硬件定时器都有其自己的精度，会根据精度每隔一段时间产生一个时钟中断。而系统会让每个CPU使用一个tick_device描述系统当前使用的硬件定时器(因为每个CPU都有其自己的运行队列)，通过tick_device所使用的硬件时钟中断进行时钟滴答(jiffies)的累加(只会有一个CPU负责这件事)，并且在中断中也会调用调度器，而我们在驱动中常用的低精度定时器就是通过判断jiffies实现的。而当使用高精度定时器(hrtimer)时，情况则不一样，hrtimer会生成一个普通的高精度定时器，在这个定时器中回调函数是调度器，其设置的间隔时间同时钟滴答一样。　　所以在系统中，每一次时钟滴答都会使调度器判断一次是否需要进行调度。时钟中断　　当时钟发生中断时，首先会调用的是tick_handle_periodic()函数，在此函数中又主要执行tick_periodic()函数进行操作。我们先看一下tick_handle_periodic()函数：复制代码 1 void tick_handle_periodic(struct clock_event_device *dev) 2 { 3 /* 获取当前CPU */ 4 int cpu = smp_processor_id(); 5 /* 获取下次时钟中断执行时间 */ 6 ktime_t next = dev->next_event; 7 8 tick_periodic(cpu); 9 10 /* 如果是周期触发模式，直接返回 */ 11 if (dev->mode != CLOCK_EVT_MODE_ONESHOT) 12 return; 13 14 /* 为了防止当该函数被调用时，clock_event_device中的计时实际上已经经过了不止一个tick周期，这时候，tick_periodic可能被多次调用，使得jiffies和时间可以被正确地更新。 */ 15 for (;;) { 16 /* 17 * Setup the next period for devices, which do not have 18 * periodic mode: 19 */ 20 /* 计算下一次触发时间 */ 21 next = ktime_add(next, tick_period); 22 23 /* 设置下一次触发时间，返回0表示成功 */ 24 if (!clockevents_program_event(dev, next, false)) 25 return; 26 /* 27 * Have to be careful here. If we're in oneshot mode, 28 * before we call tick_periodic() in a loop, we need 29 * to be sure we're using a real hardware clocksource. 30 * Otherwise we could get trapped in an infinite(无限的) 31 * loop, as the tick_periodic() increments jiffies, 32 * which then will increment time, possibly causing 33 * the loop to trigger again and again. 34 */ 35 if (timekeeping_valid_for_hres()) 36 tick_periodic(cpu); 37 } 38 } 复制代码　　此函数主要工作是执行tick_periodic()函数，然后判断时钟中断是单触发模式还是循环触发模式，如果是循环触发模式，则直接返回，如果是单触发模式，则执行如下操作：计算下一次触发时间设置下次触发时间如果设置下次触发时间失败，则根据timekeeper等待下次tick_periodic()函数执行时间。返回第一步　　而在tick_periodic()函数中，程序主要执行路线为tick_periodic()->update_process_times()->scheduler_tick()。最后的scheduler_tick()函数则是跟调度相关的主要函数。我们在这具体先看看tick_periodic()函数和update_process_times()函数：复制代码 1 /* tick_device 周期性调用此函数 2 * 更新jffies和当前进程 3 * 只有一个CPU是负责更新jffies的，其他的CPU只会更新当前自己的进程 4 */ 5 static void tick_periodic(int cpu) 6 { 7 8 if (tick_do_timer_cpu == cpu) { 9 /* 当前CPU负责更新时间 */ 10 write_seqlock(&jiffies_lock); 11 12 /* Keep track of the next tick event */ 13 tick_next_period = ktime_add(tick_next_period, tick_period); 14 15 /* 更新 jiffies计数，jiffies += 1 */ 16 do_timer(1); 17 write_sequnlock(&jiffies_lock); 18 /* 更新墙上时间，就是我们生活中的时间 */ 19 update_wall_time(); 20 } 21 /* 更新当前进程信息，调度器主要函数 */ 22 update_process_times(user_mode(get_irq_regs())); 23 profile_tick(CPU_PROFILING); 24 } 25 26 27 28 29 void update_process_times(int user_tick) 30 { 31 struct task_struct *p = current; 32 int cpu = smp_processor_id(); 33 34 /* Note: this timer irq context must be accounted for as well. */ 35 /* 更新当前进程的内核态和用户态占用率 */ 36 account_process_tick(p, user_tick); 37 /* 检查有没有定时器到期，有就运行到期定时器的处理 */ 38 run_local_timers(); 39 rcu_check_callbacks(cpu, user_tick); 40 #ifdef CONFIG_IRQ_WORK 41 if (in_irq()) 42 irq_work_tick(); 43 #endif 44 /* 调度器的tick */ 45 scheduler_tick(); 46 run_posix_cpu_timers(p); 47 } 复制代码　　这两个函数主要工作为将jiffies加1、更新系统的墙上时间、更新当前进程的内核态和用户态的CPU占用率、检查是否有定时器到期，运行到期的定时器。当执行完这些操作后，就到了最重要的scheduler_tick()函数，而scheduler_tick()函数主要做什么呢，就是更新CPU和当前进行的一些数据，然后根据当前进程的调度类，调用task_tick()函数。这里普通进程调度类的task_tick()是task_tick_fair()函数。复制代码 1 void scheduler_tick(void) 2 { 3 /* 获取当前CPU的ID */ 4 int cpu = smp_processor_id(); 5 /* 获取当前CPU的rq队列 */ 6 struct rq *rq = cpu_rq(cpu); 7 /* 获取当前CPU的当前运行程序，实际上就是current */ 8 struct task_struct *curr = rq->curr; 9 /* 更新CPU调度统计中的本次调度时间 */ 10 sched_clock_tick(); 11 12 raw_spin_lock(&rq->lock); 13 /* 更新该CPU的rq运行时间 */ 14 update_rq_clock(rq); 15 curr->sched_class->task_tick(rq, curr, 0); 16 /* 更新CPU的负载 */ 17 update_cpu_load_active(rq); 18 raw_spin_unlock(&rq->lock); 19 20 perf_event_task_tick(); 21 22 #ifdef CONFIG_SMP 23 rq->idle_balance = idle_cpu(cpu); 24 trigger_load_balance(rq); 25 #endif 26 /* rq->last_sched_tick = jiffies; */ 27 rq_last_tick_reset(rq); 28 } 29 30 31 32 33 /* 34 * CFS调度类的task_tick() 35 */ 36 static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) 37 { 38 struct cfs_rq *cfs_rq; 39 struct sched_entity *se = &curr->se; 40 /* 向上更新进程组时间片 */ 41 for_each_sched_entity(se) { 42 cfs_rq = cfs_rq_of(se); 43 /* 更新当前进程运行时间，并判断是否需要调度此进程 */ 44 entity_tick(cfs_rq, se, queued); 45 } 46 47 if (numabalancing_enabled) 48 task_tick_numa(rq, curr); 49 50 update_rq_runnable_avg(rq, 1); 51 } 复制代码　　显然，到这里最重要的函数应该是entity_tick()，因为是这个函数决定了当前进程是否需要调度出去。我们必须先明确一点就是，CFS调度策略是使用红黑树以进程的vruntime为键值进行组织的，进程的vruntime越小越在红黑树的左边，而每次调度的下一个目标就是红黑树最左边的结点上的进程。而当进行运行时，其vruntime是随着实际运行时间而增加的，但是不同权重的进程其vruntime增加的速率不同，正在运行的进程的权重约大(优先级越高)，其vruntime增加的速率越慢，所以其所占用的CPU时间越多。而每次时钟中断的时候，在entity_tick()函数中都会更新当前进程的vruntime值。当进程没有处于CPU上运行时，其vruntime是保持不变的。复制代码 1 static void 2 entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued) 3 { 4 /* 5 * Update run-time statistics of the 'current'. 6 */ 7 /* 更新当前进程运行时间，包括虚拟运行时间 */ 8 update_curr(cfs_rq); 9 10 /* 11 * Ensure that runnable average is periodically updated. 12 */ 13 update_entity_load_avg(curr, 1); 14 update_cfs_rq_blocked_load(cfs_rq, 1); 15 update_cfs_shares(cfs_rq); 16 17 #ifdef CONFIG_SCHED_HRTICK 18 /* 19 * queued ticks are scheduled to match the slice, so don't bother 20 * validating it and just reschedule. 21 */ 22 /* 若queued为1，则当前运行队列的运行进程需要调度 */ 23 if (queued) { 24 /* 标记当前进程需要被调度出去 */ 25 resched_curr(rq_of(cfs_rq)); 26 return; 27 } 28 /* 29 * don't let the period tick interfere with the hrtick preemption 30 */ 31 if (!sched_feat(DOUBLE_TICK) && hrtimer_active(&rq_of(cfs_rq)->hrtick_timer)) 32 return; 33 #endif 34 /* 检查是否需要调度 */ 35 if (cfs_rq->nr_running > 1) 36 check_preempt_tick(cfs_rq, curr); 37 } 复制代码　　之后的文章会详细说说CFS关于进程的vruntime的处理，现在只需要知道是这样就好，在entity_tick()中，首先会更新当前进程的实际运行时间和虚拟运行时间，这里很重要，因为要使用更新后的这些数据去判断是否需要被调度。在entity_tick()函数中最后面的check_preempt_tick()函数就是用来判断进程是否需要被调度的，其判断的标准有两个：先判断当前进程的实际运行时间是否超过CPU分配给这个进程的CPU时间，如果超过，则需要调度。再判断当前进程的vruntime是否大于下个进程的vruntime，如果大于，则需要调度。　　清楚了这两个标准，check_preempt_tick()的代码则很好理解了。复制代码 1 /* 2 * 检查当前进程是否需要被抢占 3 * 判断方法有两种，一种就是判断当前进程是否超过了CPU分配给它的实际运行时间 4 * 另一种就是判断当前进程的虚拟运行时间是否大于下个进程的虚拟运行时间 5 */ 6 static void 7 check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr) 8 { 9 /* ideal_runtime为进程应该运行的时间 10 * delta_exec为进程增加的实际运行时间 11 * 如果delta_exec超过了ideal_runtime，表示该进程应该让出CPU给其他进程 12 */ 13 unsigned long ideal_runtime, delta_exec; 14 struct sched_entity *se; 15 s64 delta; 16 17 18 /* slice为CFS队列中所有进程运行一遍需要的实际时间 */ 19 /* ideal_runtime保存的是CPU分配给当前进程一个周期内实际的运行时间，计算公式为: 一个周期内进程应当运行的时间 = 一个周期内队列中所有进程运行一遍需要的时间 * 当前进程权重 / 队列总权重 20 * delta_exec保存的是当前进程增加使用的实际运行时间 21 */ 22 ideal_runtime = sched_slice(cfs_rq, curr); 23 delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime; 24 if (delta_exec > ideal_runtime) { 25 /* 增加的实际运行实际 > 应该运行实际，说明需要调度出去 */ 26 resched_curr(rq_of(cfs_rq)); 27 /* 28 * The current task ran long enough, ensure it doesn't get 29 * re-elected due to buddy favours. 30 */ 31 /* 如果cfs_rq队列的last，next，skip指针中的某个等于当前进程，则清空cfs_rq队列中的相应指针 */ 32 clear_buddies(cfs_rq, curr); 33 return; 34 } 35 36 /* 37 * Ensure that a task that missed wakeup preemption by a 38 * narrow margin doesn't have to wait for a full slice. 39 * This also mitigates buddy induced latencies under load. 40 */ 41 if (delta_exec < sysctl_sched_min_granularity) 42 return; 43 /* 获取下一个调度进程的se */ 44 se=__pick_first_entity(cfs_rq); 45 /* 当前进程的虚拟运行时间 - 下个进程的虚拟运行时间 */ 46 delta=curr->vruntime - se->vruntime; 47 48 /* 当前进程的虚拟运行时间大于下个进程的虚拟运行时间，说明这个进程还可以继续运行 */ 49 if (delta < 0) 50 return; 51 52 if (delta> ideal_runtime) 53 /* 当前进程的虚拟运行时间小于下个进程的虚拟运行时间，说明下个进程比当前进程更应该被CPU使用，resched_curr()函数用于标记当前进程需要被调度出去 */ 54 resched_curr(rq_of(cfs_rq)); 55 } 56 57 58 59 60 /* 61 * resched_curr - mark rq's current task 'to be rescheduled now'. 62 * 63 * On UP this means the setting of the need_resched flag, on SMP it 64 * might also involve a cross-CPU call to trigger the scheduler on 65 * the target CPU. 66 */ 67 /* 标记当前进程需要调度，将当前进程的thread_info->flags设置TIF_NEED_RESCHED标记 */ 68 void resched_curr(struct rq *rq) 69 { 70 struct task_struct *curr = rq->curr; 71 int cpu; 72 73 lockdep_assert_held(&rq->lock); 74 75 /* 检查当前进程是否已经设置了调度标志，如果是，则不用再设置一遍，直接返回 */ 76 if (test_tsk_need_resched(curr)) 77 return; 78 79 /* 根据rq获取CPU */ 80 cpu = cpu_of(rq); 81 /* 如果CPU = 当前CPU，则设置当前进程需要调度标志 */ 82 if (cpu == smp_processor_id()) { 83 /* 设置当前进程需要被调度出去的标志，这个标志保存在进程的thread_info结构上 */ 84 set_tsk_need_resched(curr); 85 /* 设置CPU的内核抢占 */ 86 set_preempt_need_resched(); 87 return; 88 } 89 90 /* 如果不是处于当前CPU上，则设置当前进程需要调度，并通知其他CPU */ 91 if (set_nr_and_not_polling(curr)) 92 smp_send_reschedule(cpu); 93 else 94 trace_sched_wake_idle_without_ipi(cpu); 95 } 复制代码　　好了，到这里实际上如果进程需要被调度，则已经被标记，如果进程不需要被调度，则继续执行。这里大家或许有疑问，只标记了进程需要被调度，但是为什么并没有真正处理它？其实根据我的博文linux调度器源码分析 - 概述(一)所说，进程调度的发生时机之一就是发生在中断返回时，这里是在汇编代码中实现的，而我们知道这里我们是时钟中断执行上述的这些操作的，当执行完这些后，从时钟中断返回去的时候，会调用到汇编函数ret_from_sys_call，在这个函数中会先检查调度标志被置位，如果被置位，则跳转至schedule()，而schedule()最后调用到__schedule()这个函数进行处理。复制代码 1 static void __sched __schedule(void) 2 { 3 /* prev保存换出进程(也就是当前进程)，next保存换进进程 */ 4 struct task_struct *prev, *next; 5 unsigned long *switch_count; 6 struct rq *rq; 7 int cpu; 8 9 need_resched: 10 /* 禁止抢占 */ 11 preempt_disable(); 12 /* 获取当前CPU ID */ 13 cpu = smp_processor_id(); 14 /* 获取当前CPU运行队列 */ 15 rq = cpu_rq(cpu); 16 rcu_note_context_switch(cpu); 17 prev = rq->curr; 18 19 schedule_debug(prev); 20 21 if (sched_feat(HRTICK)) 22 hrtick_clear(rq); 23 24 /* 25 * Make sure that signal_pending_state()->signal_pending() below 26 * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE) 27 * done by the caller to avoid the race with signal_wake_up(). 28 */ 29 smp_mb__before_spinlock(); 30 /* 队列上锁 */ 31 raw_spin_lock_irq(&rq->lock); 32 /* 当前进程非自愿切换次数 */ 33 switch_count = &prev->nivcsw; 34 35 /* 36 * 当内核抢占时会置位thread_info的preempt_count的PREEMPT_ACTIVE位，调用schedule()之后会清除，PREEMPT_ACTIVE置位表明是从内核抢占进入到此的 37 * preempt_count()是判断thread_info的preempt_count整体是否为0 38 * prev->state大于0表明不是TASK_RUNNING状态 39 * 40 */ 41 if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) { 42 /* 当前进程不为TASK_RUNNING状态并且不是通过内核态抢占进入调度 */ 43 if (unlikely(signal_pending_state(prev->state, prev))) { 44 /* 有信号需要处理，置为TASK_RUNNING */ 45 prev->state = TASK_RUNNING; 46 } else { 47 /* 没有信号挂起需要处理，会将此进程移除运行队列 */ 48 /* 如果代码执行到此，说明当前进程要么准备退出，要么是处于即将睡眠状态 */ 49 deactivate_task(rq, prev, DEQUEUE_SLEEP); 50 prev->on_rq = 0; 51 52 /* 53 * If a worker went to sleep, notify and ask workqueue 54 * whether it wants to wake up a task to maintain 55 * concurrency. 56 */ 57 if (prev->flags & PF_WQ_WORKER) { 58 /* 如果当前进程处于一个工作队列中 */ 59 struct task_struct *to_wakeup; 60 61 to_wakeup = wq_worker_sleeping(prev, cpu); 62 if (to_wakeup) 63 try_to_wake_up_local(to_wakeup); 64 } 65 } 66 switch_count = &prev->nvcsw; 67 } 68 69 /* 更新rq运行队列时间 */ 70 if (task_on_rq_queued(prev) || rq->skip_clock_update < 0) 71 update_rq_clock(rq); 72 73 /* 获取下一个调度实体，这里的next的值会是一个进程，而不是一个调度组，在pick_next_task会递归选出一个进程 */ 74 next=pick_next_task(rq, prev); 75 /* 清除当前进程的thread_info结构中的flags的TIF_NEED_RESCHED和PREEMPT_NEED_RESCHED标志位，这两个位表明其可以被调度调出(因为这里已经调出了，所以这两个位就没必要了) */ 76 clear_tsk_need_resched(prev); 77 clear_preempt_need_resched(); 78 rq->skip_clock_update = 0; 79 80 if (likely(prev != next)) { 81 /* 该CPU进程切换次数加1 */ 82 rq->nr_switches++; 83 /* 该CPU当前执行进程为新进程 */ 84 rq->curr = next; 85 86 ++*switch_count; 87 88 /* 这里进行了进程上下文的切换 */ 89 context_switch(rq, prev, next); /* unlocks the rq */ 90 /* 91 * The context switch have flipped the stack from under us 92 * and restored the local variables which were saved when 93 * this task called schedule() in the past. prev == current 94 * is still correct, but it can be moved to another cpu/rq. 95 */ 96 /* 新的进程有可能在其他CPU上运行，重新获取一次CPU和rq */ 97 cpu = smp_processor_id(); 98 rq = cpu_rq(cpu); 99 } 100 else 101 raw_spin_unlock_irq(&rq->lock); /* 这里意味着下个调度的进程就是当前进程，释放锁不做任何处理 */ 102 /* 上下文切换后的处理 */ 103 post_schedule(rq); 104 105 /* 重新打开抢占使能但不立即执行重新调度 */ 106 sched_preempt_enable_no_resched(); 107 if (need_resched()) 108 goto need_resched; 109 } 复制代码　　在__schedule()中，每一步的作用注释已经写得很详细了，选取下一个进程的任务在__schedule()中交给了pick_next_task()函数，而进程切换则交给了context_switch()函数。我们先看看pick_next_task()函数是如何选取下一个进程的：复制代码 1 static inline struct task_struct * 2 pick_next_task(struct rq *rq, struct task_struct *prev) 3 { 4 const struct sched_class *class = &fair_sched_class; 5 struct task_struct *p; 6 7 /* 8 * Optimization: we know that if all tasks are in 9 * the fair class we can call that function directly: 10 */ 11 12 if (likely(prev->sched_class == class && rq->nr_running == rq->cfs.h_nr_running)) { 13 /* 所有进程都处于CFS运行队列中，所以就直接使用cfs的调度类 */ 14 p = fair_sched_class.pick_next_task(rq, prev); 15 if (unlikely(p == RETRY_TASK)) 16 goto again; 17 18 /* assumes fair_sched_class->next == idle_sched_class */ 19 if (unlikely(!p)) 20 p = idle_sched_class.pick_next_task(rq, prev); 21 22 return p; 23 } 24 25 again: 26 /* 在其他调度类中包含有其他进程，从最高优先级的调度类迭代到最低优先级的调度类，并选择最优的进程运行 */ 27 for_each_class(class) { 28 p = class->pick_next_task(rq, prev); 29 if (p) { 30 if (unlikely(p == RETRY_TASK)) 31 goto again; 32 return p; 33 } 34 } 35 36 BUG(); /* the idle class will always have a runnable task */ 37 } 复制代码　　在pick_next_task()中完全体现了进程优先级的概念，首先会先判断是否所有进程都处于cfs队列中，如果不是，则表明有比普通进程更高优先级的进程(包括实时进程)。内核中是将调度类重优先级高到低进行排列，然后选择时从最高优先级的调度类开始找是否有进程需要调度，如果没有会转到下一优先级调度类，在代码27行所体现，27行展开是 #define for_each_class(class) \ for (class = sched_class_highest; class; class = class->next) 　　而调度类的优先级顺序为调度类优先级顺序: stop_sched_class -> dl_sched_class -> rt_sched_class -> fair_sched_class -> idle_sched_class 　　在pick_next_task()函数中返回了选定的进程的进程描述符，接下来就会调用context_switch()进行进程切换了。复制代码 1 static inline void 2 context_switch(struct rq *rq, struct task_struct *prev, 3 struct task_struct *next) 4 { 5 struct mm_struct *mm, *oldmm; 6 7 prepare_task_switch(rq, prev, next); 8 9 mm = next->mm; 10 oldmm = prev->active_mm; 11 /* 12 * For paravirt, this is coupled with an exit in switch_to to 13 * combine the page table reload and the switch backend into 14 * one hypercall. 15 */ 16 arch_start_context_switch(prev); 17 18 if (!mm) { 19 /* 如果新进程的内存描述符为空，说明新进程为内核线程 */ 20 next->active_mm = oldmm; 21 atomic_inc(&oldmm->mm_count); 22 /* 通知底层不需要切换虚拟地址空间 23 * if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK) 24 * this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY); 25 */ 26 enter_lazy_tlb(oldmm, next); 27 } else 28 /* 切换虚拟地址空间 */ 29 switch_mm(oldmm, mm, next); 30 31 if (!prev->mm) { 32 /* 如果被切换出去的进程是内核线程 */ 33 prev->active_mm = NULL; 34 /* 归还借用的oldmm */ 35 rq->prev_mm = oldmm; 36 } 37 /* 38 * Since the runqueue lock will be released by the next 39 * task (which is an invalid locking op but in the case 40 * of the scheduler it's an obvious special-case), so we 41 * do an early lockdep release here: 42 */ 43 spin_release(&rq->lock.dep_map, 1, _THIS_IP_); 44 45 context_tracking_task_switch(prev, next); 46 47 /* 切换寄存器和内核栈，还会重新设置current为切换进去的进程 */ 48 switch_to(prev, next, prev); 49 50 /* 同步 */ 51 barrier(); 52 /* 53 * this_rq must be evaluated again because prev may have moved 54 * CPUs since it called schedule(), thus the 'rq' on its stack 55 * frame will be invalid. 56 */ 57 finish_task_switch(this_rq(), prev); 58 } 复制代码　　到这里整个进程的选择和切换就已经完成了。总结　　整个调度器大概原理和源码已经分析完成，其他更多细节，如CFS的一些计算和处理，实时进程的处理等，将在其他文章进行详细解释。 LINUX时间管理时间管理在内核中占有非常重要的地位。相对于事件驱动，内核中有大量的函数都是基于时间驱动的。内核必须管理系统的运行时间以及当前的日期和时间。首先搞清楚RTC在kernel内的作用: linux系统有两个时钟：实时时钟和系统定时器实时时钟一个是由纽扣电池供电的“Real Time Clock”也叫做RTC（实时时钟）或者叫CMOS时钟，硬件时钟。当操作系统关机的时候，用这个来记录时间，但是对于运行的系统是不用这个时间的。当系统启动时，内核通过读取RTC来初始化墙上时间，该时间存放在xtime变量中。所谓墙上时间也就是当前的实际时间。系统定时器另一个时间是 “System clock”也叫内核时钟或者软件时或者叫系统定时器，是由软件根据时间中断来进行计数的，系统定时器是内核时间机制中最重要的一部分，它提供了一种周期性触发中断机制，即系统定时器以HZ（时钟节拍率）为频率自行触发时钟中断。当时钟中断发生时，内核就通过时钟中断处理程序timer_interrupt()对其进行处理。系统定时器完全由操作系统管理，因此也成为系统时钟或者软件时钟。当系统启动时，内核通过RTC初始化系统定时器，系统定时器接着由操作系统共掌管，进行固定频率的定时。可以看到，系统时间并不是传统意义上的那种计时时钟，而是通过定时这种特殊的方式来表现时间。　内核时钟在系统关机的情况下是不存在的，所以，当操作系统启动的时候，内核时钟是要读取RTC时间来进行时间同步。并且在系统关机的时候将系统时间写回RTC中进行同步。全局变量jiffies用来记录自系统启动以来产生的节拍的总数。它被用来记录系统自开机以来，已经过了多少tick。每发生一次timer interrupt，Jiffies变数会被加一。启动时，内核将该变量初始化为0，此后，每次时钟中断处理程序都会增加该变量的值。因为一秒内时钟中断的次数等于Hz，所以jiffes一秒内增加的值也就为Hz，系统运行时间以秒为单位计算，就等于jiffes/Hz。 jiffies转换为秒可采用公式：(jiffies/HZ)计算，将秒转换为jiffies可采用公式：(seconds*HZ)计算。 Tick是HZ的倒数，意即timer interrupt每发生一次中断的时间。如HZ为250时，tick为4毫秒(millisecond)。 jiffies仅是相对于系统启动的相对时间，如果想获取absolutetime或wall time，则需要使用RTC，内核用变量xtime来记录，当系统启动时，读取RTC并记录在xtime中，当系统halt时，则将walltime写回RTC，函数do_gettimeofday()来读取wall time。系统定时器及其中断处理程序是内核管理机制的中枢，下面是一些利用系统定时器周期执行的工作(中断处理程序所做的工作): (1) 更新系统运行时间(uptime) jiffes (2) 更新当前墙上时间(wall time) xtime (3) 在对称多处理器系统(SMP)上，均衡调度各处理器上的运行队列 (4) 检查当前进程是否用完了时间片(time slice)，如果用尽，则进行重新调度如前所述，Linux内核与RTC进行互操作的时机只有两个：　　　　1) 内核在启动时从RTC中读取启动时的时间与日期(LINUX系统时间的初始化)；通过调用rtc_read_time(rtc, &tm);-读出RTC时间。调用 do_settimeofday(&tv);给系统时间xtime初始化。 Alarm.c (kernel\drivers\rtc):static int __init alarm_late_init(void) Alarm.c (kernel\drivers\rtc):late_initcall(alarm_late_init); static int __init alarm_late_init(void) { unsigned long flags; struct timespec tmp_time, system_time; /* this needs to run after the rtc is read at boot */ spin_lock_irqsave(&alarm_slock, flags); /* We read the current rtc and system time so we can later calulate * elasped realtime to be (boot_systemtime + rtc - boot_rtc) == * (rtc - (boot_rtc - boot_systemtime)) */ getnstimeofday(&tmp_time); ktime_get_ts(&system_time); alarms[ANDROID_ALARM_ELAPSED_REALTIME_WAKEUP].delta = alarms[ANDROID_ALARM_ELAPSED_REALTIME].delta = timespec_to_ktime(timespec_sub(tmp_time, system_time)); spin_unlock_irqrestore(&alarm_slock, flags); return 0; } Hctosys.c (kernel\drivers\rtc):int rtc_hctosys(void) Hctosys.c (kernel\drivers\rtc):late_initcall(rtc_hctosys); start_kernel()-->late_initcall(rtc_hctosys);--> -->rtc_hctosys(void)-->err = rtc_read_time(rtc, &tm);---> do_settimeofday(&tv); ----》xtime = *tv; int rtc_hctosys(void) { int err = -ENODEV; struct rtc_time tm; struct timespec tv = { .tv_nsec = NSEC_PER_SEC >> 1, }; struct rtc_device *rtc = rtc_class_open(CONFIG_RTC_HCTOSYS_DEVICE); if (rtc == NULL) { pr_err("%s: unable to open rtc device (%s)\n", __FILE__, CONFIG_RTC_HCTOSYS_DEVICE); goto err_open; } err = rtc_read_time(rtc, &tm); if (err) { dev_err(rtc->dev.parent, "hctosys: unable to read the hardware clock\n"); goto err_read; } err = rtc_valid_tm(&tm); if (err) { dev_err(rtc->dev.parent, "hctosys: invalid date/time\n"); goto err_invalid; } rtc_tm_to_time(&tm, &tv.tv_sec); do_settimeofday(&tv); dev_info(rtc->dev.parent, "setting system clock to " "%d-%02d-%02d %02d:%02d:%02d UTC (%u)\n", tm.tm_year + 1900, tm.tm_mon + 1, tm.tm_mday, tm.tm_hour, tm.tm_min, tm.tm_sec, (unsigned int) tv.tv_sec); err_invalid: err_read: rtc_class_close(rtc); err_open: rtc_hctosys_ret = err; return err; } 　　2) 内核在需要时将时间与日期回写到RTC中。系统启动时，内核通过读取RTC来初始化内核时钟,又叫墙上时间，该时间放在xtime变量中。系统睡眠的时候CPU要断电系统时钟不工作，所以RTC的睡眠函数rtc_suspend读出RTC和系统时钟的时间，计算它们之间的差，然后保存到静态变量。系统被唤醒后读出当前RTC的时间和系统的时间，系统时间应该接近0，因为cpu刚刚恢复上电，将原来保存的RTC和系统之间的时间差加上刚刚读到的RTC的时间就是最新的系统时间，将计算出来的时间通过rtc_resume函数调用 timekeeping_inject_sleeptime(&sleep_time);重新初始化系统时钟xtime。如果RTC的时间和系统时间都一样，那么他们之间的差为0. 但是有些系统硬件RTC时间是不可以写，只能够被读，那么他们之间的差就不为0了。 Class.c (kernel\drivers\rtc):static int rtc_suspend(struct device *dev, pm_message_t mesg) static int rtc_suspend(struct device *dev, pm_message_t mesg) { struct rtc_device *rtc = to_rtc_device(dev); struct rtc_time tm; struct timespec delta, delta_delta; if (strcmp(dev_name(&rtc->dev), CONFIG_RTC_HCTOSYS_DEVICE) != 0) return 0; /* snapshot the current RTC and system time at suspend*/ rtc_read_time(rtc, &tm); getnstimeofday(&old_system); rtc_tm_to_time(&tm, &old_rtc.tv_sec); /* * To avoid drift caused by repeated suspend/resumes, * which each can add ~1 second drift error, * try to compensate so the difference in system time * and rtc time stays close to constant. */ delta = timespec_sub(old_system, old_rtc); delta_delta = timespec_sub(delta, old_delta); if (delta_delta.tv_sec < -2 || delta_delta.tv_sec>= 2) { /* * if delta_delta is too large, assume time correction * has occured and set old_delta to the current delta. */ old_delta = delta; } else { /* Otherwise try to adjust old_system to compensate */ old_system = timespec_sub(old_system, delta_delta); } return 0; } static int rtc_resume(struct device *dev) { struct rtc_device *rtc = to_rtc_device(dev); struct rtc_time tm; struct timespec new_system, new_rtc; struct timespec sleep_time; if (strcmp(dev_name(&rtc->dev), CONFIG_RTC_HCTOSYS_DEVICE) != 0) return 0; /* snapshot the current rtc and system time at resume */ getnstimeofday(&new_system); rtc_read_time(rtc, &tm); if (rtc_valid_tm(&tm) != 0) { pr_debug("%s: bogus resume time\n", dev_name(&rtc->dev)); return 0; } rtc_tm_to_time(&tm, &new_rtc.tv_sec); new_rtc.tv_nsec = 0; if (new_rtc.tv_sec < old_rtc.tv_sec) { pr_debug("%s: time travel!\n", dev_name(&rtc->dev)); return 0; } /* calculate the RTC time delta (sleep time)*/ sleep_time = timespec_sub(new_rtc, old_rtc); /* * Since these RTC suspend/resume handlers are not called * at the very end of suspend or the start of resume, * some run-time may pass on either sides of the sleep time * so subtract kernel run-time between rtc_suspend to rtc_resume * to keep things accurate. */ sleep_time = timespec_sub(sleep_time, timespec_sub(new_system, old_system)); if (sleep_time.tv_sec >= 0) timekeeping_inject_sleeptime(&sleep_time); return 0; } 周期产生的事件都是由系统定时器驱动的。系统定时器是一种可编程硬件芯片，它已固定频率产生中断。该中断就是所谓的定时器中断，它所对应的中断处理程序负责更新系统时间，还负责执行需要周期性运行的任务。系统定时器和时钟中断处理程序是Linux系统内核管理机制中的中枢。 1内核中的时间概念硬件为内核提供了一个系统定时器用以计算流逝的时间，系统定时器以某种频率自行触发时钟中断，该频率可以通过编程预定，称节拍率。当时钟中断发生时，内核就通过一种特殊中断处理程序对其进行处理。内核知道连续两次时钟中断的间隔时间。这个间隔时间称为节拍（tick），内核就是靠这种已知的时钟中断来计算墙上时间和系统运行时间。墙上时间即实际时间,该时间放在xtime变量中，内核提供了一组系统调用以获取实际日期和实际时间。系统运行时间——自系统启动开始所经过的时间——对用户和内核都很有用，因为许多程序都必须清楚流逝过的时间。 2节拍率系统定时器频率是通过静态预处理定义的，也就是HZ,为一秒内时钟中断的次数，在系统启动时按照Hz对硬件进行设置。体系结构不同，HZ的值也不同。内核在文件 <asm/param.h>中定义了HZ的实际值，节拍率就是HZ，周期为1/HZ。i386的节拍率为1000，其它体系结构（包括ARM）的节拍率多数都等于100。 3 jiffies 全局变量jiffies用来记录自系统启动以来产生的节拍的总数。启动时，内核将该变量初始化为0，此后，每次时钟中断处理程序都会增加该变量的值。因为一秒内时钟中断的次数等于Hz，所以jiffes一秒内增加的值也就为Hz，系统运行时间以秒为单位计算，就等于jiffes/Hz。Jiffes=seconds*HZ。 Jiffs定义在文件linux/jiffs.h中 Extern unsigned long volatile jiffies；关键字volatile指示编译器在每次访问变量时都重新从主内存中获得，而不是通过寄存器中的变量别名访问，从而确保前面的循环能按预期的方式执行。 3.1 jiffies的内部表示 jiffies变量总是无符号长整数（unsigned long），因此，在32位体系结构上是32位，在时钟频率为100的情况下，497天后会溢出，如果频率是1000，49.7天后会溢出 3.2用户空间和HZ 在2.6以前的内核中，如果改变内核中HZ的值会给用户空间中某些程序造成异常结果。这是因为内核是以节拍数/秒的形式给用户空间导出这个值的，在这个接口稳定了很长一段时间后应用程序便逐渐依赖于这个特定的HZ值了。所以如果在内核中更改了HZ的定义值，就打破了用户空间的常量关系——用户空间并不知道新的HZ值。要想避免上面的错误，内核必须更改所有导出的jiffies值。因而内核定义了USER_HZ来代表用户空间看到的值。对于ARM体系结构，HZ=USR_HZ。 4硬实钟和定时器体系结构提供了两种设备进行计时——一种是我们前面讨论过的系统定时器，另一种是实时时钟。实时时钟（RTC）是用来持久存放系统时间的设备，即便系统关闭后，它可以靠主板上的微型电池提供的电力保持系统的计时。当系统启动时，内核通过读取RTC来初始化墙上时间，该时间存放在xtime变量中。系统定时器是内核定时机制中最为重要的角色。尽管不同体系结构中的定时器实现不尽相同，但是系统定时器的根本思想没有区别——提供一种周期性触发中断机制。 5时钟中断处理程序下面我们看一下时钟中断处理程序是如何实现的。时钟中断处理程序可以划分为两个部分：体系结构相关部分和体系结构无关部分。与体系结构相关的例程作为系统定时器的中断处理程序而注册到内核中，以便在产生时钟中断时，它能够相应的运行。虽然处理程序的具体工作依赖于特定的体系结构，但是绝大多数处理程序至少要执行如下工作：（1）获得xtime_lock锁，以便对访问jiffies_64和墙上时间xtime进行保护。（2）需要时应答或重新设置系统时钟。（3）周期性的使用墙上时间更新实时时钟。（4）调用体系结构无关的例程：do_timer。中断服务程序主要通过调用与体系结构无关的例程do_timer执行下面的工作：给jiffies_64变量增加1 更新资源消耗的统计值，比如当前进程所消耗的系统时间和用户时间。执行已经到期的动态定时器。执行scheduler_tick()函数。更新墙上时间，该时间存放在xtime变量中。 Do_timer（）函数执行完毕后返回与体系结构相关的中断处理程序，继续执行后面的工作，释放xtime_lock锁，然后退出。以上全部工作每1/HZ秒都要发生一次，也就是说在你的PC机上时钟中断处理程序每秒执行1000次。 6实际时间当前实际时间（墙上时间）定义在文件kernel/timer.c中： struct timespec xtime; timespec数据结构定义在文件<linux/time.h>中，形式如下： struct timespec{ time_t tv_sec; /*秒 */ long tv_nsec; /*纳秒 */ }；其中，xtime.tv_sec以秒为单位，存放着自1970年7月1日以来经过的时间。xtime.tv_nsec记录了自上一秒开始经过的纳秒数。读写xtime变量需要使用xtime_lock锁，它是一个seq锁。读取xtime时要使用read_seqbegin()和read_seqretry()函数： SYSCALL_DEFINE2(gettimeofday, struct timeval __user *, tv, struct timezone __user *, tz)---》 do_gettimeofday(&ktv);----》 Time.c (kernel\kernel):SYSCALL_DEFINE2(gettimeofday, struct timeval __user *, tv, SYSCALL_DEFINE2(gettimeofday, struct timeval __user *, tv, struct timezone __user *, tz) { if (likely(tv !=NULL)) { struct timeval ktv; do_gettimeofday(&ktv); if (copy_to_user(tv, &ktv, sizeof(ktv))) return -EFAULT; } if (unlikely(tz !=NULL)) { if (copy_to_user(tz, &sys_tz, sizeof(sys_tz))) return -EFAULT; } return 0; } Timekeeping.c (kernel\kernel\time):void do_gettimeofday(struct timeval *tv) void do_gettimeofday(struct timeval *tv) { struct timespec now; getnstimeofday(&now); tv->tv_sec = now.tv_sec; tv->tv_usec = now.tv_nsec/1000; } Timekeeping.c (kernel\kernel\time):void getnstimeofday(struct timespec *ts) void getnstimeofday(struct timespec *ts) { unsigned long seq; s64 nsecs; WARN_ON(timekeeping_suspended); do { seq = read_seqbegin(&xtime_lock); *ts = xtime; nsecs = timekeeping_get_ns(); /* If arch requires, add in gettimeoffset() */ nsecs += arch_gettimeoffset(); } while (read_seqretry(&xtime_lock, seq)); timespec_add_ns(ts, nsecs); } EXPORT_SYMBOL(getnstimeofday); 该循环不断重复，直到读者确认读取数据时没有写操作介入。如果发现循环期间有时钟中断处理程序更新xtime，那么read_seqretry()函数就返回无效序列号，继续循环等待。从用户空间取得墙上时间的主要接口是gettimeofday()。 7定时器定时器，有时也称为动态定时器或内核定时器——是管理内核时间的基础。定时器的使用很简单。只需要执行一些初始化工作，设置一个超时时间，指定超时发生后执行的函数，然后激活定时器就可以了。指定的函数将在定时器到期时自动执行。定时器并不周期运行，它在超时后就自行销毁，这也正是这种定时器被称为动态定时器的一个原因。 7.1使用定时器定时器由结构timer_list表示，定义在文件<linux/timer.h>中。 struct timer_list { struct list_head entry; /*定时器链表的入口 */ unsigned long expiers; /*以jiffies为单位的定时器 */ spinlock_t lock; /*保护定时器的锁 */ void ( * function)(unsigned long); /*定时器处理函数 */ unsigned long data; /*传给处理函数的长整形参数 */ struct tvec_t_base_s *base; /*定时器内部值，用户不要使用 */ }；内核提供了一组与定时器相关的接口用来简化管理定时器的操作。所有这些接口都声明在文件<linux/timer.h>中，大多数接口都在kernel/timer.c中获得实现。创建定时器时需要先定义它： struct timer_list my_timer; 初始化定时器数据结构，初始化必须在使用其它定时器管理函数对定时器进行操作之前完成。 init_timer(&my_timer);然后就可以填充结构中需要的值了。 my_timer.expires=jiffies + delay; /*定时器超时时的节拍数 */ my_timer.data=0; /*给定时器处理函数传入0值 */ my_timer.function=my_function; /*定时器超时时调用的函数 */ my_timer.expires表示超时时间，它是以节拍为单位的绝对计数值。如果当前jiffies计数等于或大于它，处理函数开始执行。处理函数必须符合下面的函数原形： void my_timer_function(unsigned long data); data参数使我们可以利用一个处理函数注册多个定时器，只需通过该参数就能区别它们。激活定时器：add_timer(&my_timer); 有时可能需要更改已经激活的定时器超时时间，所以内核通过函数mod_timer()来实现该功能，该函数可以改变指定的定时器超时时间： mod_timer(&my_timer, jiffies+new_delay); mod_timer()函数也可以操作那些已经初始化，但还没有被激活的定时器，它会同时激活它。一旦从mod_timer()函数返回，定时器都将被激活而且设置了新的定时值。如果需要在定时器超时前停止定时器，可以使用del_timer()函数： del_timer(&my_timer);或 del_timer_sync()（不能在中断上下文中使用） 8延迟执行 8.1忙等待最简单的延迟方法是忙等待（或者说是忙循环）。但这种方法仅仅适用于延迟的时间是节拍的整数倍，或者精确度要求不高时。更好的方法是在代码等待时，允许内核重新调度执行其他任务： unsigned long delay=jiffies + 5*HZ; while(time_before(jiffies, delay)) cond_resched(); cond_resched()函数将调度一个新程序投入运行，但它只有在设置完need_resched标志后，才能生效。延迟执行不管在哪种情况下都不应该在持有锁或禁止中断时发生。 8.2短延迟有时内核代码（通常也是驱动程序）不但需要很短暂的延迟（比时钟节拍还短）而且还要求延迟的时间按很精确。这种情况多发生在和硬件同步时，内核提供了两个可以处理微秒和毫秒级别的延迟函数，它们都定义在 <linux/delay.h>中，可以看到它们并不使用jiffies: void udelay(unsigned long usecs) void mdelay(unsigned long msecs) 经验证明，不要使用udelay()函数处理超过1毫秒的延迟。此时使用mdelay()更为安全。纳秒，微妙，毫秒延迟，必须是短延迟，时间过长会报错头文件: delay.h void ndelay(unsigned long nesec); void udelay(unsigned long usecs); void mdelay(unsigned long msecs); void msleep(unsigned int millisecs); void ssleep(unsigned int seconds); 长延迟头文件:jeffies.h / time.h while(time_before(jiffies,jiffies+msecs_to_jiffies(delay_time)){ schedule(); } 8.3 schedule_timeout() 更理想的延迟执行方法是使用schedule_timeout()函数，用法如下： set_current_state(TASK_INTERRUPTIBLE); /*将任务设置为可中断睡眠状态 */ schedule_timeout(s *HZ); /*小睡一会，s秒后唤醒 */ 唯一的参数是延迟的相对时间，单位为jiffies。上例中将相应的任务推入可中断睡眠队列（注意了，这里的进入睡眠队列，就意味着可以去执行其他任务了），睡眠s秒。在调用schedule_timeout()函数前必须首先将任务设置成TASK_INTERRUPTILE和TASK_UNINTERRUPTIBLE面两种状态之一，否则任务不会睡眠。调用代码绝对不能持有锁（因为持有锁的任务是不能睡眠的）。当任务被重新调度时，将返回代码进入睡眠前的位置继续执行。时间相关的命令： date显示或设置系统时间与日期 date命令并不从RTC获取时间，它获取的是内核xtime时间，同样它设置的也是内核xtime时间，而非RTC。语法: date [-d <字符串>] [-u] [+格式参数] date [-s <字符串>] [-u] [+格式参数] 补充说明：第一种语法可用来显示系统日期或时间，以%为开头的参数为格式参数，可指定日期或时间的显示格式。第二种语法可用来设置系统日期与时间。只有管理员才有设置日期与时间的权限。若不加任何参数，data会显示目前的日期与时间。该命令的各选项含义如下：　-d<字符串> 　显示字符串所指的日期与时间。字符串前后必须加上双引号。　-s<字符串> 　根据字符串来设置日期与时间。字符串前后必须加上双引号。　-u 　显示GMT。　--help 　在线帮助。　--version 　显示版本信息。该命令可用的格式参数如下：　命令中各选项的含义分别为：　　-ddatestr,--datedatestr显示由datestr描述的日期　　-sdatestr,--setdatestr设置datestr描述的日期　　-u,--universal显示或设置通用时间时间域　　%H小时（00..23）　　%I小时（01..12）　　%k小时（0..23）　　%l小时（1..12）　　%M分（00..59）　　%p显示出AM或PM 　　%r时间（hh：mm：ssAM或PM），12小时　　%s从1970年1月1日00：00：00到目前经历的秒数　　%S秒（00..59）　　%T时间（24小时制）（hh:mm:ss）　　%X显示时间的格式（％H:％M:％S）　　%Z时区日期域　　%a星期几的简称（Sun..Sat）　　%A星期几的全称（Sunday..Saturday）　　%b月的简称（Jan..Dec）　　%B月的全称（January..December）　　%c日期和时间（MonNov814：12：46CST1999）　　%d一个月的第几天（01..31）　　%D日期（mm／dd／yy）　　%h和%b选项相同　　%j一年的第几天（001..366）　　%m月（01..12）　　%w一个星期的第几天（0代表星期天）　　%W一年的第几个星期（00..53，星期一为第一天）　　%x显示日期的格式（mm/dd/yy）　　%y年的最后两个数字（1999则是99）　　%Y年（例如：1970，1996等）　　需要特别说明的是，只有超级用户才能用date命令设置时间，一般用户只能用date命令显示时间。date是对系统时钟的设置和读，不是对RTC的操作。使用示例一: 格式： date 月/日/时间/年.秒也可以采用 date -s 月/日/年 date -s 时/分/秒 #date //查看系统时间 #date -s //设置当前时间，只有root权限才能设置，其他只能查看。 #date -s 20120608//设置成20120608，这样会把具体时间设置成空00:00:00 #date -s 12:23:23 //设置具体时间，不会对日期做更改 #date -s “12:12:23 2006-10-10″ //这样可以设置全部时间 #date 060812232012(月日时分年)（完整书写） //这样可以设置时间和日期 CST：中国标准时间（China Standard Time），这个解释可能是针对RedHat Linux。 UTC：协调世界时，又称世界标准时间，简称UTC，从英文国际时间/法文协调时间”Universal Time/Temps Cordonné”而来。中国大陆、香港、澳门、台湾、蒙古国、新加坡、马来西亚、菲律宾、澳洲西部的时间与UTC的时差均为+8，也就是UTC+8。 GMT：格林尼治标准时间（旧译格林威治平均时间或格林威治标准时间；英语：Greenwich Mean Time，GMT）是指位于英国伦敦郊区的皇家格林尼治天文台的标准时间，因为本初子午线被定义在通过那里的经线。设置完系统时间后,还需要同步到硬件时钟上查看RTC时间： hwclock -r 把系统时间更新至RTC hwclock -w 把RTC时间更新至系统 hwclock -s 设置后可以如下查看RTC信息： cat /proc/driver/rtc rtc_time : 09:32:13 rtc_date : 2011-03-24 alrm_time : **:**:** alrm_date : 2063-**-31 alarm_IRQ : no alrm_pending : no 24hr : yes

100次点赞 100次阅读